pdf-icon

Product Guide

Offline Voice Recognition

Industrial Control

IoT Measuring Instruments

Air Quality

Module13.2 PPS

Ethernet Camera

DIP Switch Usage Guide

Module GPS v2.0

Module GNSS

Module ExtPort For Core2

Module LoRa868 V1.2

Qwen3-0.6B

  1. Manually download the model and upload it to raspberrypi5, or pull the model repository using the following command.
Tip
If git lfs is not installed, please refer to the git lfs installation guide for installation.
git clone https://huggingface.co/AXERA-TECH/Qwen3-0.6B

File Description

m5stack@raspberrypi:~/Qwen3-0.6B $ ls -lh
total 4.4M
-rw-r--r-- 1 m5stack m5stack    0 Jul 25 15:48 config.json
-rw-r--r-- 1 m5stack m5stack 959K Jul 25 15:54 main_ax650
-rw-r--r-- 1 m5stack m5stack 1.7M Jul 25 15:54 main_axcl_aarch64
-rw-r--r-- 1 m5stack m5stack 1.8M Jul 25 15:54 main_axcl_x86
-rw-r--r-- 1 m5stack m5stack  277 Jul 25 15:47 post_config.json
drwxr-xr-x 2 m5stack m5stack 4.0K Jul 25 15:47 qwen2.5_tokenizer
drwxr-xr-x 2 m5stack m5stack 4.0K Jul 25 15:47 qwen3-0.6b-ax630c
drwxr-xr-x 2 m5stack m5stack 4.0K Jul 25 15:47 qwen3-0.6b-ax650
drwxr-xr-x 2 m5stack m5stack 4.0K Jul 25 15:47 qwen3_tokenizer
-rw-r--r-- 1 m5stack m5stack 7.6K Jul 25 15:47 qwen3_tokenizer_uid.py
-rw-r--r-- 1 m5stack m5stack  11K Jul 25 15:47 README.md
-rw-r--r-- 1 m5stack m5stack  577 Jul 25 15:47 run_qwen3_0.6b_int8_ctx_ax630c.sh
-rw-r--r-- 1 m5stack m5stack  574 Jul 25 15:47 run_qwen3_0.6b_int8_ctx_ax650.sh
-rw-r--r-- 1 m5stack m5stack  594 Jul 25 15:47 run_qwen3_0.6b_int8_ctx_axcl_aarch64.sh
-rw-r--r-- 1 m5stack m5stack  590 Jul 25 15:47 run_qwen3_0.6b_int8_ctx_axcl_x86.sh
  1. Create a virtual environment
python -m venv qwen
  1. Activate the virtual environment
source qwen/bin/activate
  1. Install dependencies
pip install transformers jinja2
  1. Start the tokenizer parser
python qwen3_tokenizer_uid.py --port 12345
  1. Run the tokenizer service. The Host IP is set to localhost by default, and the port number should be set to 12345. Once running, the information will be as follows:
(qwen) m5stack@raspberrypi:~/Qwen3-0.6B $ python qwen3_tokenizer_uid.py --port 12345
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Server running at http://0.0.0.0:12345
Tip
The following operations require opening a new terminal on the raspberrypi.
  1. Set executable permissions
chmod +x main_axcl_aarch64 run_qwen3_0.6b_int8_ctx_axcl_aarch64.sh
  1. Start the Qwen3 model inference service
./run_qwen3_0.6b_int8_ctx_axcl_aarch64.sh

Once started successfully, the information will be as follows:

m5stack@raspberrypi:~/rsp/Qwen3-0.6B$ ./run_qwen3_0.6b_int8_ctx_axcl_aarch64.sh
[I][                            Init][ 136]: LLM init start
[I][                            Init][  34]: connect http://127.0.0.1:12345 ok
[I][                            Init][  57]: uid: abf93a3d-2d6a-4ddb-8c9b-42a208a012f7
bos_id: -1, eos_id: 151645
  3% | ██                                |   1 /  31 [1.11s<34.47s, 0.90 count/s] tokenizer init ok[I][Init][  45]: LLaMaEmbedSelector use mmap
  6% | ███                               |   2 /  31 [1.11s<17.25s, 1.80 count/s] embed_selector init ok
  96% | ███████████████████████████████   |  30 /  31 [31.91s<32.98s, 0.94 count/s] init 27 axmodel
  100% | ████████████████████████████████ |  31 /  31 [36.09s<36.09s, 0.86 count/s] init post axmodel ok,remain_cmm(5068 MB)
[I][                            Init][ 237]: max_token_len : 2559
[I][                            Init][ 240]: kv_cache_size : 1024, kv_cache_num: 2559
[I][                            Init][ 248]: prefill_token_num : 128
[I][                            Init][ 252]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 252]: grp: 2, prefill_max_token_num : 512
[I][                            Init][ 252]: grp: 3, prefill_max_token_num : 1024
[I][                            Init][ 252]: grp: 4, prefill_max_token_num : 1536
[I][                            Init][ 252]: grp: 5, prefill_max_token_num : 2048
[I][                            Init][ 256]: prefill_max_token_num : 2048
________________________
|    ID| remain cmm(MB)|
========================
|     0|           5068|
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 1,
    "top_p": 0.8
}

[I][                            Init][ 279]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
[I][          GenerateKVCachePrefill][ 335]: input token num : 21, prefill_split_num : 1 prefill_grpid : 2
[I][          GenerateKVCachePrefill][ 372]: input_num_token:21
[I][                            main][ 236]: precompute_len: 21
[I][                            main][ 237]: system_prompt: You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
prompt >> hello
[I][                      SetKVCache][ 628]: prefill_grpid:2 kv_cache_num:512 precompute_len:21 input_num_token:12
[I][                      SetKVCache][ 631]: current prefill_max_token_num:1920
[I][                             Run][ 869]: input token num : 12, prefill_split_num : 1
[I][                             Run][ 901]: input_num_token:12
[I][                             Run][1030]: ttft: 670.51 ms
<think>

</think>

Hello! How can I assist you today?

[N][                             Run][1182]: hit eos,avg 12.88 token/s

[I][                      GetKVCache][ 597]: precompute_len:46, remaining:2002
prompt >>
Model Quantization tftt (ms) token/s
Qwen3-0.6B w8a16 670.51 12.88
Qwen3-1.7B w8a16 796.38 7.38
Qwen2.5-0.5B w4a16 - 27.05
Qwen2.5-1.5B w4a16 - 15.06
On This Page