pdf-icon

Product Guide

Offline Voice Recognition

Industrial Control

IoT Measuring Instruments

Air Quality

Module13.2 PPS

Ethernet Camera

DIP Switch Usage Guide

Module GPS v2.0

Module GNSS

Module ExtPort For Core2

Module LoRa868 V1.2

Qwen2.5-VL-3B-Instruct

  1. Manually download the model and upload it to raspberrypi5, or pull the model repository using the following command.
Tip
If git lfs is not installed, please refer to the git lfs installation guide for installation.
git clone https://huggingface.co/AXERA-TECH/Qwen2.5-VL-3B-Instruct

File Description

m5stack@raspberrypi:~/rsp/Qwen2.5-VL-3B-Instruct $ ls -lh
total 10M
-rw-rw-r-- 1 m5stack m5stack    0 Aug 12 16:37 config.json
drwxrwxr-x 2 m5stack m5stack 4.0K Aug 12 16:52 image
-rw-rw-r-- 1 m5stack m5stack 6.3M Aug 12 16:52 main
-rwxrwxr-x 1 m5stack m5stack  132 Aug 12 16:37 main_ax650
-rwxrwxr-x 1 m5stack m5stack 1.8M Aug 12 16:52 main_axcl_aarch64
-rwxrwxr-x 1 m5stack m5stack 1.8M Aug 12 16:52 main_axcl_x86
drwxrwxr-x 2 m5stack m5stack 4.0K Aug 12 16:37 python
drwxrwxr-x 2 m5stack m5stack 4.0K Aug 12 16:52 qwen2_5-vl-3b-image-ax650
drwxrwxr-x 2 m5stack m5stack 4.0K Aug 12 16:47 Qwen2.5-VL-3B-Instruct-AX650-chunk_prefill_512
drwxrwxr-x 2 m5stack m5stack 4.0K Aug 12 16:55 qwen2_5-vl-3b-video-ax650
drwxrwxr-x 2 m5stack m5stack 4.0K Aug 12 16:37 qwen2_5-vl-tokenizer
-rw-rw-r-- 1 m5stack m5stack 9.0K Aug 12 16:37 qwen2_tokenizer_images.py
-rw-rw-r-- 1 m5stack m5stack  16K Aug 12 16:37 qwen2_tokenizer_video_308.py
-rw-rw-r-- 1 m5stack m5stack  14K Aug 12 16:37 README.md
-rwxrwxr-x 1 m5stack m5stack  708 Aug 12 16:37 run_qwen2_5_vl_image_axcl_aarch64.sh
-rwxrwxr-x 1 m5stack m5stack  704 Aug 12 16:37 run_qwen2_5_vl_image_axcl_x86.sh
-rwxrwxr-x 1 m5stack m5stack  780 Aug 12 16:37 run_qwen2_5_vl_image.sh
-rwxrwxr-x 1 m5stack m5stack  705 Aug 12 16:37 run_qwen2_5_vl_video_axcl_aarch64.sh
-rwxrwxr-x 1 m5stack m5stack  701 Aug 12 16:37 run_qwen2_5_vl_video_axcl_x86.sh
-rwxrwxr-x 1 m5stack m5stack  777 Aug 12 16:37 run_qwen2_5_vl_video.sh
drwxrwxr-x 2 m5stack m5stack 4.0K Aug 12 16:37 video
Tip
If the qwen virtual environment has already been created previously, there is no need to create it again — simply activate it.
  1. Create a virtual environment
python -m venv qwen
  1. Activate the virtual environment
source qwen/bin/activate
  1. Install dependencies
pip install transformers jinja2
  1. Start the image tokenizer parser
python qwen2_tokenizer_images.py --port 12345
  1. Run the tokenizer service. The Host IP is set to localhost by default, and the port number should be set to 12345. Once running, the information will be as follows:
(qwen) m5stack@raspberrypi:~/rsp/Qwen2.5-VL-3B-Instruct $ python qwen2_tokenizer_images.py --port 12345
None None 151645 <|im_end|>
[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 151652, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151653, 74785, 419, 2168, 13, 151645, 198, 151644, 77091, 198]
281
[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 14990, 1879, 151645, 198, 151644, 77091, 198]
21
http://localhost:12345
Tip
The following operations require opening a new terminal on the raspberrypi.
  1. Set executable permissions
chmod +x main_axcl_aarch64 run_qwen2_5_vl_image_axcl_aarch64.sh
  1. Start the Qwen2.5-VL model image inference service
./run_qwen2_5_vl_image_axcl_aarch64.sh

Once started successfully, the information will be as follows:

m5stack@raspberrypi:~/rsp/Qwen2.5-VL-3B-Instruct $ ./run_qwen2_5_vl_image_axcl_aarch64.sh
[I][                            Init][ 162]: LLM init start
[I][                            Init][  34]: connect http://127.0.0.1:12345 ok
bos_id: -1, eos_id: 151645
img_start_token: 151652
img_context_token: 151655
  2% | █                                 |   1 /  39 [0.01s<0.20s, 200.00 count/s] tokenizer init ok
  [I][Init][  45]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  39 /  39 [72.58s<78.62s, 0.50 count/s] init post axmodel ok,remain_cmm(-1 MB)
input size: 1
    name: hidden_states [unknown] [unknown]
        1 x 1024 x 392 x 3 size:1204224


output size: 1
    name: hidden_states_out
        256 x 2048 size:2097152

[I][                            Init][ 267]: IMAGE_CONTEXT_TOKEN: 151655, IMAGE_START_TOKEN: 151652
[I][                            Init][ 328]: image encoder output float32

[I][                            Init][ 340]: max_token_len : 1023
[I][                            Init][ 343]: kv_cache_size : 256, kv_cache_num: 1023
[I][                            Init][ 351]: prefill_token_num : 128
[I][                            Init][ 355]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 355]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 355]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 355]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 355]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 359]: prefill_max_token_num : 512
________________________
|    ID| remain cmm(MB)|
========================
|     0|             -1|
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
[E][                     load_config][ 278]: config file(post_config.json) open failed
[W][                            Init][ 452]: load postprocess config(post_config.json) failed
[I][                            Init][ 456]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> who are you?
image >>
[I][                             Run][ 625]: input token num : 23, prefill_split_num : 1
[I][                             Run][ 659]: input_num_token:23
[I][                             Run][ 796]: ttft: 558.68 ms
I am a large language model created by Alibaba Cloud. I am called Qwen.

[N][                             Run][ 949]: hit eos,avg 4.81 token/s

prompt >>
  1. Analyze the image.
prompt >> describe this picture
image >> image/ssd_car.jpg
[I][                          Encode][ 539]: image encode time : 773.948975 ms, size : 524288
[I][                             Run][ 625]: input token num : 280, prefill_split_num : 3
[I][                             Run][ 659]: input_num_token:128
[I][                             Run][ 659]: input_num_token:128
[I][                             Run][ 659]: input_num_token:24
[I][                             Run][ 796]: ttft: 2076.95 ms
The picture shows a busy urban street scene in what appears to be a city center. A red double-decker bus is prominently featured in the foreground, with a large advertisement on its side that reads, "THINGS GET MORE EXITING WHEN YOU SAY 'YES' VirginMoney.co.uk." The bus is driving on the right side of the road, which is typical in the UK.

In the background, there are several pedestrians walking on the sidewalk, and a few cars are visible on the road. The buildings lining the street are modern and have large windows, suggesting a commercial or business district. The street is well-maintained, with clear pedestrian crossings and a clear separation between the road and the sidewalk.

The overall atmosphere of the image is bustling and typical of a busy city environment.

[N][                             Run][ 949]: hit eos,avg 4.83 token/s
Tip
The following operation requires stopping the previous tokenizer service by using CTRL + C to terminate it.
  1. Start the video tokenizer parser
python qwen2_tokenizer_video_308.py --port 12345
  1. Run the tokenizer service. The Host IP is set to localhost by default, and the port number should be set to 12345. Once running, the information will be as follows:
(qwen) m5stack@raspberrypi:~/rsp/Qwen2.5-VL-3B-Instruct $ python qwen2_tokenizer_video_308.py --port 12345
None None 151645 <|im_end|>
[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 151652, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151653, 53481, 100158, 99487, 87140, 104597, 151645, 198, 151644, 77091, 198]
510
[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 14990, 1879, 151645, 198, 151644, 77091, 198]
21
http://localhost:12345
Tip
The following operations require opening a new terminal on the raspberrypi.
  1. Set executable permissions
chmod +x main_axcl_aarch64 run_qwen2_5_vl_video_axcl_aarch64.sh
  1. Start the Qwen2.5-VL model video inference service
./run_qwen2_5_vl_video_axcl_aarch64.sh

Once started successfully, the information will be as follows:

m5stack@raspberrypi:~/rsp/Qwen2.5-VL-3B-Instruct $ ./run_qwen2_5_vl_video_axcl_aarch64.sh
[I][                            Init][ 162]: LLM init start
[I][                            Init][  34]: connect http://127.0.0.1:12345 ok
bos_id: -1, eos_id: 151645
img_start_token: 151652
img_context_token: 151656
  2% | █                                 |   1 /  39 [0.00s<0.12s, 333.33 count/s] tokenizer init ok
100% | ████████████████████████████████ |  39 /  39 [75.71s<77.70s, 0.50 count/s] init post axmodel ok,remain_cmm(3220 MB)
input size: 1
    name: hidden_states [unknown] [unknown]
        1 x 484 x 392 x 3 size:569184


output size: 1
    name: hidden_states_out
        121 x 2048 size:991232

[I][                            Init][ 267]: IMAGE_CONTEXT_TOKEN: 151656, IMAGE_START_TOKEN: 151652
[I][                            Init][ 328]: image encoder output float32

[I][                            Init][ 340]: max_token_len : 1023
[I][                            Init][ 343]: kv_cache_size : 256, kv_cache_num: 1023
[I][                            Init][ 351]: prefill_token_num : 128
[I][                            Init][ 355]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 355]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 355]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 355]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 355]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 359]: prefill_max_token_num : 512
________________________
|    ID| remain cmm(MB)|
========================
|     0|             -1|
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
[E][                     load_config][ 278]: config file(post_config.json) open failed
[W][                            Init][ 452]: load postprocess config(post_config.json) failed
[I][                            Init][ 456]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> Describe the content of this video
image >> video
video/frame_0000.jpg
video/frame_0008.jpg
video/frame_0016.jpg
video/frame_0024.jpg
video/frame_0032.jpg
video/frame_0040.jpg
video/frame_0048.jpg
video/frame_0056.jpg
[I][                          Encode][ 539]: image encode time : 1480.802002 ms, size : 991232
[I][                             Run][ 625]: input token num : 511, prefill_split_num : 4
[I][                             Run][ 659]: input_num_token:128
[I][                             Run][ 659]: input_num_token:128
[I][                             Run][ 659]: input_num_token:128
[I][                             Run][ 659]: input_num_token:127
[I][                             Run][ 796]: ttft: 3032.74 ms
The video depicts two ground squirrels engaging in a playful interaction on a rocky path. The squirrels are facing each other, and their front paws are raised, as if they are playfully bumping or scratching each other. The background features a scenic mountain landscape with green hills and a clear blue sky, suggesting a sunny day. The squirrels' fur is a mix of brown, black, and white, and their tails are bushy and fluffy. The overall atmosphere of the video is light-hearted and playful, capturing a moment of natural behavior in a beautiful outdoor setting.

[N][                             Run][ 949]: hit eos,avg 4.86 token/s

prompt >>
Model Quantization image encode (ms) tftt (ms) token/s
Qwen2.5-3B-VL w8a16 773.95 558.68 4.81
On This Page