pdf-icon

Product Guide

Offline Voice Recognition

Industrial Control

IoT Measuring Instruments

Air Quality

Module13.2 PPS

Ethernet Camera

DIP Switch Usage Guide

Module GPS v2.0

Module GNSS

Module ExtPort For Core2

Module LoRa868 V1.2

SmolVLM2-500M-Video-Instruct

SmolVLM2-500M-Video-Instruct is a lightweight multimodal video-language model capable of understanding and generating text based on video content, and supports instruction-based interaction.

  1. Manually download the model and upload it to Raspberry Pi 5, or pull the model repository using the following command.
Note
If git lfs is not installed, please refer to git lfs installation instructions for installation.
git clone https://huggingface.co/AXERA-TECH/SmolVLM2-500M-Video-Instruct

File Description:

m5stack@raspberrypi:~/rsp/SmolVLM2-500M-Video-Instruct $ ls -lh
total 40K
drwxrwxr-x 2 m5stack m5stack 4.0K Aug 12 09:12 assets
-rw-rw-r-- 1 m5stack m5stack    0 Aug 12 09:12 config.json
drwxrwxr-x 2 m5stack m5stack 4.0K Aug 12 09:12 embeds
-rw-rw-r-- 1 m5stack m5stack  10K Aug 12 09:12 infer_axmodel.py
-rw-rw-r-- 1 m5stack m5stack 2.5K Aug 12 09:12 README.md
drwxrwxr-x 2 m5stack m5stack 4.0K Aug 12 09:12 smolvlm2_axmodel
drwxrwxr-x 2 m5stack m5stack 4.0K Aug 12 09:12 smolvlm2_tokenizer
drwxrwxr-x 2 m5stack m5stack 4.0K Aug 12 09:12 utils
drwxrwxr-x 2 m5stack m5stack 4.0K Aug 12 09:13 vit_model
  1. Create a virtual environment
python -m venv smolvlm
  1. Activate the virtual environment
source smolvlm/bin/activate
  1. Install dependencies
pip install https://github.com/AXERA-TECH/pyaxengine/releases/download/0.1.3.rc1/axengine-0.1.3-py3-none-any.whl
pip install transformers torch torchvision tqdm pillow num2words onnx onnxruntime
  1. Run
python infer_axmodel.py --axmodel_path smolvlm2_axmodel/ --vit_model vit_model/vision_model.axmodel --hf_model smolvlm2_tokenizer/ -i assets/bee.jpg

Test image:

Run result:

(smolvlm) m5stack@raspberrypi:~/rsp/SmolVLM2-500M-Video-Instruct $ python infer_axmodel.py --axmodel_path smolvlm2_axmodel/ --vit_model vit_model/vision_model.axmodel --hf_model smolvlm2_tokenizer/ -i assets/bee.jpg
[INFO] Available providers:  ['AXCLRTExecutionProvider']
[INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 4.1-patch1-dirty 59be8f11-dirty
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
Init InferenceSession:   0%|                                                                                       | 0/32 [00:00<?, ?it/s][INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 4.1-patch1-dirty 59be8f11-dirty
Init InferenceSession: 100%|██████████████████████████████████████████████████████████████████████████████| 32/32 [00:10<00:00,  2.98it/s]
[INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 4.1-patch1-dirty 59be8f11-dirty
Model loaded successfully!
slice_indices: [0, 1, 2, 3, 4, 5, 6, 7, 8]
Slice prefill done: 0
Slice prefill done: 1
Slice prefill done: 2
Slice prefill done: 3
Slice prefill done: 4
Slice prefill done: 5
Slice prefill done: 6
Slice prefill done: 7
Slice prefill done: 8
answer >>  The image depicts a close-up view of a pink flower with a bee on it. The bee, which appears to be a bumblebee, is perched on the flower's center, which is surrounded by a cluster of other flowers. The bee is in the process of collecting nectar from the flower, which is a common behavior for bees. The flower itself has a yellow center with a cluster of yellow stamens surrounding it. The petals of the flower are a vibrant shade of pink, and the bee is positioned very close to the camera, making it the focal point of the image. The background is slightly blurred, but it appears to be a garden or a field with other flowers and plants, contributing to the overall natural setting of the image.
On This Page