Sherpa-ONNX is a lightweight speech recognition and processing toolkit based on ONNX Runtime, focusing on high-performance, low-latency streaming and offline speech recognition on local devices. The project fully supports AI Pyramid as an inference backend and can fully leverage NPU acceleration. For more details, refer to the official documentation.
Choose one of the following methods to obtain the speech recognition models:
Method 1: Manual Download
Visit the SenseVoice Model Repository to download the models, then upload them to the AI Pyramid device.
Method 2: Command-Line Clone
git clone https://huggingface.co/M5Stack/SenseVoiceSmall-axmodel After cloning is complete, the directory structure is as follows:
root@m5stack-AI-Pyramid:~/SenseVoiceSmall-axmodel# ls -lh
total 328K
drwxr-xr-x 2 root root 4.0K Jan 9 17:27 ax630c
drwxr-xr-x 2 root root 4.0K Jan 9 17:27 ax650
-rw-r--r-- 1 root root 24 Jan 9 17:27 README.md
drwxr-xr-x 2 root root 4.0K Jan 9 17:27 test_wavs
-rw-r--r-- 1 root root 309K Jan 9 17:27 tokens.txt Download the precompiled Sherpa-ONNX binary package:
Method 1: Manual Download
Visit the Sherpa-ONNX Release Page to download it.
Method 2: Command-Line Download
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/v1.12.20/sherpa-onnx-v1.12.20-axera-ax650-linux-aarch64-shared.tar.bz2 tar xvf sherpa-onnx-v1.12.20-axera-ax650-linux-aarch64-shared.tar.bz2
rm sherpa-onnx-v1.12.20-axera-ax650-linux-aarch64-shared.tar.bz2 The extracted bin directory contains multiple precompiled executables. The main files are listed below:
root@m5stack-AI-Pyramid:~/SenseVoiceSmall-axmodel# ls -lh sherpa-onnx-v1.12.20-axera-ax650-linux-aarch64-shared/bin/
total 42M
-rwxr-xr-x 1 1001 1001 1.8M Dec 17 21:10 sherpa-onnx
-rwxr-xr-x 1 1001 1001 1.8M Dec 17 21:10 sherpa-onnx-alsa
-rwxr-xr-x 1 1001 1001 1.7M Dec 17 21:10 sherpa-onnx-alsa-offline
-rwxr-xr-x 1 1001 1001 1.9M Dec 17 21:10 sherpa-onnx-microphone
-rwxr-xr-x 1 1001 1001 1.9M Dec 17 21:10 sherpa-onnx-microphone-offline
-rwxr-xr-x 1 1001 1001 1.7M Dec 17 21:10 sherpa-onnx-offline
-rwxr-xr-x 1 1001 1001 1.8M Dec 17 21:10 sherpa-onnx-vad-with-offline-asr
-rwxr-xr-x 1 1001 1001 1.8M Dec 17 21:10 sherpa-onnx-vad-alsa-offline-asr
... Run the following command to recognize a single audio file:
./sherpa-onnx-v1.12.20-axera-ax650-linux-aarch64-shared/bin/sherpa-onnx-offline --sense-voice-model=ax650/model-10-seconds.axmodel --tokens=tokens.txt --provider=axera test_wavs/en.wav Example recognition output:
root@m5stack-AI-Pyramid:~/SenseVoiceSmall-axmodel# ./sherpa-onnx-v1.12.20-axera-ax650-linux-aarch64-shared/bin/sherpa-onnx-offline --sense-voice-model=ax650/model-10-seconds.axmodel --tokens=tokens.txt --provider=axera test_wavs/en.wav
/k2-fsa/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:373 ./sherpa-onnx-v1.12.20-axera-ax650-linux-aarch64-shared/bin/sherpa-onnx-offline --sense-voice-model=ax650/model-10-seconds.axmodel --tokens=tokens.txt --provider=axera test_wavs/en.wav
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(...))
Creating recognizer ...
recognizer created in 1.128 s
Started
Done!
test_wavs/en.wav
{"lang": "<|en|>", "emotion": "<|EMO_UNKNOWN|>", "event": "<|Speech|>", "text": "the tribal chieftain called for the boy and presented him with fifty pieces of gold", ...}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.119 s
Real time factor (RTF): 0.119 / 7.152 = 0.017 To perform segmented recognition on long audio files, you first need to obtain a VAD (Voice Activity Detection) model.
Download the VAD Model
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx Download Test Audio (or use a local audio file)
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/Obama.wav Run Long Audio Recognition
./sherpa-onnx-v1.12.20-axera-ax650-linux-aarch64-shared/bin/sherpa-onnx-vad-with-offline-asr --silero-vad-model=silero_vad.onnx --sense-voice-model=ax650/model-10-seconds.axmodel --tokens=tokens.txt --provider=axera Obama.wav Example recognition output:
root@m5stack-AI-Pyramid:~/SenseVoiceSmall-axmodel# ./sherpa-onnx-vad-with-offline-asr ...
9.286 -- 12.428: everybody all right everybody go ahead and have a seat
13.094 -- 14.988: how's everybody doing today
...
320.518 -- 323.660: and this isn't just important for your own life and your own future
324.678 -- 333.004: what you make of your education will decide nothing less than the future of this country the future of america depends on you
num threads: 2
decoding method: greedy_search
Elapsed seconds: 23.002 s
Real time factor (RTF): 0.069 Run the following command to list available audio devices on the system:
aplay -l Example device list:
card 3: Audio_3 [AB13X USB Audio], device 0: USB Audio [USB Audio] Run the following command to start real-time microphone recognition (modify the device name as needed):
./sherpa-onnx-v1.12.20-axera-ax650-linux-aarch64-shared/bin/sherpa-onnx-vad-alsa-offline-asr --silero-vad-model=silero_vad.onnx --sense-voice-model=ax650/model-10-seconds.axmodel --tokens=tokens.txt --provider=axera plughw:3,0 plughw:3,0 with the actual identifier of your audio device.Example real-time recognition output:
Started. Please speak
0: hello
1: how are you
^C
Caught Ctrl + C. Exiting...