pdf-icon

StackFlow AI Platform

Module LLM Applications

CV Vision Application

Vision Language Model (VLM)

Large Language Model (LLM)

Voice Assistant

AI Pyramid - Sherpa-ONNX Speech Recognition

Sherpa-ONNX is a lightweight speech recognition and processing toolkit based on ONNX Runtime, focusing on high-performance, low-latency streaming and offline speech recognition on local devices. The project fully supports AI Pyramid as an inference backend and can fully leverage NPU acceleration. For more details, refer to the official documentation.

1. Obtain Model Files

Choose one of the following methods to obtain the speech recognition models:

Method 1: Manual Download

Visit the SenseVoice Model Repository to download the models, then upload them to the AI Pyramid device.

Method 2: Command-Line Clone

Dependency Check
If git lfs is not installed on your system, please refer to the git lfs Installation Guide to install it.
git clone https://huggingface.co/M5Stack/SenseVoiceSmall-axmodel

1.1 Model File Description

After cloning is complete, the directory structure is as follows:

root@m5stack-AI-Pyramid:~/SenseVoiceSmall-axmodel# ls -lh
total 328K
drwxr-xr-x 2 root root 4.0K Jan  9 17:27 ax630c
drwxr-xr-x 2 root root 4.0K Jan  9 17:27 ax650
-rw-r--r-- 1 root root   24 Jan  9 17:27 README.md
drwxr-xr-x 2 root root 4.0K Jan  9 17:27 test_wavs
-rw-r--r-- 1 root root 309K Jan  9 17:27 tokens.txt

2. Obtain Precompiled Binaries

Download the precompiled Sherpa-ONNX binary package:

Method 1: Manual Download

Visit the Sherpa-ONNX Release Page to download it.

Method 2: Command-Line Download

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/v1.12.20/sherpa-onnx-v1.12.20-axera-ax650-linux-aarch64-shared.tar.bz2

2.1 Extract the Files

tar xvf sherpa-onnx-v1.12.20-axera-ax650-linux-aarch64-shared.tar.bz2
rm sherpa-onnx-v1.12.20-axera-ax650-linux-aarch64-shared.tar.bz2

2.2 Executable File Description

The extracted bin directory contains multiple precompiled executables. The main files are listed below:

root@m5stack-AI-Pyramid:~/SenseVoiceSmall-axmodel# ls -lh sherpa-onnx-v1.12.20-axera-ax650-linux-aarch64-shared/bin/
total 42M
-rwxr-xr-x 1 1001 1001 1.8M Dec 17 21:10 sherpa-onnx
-rwxr-xr-x 1 1001 1001 1.8M Dec 17 21:10 sherpa-onnx-alsa
-rwxr-xr-x 1 1001 1001 1.7M Dec 17 21:10 sherpa-onnx-alsa-offline
-rwxr-xr-x 1 1001 1001 1.9M Dec 17 21:10 sherpa-onnx-microphone
-rwxr-xr-x 1 1001 1001 1.9M Dec 17 21:10 sherpa-onnx-microphone-offline
-rwxr-xr-x 1 1001 1001 1.7M Dec 17 21:10 sherpa-onnx-offline
-rwxr-xr-x 1 1001 1001 1.8M Dec 17 21:10 sherpa-onnx-vad-with-offline-asr
-rwxr-xr-x 1 1001 1001 1.8M Dec 17 21:10 sherpa-onnx-vad-alsa-offline-asr
...

3. Speech Recognition Examples

3.1 Offline File Recognition

Run the following command to recognize a single audio file:

./sherpa-onnx-v1.12.20-axera-ax650-linux-aarch64-shared/bin/sherpa-onnx-offline --sense-voice-model=ax650/model-10-seconds.axmodel --tokens=tokens.txt --provider=axera test_wavs/en.wav

Example recognition output:

root@m5stack-AI-Pyramid:~/SenseVoiceSmall-axmodel# ./sherpa-onnx-v1.12.20-axera-ax650-linux-aarch64-shared/bin/sherpa-onnx-offline --sense-voice-model=ax650/model-10-seconds.axmodel --tokens=tokens.txt --provider=axera test_wavs/en.wav
/k2-fsa/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:373 ./sherpa-onnx-v1.12.20-axera-ax650-linux-aarch64-shared/bin/sherpa-onnx-offline --sense-voice-model=ax650/model-10-seconds.axmodel --tokens=tokens.txt --provider=axera test_wavs/en.wav

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(...))
Creating recognizer ...
recognizer created in 1.128 s
Started
Done!

test_wavs/en.wav
{"lang": "<|en|>", "emotion": "<|EMO_UNKNOWN|>", "event": "<|Speech|>", "text": "the tribal chieftain called for the boy and presented him with fifty pieces of gold", ...}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.119 s
Real time factor (RTF): 0.119 / 7.152 = 0.017

3.2 Long Audio Segmented Recognition with VAD

To perform segmented recognition on long audio files, you first need to obtain a VAD (Voice Activity Detection) model.

Download the VAD Model

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx

Download Test Audio (or use a local audio file)

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/Obama.wav

Run Long Audio Recognition

./sherpa-onnx-v1.12.20-axera-ax650-linux-aarch64-shared/bin/sherpa-onnx-vad-with-offline-asr --silero-vad-model=silero_vad.onnx --sense-voice-model=ax650/model-10-seconds.axmodel --tokens=tokens.txt --provider=axera Obama.wav

Example recognition output:

root@m5stack-AI-Pyramid:~/SenseVoiceSmall-axmodel# ./sherpa-onnx-vad-with-offline-asr ...
9.286 -- 12.428: everybody all right everybody go ahead and have a seat
13.094 -- 14.988: how's everybody doing today
...
320.518 -- 323.660: and this isn't just important for your own life and your own future
324.678 -- 333.004: what you make of your education will decide nothing less than the future of this country the future of america depends on you
num threads: 2
decoding method: greedy_search
Elapsed seconds: 23.002 s
Real time factor (RTF): 0.069

4. Real-Time Microphone Recognition

4.1 Hardware Requirements

Hardware Note
The official Sherpa-ONNX examples currently do not support the built-in four-channel microphone on AI Pyramid. An external USB microphone or other compatible audio input device is required.

4.2 List Microphone Devices

Run the following command to list available audio devices on the system:

aplay -l

Example device list:

card 3: Audio_3 [AB13X USB Audio], device 0: USB Audio [USB Audio]

4.3 Start Real-Time Recognition

Run the following command to start real-time microphone recognition (modify the device name as needed):

./sherpa-onnx-v1.12.20-axera-ax650-linux-aarch64-shared/bin/sherpa-onnx-vad-alsa-offline-asr --silero-vad-model=silero_vad.onnx --sense-voice-model=ax650/model-10-seconds.axmodel --tokens=tokens.txt --provider=axera plughw:3,0
Device Selection
Replace plughw:3,0 with the actual identifier of your audio device.

Example real-time recognition output:

Started. Please speak
0: hello
1: how are you
^C
Caught Ctrl + C. Exiting...
On This Page