Speech to Text

Convert input audio into text output through an API interface.

Preparation

Before running the example program, you need to install the corresponding model package on the device. For model package installation instructions, refer to the Model List section. For detailed model descriptions, refer to the Model Introduction section.

Before running this example program, please ensure that the following preparations have been completed on the LLM device:

Use the apt package manager to install the SenseVoice model package.

apt install llm-model-sense-voice-small-10s-ax650

Install the ffmpeg tool.

apt install ffmpeg

After installation is complete, restart the OpenAI service to make the new model take effect.

systemctl restart llm-openai-api

Example Program

On the PC side, use the OpenAI API to pass an audio file to implement the speech-to-text function. Before running the example program, modify the IP part of base_url below to the actual IP address of the device.

from openai import OpenAI
client = OpenAI(
    api_key="sk-",
    base_url="http://192.168.20.186:8000/v1"
)

audio_file = open("speech.mp3", "rb")
transcript = client.audio.transcriptions.create(
  model="sense-voice-small-10s-ax650",
  file=audio_file
)

print(transcript) 

Request Parameters

Parameter Name	Type	Required	Example Value	Description
file	file	Yes	-	Audio file object to be transcribed (not the file name). Supported formats include flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm
model	string	Yes	sense-voice-small-10s-ax650	SenseVoice models support automatic multilingual recognition, including Chinese, English, Japanese, Cantonese, Korean, etc.
language	string	No	-	Language is automatically detected by the model internally
response_format	string	No	json	Response format. Currently, only `json` is supported. The default value is `json`.

Response Example

Transcription(text=' Thank you. Thank you everybody. All right everybody go ahead and have a seat. How\'s everybody doing today? .....',
logprobs=None, task='transcribe', language='en', duration=334.234, segments=12, sample_rate=16000, channels=1, bit_depth=16) 

Next Overview

Devices & Quick Start

AI Pyramid

Module LLM

LLM630 Compute Kit

Models

Qwen2.5

Qwen3

DeepSeek-R1

SmolVLM

MeloTTS

Whisper

Llama

AI Pyramid Applications

Module LLM Applications

Audio

CV Vision Application

Vision Language Model (VLM)

Large Language Model (LLM)

Voice Assistant

OpenAI API

Speech to Text

Preparation

Example Program

Request Parameters

Response Example

On This Page