CosyVoice2 is a high-quality speech synthesis system based on large language models, capable of generating natural and fluent speech. This document provides a complete invocation method compatible with the OpenAI API. Users can get started quickly by installing the corresponding StackFlow software packages.
Refer to AI Pyramid Software Package Update to complete the installation of the following dependency packages and models:
apt update Install core dependency packages:
apt install lib-llm llm-sys llm-cosy-voice llm-openai-api Install the CosyVoice2 model:
apt install llm-model-cosyvoice2-0.5b-ax650 systemctl restart llm-openai-api command to update the model list.curl http://127.0.0.1:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "CosyVoice2-0.5B-ax650",
"response_format": "wav",
"input": "But thy eternal summer shall not fade, Nor lose possession of that fair thou ow’st; Nor shall Death brag thou wander’st in his shade, When in eternal lines to time thou grow’st; So long as men can breathe or eyes can see, So long lives this, and this gives life to thee.
"
}' \
-o output.wav from pathlib import Path
from openai import OpenAI
client = OpenAI(
api_key="sk-",
base_url="http://127.0.0.1:8000/v1"
)
speech_file_path = Path(__file__).parent / "output.wav"
with client.audio.speech.with_streaming_response.create(
model="CosyVoice2-0.5B-ax650",
voice="prompt_data",
response_format="wav",
input='But thy eternal summer shall not fade, Nor lose possession of that fair thou ow’st; Nor shall Death brag thou wander’st in his shade, When in eternal lines to time thou grow’st; So long as men can breathe or eyes can see, So long lives this, and this gives life to thee.',
) as response:
response.stream_to_file(speech_file_path) Choose one of the following methods to obtain the CosyVoice2 cloning scripts:
Method 1: Manual Download
Visit the CosyVoice2 Script Repository to download, then upload them to the AI Pyramid device.
Method 2: Command-Line Clone
git clone --recurse-submodules https://huggingface.co/M5Stack/CosyVoice2-scripts After cloning is complete, the directory structure is as follows:
root@m5stack-AI-Pyramid:~/CosyVoice2-scripts# ls -lh
total 28K
drwxr-xr-x 2 root root 4.0K Jan 9 10:26 asset
drwxr-xr-x 2 root root 4.0K Jan 9 10:26 CosyVoice-BlankEN
drwxr-xr-x 2 root root 4.0K Jan 9 10:27 frontend-onnx
drwxr-xr-x 3 root root 4.0K Jan 9 10:26 pengzhendong
-rw-r--r-- 1 root root 24 Jan 9 10:26 README.md
-rw-r--r-- 1 root root 103 Jan 9 10:26 requirements.txt
drwxr-xr-x 3 root root 4.0K Jan 9 10:26 scripts Enter the CosyVoice2-scripts directory.
cd CosyVoice2-scripts/ apt install python3.10-venv.python3 -m venv cosyvoice source cosyvoice/bin/activate pip3 install torch torchaudio --index-url https://download.pytorch.org/whl/cpu pip install -r requirements.txt Run the voice processing script to generate voice feature files:
python3 scripts/process_prompt.py --prompt_text asset/zh_woman1.txt --prompt_speech asset/zh_woman1.wav --output zh_woman1 Example output after successful script execution:
(cosyvoice) root@m5stack-AI-Pyramid:~/CosyVoice2-scripts# python3 scripts/process_prompt.py --prompt_text asset/zh_woman1.txt --prompt_speech asset/zh_woman1.wav --output zh_woman1
2026-01-09 10:41:18.655905428 [W:onnxruntime:Default, device_discovery.cc:164 DiscoverDevicesForPlatform] GPU device discovery failed: device_discovery.cc:89 ReadFileContents Failed to open file: "/sys/class/drm/card1/device/vendor"
prompt_text 希望你以后能够做的比我还好呦。
fmax 8000
prompt speech token size: torch.Size([1, 87]) Copy the processed voice feature files to the model data directory:
cp -r zh_woman1 /opt/m5stack/data/CosyVoice2-0.5B-ax650/ Restart the model service to load the new voice configuration:
systemctl restart llm-sys prompt_dir field in the /opt/m5stack/data/models/mode_CosyVoice2-0.5B-ax650.json file to the new voice directory. Each time the voice is replaced, the model service needs to be reinitialized.curl http://127.0.0.1:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "CosyVoice2-0.5B-ax650",
"voice": "zh_woman1",
"response_format": "wav",
"input": "But thy eternal summer shall not fade, Nor lose possession of that fair thou ow’st; Nor shall Death brag thou wander’st in his shade, When in eternal lines to time thou grow’st; So long as men can breathe or eyes can see, So long lives this, and this gives life to thee.
"
}' \
-o output.wav from pathlib import Path
from openai import OpenAI
client = OpenAI(
api_key="sk-",
base_url="http://127.0.0.1:8000/v1"
)
speech_file_path = Path(__file__).parent / "output.wav"
with client.audio.speech.with_streaming_response.create(
model="CosyVoice2-0.5B-ax650",
voice="zh_woman1",
response_format="wav",
input='But thy eternal summer shall not fade, Nor lose possession of that fair thou ow’st; Nor shall Death brag thou wander’st in his shade, When in eternal lines to time thou grow’st; So long as men can breathe or eyes can see, So long lives this, and this gives life to thee.',
) as response:
response.stream_to_file(speech_file_path) Running the case requires installing the Open AI dependencies and restarting the Ollama service.
pip3 install openai systemctl restart llm-* # main.py
from pathlib import Path
from openai import OpenAI
import subprocess
def main():
# Initialize the OpenAI client
client = OpenAI(
api_key="sk-", # Replace with your actual API key
base_url="http://127.0.0.1:8000/v1"
)
# Temporary file paths
base_dir = Path(__file__).parent
raw_audio_path = base_dir / "temp_raw.wav"
transcoded_audio_path = base_dir / "temp_48k_stereo.wav"
print("=== Interactive Speech Synthesis Mode ===")
print("Enter text and press Enter to generate speech.")
print("Type 'quit' or 'exit' to stop.\n")
while True:
# 1. Read user input
input_text = input("Enter text (quit/exit to stop): ").strip()
# Exit condition
if input_text.lower() in ["quit", "exit"]:
print("Exiting program...")
break
if not input_text:
print("Error: Input text cannot be empty.\n")
continue
try:
# 2. Generate raw audio from the TTS API
print("Generating speech...")
with client.audio.speech.with_streaming_response.create(
model="CosyVoice2-0.5B-ax650",
voice="zh_woman1",
response_format="wav",
input=input_text,
) as response:
response.stream_to_file(raw_audio_path)
# 3. Transcode to 48 kHz stereo WAV using FFmpeg
print("Transcoding audio...")
ffmpeg_cmd = [
"ffmpeg",
"-y", # Overwrite output file if it exists
"-i", str(raw_audio_path),
"-ar", "48000", # Set sample rate to 48 kHz
"-ac", "2", # Set channel count to stereo
"-f", "wav",
str(transcoded_audio_path)
]
# Run transcoding
# Remove stdout/stderr redirection if you need FFmpeg logs for debugging
subprocess.run(
ffmpeg_cmd,
check=True,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL
)
# 4. Play the transcoded audio with tinyplay
print("Playing audio...\n")
tinyplay_cmd = ["tinyplay", str(transcoded_audio_path)]
subprocess.run(tinyplay_cmd, check=True)
except subprocess.CalledProcessError as e:
print(
f"Command execution failed. Please make sure FFmpeg and tinyplay "
f"are installed and available in PATH: {e}\n"
)
except Exception as e:
print(f"An error occurred: {e}\n")
finally:
# Remove temporary files
raw_audio_path.unlink(missing_ok=True)
transcoded_audio_path.unlink(missing_ok=True)
if __name__ == "__main__":
main()