AI Pyramid - CosyVoice2 Voice Cloning

CosyVoice2 is a high-quality speech synthesis system based on large language models, capable of generating natural and fluent speech. This document provides a complete invocation method compatible with the OpenAI API. Users can get started quickly by installing the corresponding StackFlow software packages.

1. Preparation

Refer to AI Pyramid Software Package Update to complete the installation of the following dependency packages and models:

Install core dependency packages:

apt install lib-llm llm-sys llm-cosy-voice llm-openai-api

Install the CosyVoice2 model:

apt install llm-model-cosyvoice2-0.5b-ax650

Model Update Notice

After installing a new model each time, you need to manually execute the systemctl restart llm-openai-api command to update the model list.

Performance Notes

CosyVoice2 is a high-performance neural network speech generation model. Although it can synthesize natural and fluent speech, it has the following limitations on resource-constrained devices: the maximum generated audio length is 27 seconds, and the initial model loading may take a relatively long time. Please adjust your expected audio length reasonably according to your application scenario.

2. Basic Invocation Examples

Using Curl

curl http://127.0.0.1:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "CosyVoice2-0.5B-ax650",
    "response_format": "wav",
    "input": "But thy eternal summer shall not fade, Nor lose possession of that fair thou ow’st; Nor shall Death brag thou wander’st in his shade, When in eternal lines to time thou grow’st; So long as men can breathe or eyes can see, So long lives this, and this gives life to thee.
"
  }' \
  -o output.wav

Using Python

from pathlib import Path
from openai import OpenAI

client = OpenAI(
    api_key="sk-",
    base_url="http://127.0.0.1:8000/v1"
)

speech_file_path = Path(__file__).parent / "output.wav"
with client.audio.speech.with_streaming_response.create(
  model="CosyVoice2-0.5B-ax650",
  voice="prompt_data",
  response_format="wav",
  input='But thy eternal summer shall not fade, Nor lose possession of that fair thou ow’st; Nor shall Death brag thou wander’st in his shade, When in eternal lines to time thou grow’st; So long as men can breathe or eyes can see, So long lives this, and this gives life to thee.
',
) as response:
  response.stream_to_file(speech_file_path)

3. Voice Cloning

3.1 Obtain Cloning Scripts

Choose one of the following methods to obtain the CosyVoice2 cloning scripts:

Method 1: Manual Download

Visit the CosyVoice2 Script Repository to download, then upload them to the AI Pyramid device.

Method 2: Command-Line Clone

Dependency Check

If git lfs is not installed on the system, please refer to the git lfs Installation Guide to install it.

git clone --recurse-submodules https://huggingface.co/M5Stack/CosyVoice2-scripts

3.2 Directory Structure Description

After cloning is complete, the directory structure is as follows:

root@m5stack-AI-Pyramid:~/CosyVoice2-scripts# ls -lh
total 28K
drwxr-xr-x 2 root root 4.0K Jan  9 10:26 asset
drwxr-xr-x 2 root root 4.0K Jan  9 10:26 CosyVoice-BlankEN
drwxr-xr-x 2 root root 4.0K Jan  9 10:27 frontend-onnx
drwxr-xr-x 3 root root 4.0K Jan  9 10:26 pengzhendong
-rw-r--r-- 1 root root   24 Jan  9 10:26 README.md
-rw-r--r-- 1 root root  103 Jan  9 10:26 requirements.txt
drwxr-xr-x 3 root root 4.0K Jan  9 10:26 scripts

3.3 Process Audio Samples

Step 1: Create a Virtual Environment

First-Time Operation

To create a Python virtual environment for the first time, you need to execute apt install python3.10-venv.

python3 -m venv cosyvoice

Step 2: Activate the Virtual Environment

source cosyvoice/bin/activate

Step 3: Install Dependency Packages

pip install -r requirements.txt

Step 4: Run the Processing Script

Run the voice processing script to generate voice feature files:

python3 scripts/process_prompt.py --prompt_text asset/zh_woman1.txt --prompt_speech asset/zh_woman1.wav --output zh_woman1

Example output after successful script execution:

(cosyvoice) root@m5stack-AI-Pyramid:~/CosyVoice2-scripts# python3 scripts/process_prompt.py --prompt_text asset/zh_woman1.txt --prompt_speech asset/zh_woman1.wav --output zh_woman1
2026-01-09 10:41:18.655905428 [W:onnxruntime:Default, device_discovery.cc:164 DiscoverDevicesForPlatform] GPU device discovery failed: device_discovery.cc:89 ReadFileContents Failed to open file: "/sys/class/drm/card1/device/vendor"
prompt_text 希望你以后能够做的比我还好呦。
fmax 8000
prompt speech token size: torch.Size([1, 87])

3.4 Deploy Voice Data to the Model Directory

Copy the processed voice feature files to the model data directory:

cp -r zh_woman1 /opt/m5stack/data/CosyVoice2-0.5B-ax650/

Restart the model service to load the new voice configuration:

systemctl restart llm-sys

Voice Replacement Notes

To replace the default cloned voice, modify the prompt_dir field in the /opt/m5stack/data/models/mode_CosyVoice2-0.5B-ax650.json file to the new voice directory. Each time the voice is replaced, the model service needs to be reinitialized.

4. Invoke Using the Cloned Voice

Using Curl

curl http://127.0.0.1:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "CosyVoice2-0.5B-ax650",
    "voice": "zh_woman1",
    "response_format": "wav",
    "input": "But thy eternal summer shall not fade, Nor lose possession of that fair thou ow’st; Nor shall Death brag thou wander’st in his shade, When in eternal lines to time thou grow’st; So long as men can breathe or eyes can see, So long lives this, and this gives life to thee.
"
  }' \
  -o output.wav

Using Python

from pathlib import Path
from openai import OpenAI

client = OpenAI(
    api_key="sk-",
    base_url="http://127.0.0.1:8000/v1"
)

speech_file_path = Path(__file__).parent / "output.wav"
with client.audio.speech.with_streaming_response.create(
  model="CosyVoice2-0.5B-ax650",
  voice="zh_woman1",
  response_format="wav",
  input='But thy eternal summer shall not fade, Nor lose possession of that fair thou ow’st; Nor shall Death brag thou wander’st in his shade, When in eternal lines to time thou grow’st; So long as men can breathe or eyes can see, So long lives this, and this gives life to thee.
',
) as response:
  response.stream_to_file(speech_file_path)

Next Overview

Devices & Quick Start

AI Pyramid

Module LLM

LLM630 Compute Kit

Models

Qwen2.5

Qwen3

DeepSeek-R1

SmolVLM

MeloTTS

Whisper

Llama

AI Pyramid Applications

Module LLM Applications

Audio

CV Vision Application

Vision Language Model (VLM)

Large Language Model (LLM)

Voice Assistant

OpenAI API

AI Pyramid - CosyVoice2 Voice Cloning

1. Preparation

2. Basic Invocation Examples

Using Curl

Using Python

3. Voice Cloning

3.1 Obtain Cloning Scripts

3.2 Directory Structure Description

3.3 Process Audio Samples

Step 1: Create a Virtual Environment

Step 2: Activate the Virtual Environment

Step 3: Install Dependency Packages

Step 4: Run the Processing Script

3.4 Deploy Voice Data to the Model Directory

4. Invoke Using the Cloned Voice

Using Curl

Using Python

On This Page