English
English
简体中文
日本語
pdf-icon

StackFlow AI Platform

Module LLM Applications

CV Vision Application

Vision Language Model (VLM)

Large Language Model (LLM)

Voice Assistant

AI Pyramid - CosyVoice2 Voice Cloning

CosyVoice2 is a high-quality speech synthesis system based on large language models, capable of generating natural and fluent speech. This document provides a complete invocation method compatible with the OpenAI API. Users can get started quickly by installing the corresponding StackFlow software packages.

1. Preparation

Refer to AI Pyramid Software Package Update to complete the installation of the following dependency packages and models:

apt update

Install core dependency packages:

apt install lib-llm llm-sys llm-cosy-voice llm-openai-api

Install the CosyVoice2 model:

apt install llm-model-cosyvoice2-0.5b-ax650
Model Update Notice
After installing a new model each time, you need to manually execute the systemctl restart llm-openai-api command to update the model list.
Performance Notes
CosyVoice2 is a high-performance neural network speech generation model. Although it can synthesize natural and fluent speech, it has the following limitations on resource-constrained devices: the maximum generated audio length is 27 seconds, and the initial model loading may take a relatively long time. Please adjust your expected audio length reasonably according to your application scenario.

2. Basic Invocation Examples

Using Curl

curl http://127.0.0.1:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "CosyVoice2-0.5B-ax650",
    "response_format": "wav",
    "input": "But thy eternal summer shall not fade, Nor lose possession of that fair thou ow’st; Nor shall Death brag thou wander’st in his shade, When in eternal lines to time thou grow’st; So long as men can breathe or eyes can see, So long lives this, and this gives life to thee.
"
  }' \
  -o output.wav

Using Python

from pathlib import Path
from openai import OpenAI

client = OpenAI(
    api_key="sk-",
    base_url="http://127.0.0.1:8000/v1"
)

speech_file_path = Path(__file__).parent / "output.wav"
with client.audio.speech.with_streaming_response.create(
  model="CosyVoice2-0.5B-ax650",
  voice="prompt_data",
  response_format="wav",
  input='But thy eternal summer shall not fade, Nor lose possession of that fair thou ow’st; Nor shall Death brag thou wander’st in his shade, When in eternal lines to time thou grow’st; So long as men can breathe or eyes can see, So long lives this, and this gives life to thee.',
) as response:
  response.stream_to_file(speech_file_path)

3. Voice Cloning

3.1 Obtain Cloning Scripts

Choose one of the following methods to obtain the CosyVoice2 cloning scripts:

Method 1: Manual Download

Visit the CosyVoice2 Script Repository to download, then upload them to the AI Pyramid device.

Method 2: Command-Line Clone

Dependency Check
If git lfs is not installed on the system, please refer to the git lfs Installation Guide to install it.
git clone --recurse-submodules https://huggingface.co/M5Stack/CosyVoice2-scripts

3.2 Directory Structure Description

After cloning is complete, the directory structure is as follows:

root@m5stack-AI-Pyramid:~/CosyVoice2-scripts# ls -lh
total 28K
drwxr-xr-x 2 root root 4.0K Jan  9 10:26 asset
drwxr-xr-x 2 root root 4.0K Jan  9 10:26 CosyVoice-BlankEN
drwxr-xr-x 2 root root 4.0K Jan  9 10:27 frontend-onnx
drwxr-xr-x 3 root root 4.0K Jan  9 10:26 pengzhendong
-rw-r--r-- 1 root root   24 Jan  9 10:26 README.md
-rw-r--r-- 1 root root  103 Jan  9 10:26 requirements.txt
drwxr-xr-x 3 root root 4.0K Jan  9 10:26 scripts

3.3 Process Audio Samples

Step 1: Create a Virtual Environment

Enter the CosyVoice2-scripts directory.

cd CosyVoice2-scripts/
First-Time Operation
To create a Python virtual environment for the first time, you need to execute apt install python3.10-venv.
python3 -m venv cosyvoice

Step 2: Activate the Virtual Environment

source cosyvoice/bin/activate

Step 3: Install Dependency Packages

pip3 install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txt

Step 4: Run the Processing Script

Run the voice processing script to generate voice feature files:

python3 scripts/process_prompt.py --prompt_text asset/zh_woman1.txt --prompt_speech asset/zh_woman1.wav --output zh_woman1

Example output after successful script execution:

(cosyvoice) root@m5stack-AI-Pyramid:~/CosyVoice2-scripts# python3 scripts/process_prompt.py --prompt_text asset/zh_woman1.txt --prompt_speech asset/zh_woman1.wav --output zh_woman1
2026-01-09 10:41:18.655905428 [W:onnxruntime:Default, device_discovery.cc:164 DiscoverDevicesForPlatform] GPU device discovery failed: device_discovery.cc:89 ReadFileContents Failed to open file: "/sys/class/drm/card1/device/vendor"
prompt_text 希望你以后能够做的比我还好呦。
fmax 8000
prompt speech token size: torch.Size([1, 87])

3.4 Deploy Voice Data to the Model Directory

Copy the processed voice feature files to the model data directory:

cp -r zh_woman1 /opt/m5stack/data/CosyVoice2-0.5B-ax650/

Restart the model service to load the new voice configuration:

systemctl restart llm-sys
Voice Replacement Notes
To replace the default cloned voice, modify the prompt_dir field in the /opt/m5stack/data/models/mode_CosyVoice2-0.5B-ax650.json file to the new voice directory. Each time the voice is replaced, the model service needs to be reinitialized.

4. Invoke Using the Cloned Voice

Using Curl

curl http://127.0.0.1:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "CosyVoice2-0.5B-ax650",
    "voice": "zh_woman1",
    "response_format": "wav",
    "input": "But thy eternal summer shall not fade, Nor lose possession of that fair thou ow’st; Nor shall Death brag thou wander’st in his shade, When in eternal lines to time thou grow’st; So long as men can breathe or eyes can see, So long lives this, and this gives life to thee.
"
  }' \
  -o output.wav

Using Python

from pathlib import Path
from openai import OpenAI

client = OpenAI(
    api_key="sk-",
    base_url="http://127.0.0.1:8000/v1"
)

speech_file_path = Path(__file__).parent / "output.wav"
with client.audio.speech.with_streaming_response.create(
  model="CosyVoice2-0.5B-ax650",
  voice="zh_woman1",
  response_format="wav",
  input='But thy eternal summer shall not fade, Nor lose possession of that fair thou ow’st; Nor shall Death brag thou wander’st in his shade, When in eternal lines to time thou grow’st; So long as men can breathe or eyes can see, So long lives this, and this gives life to thee.',
) as response:
  response.stream_to_file(speech_file_path)

Demonstration

Running the case requires installing the Open AI dependencies and restarting the Ollama service.

pip3 install openai
systemctl restart llm-*
# main.py

from pathlib import Path
from openai import OpenAI
import subprocess


def main():
    # Initialize the OpenAI client
    client = OpenAI(
        api_key="sk-",  # Replace with your actual API key
        base_url="http://127.0.0.1:8000/v1"
    )

    # Temporary file paths
    base_dir = Path(__file__).parent
    raw_audio_path = base_dir / "temp_raw.wav"
    transcoded_audio_path = base_dir / "temp_48k_stereo.wav"

    print("=== Interactive Speech Synthesis Mode ===")
    print("Enter text and press Enter to generate speech.")
    print("Type 'quit' or 'exit' to stop.\n")

    while True:
        # 1. Read user input
        input_text = input("Enter text (quit/exit to stop): ").strip()

        # Exit condition
        if input_text.lower() in ["quit", "exit"]:
            print("Exiting program...")
            break

        if not input_text:
            print("Error: Input text cannot be empty.\n")
            continue

        try:
            # 2. Generate raw audio from the TTS API
            print("Generating speech...")
            with client.audio.speech.with_streaming_response.create(
                model="CosyVoice2-0.5B-ax650",
                voice="zh_woman1",
                response_format="wav",
                input=input_text,
            ) as response:
                response.stream_to_file(raw_audio_path)

            # 3. Transcode to 48 kHz stereo WAV using FFmpeg
            print("Transcoding audio...")
            ffmpeg_cmd = [
                "ffmpeg",
                "-y",                    # Overwrite output file if it exists
                "-i", str(raw_audio_path),
                "-ar", "48000",          # Set sample rate to 48 kHz
                "-ac", "2",              # Set channel count to stereo
                "-f", "wav",
                str(transcoded_audio_path)
            ]

            # Run transcoding
            # Remove stdout/stderr redirection if you need FFmpeg logs for debugging
            subprocess.run(
                ffmpeg_cmd,
                check=True,
                stdout=subprocess.DEVNULL,
                stderr=subprocess.DEVNULL
            )

            # 4. Play the transcoded audio with tinyplay
            print("Playing audio...\n")
            tinyplay_cmd = ["tinyplay", str(transcoded_audio_path)]
            subprocess.run(tinyplay_cmd, check=True)

        except subprocess.CalledProcessError as e:
            print(
                f"Command execution failed. Please make sure FFmpeg and tinyplay "
                f"are installed and available in PATH: {e}\n"
            )
        except Exception as e:
            print(f"An error occurred: {e}\n")
        finally:
            # Remove temporary files
            raw_audio_path.unlink(missing_ok=True)
            transcoded_audio_path.unlink(missing_ok=True)


if __name__ == "__main__":
    main()
On This Page