logo
On this page

Configuring TTS

Function Introduction

To match different personas and scenarios, you may need to:

  • Select different text-to-speech (TTS) vendors, such as Volcano Engine, MiniMax, Aliyun, etc.
  • Configure different voices.
  • Customize the audio of TTS, such as volume, speed, tone, etc.
  • Special rules can filter the content for TTS. For example, in "(happily) The weather is really nice today", the content inside the parentheses will be filtered out.

Prerequisites

  1. Enable AI Agent service
  2. Enable corresponding TTS vendor service:
    • Method 1: Experience directly with the zego_test account.
    • Method 2: Purchase TTS service through ZEGO. Please contact ZEGOCLOUD sales to obtain an account and authentication information.
    • Method 3: Purchase TTS service on your own and obtain key information, etc.

Usage Method

Currently, TTS related parameters can be set through 4 interfaces:

InterfaceDescription
Register AgentSet vendor, voice, speed, etc. parameters.
Create Agent Instance
Create Digital Human Agent Instance
Set vendor, voice, speed, etc. parameters.
Note
If not set, the TTS parameters carried by the registered Agent ( RegisterAgent ) will be used by default.
Update Agent InstanceSet voice, speed, etc. parameters.
Note
Does not support modifying the FilterText parameter.

TTS Parameters Description

Parameter NameTypeRequiredDescription
VendorStringYesText-to-speech (TTS) service provider. Optional values:
  • Aliyun: Aliyun TTS (note: this is normal speech synthesis, not CosyVoice).
  • CosyVoice: Aliyun CosyVoice TTS
  • ByteDance: Volcano Engine unidirectional streaming TTS.
  • ByteDanceV3: Volcano Engine V3 version unidirectional streaming TTS.
  • ByteDanceFlowing: Volcano bidirectional streaming engine TTS.
  • MiniMax: MiniMax TTS

Note
This parameter cannot be updated when updating the agent instance.
ParamsObjectYesTTS configuration parameters, in JSON object format. Contains app parameters (for authentication) and other parameters (for adjusting TTS effects). Please refer to the Params parameter description below.
FilterTextArray of ObjectNoFilter out text within specified punctuation marks from the content input to TTS (usually the content returned by the LLM or the Text parameter value of the SendAgentInstanceTTS interface), and then perform speech synthesis. For example, if the content inside the parentheses in "(happily) Zego Technology welcomes you!" is filtered out and then synthesized, set it to [{"BeginCharacters": "(", "EndCharacters": ")"}]
Note
FilterText is an Object array. Each Object contains two string type parameters: BeginCharacters and EndCharacters.
Note
This parameter cannot be updated when updating the agent instance.
TerminatorTextStringNoCan be used to set the termination text of TTS. If the content input to TTS (usually the content returned by the LLM or the Text parameter value of the SendAgentInstanceTTS interface) contains content matching the TerminatorText string, the content from the TerminatorText string (inclusive) will not be synthesized for this round of TTS.
Note
The bidirectional streaming can only set one character. The maximum length is 4 characters.
CharacterFilterArray of StringNoThe specified string in the content input to TTS (usually the content returned by the LLM or the Text parameter value of the SendAgentInstanceTTS interface) does not participate in speech synthesis.
Note
Each string in the array represents a string to be filtered out, and each string does not exceed 2 characters.
Note
This parameter cannot be updated when updating the agent instance.

Params Description

Parameter NameTypeRequiredDescription
appobjectYesUsed for TTS service authentication. The structure of the app parameter required varies depending on the value of Vendor. See the app parameter instructions for each vendor below.
Other Params-NoBesides the app parameter, you can also provide additional TTS configuration parameters to further customize the speech synthesis effect. These parameters are directly passed through to the associated TTS service provider.

Refer to the official documentation for each vendor (by the value of Vendor) for detailed parameter information:

The definitions of the app parameter and other TTS parameters vary by vendor. Please refer to the parameter instructions for each vendor below.

Filtering text content before speech synthesis

Note
The filtering content feature is optional and can be set according to your actual needs.

In some scenarios, you may need to filter the text content generated by the LLM before speech synthesis. For example, in a companion app, it may require the LLM to use parentheses to represent emotion or tone in the conversation, but these contents are only used for subtitles displayed to users and do not need to be synthesized:

(happily) You are welcome to ZEGOCLOUD!

Here "(happily)" does not need to be synthesized.

There are three ways to control text filtering, respectively:

Filter text content between specified punctuation marks using FilterText.

For example, to filter the content inside parentheses in "(happily) You are welcome to ZEGOCLOUD!", set it to:

"TTS":{
  ....
  "FilterText": [
    {
      "BeginCharacters": "(",
      "EndCharacters": ")"
    }
  ]
  .....

}

Filter text content after a specific string using TerminatorText.

For example, to filter the content after "#" in "Your are welcome to ZEGOCLOUD! #2025-01-01.", set it to:

"TTS":{
  ....
  "TerminatorText": "#"
  .....

}

Filter specific strings using CharacterFilter.

For example, to filter the symbols "-" and "*" in "- Tomorrow 10am meeting", set it to:

"TTS":{
  ....
  "CharacterFilter": ["-","*"]
  .....

}

Previous

Configuring ASR

Next

AI Proactive Speech: Proactively Invoke LLM or TTS