Controlling AI Agent Speech Emotion
Scenario Description
Some TTS models support specifying the emotion used during synthesis. In the real-time voice interaction scenario with AI, you can combine the system prompt of the large language model LLM to enable the ability of the AI to output the corresponding emotion based on the persona, making the AI more expressive and emotional.
For example, the Speech series of MiniMax supports specifying multiple emotions (“happy” - happy, “sad” - sad, “angry” - angry, “fearful” - fearful, “disgusted” - disgusted, “surprised” - surprised, “calm” - neutral, “fluent” - fluent) during TTS synthesis. For detailed parameter descriptions, please refer to Synchronized Speech Synthesis WebSocket.
Feature Overview

Supported TTS Models and Control Tags
- The timbre/emotion list may change, please refer to the latest list provided by the TTS vendor.
- ZEGO control parameters mean that ZEGO AI Agent will uniformly control the TTS emotion through the specified parameters in the TTS text to be synthesized. The LLM should output the specific format according to the ZEGO control parameters to achieve the effect of controlling the emotion, but the value is still consistent with the timbre/emotion name of the TTS vendor.
| TTS Vendor | Supported Models | Supported Timbre/Emotions | Experience Way | ZEGO Control Parameters |
|---|---|---|---|---|
| MiniMax | Speech Series |
Some emotions are only supported in some models, please refer to Synchronized Speech Synthesis WebSocket -> Task Start -> voice_setting -> emotion | Voice Debugging Console | {"emotion": emotion} |
| Doubao TTS (Unidirectional Streaming) | 1.0, 2.0 Series | Chinese timbre examples:
English timbre examples:
For more timbres, please refer to Timbre List -> Emotion Parameters | Multiple emotional timbres in Doubao TTS Model | {"emotion": emotion} {"emotion_scale": scale} |
Implement the ability to specify the emotion of the voice content output by the AI agent
To achieve this ability, it mainly involves three steps:
- Specify the format of the content in the LLM text that controls the emotion.
- Let the LLM output the content according to the specified control emotion format.
- Let the TTS vendor synthesize the voice with emotion control parameters (ZEGO AI Agent automatically handles it).
Prerequisites
- Enable AI Agent service
- Confirm that the TTS model or timbre used supports specifying the emotion tag
- ZEGO AI Agent service supports the corresponding TTS model and tag. Refer to Supported TTS Models and Control Tags
Usage Steps
We take the example of specifying the text in the LLM text that is wrapped in “[[” and “]]” as metadata, letting the LLM output the control emotion control parameter metadata according to the format, and then extracting the “emotion” and “emotion_scale” parameters to control the voice emotion.
Specify the format of the content in the LLM text that controls the emotion
By configuring the AdvancedConfig.LLMMetaInfo parameter of the Create Agent Instance interface, specify how to extract the metadata controlling the emotion from the LLM text. For example:
"LLMMetaInfo" : {
"BeginCharacters": "[[",
"EndCharacters": "]]"
}Let the LLM output the content according to the specified control emotion format
- Please refer to the ZEGO control parameter format corresponding to the TTS vendor after determining which TTS vendor to use according to Supported TTS Models and Control Tags.
- The key in the metadata JSON can only be emotion or emotion_scale, and the value must be exactly the same as the emotion parameter value supported by the TTS vendor.
emotion value in the following examples is only an example. The actual emotions supported by the TTS vendor can be included in the text output by the LLM. However, generally, not all emotions are included (for example, a customer service application will not allow it to have a sad emotion).The following are the LLM.SystemPrompt examples corresponding to the Register Agent and Create Agent Instance interfaces when using MiniMax and ByteDance TTS, please refer to the actual needs for adjustment:
Let the TTS vendor synthesize the voice with emotion control parameters
Now you can start a voice conversation with the created AI agent instance! When the content output by the LLM contains emotion control parameters, the AI Agent service will automatically call the TTS vendor interface based on these parameters, allowing it to interact with you in a rich emotional voice performance. 🎉🎉🎉
