Controlling AI Agent Speech Emotion

Scenario Description

Some TTS models support specifying the emotion used during synthesis. In the real-time voice interaction scenario with AI, you can combine the system prompt of the large language model LLM to enable the ability of the AI to output the corresponding emotion based on the persona, making the AI more expressive and emotional.

For example, the Speech series of MiniMax supports specifying multiple emotions (“happy” - happy, “sad” - sad, “angry” - angry, “fearful” - fearful, “disgusted” - disgusted, “surprised” - surprised, “calm” - neutral, “fluent” - fluent) during TTS synthesis. For detailed parameter descriptions, please refer to Synchronized Speech Synthesis WebSocket.

Feature Overview

Note

You must use the LLM.SystemPrompt parameter of the Register Agent, Update Agent, and Create Agent Instance interface to control the LLM to output the specific format according to the ZEGO control parameters when answering, in order to achieve the effect of controlling the emotion.

Supported TTS Models and Control Tags

Note

Currently, ZEGO supports two model capabilities. If you need other TTS models, please contact ZEGO Technical Support.

The timbre/emotion list may change, please refer to the latest list provided by the TTS vendor.
ZEGO control parameters mean that ZEGO AI Agent will uniformly control the TTS emotion through the specified parameters in the TTS text to be synthesized. The LLM should output the specific format according to the ZEGO control parameters to achieve the effect of controlling the emotion, but the value is still consistent with the timbre/emotion name of the TTS vendor.

TTS Vendor	Supported Models	Supported Timbre/Emotions	Experience Way	ZEGO Control Parameters
MiniMax	Speech Series	"happy" - happy "sad" - sad "angry" - angry "fearful" - fearful "disgusted" - disgusted "surprised" - surprised "calm" - neutral "fluent" - fluent Some emotions are only supported in some models, please refer to Synchronized Speech Synthesis WebSocket -> Task Start -> voice_setting -> emotion	Voice Debugging Console	{"emotion": emotion}
Doubao TTS (Unidirectional Streaming)	1.0, 2.0 Series	Chinese timbre examples: "happy" - happy "sad" - sad "angry" - angry "fearful" - fearful "disgusted" - disgusted "surprised" - surprised "calm" - neutral "fluent" - fluent English timbre examples: "neutral" - neutral "happy" - happy "angry" - angry "sad" - sad "excited" - excited "chat" - chat/conversational "warm" - warm "affectionate" - affectionate "authoritative" - authoritative For more timbres, please refer to Timbre List -> Emotion Parameters	Multiple emotional timbres in Doubao TTS Model	{"emotion": emotion} {"emotion_scale": scale}

TTS Vendor

Supported Models

Supported Timbre/Emotions

Experience Way

ZEGO Control Parameters

MiniMax

Speech Series

"happy" - happy
"sad" - sad
"angry" - angry
"fearful" - fearful
"disgusted" - disgusted
"surprised" - surprised
"calm" - neutral
"fluent" - fluent

Some emotions are only supported in some models, please refer to Synchronized Speech Synthesis WebSocket -> Task Start -> voice_setting -> emotion

Voice Debugging Console

{"emotion": emotion}

Doubao TTS (Unidirectional Streaming)

1.0, 2.0 Series

Chinese timbre examples:

"happy" - happy
"sad" - sad
"angry" - angry
"fearful" - fearful
"disgusted" - disgusted
"surprised" - surprised
"calm" - neutral
"fluent" - fluent

English timbre examples:

"neutral" - neutral
"happy" - happy
"angry" - angry
"sad" - sad
"excited" - excited
"chat" - chat/conversational
"warm" - warm
"affectionate" - affectionate
"authoritative" - authoritative

For more timbres, please refer to Timbre List -> Emotion Parameters

Multiple emotional timbres in Doubao TTS Model

{"emotion": emotion}
{"emotion_scale": scale}

Implement the ability to specify the emotion of the voice content output by the AI agent

To achieve this ability, it mainly involves three steps:

Specify the format of the content in the LLM text that controls the emotion.
Let the LLM output the content according to the specified control emotion format.
Let the TTS vendor synthesize the voice with emotion control parameters (ZEGO AI Agent automatically handles it).

Prerequisites

Enable AI Agent service
Confirm that the TTS model or timbre used supports specifying the emotion tag
ZEGO AI Agent service supports the corresponding TTS model and tag. Refer to Supported TTS Models and Control Tags

Usage Steps

We take the example of specifying the text in the LLM text that is wrapped in “[[” and “]]” as metadata, letting the LLM output the control emotion control parameter metadata according to the format, and then extracting the “emotion” and “emotion_scale” parameters to control the voice emotion.

Specify the format of the content in the LLM text that controls the emotion

Note

No need to configure the FilterText.BeginCharacters and FilterText.EndCharacters parameters again when registering the AI agent or creating the AI agent instance, because the metadata and flag symbols will be removed from the LLM output text.

By configuring the AdvancedConfig.LLMMetaInfo parameter of the Create Agent Instance interface, specify how to extract the metadata controlling the emotion from the LLM text. For example:

"LLMMetaInfo" : {
    "BeginCharacters": "[[",
    "EndCharacters": "]]"
}

"LLMMetaInfo" : {
    "BeginCharacters": "[[",
    "EndCharacters": "]]"
}

Let the LLM output the content according to the specified control emotion format

说明

Please refer to the ZEGO control parameter format corresponding to the TTS vendor after determining which TTS vendor to use according to Supported TTS Models and Control Tags.
The key in the metadata JSON can only be emotion or emotion_scale, and the value must be exactly the same as the emotion parameter value supported by the TTS vendor.

Note

The emotion value in the following examples is only an example. The actual emotions supported by the TTS vendor can be included in the text output by the LLM. However, generally, not all emotions are included (for example, a customer service application will not allow it to have a sad emotion).

The following are the LLM.SystemPrompt examples corresponding to the Register Agent and Create Agent Instance interfaces when using MiniMax and ByteDance TTS, please refer to the actual needs for adjustment:

Let the TTS vendor synthesize the voice with emotion control parameters

Now you can start a voice conversation with the created AI agent instance! When the content output by the LLM contains emotion control parameters, the AI Agent service will automatically call the TTS vendor interface based on these parameters, allowing it to interact with you in a rich emotional voice performance. 🎉🎉🎉