Release Notes

V2

2025-07-31

Server v2.4.15

New Features

Feature	Description	Documentation
WindowSize、LoadMessageCount maximum value adjusted to 200	The `MessageHistory.WindowSize` and `MessageHistory.ZIM.LoadMessageCount` fields of the create agent instance/create digital human agent instance interface are adjusted to 200.	Create Agent Instance Create Digital Human Agent Instance
TTS adds TerminatorText field	The `TTS` field of the register agent/create/update agent instance interface adds a `TerminatorText` field. This field can be used to set the termination text of TTS. If the content in the input TTS text matches the TerminatorText string, the content from the TerminatorText string (including) will not be synthesized for this round of TTS.

Improvements & Optimizations

Optimized the sentence break logic of unidirectional streaming TTS.

2025-06-26

Server v2.4.0

New Features

Feature	Description	Documentation
Digital Human Video Call	Support creating a digital human image in the Digital Human PaaS Service, and create a digital human agent instance to achieve real-time video interaction with the digital human. Digital human driving latency within 500ms, end-to-end latency within 2s (user speech ends to see AI digital human video). Ultra-clear digital human video, real 1080P effect. Realistic facial expressions. Accurate lip movement. Supports all languages, especially English and Chinese.	Implement Digital Human Video Call
Multi-agent multi-voice output	Support multi-voice output when interacting with multiple AI agents, by actively calling TTS	Send Agent Instance TTS

Improvements & Optimizations

Updated the default model of MiniMax TTS (Text-to-Speech) to speech-02-turbo, and optimized its latency to approximately 300ms.

2025-06-19

Server v2.3.0

New Features

Feature	Description	Documentation
Support retrieving average latency information when instance is destroyed	Latency information includes: LLM-related latency: First token latency (ms), LLM output speed (tokens/sec) TTS-related latency: First audio frame latency (ms) Total server latency (ms)	Get Agent Service Status & Latency Data
Support Alibaba CosyVoice TTS bidirectional streaming	By configuring the Vendor as Alibaba CosyVoice when creating an agent and setting up supported voice tones, you can achieve AI real-time voice calls based on CosyVoice.	-
Support callbacks for agent instance creation success and destruction	Can be used in conjunction with agent instance status query, server exception callback, and agent interruption callback to manage the entire lifecycle process of the agent	Get Agent Service Status & Latency Data

Improvements & Optimizations

During the integration testing period, no separate account application and authentication are required to use services from some ZEGO-supported LLMs (Doubao, MiniMax, Tongyi Qianwen, Stepfun, etc.) and TTS vendors (MiniMax, BytePlus, Alibaba CosyVoice). For details, please refer to Quick Start.
Updated support for MiniMax TTS WebSocket unidirectional streaming, further optimizing latency and voice tone effects.
Reduced end-to-end latency by 100-200ms, can be reduced to under 1 second with technical support enablement.

2025-05-30

Server v2.2.0

New Features

Feature	Description	Documentation
1 user vs multiple AI roles	Note Feature is in beta testing, please contact ZEGOCLOUD Business for details.	-
Request body contains agent instance and user information when calling LLM	When creating an agent instance, if the `AddAgentInfo` field is set to `true`, the AI Agent backend will add the `agent_info` field to the request body parameters sent to the custom LLM, which includes `room_id`, `user_id`, and `agent_instance_id` information. This allows for personalized responses based on different users or agent instances, such as calling different function calling or memory based on user IDs.	Configuring LLM
Callback for each round of user speech audio segment	When creating an agent instance, if the `UserAudioData` field of `CallbackConfig` is set to 1, the AI Agent backend will callback the audio data of the user's speech in the previous 1-1.5 seconds of each round of conversation (if less than 1 second, no callback will be sent). Business side can implement voiceprint recognition and other capabilities based on this audio information.	Receiving Callback

Improvements & Optimizations

Optimized the user experience problem caused by subtitle and LLM callback too early when ASR multi-sentence concatenation is enabled. For details, please refer to Speech Recognition Segmentation.

2025-05-16

Server v2.1.0

New Features

Feature	Description	Documentation
Multi-user vs 1 Agent	Supports multiple users simultaneously interacting with one AI agent through voice. Features include voice interruption, manual interruption, proactive agent speech, and the agent's ability to distinguish and respond to different users. Note Contact ZEGOCLOUD Technical Support for details.	-
Speech Recognition Segmentation	Supports voice detection threshold settings and pause duration settings to balance latency and speech recognition segmentation.	Speech Recognition Segmentation
More TTS Service Providers	Added support for Alibaba Cloud and MiniMax, with bidirectional streaming API support for BytePlus.	Agent Parameter Description - TTS
Interrupt Agent	Supports disabling voice interruption while enabling manual interruption, enabling scenarios like manual interruption and Push-to-talk intercom voice interaction.	Interrupt Agent
Context Management	Supports agent instance-level context management capabilities, including context querying and resetting.	AI Short-term Memory (Agent Context) Management
LLM Content Filtering	Supports filtering LLM output content, enabling emoji filtering and specific word replacement. Note Contact ZEGOCLOUD Technical Support for details.	-
Callback Events	Enables developers to receive agent interruption events, user speech behavior, and agent speech behavior through server-side callbacks.	Get AI Agent Status Receiving Callback

Improvements & Optimizations

Comprehensive optimization of integration examples, providing business service control pages and supporting client sample code. For details, refer to Quick Start.
Further improved speech recognition and interruption accuracy, especially for external music sounds.
Further optimized voice end-to-end latency, reducing 200ms+ delay.
Added token authentication support for real-time audio and video (RTC), enhancing interaction security without affecting agent interaction.

2025-04-25

Server v2.0.0

Version Update

Enhanced onboarding experience, enabling voice calls with AI agents through less than 10 lines of code.
Upgraded full-process audio handling capabilities, significantly improving the accuracy of speech interruption and recognition, especially in noisy environments, while playing BGM, or during cross talk (AI and user speaking simultaneously), covering various environments such as home, office, and public spaces for AI interaction.
Supports for features including: custom third-party large language models (LLMs), natural speech interruptions within 500ms, real-time subtitles, AI agent status queries, proactive LLM invocation, and proactive TTS invocation.
Upgraded architecture: ZEGOCLOUD AI agent supports multi-user vs multi-AI agent for more flexible interaction formats.

V1

2025-03-21

Server v1.4.0

New Features

Added a Query Agent Status server-side interface.
When creating a session, added a Pass-through Third-party Parameters field to the text-to-speech configuration object.
For Minimax text-to-speech services, the Pass-through Third-party Parameters now includes a Model field.
The ASR configuration object has added Hotwords and Extended Parameters fields.
Added a Remove History field to the request parameters of the server-side interface used for actively invoking text-to-speech services.

2025-02-10

Server v1.3.0

New Features

Added server-side callback for abnormal events.
Added a Sentence Pause Duration field to the text-to-speech configuration object.

2025-01-16

Server v1.2.0

New Features

Added Response Format Types and Response Message Name fields to the large language model configuration object when creating a session.
Added a User ID (required) field to the request parameters of session and conversation-related server-side interfaces, as well as those used for actively invoking large language models and text-to-speech services.
Added API Type and Resource ID fields to the extended parameters of the text-to-speech configuration object.

2025-01-08

Server v1.1.0

New Features

Added a Session ID field to the server-side interface for obtaining session lists, supporting querying session details by session ID.
Added a Conversation History Mode field to the server-side interface for creating sessions, supporting whether to save session history messages.

Improvements & Optimizations

Adjusted room event message protocol.

Deprecated & Removed

Removed the Account Source field from large language model and text-to-speech configuration objects.

2024-12-31

Server v1.0.0

Version Update

Comprehensive service reliability & stability.
Lower end-to-end latency and interruption delay.
Updated audio processing capabilities, supporting noisy environments and meeting over 80% of scenarios.
Agent template library.
Supports active invocation of large language models.
Supports active invocation of text-to-speech services.
Supports custom RAG and other capabilities.
Added an Ignore Bracketed Text field to the large language model configuration object, supporting filtering out emojis from large language model texts.

Beta

2024-12-16

Server v0.5.0

New Features

Added a server-side interface for proactively calling the text-to-speech service.
Added a server-side interface for proactively calling the large language model service.
Added a server-side callback interface for obtaining results from the large language model service.
The session creation server-side interface added an Enable Large Language Model Server Message configuration.
The large language model configuration object added an Ignore Bracketed Text field, supporting filtering of emoticons in the large language model's text.

Improvements and Optimizations

Unified the Timestamp field for customizing per-round conversation prompts with the large language model to Int type.

2024-12-05

Server v0.3.0

New Features

Added a Conversation Configuration field to server-side interfaces for creating, updating, and querying sessions.
Added a protocol for a custom pre-processing server-side interface for large language model prompts.
The text-to-speech configuration object added Ignore Bracketed Text and Ignore Custom Bracketed Text fields, supporting ignoring certain input content for text-to-speech services, such as content within Chinese and English brackets.

2024-11-26

Server v0.2.0

New Features

Added an Extended Parameters field applicable to text-to-speech services, supporting replicated voices from BytePlus and Minimax.
Added error codes such as 410003101.

Bug Fixes

Fixed an issue where the AI agent could not interrupt properly under certain scenarios.

2024-10-01

Server v0.1.0

Version Release

Supports basic scenarios such as AI real-time voice calls and IM text chats.
Supports switching between large language models (LLMs), text-to-speech (TTS) service providers, and voice tones.