Overview

2026-02-05

Note

The Real-time Interactive AI Agent has been fully upgraded and released as version 2.0. ZEGOCLOUD has built a new generation of real-time interactive AI specifically for AI agents:

Comprehensive upgrade of end-to-end AI voice processing capabilities, achieving >95% recognition and interruption accuracy, with special optimization for scenarios such as double talk and background music (BGM).
Fully optimized interaction architecture, supporting multi-user and multi-AI interaction scenarios.
Greatly improved integration experience and usability.

For details, please refer to the Release Notes.

Product Introduction

ZEGOCLOUD AI Agent (hereinafter referred to as "Interactive AI" or "AI Agent") enables you to quickly implement ultra-low latency IM text & image chat, voice calls, and digital human voice calls between users and AI (agents) by integrating the SDK and server APIs. This meets the needs of scenarios such as AI companionship, AI customer service, and AI digital human live streaming. ZEGOCLOUD AI Agent supports custom persona, voice, and avatar settings, supports multiple large language models (LLM) and text-to-speech (TTS) services, as well as long-term memory, external knowledge base, and model fine-tuning, enabling more advanced and perfect AI agents.

Product Advantages

Multimodal Interactive AI Agent

Personalized Persona: Define the AI's personality and role. Use best practices for prompts, combined with RAG, LoRA, and other methods to better match roles and meet specific requirements.
Rich Voice Options & Voice Cloning: Over a hundred highly human-like voices suitable for emotional companionship, customer service, e-commerce, etc., with support for voice cloning.
Multimodal Interaction: IM text & image messages, real-time voice calls, video calls, and more.
Premium Photo-based Digital Human: With just one photo and as low as 200ms latency, give your AI a real-time interactive avatar with accurate lip sync and realistic facial expressions.

Real-time Voice Call Capabilities

Response latency as low as 1s. Full streaming processing, with global access via ZEGOCLOUD's proprietary MSDN (Media Streaming Delivery Network) nodes, achieving global latency as low as 1s.
Natural voice interruption in just 500ms. Rapid and accurate human voice detection for smooth, non-intrusive interruptions, with no crosstalk during consecutive interruptions.
Accurate speaking state detection. Ensures that replies are not mistakenly split into multiple sentences, providing more precise AI responses without affecting reply latency.

AI Audio Processing Tailored for Agents

AI Noise Suppression (AI ANS): Eliminates environmental noise, music, distant human voices, etc., supporting interaction in various environments such as offices, homes, and cars.
AI Voice Activity Detection (AI VAD): Accurately detects valid human speech, filtering out soft responses like "um", "oh", as well as coughs and human-like noises.
AI Echo Cancellation (AI AEC): Precisely removes AI voice and background music picked up by the microphone, preventing AI from interrupting itself and improving speech accuracy during interruptions. Also supports volume ducking and adaptive playback volume.

Personalized Integration

Easy Integration: Add AI agents to IM, real-time voice calls, or digital human calls with less than 10 lines of code.
Flexible LLM and TTS Plugin Selection: Supports a wide range of domestic and international providers such as Volcano Ark (Doubao), MiniMax, Volcano Engine, Alibaba Cloud, Stepfun, and open-source models.
Highly Available and Cost-effective Service: Optimized invocation of ASR, LLM, and TTS, efficiently utilizing concurrency and usage to reduce overall costs.

Product Features

Module	Feature	Description
Voice Call with AI Agent	Create, Update, Delete, Query AI Agent	Create AI agents, including adjusting the agent's virtual user profile, persona (system prompt), voice, and parameters for LLM and TTS.
		Initiate AI Agent Voice Call (Single User)	Create an AI agent to achieve real-time voice calls with AI with latency as low as 1s.
		Multi-user Interaction with AI Agent (Beta)	Create a group AI agent instance to enable multi-user interaction with a single AI agent. Note This feature is in beta. Please contact ZEGOCLOUD sales for details.
		Single User with Multiple AI Agents (Beta)	Create AI agents and configure voice mapping rules to enable a single user to interact with multiple AI agents. Note This feature is in beta. Please contact ZEGOCLOUD sales for details.
		AI Audio Processing for Interaction	Automatically filters out noise from the user side during conversations and eliminates far-field human voices, achieving more accurate voice interruption and ASR recognition.
		Natural Voice Interruption	During real-time voice calls, the AI agent intelligently detects user interruption intent and stops its output.
		Real-time Transcription	The conversation between the AI agent and the user is converted to text in real time and displayed on the client.
		ASR Configuration Management	Adjust the ASR used by the AI agent: Vendor models: Tencent, Alibaba Bailian, Microsoft, etc. Supports hot words, recognition language, and other adjustments.
Basic Capabilities	ASR Configuration Management	Adjust the ASR used by the AI agent: Vendor models: Tencent, Alibaba Bailian, Microsoft, etc. Supports hot words, recognition language, and other adjustments.
		LLM Management	Adjust the LLM used by the AI agent. Commercial LLMs: OpenAI, MiniMax, Tongyi Qianwen, Volcano Ark (Doubao), Stepfun, Wenxin Yiyan. Open-source LLMs compatible with OpenAI Chat Completions API.
		TTS Management	Supports various TTS services and related capabilities: Supported providers: Volcano Engine (unidirectional & bidirectional streaming), Alibaba Cloud (CosyVoice), MiniMax; Supports various models, public voices, voice cloning, and adjustments for speed and pitch.
		Digital Human Management	Integrate ZEGOCLOUD digital humans into RTC real-time video interaction. Premium photo-based digital humans require only one photo or image to obtain a 1080P digital human, which can be used as an AI avatar during voice calls.
		Add/Delete/Update AI Agent Instances	Create or delete an AI Agent instance to start a voice or digital human interaction with the agent.
		Get AI Agent Status	Receive server callbacks to get the agent's speaking status; also query agent status via API, including idle, listening, thinking, and speaking states.
		Memory (Context) Source	The agent's memory (context) can be provided externally or by binding to ZIM (ZEGOCLOUD In-app Chat) chat history.
		Memory (Context) Update	During the agent instance lifecycle, each conversation is recorded as context for the agent's memory. Memory can be cleared to start a new conversation.
		Memory (Context) Archiving	Convert the dialogue between users and AI agents into text information and store it.
		ASR Hot Words	For specialized vocabulary such as role names, temporary hot words can be set to improve speech recognition accuracy.
		Proactive LLM Invocation	Simulate user questions by customizing messages sent to LLM, and after LLM responds, send voice to users via TTS. Can be used to implement context-based welcome messages and other scenarios.
		Proactive TTS Invocation	TTS can be invoked at any time to achieve AI's proactive broadcasting, thus satisfying scenarios such as AI welcome messages or user reminders. Also supports configuring whether to add to history records and context.
Advanced Capabilities	AI Agent Interruption Mode Control	The form of interruption when the agent is speaking can include multiple options, and multiple selections are possible: Natural voice interruption: When the agent receives voice input, i.e., when the user speaks, it interrupts the agent's speech. Manual interruption: Control interruption through server-side APIs to enable users to interrupt via buttons or business-side management.
		Filter LLM Output and TTS Input	Filtering based on certain rules, such as Chinese and English brackets, emoji expressions, etc., for more controllable AI behavior.
		Speech Recognition Segmentation Optimization	Support for voice detection segmentation threshold settings and pause duration settings to achieve balance between delay and voice segmentation. For details, see Speech Segmentation Control.
Best Practices	Role-playing Prompt Optimization	When using AI agents for role-playing, learn how to write system prompts to better showcase the effect.
		Better Output with RAG	Support for AI external knowledge base to achieve more basic scripts, company information, and other content. For details, see Using AI Agent with RAG.
		IM Chat with AI and Initiate Voice Call	Based on ZIM, enables text message interaction with AI and sharing memory to initiate voice calls.
		Memory Module	For longer time spans and where AI needs to remember more basic user information (e.g., age, place of birth, preferences), conduct regular summaries and conclusions to achieve smarter AI interactions.
		LoRA, SFT Model Fine-tuning	When there are very high demands for the AI character, fine-tuning of the LLM can be performed. For example, in scenarios where a cloned host replaces a real person.
		AI Voice Chat with Cloned Voice	Apply the cloned voice in the voice call process to achieve communication with an AI agent of a specific voice.