Overview

Note

ZEGOCLOUD AI Agent has been fully upgraded and version 2.0 is now released, ZEGOCLOUD has developed a new generation of real-time interactive AI specifically designed for AI agents:

The end-to-end AI voice processing capability has been comprehensively upgraded, achieving over 95% accuracy in recognition and interruption handling, especially in scenarios with double-talk or BGM;
The interactive architecture has been fully optimized to support multi-user and multi-AI interaction scenarios;
The integration experience and usability have been significantly improved.

For more details, please refer to the Release Notes.

What is ZEGOCLOUD AI Agent?

ZEGOCLOUD AI Agent provides SDK and server APIs to help you quickly achieve ultra-low latency IM text and image chatting, voice calls, digital human voice calls, and other interactive features between users and AI agents, thereby fulfilling scenarios such as AI companionship, AI customer service, AI digital human live streaming, etc.

ZEGOCLOUD AI Agent supports custom settings for persona, timbre, appearance, etc., and is compatible with various large language models (LLMs) and text-to-speech services (TTS). It also supports long-term memory, external knowledge bases, and model fine-tuning, thereby delivering a more perfect AI agent.

Why Choose ZEGOCLOUD AI Agent?

Customizable Character: You can define the personality and character of AI agents through prompts best practices, combined with RAG, LoRA, etc., to better match roles and meet exclusive needs.
Rich Timbres & Voice Cloning: Over a hundred hyper-realistic timbres suitable for various scenarios such as emotional companionship, customer service, e-commerce, etc., with voice cloning capabilities.
Multi-modal Interaction: Instant text messages, real-time voice calls, video calls, etc.
Extended Premium Photo Digital Human: In as quickly as 200 ms, a single photo is all it takes to generate a real-time interactive AI avatar—complete with precise lip synchronization and lifelike facial rendering.

Real-time Voice Call Capability

Response Delay Reduced to 1 Second Worldwide. ZEGOCLOUD AI Agent adopts fully stream-based processing, leverages our global MSDN (Real-time Sequential Data Network), and achieves an as quick as 1s end-to-end response delay anywhere in the world.
Natural Voice Interruption in 500 ms. ZEGOCLOUD AI Agent rapidly and accurately detects human speech, seamlessly halts its responses within 500 ms upon interruption, and ensures no cross talk even under successive interrupts.
Accurate Speaking-State Detection. While maintaining low response latency, ZEGOCLOUD AI Agent prevents sentence fragments from being split, resulting in more precise AI replies.

AI Audio Processing Capabilities for Agents

AI Noise Reduction (AI ANS). Eliminates environmental noise, music, distant environmental human voices, etc., supporting interactions in various environments such as offices, homes, cars, etc.
AI Voice Activity Detection (AI VAD). Accurately identifies effective human voices, filtering out soft responses like "um", "oh", as well as coughs and other noises resembling human sounds.
AI Echo Cancellation (AI AEC). Precisely eliminates AI voices and background music re-captured by microphones, preventing AI speech from interrupting itself, improving the accuracy of voice when interrupting AI. Also combines functions such as volume ducking and playback volume self-adaptation.

Customized Integration

Easy Integration: With fewer than 10 lines of code, you can embed the AI agent into instant messaging, real-time voice calls, or digital-human conversations in your app.
Flexible Selection of LLM and TTS Plugins: ZEGOCLOUD AI Agent supports multiple vendors both domestic and international, such as ModelArk (Douyin), MiniMax, BytePlus, Alibaba Cloud, Stepfun, etc., and also supports open-source models.
Highly Available, Cost-Efficient Services: By optimizing ASR, LLM, and TTS calls for concurrency and usage, ZEGOCLOUD minimizes end-to-end latency and reduces overall operational costs.

What Can ZEGOCLOUD AI Agent Do?

Module	Function	Description
Voice Calls with AI Agents	Create, Modify, Delete, Query AI Agents	Create an AI agent, including adjusting the basic information description of the AI agent virtual user, including persona (system prompt), timbre, etc., as well as parameters of LLM and TTS used by the agent.
	Initiate AI Agent Voice Call	Through creating an AI agent, achieve real-time voice calls with AI with a minimum delay of 1s.
	Multi-user Interaction with AI Agent (Beta)	Achieve multi-user interaction with a single AI agent by creating group AI agent instances. Note Feature in beta testing, please contact ZEGOCLOUD business for details.
	Single user vs multiple AI roles (Beta)	Achieve single user interaction with multiple AI agents by creating AI agent instances and configuring voice color mapping rules. Note Feature in beta testing, please contact ZEGOCLOUD Business for details.
	AI Audio Processing Capability for AI Interaction	Automatically filters out user-side noise generated during conversations and removes far-field human voices, achieving more precise voice interruption effects and more accurate ASR speech recognition.
	Natural Voice Interruption	During real-time voice calls, the AI agent intelligently identifies the user's intention to interrupt the conversation and stops its output.
	Real-time Broadcast	The dialogue information between the AI agent and the user will be converted into text in real-time and displayed by the client.
Basic Capabilities	Large Language Model (LLM) Management	Adjust the large language model (LLM) applied by the AI agent. Commercial LLMs: OpenAI, MiniMax, Qwen, Volcano Ark, Stepfun, ERNIE. Open-source LLMs compatible with OpenAI Chat Completions API.
	Text-to-Speech (TTS) Management	Support for various TTS providers and related capabilities: Supported service providers: BytePlus (Large Model Voice Synthesis & Streaming Text-to-Speech), Alibaba Cloud (CosyVoice), MiniMax; Various models, public timbres, voice cloning from vendors, and support for speed and tone adjustments.
	Digital Human Management	Integrate digital human images into RTC real-time video interactions based on ZEGO digital human. With premium photo digital humans, you can obtain a 1080P digital human with just a single photo or image, and assign an AI avatar during voice calls.
	Add/Delete/Modify AI Agent Instances	Create or delete an AI Agent instance to initiate voice or digital human interaction with the agent.
	Get AI Agent Status	Receive corresponding server callbacks to get the AI agent's start speaking and end speaking status; also can query AI agent status API to get states including idle, listening, thinking, speaking, etc.
	Interact with AI via IM and makes voice calls	Based on ZIM, implement text message interaction with AI and share memory to initiate voice calls.
	Memory (Context) Source	The AI agent's memory (context) can be provided through external input or by binding historical records from In-app Chat (ZIM).
	Memory (Context) Update	During the lifecycle of this AI agent instance, record the content of each conversation and use it as subsequent context messages for the agent's memory. Memory can be cleared to restart the conversation.
	Memory (Context) Archiving	Convert the dialogue between users and AI agents into text information and store it
	Speech Recognition Hot Words	For specialized vocabulary such as role names, temporary hot words can be set to improve speech recognition accuracy.
	Proactive LLM Invocation	Simulate user questions by customizing messages sent to LLM, and after LLM responds, send voice to users via TTS. Can be used to implement context-based welcome messages and other scenarios.
Proactive TTS Invocation	TTS can be invoked at any time to achieve AI's proactive broadcasting, thus satisfying scenarios such as AI welcome messages or user reminders. Also supports configuring whether to add to history records and context
Advanced Capabilities	AI Agent Interruption Mode Control	The form of interruption when the agent is speaking can include multiple options, and multiple selections are possible: Natural voice interruption: When the agent receives voice input, i.e., when the user speaks, it interrupts the agent's speech. Manual interruption: Control interruption through server-side APIs to enable users to interrupt via buttons or business-side management.
	Filter LLM Output and TTS Input	Filtering based on certain rules, such as Chinese and English brackets, emoji expressions, etc., for more controllable AI behavior.
	Speech Recognition Segmentation Optimization	Support for voice detection segmentation threshold settings and pause duration settings to achieve balance between delay and voice segmentation.
Best Practices	Role-playing Prompt Optimization	When using AI agents for role-playing, learn how to write system prompts to better showcase the effect.
	Better Output with RAG	Support for AI external knowledge base to achieve more basic scripts, company information, and other content.
	Memory Module	For longer time spans and where AI needs to remember more basic user information (e.g., age, place of birth, preferences), conduct regular summaries and conclusions to achieve smarter AI interactions.
	LoRA, SFT Model Fine-tuning	When there are very high demands for the AI character, fine-tuning of the LLM can be performed. For example, in scenarios where a cloned host replaces a real person.
	AI Voice Chat with Cloned Voice	Apply the cloned voice in the voice call process to achieve communication with an AI agent of a specific voice.