Overview
ZEGOCLOUD AI Agent has been fully upgraded and version 2.0 is now released, ZEGOCLOUD has developed a new generation of real-time interactive AI specifically designed for AI agents:
- The end-to-end AI voice processing capability has been comprehensively upgraded, achieving over 95% accuracy in recognition and interruption handling, especially in scenarios with double-talk or BGM;
- The interactive architecture has been fully optimized to support multi-user and multi-AI interaction scenarios;
- The integration experience and usability have been significantly improved.
For more details, please refer to the Release Notes.
What is ZEGOCLOUD AI Agent?
ZEGOCLOUD AI Agent provides SDK and server APIs to help you quickly achieve ultra-low latency IM text and image chatting, voice calls, digital human voice calls, and other interactive features between users and AI agents, thereby fulfilling scenarios such as AI companionship, AI customer service, AI digital human live streaming, etc.
ZEGOCLOUD AI Agent supports custom settings for persona, timbre, appearance, etc., and is compatible with various large language models (LLMs) and text-to-speech services (TTS). It also supports long-term memory, external knowledge bases, and model fine-tuning, thereby delivering a more perfect AI agent.
Why Choose ZEGOCLOUD AI Agent?
Multi-modal Interactive Agent
- Customizable Character: You can define the personality and character of AI agents through prompts best practices, combined with RAG, LoRA, etc., to better match roles and meet exclusive needs.
- Rich Timbres & Voice Cloning: Over a hundred hyper-realistic timbres suitable for various scenarios such as emotional companionship, customer service, e-commerce, etc., with voice cloning capabilities.
- Multi-modal Interaction: Instant text messages, real-time voice calls, video calls, etc.
- Extended Premium Photo Digital Human: In as quickly as 200 ms, a single photo is all it takes to generate a real-time interactive AI avatar—complete with precise lip synchronization and lifelike facial rendering.
Real-time Voice Call Capability
- Response Delay Reduced to 1 Second Worldwide. ZEGOCLOUD AI Agent adopts fully stream-based processing, leverages our global MSDN (Real-time Sequential Data Network), and achieves an as quick as 1s end-to-end response delay anywhere in the world.
- Natural Voice Interruption in 500 ms. ZEGOCLOUD AI Agent rapidly and accurately detects human speech, seamlessly halts its responses within 500 ms upon interruption, and ensures no cross talk even under successive interrupts.
- Accurate Speaking-State Detection. While maintaining low response latency, ZEGOCLOUD AI Agent prevents sentence fragments from being split, resulting in more precise AI replies.
AI Audio Processing Capabilities for Agents
- AI Noise Reduction (AI ANS). Eliminates environmental noise, music, distant environmental human voices, etc., supporting interactions in various environments such as offices, homes, cars, etc.
- AI Voice Activity Detection (AI VAD). Accurately identifies effective human voices, filtering out soft responses like "um", "oh", as well as coughs and other noises resembling human sounds.
- AI Echo Cancellation (AI AEC). Precisely eliminates AI voices and background music re-captured by microphones, preventing AI speech from interrupting itself, improving the accuracy of voice when interrupting AI. Also combines functions such as volume ducking and playback volume self-adaptation.
Customized Integration
- Easy Integration: With fewer than 10 lines of code, you can embed the AI agent into instant messaging, real-time voice calls, or digital-human conversations in your app.
- Flexible Selection of LLM and TTS Plugins: ZEGOCLOUD AI Agent supports multiple vendors both domestic and international, such as ModelArk (Douyin), MiniMax, BytePlus, Alibaba Cloud, Stepfun, etc., and also supports open-source models.
- Highly Available, Cost-Efficient Services: By optimizing ASR, LLM, and TTS calls for concurrency and usage, ZEGOCLOUD minimizes end-to-end latency and reduces overall operational costs.
What Can ZEGOCLOUD AI Agent Do?
Module | Function | Description |
---|---|---|
Voice Calls with AI Agents | Create, Modify, Delete, Query AI Agents | Create an AI agent, including adjusting the basic information description of the AI agent virtual user, including persona (system prompt), timbre, etc., as well as parameters of LLM and TTS used by the agent. |
Initiate AI Agent Voice Call | Through creating an AI agent, achieve real-time voice calls with AI with a minimum delay of 1s. | |
Multi-user Interaction with AI Agent (Beta) | Achieve multi-user interaction with a single AI agent by creating group AI agent instances. Note Feature in beta testing, please contact ZEGOCLOUD business for details. | |
Single user vs multiple AI roles (Beta) | Achieve single user interaction with multiple AI agents by creating AI agent instances and configuring voice color mapping rules. Note Feature in beta testing, please contact ZEGOCLOUD Business for details. | |
AI Audio Processing Capability for AI Interaction | Automatically filters out user-side noise generated during conversations and removes far-field human voices, achieving more precise voice interruption effects and more accurate ASR speech recognition. | |
Natural Voice Interruption | During real-time voice calls, the AI agent intelligently identifies the user's intention to interrupt the conversation and stops its output. | |
Real-time Broadcast | The dialogue information between the AI agent and the user will be converted into text in real-time and displayed by the client. | |
Basic Capabilities | Large Language Model (LLM) Management | Adjust the large language model (LLM) applied by the AI agent.
|
Text-to-Speech (TTS) Management | Support for various TTS providers and related capabilities:
| |
Add/Delete/Modify AI Agent Instances | Create or delete an AI Agent instance to initiate voice or digital human interaction with the agent. | |
Get AI Agent Status | Receive corresponding server callbacks to get the AI agent's start speaking and end speaking status; also can query AI agent status API to get states including idle, listening, thinking, speaking, etc. | |
Interact with AI via IM and makes voice calls | Based on ZIM, implement text message interaction with AI and share memory to initiate voice calls. | |
Memory (Context) Source | The AI agent's memory (context) can be provided through external input or by binding historical records from In-app Chat (ZIM). | |
Memory (Context) Update | During the lifecycle of this AI agent instance, record the content of each conversation and use it as subsequent context messages for the agent's memory. Memory can be cleared to restart the conversation. | |
Memory (Context) Archiving | Convert the dialogue between users and AI agents into text information and store it | |
Speech Recognition Hot Words | For specialized vocabulary such as role names, temporary hot words can be set to improve speech recognition accuracy. | |
Proactive LLM Invocation | Simulate user questions by customizing messages sent to LLM, and after LLM responds, send voice to users via TTS. Can be used to implement context-based welcome messages and other scenarios. | |
Proactive TTS Invocation | TTS can be invoked at any time to achieve AI's proactive broadcasting, thus satisfying scenarios such as AI welcome messages or user reminders. Also supports configuring whether to add to history records and context | |
Advanced Capabilities | AI Agent Interruption Mode Control | The form of interruption when the agent is speaking can include multiple options, and multiple selections are possible:
|
Filter LLM Output and TTS Input | Filtering based on certain rules, such as Chinese and English brackets, emoji expressions, etc., for more controllable AI behavior. | |
Speech Recognition Segmentation Optimization | Support for voice detection segmentation threshold settings and pause duration settings to achieve balance between delay and voice segmentation. | |
Best Practices | Role-playing Prompt Optimization | When using AI agents for role-playing, learn how to write system prompts to better showcase the effect. |
Better Output with RAG | Support for AI external knowledge base to achieve more basic scripts, company information, and other content. | |
Memory Module | For longer time spans and where AI needs to remember more basic user information (e.g., age, place of birth, preferences), conduct regular summaries and conclusions to achieve smarter AI interactions. | |
LoRA, SFT Model Fine-tuning | When there are very high demands for the AI character, fine-tuning of the LLM can be performed. For example, in scenarios where a cloned host replaces a real person. | |
AI Voice Chat with Cloned Voice | Apply the cloned voice in the voice call process to achieve communication with an AI agent of a specific voice. |