logo
On this page

Overview


Note

ZEGOCLOUD AI Agent has been fully upgraded and version 2.0 is now released, ZEGOCLOUD has developed a new generation of real-time interactive AI specifically designed for AI agents:

  • The end-to-end AI voice processing capability has been comprehensively upgraded, achieving over 95% accuracy in recognition and interruption handling, especially in scenarios with double-talk or BGM;
  • The interactive architecture has been fully optimized to support multi-user and multi-AI interaction scenarios;
  • The integration experience and usability have been significantly improved.

For more details, please refer to the Release Notes.

What is ZEGOCLOUD AI Agent?

ZEGOCLOUD AI Agent provides SDK and server APIs to help you quickly achieve ultra-low latency IM text and image chatting, voice calls, digital human voice calls, and other interactive features between users and AI agents, thereby fulfilling scenarios such as AI companionship, AI customer service, AI digital human live streaming, etc.

ZEGOCLOUD AI Agent supports custom settings for persona, timbre, appearance, etc., and is compatible with various large language models (LLMs) and text-to-speech services (TTS). It also supports long-term memory, external knowledge bases, and model fine-tuning, thereby delivering a more perfect AI agent.

Why Choose ZEGOCLOUD AI Agent?

Multi-modal Interactive Agent

  • Customizable Character: You can define the personality and character of AI agents through prompts best practices, combined with RAG, LoRA, etc., to better match roles and meet exclusive needs.
  • Rich Timbres & Voice Cloning: Over a hundred hyper-realistic timbres suitable for various scenarios such as emotional companionship, customer service, e-commerce, etc., with voice cloning capabilities.
  • Multi-modal Interaction: Instant text messages, real-time voice calls, video calls, etc.
  • Extended Premium Photo Digital Human: In as quickly as 200 ms, a single photo is all it takes to generate a real-time interactive AI avatar—complete with precise lip synchronization and lifelike facial rendering.

Real-time Voice Call Capability

  • Response Delay Reduced to 1 Second Worldwide. ZEGOCLOUD AI Agent adopts fully stream-based processing, leverages our global MSDN (Real-time Sequential Data Network), and achieves an as quick as 1s end-to-end response delay anywhere in the world.
  • Natural Voice Interruption in 500 ms. ZEGOCLOUD AI Agent rapidly and accurately detects human speech, seamlessly halts its responses within 500 ms upon interruption, and ensures no cross talk even under successive interrupts.
  • Accurate Speaking-State Detection. While maintaining low response latency, ZEGOCLOUD AI Agent prevents sentence fragments from being split, resulting in more precise AI replies.

AI Audio Processing Capabilities for Agents

  • AI Noise Reduction (AI ANS). Eliminates environmental noise, music, distant environmental human voices, etc., supporting interactions in various environments such as offices, homes, cars, etc.
  • AI Voice Activity Detection (AI VAD). Accurately identifies effective human voices, filtering out soft responses like "um", "oh", as well as coughs and other noises resembling human sounds.
  • AI Echo Cancellation (AI AEC). Precisely eliminates AI voices and background music re-captured by microphones, preventing AI speech from interrupting itself, improving the accuracy of voice when interrupting AI. Also combines functions such as volume ducking and playback volume self-adaptation.

Customized Integration

  • Easy Integration: With fewer than 10 lines of code, you can embed the AI agent into instant messaging, real-time voice calls, or digital-human conversations in your app.
  • Flexible Selection of LLM and TTS Plugins: ZEGOCLOUD AI Agent supports multiple vendors both domestic and international, such as ModelArk (Douyin), MiniMax, BytePlus, Alibaba Cloud, Stepfun, etc., and also supports open-source models.
  • Highly Available, Cost-Efficient Services: By optimizing ASR, LLM, and TTS calls for concurrency and usage, ZEGOCLOUD minimizes end-to-end latency and reduces overall operational costs.

What Can ZEGOCLOUD AI Agent Do?

ModuleFunctionDescription
Voice Calls with AI AgentsCreate, Modify, Delete, Query AI AgentsCreate an AI agent, including adjusting the basic information description of the AI agent virtual user, including persona (system prompt), timbre, etc., as well as parameters of LLM and TTS used by the agent.
Initiate AI Agent Voice CallThrough creating an AI agent, achieve real-time voice calls with AI with a minimum delay of 1s.
Multi-user Interaction with AI Agent (Beta)Achieve multi-user interaction with a single AI agent by creating group AI agent instances.
Note
Feature in beta testing, please contact ZEGOCLOUD business for details.
Single user vs multiple AI roles (Beta)Achieve single user interaction with multiple AI agents by creating AI agent instances and configuring voice color mapping rules.
Note
Feature in beta testing, please contact ZEGOCLOUD Business for details.
AI Audio Processing Capability for AI InteractionAutomatically filters out user-side noise generated during conversations and removes far-field human voices, achieving more precise voice interruption effects and more accurate ASR speech recognition.
Natural Voice InterruptionDuring real-time voice calls, the AI agent intelligently identifies the user's intention to interrupt the conversation and stops its output.
Real-time BroadcastThe dialogue information between the AI agent and the user will be converted into text in real-time and displayed by the client.
Basic CapabilitiesLarge Language Model (LLM) ManagementAdjust the large language model (LLM) applied by the AI agent.
Text-to-Speech (TTS) ManagementSupport for various TTS providers and related capabilities:
  • Supported service providers: BytePlus (Large Model Voice Synthesis & Streaming Text-to-Speech), Alibaba Cloud (CosyVoice), MiniMax;
  • Various models, public timbres, voice cloning from vendors, and support for speed and tone adjustments.
Add/Delete/Modify AI Agent InstancesCreate or delete an AI Agent instance to initiate voice or digital human interaction with the agent.
Get AI Agent StatusReceive corresponding server callbacks to get the AI agent's start speaking and end speaking status; also can query AI agent status API to get states including idle, listening, thinking, speaking, etc.
Interact with AI via IM and makes voice callsBased on ZIM, implement text message interaction with AI and share memory to initiate voice calls.
Memory (Context) SourceThe AI agent's memory (context) can be provided through external input or by binding historical records from In-app Chat (ZIM).
Memory (Context) UpdateDuring the lifecycle of this AI agent instance, record the content of each conversation and use it as subsequent context messages for the agent's memory. Memory can be cleared to restart the conversation.
Memory (Context) ArchivingConvert the dialogue between users and AI agents into text information and store it
Speech Recognition Hot WordsFor specialized vocabulary such as role names, temporary hot words can be set to improve speech recognition accuracy.
Proactive LLM InvocationSimulate user questions by customizing messages sent to LLM, and after LLM responds, send voice to users via TTS. Can be used to implement context-based welcome messages and other scenarios.
Proactive TTS InvocationTTS can be invoked at any time to achieve AI's proactive broadcasting, thus satisfying scenarios such as AI welcome messages or user reminders. Also supports configuring whether to add to history records and context
Advanced CapabilitiesAI Agent Interruption Mode ControlThe form of interruption when the agent is speaking can include multiple options, and multiple selections are possible:
  • Natural voice interruption: When the agent receives voice input, i.e., when the user speaks, it interrupts the agent's speech.
  • Manual interruption: Control interruption through server-side APIs to enable users to interrupt via buttons or business-side management.
Filter LLM Output and TTS InputFiltering based on certain rules, such as Chinese and English brackets, emoji expressions, etc., for more controllable AI behavior.
Speech Recognition Segmentation OptimizationSupport for voice detection segmentation threshold settings and pause duration settings to achieve balance between delay and voice segmentation.
Best PracticesRole-playing Prompt OptimizationWhen using AI agents for role-playing, learn how to write system prompts to better showcase the effect.
Better Output with RAGSupport for AI external knowledge base to achieve more basic scripts, company information, and other content.
Memory ModuleFor longer time spans and where AI needs to remember more basic user information (e.g., age, place of birth, preferences), conduct regular summaries and conclusions to achieve smarter AI interactions.
LoRA, SFT Model Fine-tuningWhen there are very high demands for the AI character, fine-tuning of the LLM can be performed. For example, in scenarios where a cloned host replaces a real person.
AI Voice Chat with Cloned VoiceApply the cloned voice in the voice call process to achieve communication with an AI agent of a specific voice.

Next

Release Notes