logo
On this page

Overview


Note

The Real-time Interactive AI Agent has been fully upgraded and released as version 2.0. ZEGOCLOUD has built a new generation of real-time interactive AI specifically for AI agents:

  • Comprehensive upgrade of end-to-end AI voice processing capabilities, achieving >95% recognition and interruption accuracy, with special optimization for scenarios such as double talk and background music (BGM).
  • Fully optimized interaction architecture, supporting multi-user and multi-AI interaction scenarios.
  • Greatly improved integration experience and usability.

For details, please refer to the Release Notes.

Product Introduction

ZEGOCLOUD AI Agent (hereinafter referred to as "Interactive AI" or "AI Agent") enables you to quickly implement ultra-low latency IM text & image chat, voice calls, and digital human voice calls between users and AI (agents) by integrating the SDK and server APIs. This meets the needs of scenarios such as AI companionship, AI customer service, and AI digital human live streaming. ZEGOCLOUD AI Agent supports custom persona, voice, and avatar settings, supports multiple large language models (LLM) and text-to-speech (TTS) services, as well as long-term memory, external knowledge base, and model fine-tuning, enabling more advanced and perfect AI agents.

Product Advantages

Multimodal Interactive AI Agent

  • Personalized Persona: Define the AI's personality and role. Use best practices for prompts, combined with RAG, LoRA, and other methods to better match roles and meet specific requirements.
  • Rich Voice Options & Voice Cloning: Over a hundred highly human-like voices suitable for emotional companionship, customer service, e-commerce, etc., with support for voice cloning.
  • Multimodal Interaction: IM text & image messages, real-time voice calls, video calls, and more.
  • Premium Photo-based Digital Human: With just one photo and as low as 200ms latency, give your AI a real-time interactive avatar with accurate lip sync and realistic facial expressions.

Real-time Voice Call Capabilities

  • Response latency as low as 1s. Full streaming processing, with global access via ZEGOCLOUD's proprietary MSDN (Media Streaming Delivery Network) nodes, achieving global latency as low as 1s.
  • Natural voice interruption in just 500ms. Rapid and accurate human voice detection for smooth, non-intrusive interruptions, with no crosstalk during consecutive interruptions.
  • Accurate speaking state detection. Ensures that replies are not mistakenly split into multiple sentences, providing more precise AI responses without affecting reply latency.

AI Audio Processing Tailored for Agents

  • AI Noise Suppression (AI ANS): Eliminates environmental noise, music, distant human voices, etc., supporting interaction in various environments such as offices, homes, and cars.
  • AI Voice Activity Detection (AI VAD): Accurately detects valid human speech, filtering out soft responses like "um", "oh", as well as coughs and human-like noises.
  • AI Echo Cancellation (AI AEC): Precisely removes AI voice and background music picked up by the microphone, preventing AI from interrupting itself and improving speech accuracy during interruptions. Also supports volume ducking and adaptive playback volume.

Personalized Integration

  • Easy Integration: Add AI agents to IM, real-time voice calls, or digital human calls with less than 10 lines of code.
  • Flexible LLM and TTS Plugin Selection: Supports a wide range of domestic and international providers such as Volcano Ark (Doubao), MiniMax, Volcano Engine, Alibaba Cloud, Stepfun, and open-source models.
  • Highly Available and Cost-effective Service: Optimized invocation of ASR, LLM, and TTS, efficiently utilizing concurrency and usage to reduce overall costs.

Product Features

ModuleFeatureDescription
Voice Call with AI AgentCreate, Update, Delete, Query AI AgentCreate AI agents, including adjusting the agent's virtual user profile, persona (system prompt), voice, and parameters for LLM and TTS.
Initiate AI Agent Voice Call (Single User)Create an AI agent to achieve real-time voice calls with AI with latency as low as 1s.
Multi-user Interaction with AI Agent (Beta)Create a group AI agent instance to enable multi-user interaction with a single AI agent.
Note
This feature is in beta. Please contact ZEGOCLOUD sales for details.
Single User with Multiple AI Agents (Beta)Create AI agents and configure voice mapping rules to enable a single user to interact with multiple AI agents.
Note
This feature is in beta. Please contact ZEGOCLOUD sales for details.
AI Audio Processing for InteractionAutomatically filters out noise from the user side during conversations and eliminates far-field human voices, achieving more accurate voice interruption and ASR recognition.
Natural Voice InterruptionDuring real-time voice calls, the AI agent intelligently detects user interruption intent and stops its output.
Real-time TranscriptionThe conversation between the AI agent and the user is converted to text in real time and displayed on the client.
ASR Configuration ManagementAdjust the ASR used by the AI agent:
  • Vendor models: Tencent, Alibaba Bailian, Microsoft, etc.
  • Supports hot words, recognition language, and other adjustments.
Basic CapabilitiesLLM ManagementAdjust the LLM used by the AI agent.
  • Commercial LLMs: OpenAI, MiniMax, Tongyi Qianwen, Volcano Ark (Doubao), Stepfun, Wenxin Yiyan.
  • Open-source LLMs compatible with OpenAI Chat Completions API.
TTS ManagementSupports various TTS services and related capabilities:
  • Supported providers: Volcano Engine (unidirectional & bidirectional streaming), Alibaba Cloud (CosyVoice), MiniMax;
  • Supports various models, public voices, voice cloning, and adjustments for speed and pitch.
Digital Human ManagementIntegrate ZEGOCLOUD digital humans into RTC real-time video interaction. Premium photo-based digital humans require only one photo or image to obtain a 1080P digital human, which can be used as an AI avatar during voice calls.
Add/Delete/Update AI Agent InstancesCreate or delete an AI Agent instance to start a voice or digital human interaction with the agent.
Get AI Agent StatusReceive server callbacks to get the agent's speaking status; also query agent status via API, including idle, listening, thinking, and speaking states.
Memory (Context) SourceThe agent's memory (context) can be provided externally or by binding to ZIM (ZEGOCLOUD In-app Chat) chat history.
Memory (Context) UpdateDuring the agent instance lifecycle, each conversation is recorded as context for the agent's memory. Memory can be cleared to start a new conversation.
Memory (Context) ArchivingConvert the dialogue between users and AI agents into text information and store it.
ASR Hot WordsFor specialized vocabulary such as role names, temporary hot words can be set to improve speech recognition accuracy.
Proactive LLM InvocationSimulate user questions by customizing messages sent to LLM, and after LLM responds, send voice to users via TTS. Can be used to implement context-based welcome messages and other scenarios.
Proactive TTS InvocationTTS can be invoked at any time to achieve AI's proactive broadcasting, thus satisfying scenarios such as AI welcome messages or user reminders. Also supports configuring whether to add to history records and context.
Advanced CapabilitiesAI Agent Interruption Mode ControlThe form of interruption when the agent is speaking can include multiple options, and multiple selections are possible:
  • Natural voice interruption: When the agent receives voice input, i.e., when the user speaks, it interrupts the agent's speech.
  • Manual interruption: Control interruption through server-side APIs to enable users to interrupt via buttons or business-side management.
Filter LLM Output and TTS InputFiltering based on certain rules, such as Chinese and English brackets, emoji expressions, etc., for more controllable AI behavior.
Speech Recognition Segmentation OptimizationSupport for voice detection segmentation threshold settings and pause duration settings to achieve balance between delay and voice segmentation.
Best PracticesRole-playing Prompt OptimizationWhen using AI agents for role-playing, learn how to write system prompts to better showcase the effect.
Better Output with RAGSupport for AI external knowledge base to achieve more basic scripts, company information, and other content. For details, see Using AI Agent with RAG.
IM Chat with AI and Initiate Voice CallBased on ZIM, enables text message interaction with AI and sharing memory to initiate voice calls.
Memory ModuleFor longer time spans and where AI needs to remember more basic user information (e.g., age, place of birth, preferences), conduct regular summaries and conclusions to achieve smarter AI interactions.
LoRA, SFT Model Fine-tuningWhen there are very high demands for the AI character, fine-tuning of the LLM can be performed. For example, in scenarios where a cloned host replaces a real person.
AI Voice Chat with Cloned VoiceApply the cloned voice in the voice call process to achieve communication with an AI agent of a specific voice.

Next

Download SDK and Demo

On this page

Back to top