Talk to us
Talk to us
menu

How to Build an AI Voice Agent

How to Build an AI Voice Agent

Build an AI Voice Agent can seem harder than it looks. Many developers spend weeks combining speech recognition, natural language processing, and voice synthesis just to make basic voice features work. Each service comes with its own authentication steps, audio handling challenges, and hours of debugging, which quickly makes the process overwhelming.

Instead of dealing with all that complexity, imagine deploying a production-ready AI voice agent that understands speech, processes requests intelligently, and responds with natural, human-like voices in just minutes.

ZEGOCLOUD conversational AI provides a unified platform that manages the entire voice pipeline for you. There is no need for audio engineering, service orchestration, or infrastructure maintenance. With simple APIs, your application can turn voice input into intelligent, real-time responses.

This tutorial will guide you through the process. By the end, you will have built an AI voice agent that feels natural to interact with, delivers contextual replies instantly, and enhances user experiences across different scenarios.

How to Build an AI-Powered Voice Agent in 10 Minutes

ZEGOCLOUD built voice agents on top of their proven communication platform. The same technology that powers millions of voice calls now includes AI agents with speech recognition, language processing, and voice synthesis built right in.

This means you get years of audio optimization and network reliability that already work in production. Instead of experimental voice AI services, you’re using enterprise-grade communication tools with intelligent integrations now.

Your voice agent runs on established WebRTC protocols, tested audio codecs, and streaming technology that handles critical business communications worldwide. You get AI conversation capabilities through a foundation that already works reliably at scale.

👉 Schedule a Demo

zegocloud conversational ai

Prerequisites

Before building your AI voice agent, ensure you have these components ready:

  • ZEGOCLOUD developer account with valid AppID and ServerSecret – Sign up
  • Any compatible LLM API key for intelligent conversation processing and contextual response generation.
  • Node.js 18+ installed locally with npm for package management and development servers.
  • Testing environment with microphone access since voice agents require real audio hardware capabilities.
  • Basic familiarity with React hooks and Express.js patterns for following the implementation steps.

1. Project Setup and Configuration

Before we proceed with the setup, the diagram below shows how the parts of the voice agent we’re building connects:

zegocloud ai voice agent architecture

Now, create the project structure for your voice agent:

mkdir ai-voice-agent
cd ai-voice-agent
mkdir server client

Install backend dependencies for voice agent processing:

cd server
npm init -y
npm install express cors dotenv axios typescript tsx
npm install --save-dev @types/express @types/cors @types/node

Rename the .env.example file in the server directory to .env, then fill in the necessary values as instructed.

ZEGO_APP_ID=your_zego_app_id
ZEGO_SERVER_SECRET=your_zego_server_secret
ZEGO_API_BASE_URL=https://aigc-aiagent-api.zegotech.cn
DASHSCOPE_API_KEY=your_dashscope_api_key
PORT=8080

2. Voice Agent Server Implementation

2.1 Authentication System

Generate ZEGOCLOUD authentication tokens for voice sessions:

app.get('/api/token', (req: Request, res: Response): void => {
  try {
    const userId = req.query.user_id as string
    const roomId = req.query.room_id as string

    if (!userId) {
      res.status(400).json({ error: 'user_id required' })
      return
    }

    const payload = {
      room_id: roomId || '',
      privilege: { 1: 1, 2: 1 }, // Login and publish permissions
      stream_id_list: null
    }

    const token = generateToken04(
      parseInt(CONFIG.ZEGO_APP_ID, 10),
      userId,
      CONFIG.ZEGO_SERVER_SECRET,
      3600,
      JSON.stringify(payload)
    )

    res.json({ token })
  } catch (error) {
    res.status(500).json({ error: 'Failed to generate token' })
  }
})

2.2 Voice Agent Registration

Configure the AI agent with voice-specific capabilities including speech recognition, text-to-speech, and conversation parameters:

async function registerAgent(): Promise<string> {
  if (REGISTERED_AGENT_ID) return REGISTERED_AGENT_ID

  const agentId = `agent_${Date.now()}`
  const agentConfig = {
    AgentId: agentId,
    Name: 'AI Assistant',
    LLM: {
      Url: 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions',
      ApiKey: CONFIG.DASHSCOPE_API_KEY || 'zego_test',
      Model: 'qwen-plus',
      SystemPrompt: 'You are a helpful AI assistant. Be concise and friendly. Respond in the same language as the user. Keep responses under 100 words for better voice conversation flow.',
      Temperature: 0.7,
      TopP: 0.9,
      Params: { 
        max_tokens: 200
      }
    },
    TTS: {
      Vendor: 'CosyVoice',
      Params: {
        app: { 
          api_key:'zego_test'
        },
        payload: {
          model: 'cosyvoice-v2',
          parameters: {
            voice: 'longxiaochun_v2', // Natural Chinese voice
            speed: 1.0,
            volume: 0.8
          }
        }
      },
      FilterText: [
        {
          BeginCharacters: '(',
          EndCharacters: ')'
        },
        {
          BeginCharacters: '[',
          EndCharacters: ']'
        }
      ]
    },
    ASR: {
      HotWord: 'ZEGOCLOUD|10,AI|8,Assistant|8,money|10,help|8',
      VADSilenceSegmentation: 1500,  // Wait 1.5 seconds before processing speech
      PauseInterval: 2000  // Concatenate speech within 2 seconds
    }
  }

  const result = await makeZegoRequest('RegisterAgent', agentConfig)
  if (result.Code !== 0) {
    throw new Error(`RegisterAgent failed: ${result.Code} ${result.Message}`)
  }

  REGISTERED_AGENT_ID = agentId
  console.log('Agent registered:', agentId)
  return agentId
}

The ASR configuration is crucial for voice agents – VADSilenceSegmentation prevents cutting off users mid-sentence, while PauseInterval handles natural speech patterns.

2.3 Voice Session Management

Create voice agent instances that handle real-time audio streams and voice processing:

app.post('/api/start', async (req: Request, res: Response): Promise<void> => {
  try {
    const { room_id, user_id, user_stream_id } = req.body

    if (!room_id || !user_id) {
      res.status(400).json({ error: 'room_id and user_id required' })
      return
    }

    const agentId = await registerAgent()

    const userStreamId = user_stream_id || `${user_id}_stream`
    const agentUserId = `agent_${room_id}`
    const agentStreamId = `agent_stream_${room_id}`

    const instanceConfig = {
      AgentId: agentId,
      UserId: user_id,
      RTC: {
        RoomId: room_id,
        AgentUserId: agentUserId,
        AgentStreamId: agentStreamId,
        UserStreamId: userStreamId
      },
      MessageHistory: {
        SyncMode: 1,
        Messages: [],
        WindowSize: 10
      },
      CallbackConfig: {
        ASRResult: 1,        // Voice transcription events
        LLMResult: 1,        // AI response events  
        Exception: 1,        // Error handling
        Interrupted: 1,      // Voice interruption support
        UserSpeakAction: 1,  // User speech start/stop
        AgentSpeakAction: 1  // AI speech start/stop
      },
      AdvancedConfig: {
        InterruptMode: 0  // Enable natural voice interruption
      }
    }

    const result = await makeZegoRequest('CreateAgentInstance', instanceConfig)

    if (result.Code !== 0) {
      res.status(400).json({ error: result.Message || 'Failed to create instance' })
      return
    }

    res.json({
      success: true,
      agentInstanceId: result.Data?.AgentInstanceId,
      agentUserId: agentUserId,
      agentStreamId: agentStreamId,
      userStreamId: userStreamId
    })

  } catch (error: any) {
    console.error('Start error:', error)
    res.status(500).json({ error: error.message || 'Internal error' })
  }
})

2.4 Session Cleanup

Properly terminate voice agent instances and release audio resources:

app.post('/api/stop', async (req: Request, res: Response): Promise<void> => {
  try {
    const { agent_instance_id } = req.body

    if (!agent_instance_id) {
      res.status(400).json({ error: 'agent_instance_id required' })
      return
    }

    const result = await makeZegoRequest('DeleteAgentInstance', {
      AgentInstanceId: agent_instance_id
    })

    if (result.Code !== 0) {
      res.status(400).json({ error: result.Message || 'Failed to delete instance' })
      return
    }

    res.json({ success: true })

  } catch (error: any) {
    console.error('Stop error:', error)
    res.status(500).json({ error: error.message || 'Internal error' })
  }
})

The complete server code is here.

3. Frontend Voice Interface

Set up the React client for voice interaction:

cd ../client
npm create vite@latest . -- --template react-ts
npm install zego-express-engine-webrtc axios framer-motion lucide-react tailwindcss zod

Environment configuration:

VITE_ZEGO_APP_ID=your_zego_app_id
VITE_ZEGO_SERVER=wss://webliveroom-api.zegocloud.com/ws
VITE_API_BASE_URL=http://localhost:8080

4. ZEGOCLOUD Voice Integration

4.1 Audio Stream Management

The core workflow: Initialize Engine → Setup Audio → Join Room → Handle Voice Events → Process Responses

export class ZegoService {
  private static instance: ZegoService
  private zg: ZegoExpressEngine | null = null
  private audioElement: HTMLAudioElement | null = null
  private currentRoomId: string | null = null
  private currentUserId: string | null = null
  private localStream: any = null

  private setupAudioElement(): void {
    // Create dedicated audio element for AI voice output
    this.audioElement = document.getElementById('ai-audio-output') as HTMLAudioElement
    if (!this.audioElement) {
      this.audioElement = document.createElement('audio')
      this.audioElement.id = 'ai-audio-output'
      this.audioElement.autoplay = true
      this.audioElement.controls = false
      this.audioElement.style.display = 'none'
      document.body.appendChild(this.audioElement)
    }

    this.audioElement.volume = 0.8
    this.audioElement.muted = false

    // Audio event monitoring for voice agent debugging
    this.audioElement.addEventListener('loadstart', () => {
      console.log('🔊 AI voice loading started')
    })

    this.audioElement.addEventListener('canplay', () => {
      console.log('🔊 AI voice ready to play')
    })

    this.audioElement.addEventListener('play', () => {
      console.log('🔊 AI voice playback started')
    })
  }
4.2 Voice Message Processing
Handle real-time voice transcription and AI response streaming:
typescriptprivate setupEventListeners(): void {
  if (!this.zg) return

  // Process voice messages from ZEGOCLOUD AI agent
  this.zg.on('recvExperimentalAPI', (result: any) => {
    const { method, content } = result
    if (method === 'onRecvRoomChannelMessage') {
      try {
        const message = JSON.parse(content.msgContent)
        console.log('🎯 Voice message received:', message)
        this.handleRoomMessage(message)
      } catch (error) {
        console.error('Failed to parse voice message:', error)
      }
    }
  })

  // Monitor AI agent audio streams
  this.zg.on('roomStreamUpdate', async (_roomID: string, updateType: string, streamList: any[]) => {
    console.log('📡 Audio stream update:', updateType, streamList.length, 'streams')

    if (updateType === 'ADD' && streamList.length > 0) {
      for (const stream of streamList) {
        const userStreamId = this.currentUserId ? `${this.currentUserId}_stream` : null

        // Skip user's own audio stream to prevent feedback
        if (userStreamId && stream.streamID === userStreamId) {
          console.log('🚫 Skipping user\'s own stream:', stream.streamID)
          continue
        }

        try {
          console.log('🔗 Playing AI agent audio stream:', stream.streamID)

          // Connect AI agent audio stream to playback
          const mediaStream = await this.zg!.startPlayingStream(stream.streamID)
          if (mediaStream) {
            const remoteView = await this.zg!.createRemoteStreamView(mediaStream)
            if (remoteView && this.audioElement) {
              await remoteView.play(this.audioElement, { 
                enableAutoplayDialog: false,
                muted: false
              })
              console.log('✅ AI agent audio connected and playing')

              this.audioElement.muted = false
              this.audioElement.volume = 0.8
            }
          }
        } catch (error) {
          console.error('❌ Failed to play agent audio stream:', error)
        }
      }
    } else if (updateType === 'DELETE') {
      console.log('📴 AI agent audio stream disconnected')
      if (this.audioElement) {
        this.audioElement.srcObject = null
      }
    }
  })
}

4.3 Voice Session Coordination

Join voice rooms and establish bidirectional audio communication:

async joinRoom(roomId: string, userId: string): Promise<boolean> {
  if (!this.zg) {
    console.error('❌ ZEGO not initialized')
    return false
  }

  try {
    this.currentRoomId = roomId
    this.currentUserId = userId

    // Get authentication token for voice session
    console.log('🔑 Getting token for user:', userId)
    const { token } = await agentAPI.getToken(userId)

    // Join ZEGOCLOUD room for voice communication
    console.log('🚪 Logging into room:', roomId)
    await this.zg.loginRoom(roomId, token, {
      userID: userId,
      userName: userId
    })

    // Enable voice message reception from AI agent
    console.log('📢 Enabling voice message reception')
    this.zg.callExperimentalAPI({ 
      method: 'onRecvRoomChannelMessage', 
      params: {} 
    })

    // Create local audio stream for voice input
    console.log('🎤 Creating local audio stream')
    const localStream = await this.zg.createZegoStream({
      camera: { 
        video: false, 
        audio: true
      }
    })

    if (localStream) {
      this.localStream = localStream
      const streamId = `${userId}_stream`

      // Publish user audio stream to AI agent
      console.log('📤 Publishing audio stream:', streamId)
      await this.zg.startPublishingStream(streamId, localStream)

      console.log('✅ Voice room joined successfully')
      return true
    } else {
      throw new Error('Failed to create local audio stream')
    }
  } catch (error) {
    console.error('❌ Failed to join voice room:', error)
    return false
  }
}

The complete ZEGOCLOUD integration code is available here.

5. Voice Processing Hook

5.1 Voice Event Handling

Manage voice transcription and AI response streaming with real-time state updates:

const setupMessageHandlers = useCallback((conv: ConversationMemory) => {
  const handleRoomMessage = (data: any) => {
    try {
      const { Cmd, Data: msgData } = data
      console.log('Voice message received:', { Cmd, msgData })

      if (Cmd === 3) {
        // User voice transcription processing
        const { Text: transcript, EndFlag, MessageId } = msgData

        if (transcript && transcript.trim()) {
          dispatch({ type: 'SET_TRANSCRIPT', payload: transcript })
          dispatch({ type: 'SET_AGENT_STATUS', payload: 'listening' })

          if (EndFlag) {
            // Complete voice input - create user message
            const messageId = MessageId || `voice_${Date.now()}_${Math.random().toString(36).substr(2, 6)}`

            const userMessage: Message = {
              id: messageId,
              content: transcript.trim(),
              sender: 'user',
              timestamp: Date.now(),
              type: 'voice',
              transcript: transcript.trim()
            }

            addMessageSafely(userMessage, conv.id)
            dispatch({ type: 'SET_TRANSCRIPT', payload: '' })
            dispatch({ type: 'SET_AGENT_STATUS', payload: 'thinking' })
          }
        }
      } else if (Cmd === 4) {
        // AI agent voice response streaming
        const { Text: content, MessageId, EndFlag } = msgData
        if (!content || !MessageId) return

        if (EndFlag) {
          // Final response chunk - complete the message
          const currentStreaming = streamingMessages.current.get(MessageId) || ''
          const finalContent = currentStreaming + content

          dispatch({ type: 'UPDATE_MESSAGE', payload: {
            id: MessageId,
            updates: { 
              content: finalContent, 
              isStreaming: false 
            }
          }})

          streamingMessages.current.delete(MessageId)
          dispatch({ type: 'SET_AGENT_STATUS', payload: 'idle' })

          // Save completed AI response
          const finalMessage: Message = {
            id: MessageId,
            content: finalContent,
            sender: 'ai',
            timestamp: Date.now(),
            type: 'text'
          }
          memoryService.addMessage(conv.id, finalMessage)
        } else {
          // Streaming response chunk - update in real-time
          const currentStreaming = streamingMessages.current.get(MessageId) || ''
          const updatedContent = currentStreaming + content
          streamingMessages.current.set(MessageId, updatedContent)

          if (!processedMessageIds.current.has(MessageId)) {
            // Create new streaming message
            const streamingMessage: Message = {
              id: MessageId,
              content: updatedContent,
              sender: 'ai',
              timestamp: Date.now(),
              type: 'text',
              isStreaming: true
            }

            processedMessageIds.current.add(MessageId)
            dispatch({ type: 'ADD_MESSAGE', payload: streamingMessage })
          } else {
            // Update existing streaming message
            dispatch({ type: 'UPDATE_MESSAGE', payload: {
              id: MessageId,
              updates: { content: updatedContent, isStreaming: true }
            }})
          }

          dispatch({ type: 'SET_AGENT_STATUS', payload: 'speaking' })
        }
      }
    } catch (error) {
      console.error('Error handling voice message:', error)
      dispatch({ type: 'SET_AGENT_STATUS', payload: 'idle' })
    }
  }

  zegoService.current.onRoomMessage(handleRoomMessage)
}, [addMessageSafely])

5.2 Voice Session Management

Coordinate voice session lifecycle with proper cleanup:

const startSession = useCallback(async (existingConversationId?: string): Promise<boolean> => {
  if (state.isLoading || state.isConnected) return false

  dispatch({ type: 'SET_LOADING', payload: true })

  try {
    const roomId = `room_${Date.now()}_${Math.random().toString(36).substr(2, 6)}`
    const userId = `user_${Date.now()}_${Math.random().toString(36).substr(2, 6)}`

    // Initialize ZEGO voice engine
    console.log('Initializing ZEGO voice service...')
    await zegoService.current.initialize()

    // Join voice room for bidirectional audio
    console.log('Joining voice room:', roomId)
    const joinResult = await zegoService.current.joinRoom(roomId, userId)
    if (!joinResult) throw new Error('Failed to join ZEGO voice room')

    // Start AI agent for voice processing
    console.log('Starting AI voice agent...')
    const result = await agentAPI.startSession(roomId, userId)

    const conv = initializeConversation(existingConversationId)
    if (!conv) throw new Error('Failed to initialize conversation')

    const newSession: ChatSession = {
      roomId,
      userId,
      agentInstanceId: result.agentInstanceId,
      isActive: true,
      conversationId: conv.id,
      voiceSettings: defaultVoiceSettings
    }

    dispatch({ type: 'SET_SESSION', payload: newSession })
    dispatch({ type: 'SET_CONNECTED', payload: true })

    setupMessageHandlers(conv)

    console.log('Voice session started successfully')
    return true
  } catch (error) {
    console.error('Failed to start voice session:', error)
    dispatch({ type: 'SET_ERROR', payload: error instanceof Error ? error.message : 'Failed to start voice session' })
    return false
  } finally {
    dispatch({ type: 'SET_LOADING', payload: false })
  }
}, [state.isLoading, state.isConnected, initializeConversation, setupMessageHandlers])

6. Application Integration

The complete voice agent implementation is available in the Zego Agent GitHub repository. Run both servers to test voice functionality:

Start the backend:

cd server && npm run dev

Start the frontend:

cd client && npm run dev

The voice agent processes speech through ZEGOCLOUD’s ASR, generates responses via Dashscope, and delivers audio through CosyVoice synthesis – all coordinated through real-time WebRTC streams.

Run a Demo

Conclusion

This AI voice agent makes conversations smooth and natural, without complicated setup. Users can simply speak and get real-time, intelligent responses.

With ZEGOCLOUD handling audio processing, streaming, and speech recognition, you can focus on creating engaging experiences instead of technical overhead.

Its flexible and scalable foundation lets you build AI voice agent that match your needs. You can customize personality, connect with business systems, and extend features while keeping a reliable core for voice interactions.

FAQ

Q1: How to create an AI voice agent?

You can create an AI voice agent by combining speech recognition, natural language processing, and text-to-speech. Using platforms like ZEGOCLOUD, you can simplify this process with one SDK that manages the full voice pipeline.

Q2: Can I create my own AI agent?

Yes. With modern APIs and SDKs, anyone can build a custom AI agent. You do not need to train models from scratch since you can integrate existing AI services to handle speech and conversation.

Q3: How do I build my own AI voice?

To build your own AI voice, you need text-to-speech technology that converts responses into natural speech. ZEGOCLOUD integrates this with speech recognition and AI processing so you can create a complete voice interaction system.

Q4: What is the architecture of a voice AI agent?

A voice AI agent typically has four layers: speech recognition to capture input, natural language processing to understand meaning, decision-making powered by a language model, and voice synthesis to respond naturally. ZEGOCLOUD connects these components through a unified architecture.

Let’s Build APP Together

Start building with real-time video, voice & chat SDK for apps today!

Talk to us

Take your apps to the next level with our voice, video and chat APIs

Free Trial
  • 10,000 minutes for free
  • 4,000+ corporate clients
  • 3 Billion daily call minutes

Stay updated with us by signing up for our newsletter!

Don't miss out on important news and updates from ZEGOCLOUD!

* You may unsubscribe at any time using the unsubscribe link in the digest email. See our privacy policy for more information.