Talk to us
Talk to us
menu

How to Develop an AI Assistant

How to Develop an AI Assistant

AI assistants help people get answers and complete tasks quickly through voice and text conversations. These assistants can understand what you say, process your questions, and give helpful responses in real-time. In this guide, we will build an AI assistant application using ZEGOCLOUD that works with both voice and text input. The assistant will remember conversations, have a clean interface, and provide clear spoken responses. Users can ask questions, get information, and receive help with various tasks through natural conversation.

How to Build an AI Assistant with ZEGOCLOUD

AI assistants help users complete daily tasks more efficiently by providing fast and intelligent responses. In this project, we use ZEGOCLOUD to build an AI assistant with real-time interaction capabilities.

ZEGOCLOUD’s AI Agent tools allow the assistant to join a chat room as a virtual user. It can listen to voice or text input, understand user intent, and respond with natural speech in real time.

By combining speech recognition, language processing, and voice synthesis, ZEGOCLOUD simplifies the setup of conversational AI. You can connect your language model, configure voice settings, and run AI agents that manage conversations automatically.

Prerequisites

Before beginning development, ensure you have:

  • A ZEGOCLOUD account with AI Agent services enabled → Sign up here
  • Node.js 18+ and npm installed
  • Valid AppID and ServerSecret from the ZEGOCLOUD console
  • A DashScope API key for the language model (you can use zego_test for testing during trial period)
  • Modern browser with microphone access (Chrome or Edge recommended)
  • Fundamental understanding of web development concepts

Step 1. Project Setup and Architecture

The complete implementation for this guide is available in the zego-assistant repository.

1.1 Architecture Overview

Our AI assistant has two main parts. The backend uses Express and handles ZEGOCLOUD authentication, registers the AI agent, and provides API endpoints for starting sessions, sending messages, and creating access tokens.

The frontend is a React application that creates a chat interface. Users can switch between voice and text input. It uses ZEGOCLOUD’s WebRTC engine for real-time connections and processes messages from users and the AI agent. All conversations save to browser storage so users keep their chat history.

The backend handles API management while ZEGOCLOUD handles real-time audio streaming and message routing. This keeps your server simple while providing professional voice and text communication.

1.2 Environment Setup and Dependencies

Create the foundational project structure:

mkdir zego-assistant && cd zego-assistant
mkdir server client

Backend Configuration

cd server
npm init -y
npm install express cors dotenv axios typescript tsx
npm install --save-dev @types/express @types/cors @types/node

Create server/.env:

ZEGO_APP_ID=your_numeric_app_id
ZEGO_SERVER_SECRET=your_32_character_secret
DASHSCOPE_API_KEY=your_dashscope_api_key
PORT=8080

Configure development scripts in server/package.json:

{
  "scripts": {
    "dev": "tsx watch src/server.ts",
    "build": "tsc",
    "start": "node dist/server.js"
  }
}

Frontend Configuration

cd ../client
npm create vite@latest . -- --template react-ts
npm install zego-express-engine-webrtc axios framer-motion lucide-react tailwindcss

Create client/.env:

VITE_ZEGO_APP_ID=your_numeric_app_id
VITE_ZEGO_SERVER=wss://webrtc-api.zegocloud.com/ws
VITE_API_BASE_URL=http://localhost:8080

Implement configuration validation:

// client/src/config.ts
import { z } from 'zod'

const configSchema = z.object({
  ZEGO_APP_ID: z.string().min(1, 'ZEGO App ID is required'),
  API_BASE_URL: z.string().url('Valid API base URL required'),
  ZEGO_SERVER: z.string().url('Valid ZEGO server URL required'),
})

const rawConfig = {
  ZEGO_APP_ID: import.meta.env.VITE_ZEGO_APP_ID,
  API_BASE_URL: import.meta.env.VITE_API_BASE_URL,
  ZEGO_SERVER: import.meta.env.VITE_ZEGO_SERVER || 'wss://webrtc-api.zegocloud.com/ws',
}

export const config = configSchema.parse(rawConfig)

This validation ensures the application fails gracefully if environment variables are missing or malformed.

Step 2. Building the AI Assistant Server

The backend manages ZEGOCLOUD authentication, agent configuration, and session orchestration.

2.1 ZEGOCLOUD API Authentication

ZEGOCLOUD APIs require MD5-based signature authentication:

// server/src/server.ts
import crypto from 'crypto'
import axios from 'axios'
import dotenv from 'dotenv'
dotenv.config()

const CONFIG = {
  ZEGO_APP_ID: process.env.ZEGO_APP_ID!,
  ZEGO_SERVER_SECRET: process.env.ZEGO_SERVER_SECRET!,
  ZEGO_API_BASE_URL: 'https://aigc-aiagent-api.zegotech.cn/',
}

function generateZegoSignature(action: string) {
  const timestamp = Math.floor(Date.now() / 1000)
  const nonce = crypto.randomBytes(8).toString('hex')

  const signString = CONFIG.ZEGO_APP_ID + nonce + CONFIG.ZEGO_SERVER_SECRET + timestamp
  const signature = crypto.createHash('md5').update(signString).digest('hex')

  return {
    Action: action,
    AppId: CONFIG.ZEGO_APP_ID,
    SignatureNonce: nonce,
    SignatureVersion: '2.0',
    Timestamp: timestamp,
    Signature: signature
  }
}

async function makeZegoRequest(action: string, body: object = {}) {
  const queryParams = generateZegoSignature(action)
  const queryString = Object.entries(queryParams)
    .map(([k, v]) => `${k}=${encodeURIComponent(String(v))}`)
    .join('&')

  const url = `${CONFIG.ZEGO_API_BASE_URL}?${queryString}`
  const response = await axios.post(url, body, {
    headers: { 'Content-Type': 'application/json' },
    timeout: 30000
  })
  return response.data
}

This authentication mechanism secures every ZEGOCLOUD API interaction.

2.2 Configuring the AI Assistant Agent

The agent configuration defines the assistant’s personality and capabilities:

// server/src/server.ts
let REGISTERED_AGENT_ID: string | null = null

async function registerAgent(): Promise<string> {
  if (REGISTERED_AGENT_ID) return REGISTERED_AGENT_ID

  const agentId = `agent_${Date.now()}`
  const agentConfig = {
    AgentId: agentId,
    Name: 'AI Assistant',
    LLM: {
      Url: 'https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions',
      ApiKey: 'zego_test',
      Model: 'qwen-plus',
      SystemPrompt: `You are a helpful AI assistant. Provide clear, accurate, and useful information on a wide range of topics. Be concise but thorough in your responses. Keep responses conversational and under 100 words for natural voice flow. Help users with questions, tasks, and problem-solving in a friendly and professional manner.`,
      Temperature: 0.7,
      TopP: 0.9,
      Params: { max_tokens: 200 }
    },
    TTS: {
      Vendor: 'ByteDance',
      Params: {
        app: { appid: 'zego_test', token: 'zego_test', cluster: 'volcano_tts' },
        speed_ratio: 1,
        volume_ratio: 1,
        pitch_ratio: 1,
        audio: { rate: 24000 }
      }
    },
    ASR: {
      Vendor: 'Tencent',
      Params: {
        engine_model_type: '16k_en',
        hotword_list: 'assistant|10,help|8,question|8,answer|8,information|8'
      },
      VADSilenceSegmentation: 1500,
      PauseInterval: 2000
    }
  }

  const result = await makeZegoRequest('RegisterAgent', agentConfig)
  if (result.Code !== 0) {
    throw new Error(`RegisterAgent failed: ${result.Message}`)
  }

  REGISTERED_AGENT_ID = agentId
  return agentId
}

Key configuration elements:

  • SystemPrompt: Establishes the assistant’s helpful and professional personality
  • Temperature: 0.7 balances creativity with consistency for natural responses
  • max_tokens: 200 ensures responses remain concise for smooth voice delivery
  • VADSilenceSegmentation: 1500ms pause detection for natural speech processing
  • PauseInterval: 2000ms wait time before finalizing speech transcription

The agent registers once per server instance and serves all user sessions.

2.3 Session Management and Agent Instances

The /api/start endpoint creates agent instances and establishes communication channels:

// server/src/server.ts
import express from 'express'
import cors from 'cors'
import { createRequire } from 'module'
const require = createRequire(import.meta.url)
const { generateToken04 } = require('../zego-token.cjs')

const app = express()
app.use(express.json())
app.use(cors())

app.post('/api/start', async (req, res) => {
  const { room_id, user_id, user_stream_id } = req.body

  if (!room_id || !user_id) {
    res.status(400).json({ error: 'room_id and user_id required' })
    return
  }

  const agentId = await registerAgent()
  const userStreamId = user_stream_id || `${user_id}_stream`
  const agentUserId = `agent_${room_id}`
  const agentStreamId = `agent_stream_${room_id}`

  const instanceConfig = {
    AgentId: agentId,
    UserId: user_id,
    RTC: {
      RoomId: room_id,
      AgentUserId: agentUserId,
      AgentStreamId: agentStreamId,
      UserStreamId: userStreamId
    },
    MessageHistory: {
      SyncMode: 1,
      Messages: [],
      WindowSize: 10
    },
    AdvancedConfig: {
      InterruptMode: 0
    }
  }

  const result = await makeZegoRequest('CreateAgentInstance', instanceConfig)

  if (result.Code !== 0) {
    res.status(400).json({ error: result.Message })
    return
  }

  res.json({
    success: true,
    agentInstanceId: result.Data?.AgentInstanceId,
    agentUserId,
    agentStreamId,
    userStreamId
  })
})

The response provides the agentInstanceId required for message sending and session cleanup.

2.4 Text Message Processing

Users can send text messages when voice input isn’t preferred:

// server/src/server.ts
app.post('/api/send-message', async (req, res) => {
  const { agent_instance_id, message } = req.body

  if (!agent_instance_id || !message) {
    res.status(400).json({ error: 'agent_instance_id and message required' })
    return
  }

  const result = await makeZegoRequest('SendAgentInstanceLLM', {
    AgentInstanceId: agent_instance_id,
    Text: message,
    AddQuestionToHistory: true,
    AddAnswerToHistory: true
  })

  if (result.Code !== 0) {
    res.status(400).json({ error: result.Message })
    return
  }

  res.json({ success: true })
})

The agent processes text messages identically to voice transcriptions, maintaining conversation context through message history.

2.5 WebRTC Token Generation

The frontend requires authentication tokens to join ZEGOCLOUD rooms:

// server/src/server.ts
app.get('/api/token', (req, res) => {
  const userId = req.query.user_id as string
  const roomId = req.query.room_id as string

  if (!userId) {
    res.status(400).json({ error: 'user_id required' })
    return
  }

  const payload = {
    room_id: roomId || '',
    privilege: { 1: 1, 2: 1 },
    stream_id_list: null
  }

  const token = generateToken04(
    parseInt(CONFIG.ZEGO_APP_ID, 10),
    userId,
    CONFIG.ZEGO_SERVER_SECRET,
    3600,
    JSON.stringify(payload)
  )

  res.json({ token })
})

Tokens remain valid for 3600 seconds (1 hour) and grant both publish and play privileges for seamless communication.

2.6 Session Cleanup

When users end sessions, proper resource cleanup is essential:

// server/src/server.ts
app.post('/api/stop', async (req, res) => {
  const { agent_instance_id } = req.body

  if (!agent_instance_id) {
    res.status(400).json({ error: 'agent_instance_id required' })
    return
  }

  const result = await makeZegoRequest('DeleteAgentInstance', {
    AgentInstanceId: agent_instance_id
  })

  if (result.Code !== 0) {
    res.status(400).json({ error: result.Message })
    return
  }

  res.json({ success: true })
})

app.listen(CONFIG.PORT, () => {
  console.log(`Server running on port ${CONFIG.PORT}`)
})

This endpoint releases agent resources and prevents unnecessary processing after session termination.

Step 3. WebRTC Integration with ZegoExpressEngine

The frontend leverages ZegoExpressEngine for all real-time communication. A service class encapsulates the SDK to provide a clean interface for React components.

3.1 ZEGO Service Initialization

// client/src/services/zego.ts
import { ZegoExpressEngine } from 'zego-express-engine-webrtc'
import { config } from '../config'

export class ZegoService {
  private static instance: ZegoService
  private zg: ZegoExpressEngine | null = null
  private isInitialized = false
  private currentRoomId: string | null = null
  private currentUserId: string | null = null
  private localStream: any = null
  private audioElement: HTMLAudioElement | null = null

  static getInstance(): ZegoService {
    if (!ZegoService.instance) {
      ZegoService.instance = new ZegoService()
    }
    return ZegoService.instance
  }

  async initialize(): Promise<void> {
    if (this.isInitialized) return

    this.zg = new ZegoExpressEngine(
      parseInt(config.ZEGO_APP_ID), 
      config.ZEGO_SERVER
    )

    this.setupEventListeners()
    this.setupAudioElement()
    this.isInitialized = true
  }

  private setupAudioElement(): void {
    this.audioElement = document.getElementById('ai-audio-output') as HTMLAudioElement
    if (!this.audioElement) {
      this.audioElement = document.createElement('audio')
      this.audioElement.id = 'ai-audio-output'
      this.audioElement.autoplay = true
      this.audioElement.style.display = 'none'
      document.body.appendChild(this.audioElement)
    }
    this.audioElement.volume = 0.8
  }
}

The singleton pattern ensures only one ZEGO engine instance exists per browser session.

3.2 Room Management and Audio Publishing

// client/src/services/zego.ts (continued)
async joinRoom(roomId: string, userId: string): Promise<boolean> {
  if (!this.zg) return false

  if (this.currentRoomId === roomId && this.currentUserId === userId) {
    return true
  }

  try {
    if (this.currentRoomId) {
      await this.leaveRoom()
    }

    this.currentRoomId = roomId
    this.currentUserId = userId

    const { token } = await agentAPI.getToken(userId)

    await this.zg.loginRoom(roomId, token, {
      userID: userId,
      userName: userId
    })

    this.zg.callExperimentalAPI({ 
      method: 'onRecvRoomChannelMessage', 
      params: {} 
    })

    const localStream = await this.zg.createZegoStream({
      camera: { video: false, audio: true }
    })

    if (localStream) {
      this.localStream = localStream
      const streamId = `${userId}_stream`

      await this.zg.startPublishingStream(streamId, localStream)
      return true
    }

    throw new Error('Failed to create local stream')
  } catch (error) {
    console.error('Failed to join room:', error)
    this.currentRoomId = null
    this.currentUserId = null
    return false
  }
}

async enableMicrophone(enabled: boolean): Promise<boolean> {
  if (!this.localStream) return false

  const audioTrack = this.localStream.getAudioTracks()[0]
  if (audioTrack) {
    audioTrack.enabled = enabled
    return true
  }
  return false
}

The enableMicrophone method provides granular control over voice transmission to the AI agent.

3.3 Event Handling and Message Processing

// client/src/services/zego.ts (continued)
private setupEventListeners(): void {
  if (!this.zg) return

  this.zg.on('recvExperimentalAPI', (result: any) => {
    const { method, content } = result
    if (method === 'onRecvRoomChannelMessage') {
      try {
        const message = JSON.parse(content.msgContent)
        this.handleRoomMessage(message)
      } catch (error) {
        console.error('Failed to parse room message:', error)
      }
    }
  })

  this.zg.on('roomStreamUpdate', async (_roomID, updateType, streamList) => {
    if (updateType === 'ADD') {
      for (const stream of streamList) {
        const userStreamId = this.currentUserId ? `${this.currentUserId}_stream` : null

        if (userStreamId && stream.streamID === userStreamId) {
          continue
        }

        try {
          const mediaStream = await this.zg!.startPlayingStream(stream.streamID)
          if (mediaStream) {
            const remoteView = await this.zg!.createRemoteStreamView(mediaStream)
            if (remoteView && this.audioElement) {
              await remoteView.play(this.audioElement, { 
                enableAutoplayDialog: false,
                muted: false
              })
            }
          }
        } catch (error) {
          console.error('Failed to play agent stream:', error)
        }
      }
    }
  })
}

private messageCallback: ((message: any) => void) | null = null

private handleRoomMessage(message: any): void {
  if (this.messageCallback) {
    this.messageCallback(message)
  }
}

onRoomMessage(callback: (message: any) => void): void {
  this.messageCallback = callback
}

Room messages contain ASR transcriptions and LLM responses. The callback pattern enables React components to handle these events without tight coupling to the ZEGO service.

Step 4. React Chat Interface and State Management

The React application manages conversation state, displays messages, and provides intuitive voice and text input options.

4.1 Advanced State Management with useReducer

// client/src/hooks/useChat.ts
import { useCallback, useRef, useReducer } from 'react'
import { ZegoService } from '../services/zego'
import { agentAPI } from '../services/api'
import { memoryService } from '../services/memory'

interface ChatState {
  messages: Message[]
  session: ChatSession | null
  isLoading: boolean
  isConnected: boolean
  isRecording: boolean
  currentTranscript: string
  agentStatus: 'idle' | 'listening' | 'thinking' | 'speaking'
  error: string | null
}

type ChatAction = 
  | { type: 'ADD_MESSAGE'; payload: Message }
  | { type: 'SET_CONNECTED'; payload: boolean }
  | { type: 'SET_RECORDING'; payload: boolean }
  | { type: 'SET_TRANSCRIPT'; payload: string }
  | { type: 'SET_AGENT_STATUS'; payload: 'idle' | 'listening' | 'thinking' | 'speaking' }

function chatReducer(state: ChatState, action: ChatAction): ChatState {
  switch (action.type) {
    case 'ADD_MESSAGE':
      return { ...state, messages: [...state.messages, action.payload] }
    case 'SET_CONNECTED':
      return { ...state, isConnected: action.payload }
    case 'SET_RECORDING':
      return { ...state, isRecording: action.payload }
    case 'SET_TRANSCRIPT':
      return { ...state, currentTranscript: action.payload }
    case 'SET_AGENT_STATUS':
      return { ...state, agentStatus: action.payload }
    default:
      return state
  }
}

export const useChat = () => {
  const [state, dispatch] = useReducer(chatReducer, {
    messages: [],
    session: null,
    isLoading: false,
    isConnected: false,
    isRecording: false,
    currentTranscript: '',
    agentStatus: 'idle',
    error: null
  })

  const zegoService = useRef(ZegoService.getInstance())
  const processedMessageIds = useRef(new Set<string>())
}

Using useReducer provides predictable state updates and simplifies debugging complex state transitions.

4.2 ASR and LLM Event Processing

// client/src/hooks/useChat.ts (continued)
const setupMessageHandlers = useCallback((conversationId: string) => {
  const handleRoomMessage = (data: any) => {
    const { Cmd, Data: msgData } = data

    // Cmd 3: ASR transcription events
    if (Cmd === 3) {
      const { Text: transcript, EndFlag, MessageId } = msgData

      if (transcript && transcript.trim()) {
        dispatch({ type: 'SET_TRANSCRIPT', payload: transcript })
        dispatch({ type: 'SET_AGENT_STATUS', payload: 'listening' })

        if (EndFlag) {
          const userMessage: Message = {
            id: MessageId || `voice_${Date.now()}`,
            content: transcript.trim(),
            sender: 'user',
            timestamp: Date.now(),
            type: 'voice'
          }

          dispatch({ type: 'ADD_MESSAGE', payload: userMessage })
          memoryService.addMessage(conversationId, userMessage)
          dispatch({ type: 'SET_TRANSCRIPT', payload: '' })
          dispatch({ type: 'SET_AGENT_STATUS', payload: 'thinking' })
        }
      }
    }

    // Cmd 4: LLM response events
    if (Cmd === 4) {
      const { Text: content, MessageId, EndFlag } = msgData
      if (!content || !MessageId) return

      dispatch({ type: 'SET_AGENT_STATUS', payload: 'speaking' })

      if (EndFlag) {
        const aiMessage: Message = {
          id: MessageId,
          content,
          sender: 'ai',
          timestamp: Date.now(),
          type: 'text'
        }

        dispatch({ type: 'ADD_MESSAGE', payload: aiMessage })
        memoryService.addMessage(conversationId, aiMessage)
        dispatch({ type: 'SET_AGENT_STATUS', payload: 'idle' })
      }
    }
  }

  zegoService.current.onRoomMessage(handleRoomMessage)
}, [])

The agent status transitions through listeningthinkingspeakingidle, providing clear visual feedback about the AI’s current activity.

4.3 Session Lifecycle Management

// client/src/hooks/useChat.ts (continued)
const startSession = useCallback(async (): Promise<boolean> => {
  if (state.isLoading || state.isConnected) return false

  dispatch({ type: 'SET_LOADING', payload: true })

  try {
    const roomId = `room_${Date.now()}_${Math.random().toString(36).substr(2, 6)}`
    const userId = `user_${Date.now()}_${Math.random().toString(36).substr(2, 6)}`

    await zegoService.current.initialize()

    const joinResult = await zegoService.current.joinRoom(roomId, userId)
    if (!joinResult) throw new Error('Failed to join ZEGO room')

    const result = await agentAPI.startSession(roomId, userId)

    const conversation = memoryService.createOrGetConversation()

    const newSession: ChatSession = {
      roomId,
      userId,
      agentInstanceId: result.agentInstanceId,
      isActive: true,
      conversationId: conversation.id
    }

    dispatch({ type: 'SET_SESSION', payload: newSession })
    dispatch({ type: 'SET_CONNECTED', payload: true })

    setupMessageHandlers(conversation.id)

    return true
  } catch (error) {
    dispatch({ type: 'SET_ERROR', payload: error.message })
    return false
  } finally {
    dispatch({ type: 'SET_LOADING', payload: false })
  }
}, [state.isLoading, state.isConnected, setupMessageHandlers])

const endSession = useCallback(async () => {
  if (!state.session) return

  try {
    if (state.isRecording) {
      await zegoService.current.enableMicrophone(false)
      dispatch({ type: 'SET_RECORDING', payload: false })
    }

    if (state.session.agentInstanceId) {
      await agentAPI.stopSession(state.session.agentInstanceId)
    }

    await zegoService.current.leaveRoom()

    dispatch({ type: 'SET_SESSION', payload: null })
    dispatch({ type: 'SET_CONNECTED', payload: false })
    dispatch({ type: 'SET_AGENT_STATUS', payload: 'idle' })
  } catch (error) {
    console.error('Failed to end session:', error)
  }
}, [state.session, state.isRecording])

Sessions are isolated by unique room IDs, enabling multiple concurrent user sessions without interference.

4.4 Dual Input Mode Implementation

// client/src/hooks/useChat.ts (continued)
const sendTextMessage = useCallback(async (content: string) => {
  if (!state.session?.agentInstanceId || !state.conversation) return

  const trimmedContent = content.trim()
  if (!trimmedContent) return

  try {
    const userMessage: Message = {
      id: `text_${Date.now()}`,
      content: trimmedContent,
      sender: 'user',
      timestamp: Date.now(),
      type: 'text'
    }

    dispatch({ type: 'ADD_MESSAGE', payload: userMessage })
    memoryService.addMessage(state.conversation.id, userMessage)
    dispatch({ type: 'SET_AGENT_STATUS', payload: 'thinking' })

    await agentAPI.sendMessage(state.session.agentInstanceId, trimmedContent)
  } catch (error) {
    dispatch({ type: 'SET_ERROR', payload: 'Failed to send message' })
    dispatch({ type: 'SET_AGENT_STATUS', payload: 'idle' })
  }
}, [state.session, state.conversation])

const toggleVoiceRecording = useCallback(async () => {
  if (!state.isConnected) return

  try {
    if (state.isRecording) {
      await zegoService.current.enableMicrophone(false)
      dispatch({ type: 'SET_RECORDING', payload: false })
      dispatch({ type: 'SET_AGENT_STATUS', payload: 'idle' })
    } else {
      const success = await zegoService.current.enableMicrophone(true)
      if (success) {
        dispatch({ type: 'SET_RECORDING', payload: true })
        dispatch({ type: 'SET_AGENT_STATUS', payload: 'listening' })
      }
    }
  } catch (error) {
    console.error('Failed to toggle recording:', error)
  }
}, [state.isConnected, state.isRecording])

return {
  ...state,
  startSession,
  sendTextMessage,
  toggleVoiceRecording,
  endSession
}

The hook exposes a clean API that React components can use without understanding the underlying ZEGOCLOUD or API complexities.

4.5 Persistent Conversation Memory

Conversations are stored in browser localStorage to maintain continuity across page refreshes:

// client/src/services/memory.ts
import type { ConversationMemory, Message } from '../types'

class MemoryService {
  private conversations: Map<string, ConversationMemory> = new Map()

  constructor() {
    this.loadFromStorage()
  }

  private loadFromStorage(): void {
    const stored = localStorage.getItem('ai_conversations')
    if (stored) {
      const conversations: ConversationMemory[] = JSON.parse(stored)
      conversations.forEach(conv => {
        this.conversations.set(conv.id, conv)
      })
    }
  }

  private saveToStorage(): void {
    const conversations = Array.from(this.conversations.values())
    localStorage.setItem('ai_conversations', JSON.stringify(conversations))
  }

  createOrGetConversation(id?: string): ConversationMemory {
    const conversationId = id || `conv_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`

    if (this.conversations.has(conversationId)) {
      return this.conversations.get(conversationId)!
    }

    const newConversation: ConversationMemory = {
      id: conversationId,
      title: 'New Conversation',
      messages: [],
      createdAt: Date.now(),
      updatedAt: Date.now()
    }

    this.conversations.set(conversationId, newConversation)
    this.saveToStorage()
    return newConversation
  }

  addMessage(conversationId: string, message: Message): void {
    const conversation = this.conversations.get(conversationId)
    if (!conversation) return

    conversation.messages.push(message)
    conversation.updatedAt = Date.now()

    if (conversation.messages.length === 1 && message.sender === 'user') {
      conversation.title = message.content.slice(0, 50)
    }

    this.saveToStorage()
  }

  getAllConversations(): ConversationMemory[] {
    return Array.from(this.conversations.values())
      .sort((a, b) => b.updatedAt - a.updatedAt)
  }

  deleteConversation(conversationId: string): void {
    this.conversations.delete(conversationId)
    this.saveToStorage()
  }
}

export const memoryService = new MemoryService()

This enables users to review previous conversations and maintain context across sessions.

Step 5. Building the User Interface Components

The interface provides a clean, professional environment for AI assistance interactions.

5.1 Main Chat Session Component

// client/src/components/ChatSession.tsx
import { useEffect, useRef } from 'react'
import { motion } from 'framer-motion'
import { MessageBubble } from './Chat/MessageBubble'
import { VoiceInput } from './VoiceInput'
import { useChat } from '../hooks/useChat'
import { Bot, Phone, PhoneOff } from 'lucide-react'

export const ChatSession = () => {
  const messagesEndRef = useRef<HTMLDivElement>(null)
  const { 
    messages, 
    isLoading, 
    isConnected, 
    isRecording,
    currentTranscript,
    agentStatus,
    startSession, 
    sendTextMessage, 
    toggleVoiceRecording,
    endSession
  } = useChat()

  useEffect(() => {
    messagesEndRef.current?.scrollIntoView({ behavior: 'smooth' })
  }, [messages])

  if (!isConnected && messages.length === 0) {
    return (
      <div className="flex flex-col h-full bg-black">
        <audio id="ai-audio-output" autoPlay style={{ display: 'none' }} />

        <div className="flex-1 flex flex-col items-center justify-center">
          <motion.div initial={{ opacity: 0, y: 20 }} animate={{ opacity: 1, y: 0 }}>
            <div className="w-24 h-24 bg-gradient-to-br from-blue-600 to-blue-700 rounded-full flex items-center justify-center mb-8 mx-auto">
              <Bot className="w-12 h-12 text-white" />
            </div>

            <h2 className="text-3xl font-semibold mb-4">AI Assistant</h2>
            <p className="text-gray-400 mb-10 max-w-md text-center">
              Your intelligent companion for questions, tasks, and conversations.
            </p>

            <button
              onClick={startSession}
              disabled={isLoading}
              className="px-8 py-4 bg-blue-600 hover:bg-blue-700 rounded-full flex items-center space-x-3 mx-auto transition-colors"
            >
              <Phone className="w-5 h-5" />
              <span>{isLoading ? 'Starting...' : 'Start Chat'}</span>
            </button>
          </motion.div>
        </div>
      </div>
    )
  }

  return (
    <div className="flex flex-col h-full bg-black">
      <audio id="ai-audio-output" autoPlay style={{ display: 'none' }} />

      {/* Status Bar */}
      <div className="bg-gray-900/50 border-b border-gray-800 px-6 py-3">
        <div className="flex items-center justify-between">
          <div className="flex items-center space-x-3">
            <div className={`w-3 h-3 rounded-full ${isConnected ? 'bg-green-400 animate-pulse' : 'bg-gray-600'}`} />
            <span className="text-sm text-gray-400">
              {agentStatus === 'listening' && 'Listening...'}
              {agentStatus === 'thinking' && 'Processing...'}
              {agentStatus === 'speaking' && 'Responding...'}
              {agentStatus === 'idle' && 'Ready'}
            </span>
          </div>

          {isConnected && (
            <button
              onClick={endSession}
              className="px-4 py-2 bg-red-600/80 hover:bg-red-600 rounded-lg flex items-center space-x-2 transition-colors"
            >
              <PhoneOff className="w-4 h-4" />
              <span>End Chat</span>
            </button>
          )}
        </div>
      </div>

      {/* Messages */}
      <div className="flex-1 overflow-y-auto px-6 py-6">
        {messages.map((message) => (
          <MessageBubble key={message.id} message={message} />
        ))}

        {agentStatus === 'thinking' && (
          <motion.div
            initial={{ opacity: 0, y: 20 }}
            animate={{ opacity: 1, y: 0 }}
            className="flex justify-start mb-6"
          >
            <div className="flex items-center space-x-3">
              <div className="w-10 h-10 bg-gradient-to-br from-blue-600 to-blue-700 rounded-full flex items-center justify-center">
                <Bot className="w-5 h-5 text-white" />
              </div>
              <div className="bg-gray-800 rounded-2xl px-5 py-3">
                <div className="flex space-x-1">
                  <div className="w-2 h-2 bg-blue-400 rounded-full animate-bounce" />
                  <div className="w-2 h-2 bg-blue-400 rounded-full animate-bounce" style={{ animationDelay: '0.1s' }} />
                  <div className="w-2 h-2 bg-blue-400 rounded-full animate-bounce" style={{ animationDelay: '0.2s' }} />
                </div>
              </div>
            </div>
          </motion.div>
        )}

        <div ref={messagesEndRef} />
      </div>

      {/* Input */}
      {isConnected && (
        <VoiceInput 
          onSendMessage={sendTextMessage}
          isRecording={isRecording}
          onToggleRecording={toggleVoiceRecording}
          currentTranscript={currentTranscript}
          agentStatus={agentStatus}
        />
      )}
    </div>
  )
}

The component manages three distinct states: welcome screen, active session, and message display with smooth transitions.

5.2 Intelligent Voice Input Component

// client/src/components/VoiceInput.tsx
import { useState } from 'react'
import { motion } from 'framer-motion'
import { Mic, MicOff, Send, Type } from 'lucide-react'

interface VoiceInputProps {
  onSendMessage: (message: string) => void
  isRecording: boolean
  onToggleRecording: () => void
  currentTranscript: string
  agentStatus: 'idle' | 'listening' | 'thinking' | 'speaking'
}

export const VoiceInput = ({ 
  onSendMessage, 
  isRecording, 
  onToggleRecording, 
  currentTranscript,
  agentStatus 
}: VoiceInputProps) => {
  const [textInput, setTextInput] = useState('')
  const [inputMode, setInputMode] = useState<'voice' | 'text'>('voice')

  const handleSendText = () => {
    if (textInput.trim()) {
      onSendMessage(textInput.trim())
      setTextInput('')
    }
  }

  const handleKeyPress = (e: React.KeyboardEvent) => {
    if (e.key === 'Enter' && !e.shiftKey) {
      e.preventDefault()
      handleSendText()
    }
  }

  const isDisabled = agentStatus === 'thinking' || agentStatus === 'speaking'

  return (
    <div className="bg-gray-900 border-t border-gray-800 p-4">
      <div className="max-w-4xl mx-auto">
        {/* Mode Toggle */}
        <div className="flex justify-center mb-4">
          <div className="bg-gray-800 rounded-lg p-1 flex">
            <button
              onClick={() => setInputMode('voice')}
              className={`px-4 py-2 rounded-md flex items-center space-x-2 transition-colors ${
                inputMode === 'voice' ? 'bg-blue-600 text-white' : 'text-gray-400 hover:text-white'
              }`}
            >
              <Mic className="w-4 h-4" />
              <span>Voice</span>
            </button>
            <button
              onClick={() => setInputMode('text')}
              className={`px-4 py-2 rounded-md flex items-center space-x-2 transition-colors ${
                inputMode === 'text' ? 'bg-blue-600 text-white' : 'text-gray-400 hover:text-white'
              }`}
            >
              <Type className="w-4 h-4" />
              <span>Text</span>
            </button>
          </div>
        </div>

        {inputMode === 'voice' ? (
          <div className="flex flex-col items-center space-y-4">
            {currentTranscript && (
              <motion.div
                initial={{ opacity: 0, y: 10 }}
                animate={{ opacity: 1, y: 0 }}
                className="bg-gray-800 rounded-lg p-4 max-w-2xl w-full"
              >
                <p className="text-gray-300 text-center">{currentTranscript}</p>
              </motion.div>
            )}

            <motion.button
              onClick={onToggleRecording}
              disabled={isDisabled}
              className={`w-16 h-16 rounded-full flex items-center justify-center transition-all ${
                isRecording ? 'bg-red-600 hover:bg-red-700 scale-110' : 'bg-blue-600 hover:bg-blue-700'
              } ${isDisabled ? 'opacity-50 cursor-not-allowed' : 'hover:scale-105'}`}
              whileTap={{ scale: 0.95 }}
            >
              {isRecording ? <MicOff className="w-6 h-6 text-white" /> : <Mic className="w-6 h-6 text-white" />}
            </motion.button>

            <p className="text-sm text-gray-400 text-center">
              {isRecording ? 'Tap to stop recording' : 'Tap to start speaking'}
            </p>
          </div>
        ) : (
          <div className="flex items-end space-x-3">
            <textarea
              value={textInput}
              onChange={(e) => setTextInput(e.target.value)}
              onKeyPress={handleKeyPress}
              placeholder="Type your message..."
              disabled={isDisabled}
              className="flex-1 bg-gray-800 border border-gray-700 rounded-lg px-4 py-3 text-white resize-none focus:outline-none focus:border-blue-500 transition-colors"
              rows={1}
              style={{ minHeight: '48px', maxHeight: '120px' }}
            />
            <button
              onClick={handleSendText}
              disabled={!textInput.trim() || isDisabled}
              className="w-12 h-12 bg-blue-600 hover:bg-blue-700 disabled:opacity-50 disabled:cursor-not-allowed rounded-lg flex items-center justify-center transition-colors"
            >
              <Send className="w-5 h-5 text-white" />
            </button>
          </div>
        )}
      </div>
    </div>
  )
}

Users can seamlessly switch between voice and text input based on their preference or environment constraints.

5.3 Message Display Component

// client/src/components/Chat/MessageBubble.tsx
import { motion } from 'framer-motion'
import { User, Bot } from 'lucide-react'

interface MessageBubbleProps {
  message: {
    id: string
    content: string
    sender: 'user' | 'ai'
    timestamp: number
    type: 'text' | 'voice'
  }
}

export const MessageBubble = ({ message }: MessageBubbleProps) => {
  const isUser = message.sender === 'user'

  return (
    <motion.div
      initial={{ opacity: 0, y: 20 }}
      animate={{ opacity: 1, y: 0 }}
      className={`flex mb-6 ${isUser ? 'justify-end' : 'justify-start'}`}
    >
      <div className={`flex items-start space-x-3 max-w-2xl ${isUser ? 'flex-row-reverse space-x-reverse' : ''}`}>
        <div className={`w-10 h-10 rounded-full flex items-center justify-center flex-shrink-0 ${
          isUser ? 'bg-gradient-to-br from-green-600 to-green-700' : 'bg-gradient-to-br from-blue-600 to-blue-700'
        }`}>
          {isUser ? <User className="w-5 h-5 text-white" /> : <Bot className="w-5 h-5 text-white" />}
        </div>

        <div className={`rounded-2xl px-5 py-3 ${
          isUser ? 'bg-green-600 text-white' : 'bg-gray-800 text-gray-100'
        }`}>
          <p className="text-sm leading-relaxed whitespace-pre-wrap">{message.content}</p>
          {message.type === 'voice' && (
            <div className="flex items-center mt-2 opacity-70">
              <Mic className="w-3 h-3 mr-1" />
              <span className="text-xs">Voice message</span>
            </div>
          )}
        </div>
      </div>
    </motion.div>
  )
}

Messages animate smoothly into view and use distinct visual styling for user and AI responses, with indicators for voice messages.

Step 6. API Client Service

The frontend communicates with the backend through a centralized, robust API service:

// client/src/services/api.ts
import axios from 'axios'
import { config } from '../config'

const api = axios.create({
  baseURL: config.API_BASE_URL,
  timeout: 30000,
  headers: { 'Content-Type': 'application/json' }
})

// Request interceptor for logging
api.interceptors.request.use((config) => {
  console.log(`📤 API Request: ${config.method?.toUpperCase()} ${config.url}`)
  return config
})

// Response interceptor for error handling
api.interceptors.response.use(
  (response) => {
    console.log(`✅ API Response: ${response.status} ${response.config.url}`)
    return response
  },
  (error) => {
    console.error(`❌ API Error: ${error.response?.status} ${error.config?.url}`, error.response?.data)
    return Promise.reject(error)
  }
)

export const agentAPI = {
  async startSession(roomId: string, userId: string) {
    const response = await api.post('/api/start', {
      room_id: roomId,
      user_id: userId,
      user_stream_id: `${userId}_stream`,
    })

    if (!response.data?.success) {
      throw new Error(response.data?.error || 'Session start failed')
    }

    return {
      agentInstanceId: response.data.agentInstanceId
    }
  },

  async sendMessage(agentInstanceId: string, message: string) {
    const response = await api.post('/api/send-message', {
      agent_instance_id: agentInstanceId,
      message: message.trim(),
    })

    if (!response.data?.success) {
      throw new Error(response.data?.error || 'Message send failed')
    }
  },

  async stopSession(agentInstanceId: string) {
    await api.post('/api/stop', {
      agent_instance_id: agentInstanceId,
    })
  },

  async getToken(userId: string) {
    const response = await api.get(`/api/token?user_id=${encodeURIComponent(userId)}`)

    if (!response.data?.token) {
      throw new Error('No token returned')
    }

    return { token: response.data.token }
  }
}

This abstraction provides centralized error handling, request logging, and retry logic capabilities.

Step 7. Running and Testing the Application

7.1 Backend Startup

From the server directory:

npm install
npm run dev

Verify server health by accessing http://localhost:8080/health. Expected response:

{
  "status": "healthy",
  "timestamp": "2025-12-22T18:00:00.000Z",
  "registered": false,
  "config": {
    "appId": true,
    "serverSecret": true,
    "dashscope": true
  }
}

The registered field becomes true after the first session initializes the agent.

7.2 Frontend Startup

From the client directory:

npm install
npm run dev

Navigate to http://localhost:5173 in Chrome or Edge. You should see the welcome screen with a blue bot icon and “Start Chat” button.

Run a Demo

Conclusion

Your AI assistant is ready to provide smart, 24/7 support through voice and text conversations. The system uses ZEGOCLOUD’s real-time infrastructure with language models to create a helpful digital assistant that users can rely on for information and problem-solving.

You can extend this foundation with features such as multi-language support, conversation analytics, user authentication, or specialized knowledge areas. The same pattern works well for customer service bots, educational assistants, or any application where smart conversational AI improves the user experience. The clean separation between frontend and backend, combined with conversation memory, ensures your AI assistant can grow and adapt to meet user needs while maintaining good performance and reliability.

FAQ

Q1: What is an AI assistant?

An AI assistant is a software application that can understand user input, process requests, and respond with helpful information using text or voice.

Q2: What are the core components needed to build an AI assistant?

A typical AI assistant includes speech recognition (ASR), natural language processing (NLP or LLM), and text-to-speech (TTS), along with a system to manage conversations.

Q3: What are common use cases for AI assistants?

Common use cases include customer support, productivity tools, education, virtual companions, and voice-controlled applications.

Let’s Build APP Together

Start building with real-time video, voice & chat SDK for apps today!

Talk to us

Take your apps to the next level with our voice, video and chat APIs

Free Trial
  • 10,000 minutes for free
  • 4,000+ corporate clients
  • 3 Billion daily call minutes

Stay updated with us by signing up for our newsletter!

Don't miss out on important news and updates from ZEGOCLOUD!

* You may unsubscribe at any time using the unsubscribe link in the digest email. See our privacy policy for more information.