AI assistants help people get answers and complete tasks quickly through voice and text conversations. These assistants can understand what you say, process your questions, and give helpful responses in real-time. In this guide, we will build an AI assistant application using ZEGOCLOUD that works with both voice and text input. The assistant will remember conversations, have a clean interface, and provide clear spoken responses. Users can ask questions, get information, and receive help with various tasks through natural conversation.
How to Build an AI Assistant with ZEGOCLOUD
AI assistants help users complete daily tasks more efficiently by providing fast and intelligent responses. In this project, we use ZEGOCLOUD to build an AI assistant with real-time interaction capabilities.
ZEGOCLOUD’s AI Agent tools allow the assistant to join a chat room as a virtual user. It can listen to voice or text input, understand user intent, and respond with natural speech in real time.
By combining speech recognition, language processing, and voice synthesis, ZEGOCLOUD simplifies the setup of conversational AI. You can connect your language model, configure voice settings, and run AI agents that manage conversations automatically.
Prerequisites
Before beginning development, ensure you have:
- A ZEGOCLOUD account with AI Agent services enabled → Sign up here
- Node.js 18+ and npm installed
- Valid AppID and ServerSecret from the ZEGOCLOUD console
- A DashScope API key for the language model (you can use
zego_testfor testing during trial period) - Modern browser with microphone access (Chrome or Edge recommended)
- Fundamental understanding of web development concepts
Step 1. Project Setup and Architecture
The complete implementation for this guide is available in the zego-assistant repository.
1.1 Architecture Overview
Our AI assistant has two main parts. The backend uses Express and handles ZEGOCLOUD authentication, registers the AI agent, and provides API endpoints for starting sessions, sending messages, and creating access tokens.
The frontend is a React application that creates a chat interface. Users can switch between voice and text input. It uses ZEGOCLOUD’s WebRTC engine for real-time connections and processes messages from users and the AI agent. All conversations save to browser storage so users keep their chat history.
The backend handles API management while ZEGOCLOUD handles real-time audio streaming and message routing. This keeps your server simple while providing professional voice and text communication.
1.2 Environment Setup and Dependencies
Create the foundational project structure:
mkdir zego-assistant && cd zego-assistant
mkdir server client
Backend Configuration
cd server
npm init -y
npm install express cors dotenv axios typescript tsx
npm install --save-dev @types/express @types/cors @types/node
Create server/.env:
ZEGO_APP_ID=your_numeric_app_id
ZEGO_SERVER_SECRET=your_32_character_secret
DASHSCOPE_API_KEY=your_dashscope_api_key
PORT=8080
Configure development scripts in server/package.json:
{
"scripts": {
"dev": "tsx watch src/server.ts",
"build": "tsc",
"start": "node dist/server.js"
}
}
Frontend Configuration
cd ../client
npm create vite@latest . -- --template react-ts
npm install zego-express-engine-webrtc axios framer-motion lucide-react tailwindcss
Create client/.env:
VITE_ZEGO_APP_ID=your_numeric_app_id
VITE_ZEGO_SERVER=wss://webrtc-api.zegocloud.com/ws
VITE_API_BASE_URL=http://localhost:8080
Implement configuration validation:
// client/src/config.ts
import { z } from 'zod'
const configSchema = z.object({
ZEGO_APP_ID: z.string().min(1, 'ZEGO App ID is required'),
API_BASE_URL: z.string().url('Valid API base URL required'),
ZEGO_SERVER: z.string().url('Valid ZEGO server URL required'),
})
const rawConfig = {
ZEGO_APP_ID: import.meta.env.VITE_ZEGO_APP_ID,
API_BASE_URL: import.meta.env.VITE_API_BASE_URL,
ZEGO_SERVER: import.meta.env.VITE_ZEGO_SERVER || 'wss://webrtc-api.zegocloud.com/ws',
}
export const config = configSchema.parse(rawConfig)
This validation ensures the application fails gracefully if environment variables are missing or malformed.
Step 2. Building the AI Assistant Server
The backend manages ZEGOCLOUD authentication, agent configuration, and session orchestration.
2.1 ZEGOCLOUD API Authentication
ZEGOCLOUD APIs require MD5-based signature authentication:
// server/src/server.ts
import crypto from 'crypto'
import axios from 'axios'
import dotenv from 'dotenv'
dotenv.config()
const CONFIG = {
ZEGO_APP_ID: process.env.ZEGO_APP_ID!,
ZEGO_SERVER_SECRET: process.env.ZEGO_SERVER_SECRET!,
ZEGO_API_BASE_URL: 'https://aigc-aiagent-api.zegotech.cn/',
}
function generateZegoSignature(action: string) {
const timestamp = Math.floor(Date.now() / 1000)
const nonce = crypto.randomBytes(8).toString('hex')
const signString = CONFIG.ZEGO_APP_ID + nonce + CONFIG.ZEGO_SERVER_SECRET + timestamp
const signature = crypto.createHash('md5').update(signString).digest('hex')
return {
Action: action,
AppId: CONFIG.ZEGO_APP_ID,
SignatureNonce: nonce,
SignatureVersion: '2.0',
Timestamp: timestamp,
Signature: signature
}
}
async function makeZegoRequest(action: string, body: object = {}) {
const queryParams = generateZegoSignature(action)
const queryString = Object.entries(queryParams)
.map(([k, v]) => `${k}=${encodeURIComponent(String(v))}`)
.join('&')
const url = `${CONFIG.ZEGO_API_BASE_URL}?${queryString}`
const response = await axios.post(url, body, {
headers: { 'Content-Type': 'application/json' },
timeout: 30000
})
return response.data
}
This authentication mechanism secures every ZEGOCLOUD API interaction.
2.2 Configuring the AI Assistant Agent
The agent configuration defines the assistant’s personality and capabilities:
// server/src/server.ts
let REGISTERED_AGENT_ID: string | null = null
async function registerAgent(): Promise<string> {
if (REGISTERED_AGENT_ID) return REGISTERED_AGENT_ID
const agentId = `agent_${Date.now()}`
const agentConfig = {
AgentId: agentId,
Name: 'AI Assistant',
LLM: {
Url: 'https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions',
ApiKey: 'zego_test',
Model: 'qwen-plus',
SystemPrompt: `You are a helpful AI assistant. Provide clear, accurate, and useful information on a wide range of topics. Be concise but thorough in your responses. Keep responses conversational and under 100 words for natural voice flow. Help users with questions, tasks, and problem-solving in a friendly and professional manner.`,
Temperature: 0.7,
TopP: 0.9,
Params: { max_tokens: 200 }
},
TTS: {
Vendor: 'ByteDance',
Params: {
app: { appid: 'zego_test', token: 'zego_test', cluster: 'volcano_tts' },
speed_ratio: 1,
volume_ratio: 1,
pitch_ratio: 1,
audio: { rate: 24000 }
}
},
ASR: {
Vendor: 'Tencent',
Params: {
engine_model_type: '16k_en',
hotword_list: 'assistant|10,help|8,question|8,answer|8,information|8'
},
VADSilenceSegmentation: 1500,
PauseInterval: 2000
}
}
const result = await makeZegoRequest('RegisterAgent', agentConfig)
if (result.Code !== 0) {
throw new Error(`RegisterAgent failed: ${result.Message}`)
}
REGISTERED_AGENT_ID = agentId
return agentId
}
Key configuration elements:
- SystemPrompt: Establishes the assistant’s helpful and professional personality
- Temperature: 0.7 balances creativity with consistency for natural responses
- max_tokens: 200 ensures responses remain concise for smooth voice delivery
- VADSilenceSegmentation: 1500ms pause detection for natural speech processing
- PauseInterval: 2000ms wait time before finalizing speech transcription
The agent registers once per server instance and serves all user sessions.
2.3 Session Management and Agent Instances
The /api/start endpoint creates agent instances and establishes communication channels:
// server/src/server.ts
import express from 'express'
import cors from 'cors'
import { createRequire } from 'module'
const require = createRequire(import.meta.url)
const { generateToken04 } = require('../zego-token.cjs')
const app = express()
app.use(express.json())
app.use(cors())
app.post('/api/start', async (req, res) => {
const { room_id, user_id, user_stream_id } = req.body
if (!room_id || !user_id) {
res.status(400).json({ error: 'room_id and user_id required' })
return
}
const agentId = await registerAgent()
const userStreamId = user_stream_id || `${user_id}_stream`
const agentUserId = `agent_${room_id}`
const agentStreamId = `agent_stream_${room_id}`
const instanceConfig = {
AgentId: agentId,
UserId: user_id,
RTC: {
RoomId: room_id,
AgentUserId: agentUserId,
AgentStreamId: agentStreamId,
UserStreamId: userStreamId
},
MessageHistory: {
SyncMode: 1,
Messages: [],
WindowSize: 10
},
AdvancedConfig: {
InterruptMode: 0
}
}
const result = await makeZegoRequest('CreateAgentInstance', instanceConfig)
if (result.Code !== 0) {
res.status(400).json({ error: result.Message })
return
}
res.json({
success: true,
agentInstanceId: result.Data?.AgentInstanceId,
agentUserId,
agentStreamId,
userStreamId
})
})
The response provides the agentInstanceId required for message sending and session cleanup.
2.4 Text Message Processing
Users can send text messages when voice input isn’t preferred:
// server/src/server.ts
app.post('/api/send-message', async (req, res) => {
const { agent_instance_id, message } = req.body
if (!agent_instance_id || !message) {
res.status(400).json({ error: 'agent_instance_id and message required' })
return
}
const result = await makeZegoRequest('SendAgentInstanceLLM', {
AgentInstanceId: agent_instance_id,
Text: message,
AddQuestionToHistory: true,
AddAnswerToHistory: true
})
if (result.Code !== 0) {
res.status(400).json({ error: result.Message })
return
}
res.json({ success: true })
})
The agent processes text messages identically to voice transcriptions, maintaining conversation context through message history.
2.5 WebRTC Token Generation
The frontend requires authentication tokens to join ZEGOCLOUD rooms:
// server/src/server.ts
app.get('/api/token', (req, res) => {
const userId = req.query.user_id as string
const roomId = req.query.room_id as string
if (!userId) {
res.status(400).json({ error: 'user_id required' })
return
}
const payload = {
room_id: roomId || '',
privilege: { 1: 1, 2: 1 },
stream_id_list: null
}
const token = generateToken04(
parseInt(CONFIG.ZEGO_APP_ID, 10),
userId,
CONFIG.ZEGO_SERVER_SECRET,
3600,
JSON.stringify(payload)
)
res.json({ token })
})
Tokens remain valid for 3600 seconds (1 hour) and grant both publish and play privileges for seamless communication.
2.6 Session Cleanup
When users end sessions, proper resource cleanup is essential:
// server/src/server.ts
app.post('/api/stop', async (req, res) => {
const { agent_instance_id } = req.body
if (!agent_instance_id) {
res.status(400).json({ error: 'agent_instance_id required' })
return
}
const result = await makeZegoRequest('DeleteAgentInstance', {
AgentInstanceId: agent_instance_id
})
if (result.Code !== 0) {
res.status(400).json({ error: result.Message })
return
}
res.json({ success: true })
})
app.listen(CONFIG.PORT, () => {
console.log(`Server running on port ${CONFIG.PORT}`)
})
This endpoint releases agent resources and prevents unnecessary processing after session termination.
Step 3. WebRTC Integration with ZegoExpressEngine
The frontend leverages ZegoExpressEngine for all real-time communication. A service class encapsulates the SDK to provide a clean interface for React components.
3.1 ZEGO Service Initialization
// client/src/services/zego.ts
import { ZegoExpressEngine } from 'zego-express-engine-webrtc'
import { config } from '../config'
export class ZegoService {
private static instance: ZegoService
private zg: ZegoExpressEngine | null = null
private isInitialized = false
private currentRoomId: string | null = null
private currentUserId: string | null = null
private localStream: any = null
private audioElement: HTMLAudioElement | null = null
static getInstance(): ZegoService {
if (!ZegoService.instance) {
ZegoService.instance = new ZegoService()
}
return ZegoService.instance
}
async initialize(): Promise<void> {
if (this.isInitialized) return
this.zg = new ZegoExpressEngine(
parseInt(config.ZEGO_APP_ID),
config.ZEGO_SERVER
)
this.setupEventListeners()
this.setupAudioElement()
this.isInitialized = true
}
private setupAudioElement(): void {
this.audioElement = document.getElementById('ai-audio-output') as HTMLAudioElement
if (!this.audioElement) {
this.audioElement = document.createElement('audio')
this.audioElement.id = 'ai-audio-output'
this.audioElement.autoplay = true
this.audioElement.style.display = 'none'
document.body.appendChild(this.audioElement)
}
this.audioElement.volume = 0.8
}
}
The singleton pattern ensures only one ZEGO engine instance exists per browser session.
3.2 Room Management and Audio Publishing
// client/src/services/zego.ts (continued)
async joinRoom(roomId: string, userId: string): Promise<boolean> {
if (!this.zg) return false
if (this.currentRoomId === roomId && this.currentUserId === userId) {
return true
}
try {
if (this.currentRoomId) {
await this.leaveRoom()
}
this.currentRoomId = roomId
this.currentUserId = userId
const { token } = await agentAPI.getToken(userId)
await this.zg.loginRoom(roomId, token, {
userID: userId,
userName: userId
})
this.zg.callExperimentalAPI({
method: 'onRecvRoomChannelMessage',
params: {}
})
const localStream = await this.zg.createZegoStream({
camera: { video: false, audio: true }
})
if (localStream) {
this.localStream = localStream
const streamId = `${userId}_stream`
await this.zg.startPublishingStream(streamId, localStream)
return true
}
throw new Error('Failed to create local stream')
} catch (error) {
console.error('Failed to join room:', error)
this.currentRoomId = null
this.currentUserId = null
return false
}
}
async enableMicrophone(enabled: boolean): Promise<boolean> {
if (!this.localStream) return false
const audioTrack = this.localStream.getAudioTracks()[0]
if (audioTrack) {
audioTrack.enabled = enabled
return true
}
return false
}
The enableMicrophone method provides granular control over voice transmission to the AI agent.
3.3 Event Handling and Message Processing
// client/src/services/zego.ts (continued)
private setupEventListeners(): void {
if (!this.zg) return
this.zg.on('recvExperimentalAPI', (result: any) => {
const { method, content } = result
if (method === 'onRecvRoomChannelMessage') {
try {
const message = JSON.parse(content.msgContent)
this.handleRoomMessage(message)
} catch (error) {
console.error('Failed to parse room message:', error)
}
}
})
this.zg.on('roomStreamUpdate', async (_roomID, updateType, streamList) => {
if (updateType === 'ADD') {
for (const stream of streamList) {
const userStreamId = this.currentUserId ? `${this.currentUserId}_stream` : null
if (userStreamId && stream.streamID === userStreamId) {
continue
}
try {
const mediaStream = await this.zg!.startPlayingStream(stream.streamID)
if (mediaStream) {
const remoteView = await this.zg!.createRemoteStreamView(mediaStream)
if (remoteView && this.audioElement) {
await remoteView.play(this.audioElement, {
enableAutoplayDialog: false,
muted: false
})
}
}
} catch (error) {
console.error('Failed to play agent stream:', error)
}
}
}
})
}
private messageCallback: ((message: any) => void) | null = null
private handleRoomMessage(message: any): void {
if (this.messageCallback) {
this.messageCallback(message)
}
}
onRoomMessage(callback: (message: any) => void): void {
this.messageCallback = callback
}
Room messages contain ASR transcriptions and LLM responses. The callback pattern enables React components to handle these events without tight coupling to the ZEGO service.
Step 4. React Chat Interface and State Management
The React application manages conversation state, displays messages, and provides intuitive voice and text input options.
4.1 Advanced State Management with useReducer
// client/src/hooks/useChat.ts
import { useCallback, useRef, useReducer } from 'react'
import { ZegoService } from '../services/zego'
import { agentAPI } from '../services/api'
import { memoryService } from '../services/memory'
interface ChatState {
messages: Message[]
session: ChatSession | null
isLoading: boolean
isConnected: boolean
isRecording: boolean
currentTranscript: string
agentStatus: 'idle' | 'listening' | 'thinking' | 'speaking'
error: string | null
}
type ChatAction =
| { type: 'ADD_MESSAGE'; payload: Message }
| { type: 'SET_CONNECTED'; payload: boolean }
| { type: 'SET_RECORDING'; payload: boolean }
| { type: 'SET_TRANSCRIPT'; payload: string }
| { type: 'SET_AGENT_STATUS'; payload: 'idle' | 'listening' | 'thinking' | 'speaking' }
function chatReducer(state: ChatState, action: ChatAction): ChatState {
switch (action.type) {
case 'ADD_MESSAGE':
return { ...state, messages: [...state.messages, action.payload] }
case 'SET_CONNECTED':
return { ...state, isConnected: action.payload }
case 'SET_RECORDING':
return { ...state, isRecording: action.payload }
case 'SET_TRANSCRIPT':
return { ...state, currentTranscript: action.payload }
case 'SET_AGENT_STATUS':
return { ...state, agentStatus: action.payload }
default:
return state
}
}
export const useChat = () => {
const [state, dispatch] = useReducer(chatReducer, {
messages: [],
session: null,
isLoading: false,
isConnected: false,
isRecording: false,
currentTranscript: '',
agentStatus: 'idle',
error: null
})
const zegoService = useRef(ZegoService.getInstance())
const processedMessageIds = useRef(new Set<string>())
}
Using useReducer provides predictable state updates and simplifies debugging complex state transitions.
4.2 ASR and LLM Event Processing
// client/src/hooks/useChat.ts (continued)
const setupMessageHandlers = useCallback((conversationId: string) => {
const handleRoomMessage = (data: any) => {
const { Cmd, Data: msgData } = data
// Cmd 3: ASR transcription events
if (Cmd === 3) {
const { Text: transcript, EndFlag, MessageId } = msgData
if (transcript && transcript.trim()) {
dispatch({ type: 'SET_TRANSCRIPT', payload: transcript })
dispatch({ type: 'SET_AGENT_STATUS', payload: 'listening' })
if (EndFlag) {
const userMessage: Message = {
id: MessageId || `voice_${Date.now()}`,
content: transcript.trim(),
sender: 'user',
timestamp: Date.now(),
type: 'voice'
}
dispatch({ type: 'ADD_MESSAGE', payload: userMessage })
memoryService.addMessage(conversationId, userMessage)
dispatch({ type: 'SET_TRANSCRIPT', payload: '' })
dispatch({ type: 'SET_AGENT_STATUS', payload: 'thinking' })
}
}
}
// Cmd 4: LLM response events
if (Cmd === 4) {
const { Text: content, MessageId, EndFlag } = msgData
if (!content || !MessageId) return
dispatch({ type: 'SET_AGENT_STATUS', payload: 'speaking' })
if (EndFlag) {
const aiMessage: Message = {
id: MessageId,
content,
sender: 'ai',
timestamp: Date.now(),
type: 'text'
}
dispatch({ type: 'ADD_MESSAGE', payload: aiMessage })
memoryService.addMessage(conversationId, aiMessage)
dispatch({ type: 'SET_AGENT_STATUS', payload: 'idle' })
}
}
}
zegoService.current.onRoomMessage(handleRoomMessage)
}, [])
The agent status transitions through listening → thinking → speaking → idle, providing clear visual feedback about the AI’s current activity.
4.3 Session Lifecycle Management
// client/src/hooks/useChat.ts (continued)
const startSession = useCallback(async (): Promise<boolean> => {
if (state.isLoading || state.isConnected) return false
dispatch({ type: 'SET_LOADING', payload: true })
try {
const roomId = `room_${Date.now()}_${Math.random().toString(36).substr(2, 6)}`
const userId = `user_${Date.now()}_${Math.random().toString(36).substr(2, 6)}`
await zegoService.current.initialize()
const joinResult = await zegoService.current.joinRoom(roomId, userId)
if (!joinResult) throw new Error('Failed to join ZEGO room')
const result = await agentAPI.startSession(roomId, userId)
const conversation = memoryService.createOrGetConversation()
const newSession: ChatSession = {
roomId,
userId,
agentInstanceId: result.agentInstanceId,
isActive: true,
conversationId: conversation.id
}
dispatch({ type: 'SET_SESSION', payload: newSession })
dispatch({ type: 'SET_CONNECTED', payload: true })
setupMessageHandlers(conversation.id)
return true
} catch (error) {
dispatch({ type: 'SET_ERROR', payload: error.message })
return false
} finally {
dispatch({ type: 'SET_LOADING', payload: false })
}
}, [state.isLoading, state.isConnected, setupMessageHandlers])
const endSession = useCallback(async () => {
if (!state.session) return
try {
if (state.isRecording) {
await zegoService.current.enableMicrophone(false)
dispatch({ type: 'SET_RECORDING', payload: false })
}
if (state.session.agentInstanceId) {
await agentAPI.stopSession(state.session.agentInstanceId)
}
await zegoService.current.leaveRoom()
dispatch({ type: 'SET_SESSION', payload: null })
dispatch({ type: 'SET_CONNECTED', payload: false })
dispatch({ type: 'SET_AGENT_STATUS', payload: 'idle' })
} catch (error) {
console.error('Failed to end session:', error)
}
}, [state.session, state.isRecording])
Sessions are isolated by unique room IDs, enabling multiple concurrent user sessions without interference.
4.4 Dual Input Mode Implementation
// client/src/hooks/useChat.ts (continued)
const sendTextMessage = useCallback(async (content: string) => {
if (!state.session?.agentInstanceId || !state.conversation) return
const trimmedContent = content.trim()
if (!trimmedContent) return
try {
const userMessage: Message = {
id: `text_${Date.now()}`,
content: trimmedContent,
sender: 'user',
timestamp: Date.now(),
type: 'text'
}
dispatch({ type: 'ADD_MESSAGE', payload: userMessage })
memoryService.addMessage(state.conversation.id, userMessage)
dispatch({ type: 'SET_AGENT_STATUS', payload: 'thinking' })
await agentAPI.sendMessage(state.session.agentInstanceId, trimmedContent)
} catch (error) {
dispatch({ type: 'SET_ERROR', payload: 'Failed to send message' })
dispatch({ type: 'SET_AGENT_STATUS', payload: 'idle' })
}
}, [state.session, state.conversation])
const toggleVoiceRecording = useCallback(async () => {
if (!state.isConnected) return
try {
if (state.isRecording) {
await zegoService.current.enableMicrophone(false)
dispatch({ type: 'SET_RECORDING', payload: false })
dispatch({ type: 'SET_AGENT_STATUS', payload: 'idle' })
} else {
const success = await zegoService.current.enableMicrophone(true)
if (success) {
dispatch({ type: 'SET_RECORDING', payload: true })
dispatch({ type: 'SET_AGENT_STATUS', payload: 'listening' })
}
}
} catch (error) {
console.error('Failed to toggle recording:', error)
}
}, [state.isConnected, state.isRecording])
return {
...state,
startSession,
sendTextMessage,
toggleVoiceRecording,
endSession
}
The hook exposes a clean API that React components can use without understanding the underlying ZEGOCLOUD or API complexities.
4.5 Persistent Conversation Memory
Conversations are stored in browser localStorage to maintain continuity across page refreshes:
// client/src/services/memory.ts
import type { ConversationMemory, Message } from '../types'
class MemoryService {
private conversations: Map<string, ConversationMemory> = new Map()
constructor() {
this.loadFromStorage()
}
private loadFromStorage(): void {
const stored = localStorage.getItem('ai_conversations')
if (stored) {
const conversations: ConversationMemory[] = JSON.parse(stored)
conversations.forEach(conv => {
this.conversations.set(conv.id, conv)
})
}
}
private saveToStorage(): void {
const conversations = Array.from(this.conversations.values())
localStorage.setItem('ai_conversations', JSON.stringify(conversations))
}
createOrGetConversation(id?: string): ConversationMemory {
const conversationId = id || `conv_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`
if (this.conversations.has(conversationId)) {
return this.conversations.get(conversationId)!
}
const newConversation: ConversationMemory = {
id: conversationId,
title: 'New Conversation',
messages: [],
createdAt: Date.now(),
updatedAt: Date.now()
}
this.conversations.set(conversationId, newConversation)
this.saveToStorage()
return newConversation
}
addMessage(conversationId: string, message: Message): void {
const conversation = this.conversations.get(conversationId)
if (!conversation) return
conversation.messages.push(message)
conversation.updatedAt = Date.now()
if (conversation.messages.length === 1 && message.sender === 'user') {
conversation.title = message.content.slice(0, 50)
}
this.saveToStorage()
}
getAllConversations(): ConversationMemory[] {
return Array.from(this.conversations.values())
.sort((a, b) => b.updatedAt - a.updatedAt)
}
deleteConversation(conversationId: string): void {
this.conversations.delete(conversationId)
this.saveToStorage()
}
}
export const memoryService = new MemoryService()
This enables users to review previous conversations and maintain context across sessions.
Step 5. Building the User Interface Components
The interface provides a clean, professional environment for AI assistance interactions.
5.1 Main Chat Session Component
// client/src/components/ChatSession.tsx
import { useEffect, useRef } from 'react'
import { motion } from 'framer-motion'
import { MessageBubble } from './Chat/MessageBubble'
import { VoiceInput } from './VoiceInput'
import { useChat } from '../hooks/useChat'
import { Bot, Phone, PhoneOff } from 'lucide-react'
export const ChatSession = () => {
const messagesEndRef = useRef<HTMLDivElement>(null)
const {
messages,
isLoading,
isConnected,
isRecording,
currentTranscript,
agentStatus,
startSession,
sendTextMessage,
toggleVoiceRecording,
endSession
} = useChat()
useEffect(() => {
messagesEndRef.current?.scrollIntoView({ behavior: 'smooth' })
}, [messages])
if (!isConnected && messages.length === 0) {
return (
<div className="flex flex-col h-full bg-black">
<audio id="ai-audio-output" autoPlay style={{ display: 'none' }} />
<div className="flex-1 flex flex-col items-center justify-center">
<motion.div initial={{ opacity: 0, y: 20 }} animate={{ opacity: 1, y: 0 }}>
<div className="w-24 h-24 bg-gradient-to-br from-blue-600 to-blue-700 rounded-full flex items-center justify-center mb-8 mx-auto">
<Bot className="w-12 h-12 text-white" />
</div>
<h2 className="text-3xl font-semibold mb-4">AI Assistant</h2>
<p className="text-gray-400 mb-10 max-w-md text-center">
Your intelligent companion for questions, tasks, and conversations.
</p>
<button
onClick={startSession}
disabled={isLoading}
className="px-8 py-4 bg-blue-600 hover:bg-blue-700 rounded-full flex items-center space-x-3 mx-auto transition-colors"
>
<Phone className="w-5 h-5" />
<span>{isLoading ? 'Starting...' : 'Start Chat'}</span>
</button>
</motion.div>
</div>
</div>
)
}
return (
<div className="flex flex-col h-full bg-black">
<audio id="ai-audio-output" autoPlay style={{ display: 'none' }} />
{/* Status Bar */}
<div className="bg-gray-900/50 border-b border-gray-800 px-6 py-3">
<div className="flex items-center justify-between">
<div className="flex items-center space-x-3">
<div className={`w-3 h-3 rounded-full ${isConnected ? 'bg-green-400 animate-pulse' : 'bg-gray-600'}`} />
<span className="text-sm text-gray-400">
{agentStatus === 'listening' && 'Listening...'}
{agentStatus === 'thinking' && 'Processing...'}
{agentStatus === 'speaking' && 'Responding...'}
{agentStatus === 'idle' && 'Ready'}
</span>
</div>
{isConnected && (
<button
onClick={endSession}
className="px-4 py-2 bg-red-600/80 hover:bg-red-600 rounded-lg flex items-center space-x-2 transition-colors"
>
<PhoneOff className="w-4 h-4" />
<span>End Chat</span>
</button>
)}
</div>
</div>
{/* Messages */}
<div className="flex-1 overflow-y-auto px-6 py-6">
{messages.map((message) => (
<MessageBubble key={message.id} message={message} />
))}
{agentStatus === 'thinking' && (
<motion.div
initial={{ opacity: 0, y: 20 }}
animate={{ opacity: 1, y: 0 }}
className="flex justify-start mb-6"
>
<div className="flex items-center space-x-3">
<div className="w-10 h-10 bg-gradient-to-br from-blue-600 to-blue-700 rounded-full flex items-center justify-center">
<Bot className="w-5 h-5 text-white" />
</div>
<div className="bg-gray-800 rounded-2xl px-5 py-3">
<div className="flex space-x-1">
<div className="w-2 h-2 bg-blue-400 rounded-full animate-bounce" />
<div className="w-2 h-2 bg-blue-400 rounded-full animate-bounce" style={{ animationDelay: '0.1s' }} />
<div className="w-2 h-2 bg-blue-400 rounded-full animate-bounce" style={{ animationDelay: '0.2s' }} />
</div>
</div>
</div>
</motion.div>
)}
<div ref={messagesEndRef} />
</div>
{/* Input */}
{isConnected && (
<VoiceInput
onSendMessage={sendTextMessage}
isRecording={isRecording}
onToggleRecording={toggleVoiceRecording}
currentTranscript={currentTranscript}
agentStatus={agentStatus}
/>
)}
</div>
)
}
The component manages three distinct states: welcome screen, active session, and message display with smooth transitions.
5.2 Intelligent Voice Input Component
// client/src/components/VoiceInput.tsx
import { useState } from 'react'
import { motion } from 'framer-motion'
import { Mic, MicOff, Send, Type } from 'lucide-react'
interface VoiceInputProps {
onSendMessage: (message: string) => void
isRecording: boolean
onToggleRecording: () => void
currentTranscript: string
agentStatus: 'idle' | 'listening' | 'thinking' | 'speaking'
}
export const VoiceInput = ({
onSendMessage,
isRecording,
onToggleRecording,
currentTranscript,
agentStatus
}: VoiceInputProps) => {
const [textInput, setTextInput] = useState('')
const [inputMode, setInputMode] = useState<'voice' | 'text'>('voice')
const handleSendText = () => {
if (textInput.trim()) {
onSendMessage(textInput.trim())
setTextInput('')
}
}
const handleKeyPress = (e: React.KeyboardEvent) => {
if (e.key === 'Enter' && !e.shiftKey) {
e.preventDefault()
handleSendText()
}
}
const isDisabled = agentStatus === 'thinking' || agentStatus === 'speaking'
return (
<div className="bg-gray-900 border-t border-gray-800 p-4">
<div className="max-w-4xl mx-auto">
{/* Mode Toggle */}
<div className="flex justify-center mb-4">
<div className="bg-gray-800 rounded-lg p-1 flex">
<button
onClick={() => setInputMode('voice')}
className={`px-4 py-2 rounded-md flex items-center space-x-2 transition-colors ${
inputMode === 'voice' ? 'bg-blue-600 text-white' : 'text-gray-400 hover:text-white'
}`}
>
<Mic className="w-4 h-4" />
<span>Voice</span>
</button>
<button
onClick={() => setInputMode('text')}
className={`px-4 py-2 rounded-md flex items-center space-x-2 transition-colors ${
inputMode === 'text' ? 'bg-blue-600 text-white' : 'text-gray-400 hover:text-white'
}`}
>
<Type className="w-4 h-4" />
<span>Text</span>
</button>
</div>
</div>
{inputMode === 'voice' ? (
<div className="flex flex-col items-center space-y-4">
{currentTranscript && (
<motion.div
initial={{ opacity: 0, y: 10 }}
animate={{ opacity: 1, y: 0 }}
className="bg-gray-800 rounded-lg p-4 max-w-2xl w-full"
>
<p className="text-gray-300 text-center">{currentTranscript}</p>
</motion.div>
)}
<motion.button
onClick={onToggleRecording}
disabled={isDisabled}
className={`w-16 h-16 rounded-full flex items-center justify-center transition-all ${
isRecording ? 'bg-red-600 hover:bg-red-700 scale-110' : 'bg-blue-600 hover:bg-blue-700'
} ${isDisabled ? 'opacity-50 cursor-not-allowed' : 'hover:scale-105'}`}
whileTap={{ scale: 0.95 }}
>
{isRecording ? <MicOff className="w-6 h-6 text-white" /> : <Mic className="w-6 h-6 text-white" />}
</motion.button>
<p className="text-sm text-gray-400 text-center">
{isRecording ? 'Tap to stop recording' : 'Tap to start speaking'}
</p>
</div>
) : (
<div className="flex items-end space-x-3">
<textarea
value={textInput}
onChange={(e) => setTextInput(e.target.value)}
onKeyPress={handleKeyPress}
placeholder="Type your message..."
disabled={isDisabled}
className="flex-1 bg-gray-800 border border-gray-700 rounded-lg px-4 py-3 text-white resize-none focus:outline-none focus:border-blue-500 transition-colors"
rows={1}
style={{ minHeight: '48px', maxHeight: '120px' }}
/>
<button
onClick={handleSendText}
disabled={!textInput.trim() || isDisabled}
className="w-12 h-12 bg-blue-600 hover:bg-blue-700 disabled:opacity-50 disabled:cursor-not-allowed rounded-lg flex items-center justify-center transition-colors"
>
<Send className="w-5 h-5 text-white" />
</button>
</div>
)}
</div>
</div>
)
}
Users can seamlessly switch between voice and text input based on their preference or environment constraints.
5.3 Message Display Component
// client/src/components/Chat/MessageBubble.tsx
import { motion } from 'framer-motion'
import { User, Bot } from 'lucide-react'
interface MessageBubbleProps {
message: {
id: string
content: string
sender: 'user' | 'ai'
timestamp: number
type: 'text' | 'voice'
}
}
export const MessageBubble = ({ message }: MessageBubbleProps) => {
const isUser = message.sender === 'user'
return (
<motion.div
initial={{ opacity: 0, y: 20 }}
animate={{ opacity: 1, y: 0 }}
className={`flex mb-6 ${isUser ? 'justify-end' : 'justify-start'}`}
>
<div className={`flex items-start space-x-3 max-w-2xl ${isUser ? 'flex-row-reverse space-x-reverse' : ''}`}>
<div className={`w-10 h-10 rounded-full flex items-center justify-center flex-shrink-0 ${
isUser ? 'bg-gradient-to-br from-green-600 to-green-700' : 'bg-gradient-to-br from-blue-600 to-blue-700'
}`}>
{isUser ? <User className="w-5 h-5 text-white" /> : <Bot className="w-5 h-5 text-white" />}
</div>
<div className={`rounded-2xl px-5 py-3 ${
isUser ? 'bg-green-600 text-white' : 'bg-gray-800 text-gray-100'
}`}>
<p className="text-sm leading-relaxed whitespace-pre-wrap">{message.content}</p>
{message.type === 'voice' && (
<div className="flex items-center mt-2 opacity-70">
<Mic className="w-3 h-3 mr-1" />
<span className="text-xs">Voice message</span>
</div>
)}
</div>
</div>
</motion.div>
)
}
Messages animate smoothly into view and use distinct visual styling for user and AI responses, with indicators for voice messages.
Step 6. API Client Service
The frontend communicates with the backend through a centralized, robust API service:
// client/src/services/api.ts
import axios from 'axios'
import { config } from '../config'
const api = axios.create({
baseURL: config.API_BASE_URL,
timeout: 30000,
headers: { 'Content-Type': 'application/json' }
})
// Request interceptor for logging
api.interceptors.request.use((config) => {
console.log(`📤 API Request: ${config.method?.toUpperCase()} ${config.url}`)
return config
})
// Response interceptor for error handling
api.interceptors.response.use(
(response) => {
console.log(`✅ API Response: ${response.status} ${response.config.url}`)
return response
},
(error) => {
console.error(`❌ API Error: ${error.response?.status} ${error.config?.url}`, error.response?.data)
return Promise.reject(error)
}
)
export const agentAPI = {
async startSession(roomId: string, userId: string) {
const response = await api.post('/api/start', {
room_id: roomId,
user_id: userId,
user_stream_id: `${userId}_stream`,
})
if (!response.data?.success) {
throw new Error(response.data?.error || 'Session start failed')
}
return {
agentInstanceId: response.data.agentInstanceId
}
},
async sendMessage(agentInstanceId: string, message: string) {
const response = await api.post('/api/send-message', {
agent_instance_id: agentInstanceId,
message: message.trim(),
})
if (!response.data?.success) {
throw new Error(response.data?.error || 'Message send failed')
}
},
async stopSession(agentInstanceId: string) {
await api.post('/api/stop', {
agent_instance_id: agentInstanceId,
})
},
async getToken(userId: string) {
const response = await api.get(`/api/token?user_id=${encodeURIComponent(userId)}`)
if (!response.data?.token) {
throw new Error('No token returned')
}
return { token: response.data.token }
}
}
This abstraction provides centralized error handling, request logging, and retry logic capabilities.
Step 7. Running and Testing the Application
7.1 Backend Startup
From the server directory:
npm install
npm run dev
Verify server health by accessing http://localhost:8080/health. Expected response:
{
"status": "healthy",
"timestamp": "2025-12-22T18:00:00.000Z",
"registered": false,
"config": {
"appId": true,
"serverSecret": true,
"dashscope": true
}
}
The registered field becomes true after the first session initializes the agent.
7.2 Frontend Startup
From the client directory:
npm install
npm run dev
Navigate to http://localhost:5173 in Chrome or Edge. You should see the welcome screen with a blue bot icon and “Start Chat” button.
Run a Demo
Conclusion
Your AI assistant is ready to provide smart, 24/7 support through voice and text conversations. The system uses ZEGOCLOUD’s real-time infrastructure with language models to create a helpful digital assistant that users can rely on for information and problem-solving.
You can extend this foundation with features such as multi-language support, conversation analytics, user authentication, or specialized knowledge areas. The same pattern works well for customer service bots, educational assistants, or any application where smart conversational AI improves the user experience. The clean separation between frontend and backend, combined with conversation memory, ensures your AI assistant can grow and adapt to meet user needs while maintaining good performance and reliability.
FAQ
Q1: What is an AI assistant?
An AI assistant is a software application that can understand user input, process requests, and respond with helpful information using text or voice.
Q2: What are the core components needed to build an AI assistant?
A typical AI assistant includes speech recognition (ASR), natural language processing (NLP or LLM), and text-to-speech (TTS), along with a system to manage conversations.
Q3: What are common use cases for AI assistants?
Common use cases include customer support, productivity tools, education, virtual companions, and voice-controlled applications.
Let’s Build APP Together
Start building with real-time video, voice & chat SDK for apps today!






