Video-based AI agents are becoming common in modern products, appearing as virtual recruiters, onboarding assistants, and training coaches, and an AI interview assistant is one of the most useful examples. These agents do more than answer questions; they look you in the eye, speak with natural timing, and guide you through a structured conversation. Behind that smooth experience is a complex stack of real-time audio and video, speech recognition, LLM reasoning, text-to-speech, and avatar rendering that must stay in sync. In this guide, you will build an AI interview assistant that welcomes a candidate, asks structured questions, and responds naturally in real time.
How to Develop an AI Interview Assistant
Instead of a classic chatbot, ZEGOCLOUD treats the AI interviewer as another participant in a real-time room:
- The candidate joins a ZEGOCLOUD room with a microphone stream.
- The AI agent joins the same room, listening to the candidate’s voice and replying with speech.
- The Digital Human binds to the agent’s voice stream and outputs a synchronized video stream.
- Your web app just plays the candidate’s audio and the digital human’s video and manages UI state.
Under the hood, the Digital Human SDK turns the agent’s TTS audio into a talking avatar, and ZegoExpressEngine carries all of those audio/video streams through the same room so the browser simply subscribes to the digital human stream like any other remote video.
Prerequisites
Before you start, make sure you have:
- A ZEGOCLOUD account with Agent and Digital Human services enabled → Sign up here
- Node.js 18+ and npm.
- A valid AppID and ServerSecret from the ZEGOCLOUD console.
- A DashScope (or other LLM) API key for interview logic. You can use
zego_testfor testing within the trial period. - A modern desktop browser (Chrome/Edge) with microphone access.
1. Project Setup
The complete project implementation for this guide is available in the zego-digital-human repository.
1.1 Architecture Overview
The implementation is structured as:
- Backend (
server) Expressapp exposing/api/start,/api/start-digital-human,/api/send-message,/api/token,/api/stop.- ZEGOCLOUD MD5 signature generation.
- Agent registration for an interview-oriented LLM, TTS, and ASR profile.
- Unified “digital human agent instance” creation and cleanup.
- Frontend (
client) - React app created with Vite + TypeScript.
- ZegoExpressEngine WebRTC wrapper (
ZegoService) for joining rooms, publishing mic, and playing remote streams. - Digital human view that hosts the avatar video.
- Interview flow hook (
useInterview) managing connection state, ASR/LLM events, and UI.
The backend only exposes REST endpoints; all real-time media is handled via ZEGOCLOUD.
1.2 Installing Dependencies and Environment
Create the base structure:
mkdir zego-digital-human && cd zego-digital-human
mkdir server client
Backend setup
cd server
npm init -y
npm install express cors dotenv axios typescript tsx
npm install --save-dev @types/express @types/cors @types/node
Add server/.env:
ZEGO_APP_ID=your_numeric_app_id
ZEGO_SERVER_SECRET=your_32_character_secret
DASHSCOPE_API_KEY=your_dashscope_api_key
ALLOWED_ORIGINS=https://your-frontend-domain.com,http://localhost:5173
PORT=8080
Use tsx for development:
// server/package.json (scripts)
{
"scripts": {
"dev": "tsx watch src/server.ts",
"build": "tsc",
"start": "node dist/server.js",
"type-check": "tsc --noEmit"
}
}
Frontend setup
cd ../client
npm create vite@latest . -- --template react-ts
npm install zego-express-engine-webrtc axios framer-motion lucide-react tailwindcss zod
Add client/.env:
VITE_ZEGO_APP_ID=your_numeric_app_id
VITE_ZEGO_SERVER=wss://webliveroom-api.zegocloud.com/ws
VITE_API_BASE_URL=http://localhost:8080
Validate config on the client:
// client/src/config.ts
import { z } from 'zod'
const configSchema = z.object({
ZEGO_APP_ID: z.string().min(1, 'ZEGO App ID is required'),
ZEGO_SERVER: z.string().url('Valid ZEGO server URL required'),
API_BASE_URL: z.string().url('Valid API base URL required'),
})
const rawConfig = {
ZEGO_APP_ID: import.meta.env.VITE_ZEGO_APP_ID,
ZEGO_SERVER: import.meta.env.VITE_ZEGO_SERVER,
API_BASE_URL: import.meta.env.VITE_API_BASE_URL,
}
export const config = configSchema.parse(rawConfig)
This fails fast if environment variables are missing or mis-typed.
2. Building the Interview Agent Server
All backend logic lives in server/src/server.ts. The core steps:
- Generate ZEGOCLOUD signatures.
- Register an interview agent once per process.
- Start voice-only and digital human sessions.
- Provide tokens and cleanup endpoints.
2.1 ZEGOCLOUD API Authentication
The Agent and Digital Human APIs share a signature scheme based on MD5:
// server/src/server.ts
import crypto from 'crypto'
import axios from 'axios'
import dotenv from 'dotenv'
dotenv.config()
const CONFIG = {
ZEGO_APP_ID: process.env.ZEGO_APP_ID!,
ZEGO_SERVER_SECRET: process.env.ZEGO_SERVER_SECRET!,
ZEGO_AIAGENT_API_BASE_URL: 'https://aigc-aiagent-api.zegotech.cn',
ZEGO_DIGITAL_HUMAN_API_BASE_URL: 'https://aigc-digitalhuman-api.zegotech.cn'
}
function generateZegoSignature(action: string) {
const timestamp = Math.floor(Date.now() / 1000)
const nonce = crypto.randomBytes(8).toString('hex')
// Critical: AppId + SignatureNonce + ServerSecret + Timestamp
const signString = CONFIG.ZEGO_APP_ID + nonce + CONFIG.ZEGO_SERVER_SECRET + timestamp
const signature = crypto.createHash('md5').update(signString).digest('hex')
return {
Action: action,
AppId: CONFIG.ZEGO_APP_ID,
SignatureNonce: nonce,
SignatureVersion: '2.0',
Timestamp: timestamp,
Signature: signature
}
}
async function makeZegoRequest(
action: string,
body: object = {},
apiType: 'aiagent' | 'digitalhuman' = 'aiagent'
) {
const queryParams = generateZegoSignature(action)
const queryString = Object.entries(queryParams)
.map(([k, v]) => `${k}=${encodeURIComponent(String(v))}`)
.join('&')
const baseUrl =
apiType === 'digitalhuman'
? CONFIG.ZEGO_DIGITAL_HUMAN_API_BASE_URL
: CONFIG.ZEGO_AIAGENT_API_BASE_URL
const url = `${baseUrl}?${queryString}`
const response = await axios.post(url, body, {
headers: { 'Content-Type': 'application/json' },
timeout: 30000
})
return response.data
}
You will reuse makeZegoRequest for every Agent and Digital Human operation.
2.2 Defining the Interview Agent (LLM, TTS, ASR)
Next, define a reusable interview agent with a focused system prompt and streaming preferences:
// server/src/server.ts
let REGISTERED_AGENT_ID: string | null = null
const AGENT_CONFIG = {
LLM: {
Url: 'https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions',
ApiKey: process.env.DASHSCOPE_API_KEY || 'zego_test',
Model: 'qwen-plus',
SystemPrompt: `
You are a professional job interviewer.
INTERVIEW PHASES:
1. Introduction: brief greeting and self-introduction question.
2. Technical: ask about skills, projects, and problem-solving.
3. Behavioral: explore teamwork, conflict, and challenges.
4. Closing: invite questions and wrap up politely.
RULES:
- Ask ONE clear question at a time.
- Keep questions under two sentences.
- Acknowledge answers briefly before moving on.
- Conclude with: "Thank you for your time today. This concludes our interview."
`,
Temperature: 0.7,
TopP: 0.9,
Params: { max_tokens: 400 }
},
TTS: {
Vendor: 'ByteDance',
Params: {
app: {
appid: 'zego_test',
token: 'zego_test',
cluster: 'volcano_tts'
},
speed_ratio: 1,
volume_ratio: 1,
pitch_ratio: 1,
audio: { rate: 24000 }
},
FilterText: [
{ BeginCharacters: '(', EndCharacters: ')' },
{ BeginCharacters: '[', EndCharacters: ']' }
],
TerminatorText: '#'
},
ASR: {
Vendor: 'Tencent',
Params: {
engine_model_type: '16k_en',
hotword_list: 'interview|10,experience|8,project|8,team|8,challenge|8,skills|8'
},
VADSilenceSegmentation: 1500,
PauseInterval: 2000
}
}
async function registerAgent(): Promise<string> {
if (REGISTERED_AGENT_ID) return REGISTERED_AGENT_ID
const agentId = `interview_agent_${Date.now()}`
const payload = { AgentId: agentId, Name: 'AI Interview Assistant', ...AGENT_CONFIG }
const result = await makeZegoRequest('RegisterAgent', payload)
if (result.Code !== 0) {
throw new Error(`RegisterAgent failed: ${result.Code} ${result.Message}`)
}
REGISTERED_AGENT_ID = agentId
return agentId
}
The agent is registered only once per server process and reused across sessions.
2.3 Voice-Only Agent Session and Token Endpoint
Even with a digital human, a basic voice agent and token endpoint are useful and share the same patterns.
// server/src/server.ts
import express from 'express'
import cors from 'cors'
import { createRequire } from 'module'
const require = createRequire(import.meta.url)
const { generateToken04 } = require('../zego-token.cjs')
const app = express()
app.use(express.json())
app.use(cors())
function sanitizeRTCId(id: string) {
const s = String(id || '').replace(/[^A-Za-z0-9_.-]/g, '')
return s || `room_${Date.now().toString(36)}`
}
app.post('/api/start', async (req, res) => {
const { room_id, user_id, user_stream_id } = req.body
if (!room_id || !user_id) {
res.status(400).json({ error: 'room_id and user_id required' })
return
}
const agentId = await registerAgent()
const roomId = sanitizeRTCId(room_id)
const userStreamId = (user_stream_id || `${user_id}_stream`)
.toLowerCase()
.replace(/[^a-z0-9_.-]/g, '')
.slice(0, 128)
const instanceConfig = {
AgentId: agentId,
UserId: String(user_id).slice(0, 32),
RTC: {
RoomId: roomId,
AgentUserId: `ai_${roomId}`,
AgentStreamId: `ai_stream_${roomId}`,
UserStreamId: userStreamId
},
MessageHistory: {
SyncMode: 1,
Messages: [],
WindowSize: 10
},
AdvancedConfig: { InterruptMode: 0 }
}
const result = await makeZegoRequest('CreateAgentInstance', instanceConfig, 'aiagent')
if (result.Code !== 0) {
res.status(400).json({ error: result.Message || 'Failed to create instance' })
return
}
res.json({
success: true,
agentInstanceId: result.Data.AgentInstanceId,
agentUserId: instanceConfig.RTC.AgentUserId,
agentStreamId: instanceConfig.RTC.AgentStreamId,
userStreamId
})
})
app.get('/api/token', (req, res) => {
const userId = ((req.query.user_id as string) || '').trim()
const roomId = ((req.query.room_id as string) || '').trim()
if (!userId) {
res.status(400).json({ error: 'user_id required' })
return
}
const appId = Number(CONFIG.ZEGO_APP_ID)
const secret = CONFIG.ZEGO_SERVER_SECRET
const payload = {
room_id: roomId,
privilege: { 1: 1, 2: 1, 3: 1 },
stream_id_list: null
}
const token = generateToken04(appId, userId, secret, 3600, JSON.stringify(payload))
res.json({ token })
})
The frontend uses /api/token to log in with ZegoExpressEngine.
2.4 Starting a Digital Human Interview Session
The digital human endpoint creates a unified agent instance that includes both voice and avatar configuration:
// server/src/server.ts
app.post('/api/start-digital-human', async (req, res) => {
try {
const { room_id, user_id, user_stream_id, digital_human_id } = req.body
if (!room_id || !user_id) {
res.status(400).json({ error: 'room_id and user_id required' })
return
}
const roomIdRTC = sanitizeRTCId(room_id)
const userStreamId = (user_stream_id || `${user_id}_stream`)
.toLowerCase()
.replace(/[^a-z0-9_.-]/g, '')
.slice(0, 128)
const agentId = await registerAgent()
const normalizedUserId = String(user_id).replace(/[^a-zA-Z0-9_-]/g, '').slice(0, 32)
const digitalHumanId = digital_human_id || 'your_digital_human_id'
const agentUserId = `agt_${roomIdRTC}`.slice(0, 32)
const agentStreamId = `agt_stream_${roomIdRTC}`.slice(0, 128)
const payload = {
AgentId: agentId,
UserId: normalizedUserId,
RTC: {
RoomId: roomIdRTC,
AgentUserId: agentUserId,
AgentStreamId: agentStreamId,
UserStreamId: userStreamId
},
DigitalHuman: {
DigitalHumanId: digitalHumanId,
ConfigId: 'web',
EncodeCode: 'H264'
},
MessageHistory: {
SyncMode: 1,
Messages: [],
WindowSize: 10
},
AdvancedConfig: { InterruptMode: 0 }
}
const result = await makeZegoRequest('CreateDigitalHumanAgentInstance', payload, 'aiagent')
if (result.Code !== 0) {
res.status(400).json({
error: result.Message || 'Failed to create digital human agent instance',
code: result.Code,
requestId: result.RequestId
})
return
}
const digitalHumanConfig = result.Data.DigitalHumanConfig
res.json({
success: true,
agentInstanceId: result.Data.AgentInstanceId,
agentStreamId,
roomId: roomIdRTC,
digitalHumanId,
digitalHumanConfig,
unifiedDigitalHuman: true
})
} catch (error: any) {
res.status(500).json({ error: error.message || 'Internal error' })
}
})
The response includes:
agentInstanceId– for text messages and teardown.agentStreamId– the agent’s audio stream.roomId– where the browser should join.digitalHumanConfig– avatar configuration for the Digital Human SDK.
2.5 Stopping the Session and Cleaning Up
When the candidate ends the interview, you must stop both the agent instance and any digital human task:
// server/src/server.ts
app.post('/api/stop', async (req, res) => {
const { agent_instance_id } = req.body
if (!agent_instance_id) {
res.status(400).json({ error: 'agent_instance_id required' })
return
}
// Optional: collect metrics before teardown
try {
const status = await makeZegoRequest('QueryAgentInstanceStatus', {
AgentInstanceId: agent_instance_id
})
console.log('Interview performance:', {
llmFirstTokenLatency: status.Data?.LLMFirstTokenLatency,
ttsFirstAudioLatency: status.Data?.TTSFirstAudioLatency
})
} catch {
console.warn('Could not fetch metrics')
}
const result = await makeZegoRequest('DeleteAgentInstance', {
AgentInstanceId: agent_instance_id
})
if (result.Code !== 0) {
res.status(400).json({ error: result.Message || 'Failed to delete instance' })
return
}
res.json({ success: true })
})
You can also expose a /api/cleanup endpoint using QueryDigitalHumanStreamTasks to force-stop any orphaned video streams.
2.6 Optional: Listing Available Digital Humans
To let your frontend choose between different avatars, add:
// server/src/server.ts
app.get('/api/digital-humans', async (_req, res) => {
const result = await makeZegoRequest('GetDigitalHumanList', {}, 'digitalhuman')
if (result.Code !== 0) {
res.status(400).json({
error: result.Message || 'Failed to query digital humans',
code: result.Code,
requestId: result.RequestId
})
return
}
res.json({
success: true,
digitalHumans: result.Data?.List || []
})
})
The client can present this as a simple avatar selector before starting the interview.
3. WebRTC Integration: ZegoExpressEngine Wrapper
On the frontend, all WebRTC logic lives inside ZegoService (client/src/services/zego.ts). It:
- Manages a single
ZegoExpressEngineinstance. - Joins/leaves rooms.
- Publishes the candidate’s mic.
- Plays remote audio and digital human video streams.
- Exposes callbacks for ASR/LLM room messages and player state.
3.1 Initializing ZegoExpressEngine
// client/src/services/zego.ts
import { ZegoExpressEngine } from 'zego-express-engine-webrtc'
import { VoiceChanger } from 'zego-express-engine-webrtc/voice-changer'
import { config } from '../config'
import { digitalHumanAPI } from './digitalHumanAPI'
export class ZegoService {
private static instance: ZegoService
private zg: ZegoExpressEngine | null = null
private isInitialized = false
private currentRoomId: string | null = null
private currentUserId: string | null = null
private localStream: MediaStream | null = null
private audioElement: HTMLAudioElement | null = null
// ... other fields
static getInstance(): ZegoService {
if (!ZegoService.instance) ZegoService.instance = new ZegoService()
return ZegoService.instance
}
async initialize(): Promise<void> {
if (this.isInitialized) return
try {
try { ZegoExpressEngine.use(VoiceChanger) } catch {}
this.zg = new ZegoExpressEngine(
parseInt(config.ZEGO_APP_ID),
config.ZEGO_SERVER,
{ scenario: 7 } // digital human / AI scenario
)
try {
const rtc = await this.zg.checkSystemRequirements('webRTC')
const mic = await this.zg.checkSystemRequirements('microphone')
if (!rtc?.result) throw new Error('WebRTC not supported')
if (!mic?.result) console.warn('Microphone permission not granted yet')
} catch {}
this.setupEventListeners()
this.setupMediaElements()
this.isInitialized = true
} catch (error) {
console.error('ZEGO initialization failed:', error)
throw error
}
}
private setupMediaElements() {
this.audioElement = document.getElementById('ai-audio-output') as HTMLAudioElement
if (!this.audioElement) {
this.audioElement = document.createElement('audio')
this.audioElement.id = 'ai-audio-output'
this.audioElement.autoplay = true
this.audioElement.style.display = 'none'
document.body.appendChild(this.audioElement)
}
}
// ...
}
This ensures there is only one engine instance per browser tab.
3.2 Joining the Room and Publishing the Mic
// client/src/services/zego.ts
async joinRoom(roomId: string, userId: string): Promise<boolean> {
if (!this.zg) return false
if (this.currentRoomId === roomId && this.currentUserId === userId) return true
try {
if (this.currentRoomId) await this.leaveRoom()
this.currentRoomId = roomId
this.currentUserId = userId
const { token } = await digitalHumanAPI.getToken(userId, roomId)
await this.zg.loginRoom(roomId, token, {
userID: userId,
userName: userId
})
// Enable room message callbacks (ASR / LLM events)
this.zg.callExperimentalAPI({
method: 'onRecvRoomChannelMessage',
params: {}
})
const localStream = await this.zg.createStream({
camera: { video: false, audio: true }
})
this.localStream = localStream
const streamId = `${userId}_stream`
await this.zg.startPublishingStream(streamId, localStream, {
enableAutoSwitchVideoCodec: true
})
return true
} catch (error) {
console.error('Failed to join room:', error)
this.currentRoomId = null
this.currentUserId = null
this.localStream = null
return false
}
}
async enableMicrophone(enabled: boolean): Promise<boolean> {
if (!this.localStream) return false
const track = this.localStream.getAudioTracks?.()[0]
if (track) {
track.enabled = enabled
return true
}
return false
}
This is all the logic your React components need to toggle recording.
3.3 Handling Streams and Attaching the Digital Human Video
When ZEGOCLOUD adds new streams to the room, you decide which ones to play:
// client/src/services/zego.ts
private remoteViews = new Map<string, any>()
private playingStreamIds = new Set<string>()
private messageCallback: ((message: any) => void) | null = null
private setupEventListeners(): void {
if (!this.zg) return
this.zg.on('recvExperimentalAPI', (result: any) => {
const { method, content } = result
if (method === 'onRecvRoomChannelMessage') {
try {
const msg = JSON.parse(content.msgContent)
this.handleRoomMessage(msg)
} catch (e) {
console.error('Parse room message failed:', e)
}
}
})
this.zg.on('roomStreamUpdate', async (_roomID, updateType, streamList) => {
if (updateType === 'ADD') {
for (const stream of streamList) {
const streamId = stream.streamID
const userStreamId = this.currentUserId ? `${this.currentUserId}_stream` : null
if (userStreamId && streamId === userStreamId) continue
if (this.playingStreamIds.has(streamId)) continue
this.playingStreamIds.add(streamId)
try {
const mediaStream = await this.zg!.startPlayingStream(streamId)
if (!mediaStream) continue
const remoteView = await (this.zg as any).createRemoteStreamView(mediaStream)
if (!remoteView) continue
// Audio for agent / digital human is always enabled here
Promise.resolve(remoteView.playAudio({ enableAutoplayDialog: true })).catch(() => {})
this.remoteViews.set(streamId, remoteView)
} catch (error) {
console.error('Failed to start remote stream:', streamId, error)
}
}
}
if (updateType === 'DELETE') {
for (const stream of streamList) {
const rv = this.remoteViews.get(stream.streamID)
if (rv?.destroy) rv.destroy()
this.remoteViews.delete(stream.streamID)
this.playingStreamIds.delete(stream.streamID)
}
}
})
}
private handleRoomMessage(message: any): void {
if (this.messageCallback) {
this.messageCallback(message)
}
}
onRoomMessage(callback: (message: any) => void): void {
this.messageCallback = callback
}
To attach a specific digital human video stream into the UI, expose:
// client/src/services/zego.ts (core idea)
private dhVideoStreamId: string | null = null
setDigitalHumanStream(streamId: string | null): void {
this.dhVideoStreamId = streamId
if (!streamId) return
void this.startDigitalHumanPlayback(streamId)
}
private async startDigitalHumanPlayback(streamId: string): Promise<void> {
if (!this.zg) return
const mediaStream = await this.zg.startPlayingStream(streamId)
if (!mediaStream) return
const remoteView = await (this.zg as any).createRemoteStreamView(mediaStream)
if (!remoteView) return
// Attach audio
Promise.resolve(remoteView.playAudio({ enableAutoplayDialog: true })).catch(() => {})
// Attach video element into #remoteSteamView container
const attach = async () => {
const container = document.getElementById('remoteSteamView')
if (!container) {
setTimeout(attach, 200)
return
}
const result = await Promise.resolve(remoteView.playVideo(container, {
enableAutoplayDialog: false
}))
setTimeout(() => {
const videoEl = container.querySelector('video') as HTMLVideoElement | null
if (!videoEl) return
if (!videoEl.srcObject) {
videoEl.srcObject = mediaStream
videoEl.load()
void videoEl.play()
}
document.dispatchEvent(
new CustomEvent('zego-digital-human-video-state', { detail: { ready: true } })
)
}, 150)
}
attach()
}
The only requirement from React is to provide an element with id remoteSteamView; the service takes care of attaching and repairing the <video> element.
4. React Interview Experience
With media and backend in place, the rest is React:
useInterview– orchestrates the session.DigitalHuman– displays the avatar and status.ChatPanel– shows the transcript.VoiceMessageInput– allows typed or spoken answers.App– small state machine for welcome → interview → summary.
4.1 Interview State Hook
useInterview is the central hook that ties together ZegoService and the digital human APIs.
// client/src/hooks/useInterview.ts
import { useCallback, useRef, useEffect, useReducer } from 'react'
import { ZegoService } from '../services/zego'
import { digitalHumanAPI } from '../services/digitalHumanAPI'
import type { Message, ChatSession, ZegoRoomMessage } from '../types'
interface InterviewState {
messages: Message[]
session: ChatSession | null
isLoading: boolean
isConnected: boolean
isRecording: boolean
currentTranscript: string
agentStatus: 'idle' | 'listening' | 'thinking' | 'speaking'
error: string | null
questionsAsked: number
isInterviewComplete: boolean
startTime: number | null
}
// reducer implementation omitted for brevity (check project repository)
export const useInterview = () => {
const [state, dispatch] = useReducer(interviewReducer, {
messages: [],
session: null,
isLoading: false,
isConnected: false,
isRecording: false,
currentTranscript: '',
agentStatus: 'idle',
error: null,
questionsAsked: 0,
isInterviewComplete: false,
startTime: null
})
const zegoService = useRef(ZegoService.getInstance())
const processedMessageIds = useRef(new Set<string>())
const addMessageSafely = useCallback((message: Message) => {
if (processedMessageIds.current.has(message.id)) return
processedMessageIds.current.add(message.id)
dispatch({ type: 'ADD_MESSAGE', payload: message })
if (
message.sender === 'ai' &&
message.content.toLowerCase().includes('this concludes our interview')
) {
setTimeout(() => {
dispatch({ type: 'SET_INTERVIEW_COMPLETE', payload: true })
}, 2000)
}
}, [])
const setupMessageHandlers = useCallback(() => {
const handleRoomMessage = (data: ZegoRoomMessage) => {
const { Cmd, Data: msgData } = data
if (Cmd === 3) {
// ASR events (candidate speech)
const { Text: transcript, EndFlag, MessageId } = msgData
if (!transcript?.trim()) return
dispatch({ type: 'SET_TRANSCRIPT', payload: transcript })
dispatch({ type: 'SET_AGENT_STATUS', payload: 'listening' })
if (EndFlag) {
const message: Message = {
id: MessageId || `voice_${Date.now()}`,
content: transcript.trim(),
sender: 'user',
timestamp: Date.now(),
type: 'voice',
transcript: transcript.trim()
}
addMessageSafely(message)
dispatch({ type: 'SET_TRANSCRIPT', payload: '' })
dispatch({ type: 'SET_AGENT_STATUS', payload: 'thinking' })
dispatch({ type: 'INCREMENT_QUESTIONS_ASKED' })
}
}
if (Cmd === 4) {
// LLM events (AI interviewer responses)
const { Text: content, MessageId, EndFlag } = msgData
if (!content || !MessageId) return
dispatch({ type: 'SET_AGENT_STATUS', payload: 'speaking' })
if (EndFlag) {
const final: Message = {
id: `ai_${Date.now()}`,
content,
sender: 'ai',
timestamp: Date.now(),
type: 'text'
}
addMessageSafely(final)
setTimeout(() => {
dispatch({ type: 'SET_AGENT_STATUS', payload: 'listening' })
}, 8000)
}
}
}
zegoService.current.onRoomMessage(handleRoomMessage)
}, [addMessageSafely])
const startInterview = useCallback(async () => {
if (state.isLoading || state.isConnected) return false
dispatch({ type: 'SET_LOADING', payload: true })
dispatch({ type: 'SET_START_TIME', payload: Date.now() })
dispatch({ type: 'SET_ERROR', payload: null })
try {
const roomId = `interview_${Date.now().toString(36)}`
const userId = `candidate_${Date.now().toString(36)}`
await zegoService.current.initialize()
const result = await digitalHumanAPI.startInterview(roomId, userId)
const joinedRoomId = result.roomId || roomId
const joined = await zegoService.current.joinRoom(joinedRoomId, userId)
if (!joined) throw new Error('Failed to join ZEGO room')
if (result.agentStreamId) {
zegoService.current.setAgentAudioStream(result.agentStreamId)
}
if (result.digitalHumanVideoStreamId) {
zegoService.current.setDigitalHumanStream(result.digitalHumanVideoStreamId)
}
const session: ChatSession = {
roomId: joinedRoomId,
userId,
agentInstanceId: result.agentInstanceId,
agentStreamId: result.agentStreamId,
digitalHumanTaskId: result.digitalHumanTaskId,
digitalHumanVideoStreamId: result.digitalHumanVideoStreamId,
digitalHumanId: result.digitalHumanId,
isActive: true,
voiceSettings: {
isEnabled: false,
autoPlay: true,
speechRate: 1.0,
speechPitch: 1.0
}
}
dispatch({ type: 'SET_SESSION', payload: session })
dispatch({ type: 'SET_CONNECTED', payload: true })
setupMessageHandlers()
await digitalHumanAPI.sendMessage(
session.agentInstanceId!,
'Please start the interview with a short greeting and your first question.'
)
return true
} catch (error: any) {
dispatch({ type: 'SET_ERROR', payload: error.message || 'Failed to start interview' })
return false
} finally {
dispatch({ type: 'SET_LOADING', payload: false })
}
}, [state.isLoading, state.isConnected, setupMessageHandlers])
// sendTextMessage, toggleVoiceRecording, endInterview, cleanup omitted...
return {
...state,
startInterview,
// sendTextMessage,
// toggleVoiceRecording,
// toggleVoiceSettings,
// endInterview
}
}
The hook hides all ZEGOCLOUD details from components.
4.2 Digital Human Component
The DigitalHuman component hosts the video container and overlays connection status and current question.
// client/src/components/Interview/DigitalHuman.tsx
import { useEffect, useState } from 'react'
import { motion, AnimatePresence } from 'framer-motion'
import { useDigitalHuman } from '../../hooks/useDigitalHuman'
import { ZegoService } from '../../services/zego'
import { Volume2, VolumeX, Video, VideoOff, Circle } from 'lucide-react'
interface DigitalHumanProps {
isConnected: boolean
agentStatus: 'idle' | 'listening' | 'thinking' | 'speaking'
currentQuestion?: string
}
export const DigitalHuman = ({ isConnected, agentStatus, currentQuestion }: DigitalHumanProps) => {
const { isVideoEnabled, isAudioEnabled, toggleVideo, toggleAudio } = useDigitalHuman()
const [videoReady, setVideoReady] = useState(false)
useEffect(() => {
ZegoService.getInstance().ensureVideoContainer()
}, [isConnected])
useEffect(() => {
const handler = (event: Event) => {
const { detail } = event as CustomEvent<{ ready: boolean }>
setVideoReady(!!detail?.ready)
}
document.addEventListener('zego-digital-human-video-state', handler)
return () => document.removeEventListener('zego-digital-human-video-state', handler)
}, [])
const status = {
idle: { text: 'Ready', color: 'bg-slate-400' },
listening: { text: 'Listening', color: 'bg-emerald-500' },
thinking: { text: 'Processing', color: 'bg-blue-500' },
speaking: { text: 'Speaking', color: 'bg-violet-500' }
}[agentStatus]
return (
<div className="relative w-full h-full bg-slate-900 flex items-center justify-center overflow-hidden">
{/* Digital human video container */}
<div
id="remoteSteamView"
className={`absolute inset-0 w-full h-full transition-opacity duration-300 ${
videoReady && isVideoEnabled ? 'opacity-100' : 'opacity-0'
}`}
/>
<style>{`
#remoteSteamView {
display: flex;
align-items: center;
justify-content: center;
}
#remoteSteamView > div {
width: 100%;
height: 100%;
}
#remoteSteamView video {
width: 100%;
height: 100%;
object-fit: cover;
}
`}</style>
{/* Status + controls */}
{isConnected && (
<motion.div
initial={{ opacity: 0, y: -20 }}
animate={{ opacity: 1, y: 0 }}
className="absolute top-6 left-6 right-6 flex items-center justify-between"
>
<div className="flex items-center space-x-3 bg-black/50 rounded-full px-4 py-2 border border-white/10">
<motion.div
className={`w-2.5 h-2.5 rounded-full ${status.color}`}
animate={{ scale: [1, 1.3, 1], opacity: [1, 0.7, 1] }}
transition={{ repeat: Infinity, duration: 2 }}
/>
<span className="text-white text-sm font-medium">
{status.text}
</span>
</div>
<div className="flex items-center space-x-2">
<button
onClick={toggleVideo}
className="p-2.5 rounded-full bg-black/50 text-white"
title={isVideoEnabled ? 'Disable video' : 'Enable video'}
>
{isVideoEnabled ? <Video className="w-4 h-4" /> : <VideoOff className="w-4 h-4" />}
</button>
<button
onClick={toggleAudio}
className="p-2.5 rounded-full bg-black/50 text-white"
title={isAudioEnabled ? 'Mute audio' : 'Unmute audio'}
>
{isAudioEnabled ? <Volume2 className="w-4 h-4" /> : <VolumeX className="w-4 h-4" />}
</button>
</div>
</motion.div>
)}
{/* Optional: show current question overlay when agent is speaking */}
<AnimatePresence>
{currentQuestion && agentStatus === 'speaking' && (
<motion.div
initial={{ opacity: 0, y: 30 }}
animate={{ opacity: 1, y: 0 }}
exit={{ opacity: 0, y: -20 }}
className="absolute bottom-0 left-0 right-0 p-8 bg-gradient-to-t from-black/80 via-black/50 to-transparent"
>
<div className="bg-white/95 rounded-2xl p-6 shadow-2xl flex items-start space-x-3">
<Circle className="w-5 h-5 text-violet-500 mt-1" />
<p className="text-slate-900 font-medium text-lg leading-relaxed">
{currentQuestion}
</p>
</div>
</motion.div>
)}
</AnimatePresence>
</div>
)
}
This encapsulates all avatar-related UI concerns.
4.3 Chat Panel and Voice Input
The chat panel shows the full message history:
// client/src/components/Interview/ChatPanel.tsx (simplified)
import { useEffect, useRef } from 'react'
import { motion, AnimatePresence } from 'framer-motion'
import type { Message } from '../../types'
import { User, Bot } from 'lucide-react'
interface ChatPanelProps {
messages: Message[]
isTyping: boolean
}
export const ChatPanel = ({ messages, isTyping }: ChatPanelProps) => {
const endRef = useRef<HTMLDivElement>(null)
useEffect(() => {
endRef.current?.scrollIntoView({ behavior: 'smooth' })
}, [messages, isTyping])
return (
<div className="flex flex-col flex-1 bg-slate-900/50">
<div className="flex-1 overflow-y-auto px-6 py-4 space-y-4">
<AnimatePresence initial={false}>
{messages.map((m) => (
<motion.div
key={m.id}
initial={{ opacity: 0, y: 10 }}
animate={{ opacity: 1, y: 0 }}
className={`flex gap-3 ${m.sender === 'user' ? 'justify-end' : 'justify-start'}`}
>
{m.sender === 'ai' && (
<div className="w-8 h-8 rounded-full bg-violet-600 flex items-center justify-center">
<Bot className="w-4 h-4 text-white" />
</div>
)}
<div
className={`max-w-[75%] rounded-2xl px-4 py-3 ${
m.sender === 'user'
? 'bg-blue-600 text-white'
: 'bg-slate-800 text-slate-100'
}`}
>
<p className="text-sm whitespace-pre-wrap">{m.content}</p>
</div>
{m.sender === 'user' && (
<div className="w-8 h-8 rounded-full bg-blue-600 flex items-center justify-center">
<User className="w-4 h-4 text-white" />
</div>
)}
</motion.div>
))}
</AnimatePresence>
{isTyping && (
<motion.div
initial={{ opacity: 0, y: 10 }}
animate={{ opacity: 1, y: 0 }}
className="flex gap-3"
>
<div className="w-8 h-8 rounded-full bg-violet-600 flex items-center justify-center">
<Bot className="w-4 h-4 text-white" />
</div>
<div className="bg-slate-800 rounded-2xl px-4 py-3">
<div className="flex gap-1">
<span className="w-2 h-2 bg-slate-400 rounded-full animate-bounce" />
<span
className="w-2 h-2 bg-slate-400 rounded-full animate-bounce"
style={{ animationDelay: '150ms' }}
/>
<span
className="w-2 h-2 bg-slate-400 rounded-full animate-bounce"
style={{ animationDelay: '300ms' }}
/>
</div>
</div>
</motion.div>
)}
<div ref={endRef} />
</div>
</div>
)
}
The voice input lets candidates type or speak answers and reflects the interviewer’s status. The implementation is similar to a standard chat input with an extra mic toggle and transcript area.
4.4 Interview Room Layout
Finally, the InterviewRoom component ties everything together and returns a summary when the interview ends:
// client/src/components/Interview/InterviewRoom.tsx
import { useEffect, useState, useCallback, useMemo } from 'react'
import { motion } from 'framer-motion'
import { DigitalHuman } from './DigitalHuman'
import { ChatPanel } from './ChatPanel'
import { Button } from '../UI/Button'
import { useInterview } from '../../hooks/useInterview'
import { PhoneOff, Clock } from 'lucide-react'
import type { Message } from '../../types'
export interface InterviewSummary {
duration: string
questionsCount: number
responsesCount: number
messages: Message[]
}
interface InterviewRoomProps {
onComplete: (data: InterviewSummary) => void
}
export const InterviewRoom = ({ onComplete }: InterviewRoomProps) => {
const [currentTime, setCurrentTime] = useState(Date.now())
const {
messages,
isLoading,
isConnected,
isRecording,
error,
agentStatus,
questionsAsked,
isInterviewComplete,
startTime,
startInterview,
endInterview
} = useInterview()
useEffect(() => {
void startInterview()
}, [])
useEffect(() => {
if (!isConnected) return
const id = setInterval(() => setCurrentTime(Date.now()), 1000)
return () => clearInterval(id)
}, [isConnected])
useEffect(() => {
if (!isInterviewComplete || !startTime) return
const secs = Math.floor((Date.now() - startTime) / 1000)
const data: InterviewSummary = {
duration: `${Math.floor(secs / 60)}:${(secs % 60).toString().padStart(2, '0')}`,
questionsCount: messages.filter(m => m.sender === 'ai').length,
responsesCount: messages.filter(m => m.sender === 'user').length,
messages
}
onComplete(data)
}, [isInterviewComplete, startTime, messages, onComplete])
const formatDuration = useCallback((now: number) => {
if (!startTime) return '0:00'
const secs = Math.floor((now - startTime) / 1000)
const mins = Math.floor(secs / 60)
return `${mins}:${(secs % 60).toString().padStart(2, '0')}`
}, [startTime])
const statusDisplay = useMemo(() => {
if (isInterviewComplete) return { text: 'Interview completed', color: 'text-emerald-500' }
if (isLoading && !isConnected) return { text: 'Connecting...', color: 'text-blue-500' }
if (!isConnected) return { text: 'Connecting...', color: 'text-blue-500' }
const map = {
listening: { text: 'Listening...', color: 'text-emerald-500' },
thinking: { text: 'Thinking...', color: 'text-blue-500' },
speaking: { text: 'Speaking...', color: 'text-violet-500' },
idle: { text: 'Ready', color: 'text-slate-400' }
} as const
return map[agentStatus]
}, [isConnected, isInterviewComplete, isLoading, agentStatus])
return (
<div className="h-screen flex flex-col bg-slate-950">
{/* Header */}
<motion.header
initial={{ y: -20, opacity: 0 }}
animate={{ y: 0, opacity: 1 }}
className="bg-slate-900/80 backdrop-blur-xl border-b border-slate-800"
>
<div className="px-6 py-4 flex items-center justify-between">
<div>
<h1 className="text-lg font-bold text-white">AI Interview</h1>
<p className={`text-sm font-medium ${statusDisplay.color}`}>
{statusDisplay.text}
</p>
{error && (
<p className="text-xs text-red-400 mt-1">
{error}
</p>
)}
</div>
{isConnected && (
<div className="flex items-center space-x-4">
<div className="flex items-center space-x-2 text-sm text-slate-400">
<Clock className="w-4 h-4" />
<span className="tabular-nums">{formatDuration(currentTime)}</span>
</div>
{isRecording && (
<div className="px-3 py-1 rounded-full border border-emerald-500/40 bg-emerald-500/10 flex items-center space-x-2">
<span className="w-2 h-2 rounded-full bg-emerald-400 animate-pulse" />
<span className="text-xs font-semibold text-emerald-300">Mic On / Listening</span>
</div>
)}
<div className="px-3 py-1 bg-blue-500/10 rounded-full">
<span className="text-xs font-semibold text-blue-400">
Q{questionsAsked}
</span>
</div>
<Button
onClick={endInterview}
variant="secondary"
size="sm"
disabled={isLoading}
className="bg-slate-800 hover:bg-red-500/10 text-slate-300 hover:text-red-400 border-slate-700"
>
<PhoneOff className="w-4 h-4 mr-2" />
End
</Button>
</div>
)}
</div>
</motion.header>
{/* Body */}
<div className="flex-1 flex flex-col lg:flex-row overflow-hidden">
<div className="w-full lg:w-1/2">
<DigitalHuman
isConnected={isConnected}
agentStatus={agentStatus}
currentQuestion=""
/>
</div>
<div className="w-full lg:w-1/2 border-t lg:border-t-0 lg:border-l border-slate-800">
<ChatPanel
messages={messages}
isTyping={agentStatus === 'thinking' || agentStatus === 'speaking'}
/>
</div>
</div>
</div>
)
}
Your top-level App component only needs to choose between the welcome screen, interview screen, and summary view.
5. Frontend API Client
The React app talks to the backend through a small wrapper in client/src/services/digitalHumanAPI.ts. It hides raw URLs and response shapes.
// client/src/services/digitalHumanAPI.ts
import axios from 'axios'
import { config } from '../config'
const api = axios.create({
baseURL: config.API_BASE_URL,
timeout: 30000,
headers: { 'Content-Type': 'application/json' }
})
export const digitalHumanAPI = {
async startInterview(roomId: string, userId: string) {
const requestData = {
room_id: roomId,
user_id: userId,
user_stream_id: `${userId}_stream`,
// digital_human_id: optional override
}
const response = await api.post('/api/start-digital-human', requestData)
if (!response.data || !response.data.success) {
throw new Error(response.data?.error || 'Digital human interview start failed')
}
return {
agentInstanceId: response.data.agentInstanceId,
agentStreamId: response.data.agentStreamId,
digitalHumanTaskId: response.data.digitalHumanTaskId,
digitalHumanVideoStreamId: response.data.digitalHumanVideoStreamId,
digitalHumanConfig: response.data.digitalHumanConfig,
roomId: response.data.roomId || roomId,
digitalHumanId: response.data.digitalHumanId,
unifiedDigitalHuman: response.data.unifiedDigitalHuman
}
},
async stopInterview(agentInstanceId: string, digitalHumanTaskId?: string) {
if (!agentInstanceId) return
await api.post('/api/stop', { agent_instance_id: agentInstanceId })
if (digitalHumanTaskId) {
await api.post('/api/stop-digital-human', { task_id: digitalHumanTaskId })
}
},
async sendMessage(agentInstanceId: string, message: string) {
const trimmed = (message || '').trim()
if (!agentInstanceId || !trimmed) return
const response = await api.post('/api/send-message', {
agent_instance_id: agentInstanceId,
message: trimmed
})
if (!response.data?.success) {
throw new Error(response.data?.error || 'Message send failed')
}
},
async getToken(userId: string, roomId?: string) {
const params = new URLSearchParams({ user_id: userId })
if (roomId) params.append('room_id', roomId)
const response = await api.get(`/api/token?${params.toString()}`)
if (!response.data?.token) {
throw new Error('No token returned')
}
return { token: response.data.token }
},
async healthCheck() {
const response = await api.get('/health')
return response.data
}
}
This keeps network code out of hooks and components.
6. Running and Testing Your Digital Human Interviewer
6.1 Backend
From server:
npm install # if not already installed
npm run dev
Check that:
http://localhost:8080/healthreturnsstatus: "healthy".- No
ZEGO_APP_ID/ZEGO_SERVER_SECRETerrors are logged. - Outbound calls to ZEGOCLOUD succeed (no signature errors).
6.2 Frontend
From client:
npm install
npm run dev
Open http://localhost:5173 in a desktop browser. You should:
a.See a welcome screen describing the AI Interview Assistant.
b. Click “Start Interview”, which will:
- Ask the backend to create a digital human agent instance.
- Join the room via
ZegoExpressEngine. - Attach the digital human video to the main panel.
a. Hear the AI interviewer greet you and ask an introductory question.
b. Answer via:
- Voice: press the mic button and speak.
At the end, when the interviewer says the closing phrase or clicks on the end button, the app shows a simple summary with total duration, questions asked, and responses given.
Run a Demo
Conclusion
You now have a complete digital human interview flow built on ZEGOCLOUD:
- The server manages ZEGOCLOUD authentication, agent registration, and the digital human lifecycle.
- The client handles WebRTC, streams, and ASR/LLM events.
- The React UI presents a polished, guided interview experience with a realistic avatar.
From here you can:
- Customize the LLM prompt for different interview types (engineering, sales, product).
- Use
/api/digital-humansto let users choose from multiple avatars. - Persist and analyze interview transcripts for scoring and feedback.
- Embed the interview experience into your own application shell or dashboard.
ZEGOCLOUD handles the difficult real-time streaming and avatar animation layers so you can stay focused on interview design, scoring, and integration into your product.
Let’s Build APP Together
Start building with real-time video, voice & chat SDK for apps today!






