How to Build an AI Interview Assistant with ZEGOCLOUD

Video-based AI agents are becoming common in modern products, appearing as virtual recruiters, onboarding assistants, and training coaches, and an AI interview assistant is one of the most useful examples. These agents do more than answer questions; they look you in the eye, speak with natural timing, and guide you through a structured conversation. Behind that smooth experience is a complex stack of real-time audio and video, speech recognition, LLM reasoning, text-to-speech, and avatar rendering that must stay in sync. In this guide, you will build an AI interview assistant that welcomes a candidate, asks structured questions, and responds naturally in real time.

How to Develop an AI Interview Assistant

Instead of a classic chatbot, ZEGOCLOUD treats the AI interviewer as another participant in a real-time room:

The candidate joins a ZEGOCLOUD room with a microphone stream.
The AI agent joins the same room, listening to the candidate’s voice and replying with speech.
The Digital Human binds to the agent’s voice stream and outputs a synchronized video stream.
Your web app just plays the candidate’s audio and the digital human’s video and manages UI state.

Under the hood, the Digital Human SDK turns the agent’s TTS audio into a talking avatar, and ZegoExpressEngine carries all of those audio/video streams through the same room so the browser simply subscribes to the digital human stream like any other remote video.

Prerequisites

Before you start, make sure you have:

A ZEGOCLOUD account with Agent and Digital Human services enabled → Sign up here
Node.js 18+ and npm.
A valid AppID and ServerSecret from the ZEGOCLOUD console.
A DashScope (or other LLM) API key for interview logic. You can use zego_testfor testing within the trial period.
A modern desktop browser (Chrome/Edge) with microphone access.

1. Project Setup

The complete project implementation for this guide is available in the zego-digital-human repository.

1.1 Architecture Overview

The implementation is structured as:

Backend (server)
Express app exposing /api/start, /api/start-digital-human, /api/send-message, /api/token, /api/stop.
ZEGOCLOUD MD5 signature generation.
Agent registration for an interview-oriented LLM, TTS, and ASR profile.
Unified “digital human agent instance” creation and cleanup.
Frontend (client)
React app created with Vite + TypeScript.
ZegoExpressEngine WebRTC wrapper (ZegoService) for joining rooms, publishing mic, and playing remote streams.
Digital human view that hosts the avatar video.
Interview flow hook (useInterview) managing connection state, ASR/LLM events, and UI.

The backend only exposes REST endpoints; all real-time media is handled via ZEGOCLOUD.

1.2 Installing Dependencies and Environment

Create the base structure:

mkdir zego-digital-human && cd zego-digital-human
mkdir server client

Backend setup

cd server
npm init -y
npm install express cors dotenv axios typescript tsx
npm install --save-dev @types/express @types/cors @types/node

Add server/.env:

ZEGO_APP_ID=your_numeric_app_id
ZEGO_SERVER_SECRET=your_32_character_secret
DASHSCOPE_API_KEY=your_dashscope_api_key
ALLOWED_ORIGINS=https://your-frontend-domain.com,http://localhost:5173
PORT=8080

Use tsx for development:

// server/package.json (scripts)
{
  "scripts": {
    "dev": "tsx watch src/server.ts",
    "build": "tsc",
    "start": "node dist/server.js",
    "type-check": "tsc --noEmit"
  }
}

Frontend setup

cd ../client
npm create vite@latest . -- --template react-ts
npm install zego-express-engine-webrtc axios framer-motion lucide-react tailwindcss zod

Add client/.env:

VITE_ZEGO_APP_ID=your_numeric_app_id
VITE_ZEGO_SERVER=wss://webliveroom-api.zegocloud.com/ws
VITE_API_BASE_URL=http://localhost:8080

Validate config on the client:

// client/src/config.ts
import { z } from 'zod'

const configSchema = z.object({
  ZEGO_APP_ID: z.string().min(1, 'ZEGO App ID is required'),
  ZEGO_SERVER: z.string().url('Valid ZEGO server URL required'),
  API_BASE_URL: z.string().url('Valid API base URL required'),
})

const rawConfig = {
  ZEGO_APP_ID: import.meta.env.VITE_ZEGO_APP_ID,
  ZEGO_SERVER: import.meta.env.VITE_ZEGO_SERVER,
  API_BASE_URL: import.meta.env.VITE_API_BASE_URL,
}

export const config = configSchema.parse(rawConfig)

This fails fast if environment variables are missing or mis-typed.

2. Building the Interview Agent Server

All backend logic lives in server/src/server.ts. The core steps:

Generate ZEGOCLOUD signatures.
Register an interview agent once per process.
Start voice-only and digital human sessions.
Provide tokens and cleanup endpoints.

2.1 ZEGOCLOUD API Authentication

The Agent and Digital Human APIs share a signature scheme based on MD5:

// server/src/server.ts
import crypto from 'crypto'
import axios from 'axios'
import dotenv from 'dotenv'
dotenv.config()

const CONFIG = {
  ZEGO_APP_ID: process.env.ZEGO_APP_ID!,
  ZEGO_SERVER_SECRET: process.env.ZEGO_SERVER_SECRET!,
  ZEGO_AIAGENT_API_BASE_URL: 'https://aigc-aiagent-api.zegotech.cn',
  ZEGO_DIGITAL_HUMAN_API_BASE_URL: 'https://aigc-digitalhuman-api.zegotech.cn'
}

function generateZegoSignature(action: string) {
  const timestamp = Math.floor(Date.now() / 1000)
  const nonce = crypto.randomBytes(8).toString('hex')

  // Critical: AppId + SignatureNonce + ServerSecret + Timestamp
  const signString = CONFIG.ZEGO_APP_ID + nonce + CONFIG.ZEGO_SERVER_SECRET + timestamp
  const signature = crypto.createHash('md5').update(signString).digest('hex')

  return {
    Action: action,
    AppId: CONFIG.ZEGO_APP_ID,
    SignatureNonce: nonce,
    SignatureVersion: '2.0',
    Timestamp: timestamp,
    Signature: signature
  }
}

async function makeZegoRequest(
  action: string,
  body: object = {},
  apiType: 'aiagent' | 'digitalhuman' = 'aiagent'
) {
  const queryParams = generateZegoSignature(action)
  const queryString = Object.entries(queryParams)
    .map(([k, v]) => `${k}=${encodeURIComponent(String(v))}`)
    .join('&')

  const baseUrl =
    apiType === 'digitalhuman'
      ? CONFIG.ZEGO_DIGITAL_HUMAN_API_BASE_URL
      : CONFIG.ZEGO_AIAGENT_API_BASE_URL

  const url = `${baseUrl}?${queryString}`

  const response = await axios.post(url, body, {
    headers: { 'Content-Type': 'application/json' },
    timeout: 30000
  })
  return response.data
}

You will reuse makeZegoRequest for every Agent and Digital Human operation.

2.2 Defining the Interview Agent (LLM, TTS, ASR)

Next, define a reusable interview agent with a focused system prompt and streaming preferences:

// server/src/server.ts
let REGISTERED_AGENT_ID: string | null = null

const AGENT_CONFIG = {
  LLM: {
    Url: 'https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions',
    ApiKey: process.env.DASHSCOPE_API_KEY || 'zego_test',
    Model: 'qwen-plus',
    SystemPrompt: `
      You are a professional job interviewer.

      INTERVIEW PHASES:
      1. Introduction: brief greeting and self-introduction question.
      2. Technical: ask about skills, projects, and problem-solving.
      3. Behavioral: explore teamwork, conflict, and challenges.
      4. Closing: invite questions and wrap up politely.

      RULES:
      - Ask ONE clear question at a time.
      - Keep questions under two sentences.
      - Acknowledge answers briefly before moving on.
      - Conclude with: "Thank you for your time today. This concludes our interview."
    `,
    Temperature: 0.7,
    TopP: 0.9,
    Params: { max_tokens: 400 }
  },
  TTS: {
    Vendor: 'ByteDance',
    Params: {
      app: {
        appid: 'zego_test',
        token: 'zego_test',
        cluster: 'volcano_tts'
      },
      speed_ratio: 1,
      volume_ratio: 1,
      pitch_ratio: 1,
      audio: { rate: 24000 }
    },
    FilterText: [
      { BeginCharacters: '(', EndCharacters: ')' },
      { BeginCharacters: '[', EndCharacters: ']' }
    ],
    TerminatorText: '#'
  },
  ASR: {
    Vendor: 'Tencent',
    Params: {
      engine_model_type: '16k_en',
      hotword_list: 'interview|10,experience|8,project|8,team|8,challenge|8,skills|8'
    },
    VADSilenceSegmentation: 1500,
    PauseInterval: 2000
  }
}

async function registerAgent(): Promise<string> {
  if (REGISTERED_AGENT_ID) return REGISTERED_AGENT_ID

  const agentId = `interview_agent_${Date.now()}`
  const payload = { AgentId: agentId, Name: 'AI Interview Assistant', ...AGENT_CONFIG }

  const result = await makeZegoRequest('RegisterAgent', payload)
  if (result.Code !== 0) {
    throw new Error(`RegisterAgent failed: ${result.Code} ${result.Message}`)
  }

  REGISTERED_AGENT_ID = agentId
  return agentId
}

The agent is registered only once per server process and reused across sessions.

2.3 Voice-Only Agent Session and Token Endpoint

Even with a digital human, a basic voice agent and token endpoint are useful and share the same patterns.

// server/src/server.ts
import express from 'express'
import cors from 'cors'
import { createRequire } from 'module'
const require = createRequire(import.meta.url)
const { generateToken04 } = require('../zego-token.cjs')

const app = express()
app.use(express.json())
app.use(cors())

function sanitizeRTCId(id: string) {
  const s = String(id || '').replace(/[^A-Za-z0-9_.-]/g, '')
  return s || `room_${Date.now().toString(36)}`
}

app.post('/api/start', async (req, res) => {
  const { room_id, user_id, user_stream_id } = req.body
  if (!room_id || !user_id) {
    res.status(400).json({ error: 'room_id and user_id required' })
    return
  }

  const agentId = await registerAgent()
  const roomId = sanitizeRTCId(room_id)
  const userStreamId = (user_stream_id || `${user_id}_stream`)
    .toLowerCase()
    .replace(/[^a-z0-9_.-]/g, '')
    .slice(0, 128)

  const instanceConfig = {
    AgentId: agentId,
    UserId: String(user_id).slice(0, 32),
    RTC: {
      RoomId: roomId,
      AgentUserId: `ai_${roomId}`,
      AgentStreamId: `ai_stream_${roomId}`,
      UserStreamId: userStreamId
    },
    MessageHistory: {
      SyncMode: 1,
      Messages: [],
      WindowSize: 10
    },
    AdvancedConfig: { InterruptMode: 0 }
  }

  const result = await makeZegoRequest('CreateAgentInstance', instanceConfig, 'aiagent')
  if (result.Code !== 0) {
    res.status(400).json({ error: result.Message || 'Failed to create instance' })
    return
  }

  res.json({
    success: true,
    agentInstanceId: result.Data.AgentInstanceId,
    agentUserId: instanceConfig.RTC.AgentUserId,
    agentStreamId: instanceConfig.RTC.AgentStreamId,
    userStreamId
  })
})

app.get('/api/token', (req, res) => {
  const userId = ((req.query.user_id as string) || '').trim()
  const roomId = ((req.query.room_id as string) || '').trim()

  if (!userId) {
    res.status(400).json({ error: 'user_id required' })
    return
  }

  const appId = Number(CONFIG.ZEGO_APP_ID)
  const secret = CONFIG.ZEGO_SERVER_SECRET

  const payload = {
    room_id: roomId,
    privilege: { 1: 1, 2: 1, 3: 1 },
    stream_id_list: null
  }

  const token = generateToken04(appId, userId, secret, 3600, JSON.stringify(payload))
  res.json({ token })
})

The frontend uses /api/token to log in with ZegoExpressEngine.

2.4 Starting a Digital Human Interview Session

The digital human endpoint creates a unified agent instance that includes both voice and avatar configuration:

// server/src/server.ts
app.post('/api/start-digital-human', async (req, res) => {
  try {
    const { room_id, user_id, user_stream_id, digital_human_id } = req.body

    if (!room_id || !user_id) {
      res.status(400).json({ error: 'room_id and user_id required' })
      return
    }

    const roomIdRTC = sanitizeRTCId(room_id)
    const userStreamId = (user_stream_id || `${user_id}_stream`)
      .toLowerCase()
      .replace(/[^a-z0-9_.-]/g, '')
      .slice(0, 128)

    const agentId = await registerAgent()
    const normalizedUserId = String(user_id).replace(/[^a-zA-Z0-9_-]/g, '').slice(0, 32)
    const digitalHumanId = digital_human_id || 'your_digital_human_id'

    const agentUserId = `agt_${roomIdRTC}`.slice(0, 32)
    const agentStreamId = `agt_stream_${roomIdRTC}`.slice(0, 128)

    const payload = {
      AgentId: agentId,
      UserId: normalizedUserId,
      RTC: {
        RoomId: roomIdRTC,
        AgentUserId: agentUserId,
        AgentStreamId: agentStreamId,
        UserStreamId: userStreamId
      },
      DigitalHuman: {
        DigitalHumanId: digitalHumanId,
        ConfigId: 'web',
        EncodeCode: 'H264'
      },
      MessageHistory: {
        SyncMode: 1,
        Messages: [],
        WindowSize: 10
      },
      AdvancedConfig: { InterruptMode: 0 }
    }

    const result = await makeZegoRequest('CreateDigitalHumanAgentInstance', payload, 'aiagent')
    if (result.Code !== 0) {
      res.status(400).json({
        error: result.Message || 'Failed to create digital human agent instance',
        code: result.Code,
        requestId: result.RequestId
      })
      return
    }

    const digitalHumanConfig = result.Data.DigitalHumanConfig

    res.json({
      success: true,
      agentInstanceId: result.Data.AgentInstanceId,
      agentStreamId,
      roomId: roomIdRTC,
      digitalHumanId,
      digitalHumanConfig,
      unifiedDigitalHuman: true
    })
  } catch (error: any) {
    res.status(500).json({ error: error.message || 'Internal error' })
  }
})

The response includes:

agentInstanceId – for text messages and teardown.
agentStreamId – the agent’s audio stream.
roomId – where the browser should join.
digitalHumanConfig – avatar configuration for the Digital Human SDK.

2.5 Stopping the Session and Cleaning Up

When the candidate ends the interview, you must stop both the agent instance and any digital human task:

// server/src/server.ts
app.post('/api/stop', async (req, res) => {
  const { agent_instance_id } = req.body

  if (!agent_instance_id) {
    res.status(400).json({ error: 'agent_instance_id required' })
    return
  }

  // Optional: collect metrics before teardown
  try {
    const status = await makeZegoRequest('QueryAgentInstanceStatus', {
      AgentInstanceId: agent_instance_id
    })
    console.log('Interview performance:', {
      llmFirstTokenLatency: status.Data?.LLMFirstTokenLatency,
      ttsFirstAudioLatency: status.Data?.TTSFirstAudioLatency
    })
  } catch {
    console.warn('Could not fetch metrics')
  }

  const result = await makeZegoRequest('DeleteAgentInstance', {
    AgentInstanceId: agent_instance_id
  })

  if (result.Code !== 0) {
    res.status(400).json({ error: result.Message || 'Failed to delete instance' })
    return
  }

  res.json({ success: true })
})

You can also expose a /api/cleanup endpoint using QueryDigitalHumanStreamTasks to force-stop any orphaned video streams.

2.6 Optional: Listing Available Digital Humans

To let your frontend choose between different avatars, add:

// server/src/server.ts
app.get('/api/digital-humans', async (_req, res) => {
  const result = await makeZegoRequest('GetDigitalHumanList', {}, 'digitalhuman')

  if (result.Code !== 0) {
    res.status(400).json({
      error: result.Message || 'Failed to query digital humans',
      code: result.Code,
      requestId: result.RequestId
    })
    return
  }

  res.json({
    success: true,
    digitalHumans: result.Data?.List || []
  })
})

The client can present this as a simple avatar selector before starting the interview.

3. WebRTC Integration: ZegoExpressEngine Wrapper

On the frontend, all WebRTC logic lives inside ZegoService (client/src/services/zego.ts). It:

Manages a single ZegoExpressEngine instance.
Joins/leaves rooms.
Publishes the candidate’s mic.
Plays remote audio and digital human video streams.
Exposes callbacks for ASR/LLM room messages and player state.

3.1 Initializing ZegoExpressEngine

// client/src/services/zego.ts
import { ZegoExpressEngine } from 'zego-express-engine-webrtc'
import { VoiceChanger } from 'zego-express-engine-webrtc/voice-changer'
import { config } from '../config'
import { digitalHumanAPI } from './digitalHumanAPI'

export class ZegoService {
  private static instance: ZegoService
  private zg: ZegoExpressEngine | null = null
  private isInitialized = false
  private currentRoomId: string | null = null
  private currentUserId: string | null = null
  private localStream: MediaStream | null = null
  private audioElement: HTMLAudioElement | null = null
  // ... other fields

  static getInstance(): ZegoService {
    if (!ZegoService.instance) ZegoService.instance = new ZegoService()
    return ZegoService.instance
  }

  async initialize(): Promise<void> {
    if (this.isInitialized) return

    try {
      try { ZegoExpressEngine.use(VoiceChanger) } catch {}
      this.zg = new ZegoExpressEngine(
        parseInt(config.ZEGO_APP_ID),
        config.ZEGO_SERVER,
        { scenario: 7 } // digital human / AI scenario
      )

      try {
        const rtc = await this.zg.checkSystemRequirements('webRTC')
        const mic = await this.zg.checkSystemRequirements('microphone')
        if (!rtc?.result) throw new Error('WebRTC not supported')
        if (!mic?.result) console.warn('Microphone permission not granted yet')
      } catch {}

      this.setupEventListeners()
      this.setupMediaElements()
      this.isInitialized = true
    } catch (error) {
      console.error('ZEGO initialization failed:', error)
      throw error
    }
  }

  private setupMediaElements() {
    this.audioElement = document.getElementById('ai-audio-output') as HTMLAudioElement
    if (!this.audioElement) {
      this.audioElement = document.createElement('audio')
      this.audioElement.id = 'ai-audio-output'
      this.audioElement.autoplay = true
      this.audioElement.style.display = 'none'
      document.body.appendChild(this.audioElement)
    }
  }

  // ...
}

This ensures there is only one engine instance per browser tab.

3.2 Joining the Room and Publishing the Mic

// client/src/services/zego.ts
async joinRoom(roomId: string, userId: string): Promise<boolean> {
  if (!this.zg) return false
  if (this.currentRoomId === roomId && this.currentUserId === userId) return true

  try {
    if (this.currentRoomId) await this.leaveRoom()

    this.currentRoomId = roomId
    this.currentUserId = userId

    const { token } = await digitalHumanAPI.getToken(userId, roomId)

    await this.zg.loginRoom(roomId, token, {
      userID: userId,
      userName: userId
    })

    // Enable room message callbacks (ASR / LLM events)
    this.zg.callExperimentalAPI({
      method: 'onRecvRoomChannelMessage',
      params: {}
    })

    const localStream = await this.zg.createStream({
      camera: { video: false, audio: true }
    })

    this.localStream = localStream
    const streamId = `${userId}_stream`

    await this.zg.startPublishingStream(streamId, localStream, {
      enableAutoSwitchVideoCodec: true
    })

    return true
  } catch (error) {
    console.error('Failed to join room:', error)
    this.currentRoomId = null
    this.currentUserId = null
    this.localStream = null
    return false
  }
}

async enableMicrophone(enabled: boolean): Promise<boolean> {
  if (!this.localStream) return false
  const track = this.localStream.getAudioTracks?.()[0]
  if (track) {
    track.enabled = enabled
    return true
  }
  return false
}

This is all the logic your React components need to toggle recording.

3.3 Handling Streams and Attaching the Digital Human Video

When ZEGOCLOUD adds new streams to the room, you decide which ones to play:

// client/src/services/zego.ts
private remoteViews = new Map<string, any>()
private playingStreamIds = new Set<string>()
private messageCallback: ((message: any) => void) | null = null

private setupEventListeners(): void {
  if (!this.zg) return

  this.zg.on('recvExperimentalAPI', (result: any) => {
    const { method, content } = result
    if (method === 'onRecvRoomChannelMessage') {
      try {
        const msg = JSON.parse(content.msgContent)
        this.handleRoomMessage(msg)
      } catch (e) {
        console.error('Parse room message failed:', e)
      }
    }
  })

  this.zg.on('roomStreamUpdate', async (_roomID, updateType, streamList) => {
    if (updateType === 'ADD') {
      for (const stream of streamList) {
        const streamId = stream.streamID
        const userStreamId = this.currentUserId ? `${this.currentUserId}_stream` : null
        if (userStreamId && streamId === userStreamId) continue

        if (this.playingStreamIds.has(streamId)) continue
        this.playingStreamIds.add(streamId)

        try {
          const mediaStream = await this.zg!.startPlayingStream(streamId)
          if (!mediaStream) continue

          const remoteView = await (this.zg as any).createRemoteStreamView(mediaStream)
          if (!remoteView) continue

          // Audio for agent / digital human is always enabled here
          Promise.resolve(remoteView.playAudio({ enableAutoplayDialog: true })).catch(() => {})

          this.remoteViews.set(streamId, remoteView)
        } catch (error) {
          console.error('Failed to start remote stream:', streamId, error)
        }
      }
    }

    if (updateType === 'DELETE') {
      for (const stream of streamList) {
        const rv = this.remoteViews.get(stream.streamID)
        if (rv?.destroy) rv.destroy()
        this.remoteViews.delete(stream.streamID)
        this.playingStreamIds.delete(stream.streamID)
      }
    }
  })
}

private handleRoomMessage(message: any): void {
  if (this.messageCallback) {
    this.messageCallback(message)
  }
}

onRoomMessage(callback: (message: any) => void): void {
  this.messageCallback = callback
}

To attach a specific digital human video stream into the UI, expose:

// client/src/services/zego.ts (core idea)
private dhVideoStreamId: string | null = null

setDigitalHumanStream(streamId: string | null): void {
  this.dhVideoStreamId = streamId
  if (!streamId) return
  void this.startDigitalHumanPlayback(streamId)
}

private async startDigitalHumanPlayback(streamId: string): Promise<void> {
  if (!this.zg) return

  const mediaStream = await this.zg.startPlayingStream(streamId)
  if (!mediaStream) return

  const remoteView = await (this.zg as any).createRemoteStreamView(mediaStream)
  if (!remoteView) return

  // Attach audio
  Promise.resolve(remoteView.playAudio({ enableAutoplayDialog: true })).catch(() => {})

  // Attach video element into #remoteSteamView container
  const attach = async () => {
    const container = document.getElementById('remoteSteamView')
    if (!container) {
      setTimeout(attach, 200)
      return
    }

    const result = await Promise.resolve(remoteView.playVideo(container, {
      enableAutoplayDialog: false
    }))

    setTimeout(() => {
      const videoEl = container.querySelector('video') as HTMLVideoElement | null
      if (!videoEl) return

      if (!videoEl.srcObject) {
        videoEl.srcObject = mediaStream
        videoEl.load()
        void videoEl.play()
      }

      document.dispatchEvent(
        new CustomEvent('zego-digital-human-video-state', { detail: { ready: true } })
      )
    }, 150)
  }

  attach()
}

The only requirement from React is to provide an element with id remoteSteamView; the service takes care of attaching and repairing the <video> element.

4. React Interview Experience

With media and backend in place, the rest is React:

useInterview – orchestrates the session.
DigitalHuman – displays the avatar and status.
ChatPanel – shows the transcript.
VoiceMessageInput – allows typed or spoken answers.
App – small state machine for welcome → interview → summary.

4.1 Interview State Hook

useInterview is the central hook that ties together ZegoService and the digital human APIs.

// client/src/hooks/useInterview.ts
import { useCallback, useRef, useEffect, useReducer } from 'react'
import { ZegoService } from '../services/zego'
import { digitalHumanAPI } from '../services/digitalHumanAPI'
import type { Message, ChatSession, ZegoRoomMessage } from '../types'

interface InterviewState {
  messages: Message[]
  session: ChatSession | null
  isLoading: boolean
  isConnected: boolean
  isRecording: boolean
  currentTranscript: string
  agentStatus: 'idle' | 'listening' | 'thinking' | 'speaking'
  error: string | null
  questionsAsked: number
  isInterviewComplete: boolean
  startTime: number | null
}

// reducer implementation omitted for brevity (check project repository)

export const useInterview = () => {
  const [state, dispatch] = useReducer(interviewReducer, {
    messages: [],
    session: null,
    isLoading: false,
    isConnected: false,
    isRecording: false,
    currentTranscript: '',
    agentStatus: 'idle',
    error: null,
    questionsAsked: 0,
    isInterviewComplete: false,
    startTime: null
  })

  const zegoService = useRef(ZegoService.getInstance())
  const processedMessageIds = useRef(new Set<string>())

  const addMessageSafely = useCallback((message: Message) => {
    if (processedMessageIds.current.has(message.id)) return
    processedMessageIds.current.add(message.id)
    dispatch({ type: 'ADD_MESSAGE', payload: message })

    if (
      message.sender === 'ai' &&
      message.content.toLowerCase().includes('this concludes our interview')
    ) {
      setTimeout(() => {
        dispatch({ type: 'SET_INTERVIEW_COMPLETE', payload: true })
      }, 2000)
    }
  }, [])

  const setupMessageHandlers = useCallback(() => {
    const handleRoomMessage = (data: ZegoRoomMessage) => {
      const { Cmd, Data: msgData } = data

      if (Cmd === 3) {
        // ASR events (candidate speech)
        const { Text: transcript, EndFlag, MessageId } = msgData
        if (!transcript?.trim()) return

        dispatch({ type: 'SET_TRANSCRIPT', payload: transcript })
        dispatch({ type: 'SET_AGENT_STATUS', payload: 'listening' })

        if (EndFlag) {
          const message: Message = {
            id: MessageId || `voice_${Date.now()}`,
            content: transcript.trim(),
            sender: 'user',
            timestamp: Date.now(),
            type: 'voice',
            transcript: transcript.trim()
          }
          addMessageSafely(message)
          dispatch({ type: 'SET_TRANSCRIPT', payload: '' })
          dispatch({ type: 'SET_AGENT_STATUS', payload: 'thinking' })
          dispatch({ type: 'INCREMENT_QUESTIONS_ASKED' })
        }
      }

      if (Cmd === 4) {
        // LLM events (AI interviewer responses)
        const { Text: content, MessageId, EndFlag } = msgData
        if (!content || !MessageId) return

        dispatch({ type: 'SET_AGENT_STATUS', payload: 'speaking' })

        if (EndFlag) {
          const final: Message = {
            id: `ai_${Date.now()}`,
            content,
            sender: 'ai',
            timestamp: Date.now(),
            type: 'text'
          }
          addMessageSafely(final)

          setTimeout(() => {
            dispatch({ type: 'SET_AGENT_STATUS', payload: 'listening' })
          }, 8000)
        }
      }
    }

    zegoService.current.onRoomMessage(handleRoomMessage)
  }, [addMessageSafely])

  const startInterview = useCallback(async () => {
    if (state.isLoading || state.isConnected) return false

    dispatch({ type: 'SET_LOADING', payload: true })
    dispatch({ type: 'SET_START_TIME', payload: Date.now() })
    dispatch({ type: 'SET_ERROR', payload: null })

    try {
      const roomId = `interview_${Date.now().toString(36)}`
      const userId = `candidate_${Date.now().toString(36)}`

      await zegoService.current.initialize()

      const result = await digitalHumanAPI.startInterview(roomId, userId)
      const joinedRoomId = result.roomId || roomId

      const joined = await zegoService.current.joinRoom(joinedRoomId, userId)
      if (!joined) throw new Error('Failed to join ZEGO room')

      if (result.agentStreamId) {
        zegoService.current.setAgentAudioStream(result.agentStreamId)
      }
      if (result.digitalHumanVideoStreamId) {
        zegoService.current.setDigitalHumanStream(result.digitalHumanVideoStreamId)
      }

      const session: ChatSession = {
        roomId: joinedRoomId,
        userId,
        agentInstanceId: result.agentInstanceId,
        agentStreamId: result.agentStreamId,
        digitalHumanTaskId: result.digitalHumanTaskId,
        digitalHumanVideoStreamId: result.digitalHumanVideoStreamId,
        digitalHumanId: result.digitalHumanId,
        isActive: true,
        voiceSettings: {
          isEnabled: false,
          autoPlay: true,
          speechRate: 1.0,
          speechPitch: 1.0
        }
      }

      dispatch({ type: 'SET_SESSION', payload: session })
      dispatch({ type: 'SET_CONNECTED', payload: true })

      setupMessageHandlers()

      await digitalHumanAPI.sendMessage(
        session.agentInstanceId!,
        'Please start the interview with a short greeting and your first question.'
      )

      return true
    } catch (error: any) {
      dispatch({ type: 'SET_ERROR', payload: error.message || 'Failed to start interview' })
      return false
    } finally {
      dispatch({ type: 'SET_LOADING', payload: false })
    }
  }, [state.isLoading, state.isConnected, setupMessageHandlers])

  // sendTextMessage, toggleVoiceRecording, endInterview, cleanup omitted...

  return {
    ...state,
    startInterview,
    // sendTextMessage,
    // toggleVoiceRecording,
    // toggleVoiceSettings,
    // endInterview
  }
}

The hook hides all ZEGOCLOUD details from components.

4.2 Digital Human Component

The DigitalHuman component hosts the video container and overlays connection status and current question.

// client/src/components/Interview/DigitalHuman.tsx
import { useEffect, useState } from 'react'
import { motion, AnimatePresence } from 'framer-motion'
import { useDigitalHuman } from '../../hooks/useDigitalHuman'
import { ZegoService } from '../../services/zego'
import { Volume2, VolumeX, Video, VideoOff, Circle } from 'lucide-react'

interface DigitalHumanProps {
  isConnected: boolean
  agentStatus: 'idle' | 'listening' | 'thinking' | 'speaking'
  currentQuestion?: string
}

export const DigitalHuman = ({ isConnected, agentStatus, currentQuestion }: DigitalHumanProps) => {
  const { isVideoEnabled, isAudioEnabled, toggleVideo, toggleAudio } = useDigitalHuman()
  const [videoReady, setVideoReady] = useState(false)

  useEffect(() => {
    ZegoService.getInstance().ensureVideoContainer()
  }, [isConnected])

  useEffect(() => {
    const handler = (event: Event) => {
      const { detail } = event as CustomEvent<{ ready: boolean }>
      setVideoReady(!!detail?.ready)
    }
    document.addEventListener('zego-digital-human-video-state', handler)
    return () => document.removeEventListener('zego-digital-human-video-state', handler)
  }, [])

  const status = {
    idle: { text: 'Ready', color: 'bg-slate-400' },
    listening: { text: 'Listening', color: 'bg-emerald-500' },
    thinking: { text: 'Processing', color: 'bg-blue-500' },
    speaking: { text: 'Speaking', color: 'bg-violet-500' }
  }[agentStatus]

  return (
    <div className="relative w-full h-full bg-slate-900 flex items-center justify-center overflow-hidden">
      {/* Digital human video container */}
      <div
        id="remoteSteamView"
        className={`absolute inset-0 w-full h-full transition-opacity duration-300 ${
          videoReady && isVideoEnabled ? 'opacity-100' : 'opacity-0'
        }`}
      />

      <style>{`
        #remoteSteamView {
          display: flex;
          align-items: center;
          justify-content: center;
        }
        #remoteSteamView > div {
          width: 100%;
          height: 100%;
        }
        #remoteSteamView video {
          width: 100%;
          height: 100%;
          object-fit: cover;
        }
      `}</style>

      {/* Status + controls */}
      {isConnected && (
        <motion.div
          initial={{ opacity: 0, y: -20 }}
          animate={{ opacity: 1, y: 0 }}
          className="absolute top-6 left-6 right-6 flex items-center justify-between"
        >
          <div className="flex items-center space-x-3 bg-black/50 rounded-full px-4 py-2 border border-white/10">
            <motion.div
              className={`w-2.5 h-2.5 rounded-full ${status.color}`}
              animate={{ scale: [1, 1.3, 1], opacity: [1, 0.7, 1] }}
              transition={{ repeat: Infinity, duration: 2 }}
            />
            <span className="text-white text-sm font-medium">
              {status.text}
            </span>
          </div>

          <div className="flex items-center space-x-2">
            <button
              onClick={toggleVideo}
              className="p-2.5 rounded-full bg-black/50 text-white"
              title={isVideoEnabled ? 'Disable video' : 'Enable video'}
            >
              {isVideoEnabled ? <Video className="w-4 h-4" /> : <VideoOff className="w-4 h-4" />}
            </button>
            <button
              onClick={toggleAudio}
              className="p-2.5 rounded-full bg-black/50 text-white"
              title={isAudioEnabled ? 'Mute audio' : 'Unmute audio'}
            >
              {isAudioEnabled ? <Volume2 className="w-4 h-4" /> : <VolumeX className="w-4 h-4" />}
            </button>
          </div>
        </motion.div>
      )}

      {/* Optional: show current question overlay when agent is speaking */}
      <AnimatePresence>
        {currentQuestion && agentStatus === 'speaking' && (
          <motion.div
            initial={{ opacity: 0, y: 30 }}
            animate={{ opacity: 1, y: 0 }}
            exit={{ opacity: 0, y: -20 }}
            className="absolute bottom-0 left-0 right-0 p-8 bg-gradient-to-t from-black/80 via-black/50 to-transparent"
          >
            <div className="bg-white/95 rounded-2xl p-6 shadow-2xl flex items-start space-x-3">
              <Circle className="w-5 h-5 text-violet-500 mt-1" />
              <p className="text-slate-900 font-medium text-lg leading-relaxed">
                {currentQuestion}
              </p>
            </div>
          </motion.div>
        )}
      </AnimatePresence>
    </div>
  )
}

This encapsulates all avatar-related UI concerns.

4.3 Chat Panel and Voice Input

The chat panel shows the full message history:

// client/src/components/Interview/ChatPanel.tsx (simplified)
import { useEffect, useRef } from 'react'
import { motion, AnimatePresence } from 'framer-motion'
import type { Message } from '../../types'
import { User, Bot } from 'lucide-react'

interface ChatPanelProps {
  messages: Message[]
  isTyping: boolean
}

export const ChatPanel = ({ messages, isTyping }: ChatPanelProps) => {
  const endRef = useRef<HTMLDivElement>(null)

  useEffect(() => {
    endRef.current?.scrollIntoView({ behavior: 'smooth' })
  }, [messages, isTyping])

  return (
    <div className="flex flex-col flex-1 bg-slate-900/50">
      <div className="flex-1 overflow-y-auto px-6 py-4 space-y-4">
        <AnimatePresence initial={false}>
          {messages.map((m) => (
            <motion.div
              key={m.id}
              initial={{ opacity: 0, y: 10 }}
              animate={{ opacity: 1, y: 0 }}
              className={`flex gap-3 ${m.sender === 'user' ? 'justify-end' : 'justify-start'}`}
            >
              {m.sender === 'ai' && (
                <div className="w-8 h-8 rounded-full bg-violet-600 flex items-center justify-center">
                  <Bot className="w-4 h-4 text-white" />
                </div>
              )}
              <div
                className={`max-w-[75%] rounded-2xl px-4 py-3 ${
                  m.sender === 'user'
                    ? 'bg-blue-600 text-white'
                    : 'bg-slate-800 text-slate-100'
                }`}
              >
                <p className="text-sm whitespace-pre-wrap">{m.content}</p>
              </div>
              {m.sender === 'user' && (
                <div className="w-8 h-8 rounded-full bg-blue-600 flex items-center justify-center">
                  <User className="w-4 h-4 text-white" />
                </div>
              )}
            </motion.div>
          ))}
        </AnimatePresence>

        {isTyping && (
          <motion.div
            initial={{ opacity: 0, y: 10 }}
            animate={{ opacity: 1, y: 0 }}
            className="flex gap-3"
          >
            <div className="w-8 h-8 rounded-full bg-violet-600 flex items-center justify-center">
              <Bot className="w-4 h-4 text-white" />
            </div>
            <div className="bg-slate-800 rounded-2xl px-4 py-3">
              <div className="flex gap-1">
                <span className="w-2 h-2 bg-slate-400 rounded-full animate-bounce" />
                <span
                  className="w-2 h-2 bg-slate-400 rounded-full animate-bounce"
                  style={{ animationDelay: '150ms' }}
                />
                <span
                  className="w-2 h-2 bg-slate-400 rounded-full animate-bounce"
                  style={{ animationDelay: '300ms' }}
                />
              </div>
            </div>
          </motion.div>
        )}

        <div ref={endRef} />
      </div>
    </div>
  )
}

The voice input lets candidates type or speak answers and reflects the interviewer’s status. The implementation is similar to a standard chat input with an extra mic toggle and transcript area.

4.4 Interview Room Layout

Finally, the InterviewRoom component ties everything together and returns a summary when the interview ends:

// client/src/components/Interview/InterviewRoom.tsx
import { useEffect, useState, useCallback, useMemo } from 'react'
import { motion } from 'framer-motion'
import { DigitalHuman } from './DigitalHuman'
import { ChatPanel } from './ChatPanel'
import { Button } from '../UI/Button'
import { useInterview } from '../../hooks/useInterview'
import { PhoneOff, Clock } from 'lucide-react'
import type { Message } from '../../types'

export interface InterviewSummary {
  duration: string
  questionsCount: number
  responsesCount: number
  messages: Message[]
}

interface InterviewRoomProps {
  onComplete: (data: InterviewSummary) => void
}

export const InterviewRoom = ({ onComplete }: InterviewRoomProps) => {
  const [currentTime, setCurrentTime] = useState(Date.now())
  const {
    messages,
    isLoading,
    isConnected,
    isRecording,
    error,
    agentStatus,
    questionsAsked,
    isInterviewComplete,
    startTime,
    startInterview,
    endInterview
  } = useInterview()

  useEffect(() => {
    void startInterview()
  }, [])

  useEffect(() => {
    if (!isConnected) return
    const id = setInterval(() => setCurrentTime(Date.now()), 1000)
    return () => clearInterval(id)
  }, [isConnected])

  useEffect(() => {
    if (!isInterviewComplete || !startTime) return
    const secs = Math.floor((Date.now() - startTime) / 1000)
    const data: InterviewSummary = {
      duration: `${Math.floor(secs / 60)}:${(secs % 60).toString().padStart(2, '0')}`,
      questionsCount: messages.filter(m => m.sender === 'ai').length,
      responsesCount: messages.filter(m => m.sender === 'user').length,
      messages
    }
    onComplete(data)
  }, [isInterviewComplete, startTime, messages, onComplete])

  const formatDuration = useCallback((now: number) => {
    if (!startTime) return '0:00'
    const secs = Math.floor((now - startTime) / 1000)
    const mins = Math.floor(secs / 60)
    return `${mins}:${(secs % 60).toString().padStart(2, '0')}`
  }, [startTime])

  const statusDisplay = useMemo(() => {
    if (isInterviewComplete) return { text: 'Interview completed', color: 'text-emerald-500' }
    if (isLoading && !isConnected) return { text: 'Connecting...', color: 'text-blue-500' }
    if (!isConnected) return { text: 'Connecting...', color: 'text-blue-500' }

    const map = {
      listening: { text: 'Listening...', color: 'text-emerald-500' },
      thinking: { text: 'Thinking...', color: 'text-blue-500' },
      speaking: { text: 'Speaking...', color: 'text-violet-500' },
      idle: { text: 'Ready', color: 'text-slate-400' }
    } as const

    return map[agentStatus]
  }, [isConnected, isInterviewComplete, isLoading, agentStatus])

  return (
    <div className="h-screen flex flex-col bg-slate-950">
      {/* Header */}
      <motion.header
        initial={{ y: -20, opacity: 0 }}
        animate={{ y: 0, opacity: 1 }}
        className="bg-slate-900/80 backdrop-blur-xl border-b border-slate-800"
      >
        <div className="px-6 py-4 flex items-center justify-between">
          <div>
            <h1 className="text-lg font-bold text-white">AI Interview</h1>
            <p className={`text-sm font-medium ${statusDisplay.color}`}>
              {statusDisplay.text}
            </p>
            {error && (
              <p className="text-xs text-red-400 mt-1">
                {error}
              </p>
            )}
          </div>
          {isConnected && (
            <div className="flex items-center space-x-4">
              <div className="flex items-center space-x-2 text-sm text-slate-400">
                <Clock className="w-4 h-4" />
                <span className="tabular-nums">{formatDuration(currentTime)}</span>
              </div>
              {isRecording && (
                <div className="px-3 py-1 rounded-full border border-emerald-500/40 bg-emerald-500/10 flex items-center space-x-2">
                  <span className="w-2 h-2 rounded-full bg-emerald-400 animate-pulse" />
                  <span className="text-xs font-semibold text-emerald-300">Mic On / Listening</span>
                </div>
              )}
              <div className="px-3 py-1 bg-blue-500/10 rounded-full">
                <span className="text-xs font-semibold text-blue-400">
                  Q{questionsAsked}
                </span>
              </div>
              <Button
                onClick={endInterview}
                variant="secondary"
                size="sm"
                disabled={isLoading}
                className="bg-slate-800 hover:bg-red-500/10 text-slate-300 hover:text-red-400 border-slate-700"
              >
                <PhoneOff className="w-4 h-4 mr-2" />
                End
              </Button>
            </div>
          )}
        </div>
      </motion.header>

      {/* Body */}
      <div className="flex-1 flex flex-col lg:flex-row overflow-hidden">
        <div className="w-full lg:w-1/2">
          <DigitalHuman
            isConnected={isConnected}
            agentStatus={agentStatus}
            currentQuestion=""
          />
        </div>
        <div className="w-full lg:w-1/2 border-t lg:border-t-0 lg:border-l border-slate-800">
          <ChatPanel
            messages={messages}
            isTyping={agentStatus === 'thinking' || agentStatus === 'speaking'}
          />
        </div>
      </div>
    </div>
  )
}

Your top-level App component only needs to choose between the welcome screen, interview screen, and summary view.

5. Frontend API Client

The React app talks to the backend through a small wrapper in client/src/services/digitalHumanAPI.ts. It hides raw URLs and response shapes.

// client/src/services/digitalHumanAPI.ts
import axios from 'axios'
import { config } from '../config'

const api = axios.create({
  baseURL: config.API_BASE_URL,
  timeout: 30000,
  headers: { 'Content-Type': 'application/json' }
})

export const digitalHumanAPI = {
  async startInterview(roomId: string, userId: string) {
    const requestData = {
      room_id: roomId,
      user_id: userId,
      user_stream_id: `${userId}_stream`,
      // digital_human_id: optional override
    }

    const response = await api.post('/api/start-digital-human', requestData)

    if (!response.data || !response.data.success) {
      throw new Error(response.data?.error || 'Digital human interview start failed')
    }

    return {
      agentInstanceId: response.data.agentInstanceId,
      agentStreamId: response.data.agentStreamId,
      digitalHumanTaskId: response.data.digitalHumanTaskId,
      digitalHumanVideoStreamId: response.data.digitalHumanVideoStreamId,
      digitalHumanConfig: response.data.digitalHumanConfig,
      roomId: response.data.roomId || roomId,
      digitalHumanId: response.data.digitalHumanId,
      unifiedDigitalHuman: response.data.unifiedDigitalHuman
    }
  },

  async stopInterview(agentInstanceId: string, digitalHumanTaskId?: string) {
    if (!agentInstanceId) return

    await api.post('/api/stop', { agent_instance_id: agentInstanceId })

    if (digitalHumanTaskId) {
      await api.post('/api/stop-digital-human', { task_id: digitalHumanTaskId })
    }
  },

  async sendMessage(agentInstanceId: string, message: string) {
    const trimmed = (message || '').trim()
    if (!agentInstanceId || !trimmed) return

    const response = await api.post('/api/send-message', {
      agent_instance_id: agentInstanceId,
      message: trimmed
    })

    if (!response.data?.success) {
      throw new Error(response.data?.error || 'Message send failed')
    }
  },

  async getToken(userId: string, roomId?: string) {
    const params = new URLSearchParams({ user_id: userId })
    if (roomId) params.append('room_id', roomId)

    const response = await api.get(`/api/token?${params.toString()}`)
    if (!response.data?.token) {
      throw new Error('No token returned')
    }

    return { token: response.data.token }
  },

  async healthCheck() {
    const response = await api.get('/health')
    return response.data
  }
}

This keeps network code out of hooks and components.

6. Running and Testing Your Digital Human Interviewer

6.1 Backend

From server:

npm install     # if not already installed
npm run dev

Check that:

http://localhost:8080/health returns status: "healthy".
No ZEGO_APP_ID / ZEGO_SERVER_SECRET errors are logged.
Outbound calls to ZEGOCLOUD succeed (no signature errors).

6.2 Frontend

From client:

npm install
npm run dev

Open http://localhost:5173 in a desktop browser. You should:

a.See a welcome screen describing the AI Interview Assistant.

b. Click “Start Interview”, which will:

Ask the backend to create a digital human agent instance.
Join the room via ZegoExpressEngine.
Attach the digital human video to the main panel.

a. Hear the AI interviewer greet you and ask an introductory question.

b. Answer via:

Voice: press the mic button and speak.

At the end, when the interviewer says the closing phrase or clicks on the end button, the app shows a simple summary with total duration, questions asked, and responses given.

Run a Demo

Conclusion

You now have a complete digital human interview flow built on ZEGOCLOUD:

The server manages ZEGOCLOUD authentication, agent registration, and the digital human lifecycle.
The client handles WebRTC, streams, and ASR/LLM events.
The React UI presents a polished, guided interview experience with a realistic avatar.

From here you can:

Customize the LLM prompt for different interview types (engineering, sales, product).
Use /api/digital-humans to let users choose from multiple avatars.
Persist and analyze interview transcripts for scoring and feedback.
Embed the interview experience into your own application shell or dashboard.

ZEGOCLOUD handles the difficult real-time streaming and avatar animation layers so you can stay focused on interview design, scoring, and integration into your product.