How to Build an AI Virtual Receptionist

An AI virtual receptionist greets visitors, answers questions, and books appointments around the clock using real-time voice and video. Businesses deploying the best AI virtual receptionist voice technology report up to 35% lower operational costs while maintaining 24/7 availability. This guide covers what an AI receptionist is, how it compares to traditional services, and walks through building one with ZEGOCLOUD’s real-time communication SDK using production-ready code extracted from a working demo.

What Is an AI Receptionist?

An AI receptionist is a software system that handles front-desk tasks through natural conversation. Specifically, it uses speech recognition to listen, a large language model to understand intent, text-to-speech to respond, and a digital human avatar to deliver responses face-to-face over live video. As a result, an AI receptionist engages callers and walk-in visitors with spoken language and visual presence, unlike a chatbot limited to text.

AI Receptionist vs Traditional Answering Services

Traditional answering services rely on human operators or basic IVR menus. In both cases, these approaches carry limitations that an AI receptionist addresses directly.

Cost and availability. A live receptionist costs between $30,000 and $45,000 per year in the US (Bureau of Labor Statistics, 2024) and works standard business hours. An answering service adds $200 to $800 per month with per-minute charges that spike during peak hours. An AI receptionist runs 24/7 at a fraction of the cost, with no overtime, sick days, or shift scheduling.
Call handling capacity. A human receptionist handles one call at a time, while IVR systems handle more but frustrate callers with rigid menu trees. According to a 2024 Zendesk CX report, 61% of consumers would switch to a competitor after a single poor support experience. By contrast, an AI virtual receptionist handles unlimited concurrent conversations with consistent quality.
Consistency and scalability. Human operators vary in knowledge, tone, and accuracy depending on fatigue, training gaps, or mood. Meanwhile, Grand View Research projects the global conversational AI market will reach $41.4 billion by 2030, driven by demand for consistent, scalable customer interactions. Consequently, an AI receptionist delivers the same accurate answer on call one and call ten thousand.
Response quality. Traditional IVR forces callers through “press 1 for sales, press 2 for support” menus. On the other hand, an AI receptionist understands natural language: “I need to reschedule my Thursday appointment” gets handled immediately. Furthermore, Gartner predicts that by 2026, conversational AI deployments will reduce contact center agent labor costs by $80 billion.

What to Look For in an AI Receptionist

Not all AI receptionist solutions are equal. Evaluate these criteria before committing:

Voice latency. Conversations feel natural only when the round-trip response time stays under 2 seconds. Choose a platform that processes ASR, LLM inference, TTS, and lip-sync rendering in a single pipeline with minimal hops.
Visual presence. A digital human face builds trust and engagement. Research from the University of Southern California’s Institute for Creative Technologies shows that people respond more positively to conversational agents with visible facial expressions and lip movement.
Integration flexibility. The top AI virtual receptionist platforms let you plug in your own LLM (GPT-4, Claude, or open-source models), TTS voices, and calendar booking systems. Vendor lock-in limits future customization.
Real-time infrastructure. Global coverage matters for businesses with customers across regions. Look for SDKs with WebRTC-based streaming, adaptive bitrate, and data centers in your target markets.

Build vs Buy AI Virtual Receptionist

Organizations face two paths: subscribe to a hosted AI receptionist service or build one using communication APIs.

Buy (hosted service). Platforms like Smith.ai, Ruby, and Abby Connect offer pre-built AI receptionist tiers starting at $140 to $300 per month. These work well for small businesses that need a quick fix. The tradeoff is limited customization: you cannot change the avatar, swap the LLM, or integrate deeply with internal systems. Data flows through a third party, which raises compliance concerns for healthcare and finance.

Build (API-based). Using a real-time communication SDK like ZEGOCLOUD, you assemble the AI pipeline yourself: choose the LLM, pick the voice, design the avatar, and own the data. Development takes days, not months, because the SDK handles WebRTC, media routing, and digital human rendering. The code example in the next section shows the entire server and client in under 400 lines.

When to build. Build when you need custom branding, regulatory control over data, integration with internal CRMs or EHRs, or multilingual support across more than two languages. The per-call cost drops as volume grows, making it more economical than hosted services above roughly 500 calls per month.

Build an AI Phone Virtual Receptionist

This section walks through building a working AI virtual receptionist with voice interaction, a digital human avatar, and real-time video streaming. Every code sample is extracted from a running demo built with ZEGOCLOUD’s Express SDK and AI Agent API.

ZEGOCLOUD provides a real-time communication platform with WebRTC-based audio and video streaming, delivering sub-300ms latency across 200+ data centers worldwide. The Express SDK handles media capture, encoding, transport, and playback in the browser. The AI Agent API orchestrates the full AI pipeline (ASR, LLM, TTS, and digital human rendering) on the server side, so the browser only needs to send audio and receive video.

Architecture Overview

The application follows a three-tier architecture:

┌──────────────────────────────────────────────────────┐
│  Browser (React + Vite)                              │
│  ┌──────────────┐    ┌──────────────────────────┐   │
│  │  UI Layer     │    │  ZEGO Express SDK        │   │
│  │  Status       │    │  (WebRTC Engine)         │   │
│  │  Video        │    │                          │   │
│  │  Mic Toggle   │    │                          │   │
│  └──────────────┘    └──────────────────────────┘   │
└──────────────────────────────────────────────────────┘
         │                        │
         │ REST API calls         │ WebRTC
         ▼                        ▼
┌──────────────────────────────────────────────────────┐
│  Server (Next.js API Routes)                         │
│  ┌────────────────┐  ┌────────────────┐             │
│  │ POST /api/     │  │ POST /api/     │             │
│  │ agent          │  │ instance       │             │
│  └────────────────┘  └────────────────┘             │
│  ┌────────────────┐  ┌────────────────┐             │
│  │ GET /api/      │  │ MD5 Signature  │             │
│  │ token          │  │ Authentication │             │
│  └────────────────┘  └────────────────┘             │
└──────────────────────────────────────────────────────┘
         │                        │
         ▼                        ▼
┌──────────────────────────────────────────────────────┐
│  ZEGOCLOUD                                            │
│  ┌────────────────┐  ┌────────────────────────┐     │
│  │ AI Agent API   │  │ RTC Infrastructure     │     │
│  └────────────────┘  │ Room · Stream · Relay  │     │
│  ┌────────────────┐  └────────────────────────┘     │
│  │ AI Pipeline    │  ┌────────────────────────┐     │
│  │ ASR → LLM →   │  │ Digital Human          │     │
│  │ TTS → Lip-Sync│  │ Renderer (1080P)       │     │
│  └────────────────┘  └────────────────────────┘     │
└──────────────────────────────────────────────────────┘

The browser calls server APIs to register the AI agent, create a digital human instance, and obtain an RTC token. It then connects to the RTC room via the ZEGO Express SDK, publishes the user’s audio stream, and receives the avatar’s video in real time. The server handles API authentication and token generation, but never touches media directly.

Step 1: Create a ZEGOCLOUD Account

Sign up at the ZEGOCLOUD Console. After creating a project, copy the App ID and Server Secret from the project settings page. These credentials authenticate all API requests.

Step 2: Set Up the Project

Create a Next.js server and a Vite React client in separate directories:

ai-receptionist/
├── server/                    # Next.js backend
│   ├── app/api/
│   │   ├── agent/route.js     # Agent registration
│   │   ├── instance/route.js  # Instance create/delete
│   │   └── token/route.js     # Token generation
│   └── .env                   # APP_ID, SERVER_SECRET
└── web-react/                 # React + Vite frontend
    ├── src/App.jsx            # UI + SDK logic
    └── .env                   # VITE_APP_ID, VITE_API_BASE_URL

Server dependencies (Next.js 15, React 19):

{
  "dependencies": {
    "next": "^15.3.3",
    "react": "^19.1.0",
    "react-dom": "^19.1.0"
  }
}

Frontend dependencies:

{
  "dependencies": {
    "react": "^19.1.0",
    "react-dom": "^19.1.0",
    "zego-express-engine-webrtc": "^3.11.0"
  }
}

Install and verify:

# Server
cd server && npm install

# Frontend
cd web-react && npm install

Step 3: Server Configuration and Authentication

Create .env in the server/ directory:

# From ZEGOCLOUD console
APP_ID=your_app_id
SERVER_SECRET=your_32_char_secret
TOKEN_EXPIRE_SECONDS=3600

All requests to the ZEGOCLOUD AI Agent API require MD5 signature authentication. The signature combines the App ID, a random nonce, the server secret, and a Unix timestamp. This utility is shared across all API route handlers:

import crypto from "crypto";
import { NextResponse } from "next/server";

const getAppId = () => Number(process.env.APP_ID || process.env.ZEGO_APPID || 0);
const getServerSecret = () => process.env.SERVER_SECRET || process.env.ZEGO_SERVER_SECRET || "";

const generateSignature = (appId, serverSecret, signatureNonce, timestamp) => {
  return crypto
    .createHash("md5")
    .update(`${appId}${signatureNonce}${serverSecret}${timestamp}`)
    .digest("hex");
};

const sendAgentRequest = async (action, body) => {
  const appId = getAppId();
  const serverSecret = getServerSecret();
  const timestamp = Math.floor(Date.now() / 1000);
  const signatureNonce = crypto.randomBytes(8).toString("hex");
  const signature = generateSignature(appId, serverSecret, signatureNonce, timestamp);

  const url = new URL("https://aigc-aiagent-api.zegotech.cn/");
  url.searchParams.set("Action", action);
  url.searchParams.set("AppId", appId.toString());
  url.searchParams.set("SignatureNonce", signatureNonce);
  url.searchParams.set("Timestamp", timestamp.toString());
  url.searchParams.set("Signature", signature);
  url.searchParams.set("SignatureVersion", "2.0");

  const response = await fetch(url.toString(), {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify(body),
  });
  return await response.json();
};

Step 4: Register the AI Agent

The RegisterAgent API configures the AI pipeline in a single call. This defines the LLM, TTS voice, and ASR provider the receptionist uses. Place this in server/app/api/agent/route.js:

export const POST = async (request) => {
  const body = await request.json();
  const agentId = body.agentId || "ai_avatar_agent";
  const agentName = body.agentName || "AI Avatar";

  const result = await sendAgentRequest("RegisterAgent", {
    AgentId: agentId,
    Name: agentName,
    LLM: {
      Url: "https://ark.cn-beijing.volces.com/api/v3/chat/completions",
      ApiKey: "zego_test",
      Model: "doubao-1-5-pro-32k-250115",
      SystemPrompt: "You are a friendly AI virtual receptionist. Greet visitors, answer questions, and help with appointments concisely.",
    },
    TTS: {
      Vendor: "ByteDance",
      Params: {
        app: {
          appid: "zego_test",
          token: "zego_test",
          cluster: "volcano_tts",
        },
        audio: {
          voice_type: "zh_female_wanwanxiaohe_moon_bigtts",
        },
      },
    },
    ASR: {
      Vendor: "Tencent",
    },
  });

  if (result.Code === 0 || result.Code === 410001008) {
    return NextResponse.json({ code: 0, agentId });
  }
  return NextResponse.json({ code: result.Code }, { status: 500 });
};

Registration is idempotent: calling it multiple times with the same AgentId returns code 410001008, which the code treats as success. Using "zego_test" as the API key activates the platform’s built-in test mode, so you can evaluate the full pipeline without connecting your own LLM provider.

To use your own LLM, replace the LLM block with any OpenAI-compatible endpoint:

LLM: {
  Url: "https://api.openai.com/v1/chat/completions",
  ApiKey: "sk-your-key",
  Model: "gpt-4o",
  SystemPrompt: "You are a professional virtual receptionist for Acme Corp...",
}

Step 5: Create the Digital Human Instance

Once the agent is registered, create a digital human instance to connect the AI pipeline to an RTC room for real-time audio and video streaming. Place this in server/app/api/instance/route.js:

export const POST = async (request) => {
  const body = await request.json();

  const result = await sendAgentRequest("CreateDigitalHumanAgentInstance", {
    AgentId: body.agentId,
    UserId: body.userId,
    RTC: {
      RoomId: body.roomId,
      AgentStreamId: body.agentStreamId,
      AgentUserId: body.agentUserId,
      UserStreamId: body.userStreamId,
    },
    DigitalHuman: {
      DigitalHumanId: body.digitalHumanId,
      ConfigId: "web",
      EncodeCode: "H264",
    },
    MessageHistory: {
      SyncMode: 1,
      Messages: [],
      WindowSize: 10,
    },
  });

  if (result.Code === 0) {
    return NextResponse.json({
      code: 0,
      data: { agentInstanceId: result.Data?.AgentInstanceId },
    });
  }
  return NextResponse.json({ code: result.Code }, { status: 500 });
};

Key parameters:

DigitalHumanId identifies the avatar to render. Use c4b56d5c-db98-4d91-86d4-5a97b507da97 for the public test avatar.
ConfigId: "web" optimizes rendering for browser playback.
EncodeCode: "H264" ensures browser-compatible video decoding.
MessageHistory with WindowSize: 10 enables multi-turn conversation.

The production code also handles concurrent limit errors (codes 410001031 and 410000011) by automatically cleaning up stale instances and retrying, preventing transient failures from breaking the session.

Step 6: Generate the RTC Token

The browser authenticates with the RTC infrastructure using a ZEGO Token04. This token uses AES-CBC encryption with the server secret. Place this in server/app/api/token/route.js:

import { createCipheriv } from "crypto";
import { NextResponse } from "next/server";

const makeRandomIv = () => {
  const chars = "0123456789abcdefghijklmnopqrstuvwxyz";
  const out = [];
  for (let i = 0; i < 16; i += 1) {
    out.push(chars.charAt(Math.floor(Math.random() * chars.length)));
  }
  return out.join("");
};

const getAlgorithm = (key) => {
  const length = Buffer.from(key).length;
  if (length === 16) return "aes-128-cbc";
  if (length === 24) return "aes-192-cbc";
  if (length === 32) return "aes-256-cbc";
  throw new Error(`Invalid ServerSecret length: ${length}`);
};

const generateToken04 = (appId, userId, secret, effectiveTimeInSeconds, payload = "") => {
  const tokenInfo = {
    app_id: appId,
    user_id: userId,
    nonce: Math.ceil(-2147483648 + 4294967295 * Math.random()),
    ctime: Math.floor(Date.now() / 1000),
    expire: Math.floor(Date.now() / 1000) + effectiveTimeInSeconds,
    payload,
  };

  const iv = makeRandomIv();
  const cipher = createCipheriv(getAlgorithm(secret), secret, iv);
  cipher.setAutoPadding(true);
  const encryptBuf = Buffer.concat([
    cipher.update(JSON.stringify(tokenInfo)),
    cipher.final(),
  ]);

  const b1 = new Uint8Array(8);
  const b2 = new Uint8Array(2);
  const b3 = new Uint8Array(2);
  new DataView(b1.buffer).setBigInt64(0, BigInt(tokenInfo.expire), false);
  new DataView(b2.buffer).setUint16(0, iv.length, false);
  new DataView(b3.buffer).setUint16(0, encryptBuf.byteLength, false);

  const buf = Buffer.concat([
    Buffer.from(b1), Buffer.from(b2), Buffer.from(iv),
    Buffer.from(b3), Buffer.from(encryptBuf),
  ]);

  return `04${Buffer.from(buf).toString("base64")}`;
};

export const GET = async (request) => {
  const appId = Number(process.env.APP_ID || process.env.ZEGO_APPID || 0);
  const serverSecret = process.env.SERVER_SECRET || process.env.ZEGO_SERVER_SECRET || "";
  const userId = new URL(request.url).searchParams.get("userId");
  const token = generateToken04(appId, userId, serverSecret, 3600);
  return NextResponse.json({ token });
};

Step 7: Frontend Connection and Streaming

The React frontend orchestrates the full flow: register the agent, create the instance, get the token, initialize the RTC engine, join the room, publish audio, and receive the avatar’s video stream.

Configure the frontend .env:

VITE_APP_ID=your_app_id
VITE_API_BASE_URL=http://localhost:3000

The startConversation function in src/App.jsx handles the complete sequence:

const clientConfig = {
  appId: Number(import.meta.env.VITE_APP_ID || 0),
  apiBaseUrl: import.meta.env.VITE_API_BASE_URL || "http://localhost:3000",
};

const generateId = (prefix) =>
  `${prefix}_${Date.now()}_${Math.random().toString(36).substring(2, 9)}`;

const startConversation = async () => {
  const userId = generateId("user");
  const roomId = generateId("room");
  const userStreamId = generateId("user_stream");
  const agentStreamId = generateId("agent_stream");
  const agentUserId = generateId("agent_user");

  // Step 1: Register agent
  await fetch(`${clientConfig.apiBaseUrl}/api/agent`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ agentId: "ai_avatar_agent", agentName: "AI Receptionist" }),
  });

  // Step 2: Create digital human instance
  await fetch(`${clientConfig.apiBaseUrl}/api/instance`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      agentId: "ai_avatar_agent", userId, roomId,
      agentStreamId, agentUserId, userStreamId,
      digitalHumanId: "c4b56d5c-db98-4d91-86d4-5a97b507da97",
    }),
  });

  // Step 3: Get RTC token
  const tokenRes = await fetch(`${clientConfig.apiBaseUrl}/api/token?userId=${userId}`);
  const { token } = await tokenRes.json();

  // Step 4: Initialize ZEGO Express SDK
  const { ZegoExpressEngine } = await import("zego-express-engine-webrtc");
  const engine = new ZegoExpressEngine(clientConfig.appId, "");

  // Listen for the avatar's video stream
  engine.on("roomStreamUpdate", async (roomID, updateType, streamList) => {
    if (updateType === "ADD") {
      for (const stream of streamList) {
        const mediaStream = await engine.startPlayingStream(stream.streamID, {
          jitterBufferTarget: 500,
        });
        const remoteView = engine.createRemoteStreamView(mediaStream);
        remoteView.play("remote-video", { enableAutoplayDialog: false });
      }
    }
  });

  // Step 5: Login room
  await engine.loginRoom(roomId, token, {
    userID: userId,
    userName: userId,
  });

  // Step 6: Publish local audio
  const localStream = await engine.createZegoStream({
    camera: { video: false, audio: true },
  });
  await engine.startPublishingStream(userStreamId, localStream);
};

The roomStreamUpdate The event fires when the digital human starts streaming. Setting jitterBufferTarget: 500 minimizes playback latency. The avatar’s video renders into the #remote-video DOM element automatically.

Microphone control and session cleanup:

// Mute/unmute without destroying the stream
const toggleMic = () => {
  const engine = engineRef.current;
  if (!engine) return;
  const newMicState = !isMicOn;
  engine.muteMicrophone(!newMicState);
  setIsMicOn(newMicState);
};

// Clean up all resources in reverse order
const endConversation = async () => {
  const engine = engineRef.current;
  if (engine) {
    engine.stopPublishingStream(userStreamId);
    engine.logoutRoom(roomId);
    engine.destroyEngine();
  }
  await fetch(`${clientConfig.apiBaseUrl}/api/instance`, {
    method: "DELETE",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ agentInstanceId }),
  });
};

The muteMicrophone method toggles audio capture without destroying the stream, avoiding re-requesting microphone permissions on unmute. The endConversation function unpublishes the stream, leaves the room, destroys the engine, and deletes the server-side instance, ensuring no orphaned resources remain.

Step 8: Run and Test

Start the server and frontend in separate terminals:

# Terminal 1: Server
cd server
npm install && npm run dev

# Terminal 2: Frontend
cd web-react
npm install && npm run dev

Open the browser at http://localhost:5173, click “Start Conversation,” and grant microphone access. The AI receptionist avatar appears and begins listening for voice input. Speak naturally, and the receptionist responds with synchronized lip movement and voice in under 1.5 seconds.

Conclusion

An AI virtual receptionist built with ZEGOCLOUD requires three server API routes and a single React component. The platform handles ASR, LLM orchestration, TTS, lip-sync rendering, and WebRTC delivery, so you focus on the receptionist’s personality and business logic. With sub-300ms streaming latency across 200+ global nodes, the result is a responsive, lifelike receptionist that works around the clock. The code in this guide is production-ready: clone it, configure your credentials, and deploy.

FAQ

Q1: What is the best AI virtual receptionist voice technology?

The best AI virtual receptionist voice technology combines real-time ASR, an LLM for understanding intent, a natural-sounding TTS engine, and synchronized lip movement on a digital human avatar. ZEGOCLOUD’s AI Agent API chains these four services into a single pipeline with sub-1.5-second end-to-end latency. The TTS vendor is configurable: ByteDance’s voice engine ships in the default setup, and you can swap in Google Cloud TTS, Amazon Polly, or any provider that fits your language and tone requirements.

Q2: How much does an AI virtual receptionist cost?

Running an AI virtual receptionist on ZEGOCLOUD involves three cost components: the ZEGOCLOUD platform fee (which includes a free tier for development), the LLM provider (OpenAI GPT-4o costs roughly $0.005 per conversation turn, while ByteDance Doubao is significantly cheaper), and the TTS provider (typically $0.01 to $0.03 per minute of generated audio). For a typical small business handling 500 calls per month, the total cost lands between $50 and $150, compared to $300 to $800 for a traditional answering service.

Q3: Can an AI virtual receptionist handle multiple calls at once?

Yes. Each call creates an independent RTC room and a separate AI agent instance on the server. Because the ASR, LLM, and TTS processing happens on ZEGOCLOUD’s infrastructure, the browser only handles audio capture and video playback. This means you can run hundreds of concurrent receptionist sessions from a single deployment without adding browser-side resources.

Q4: Is an AI virtual receptionist better than a human receptionist?

An AI virtual receptionist is better for consistency, availability, and cost at scale. It works 24/7, handles unlimited concurrent conversations, and delivers the same answer quality on every call. A human receptionist is better for empathy in sensitive situations, complex physical tasks (handing out badges, escorting visitors), and nuanced judgment calls that fall outside the LLM’s training. Most businesses benefit from a hybrid approach: the AI handles routine inquiries and after-hours calls, while human staff focus on high-value interactions.