Talk to us
Talk to us
menu

How to Build a Conversational AI on Android

How to Build a Conversational AI on Android

Building conversational AI for Android is becoming one of the most in-demand capabilities in modern mobile applications. With the rising adoption of real-time voice assistants, interactive agents, and AI-driven communication interfaces, developers are increasingly looking for efficient ways to integrate natural voice interaction into their apps.

In the past, implementing conversational AI on Android required complex setup and multiple technology stacks, such as speech recognition, large language models, text-to-speech, and low-latency audio processing. This often resulted in long development cycles and significant engineering effort.

ZEGOCLOUD simplifies this process by providing ready-to-use components for voice conversational AI. In this tutorial, we will walk through how to create an AI bot for Android using ZEGOCLOUD, from integration to real-time voice interaction. By the end, you will have a working demo that supports natural, continuous voice conversations powered by AI, suitable for production environments. Let’s get started.

Conversational AI for Android Built by ZEGOCLOUD

ZEGOCLOUD provides a conversational AI solution tailored for Android applications. It helps developers add voice interactions that understand user speech, generate LLM-based replies, and return spoken responses through text-to-speech. The SDK is designed to fit naturally into Android projects, whether you are working with Kotlin, Java, or Jetpack-based UI frameworks.

The Android solution supports fast voice capture, accurate speech recognition, and multi-turn conversation logic. It works smoothly with native Android APIs for permissions, audio input, and lifecycle control, which makes the integration process direct and predictable. You can build features that run well across different Android devices, screen sizes, and hardware levels.

Developers can customize prompts, connect to different AI model endpoints, or design UI components that follow Android patterns such as Material Design. Use it to create AI assistants, language learning apps, customer support agents, or voice companions that operate within your own activity or fragment layouts. With ZEGOCLOUD’s tools, you can bring conversational AI to Android and build reliable voice experiences that feel native to the platform.

Download Complete Source Code

Why reinvent the wheel? Here’s the complete working code:

ComponentRepository
Backend Server + Web Clientgithub.com/ZEGOCLOUD/blog-aiagent-server-and-web
Android Clientgithub.com/ZEGOCLOUD/blog-aiagent-android

Prerequisites

Before moving into implementation, ensure that the basic development environment is prepared:

  • Android Studio (latest stable version)
  • A ZEGOCLOUD account for accessing SDK features and console configuration
  • Familiarity with Kotlin, as the tutorial will be based on modern Android development practices
  • A backend service for handling AI interaction logic. In this example, we will use a simple Next.js deployment on Vercel to manage requests

Once these prerequisites are ready, you can proceed with the integration steps and begin building the conversational AI experience on Android.

Architecture Overview

Before moving into the implementation, it is helpful to understand the basic workflow of ZEGOCLOUD’s Conversational AI and the roles of both the client and the server in the process.

Why Do We Need Both Server and Client?

ZEGOCLOUD’s Conversational AI is a cloud-based service that handles the core processing required for real-time voice interaction. It performs automatic speech recognition to convert audio into text, leverages large language models to generate intelligent responses, and applies text-to-speech to output natural voice replies. This allows the application to provide an end-to-end conversational experience without requiring complex local processing.

ComponentResponsibilities
Backend ServerSecurely stores credentials (AppID, ServerSecret, LLM API keys), generates authentication tokens, and manages AI agent lifecycle via ZEGO’s server APIs
Android ClientCaptures user’s voice, streams audio to/from ZEGO’s RTC cloud, plays AI responses, and displays real-time subtitles
ZEGOCLOUDHosts the AI Agent that processes audio streams, runs ASR→LLM→TTS pipeline, and manages real-time communication

System Architecture

This architecture ensures that all sensitive credentials remain securely on your server, while the client and ZEGOCLOUD handle real-time communication and AI processing. It provides a secure and efficient foundation for delivering seamless voice-based conversational experiences.

system architecture

How to Build a Real-time Conversational AI on Android

To build a real-time conversational AI on Android, you mainly need to handle voice input, convert it to text, send it to an AI model, and return a spoken reply. Below are the basic steps using ZEGOCLOUD’s Conversational AI SDK.

Step 1: Set Up Your Backend Server

The backend server is responsible for generating authentication tokens and handling communication with ZEGOCLOUD’s AI Agent APIs. For this tutorial, we will use a simple backend built with Next.js, which can be deployed on Vercel for convenience and fast configuration.

1.1 Environment Variables

Create a .env.local file with the following configuration:

# ZEGO Configuration (from ZEGOCLOUD Console: https://console.zegocloud.com/)
NEXT_PUBLIC_ZEGO_APP_ID=your_app_id
ZEGO_SERVER_SECRET=your_server_secret_32_chars

# AI Agent Configuration
ZEGO_AGENT_ID=aiAgent1
ZEGO_AGENT_NAME=AI Assistant

# System Prompt - Define your AI's personality
SYSTEM_PROMPT="You are my best friend who I can talk to about anything. You're warm, understanding, and always there for me. Respond naturally like a close friend would - be supportive, share opinions, and engage in genuine conversation."

# LLM Configuration (Large Language Model)
LLM_URL=https://your-llm-provider.com/api/chat/completions
LLM_API_KEY=your_llm_api_key
LLM_MODEL=your_model_name

# TTS Configuration (Text-to-Speech)
TTS_VENDOR=ByteDance
TTS_APP_ID=zego_test
TTS_TOKEN=zego_test
TTS_CLUSTER=volcano_tts
TTS_VOICE_TYPE=zh_female_wanwanxiaohe_moon_bigtts

How to obtain these values:

VariableDescriptionHow to Obtain
NEXT_PUBLIC_ZEGO_APP_IDYour ZEGO application IDZEGOCLOUD Console → Project Settings
ZEGO_SERVER_SECRET32-character secret keyZEGOCLOUD Console → Project Settings
SYSTEM_PROMPTAI personality definitionCustomize based on your use case
LLM_URLLLM API endpointFrom your LLM provider (OpenAI, Doubao, Claude, etc.)
LLM_API_KEYLLM authentication keyFrom your LLM provider dashboard
LLM_MODELModel name to usee.g., gpt-4, doubao-1-5-pro-32k
TTS_*Text-to-Speech configUse zego_test for testing, or your own TTS service

1.2 Token Generation API

The token API generates authentication tokens for clients to connect to ZEGO’s RTC service:

// app/api/zego/token/route.ts
import { NextRequest, NextResponse } from 'next/server';
import crypto from 'crypto';

function generateToken(appId: number, userId: string, secret: string,
                       effectiveTimeInSeconds: number): string {
  const tokenInfo = {
    app_id: appId,
    user_id: userId,
    nonce: Math.floor(Math.random() * 2147483647),
    ctime: Math.floor(Date.now() / 1000),
    expire: Math.floor(Date.now() / 1000) + effectiveTimeInSeconds,
    payload: ''
  };

  const plainText = JSON.stringify(tokenInfo);
  const nonce = crypto.randomBytes(12);
  const cipher = crypto.createCipheriv('aes-256-gcm', secret, nonce);
  const encrypted = Buffer.concat([cipher.update(plainText, 'utf8'),
                                   cipher.final(), cipher.getAuthTag()]);

  const buf = Buffer.concat([
    Buffer.alloc(8).writeBigInt64BE(BigInt(tokenInfo.expire), 0) || Buffer.alloc(8),
    Buffer.from([0, 12]), nonce,
    Buffer.from([encrypted.length >> 8, encrypted.length & 0xff]), encrypted,
    Buffer.from([1]) // GCM mode
  ]);
  return '04' + buf.toString('base64');
}

export async function POST(request: NextRequest) {
  const { userId } = await request.json();
  const token = generateToken(
    parseInt(process.env.NEXT_PUBLIC_ZEGO_APP_ID!),
    userId,
    process.env.ZEGO_SERVER_SECRET!,
    3600
  );
  return NextResponse.json({ token });
}

1.3 AI Agent API with Signature

The signature mechanism ensures secure communication with ZEGO’s server APIs:

// app/api/zego/utils.ts
import crypto from 'crypto';

// Generate API signature: MD5(AppId + SignatureNonce + ServerSecret + Timestamp)
export function generateSignature(appId: number, signatureNonce: string,
                                  serverSecret: string, timestamp: number): string {
  const str = appId.toString() + signatureNonce + serverSecret + timestamp.toString();
  return crypto.createHash('md5').update(str).digest('hex');
}

// Send request to ZEGO AI Agent API
export async function sendZegoRequest<T>(action: string, body: object): Promise<T> {
  const appId = parseInt(process.env.NEXT_PUBLIC_ZEGO_APP_ID!);
  const serverSecret = process.env.ZEGO_SERVER_SECRET!;
  const signatureNonce = crypto.randomBytes(8).toString('hex');
  const timestamp = Math.floor(Date.now() / 1000);
  const signature = generateSignature(appId, signatureNonce, serverSecret, timestamp);

  const url = new URL('https://aigc-aiagent-api.zegotech.cn');
  url.searchParams.set('Action', action);
  url.searchParams.set('AppId', appId.toString());
  url.searchParams.set('SignatureNonce', signatureNonce);
  url.searchParams.set('Timestamp', timestamp.toString());
  url.searchParams.set('Signature', signature);
  url.searchParams.set('SignatureVersion', '2.0');

  const response = await fetch(url.toString(), {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(body)
  });

  const result = await response.json();
  return result.Data as T;
}

1.4 Deploy to Vercel

Deploy your Next.js server to Vercel with these steps:

  • Push your code to GitHub
  • Go to Vercel and import your repository
  • Add all environment variables in Vercel’s project settings
  • Click “Deploy”

Your server will be available at https://your-project.vercel.app.

Step 2: Build the Android Client

Now, let’s build conversational AI for Android step by step.

2.1 Create Android Project

Create a new Android project in Android Studio with:

  • Language: Kotlin
  • Minimum SDK: API 24

2.2 Configure Dependencies

CRITICAL: Use AI Agent Optimized SDK

You MUST use the AI Agent optimized version of ZEGO Express SDK to receive subtitle messages via onRecvExperimentalAPI. The standard SDK from Maven (im.zego:express-video) does NOT support this feature, and subtitles will not work!

Download the optimized SDK from: ZEGO AI Agent SDK Download Page

Step 1: Download and integrate the SDK manually

  • Download the AI Agent optimized SDK (v3.22.0.46173 or later) from the link above
  • Extract the ZIP and copy files to your project:
  • Copy ZegoExpressEngine.jar to app/libs/
  • Copy arm64-v8a/libZegoExpressEngine.so to app/libs/arm64-v8a/
  • Copy armeabi-v7a/libZegoExpressEngine.so to app/libs/armeabi-v7a/

Step 2: Configure app/build.gradle

android {
    // ...
    defaultConfig {
        ndk {
            abiFilters 'armeabi-v7a', 'arm64-v8a'
        }
    }

    sourceSets {
        main {
            jniLibs.srcDirs = ['libs']
        }
    }
}

dependencies {
    // ZEGO Express SDK (AI Agent optimized version - DO NOT use Maven version!)
    implementation files('libs/ZegoExpressEngine.jar')

    // Network & JSON
    implementation 'com.squareup.okhttp3:okhttp:4.12.0'
    implementation 'com.google.code.gson:gson:2.10.1'

    // Coroutines for async operations
    implementation 'org.jetbrains.kotlinx:kotlinx-coroutines-android:1.7.3'
}

Why not use Maven? The AI Agent optimized SDK includes special features for receiving room channel messages (subtitles) that are not available in the standard public SDK. This is required for the onRecvExperimentalAPI callback to work properly.

2.3 Required Permissions

Add these permissions to AndroidManifest.xml:

<uses-permission android:name="android.permission.INTERNET" />
<uses-permission android:name="android.permission.ACCESS_NETWORK_STATE" />
<uses-permission android:name="android.permission.RECORD_AUDIO" />
<uses-permission android:name="android.permission.MODIFY_AUDIO_SETTINGS" />

2.4 App Configuration

Create a configuration file to centralize all settings. This makes it easy to switch between environments:

// config/AppConfig.kt
object AppConfig {
    // ZEGO App ID - must match your backend server
    const val APP_ID: Long = 123456789L  // Replace with your App ID

    // Backend server URL (your Vercel deployment)
    const val SERVER_URL = "https://your-project.vercel.app"

    // Generate unique IDs for testing
    fun generateUserId(): String = "user${System.currentTimeMillis() % 100000}"
    fun generateRoomId(): String = "room${System.currentTimeMillis() % 100000}"
}

Important: The APP_ID here must match NEXT_PUBLIC_ZEGO_APP_ID in your backend’s .env.local file.

2.5 Initialize ZEGO Express Engine

Create a manager class with optimized audio settings for AI conversation:

class ZegoExpressManager(private val application: Application) {
    private var engine: ZegoExpressEngine? = null

    fun initEngine() {
        val profile = ZegoEngineProfile().apply {
            appID = AppConfig.APP_ID  // Use config value
            scenario = ZegoScenario.HIGH_QUALITY_CHATROOM
            application = this@ZegoExpressManager.application
        }

        engine = ZegoExpressEngine.createEngine(profile, eventHandler)

        // Optimize audio settings for AI conversation
        engine?.apply {
            enableAGC(true)  // Automatic Gain Control
            enableANS(true)  // Automatic Noise Suppression
            setANSMode(ZegoANSMode.MEDIUM)
        }
    }
}

2.6 Room Login, Publishing & Playing Streams

Implement the core RTC logic:

// Login to room with token
fun loginRoom(roomId: String, userId: String, token: String, callback: (Int) -> Unit) {
    val user = ZegoUser(userId)
    val config = ZegoRoomConfig().apply {
        this.token = token
    }
    engine?.loginRoom(roomId, user, config) { errorCode, _ ->
        callback(errorCode)
    }
}

// Start publishing local audio (user's voice)
fun startPublishing(streamId: String) {
    engine?.startPublishingStream(streamId)
}

// Start playing remote audio (AI agent's voice)
fun startPlaying(streamId: String) {
    engine?.startPlayingStream(streamId)
}

2.7 Room Login, Publishing & Playing Streams

Alright, here’s the core RTC logic – this is steps 3 and 9 from our architecture diagram:

// Login to room with token
fun loginRoom(roomId: String, userId: String, token: String, callback: (Int) -> Unit) {
    val user = ZegoUser(userId)
    val config = ZegoRoomConfig().apply {
        this.token = token
    }
    engine?.loginRoom(roomId, user, config) { errorCode, _ ->
        callback(errorCode)
    }
}

// Start publishing local audio (user's voice)
fun startPublishing(streamId: String) {
    engine?.startPublishingStream(streamId)
}

// Start playing remote audio (AI agent's voice)
fun startPlaying(streamId: String) {
    engine?.startPlayingStream(streamId)
}

2.8 Display Subtitles

ZEGO provides official subtitle components that handle message parsing, sorting, and UI display. Use AudioChatMessageParser to parse messages and AIChatListView to display them:

// Create the parser
private val audioChatMessageParser = AudioChatMessageParser()

// Set up the listener
audioChatMessageParser.setAudioChatMessageListListener(object : AudioChatMessageParser.AudioChatMessageListListener {
    override fun onMessageListUpdated(messagesList: MutableList<AudioChatMessage>) {
        runOnUiThread {
            binding.messageList.onMessageListUpdated(messagesList)
        }
    }
    override fun onAudioChatStateUpdate(statusMessage: AudioChatAgentStatusMessage) {
        // Handle agent status updates if needed
    }
})

// In onRecvExperimentalAPI callback, parse the message
zegoManager.onRecvExperimentalAPI = { content ->
    try {
        val json = JSONObject(content)
        if (json.getString("method") == "liveroom.room.on_recive_room_channel_message") {
            val msgContent = json.getJSONObject("params").getString("msg_content")
            audioChatMessageParser.parseAudioChatMessage(msgContent)
        }
    } catch (e: Exception) { e.printStackTrace() }
}

Official Components: Download AudioChatMessageParser.java and AIChatListView.java from ZEGO AI Agent Subtitle Guide. These handle ASR (Cmd=3) and LLM (Cmd=4) message merging automatically.

2.9 UI Layout

The layout is simple: top section for status and controls, bottom for subtitles:

<ConstraintLayout>
    <!-- Top: Status and Call Button -->
    <LinearLayout android:id="@+id/topSection">
        <TextView android:id="@+id/tvStatus" />
        <Button android:id="@+id/btnCall" android:text="Start Call" />
    </LinearLayout>

    <!-- Bottom: Conversation Subtitles -->
    <com.zegocloud.aiagent.subtitle.AIChatListView
        android:id="@+id/messageList" />
</ConstraintLayout>

2.10 Complete Call Flow

Here’s how all the pieces work together. The ViewModel orchestrates the complete flow when the user taps “Start Call”:

// viewmodel/MainViewModel.kt
class MainViewModel(private val zegoManager: ZegoExpressManager) : ViewModel() {

    fun startCall() {
        viewModelScope.launch {
            _isLoading.value = true
            try {
                // Generate unique IDs for this session
                val roomId = AppConfig.generateRoomId()
                val userId = AppConfig.generateUserId()
                val userStreamId = "${roomId}_${userId}_main"

                // Initialize ZEGO Engine
                zegoManager.initEngine()

                // Step 1-2: Request token from backend
                val token = ApiService.getToken(userId)
                    ?: throw Exception("Failed to get token")

                // Step 3: Login to ZEGO room
                val loginResult = zegoManager.loginRoom(roomId, userId, token)
                if (loginResult != 0) throw Exception("Failed to login room")

                // Step 3: Publish local audio stream (user's voice)
                zegoManager.startPublishing(userStreamId)

                // Step 4-8: Request backend to create AI Agent
                val agentStreamId = ApiService.startAgent(roomId, userId, userStreamId)
                    ?: throw Exception("Failed to start agent")

                // Step 9: Play AI Agent's audio stream
                zegoManager.startPlaying(agentStreamId)

                _isConnected.value = true
                _currentRoomId = roomId
            } catch (e: Exception) {
                _error.value = e.message
            } finally {
                _isLoading.value = false
            }
        }
    }

    // Step 10-11: End call and cleanup
    fun endCall() {
        viewModelScope.launch {
            _currentRoomId?.let { roomId ->
                ApiService.stopAgent(roomId)
                zegoManager.logoutRoom(roomId)
            }
            _isConnected.value = false
            _currentRoomId = null
        }
    }
}

This flow corresponds to steps 1-11 in the System Architecture diagram:

StepActionCode
1-2Request TokenApiService.getToken()
3Login Room & Publish StreamloginRoom() + startPublishing()
4-8Start AI AgentApiService.startAgent()
9Play Agent StreamstartPlaying()
10-11Stop AI AgentApiService.stopAgent()

Step 3: Run a demo

Conclusion

And with that, you now have a fully functional conversational AI system running on Android.
What once required weeks of development can now be achieved in a much shorter time, thanks to simplified workflows and modern real-time AI tooling.

Here is a quick overview of what you accomplished:

  • A secure backend with proper token generation
  • A clean and modular Android client built with the ZEGO Express SDK
  • Real-time voice interaction with AI
  • Live subtitle support for improved accessibility

A few practical recommendations as you continue exploring:

  • Refine the SYSTEM_PROMPT to shape your AI’s role, tone, and behavior. This step significantly impacts the user experience.
  • Consider integrating visual avatars for a more engaging interface.
  • Always validate performance on real devices, as microphone handling can differ from emulator environments.

This architecture is flexible and can be extended to customer support assistants, language learning platforms, AI companions, and many other real-time interaction scenarios.

If you are interested in building your own voice-driven application, get started with ZEGOCLOUD. The free tier provides more than enough resources for prototyping and experimentation.

Let’s Build APP Together

Start building with real-time video, voice & chat SDK for apps today!

Talk to us

Take your apps to the next level with our voice, video and chat APIs

Free Trial
  • 10,000 minutes for free
  • 4,000+ corporate clients
  • 3 Billion daily call minutes

Stay updated with us by signing up for our newsletter!

Don't miss out on important news and updates from ZEGOCLOUD!

* You may unsubscribe at any time using the unsubscribe link in the digest email. See our privacy policy for more information.