On this page

Video Digital Human Shooting Guide

2026-03-20

This article introduces how to capture your avatar and voice samples.

Note

Avatar and voice capture can be done separately; you don't need to use a camera to record audio.

Prerequisites

Please contact your business or pre-sales representative to discuss your use cases and requirements. We will provide shooting recommendations based on your specific needs.

Avatar Capture

The avatar capture process consists of four steps: prepare hardware, set up the site, shoot the model, and submit files.

1 Prepare Hardware

Please configure your shooting hardware according to one of the following two parameter requirements.

Parameter Requirement 1Parameter Requirement 2
Recording Specification
  • PAL format
  • 4K 50P
  • PAL format
  • 1080p 50P
Recording DurationGreater than 12 minGreater than 12 min
Camera Encoding Format, Bitrate, Sampling Standard
  • H.264
  • No bitrate requirement
  • 8bit or higher color depth (10bit 4:2:2 recommended)
  • H.264
  • Maximum bitrate
  • 8bit or higher color depth (10bit 4:2:2 recommended)
Notes-When recording at 1080p resolution, try to have the model occupy more pixels in the frame while ensuring body movements don't extend beyond the frame

2 Set Up the Site

Capture requires you to use a green screen to set up the site for later chroma keying. Please ensure the green screen is flat without obvious wrinkles. You can use roll paper backgrounds or green screen cloth.

If using a cloth screen, use multiple heavy-duty clips to stretch the screen as flat as possible to avoid uneven lighting caused by wrinkles that would make post-production keying difficult and affect the final result.

3 Shoot the Model

During the shooting process, the model and director need to complete the following tasks to achieve the best results.

Model Requirements

NoteDetails
Styling
  • Avoid bright green, highly reflective, or translucent materials (such as tassels, lace, etc.), avoid fine stripes or grid patterns to prevent moiré patterns during shooting that affect the final appearance.
  • The edges of the model's styling should be as smooth and clear as possible, avoid small see-through areas formed by hollow hair edges that would allow sight to pass through the hair and see the green screen background.
  • If accessories (such as earrings, ribbons, silk scarves, tassels, etc.) extend beyond the model's outline or swing with movement, please replace or secure them.
  • Small areas of bright green items should not appear on the body, such as jade rings, bracelets, necklaces, etc.
Beginning and EndBefore recording starts and ends, the model needs to maintain a neutral pose for 10s (the neutral pose is up to you, mouth closed, no body movements other than the neutral pose, limbs remain still). This neutral pose also applies to rhythmic pauses during recording.
Rhythmic PauseThe model needs to maintain a pause of about 2s between every 3-4 sentences, mouth closed, body movements returning to the neutral pose.
Recording and Lip Sync
  • Throughout the shooting, speaking speed and volume should remain consistent without excessive or rapid fluctuations.
  • The model should not open their mouth without making sound.
  • During speech, the model's lip movements should be obvious to help AI identify lip shape characteristics under different pronunciations, making the digital human effect more realistic.
  • When the model opens their mouth without making sound, coughs, yawns, or sneezes, shooting must be stopped. Resume after adjustments.
Head Movements
  • During shooting, the model should always face the camera lens, avoiding excessive left-right or up-down tilting (only slight head turning or nodding is allowed).
  • When the model makes large head turning movements, shooting must be stopped. Resume after adjustments.
Body MovementsDuring shooting, the model can use body movements to make the overall appearance vivid and expressive. However, if any of the following rules are violated, reshooting is required:
  • Arms must not block the face.
  • Arms or other body parts must not extend beyond the camera frame.
  • Body movements must also return to the neutral pose during rhythmic pauses, but please note that the rate of change should be moderate, the transition should be natural, and avoid overly rapid movement changes.
  • Please use generic body movements, avoid "pointing gestures" with specific meanings (such as OK, numbers, crossed arms indicating negation, pointing directions, etc.).Error examples:
    • OK:
    • Two fingers up:
    • Arms crossed:
    • Showing palm:
    • Pointing at camera:
    • Pointing with arm:

Director Notes

NoteDetails
Actor Should Perform Naturally and Expressively
  • Is the script reading fluent and natural? If there are too many pauses, shooting should be paused. Familiarize yourself with the script before resuming capture.
  • Are the movement transitions smooth? If body language is stiff, design a few gestures in advance and have the model practice them before resuming capture.
Details Affecting Green Screen Shooting Results
  • If the model's clothes are highly reflective or translucent, remind the model to change (such as satin, lace, mesh materials).
  • Minimize unnecessary green screen area on set. For example, when shooting in seated position, remove the green screen under the model's feet to avoid green light reflecting from the floor onto the model.
  • When shooting in standing position, use transparent acrylic panels or tape to secure the green screen to reduce floor wrinkles.
  • The model should be at least 2m away from the green screen, the further the better if conditions allow.
  • You can use side backlighting from behind the model to eliminate surface green light and outline the model's silhouette.
  • Please use at least F6 or smaller aperture to ensure the model's eyes and outline edges are within the focal plane, clear and distinct. Edge blur will greatly affect post-production green screen keying.
Establish Basic Rapport with the Model
  • Use gestures to remind the model of the shooting progress (how many minutes in).
  • Use specific gestures or a whiteboard to remind the model to maintain the 10s neutral pose at the beginning and end.
  • Pay attention to the model's speech habits, and promptly stop and correct issues like tone sounds, microphone popping, or opening mouth without making sound.
Watch for Model's Makeup ChangesWhen the model has multiple NG takes, facial oiliness increases and the image in the lens starts to change. The director should promptly remind the model to touch up makeup or apply powder.

4 Submit Files

After recording is complete, please submit the video files to ZEGO personnel and indicate the camera brand used and whether log mode was used.

Lighting Setup Reference

Here is a lighting setup for reference: 4 Steps to Create High-Quality Green Screen Keying for Live Streaming. This setup uses dual side backlighting to outline the silhouette, which helps eliminate green reflections on the model's surface when there are site limitations (model distance from green screen is not far enough ≤4m). The main light and fill light in front of the model can be adjusted according to the shooting theme.

Voice Capture

The voice capture process consists of four steps: prepare script, prepare recording equipment, start recording, and end recording and submit.

1 Prepare Script

The script used for voice capture must meet the following requirements:

  • More than 6000 characters.

  • Content should match the digital human's application industry/scenario context.

  • Please refer to the template below to adjust the script format, inserting pauses and instructional notes.

    Game Script

    Opening lines (Can be more enthusiastic) Hello hello, welcome new viewers! Today is a special livestream for [Game Name]. Many benefits will be given to everyone. Let me announce in advance, today's group buying products include various items - food, drinks, and entertainment. We won't disappoint anyone and will definitely surprise you all. (Pause 2s, close mouth naturally) Today we're going to play a very interesting game together - "[Game Name]"! In the game, we need to build our own army to fight against the opposing army and ultimately destroy the enemy castle. Viewers with any questions can send messages in the chat at any time! (Pause 2s, close mouth naturally) 【Gameplay Introduction】 "[Game Name]" is a livestream bullet screen game set in a medieval fantasy continent. Viewers in the livestream chat can input commands to join one side, then recruit soldiers to participate in the battle. During combat, you can summon legions, giants, snowmen, elephant soldiers, and even dragons by gifting items to gain strategic advantage. The goal is to destroy the enemy castle. The game features a fresh and vibrant cartoon art style. The game offers various camera modes for streamers to choose from. Whether soldiers, buildings, or dragons, all have detailed art models. Players can feel the intensity and excitement of a real battlefield fantasy army battle. (Pause 2s, close mouth naturally) Welcome to all new viewers in the livestream! Please follow, like, and share our livestream. Love you all! Thank you to our top supporter for the gift support. I hope you have fun in the game and continue to support game livestreams. (Pause 2s, close mouth naturally) Next, I'll briefly introduce the game rules. In the game, we're divided into red and blue teams. Each team has its own soldiers, buildings, dragons, etc. Viewers can summon legions, giants, snowmen, elephant soldiers, and even dragons by gifting items to gain strategic advantage. The ultimate goal is to destroy the enemy castle. (Pause 2s, close mouth naturally) Tactical strategy is very important in the game. We need to arrange and adjust based on factors like the enemy's formation and terrain. Brothers can also choose different unit combinations based on their preferences and strategies. For example, we can choose units with ranged attack capabilities to protect our position, while also choosing units with strong attack capabilities to directly attack the enemy castle. (Pause 2s, close mouth naturally) [Continue with rest of script...] (Pause 2s, close mouth naturally)

2 Prepare Recording Equipment

  • It is recommended to use professional microphones from brands like Rode, DJI, Sony, or Moman.
  • If using a camera for recording, please set the camera recording to manual mode.
  • If using a computer-connected microphone for recording, please adjust the microphone or audio interface settings.
  • Adjust the distance and position from the microphone to ensure no popping when speaking.

3 Start Recording

After starting recording, please ensure the following requirements are met:

  • No background noise or ambient sounds.
  • The emotion of reading the script should match expectations and remain consistent.
  • Clear pronunciation, distinct enunciation, clear sentence breaks, with 2s pauses between each sentence.

4 End Recording and Submit

After recording ends, please play back and check once to ensure the following valid audio standards are met.

Standard ItemDetails
Audio Duration, Format, and Other Parameters
  • Audio effective duration: More than 20 min.
  • File format: WAV, MP3 and AAC formats are not recommended.
  • Sample rate: 44100Hz or higher.
  • Sample depth: 16 bit or higher.
Audio Quality
  • Human voice is overall pure and prominent.
  • No voice clipping during speech.
  • No noisy background sounds.
  • No echo or reverb.
  • No microphone popping.
  • No obvious electrical interference sounds.
Voice Recording
  • Clear pronunciation and enunciation.
  • Overall smooth speech with minimal stuttering.
  • Maintain consistent voice tone and emotion throughout recording.
  • Maintain 1s - 2s pauses between sentences.

FAQ

  • High background noise and clipping may be caused by setting the camera auto recording to automatic. Please adjust the recording device level to ensure signal-to-noise ratio and avoid obvious background noise. If there is electrical buzzing, please consult the device provider to eliminate the electrical noise or seek help from ZEGO pre-sales service.
  • Control the environment well to avoid noisy voices and traffic sounds.

You can refer to the following steps:

A loose 3.5mm connection cable or recording device malfunction can cause electronic distortion. Please check if the cable is inserted correctly or replace the recording device.

The main cause of microphone popping is when the microphone head is in the airflow of plosive sounds during speech. You can adjust using any of these methods:

  • Adjust the microphone position, such as clipping the microphone on the collar.
  • Adjust the microphone angle to avoid airflow hitting the microphone. Generally, place the microphone diagonally below or above the mouth. You can monitor while adjusting.

Recording in an empty and flat room will have quite severe echo. Therefore, please record in a furnished environment, in a room corner, not in the center of the room.

Device issues and personal pronunciation can both cause muffled sound. If it's a device issue, it's recommended to use the suggested equipment for recording. If it's a personal issue, drinking water to moisten the throat can partially improve it.

Please refer to the following suggestions:

  • Add pause markers in the script to control speaking speed.
  • Try to enunciate clearly and accurately.
  • Consider personal speaking habits. Strict requirements are not enforced here. The cloning may reproduce characteristics of swallowed words and soft pronunciation.

Generally, using a phone to record audio is not recommended because recording quality is relatively poor. If you must use a phone, we recommend using iPhone 12 or later models. Note that phone recording requires the following:

  1. Go to Settings, select "Voice Memos > Audio Quality > Lossless".
  2. Use the iPhone's built-in Voice Memos for audio recording.
Note

Using a phone for recording is prone to microphone popping. Please be sure to do a test recording and confirm there are no issues before formal recording.

You can record in segments on the premise that each segment's voice tone and emotion have no obvious differences before and after. You must ensure the effective recording duration is no less than 20 minutes. Audio with mixed emotions and inconsistent voice tone will be judged as unqualified and cannot proceed to the cloning process.

We recommend estimating the script duration in advance. If it's less than 20 min, please prepare more content. During recording, if you finish the script but the audio duration is still not enough, please find another script. You cannot reread content that has already been read.

Previous

Video Digital Human Material Specification

Next

Run Example Code