Talk to us
Talk to us

Building Real-Time Interaction (RTI) in the Metaverse

Building Real-Time Interaction (RTI) in the Metaverse

The metaverse is undeniably at the forefront of the current trend, and it has garnered optimistic expectations from numerous vendors regarding its future potential. This has resulted in vendors actively developing their own solutions to accelerate and improve the realization of the metaverse in their respective fields. We are going to shed light on the real-time interaction and underlying technological capabilities of ZEGOCLOUD in metaverse scenarios.

The introduction is covering three aspects: the analysis of real-time interaction (RTI) in the metaverse, the exploration of key technological capabilities in metaverse scenarios, and a detailed analysis of ZEGOCLOUD’s metaverse scenarios case study.

The Metaverse and Real-Time Interaction (RTI)

real time interaction

ZEGOCLOUD recognizes that the metaverse is one of the future trends in internet development. With advancements and the ubiquity of artificial intelligence (AI), real-time communication (RTC), game development, and blockchain, the metaverse is becoming increasingly accessible. We believe that the metaverse can bring about new user experiences, a complete commercial ecosystem, enhanced interactivity, and immersion. It can offer experiences that are close to or even surpass reality, transitioning from fulfilling functional needs to fulfilling emotional needs, creating a virtual world experience that is as real as what is seen.

In addition, the metaverse will bring forth new forms of identity and interactive experiences, as well as the ability to maintain and accumulate digital assets, providing more business opportunities for enterprises and preserving valuable digital assets.


From RTC to RTI

Based on this understanding, we realized that real-time communication (RTC) is no longer sufficient to encompass all online interactions. Interactivity is significantly important in the context of the metaverse. To address this, we have upgraded our product from RTC to real-time interaction (RTI) for Metaverse.

from atc to rti

Transitioning from real-time communication (RTC) to real-time interaction (RTI) signifies the expansion of ZEGOCLOUD’s product capabilities and service offerings. RTI represents the culmination of ZEGOCLOUD’s capabilities and also indicates our future research direction – the pursuit of better interactive experiences to meet spiritual needs.

rti capability

The upgrade from RTC to RTI has enabled ZEGOCLOUD to enhance the following technological capabilities:

  1. Intelligent and realistic visual quality: Mobile real-time super-resolution utilizes AI prediction to achieve higher resolutions at a lower bandwidth cost. Subject segmentation with transparent channel transmission brings together virtual and real elements, enabling diverse live streaming experiences. The self-developed Z264 encoder improves the overall video quality under similar conditions, enhancing performance in complex scenes such as motion.
  2. Immersive audio quality: AI-based contextual noise reduction identifies and eliminates various types of noise in different scenarios, intelligently switching noise reduction modes based on the scene. Spatial audio allows users to perceive sound from different directions, providing a sense of directionality in interactions.
  3. Infinite scenarios and gameplay possibilities: Real-time synchronization of multiple user states, virtual avatars, and interactive audiovisual experiences for thousands of participants, increasing the scalability of large-scale interactions.

Analysis of Key Technical Capabilities in the Metaverse Scenario

After introducing the upgrade strategy and related concepts of ZEGOCLOUD real-time interaction, let’s take a closer look at the popular technical capabilities related to the metaverse.

Massive live co-hosting for 10,000+ (participants)

We usually think that if too many participants speak at the same time, the voices will become unclear. This limitation is often due to technical considerations rather than product management decisions. When multiple participants activate their microphones, they publish their audio and video streams from the client to the server. And a large number of participants can put a significant strain on the server. Therefore, traditional RTC approaches limit the number of participants speaking simultaneously in the same room, either on the business side or within the SDK.

No limitations on the number of simultaneous speakers

limitations on the number of simultaneous speakers

ZEGOCLOUD RTI, on the other hand, does not impose such limitations on the number of simultaneous speakers. So, is this massive live co-hosting capability useful? The answer is yes. For example, in online events or online concerts with a massive audience of thousands of participants, we not only need to hear the performers’ voices but also the voices of many audience members. The capability of massive live co-hosting for 10,000+ is extremely valuable as it creates an authentic atmosphere and immersion.

The traditional method of publishing audio and video streams from the client to the server and then forwarding them is not feasible for architectures like massive live co-hosting for 10,000+ participants. Instead, ZEGOCLOUD RTI publishes the audio and video streams from the client to the server, where they are routed and converged at edge nodes, and then played back to the clients. The audio received encompasses the information from all active speakers, ensuring maximum realism and atmosphere.



The capability to enable massive live co-hosting actually poses significant technological challenges.

  1. High concurrency: Through improvements, ZEGOCLOUD can now support up to 1 million simultaneous online users in a single room.
  2. Massive convergence of network traffic and computational load: ZEGOCLOUD separates audio and video streams, converging only the audio streams. To alleviate the computational load, the client-side performs a certain amount of pre-computation on large data, ensuring that the server does not need to perform calculations again and can directly perform routing.
  3. Ensuring smooth audio without any dropped frames: ZEGOCLOUD prioritizes the integrity of audio data in each routing stage to prevent interruptions caused by routing strategies.

Real-Time Status Synchronization of Multi Guests

real-time status synchronization

Real-time status synchronization of multi-guest is common in the metaverse scenario. However, in the metaverse, user status encompasses much more complex and diverse data, including movement status, virtual avatar actions, facial expressions, and item status. The synchronization of these statuses also requires real-time updates; otherwise, it would hinder smooth interaction and compromise the user experience.

Currently, ZEGOCLOUD achieves real-time signaling with a delay of around 60ms, which is because of its globally unified real-time monitoring and scheduling of signaling, allowing for proximity-based edge node access.

To ensure a better user experience in the virtual world, the concept of user perspective is introduced on the server side. The server can dynamically obtain the field of view of the virtual character based on its current position, and provide real-time event notifications to the client with relevant visual information. This allows the client to integrate a sense of direction and spatial awareness into real-time audio interaction, providing a more immersive virtual world experience for end users.

Overall architecture

overall architecture

Based on the overall architecture, we can see that:

  1. Easy to integrate: The construction of the 3D virtual world scene in the Metaverse App relies on the 3D engine, so Unity or UE is needed for development. The ZEGO SDK is developed in C language at the bottom layer, and the output interface to the outside world is C++/Unity C#, supporting mixed programming construction as a whole module. Therefore, both 3D virtual scene developers and business application developers can use their familiar programming languages without additional learning costs.
  2. Providing a status synchronization server for easy access to all status information: By introducing the status synchronization server provided by ZEGOCLOUD, the business side can easily subscribe to the status information of all users, so as to design relevant business logic based on this status information.
easy to integrate

Although it may sound complex, it is actually easy to implement. After creating an instance and logging in, ZEGOCLOUD will automatically divide different types of status notifications. The business side only needs to pay attention to the notification events they need to focus on.

ZEGOCLOUD Virtual Avatar

zegocloud avatar

With the development of AI, especially the widespread use of applications like ChatGPT, AI-generated virtual avatars have become common. So what is the difference between AI-generated virtual avatars and ZEGOCLOUD’s virtual avatars? The process of generating ZEGOCLOUD’s virtual avatars, shown on the right, does not rely on AI generation. The generation process of ZEGOCLOUD’s virtual avatars is more complex and involves stages such as modeling, design, animation, and rendering of the original manuscript design. The original manuscript is created by professional artists, and we have our own modeling standards and animation processes.

Each step follows ZEGOCLOUD’s design specifications to ensure that the virtual avatars are complete and well-designed, with all the necessary body parts, clothes, accessories, and more. Combining design specifications with AI capabilities, virtual avatars can resemble real humans. In summary, the most significant feature of ZEGOCLOUD’s virtual avatars is the ability to achieve fine-grained control through simple programming.

avatar capabilities

AI-generated avatar

It utilizes powerful and stable facial recognition technology to analyze and train on a massive amount of data, accurately replicating the facial features and shapes of real human faces in virtual avatars.

Self-defined avatar

It allows for parameter adjustments of various facial features using skeletal animation. It combines synthesized facial features with artistic elements such as makeup and accessories, enabling natural replacements and customizations on virtual avatars.

self-defined avatar

The parameter adjustments of avatars are extensive, providing a wide range of possibilities to achieve desired effects. These adjustments can be made manually through the application programmed through APIs by developers, or even adjusted automatically using AI. When taking a photo or uploading an image, AI extracts facial features from the image to generate a highly realistic virtual avatar.

Facial expression mirroring and body pose recognition

Facial expression mirroring can be driven by either the real-time camera or dynamic text-driven. The camera-driven control relies on precise facial keypoint recognition to capture and replicate facial expressions in real-time. Body pose recognition involves real-time recognition of movements through camera input, extracting body position information, and driving the avatar’s movements. While camera-driven control has certain requirements for the environment and actions, it can be utilized in many scenarios.

Speech & text driven

The virtual avatars are driven by speech and text inputs. Speech driven utilizes real-time analysis of voice waveform information to drive facial and mouth expressions, resulting in natural and realistic expressions. Text driven allows virtual avatars to read out text naturally and engage in text or voice-based conversations by incorporating ASR and NLP .

ZEGOCLOUD’s design team has created virtual avatars in various styles, including realistic, cartoon-styled, and anime-styled davatars.

modeling showcase

ZEGOCLOUD supports integrating virtual avatars with RTC (Real-Time Communication), enabling multiple virtual avatars to interact within the same space. One virtual avatar can observe the movements, expressions, actions, and audiovisual interactions of other virtual avatars in the shared space. This integration allows users to experience the emotional changes of others in the virtual world, similar to the real world.

Case Study of ZEGOCLOUD Metaverse Scenario

After discussing ZEGOCLOUD virtual avatars, let’s look at a complete case study of real-time interaction and metaverse.

case study

In this case study, virtual avatars can change their outfits and customize their face. The overall scenario is in a karaoke room. There is interaction between virtual avatars and virtual objects. For example sitting on chairs, holding microphones to sing and give gifts, performing dance movements, etc. There is also interaction between virtual avatars and real humans, where they can sing and chat together. If users prefer not to show their faces, the app can capture and render their facial expressions while synchronizing the audio.

Overall Framework of Case

overall framework of case

The overall framework consists of two main parts: programming and 3D art resource design.

3D art resource design: ZEGOCLOUD collaborates closely with multiple professional 3D art teams and can provide various blackouts for scenarios. The image on the top right shows the final presentation of different decorations and dynamic updates based on the design of the blockout. After the updates, ZEGOCLOUD’s orchestration tool is used for orchestrating. Clients can also design their own concept art to model. And it will be interactively orchestrated using the ZEGO SDK and dynamically loaded for implementation.

artistic scenarios design

Programming: The software system solution is complex and divided into multiple processes. And the one shown here is the dual-process solution. It divides into three parts: the host app runs the RTC main business, and the entire metaverse environment runs in an independent process.

Unity displays the virtual environment UI, calls ZEGOCLOUD’s driving capabilities, and communicates with the communication capabilities across processes to achieve interoperability between the virtual avatar’s position signaling and status synchronization signaling. Finally, multi-end communication is achieved through ZEGOCLOUD’s servers.


The MetaWorld SDK provides basic capabilities. It also offers more advanced capabilities, such as interactive componentization, super screens, orchestration of subject segmentation, and other interesting features. In addition to static orchestration, we also provide dynamic orchestration capabilities. It is possible to dynamically orchestrate resources entirely through the app’s interface capabilities, including styles, positions, and interaction methods for characters and objects. We have launched a solution for creating a virtual world, which uses this dynamic orchestration of resources to initially provide an empty space and allow players to create their own worlds.

This is the end of sharing ZEGOCLOUD’s underlying key technologies of real-time interaction in the metaverse. We are looking forward to creating more possibilities!

Talk to Expert

Learn more about our solutions and get your question answered.

Talk to us

Take your apps to the next level with our voice, video and chat APIs

Free Trial
  • 10,000 minutes for free
  • 4,000+ corporate clients
  • 3 Billion daily call minutes

Stay updated with us by signing up for our newsletter!

Don't miss out on important news and updates from ZEGOCLOUD!

* You may unsubscribe at any time using the unsubscribe link in the digest email. See our privacy policy for more information.