Interactive Live Streaming in Audio/Video Technology Boom

The developments in audio and video technology are changing our lives, and there are emerging needs for Interactive Live Streaming.

Key trends in technology include immersion, high fidelity, and strong interaction.
Rich social interaction experiences in virtual spaces are the most typical immersive live-streaming scenario.
Digital human (virtual avatar) live streaming is becoming a trend, with “controllability” being its most significant advantage.
Users desire virtual interaction experiences that feel like face-to-face conversations.
Advancements in virtual spatial interaction technology are needed to deliver more immersive experiences.
ZEGOCLOUD Real-Time Interaction (RTI) offers high-definition, low-latency, immersive experiences with intense interaction.
With robust RTC basic capabilities and 5G network development, the future will bring new phenomena, such as remote car insurance claims and many other exciting applications.

From teleconferences to live streaming entertainment and discussions on future interactions in the metaverse, audio, and video technology plays a crucial role in changing our lives. The latest audio and video technology has also breathed new life into interactive live streaming.

So what’s new about audio and video technology? Are there any emerging needs in interactive live streaming? What can we learn from state-of-the-art audio and video technology?

KL, the video processing mastermind at ZEGOCLOUD with 18 years of experience and over 120 patents filed shared his insights at a recent technical salon.

Video Processing and AI Optimization in Interactive Live Streaming Solutions Experience

Q: Can you tell us a little bit about yourself? You can focus on your work experience.

KL: At ZEGOCLOUD, my primary responsibilities include video processing, technology research and development, and AI inference engine optimization. With 18 years of experience in business and technical fields, I’ve worked on various projects, including a naked-eye 3D display system based on depth cameras in 2007 and a system combining images or videos from multiple cameras onto a curved screen. My expertise in video processing, including video enhancement using traditional AI technology and camera 3A algorithms. I’ve also applied for over 120 patents in this field.

Q: The evolution and iteration of audio and video technology are very fast. Can you share some impressive technological breakthroughs in your past decades of career?

KL: From an early perspective, both audio and video have been constantly evolving towards higher definition, smoother, and more real-time performance. Looking forward to Interactive Live Streaming, three technology trends are developing:

Firstly, immersion provides users with an immersive experience by portraying facial features and a clear conveyance of sound.

Second, high fidelity – using holographic projection or virtual remote control to generate realistic human images in virtual spaces.

Third, strong interaction – emphasizing strong interaction in the metaverse social field. This also applies when communicating with clients and users.

To let the three trends sink in, I will give two examples, both of which are based on my previous projects.

The first project was in 2007, when we developed an end-to-end demo for a naked-eye 3D display system based on depth cameras. Our main goal was to create a stereoscopic and realistic image that could be viewed by the naked eye from a 3D display.

Another project was a highly interactive pseudo-AI live class app developed by one of our customers in 2020. Usually, a true live class costs $40 per lesson. Our client’s approach to this project involved converting live-streamed classes into recorded videos, thus reducing teacher costs. The app pushed the recorded videos and set up various questions and interactive segments based on the lecture content, guiding and encouraging students to participate in class interactions. This way, the students still get to have a live class experience. It significantly reduced the cost per lesson to only $1 without compromising user experience. This makes high-quality online courses accessible to underdeveloped regions, demonstrating the positive changes technological progress brings.

Emerging Scenarios and Technical challenges in interactive live streaming

Q: The year 2020 was dubbed the first year of the metaverse. Metaverse has been a trending topic in recent years. This year, new scenarios such as immersive live streaming have emerged. Have you observed any other interesting scenarios for metaverse applications?

KL: Well, it’s up to someone else to define what’s interesting, but we can draw inspiration from our customers. At ZEGOCLOUD, we strive to build better and more comprehensive solutions to meet the needs of our customers and users. Based on customer feedback, we’ve identified two typical scenarios. The first is focused on creating a rich social interaction experience in virtual spaces. People have diverse social interaction needs, including one-to-one and group communication, which require the perception of the five senses, spatial audio feedback, voice interaction, body language interaction, and expression interaction. The second scenario is centered around entertainment, such as gaming, live streaming, and video on demand (VOD), with most applications initially developed by gaming manufacturers.

Q: Immersive live streaming is expected to be the next magnet for investors. What do you think? What is behind this trend?

KL: Starting in 2020, we began investing in the development of high-fidelity digital humans. In 2022, we published a paper in Neurocomputing titled “Progressive representation recalibration for lightweight super-resolution.” With our ongoing research on live streaming technology and our industry observation, high-fidelity digital human (or virtual avatar) live streaming will become a trend. Moreover, the development of digital human technology, including 3D facial reconstruction, body motion capture, rendering, synthesis, and intelligent interaction, will significantly reduce the cost of digital human production, making it more accessible to users who can experience high-fidelity digital human live streaming.

What is the most significant advantage of digital humans? The answer is “controllability.” Unlike actual human live streaming, where presenters may leave or pursue other opportunities, virtual human presenters obey all commands. In immersive live streaming, we can also build and customize easily switchable scenes for users with meager construction costs. Through digital technology, especially the development based on NeRF technology, users can enjoy a brand-new experience.

Q: Nowadays, users have increasingly higher expectations for live streaming experiences, such as immersive experiences and high-resolution videos. What are the unique features of user demands in new live-streaming interactive scenarios, and what functionalities do they require the most?

KL: We have learned from our communication with customer companies that users need powerful basic capabilities to cover a broader range of mobile devices. Hence, we emphasize high fidelity, high definition, smoothness, and real-time performance. However, achieving high definition and real-time performance on mobile devices is a significant challenge. For example, a client company may use our super-resolution technology to achieve real-time super-resolution, increasing 540P to 1080P on mobile devices. However, only a few companies can currently provide this technology. Another issue is whether specific Android models can run at 540P. These practical issues reflect whether a company can implement super-resolution, interpolation, and even frame-doubling technologies better.

Returning to the question itself, since users are paying for the experience, can we enhance the technical capabilities of our clients – the app/software development companies? For example, in Southeast Asia and India, users may only have a frame rate of 7 to 8 frames and may have low-performance devices. Can we achieve frame interpolation on mobile devices? This is the first essential characteristic.

Secondly, users desire clear audio and video quality and want to interact and communicate with others. They hope for a strong interaction experience where it feels like the other person is sitting right next to them, like having a face-to-face conversation. This is the kind of strong interaction that ZEGOCLOUD aims to enhance with its future technologies.

Q: New Interactive Live Streaming scenarios have attracted a lot of attention. In immersive live streaming and co-streaming scenarios, what challenges do they pose for audio and video technology? What capabilities still need to be improved?

KL: For example, strong interaction demands high real-time and anti-weak network capabilities and advanced spatial and audio-related technologies. Moreover, improvements in audio and video technologies, as well as spatial interaction, are basic requirements. The interaction also involves voice and action interaction, such as immediate feedback of the other person’s action when a user’s action is received. This creates a strong interaction, making Interactive Live Streaming possible.

Achieving these scenarios on hardware, such as having a large screen the same size as a real person, would provide an even stronger sense of experience and communication. Cross-screen interaction is another common scenario where two people, one on the left and one on the right, could play catch with an object. Can the person on the right catch it through visual cues or with a glove? Such spatial communication is also very challenging. However, these are all imagined scenarios, and there is still a long way to go to achieve them. Most of our interactions are in 4G scenarios on mobile devices. Still, if 5G is widely popularized, there should be no major issues with latency and high-definition. The development will move towards interactive space and immersive voice. There will be more applications and gameplay in the future.

Q: Taking virtual live broadcasting as an example, there are high demands for the realism of virtual hosts and the scenarios in which they are broadcasting. What technologies are required to support it?

KL: ZEGOCLOUD and our peers continuously strive to improve our capabilities in this industry. This healthy competition benefits both the ecosystem and users.

In reality, we prioritize constant improvement towards higher definition, smoother, and more real-time performance in the ecosystem. This includes developing technological capabilities for audio and video interactions and scenario-based AI-powered noise suppression. For example, when a child is taking an online class at home, and the kitchen is noisy, this requires active noise suppression and spatial 3D audio technology.

More specifically, immersive audio technology, including Channel-based Audio (CBA), Object-based Audio (OBA), and Scene-based Audio (SBA), can bring a completely different experience to ordinary users when the underlying algorithms are optimized.

Interpretation of RTI capabilities

Q: In 2022, ZEGOCLOUD proposed the Real-Time Interaction (RTI) concept to summarize its capabilities. How should this capability be understood?

KL: First and foremost, on behalf of ZEGOCLOUD, I would like to express my respect to all our peers. In fact, all relevant concepts are proposed to improve the ecosystem. Real-Time Interaction (RTI) is a value-added service built on top of Real-Time Communication (RTC). RTI covers all products and technical capabilities necessary to reproduce or exceed reality in real-time interaction scenarios, including RTC, in-app chat, live streaming technologies, virtual avatars, AI vision, and status synchronization. We must work together with our peers and partners to achieve this.

RTI, the new frontier of RTC

RTI aims to achieve higher definition, smoother performance, lower latency, immersive experience, high fidelity, and intense interaction. Specifically, video technology includes real-time frame interpolation on mobile devices, real-time super-resolution on clients, subject segmentation, transmission, and low-light enhancement. In terms of immersive sound quality, this includes scenario-based AI noise reduction, spatial audio, and range voice. There are unlimited features and scenarios, such as voice chat among tens of thousands of users and multi-user status synchronization.

For example, one of our clients provides a value-added service for real-time exams based on RTC, mainly for customers in the education industry. Universities are no longer satisfied with products that only offer basic interactive communication and screen-clicking exam functions. They need more services, such as monitoring whether examinees are cheating, whether someone is in front of or behind the examinee’s camera, or whether there is a possibility of cheating in a certain location. In addition, in exam monitoring or learning systems, students might need real-time scoring or error correction when playing the piano or singing. ZEGOCLOUD must constantly refine and explore new technologies to meet these evolving needs and increase real-time interactive capabilities and means.

Use case scenarios

Our most typical user scenario is mobile-based. We suddenly thought of low-light enhancement technology when we discussed frame interpolation, super-resolution, and subject segmentation based on mobile devices. This technology helps to capture the user’s face clearly when shooting videos in dark rooms with the lights turned off. A Southeast Asian client once requested that they should be able to see the user’s face clearly, even in dark environments, while maintaining a frame rate of 720P and 30 frames per second, which posed a great technological challenge. However, we later overcame this difficulty by using a 2-millisecond 720P low-light enhancement, which satisfied the users, although it introduced some noise.

Technological capabilities and challenges

ZEGOCLOUD has several examples of AI capabilities. For instance, in super-resolution, their aim is to cover as many device models as possible. Currently, for phones with the Snapdragon 660 processor, the resolution can be doubled from 640×480 to 1280×960 in about 52 milliseconds. For phones with the Snapdragon 855 processor, the resolution can be doubled from 640×480 to 1280×960 in about 20 milliseconds. Recently, one of their major clients needed real-time 960×540 super-resolution to 1080P, and ZEGOCLOUD was able to meet their needs.

Another example is the technical challenges in green screen subject segmentation, often seen as a relatively simple technology. The most typical problem with green screens is color spillage. Clients have put forward new requirements to ensure timeliness and prevent color spillage. This involves suppressing noise and cleanly removing folds. These issues may seem simple, but they are very challenging to deal with. ZEGOCLOUD was able to meet the client’s requirements through 3 or 4 convolution models with a size of only about 5KB. They have delved deeper to excel in technology that everyone thinks is simple but needs to be done better.

The exploration of AI-powered noise reduction, voice detection, spatial audio, multi-user voice chat, and multi-user status synchronization mentioned earlier also explores ZEGOCLOUD’s rich technology ecosystem and enhanced user experience in various dimensions. These explorations have also improved their basic and value-added service capabilities.

Q: At present, RTI has outstanding advantages in terms of picture quality, sound quality, and various user scenarios. What technical challenges has ZEGOCLOUD encountered in realizing these capabilities?

KL: The technologies we discussed include traditional and deep learning-based techniques. Let’s take super-resolution, an example of deep learning-based methods, as an example. What aspects can we focus on to solve the problem of super-resolution? First, it is essential to note that it cannot be too large if a deep learning model is to run on Android devices. This is because even the best deep-learning inference engine cannot improve speed when the model is too large or has too many operators.

Additionally, large models can consume too much memory. Since super-resolution is an “add-on” service while RTC is a fundamental capability, we must avoid excessive memory consumption. However, if the model is too small, the super-resolution effect will decrease, which is our problem. Therefore, model design, such as using knowledge distillation or training large models to create smaller ones, is an important consideration.

Second, there is the issue of data. By simulating the process of data degradation after understanding the details of the business, we can take super-resolution to the next level and even achieve better results than small models.

Finally, there is the issue of model training and inference quantization, which involves compression, model compilation, and optimization of inference engines specifically for super-resolution. This end-to-end process requires technical personnel from different areas to work together. Large companies may face the challenge of integrating resources across departments, where inference engine optimization is separate, and data, model design, and mobile development may all belong to different departments. ZEGOCLOUD is constantly improving and optimizing to provide users with a better experience.

Q: Can you share more specific examples with us? What problems can RTI solve, and in what scenarios is it suitable?

KL: In interactive live streaming, RTI technology is used in voice chat rooms, virtual spaces, and more. For example, voice chat rooms give users an anonymous communication space. Still, traditional anonymity features may lead to complex user backgrounds and various types of background noise. Additionally, voice chat rooms have limited microphone access, which can restrict emotional and physical communication. ZEGOCLOUD has implemented noise reduction, spatial sound effects, and high-fidelity technology to address these issues. ZEGOCLOUD virtual Avatar is also used to drive body movements and expressions, improve sound quality, and convey emotions, resulting in a better communication experience than traditional voice chat rooms. Furthermore, real-time interactive RTI can overcome the limitations of traditional RTC microphone access, allowing more people to speak freely.

Future outlook

Q: Based on the metaverse concept and advancements in audio and video technology, what new phenomena or demands do you think will emerge in the future?

KL: At present, some scenarios are already here to stay. For example, automatic driving and remote inspection or diagnosis. And in Industry 4.0, wearing VR glasses can allow remote guidance of users in factories or remote consultations. Remote car insurance claims are another example, where claims adjusters typically need to instruct users on where to take pictures of the damage. With remote claims adjustment, staff can wear VR glasses and guide users to take photos of the damage with their phones in real-time, improving efficiency. Additionally, there may be applications in education, remote visa application, and other areas.

With robust RTC basic capabilities and the development of 5G networks, I believe many unimaginable applications will emerge, such as intelligent robots that humans can monitor to perform tasks. As technology continues to advance, we can expect to see more applications that go beyond our imagination.

Q: What are ZEGOCLOUD’s plans for the future around the technology trends you just mentioned?

KL: The ultimate goal of technological development is to solve customer problems. As mentioned earlier, we have been researching and investing in forward-looking technologies. In 2022, we also published a paper, Progressive Transformer Machine for Natural Character Reenactment. We discussed our investment and research in RTC-related audio and video technology, AI-based high-fidelity, 3D facial reconstruction, high-fidelity NeRF, etc.

At the same time, we want to build our capabilities and work with other industry players to jointly build an ecosystem. We plan to continue to develop our abilities as hardware becomes more ubiquitous and technology costs decrease. As a result, we can provide faster pain point solutions to our customers.

Advice for careers in the audio and video field

Q: Large technology companies have already standardized technologies, such as noise suppression in voice chatting, making it difficult for smaller companies to compete. How can this issue be addressed?

KL: This is a good question. Noise reduction in voice chatting involves deep learning. The data we use to train in one scenario may have a different effect in another scenario. Here is a typical example. We imported a third-party speech database, which holds the voice data of adults only, to a project featuring children’s voices, and we got a recognition rate of 70% to 80%. After we imported a children’s voice data database, the recognition rate increased to 90%. The current generalization ability of AI is limited.

In many cases, big companies hope to achieve standardization, which means the data trained in one scenario can be used in other scenarios. However, this is currently impossible and can only be achieved in specific scenarios. Therefore, this creates opportunities for many small and medium-sized companies to compete and even outperform big companies in certain areas.

Q: I’ve been engaging in device driver development for two years and want to switch to the audio and video field. Can you give me some suggestions about where I should start?

KL: I actually changed my career path as well. When I first graduated, I worked as a math teacher at a university for a year and then switched to becoming a programmer. At first, I worked on algorithm-related projects and then moved into the audio and video industry. I started with traditional audio and video processing techniques before moving into AI technology and then onto framework inference engine optimization and model design. My advice for anyone looking to switch careers is to start by mastering a specific aspect of audio and video technology or being able to run a demo, understand the code and algorithms, and then optimize the algorithm based on user scenarios. As a following step, compare and analyze the advantages and disadvantages of different companies, choose a good path and your own positioning, focus on mastering a single technology first, and then expand to other areas.