OpenAI released its latest flagship model GPT-4o at its spring press conference. Based on the previous generation GPT-4, which can only understand and output text information, GPT-4o can support low-latency real-time conversations. GPT-4o, with the ‘o’ standing for ‘omni’, handles text, audio, and images as input, delivering real-time multimodal output of audio and visuals.

It is not difficult to see that Open AI has made targeted optimizations for real-time interactive scenarios. It can respond to audio input in as short as 232 milliseconds and an average of 320 milliseconds, which is close to the response speed of humans in conversation. Evaluation data at https://thefastest.ai/ shows that the user’s waiting time (which can be understood as “TTFT”: Time To First Token) is reduced, which means that users get faster feedback and are closer to the experience of human conversational interaction.

The vision for “real-time remote interaction with AGI” is to be as natural as human-to-human interaction – multimodal and real-time. AGI perceives and responds to real-world information through cameras, microphones, and speakers, processing it through a multimodal large model, just as humans obtain, process, and respond to information through their senses and brain.

This is analogous to how AI systems like JARVIS in the Iron Man movies can be remotely accessed to assist with specific tasks, using multimodal perception and real-time processing capabilities.

RTC will be a core capability for real-time remote interaction with AGI

As the “brain”, a large-scale AI model operates at high speed in the cloud computing centers, while the “sensory organs” such as cameras and microphones are distributed around the globe. To achieve real-time remote interaction with AGI, the sensory information must be transmitted to the brain in real-time and with high fidelity, and the processed information must be fed back.

To achieve this goal, OpenAI first optimized the model’s capabilities. GPT-4o natively supports cross-modal reasoning across text, audio, and visual input, without the need for pre-conversion components such as ASR and TTS. Additionally, OpenAI has introduced real-time communication (RTC) technology into the GPT-4o applications, marking a significant milestone.

By comparison, the voice mode of GPT-4, which relied on the previous approach, had an average audio response time of around 5.4 seconds, which is almost impossible to meet the latency requirements for real-time interaction. However, the audio interaction mode of GPT-4o achieves an average response time of just 320ms, enabling feasible real-time remote interaction.

Flowchart of GPT-Based Remote Interactive Application

In summary, GPT-4o has brought AGI into a new era of real-time interaction, that is, the integration of the new generation of large-scale AI models and RTC real-time networks makes the experience of remote human-computer interaction more natural and smoother.

ZEGOCLOUD RTC seamlessly integrates with multimodal large language models

As a leading company in the RTC field, ZEGOCLOUD has been actively exploring and implementing the integration of AGI (Artificial General Intelligence) into remote real-time interactions. ZEGOCLOUD RTC’s unique real-time audio/video and high-frequency data transmission capabilities allow it to seamlessly integrate with new multimodal large language models, providing users with a more natural real-time interaction experience.

Key features that enable ZEGOCLOUD RTC to support real-time remote interaction with AGI include:

Extremely low latency: While large language models inherently require 200+ms to process information, the minimum end-to-end latency of ZEGOCLOUD RTC is 60ms and the average is 200ms, meeting the real-time sensory requirements of human-AGI interaction.
Resilience to poor network conditions: Stable data transmission is critical for ubiquitous remote AGI interaction, even in weak network environments. ZEGOCLOUD RTC can maintain a smooth interaction experience with 80% audio packet loss and 70% video packet loss.
High-fidelity data transmission: Higher fidelity information helps large-scale AI models make more accurate understandings and decisions. ZEGOCLOUD RTC’s proprietary video codec, video quality enhancement algorithms, and 48kHz full-band audio sampling ensure high-quality audio and video data transmission.
Flexible and optimized deployment: Placing RTC transmission nodes closer to the computing centers of large-scale AI models can further reduce latency. ZEGOCLOUD RTC’s 500+ dynamic multi-cloud nodes can be strategically placed near data centers for a more reliable and low-latency interaction experience.

Various implementations of new real-time interaction scenarios

Based on the technical advantages mentioned above, ZEGOCLOUD has taken the lead in exploring the direction of AGI-enabled real-time remote interaction in the RTC industry and has achieved implementation in several industry scenarios.

AI-powered mock interviews: By leveraging AIGC (AI-Generated Content) technology to create digital interviewers, combined with ZEGOCLOUD’s strengths in real-time interaction, this solution simulates realistic job interview scenarios, allowing students to improve their interview skills in a low-pressure environment.

Intelligent customer service: ZEGOCLOUD’s real-time interactive Digital Human can provide accurate responses by integrating large-scale AI models or relevant knowledge bases. The quality of service can even exceed that of human counterparts, while helping customers reduce the labour costs associated with manual responses. Digital Human operates 24/7, facilitating smoother communication and significantly improving the efficiency and experience of online consultations and business processes.

In addition, the real-time interaction enabled by RTC + AI is showing more innovative benefits in scenarios such as emotional companionship, live streaming, and other social entertainment settings, as well as in industries such as online education and remote healthcare.

GPT-4o has revolutionized the way people interact remotely with large-scale AI models, while also making new requests for RTC in terms of low latency and high-fidelity data transmission. Going forward, ZEGOCLOUD will continue to deepen its focus on the real-time interaction industry, explore new scenarios and benefits of RTC + AI, and provide users with an even higher quality real-time interactive experience.

The Rise of Multimodal AGI: GPT-4o and ZEGOCLOUD RTC Usher in a New Era of Seamless Remote Interaction

RTC will be a core capability for real-time remote interaction with AGI

ZEGOCLOUD RTC seamlessly integrates with multimodal large language models

Various implementations of new real-time interaction scenarios