Real-Time Voice Optimization Process (II)

Different scenarios have different requirements for the Real-Time Voice system. The implementation process is to select the most appropriate solution and algorithm in each step according to the scene. Sometimes it requires the cooperation of hardware.

The followings are the optimization features in different scenarios.

KTV Real-Time Voice

The core requirement in the KTV Real-time Voice scenario is to play the music accompaniment along with with the host’s voice; meanwhile, the lyrics are displayed synchronously. The sound effects and quality of the entire scene must be high, and the delay between the host and the accompaniment must be low to ensure Real-Time voice.

Therefore, the targeted optimization plan for this scene is to support accompaniment playback, singing, and mixing in the pre-processing stage and synchronize lyrics transmission through SEI technology. We choose low-grade noise reduction while supporting the function of tone change and sound effect support.

Real-Time Voice in Chat Room

The core requirement of a standard language chat room is a clear voice and low-performance consumption. At the same time, it must support background music playback, voice change, sound mixing, and some business logic cooperation.

In the pre-processing stage, we must support background music playback, volume change, and sound effect support. Radical noise reduction turns on to ensure human voice clear transmission. You can also use the call volume on the mobile terminal to eliminate echo. The sampling and code rates reduce appropriately in the encoding stage to ensure transmission.

In some scenarios, the people in the chat room may frequently get on and off the microphone. At this time, if someone keeps entering and leaving the room, may swallow and disturb the sound.

We also need to provide permission control for room management, with functions like kicking people out, to ensure a smooth and serene chat environment.

Live Game Audio

Before talking about the live game audio, let’s talk about the principle of echo cancellation.

In a single-talker case, echo cancellation is relatively easy, but one can adopt a more aggressive processing strategy. Dual talk is a scene where multiple parties are talking simultaneously, and their voices may mix. In this case, echo cancellation is more complicated. On the one hand, it is necessary to protect the near-end voice signal from being damaged; on the other hand, it is necessary to eliminate echo as much as possible.

When the far-end echo is 6-8 decibels higher than the near-end voice,the elimination of the human echo could damage the near-end voice. If the far-end echo is more than 18 times higher than the near-end, the far-end echo may completely cover up the near-end voice, and the effect of echo cancellation is definitely not good.

We can adopt strategies such as killing the far-end echo and the near-end voice together, and then appropriately filling in the comfort noise.

Now let’s return to the live game audio scene. As we mentioned above, the core requirement of live game audio is to eliminate noise, ensure the human voice’s clarity, low latency, low-performance consumption, low bandwidth, and support for the game development framework.

Therefore, the targeted optimization solution in the game scene uses aggressive noise reduction in the pre-processing stage. The mobile terminal uses the volume of the call to eliminate echo, detects the human voice through VAD to indicate that only human voice is transmitted. Then it appropriately enhances the human voice through AGC sound.

The encoding stage appropriately reduces the sampling rate and bit rate, and the playback stage reduces these buffers to achieve a lower delay and ‘real’ Real-Time Voice.

Real-Time Voice in E-learning scenarios

The human voice must be clear in educational scenarios, such as a small 1v1 class. In addition:

Teachers may broadcast video courseware when they are in class, needing media play.
The equipment in class may be more diversified, requiring higher compatibility.
Many people will speak together in a small class, therefore needing high-quality echo cancellation.
Students may make mistakes in class, hence need the fool-proof design.

The targeted optimization solution is that in the pre-processing stage, the mobile terminal uses the system volume to eliminate echo, the PC terminal uses a radical echo cancellation strategy, we support the function of media playback and mixing, and appropriately increase the speech volume through AGC.

We appropriately reduced the code and sampling rates in the encoding stage. Also, we did fool-proof processing, such as equipment detection, multi-device priority selection, troubleshooting of faulty equipment, and so on.

Music Sharing

It requires high sound quality, support for accompaniment playback, and a specific fool-proof design.

The targeted optimization plan is that we support accompaniment mixing in the pre-processing stage and try to ensure sound quality in the encoding stage through a sampling rate of 48khz or higher.

It is generally recommended a bit rate of 192 or more for encoding and dual-channel support is required. The accompaniment needs to be displayed simultaneously during playback. Under appropriate circumstances, if the sound quality is to be maintained, a little video quality may be sacrificed.

In addition, there is another critical point in some scenarios: the quality of the mobile phone or the hardware itself should be high.

Smart Device

For smart devices, the power consumption must be low, and the network environment in which they are located will not be perfect. The network anti-shake requirements are higher and may need to support voice recognition simultaneously. We also provide signaling channel support.

IoT devices or VR platforms are different. We need to provide multi-platform support, such as some embedded and Android platforms, and try to use hardware codecs.

In addition, the bit rate of the device is relatively high. To reduce power consumption, we targeted VAD to detect mute in the pre-processing stage to reduce coding consumption.

Conclusion

Each business scenario is different, and different optimization solutions are required.

W need to choose the most suitable solution for the current scenario from the end-to-end voice links, weigh the contradictions between the links, and get the most suitable solution for Real-Time Voice.

The development and progress of the Internet of Things bring more scenarios. Real-Time Voice applications will also have more possibilities. 5G is constantly developing and will fall to us in recent years. It has brought a better transmission network, and real-time voice quality is expected to improve further.

In addition, voice AI technology is constantly improving in AI scenarios such as speech recognition, feature extraction, and emotion recognition. They will combine with real-time voice technology to promote more gameplay, and the development of hardware will also bring a better effect of real-time voice processing.