Different scenarios have different requirements for the real-time voice system, and the direction of optimization has its own characteristics. The specific implementation process is to select the most appropriate solution and algorithm in each step according to the scene, and sometimes requires the cooperation of hardware.
The following are the optimization features and directions in different scenarios:
In the KTV voice radio scene, the core requirement is to play accompaniment music/accompaniment MV, so there are the following requirements:
The accompaniment needs to be mixed with the host’s voice, the lyrics need to be displayed synchronously with the music, the sound effects and sound quality of the entire scene should be good, and the delay between the host and the accompaniment should be low.
Then the targeted optimization plan for this scene is to support accompaniment playback, singing and mixing in the pre-processing stage, and synchronize lyrics transmission through SEI technology. We choose low-grade noise reduction, while supporting the function of tone change and sound effect support.
The core requirement of a common language chat room is clear voice and low performance consumption. At the same time, it needs to support background music playback, voice change, sound mixing, and some business logic cooperation.
We need to support background music playback, volume change, and sound effect support in the pre-processing stage. Radical noise reduction will be turned on to ensure human voice. You can also use the call volume on the mobile terminal to eliminate echo. In the encoding stage, the sampling rate and code rate are appropriately reduced to ensure transmission.
In some scenarios, the people in the chat room may frequently get on and off the microphone. At this time, if someone keeps entering and leaving the room, the sound may be swallowed. Then we can keep everyone in the room pushing and pulling, but not The speaker uses silent playback to realize a virtual loading and unloading scene to ensure the quality of the language chat room.
In addition, we also need to provide some permission control for Maiwei management, such as function support like kicking people, to better support this scene.
3、Live Game audio
Before talking about the live game audio, let’s talk about the principle of echo cancellation.
The core requirement of the live game audio is echo cancellation
Let’s talk about the single talker scene first.
In the case of a single talker, echo cancellation is relatively easy, and a more aggressive processing strategy can be adopted. If we can determine that the single talker has a high probability, we can directly kill all the voice signals, and appropriate filling of comfortable noise is enough.
Dual talk is a scene where multiple parties are talking at the same time. The voices collected from the microphone at the two ends will be mixed together. At this time, echo cancellation is more difficult. On the one hand, it is necessary to protect the near-end voice signal from being damaged, and on the other hand, it is necessary to eliminate echo as much as possible.
Generally speaking, when the far-end echo is 6-8 decibels higher than the near-end voice, if the human echo is to be eliminated, the near-end voice will be damaged. If the far-end echo is more than 18 times higher than the near-end, the far-end echo may completely cover up the near-end voice, and the effect of echo cancellation is definitely not good.
In this case, some more radical strategies can be adopted, such as killing the far-end echo and the near-end voice together, and then appropriately filling in the comfort noise.
Now let’s return to the live game audio scene
Clear voice, low latency, low energy consumption, low bandwidth.
As we mentioned above, the core requirement of live game audio is to eliminate noise, and to ensure the clarity of the human voice, low latency, low performance consumption, low bandwidth, and support for the framework of game development.
Therefore, in the game scene, the targeted optimization solution is to use aggressive noise reduction in the pre-processing stage. The mobile terminal uses the volume of the call to eliminate echo, detects the human voice through VAD to indicate that only human voice is transmitted, and then appropriately enhances the human voice through AGC. sound.
The encoding stage is to appropriately reduce the sampling rate and bit rate, and the playback stage to appropriately reduce these buffers to achieve lower delay.
In educational scenarios, such as 1v1 small class, its core requirement is that the human voice must be clear. In addition:
- Teachers may broadcast video courseware when they are in class, so media play is required;
- The equipment in class may be more diversified – so higher compatibility is required;
- Many people will speak together in a small class – so high-quality echo cancellation is required;
- Students may make mistakes in class-some – so fool-proof design is required;
The targeted optimization solution is that in the pre-processing stage, the mobile terminal uses the system volume to eliminate echo, the PC terminal uses a radical echo cancellation strategy, we support the function of media playback and mixing, and appropriately increase the speech volume through AGC.
In the encoding stage, we appropriately reduced the code rate and sampling rate, and also did some fool-proof processing, such as equipment detection, multi-device priority selection, troubleshooting of faulty equipment, and so on.
The core requirements of the music sparring scene are high sound quality requirements, support for accompaniment playback, and a certain fool-proof design.
The targeted optimization plan is that we support accompaniment mixing in the pre-processing stage, and try to ensure the sound quality in the encoding stage, through the sampling rate of 48khz or higher.
We generally recommend a bit rate of 192 or more for encoding, and dual-channel support is required. The accompaniment needs to be displayed simultaneously during playback. Under appropriate circumstances, if the sound quality is to be maintained, a little video quality may be sacrificed.
In addition, in some scenarios, there is another very important point: the quality of the mobile phone or the hardware itself should be high.
For smart devices, first of all, the power consumption must be low, and the network environment in which they are located will not be very good, so the network anti-shake requirements are higher, and may need to support voice recognition at the same time. We also provide signaling channel support.
IOT devices or VR devices, their platforms are also different, we also need to provide multi-platform support, such as some embedded platforms, Android platforms, and try to use hardware codec.
In addition, the bit rate of the device is relatively high. In order to reduce power consumption, we targeted VAD to detect mute in the pre-processing stage to reduce coding consumption.
Summarizing the various scenarios mentioned above, the focus of each business scenario is different. Even if similar business scenarios are implemented in different clients and different business areas, different optimization solutions are required.
Therefore, we need to choose the most suitable solution for the current scenario from the end-to-end voice links, weigh the contradictions between the links, and get the most suitable solution.
The development and progress of the Internet of Things brings more scenarios, and real-time voice applications will also have more possibilities. 5G is constantly developing and will fall to us in recent years. It has brought a better transmission network, and the quality of real-time voice is expected to be further improved.
In addition, in AI scenarios such as speech recognition, feature extraction, and emotion recognition, voice AI technology is also constantly improving. They will combine with real-time voice technology to promote more gameplay, and the development of hardware will also bring better The effect of real-time voice processing.