To help developers better understand the concepts of audio and video and quickly get started with the development of audio and video apps, ZEGOCLOUD and its audio and video development experts have designed this course “Advanced Audio and Video Development”.
This course consists of a series of lessons starting from explaining the basic concepts of audio and video to tackling problems related to audio and video app development and then using SDKs to develop audio and video apps.
Before we start the topic, let’s take a look at the basic processing flow of audio and video data in the RTC scenario. Combined with the actual use cases, the processing flow can be explained from the perspectives of two roles, the host and the participant.
Audio and video data flow process
Simply put, the host captures and sends audio and video data, and the participant receives and plays the data. The host and the participant are connected over a real-time network.
In addition, the audio and video data captured by the host may have issues such as noise, echo, and large data volume, making it unsuitable to directly transmit the data over the network. The data pulled by the participant from the host is encoded and compressed and cannot be directly played back.
To solve these problems, we introduced modules such as the pre/post-processing module and the encoding/decoding module to form a basic symmetric data flow process over the network, as shown in the following figure.
Note that the figure shows a simple one-way process from host to participant, which is called a single-host scenario. The process can also be two-way in scenarios such as co-hosting, where two users have an audio or video call by using an app such as WeChat. In this case, the two parties are both the host and participant and the data flow is two-way.
The audio and video data flow in the RTC scenario basically complies with the preceding process. In this lesson, we’ll focus on the Pre-processing module.
Three A’s in Audio Processing
There are many features for audio pre-processing. Some of them are designed to offer better sound quality. For example, acoustic echo cancellation and ambient noise suppression remove irrelevant signals from the raw sound signals to make the sound pure. Some are to make the sound more interesting, such as pitch shifting, virtual stereo, and reverberation. These features add special effects to the sound.
Among the pre-processing features that offer better sound quality, the following three A’s for audio processing are worth noting:
- Acoustic echo cancellation (AEC)
- Ambient noise suppression (ANS)
- Automatic gain control (AGC)
They are commonly used in audio processing and will be described one by one later.
You can infer the functions of ANS and AGC from their names because noise and gain control are concepts that we are familiar with. However, AEC seems to be relatively mysterious and you may be unfamiliar with it. Just from the name, you can tell that it removes echo signals from sounds.
What is an echo in the RTC scenario? Why do echoes exist in voice signals? Why do we need to remove them?
Causes of Echo
Echoes are everywhere in our daily lives. Imagine such a scene: If you shout “Hello” to the mountains, what would happen? You will first hear your own voice “Hello”, and soon the mountain’s “reply”, which is also “Hello”. Here, the “reply” from the mountains is a phenomenon called echo.
A simple explanation of echo in physics is as follows: After sounds are created by the vibration of the sound source, they spread around. Some sounds are directly transmitted to the human ears and are thus called direct sounds. Some are reflected by obstacles such as walls before they are transmitted to the human ears and are thus called reflected sounds.
If the interval between a direct sound and a reflected sound that you hear is more than 0.1s, you can distinguish them and the latter is the so-called echo. After learning about echo in life and physics, let’s take a look at echo in the RTC scenario that AEC has to deal with in audio pre-processing. Echo in the RTC scenario is more complicated. Take co-hosting as an example.
In the following figure, user A and user B are co-hosting, and they each have their respective microphones and speakers.
Technically, the voices of user A and user B are captured by their own microphones, transmitted to the other party through the network, and played out on the other party’s speaker. Everything seems fine. However, something unexpected may get involved during an actual interaction, as shown in the following figure.
1. At a moment, user A starts to speak and generates voice A, which is captured by microphone A, transmitted to user B through the network, and then becomes voice A1 which is to be played.
2. After voice A1 is played on speaker B, it is captured by microphone B as voice A2 (echo A2 in the figure) through direct transmission and reflection by the surrounding environment.
3. Then, voice A2 is transmitted to user A through the network and played out on speaker A.
In the preceding process, user A finds that soon after speaking something, they hear their own voice. This is the echo in the RTC scenario. Echo A2 mainly includes direct echo (A1 is played on the speaker and directly enters the microphone without any reflection) and indirect echo (after A1 is played on the speaker, the sound enters the microphone after one or more times of reflection by the environment). (Note: Only acoustic echo is discussed here, and line echo caused by device wire exceptions is beyond the scope of this video.)
The preceding process describes the “single-talk” scenario of co-hosting (only one person speaks while the other listens). If user A and user B speak at the same time (double-talk), echo A2 will be mixed with user B’s voice B and then transmitted to user A. This significantly affects user A. Likewise, user B will also have the same problem of hearing an echo of their own voice.
If this situation is not dealt with, both parties will repeatedly hear the echoes of their own voices and cannot hear each other clearly. It would be a terrible experience. Whether it’s in daily voice calls, parties up in games, online karaoke, or other scenarios, echo is a big problem that developers must pay attention to and solve. AEC, one of the three A’s, is designed to solve this problem. Next, let’s take a look at how AEC works.
Principle of AEC
1. Basic logic of AEC
Based on the previous analysis, the echo heard by user A is captured by user B’s microphone. To solve this problem, we must start from user B’s end, as shown in the following figure.
In the preceding figure, voice A1 is played out on the speaker. It is a known signal, and we call it the reference signal. According to how echo A2 and voice B are generated, we call them the far-end echo signal and the near-end voice signal, respectively. These two signals generate a mixed signal C (C = A2 + B) after they are captured by the microphone. The mixed signal C is easy to get, but the A2 and B signals in it are like salt and sugar dissolved in a glass of water, which are difficult to distinguish.
To sum up, if we can subtract the far-end echo A2 from signal C, only the “clean” near-end voice B (B1 = C – A2) is left. This is what the AEC module is used for in the figure. This process seems simple. Since A2 is the echo of the known reference signal A1, the two should be similar when heard. Does that mean we simply need to replace A2 with A1 and subtract A1 from signal C directly? The problem is far more complex.
From when the reference signal A1 is played to when echo A2 is captured, the transmission path is loudspeaker -> room -> microphone (LRM). The room environment is time-varying, and so is the LRM path. These uncertain factors make the two signals very different at the digital processing level. If the reference signal A1 is directly subtracted from signal C, the output will have a lot of residues and is very different from voice B.
Therefore, we cannot directly replace A2 with A1 or easily pick out A2. Is there anything we can do? Of course, there is. The voices A1 and echo A2 are like twins with similar appearances and different personalities. There is still a non-negligible correlation between them. We can use this correlation to come up with an indirect solution:
The LRM path is mathematically simulated and solved by using the function F(x) where A2 = F(A1). Subtracting F(A1) from the mixed signal C can also achieve AEC. This is the key to the AEC algorithm. Its basic logic is to simulate the path by estimating the characteristic parameters of the echo path, use the path function F(x) obtained from simulation and the reference signal A1 to calculate the echo signal A2, and then subtract A2 from the captured signal C to create a “clean” signal output. That is:
Ideal output: C – A2 = B
Actual output: C – F(A1) = B1
Difference between the two: B – B1 = F(A1) – A2
If we can accurately find the echo path, then F(A1) = A2 and B1 = B. This perfectly implements AEC. However, it is very difficult to achieve “F(A1) = A2” in real-life situations. We not only need to deal with the complex external reflection environment but also need to consider the possible exceptions that might be introduced by the voice capture and playback devices.
The design of an excellent AEC algorithm involves a lot of work. After all, it is a critical part of AEC. It requires some mathematical knowledge, signal processing knowledge, and lots of practical experience. As app developers, we don’t have to delve into the details of the algorithm at the beginning, but we must learn about its basic principle, which will help us solve the echo problem in practical use cases.
2. Basic principle of the AEC algorithm
Based on the previous discussion, the input signals for AEC mainly include the reference signal (the aforementioned voice A1), the far-end echo signal (the aforementioned echo A2), and the near-end voice signal (the aforementioned voice B). The desired output signal is a clean near-end voice. The echo path LRM in the environment is unknown and we usually need a linear filter to simulate the path. Because LRM is dynamic and time-varying, a filter with fixed parameters cannot meet the requirements. Therefore, we need an adaptive filter that is able to dynamically adjust parameters according to the changes in its own state and the environment.
In short, we need an adaptive linear filter to find the echo path F(x). The adaptive linear filter can estimate the echo signal A2 according to the reference signal A1 in a complex and volatile environment, and use the correlation between A1 and A2 to remove as many echoes from signal C as possible.
After being processed by the adaptive linear filter, the near-end voice B is purified to some extent, but generally, there are still some residual echoes. Therefore, a second round of echo cancellation is needed to process the residual echoes according to the residual amount. Here, the residual amount refers to the correlation between the residual echoes and the far-end reference signal. A greater correlation means more residual echoes and vice versa.
Finally, there may be a small number of stubborn residuals, such as non-linear signals introduced by the distortion of devices like speakers and microphones. These non-linear signals cannot be removed by linear filters and need to be clipped non-linearly.
To sum up, a complete AEC requires linear adaptive filtering + residual echo suppression + non-linear clipping.
3. AEC policies in single-talk and double–talk scenarios
As we mentioned earlier, co-hosting scenarios are divided into two types according to the number of users speaking at the same time: single-talk and double-talk. In these two scenarios, the input signals for AEC are different, and so are the processing policies.
First, we can determine whether it is a double-talk case by comparing the characteristics of the far-end signal and the near-end signal, such as peak correlation, frequency domain correlation, and amplitude similarity. If the energy of each signal is high and the correlation is very low, it’s probably a double-talk scenario.
If it is a single-talk scenario, since only the far-end user A speaks, the voice signals captured by user B’s microphone only contain far-end echo without near-end voice. In this case, AEC is relatively easy, and we can even use more aggressive policies, such as directly removing all voice signals and properly filling in comfort noise to improve the listening experience. A linear adaptive filter can provide a better AEC effect, reducing the workload of subsequent residual echo suppression and non-linear clipping.
If it is a double-talk scenario, since both the far-end and near-end users are talking at the same time, the signals captured by the microphone contain the far-end echo and near-end voice. The mixing of the two makes the processing difficult. We must remove the far-end echo without compromising the sound quality of the near-end voice. If the far-end echo has higher energy than the near-end voice (for example, over 6–8 dB), it is difficult to avoid damage to the near-end voice during the echo cancellation process. In this case, we must appropriately reduce the cancellation strength of the adaptive filter and adjust the policies of subsequent residual echo suppression and non-linear clipping accordingly.
The AEC technology has long been the cutting-edge field of major audio and video technology providers. The effect of residual echo processing and the degree of the protection of near-end sound quality represents the level of an AEC algorithm.
ZEGOCLOUD SDK’s proprietary audio and video engine, based on a great deal of verification in practice and application feedback, provides optimal performance in residual echo suppression and non-linear clipping.
In order to meet the different requirements of users for sound quality, the SDK supports different AEC levels (such as soft, balanced, and aggressive). In double-talk, music, and other scenarios, the SDK represents the industry-leading level. It provides a good AEC result while ensuring sound quality. In addition to the AEC algorithm at the application level, the SDK also supports the use of device systems’ AEC, which is relatively more aggressive and provides a better cancellation result but causes greater damage to the sound quality. However, device systems’ AEC has special advantages in certain scenarios, which will be briefly discussed later.
Echo Problems in Practical Applications
From the above content, we systematically learned about the definition of echo in the RTC scenario and the basic principle of AEC. Next, let’s use this knowledge to locate and solve echo problems in practical applications.
First of all, we must be clear that if one of the two users hears the echo of their own voice during the co-hosting process, it is very likely that AEC at the other end is not well done. Of course, there are some exceptions. Users use the In-Ear Monitor feature, a headphone circuit exception causes the line to echo, users use the sound card to repeatedly play the captured audio, a software error occurs, and users request to return the voice sent by themselves. Although the final result is that users repeatedly hear their own voices, these situations are not echo problems in the conventional sense and cannot be solved by AEC. We can avoid them from other aspects such as device tuning, usage, and business logic.
After excluding the less common “echo” problems, we can use the formula C – F(A1) = B1 to analyze those remaining common problems. Let’s look at these problems one by one:
We can explain in detail based on the preceding figure.
1. Problems with Signal C
Signal C is a mixed signal captured by the microphone and is subject to AEC. It consists of the near-end voice and far-end echo. If the energy of the far-end echo in signal C is much greater than that of the near-end voice, for example, the speaker is too close to the microphone or the speaker’s output volume is too high that it drowns out the near-end voice, AEC can cause some unexpected damage. In this case, you are advised to turn down the volume of the local playback device.
2. Problems with Reference Signal A1
Signal A1 is the reference signal used for AEC. The AEC algorithm will estimate the echo by using the function F(A1) based on this reference signal. Therefore, the accuracy of A1 will directly affect the AEC result. The greater the difference between the actual played sound signal and the reference signal, the more difficult the simulation and estimation. We hope that the reference signal A1 = The signal that the speaker is about to play. This is not the case in the following circumstances:
2.1 The actual played signal is changed. For example, the output device has processed the sound of signal A1, resulting in a big difference between the actual played signal and the reference signal. As a result, F(A1) calculated based on the wrong signal cannot achieve a good AEC result.
2.2 The reference signal A1 cannot be obtained. This is generally because the executor of AEC is not the producer of signal A1. For example, app A uses its proprietary algorithm for AEC, but the signal played by the speaker contains audio generated by app B (for example, music software is playing music in the background). Because app A is not the producer of the audio and has no system-level permission, it is impossible for app A to identify the audio as a reference signal, let alone other processing steps.
The echo problem caused by the mutation or absence of signal A1 is difficult to solve algorithmically. In addition to disabling sound processing and avoiding third-party audio playback, we can only pin our hopes on the pre-processing module of the system hardware. Because the system module has the highest permission, it can obtain the final and most complete signals played by the speaker of the system. This is also the natural advantage of system pre-processing over app-level pre-processing.
Read more: if you are interested in going deeper on the topic, here is a deep dive into Echo Cancellation.
3. Problems With the Echo Path F(x)
If we can get the correct reference signal, and the energy ratio of the far-end echo to near-end voice in the mixed signal C is reasonable, but the AEC result is still unsatisfactory, it may be because there is a problem with the simulation of the echo path. If the problem of the AEC algorithm itself is ruled out, it may be caused by frequent changes in the playback and capture environment (including hardware/ambient environment).
For example, the audio device is constantly moving or is covered, or the user suddenly enters a noisy corridor from an empty room. All this will cause the LRM path to change and the filter to be unable to adapt and promptly converge (or even fail to converge). Then, an echo occurs. In this case, we need to further optimize the AEC algorithm and increase the speed of adaptation and convergence. We also need to improve the co-hosting environment and ensure its stability.
Finally, one efficient way that can solve most common echo problems is to put on headphones.
After you put on headphones, the far-end audio is concentrated in human ears and basically will not be transmitted to the outside environment. So it will not be captured by the microphone and develop into an echo. Moreover, in scenarios with extremely high requirements on sound quality and timeliness, such as multi-person real-time online karaoke, we also recommend that users put on headphones and disable AEC to avoid sound quality damage and time consumption caused by AEC.
Of course, there are many other factors that lead to echo problems and lots of solutions in the RTC scenario. Do not attempt to use the same solution to fix every problem. Instead, on the premise of understanding the principle of AEC, we should use theory to guide our practice and build experience from practice to complete our knowledge system.
As mentioned earlier, AEC technology has long been one of the cutting-edge fields of major RTC service providers. It has more complex, more profound, and of course more interesting content for everyone to explore. Today, we have only made a preliminary exploration of the basic principle and simple applications. Although it is far from enough to allow you to design an excellent AEC algorithm, we hope it can help you take the first step in practice.
Talk to Expert
Learn more about our solutions and get your question answered.
Take your apps to the next level with our voice, video and chat APIs
- 10,000 minutes for free
- 4,000+ corporate clients
- 3 Billion daily call minutes