Before Bell invented the telephone in 1876, voice communication basically relied on "roar". Whoever roared farther would have a longer communication distance.

Since then, people can hear distant sounds through the telephone. This is the first application of real-time voice technology. With the development of technology, hardware and infrastructure, real-time voice and video technologies are also advancing. It has deep applications in many fields.

I.                   Social

In the social field, the anchor join-live or multi-person join-live are common real-time audio and video scenarios. In this scenario, we need to handle the synchronization of multi-person audio and video, optimize performance to reduce CPU consumption, and require 3D processing to make the sound better.

II.                In-game chat

The in-game chat room is a pure real-time voice scene. When you play the game, you will chat with your teammates or the anchor. This needs to ensure that the human voice is smooth and the delay is low enough.

III.             Online education

In online education field, there are scenarios of 1v1, small class, large class, dual-teacher class, and music training. In these scenarios, smooth voice is required, and some scenes have very high requirements for sound quality.

IV.            E-commerce live broadcast

Live broadcast sales are very popular in these years. It needs to play bgm and display product information simultaneously.

These popular scenarios above all use real-time voice technology, so let's talk about the general processing flow and standard optimization methods of real-time voice technology.

When there is telephone technology, everyone is only satisfied with hearing. Later, with the continuous promotion of people's experience, everyone's requirements for real-time voice processing systems are constantly improving. The final requirements can be sorted into one sentence: at the fastest speed hear the voice you need.

We can break the above words into three goals

The first goal: low latency. We hope that when the sound is heard at the far end, there is as little delay as possible with the sound from the scene, that is, as fast as possible.

The second goal: high fidelity. We hope that the sound output by the system can be completely consistent with the scene. For example, in a live concert scene, the audience hopes that the sound produced by the live instruments can be completely restored.

The third goal: change the sound as needed. We will add some background sounds, music, sound effects, change the pitch to achieve some special effects, and also remove some sounds we don't want such as noise, and some sounds that users don't care about.

Finished talking about these three goals, we talk about the processing flow of real-time voice

Starting from audio collection, it needs to go through the stages of pre-processing, encoding, sending, network transmission, receiving, decoding, post-processing, and audio playback.

In each step, we take the three goals ahead and explain what to do in each goal in order to have an effect on these three goals.

1.   Acquisition.

Acquisition is the process of turning air fluctuations in sound into digital signals. The human ear can hear the frequency range of speech is 20Hz-20kHz. According to the Nyquist sampling theorem, combined with some redundancy, the industry has set a sampling rate of 44.1kHz, and it is recognized that it can meet human hearing needs.

2.   Pre-processing

There are roughly 6 methods for pre-processing, including the 3A algorithms we often say (automatic gain, echo cancellation, and active noise reduction) as well as operations such as silence detection, sound equalization, and voice tone shifting.

It should be noted that mobile phone systems generally provide hardware pre-processing, and we can also implement software pre-processing.

If the system supports hardware pre-processing, using it for echo cancellation will have better results. It is more suitable for voice-based scenes. Turn off the pre-processing of the system and use the software to do the pre-processing. The echo cancellation will be relatively less effective, but the music fidelity is better.

In the pre-processing link, corresponding to the three requirements mentioned above, it will affect the "delay" and "change the sound on demand", and it will not affect the "high fidelity" because there is no way to improve the sound effect.

3.   Coding

Why do we need to code? Simply put, in order to reduce the amount of data, if the encoding link is not done well, the delay will be very large, and the sound quality will be impaired. Therefore, the encoding link is a key point for reducing the delay and bringing high-fidelity sound effects. Generally speaking, actively changing the sound will not be processed in the encoding process. Therefore, the direction of coding optimization is to reduce delay and improve sound quality.

4.   Transmission

Transmission is a major cause of delay, so if you want to reduce the delay, you have to work very hard here. In addition, transmission packet loss is a major cause of the decline in sound quality, so we should try our best to reduce packet loss and ensure high sound quality. Finally, the transmission link generally does not change the sound.

Common transmission protocols include TCP-based protocols and UDP-based protocols. TCP-based protocols are the common RTMP/HTTP FLV/HLS, etc. These protocols are relatively standard. The player access at each end is actually relatively simple. Generally speaking, there will be no packet loss, but relative delay. bigger.

Generally speaking, no packet loss refers to the situation that the network is still improvised. If the network is extremely poor, TCP transmission will also lose packets.

In UDP-based protocols, such as Web RTC and the protocols developed by various RTC vendors, as well as reliable UDP-based protocols including SRT, they will implement some rate control algorithms, such as GCC, BBR, etc. Realize FEC forward redundant coding, reduce the impact of packet loss, and finally implement NACK retransmission technology to ensure that it can be obtained after packet loss.

Of course, no matter how good the agreement is, the network will not work. In other words, in order to transmit data faster and more steadily, a high-quality transmission network has been independently constructed, connected to each end nearby, and some BGP resources have been purchased to ensure the quality of transmission.

5.   Post-processing

In the post-processing and playback link of receiving and decoding, if the processing is not good, the delay will increase. These links will also affect the fidelity of the sound quality, and some requirements for changing the sound will also be realized in the post-processing link.

Here is a technology, NetEQ. After network jitter or packet loss, in order to make the sound sound continuous, NetEQ can be used.

The simple understanding of NetEQ is: we make judgments based on the received audio data. Assuming that a piece of data is missing, or the previous network jittered a bit, a piece of data was not transmitted, and then it came back, but in this case, it did not have time to play, then we judged according to the "more" In the case of "missing" adjust the audio playback speed faster or slower to broadcast the audio.

In post-processing, you can also add some processing to change the sound effect, or you can do some speech recognition processing, and in the playback link, the sound is sent out through external rendering.

In the above links, different optimizations have been made for the three goals, and these are relatively common optimization methods.

Different scenarios require different points for the real-time voice system, and the direction of optimization has its own characteristics. Sometimes it is necessary to abandon another aspect in order to ensure a certain aspect. The core idea of scene optimization is to select the most suitable processing solution for the scene in each link of real-time voice processing.

The specific implementation process is to select the most suitable solution and algorithm in each link according to the scene, and sometimes requires the cooperation of hardware.