Nowadays, video content accounts for more than 80% of all internet traffic, and this figure is anticipated to rise even higher in the future. As a result, developing an efficient video compression system and generating higher quality frames within a given bandwidth budget is crucial. Artificial intelligence isn't new when it comes to video editing; most major video conferencing programs now offer the ability to replace one's real-life background with a different one, including sophisticated AI-based background blurring. Real-time AI-based video compression, on the other hand, takes things to a whole new level by utilizing AI to not only generate the subject in real time, but also to change them in useful ways, such as aligning their face with a virtual front-facing camera. The technology could usher in a new era of crisper, more reliable video conferencing, especially for individuals with sluggish Internet connections, while also consuming less bandwidth than current solutions.

## Brief Introduction of Video Compression and Traditional Process of Video Compression.

This section provides a basic overview of video compression. In general, the bitstream is generated by the video compression encoder depending on the input current frames. The decoder, on the other hand, reconstructs the video frames from the incoming bitstreams. The input frame $x_t$ is divided into a collection of blocks, or square regions, of the same size (for example, $8 \times 8$). The encoder side of the classic video compression algorithm's encoding operation is as follows:

**Step 1: Estimate the motion.** Calculate the difference in motion between the current frame $x_t$ and the previously reconstructed frame $\hat{x}_{t-1}$. For each block, the relevant motion vector $v_t$ is obtained.

**Step 2: Compensation from motion.** Based on the motion vector $v_t$ specified in Step 1, the predicted frame $\bar{x}_t$ is generated by copying the relevant pixels from the previous reconstructed frame to the current frame. $r_t = x_t - \bar{x}_t$ yields the residual $r_t$ between the original frame $x_t$ and the predicted frame $\bar{x}_t$.

**Step 3: Quantization and transformation.** In Step 2, the residual $r_t$ is quantized to $\hat{y}_t$. Prior to quantization, a linear transform (e.g., DCT) is applied to improve compression performance.

**Step 4: Perform an inverse transform.** The inverse transform uses the quantized result $\hat{y}_t$ from Step 3 to get the reconstructed residual $\hat{r}_t$.

**Step 5. Entropy coding** The motion vector $v_t$ in Step 1 and the quantized result $\hat{y}_t$ in Step 3 are both encoded in bits and transmitted to the decoder using the entropy coding method.

**Step 6. Frame reconstruction.** The reconstructed frame $\hat{x}_t$ is obtained by adding $\bar{x}_t$ in Step 2 and $\hat{r}_t$ in Step 4, i.e. $\hat{x}_t = \hat{r}_t + \bar{x}_t$. The reconstructed frame will be used by the $(t + 1)$-th frame at Step 1 for motion estimation.

To generate the reconstructed frame $\hat{x]}_t$, the decoder uses the bits given by the encoder at Step 5, motion correction at Step 2, inverse quantization at Step 4, and finally frame reconstruction at Step 6.

Figure 1. a illustrates the traditional video compression network. Note that all of the modules are present on the encoder side, however the decoder side is missing the blue color modules.

**Figure 1**: (a): The predictive coding architecture used by the traditional video codec H.264 or H.265. (b): The end-to-end video compression network. The modules with blue color are not included in the decoder side. (Source)## End-to-end deep video compression network

A high-level overview of the end-to-end video compression architecture is shown in Figure 1.b. The standard video compression framework and our suggested deep learning-based framework have a one-to-one connection. The following is a basic summary of the differences and their relationship:

**Step N1: Estimation and compression of motion.** To estimate the optical flow, which is considered motion information $v_t$, we employ a CNN model. Instead of directly encoding the raw optical flow values, an MV encoder-decoder network is presented in Figure 1.b to compress and decode the optical flow data, with $\hat{m}_t$ denoting the quantized motion representation. The MV decoder net may then decode the associated reconstructed motion information $\hat{v}_t$. Details are given in Figure 2.

**Figure 2**: MV Encoder-decoder network. $Conv(3,128,2)$ represents the convoluation operation with the kernel size of $3 \times 3$, the output channel of $128$ and the stride of $2$. GDN/IGDN is the nonlinear transform function. The binary feature map is only used for illustration. (Source)**Step N2: Compensation for motion.** Based on the optical flow collected in Step N1, a motion compensation network is developed to obtain the expected frame $\bar{x}_t$. Details are given in Figure 3.

**Figure 3**: Motion Compensation Network. (Source)**Steps N3-N4 Transform, quantization, and inverse transform.** In Step 3, we use a highly non-linear residual encoder-decoder network to replace the linear transform, and the residual $r_t$ is non-linearly mapped to the representation $y_t$. $y_t$ is then quantized to $\hat{y}_t$. The quantization approach is used to create an end-to-end training system. The reconstructed residual $\hat{r}_t$ is obtained by feeding the quantized representation $\hat{y}_t$ into the residual decoder network.

**Step N5 Entropy coding.** The quantized motion representation $\hat{m}_t$ from Step N1 and the residual representation $\hat{y}_t$ from Step N3 are both coded into bits and transmitted to the decoder during the testing step. We employ CNNs (Bit rate estimation net in Figure 1.b) to acquire the probability distribution of each symbol in $\hat{m}_t$ and $\hat{y}_t$ during the training stage to estimate the number of bits cost in our suggested technique.

**Step N6. Frame reconstruction.** It is the same as Step 6 of traditional video compression method mentioned in previous section.

## Next section introduction

In the next section, we will investigate the effectiveness of mentioned methods for video compression.