Abstract

In this work we propose a novel deep learning approach for ultra-low bitrate video compression for video conferencing applications. To address the shortcomings of current video compression paradigms when the available bandwidth is extremely limited, we adopt a model-based approach that employs deep neural networks to encode motion information as keypoint displacement and reconstruct the video signal at the decoder side. The overall system is trained in an nd-to-end fashion minimizing a reconstruction error on the encoder output. Objective and subjective quality evaluation experiments demonstrate that the proposed approach provides an average bitrate reduction for the same visual quality of more than 80% compared to HEVC.

Overview

In this work, we describe, for the first time, a video coding pipeline that employs the popular First Order Model for video animation, to achieve long-term video frame prediction. Our scheme is open loop: we encode the first frame of the video using conventional Intra coding, and transmit keypoints extracted by subsequent frames in the bitstream. At the decoder side, the received keypoints are used to warp the Intra reference frame to reconstruct the video. We also propose and analyze an adaptive Intra frame selection scheme that, in conjunction with varying the quantization of Intra frames, enable to attain different rate-distortion operating points. Our experiments, validated with several video quality metrics and a subjective test campaign, show average bitrate savings compared to HEVC of over 80%, demonstrating the potential of this approach for ultra-low bitrate video conferencing.

Basic coding pipeline

Results

Voxceleb is a large audio-visual dataset of human speech of 22,496 videos, extracted from YouTube videos. We follow the pre-processing of \cite{siarohin2019first} and filter out sequences that have resolution lower than 256x256 and resize the remaining videos to 256x256 preserving the aspect ratio. This dataset is split into 12,331 training and 444 test videos. For evaluation, in order to obtain high quality videos, we select the 90 test videos with the highest resolution before resizing.

Original Video

HEVC (>50 Kbps)

HEVC (~10 Kbps)

Ours (~10 Kbps)

Ours (~1.5 Kbps)

Xiph.org is a collection of videos we downloaded from Xiph.org. This repository includes video sequences widely used in video processing (``News'', ``Akiyo'', ``Salesman'', etc.). We select 16 sequences of talking heads that correspond to the targeted video conferencing scenario. The full list of videos used in the experiments is available on the GitHub page of the paper. The region of interest is cropped with a resolution of 256x256 with speakers' faces comprising 75% of the frame.

Original Video

HEVC (>50 Kbps)

HEVC (~10 Kbps)

Ours (~10 Kbps)

Ours (~1.5 Kbps)

Quantitative Results

We compare our approach with the standard HM 16 (HEVC) low delay configuration by by performing subjective test on 10 videos (8 from Voxceleb and 2 from Xiph.org) using Amazon Mechanical Turk (AMT). Each sequence is encoded using 8 different bite-rate configurations (4 with HEVC, 4 with proposed method) ranging from 5Kbps to 55Kbps. We implement a simple Double Stimulus Impairment Scale (DSIS) test interface in AMT.

mean opinion score and confidence interval
VoxCeleb Xiph.org
BD quality BD rate BD quality BD rate
PSNR 2.88 65.50 3.14 72.44
MS-SSIM 0.070 83.96 0.075 86.41
VIF 0.027 72.29 0.021 68.02
VMAF 37.43 82.29 31.04 83.44
PCC SROCC
BD quality BD rate BD quality BD rate
PSNR 0.92 0.95 0.77 0.92
MS-SSIM 0.94 0.95 0.84 0.95
VIF 0.38 0.75 0.28 0.75
VMAF 0.94 0.96 0.80 0.89

The Mean Opinion Score computed over a 95% confidence interval shows that our approach clearly outperforms HEVC (average BD-MOS = 1.68, BD-rate = -82.35\%). We further determined the Pearson Correlation Coefficient (PCC) and the Spearman Rank-Order Correlation Coefficient (SROCC) between MOS and five commonly used quality metrics, for our codec and HEVC. We observe that, except VIF, these metrics correlate well with human judgment, at least on the tested data. Based on these preliminary results, we proceed with an extensive performance evaluation of the proposed method using these quality metrics.