Deep Animation Models for Ultra-Low Bitrate Video Conferencing

Goluck Konuko¹, Stéphane Lathuilière², Giuseppe Valenzise³,

¹Université Paris-Saclay,CentraleSupélec,L2S ²IP Paris, Télécom Paris, LTCI ³Université Paris-Saclay, CNRS, CentraleSupélec, L2S

Paper arXiv Code Dataset

Abstract

Deep generative models, and particularly facial animation schemes, can be used in video conferencing applications to efficiently compress a video through a sparse set of keypoints, without the need to transmit dense motion vectors.

While these schemes bring significant coding gains over conventional video codecs at low bitrates, their performance saturates quickly when the available bandwidth increases. In this paper, we propose a layered, hybrid coding scheme to overcome this limitation. Specifically, we extend a codec based on facial animation by adding an auxiliary stream consisting of a very low bitrate version of the video, obtained through a conventional video codec (e.g., HEVC).

The animated and auxiliary videos are combined through a novel fusion module. Our results show consistent average BD-Rate gains in excess of -30% on a large dataset of video conferencing sequences, extending the operational range of bitrates of a facial animation codec alone.

Coding Framework

A conventional video codec (light red module) is used on both encoder and decoder sides to transmit the input video with a very low bitrate. The animation module provides to the decoder the initial frame of the video and the facial keypoints that describe the motion between the initial frame and the current frame. A multi-scale fusion module combines the low-quality video provided by the conventional video codec with the output of the image animation module to generate the output frame.

Visual Results

HEVC - Low latency HEVC configuration of the reference HM TEST MODEL

VVC - VVC-VTM TEST model with low-latency configuration.

DAC - Our original deep animation codec with adpative refresh for quality enhancement

H-DAC- Animation-based coding with the scalable-quality base layer

Reference

Target+KP

HEVC(~5kbps)

DAC(~5kbps)

HEVC(~10kbps)

HDAC(~10kbps)

The DAC uses the reference frame and the keypoints extracted from the target frames to reconstruct the output video. The HDAC uses the Reference frame, keypoints from the target frames and the base layer frame encoded by HEVC at 5kbps. At comparable bitrates, the DAC and HDAC both outperform HEVC in the output perceptual quality.

Quantitative Metrics

At very low bitrates, the animation based coding framework outperform HEVC on all image quality metrics considered. However, whereas the DAC framework has a lower range of quality scalability i.e. after 25kbps, it performs worse than the HEVC codec, using the base layer increases the range to about 70kbps for the HDAC.

Over the low bitrate range, the HDAC framework achieves over 30% bitrate savings over the HEVC codec as shown in the table below:

	HEVC	VVC
	BD Quality / BD Rate	BD Quality / BD Rate
PSNR	1.07 / -33.36	0.97 / -30.7
MS-SSIM	0.02 / -33.41	0.02 / -28.33
msVGG	-19.6 / -48.84	-20.04 / -41.64

BibTeX

@article{konuko2022hdac,
  author    = {Konuko, Goluck and Lathuilière, Stéphane and Valenzise, Giuseppe},
  title     = {H-DAC: Hybrid coding with Deep Animation Models for Ultra-Low Bitrate Video Conferencing},
  journal   = {ICIP},
  year      = {2022},
}