Deep Animation Models for Ultra-Low Bitrate Video Conferencing

1Université Paris-Saclay,CentraleSupélec,L2S 2IP Paris, Télécom Paris, LTCI 3Université Paris-Saclay, CNRS, CentraleSupélec, L2S

Abstract

Deep generative models, and particularly facial animation schemes, can be used in video conferencing applications to efficiently compress a video through a sparse set of keypoints, without the need to transmit dense motion vectors.

While these schemes bring significant coding gains over conventional video codecs at low bitrates, their performance saturates quickly when the available bandwidth increases. In this paper, we propose a layered, hybrid coding scheme to overcome this limitation. Specifically, we extend a codec based on facial animation by adding an auxiliary stream consisting of a very low bitrate version of the video, obtained through a conventional video codec (e.g., HEVC).

The animated and auxiliary videos are combined through a novel fusion module. Our results show consistent average BD-Rate gains in excess of -30% on a large dataset of video conferencing sequences, extending the operational range of bitrates of a facial animation codec alone.

Coding Framework

A conventional video codec (light red module) is used on both encoder and decoder sides to transmit the input video with a very low bitrate. The animation module provides to the decoder the initial frame of the video and the facial keypoints that describe the motion between the initial frame and the current frame. A multi-scale fusion module combines the low-quality video provided by the conventional video codec with the output of the image animation module to generate the output frame.

coding Framework

Visual Results

HEVC - Low latency HEVC configuration of the reference HM TEST MODEL

VVC - VVC-VTM TEST model with low-latency configuration.

DAC - Our original deep animation codec with adpative refresh for quality enhancement

H-DAC- Animation-based coding with the scalable-quality base layer

Reference

Target+KP

HEVC(~5kbps)

DAC(~5kbps)

HEVC(~10kbps)

HDAC(~10kbps)

The DAC uses the reference frame and the keypoints extracted from the target frames to reconstruct the output video. The HDAC uses the Reference frame, keypoints from the target frames and the base layer frame encoded by HEVC at 5kbps. At comparable bitrates, the DAC and HDAC both outperform HEVC in the output perceptual quality.

Quantitative Metrics

At very low bitrates, the animation based coding framework outperform HEVC on all image quality metrics considered. However, whereas the DAC framework has a lower range of quality scalability i.e. after 25kbps, it performs worse than the HEVC codec, using the base layer increases the range to about 70kbps for the HDAC.

PSNR RD Curve
MS-SSIM RD Curve

Over the low bitrate range, the HDAC framework achieves over 30% bitrate savings over the HEVC codec as shown in the table below:

HEVC VVC
BD Quality / BD Rate BD Quality / BD Rate
PSNR 1.07 / -33.36 0.97 / -30.7
MS-SSIM 0.02 / -33.41 0.02 / -28.33
msVGG -19.6 / -48.84 -20.04 / -41.64

Related Links

There is a number of works released concurrent or subsequent to our initial work in the domain of animation-based video communication. They provide valuable insight into computer-vision and model optimization aspects that could inspire great curiousity into this line of work.

Face-vid2vid: Proposes a neural talking-head video synthesis model with novel-view rendering capability and demonstrates its application to video conferencing.

Motion-SPADE Explores quality and bandwidth trade-offs for approaches based on static landmarks, dynamic landmarks or segmentation maps for image animation and proposes designs for mobile-compatible model architecture for low-latency chat applications.

Beyond Keypoint Coding: Temporal Evolution Inference with Compact Feature Representation for Talking Face Video Compression Proposes a novel sparse representation for animation-based coding. We updated our own keypoint quantization and entropy coding processes to match those used by this work.

BibTeX

@article{konuko2022hdac,
  author    = {Konuko, Goluck and Lathuilière, Stéphane and Valenzise, Giuseppe},
  title     = {H-DAC: Hybrid coding with Deep Animation Models for Ultra-Low Bitrate Video Conferencing},
  journal   = {ICIP},
  year      = {2022},
}