FaceLift: Single Image to 3D Head with View Generation and GS-LRM

1University of California, Merced 2Adobe Research

FaceLift takes a single image of a human face as input and generates a high-fidelity 3D Gaussian head representation. The generated Gaussian representation enables high-quality, full-head novel view synthesis (NVS) while accurately capturing fine details of the face and hair.

Abstract

We present FaceLift, a novel feed-forward approach for rapid, high-quality 360-degree head reconstruction from a single image. Our pipeline begins by employing a multi-view latent diffusion model that generates consistent side and back views of the head from a single facial input. These generated views then serve as input to a GS-LRM reconstructor, which produces a comprehensive 3D representation using Gaussian Splats. To train our system, we develop a dataset of multi-view renderings using synthetic 3D human head assets. The diffusion-based multi-view generator is trained exclusively on synthetic head images, while the GS-LRM reconstructor undergoes initial training on Objaverse followed by fine-tuning on synthetic head data. FaceLift excels at preserving identity and maintaining view consistency across reconstructions. Despite being trained solely on synthetic data, our method demonstrates remarkable generalization to real-world images. Through extensive qualitative and quantitative evaluations, we show that FaceLift outperforms state-of-the-art methods in 3D head reconstruction, highlighting its practical applicability and robust performance on real-world images. In addition to single-image reconstruction, FaceLift supports video inputs for 4D novel view synthesis and seamlessly integrates with 2D reanimation techniques to enable 3D facial animation.

Method

Overview of FaceLift. Given a single image of a human face as input, we train an image-conditioned, multi-view diffusion model to generate novel views covering the entire head. By leveraging pre-trained weights and high-quality synthetic data, our multi-view latent diffusion model can hallucinate unseen views of the human head with high-fidelity and multi-view consistency. We then fine-tune a GS-LRM, which takes multi-view images and their camera poses as input and generates 3D Gaussian splats to represent the human head. The generated 3D Gaussian representation enables full-head novel view synthesis.

Results

Single Image to 3D Head

(Click to see more results)

FaceLift is a feed-forward approach that lifts a single facial image to a detailed 3D reconstruction with preserved identity features.

Video as Input for 4D Novel View Synthesis

Given a video as input, FaceLift processes each frame individually and generates 3D Gaussian sequence, which enables 4D novel view synthesis.

Input Video

4D Rendering Results

FaceLift can be combined with 2D face animation methods like LivePortrait to achieve 3D face animation.

Input Image

2D Animation (by LivePortrait)

3D Animation

Input Image

2D Animation (by LivePortrait)

3D Animation

BibTeX

@misc{lyu2024facelift,
      title={FaceLift: Single Image to 3D Head with View Generation and GS-LRM}, 
      author={Weijie Lyu and Yi Zhou and Ming-Hsuan Yang and Zhixin Shu},
      year={2024},
      eprint={2412.17812},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.17812}
      }