Bringing NeRFs to the Latent Space:
Inverse Graphics Autoencoder

* equal contribution
1 Criteo AI Lab, Paris, France
2 LASTIG, Université Gustave Eiffel, IGN-ENSG, F-94160 Saint-Mandé
3 Université Côte d’Azur, CNRS, I3S, France

Abstract

While pre-trained image autoencoders are increasingly utilized in computer vision, the application of inverse graphics in 2D latent spaces has been under-explored. Yet, besides reducing the training and rendering complexity, applying inverse graphics in the latent space enables a valuable interoperability with other latent-based 2D methods. The major challenge is that inverse graphics cannot be directly applied to such image latent spaces because they lack an underlying 3D geometry. In this paper, we propose an Inverse Graphics Autoencoder (IG-AE) that specifically addresses this issue. To this end, we regularize an image autoencoder with 3D-geometry by aligning its latent space with jointly trained latent 3D scenes. We utilize the trained IG-AE to bring NeRFs to the latent space with a latent NeRF training pipeline, which we implement in an open-source extension of the Nerfstudio framework, thereby unlocking latent scene learning for its supported methods. We experimentally confirm that Latent NeRFs trained with IG-AE present an improved quality compared to a standard autoencoder, all while exhibiting training and rendering accelerations with respect to NeRFs trained in the image space.

Method

We propose a Latent NeRF Training Pipeline compatible with most NeRF architectures and auto-encoders (AEs). Subsequently, we identify classical AEs as limiting factors in latent NeRF quality, and propose IG-AE to improve it.

Latent NeRF Training Pipeline

Latent NeRF Training

Latent NeRF Training. We train a Latent NeRF in two stages. First, we train the chosen NeRF method \(F_\theta\) using its proprietary loss \(\mathcal{L}_{F_\theta}\) that matches rendered latents \(\tilde{z}_p\) and encoded latents \(z_p\). Subsequently, we align with the scene in the RGB space by adding decoder fine-tuning via \(\mathcal{L}_\mathrm{align}\) that matches ground truth images \(x_p\) and decoded renderings \(\tilde{x}_p\).


IG-AE Training

IG-AE Training

IG-AE Training. We train IG-AE by applying a 3D regularization on its 2D latent space. Specifically, we train the encoder \(E_\phi\) and decoder \(D_\psi\) to preserve 3D consistency by supervising them with 3D-consistent latent images. We obtain such 3D consistent latent images \(\tilde{z}_{s,p}\) from a set of Triplanes \(\{T_1, ..., T_N\}\) that are simultaneously learnt. Each Tri-Plane \(T_i\) learns the latent scene corresponding to the ground truth RGB scene \(M_i\). This optimization is done via two reconstructive objectives. In the latent space, \(\mathcal{L}_\mathrm{latent}\) aligns the Tri-Plane renderings \(\tilde{z}_{s,p}\) with the encoded ground truth view \(z_{s,p}\), updating both the latent Tri-Planes and the encoder. In the RGB space, \(\mathcal{L}_\mathrm{RGB}\) aligns the ground truth view \(x_{s,p}\) with decoded rendering \(\tilde{x}_{s,p}\), updating both the latent Tri-Planes and the decoder. In addition, we preserve the auto-encoding performance of our IG-AE by adopting a reconstructive loss on synthetic and real data, via \(\mathcal{L}_\mathrm{ae}^\mathrm{(synth)}\) and \(\mathcal{L}_\mathrm{ae}^\mathrm{(real)}\) respectively.

Comparison with NeRFs trained in a standard latent space

This section compares NeRFs trained in a standard latent space of an AE, and in our IG-AE 3D-aware latent space, using our latent NeRF training pipeline implemented as an extension to Nerfstudio.

Shapenet Bag


Shapenet Hat


Shapenet Vase


BibTeX


      @article{ig-ae,
        title={{Bringing NeRFs to the Latent Space: Inverse Graphics Autoencoder}}, 
        author={Antoine Schnepf and Karim Kassab and Jean-Yves Franceschi and Laurent Caraffa and Flavian Vasile and Jeremie Mary and Andrew Comport and Valérie Gouet-Brunet},
        journal={arXiv preprint arXiv:2410.22936},
        year={2024}
      }