Latent NeRF Training. We train a Latent NeRF in two stages. First, we train the chosen NeRF method \(F_\theta\) using its proprietary loss \(\mathcal{L}_{F_\theta}\) that matches rendered latents \(\tilde{z}_p\) and encoded latents \(z_p\). Subsequently, we align with the scene in the RGB space by adding decoder fine-tuning via \(\mathcal{L}_\mathrm{align}\) that matches ground truth images \(x_p\) and decoded renderings \(\tilde{x}_p\).
IG-AE Training. We train IG-AE by applying a 3D regularization on its 2D latent space. Specifically, we train the encoder \(E_\phi\) and decoder \(D_\psi\) to preserve 3D consistency by supervising them with 3D-consistent latent images. We obtain such 3D consistent latent images \(\tilde{z}_{s,p}\) from a set of Triplanes \(\{T_1, ..., T_N\}\) that are simultaneously learnt. Each Tri-Plane \(T_i\) learns the latent scene corresponding to the ground truth RGB scene \(M_i\). This optimization is done via two reconstructive objectives. In the latent space, \(\mathcal{L}_\mathrm{latent}\) aligns the Tri-Plane renderings \(\tilde{z}_{s,p}\) with the encoded ground truth view \(z_{s,p}\), updating both the latent Tri-Planes and the encoder. In the RGB space, \(\mathcal{L}_\mathrm{RGB}\) aligns the ground truth view \(x_{s,p}\) with decoded rendering \(\tilde{x}_{s,p}\), updating both the latent Tri-Planes and the decoder. In addition, we preserve the auto-encoding performance of our IG-AE by adopting a reconstructive loss on synthetic and real data, via \(\mathcal{L}_\mathrm{ae}^\mathrm{(synth)}\) and \(\mathcal{L}_\mathrm{ae}^\mathrm{(real)}\) respectively.
@article{ig-ae,
title={{Bringing NeRFs to the Latent Space: Inverse Graphics Autoencoder}},
author={Antoine Schnepf and Karim Kassab and Jean-Yves Franceschi and Laurent Caraffa and Flavian Vasile and Jeremie Mary and Andrew Comport and Valérie Gouet-Brunet},
journal={arXiv preprint arXiv:2410.22936},
year={2024}
}