NVIDIA Introduces LiTo: Surface Light Field Tokenization for Realistic 3D Object Generation
Key Takeaways
- ▸LiTo introduces a unified 3D latent representation that simultaneously captures both geometry and view-dependent appearance, overcoming limitations of prior methods that handle these separately
- ▸The approach leverages surface light field sampling from RGB-depth images to realistically reproduce complex lighting effects including specular highlights and Fresnel reflections
- ▸By conditioning a latent flow matching model on single input images, LiTo enables generation of 3D objects with material and lighting-consistent appearances
Summary
NVIDIA researchers have unveiled LiTo (Surface Light Field Tokenization), a novel 3D latent representation that jointly models object geometry and view-dependent appearance by leveraging RGB-depth images as samples of a surface light field. Unlike prior approaches that focus on either 3D geometry reconstruction or view-independent appearance prediction, LiTo encodes random subsamples of surface light fields into a compact set of latent vectors, enabling a unified 3D latent space that captures realistic view-dependent effects such as specular highlights and Fresnel reflections under complex lighting conditions.
The researchers further augmented this approach by training a latent flow matching model on the representation to learn its distribution conditioned on a single input image. This enables the generation of 3D objects with appearances that remain consistent with the lighting and materials present in the input. According to the team's experiments, LiTo achieves higher visual quality and better input fidelity compared to existing methods, establishing a new benchmark for realistic 3D object synthesis.
- Experimental results demonstrate superior visual quality and input fidelity compared to existing 3D generation methods
Editorial Opinion
LiTo represents a meaningful advancement in 3D generative AI by addressing a long-standing challenge: realistic reproduction of view-dependent appearance alongside accurate geometry. The unified approach to modeling both aspects within a single latent space is conceptually elegant and practically valuable for applications requiring photorealistic 3D content generation. If the visual quality claims hold up in broader evaluation, this could become a foundational technique for next-generation 3D synthesis systems.



