Déjà View: Looping Transformers Achieve 3D Reconstruction with 8–10× Fewer Parameters
Key Takeaways
- ▸Déjà View uses a single looped transformer block instead of scaling model size, achieving 8–10× parameter reduction while matching larger baselines
- ▸The inference-time compute knob exposes refinement steps (K) as a tunable parameter, enabling users to balance reconstruction quality and computational cost
- ▸Explicit iteration proved to be a stronger inductive bias than raw model capacity for multi-view 3D reconstruction
Summary
Researchers have introduced Déjà View, a novel 3D reconstruction architecture that challenges the industry's scaling paradigm by replacing increasingly large feed-forward transformers with a single transformer block applied recursively. With just 117M parameters, Déjà View matches or exceeds the performance of billion-parameter baselines while consuming 8–10× fewer parameters and 1.9–2.3× less compute across five diverse benchmarks spanning indoor scenes, outdoor environments, object-centric captures, and driving scenarios.
The key insight underpinning Déjà View is that transformer layers often behave as repeated applications of similar operations, and multi-view reconstruction networks refine their predictions progressively through depth. Rather than inefficiently capturing this through unique parameters at each layer, Déjà View makes iteration explicit in the architecture, exposing the number of refinement steps (K) as an inference-time compute knob. This allows users to dynamically trade computational resources against reconstruction quality from a single trained checkpoint.
The model initializes per-view features from a pretrained DINOv2 encoder and applies a transformer block with frame and global attention sub-blocks recurrently. Because step counts are sampled during training from a defined range, one checkpoint supports any inference step count. Testing revealed that explicit looped iteration outperforms an otherwise identical variant with independent per-step parameters, suggesting that architectural iteration provides a stronger inductive bias than raw capacity.
- At 117M parameters, Déjà View achieves state-of-the-art inlier ratio and pose accuracy across all five benchmarks
Editorial Opinion
Déjà View is a valuable counterpoint to the scaling-centric narrative that has dominated AI progress. As vision transformers have swollen to billions of parameters, this work demonstrates that thoughtful architectural design—making computational patterns explicit—can outperform brute-force scale. The inference-time compute knob is particularly compelling: it suggests we're moving toward AI systems that adapt to hardware constraints rather than demanding exponentially more resources. If this efficiency trend holds across other domains, it could democratize access to state-of-the-art computer vision.



