Flux4D: Flow-based unsupervised 4D reconstruction

NeurIPS 2025

Jingkang Wang*, Henry Che*†, Yun Chen*, Ze Yang, Lily Goli†, Sivabalan Manivasagam, Raquel Urtasun

by Waabi

Reconstructing large-scale dynamic scenes from visual observations is a fundamental challenge in computer vision, with critical implications for robotics and autonomous systems. While recent differentiable rendering methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have achieved impressive photorealistic reconstruction, they suffer from scalability limitations and require annotations to decouple actor motion. Existing self-supervised methods attempt to eliminate explicit annotations by leveraging motion cues and geometric priors, yet they remain constrained by per-scene optimization and sensitivity to hyperparameter tuning. In this paper, we introduce Flux4D, a simple and scalable framework for 4D reconstruction of large-scale dynamic scenes. Flux4D directly predicts 3D Gaussians and their motion dynamics to reconstruct sensor observations, in a fully unsupervised manner. By adopting only photometric losses and enforcing an "as static as possible" regularization, Flux4D learns to decompose dynamic elements directly from raw data without requiring pre-trained supervised models or foundational priors simply by training across many scenes. Our approach enables efficient reconstruction of dynamic scenes within seconds, scales effectively to large datasets, and generalizes well to unseen environments, including rare and unknown objects. Experiments on outdoor driving datasets show Flux4D significantly outperforms existing methods in scalability, generalization, and reconstruction quality.

Demo

0:000:00

Motivation

Reconstructing dynamic scenes from the real world enables diverse content creation for various applications, such as virtual reality and self-driving simulation. To be effective, the 4D reconstructions must have high photorealism at novel views, be scalable to large datasets, be efficient to reconstruct and render, and enable controllable scene editing and manipulation.

Recently, differentiable rendering approaches such as NeRF and 3DGS have achieved realistic reconstructions for large driving scenes for sensor simulation. However, these methods rely on manual annotations to separate foreground and background entities, and reconstruct composable 3D scene representations via energy minimization and differentiable rendering. Therefore, they have difficulty scaling to the large amount of unlabelled data, and are slow to train, requiring hours of optimization for each new scene.

To enable more scalable reconstruction, we propose Flux4D, an unsupervised generalizable 4D reconstruction model. Our method takes as input localized multi-sensor data, and requires no annotations or pre-trained models. Flux4D directly reconstructs a photorealistic 4D scene representation with recovered flow, geometry, and appearance within seconds.

Flux4D is a simple and scalable framework for unsupervised 4D reconstruction, significantly outperforming existing unsupervised methods and achieving competitive performance with label-supervised per-scene reconstruction while being substantially faster.

Method

Flux4D is an unsupervised and generalizable approach that learns to reconstruct 4D scenes via three simple steps.

Step 1. Lifting sensor observations: Flux4D begins by lifting raw LiDAR and camera data into an initial set of 3D Gaussians.
Step 2. Predicting dynamics: A neural network then predicts high-fidelity 3D Gaussians and how they move over time, creating a temporally consistent scene representation.
Step 3. Rendering and training: Finally, Flux4D renders the warped Gaussians into novel views and is trained across many scenes using photometric reconstruction and “as static as possible” losses.

Unsupervised 4D Reconstruction

We evaluate Flux4D on several urban driving datasets, including PandaSet. We first showcase the 4D reconstruction performance on 8-second sequences. The top row shows the rendered image and our learned flow, where higher saturation indicates faster moving objects, and the hue encodes motion direction. The bottom row shows the ground truth image and flow estimated from tracklet instance annotations.

Since Flux4D reconstructs 4D scenes and motion flows within seconds, we can scale to large datasets and reconstruct a diverse variety of scenes, from city to suburban areas, and in different lighting conditions.

Flux4D is simple and general enough to scale to larger real-world datasets like Waymo Open Dataset and Argoverse 2. It effectively handles complex dynamic scenarios in the wild. Flux4D processes full-resolution images (≥ 1920 x 1080) and can be efficiently scaled to higher resolutions without significant overhead. Importantly, all scenes shown below are unseen during Flux4D training. We use WOD only for research purposes and solely for final benchmarking.

Comparison to State-of-the-Art

We first evaluate novel view synthesis on PandaSet sequences and compare with SoTA methods. Flux4D surpasses unsupervised and achieves competitive performance with supervised methods (top block), without requiring any annotations.

Novel view synthesis on interpolated views.

Existing unsupervised methods produce blurry rendering results or cannot handle dynamic actors well. In contrast, our method can realistically recover their appearance and motion without reliance on any pre-trained models or additional regularization terms such as geometric constraints or cycle consistency.

Qualitative comparison with SoTA unsupervised methods.

We further evaluate Flux4D's capability for future frame prediction beyond the observed frames. This challenging task requires precise motion estimation, temporal consistency, occlusion reasoning, and a comprehensive 4D scene understanding. Flux4D outperforms existing unsupervised methods in both photometric accuracy and geometric consistency.

Future prediction (0.5s ahead).

Moreover, Flux4D even outperforms supervised approaches that rely on explicit annotations for extrapolation, demonstrating the robustness of our predicted scene representation. This highlights Flux4D's ability to model scene dynamics, which is critical for world modeling, simulation, and scene understanding in autonomous systems.

Scaling Laws

Flux4D’s effectiveness stems from multi-scene training, leveraging diverse driving data as implicit regularization. Unlike per-scene methods that require complex regularizations or pre-trained models, increasing the amount of training data naturally improves scene decomposition and motion estimation.

Scaling analysis of Flux4D.

We observe consistent improvements in photometric accuracy and motion estimation as training data scale increases. This confirms that unsupervised 4D reconstruction benefits significantly from diverse real-world scenarios, suggesting that Flux4D can continue improving with more data, highlighting its potential for scalable 4D reconstruction.

Controllable Camera Simulation

Since Flux4D produces high-quality motion flows, we can cluster the predicted 3D Gaussians based on their velocity to automatically create 3D instance masks without any annotations (first row, rightmost panel), enabling realistic and controllable camera simulation as shown in the second row.

Flux4D predicts editable representations for controllable camera simulation without requiring any labels.

Here we showcase more simulation videos including removing dynamic actors (first row, rightmost panel), changing camera viewpoints (second row), and actor manipulation/insertion (third row).

Conclusion

We present Flux4D, a scalable flow-based unsupervised framework for reconstructing large-scale dynamic scenes by directly predicting 3D Gaussians and their motion dynamics. By relying solely on photometric losses and enforcing an "as static as possible" regularization, Flux4D effectively decomposes dynamic elements without requiring any supervision, pre-trained models, or foundational priors. Our method enables fast reconstruction, scales efficiently to large datasets, and generalizes well to unseen environments. Extensive experiments on outdoor driving datasets demonstrate state-of-the-art performance in scalability, generalization, and reconstruction quality. We hope this work paves the way for efficient, unsupervised 4D scene reconstruction at scale.

BibTeX

@inproceedings{wang2025flux4d,
  title={Flux4D: Flow-based Unsupervised 4D Reconstruction},
  author={Jingkang Wang and Henry Che and Yun Chen and Ze Yang and Lily Goli and Sivabalan Manivasagam and Raquel Urtasun},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025},
  url={https://openreview.net/forum?id=FeUGQ6AiKR}
}