DIO: Decomposable implicit 4D occupancy-flow world model

CVPR 2025

Christopher Diehl*, Quinlan Sykora*, Ben Agro, Thomas Gilles, Sergio Casas, Raquel Urtasun

by Waabi

We present DIO, a flexible world model that can estimate the scene occupancy-flow from a sparse set of LiDAR observations, and decompose it into individual instances. DIO can not only complete instance shapes at the present time, but also forecast their occupancy-flow evolution over a future horizon. Thanks to its flexible prompt representation, DIO can take instance prompts from off-the-shelf models like 3D detectors, achieving state-of-the-art performance in the task of 4D semantic occupancy completion and forecasting on the Argoverse 2 dataset. Moreover, our world model can easily and effectively be transferred to downstream tasks like LiDAR point cloud forecasting, ranking first compared to all baselines in the Argoverse 4D occupancy forecasting challenge

Video

Motivation

Planning a safe motion requires robots such as self-driving vehicles (SDVs) to have an accurate understanding of the world. An important challenge of perception systems is their ability to understand individual objects and their shape when dealing with sparse and noisy observations. Moreover, predictions of the environment at the current time are not sufficient, as a robot also needs a precise understanding of the 4D (3D + time) scene evolution into the future through a world model in order to plan a safe trajectory. These systems can benefit from reasoning about instances to get a complete understanding of the scene, allowing a more fine-grained interaction understanding between the SDV and objects. For instance, if a planned trajectory overlaps with forecasted future occupancy, it is important to understand where that occupancy came from to reason about who has the right-of-way. However, for any autonomous system to be robust to rare or unpredictable events, it needs to be able to contend with more objects than can reasonably be labelled through human annotations. For example, if a tree or some branches were to fall on the road, we would like our autonomous vehicle to avoid these obstacles, even if we can’t reasonably label every single instance of foliage we see in our training set.

While there are methods that are able to address each of these challenges individually, so far there have been no solutions that are able to address all of these problems simultaneously. For example, one common practice is to train an AI model to put boxes around objects of interest, and predict the movement of these boxes as rigid bodies. We refer to these methods as instance-based, and while they do allow for individual instance reasoning, and prediction of the scene into the future, they are fundamentally limited by human annotations, as they generally require manual human labels to determine which objects to put boxes on. They also simplify the geometry of entities in the scene by treating them as boxes with no fine grained details.

Another strategy is to predict the 4D occupancy of the scene as a whole. Our previous work, UnO, is able to do this by learning to predict the occupancy of objects based on the lidar returns of future sensor sweeps. This bypasses the need for human annotations, is able to capture high resolution details of entities, and forecasts their movement into the future. However, the downside of having no human annotations is that this kind of model treats the entire scene as a single entity. This lack of decomposition means that our autonomous system is unable to reason about the interactions between objects in the scene in any kind of interpretable manner.

Our method, DIO, is the first method that is able to address all of the above concerns in a single model. DIO is a 4D occupancy-flow world model that exploits unsupervised LiDAR data together with object bounding box labels as supervision, allowing it to predict both the occupancy of the scene as a whole, and the occupancy of individual instances. DIO also improves on both the fine-grained details and the forecasting abilities of previous models. This model can predict 4D scene occupancy and flow, decompose it into individual objects, and can also be transferred to the task of LiDAR point cloud forecasting. In contrast to other approaches, this method requires only 3D bounding boxes labels to predict instances that approximate the true geometry.

Method

The DIO model operates by taking in two distinct inputs. The first is a series of historical lidar sweeps. These inputs act as the sensor readings that the model uses to see the world. Next, it takes in a prompt. This prompt is made up of two components: a query point which we will denote as q and a source point which we will denote as s. The query point is a 4D coordinate in space and time, which can be written out as (x, y, z, t). This denotes the point in space and time for which we want to know the occupancy probability. More simply, if we give the model a query point q it will tell us the probability of that location being occupied at that point in time. The source point s is a point in 3D space (x, y, z) that tells us what entity we are currently looking at. If the source point happens to lie within an object in the scene at the present time, then the occupancy the model gives us will only be for that specific object, leaving the rest of the scene empty. If the source point doesn’t lie within any occupied space, then the model will not output any occupancy at any point. Finally, if we don’t give the model a source point, then we expect the model to output the occupancy of the scene as a whole. This overall structure allows us to get instance-level information while still providing occupancy of everything in the scene, ensuring that we are not restricted by the scope of the labelled set.

Architecture

At a high level, the architecture of this model can be broken down into two main sections: the 3D feature encoder and the decomposable occupancy flow decoder. The 3D feature encodes makes use of recent advancements in sparse convolutions to produce a multi-level sparse feature map of the scene in 3D, taking in 5 past lidar sweeps as input. The decoder takes in both the multi-resolution feature map from the encoder, and the prompt provided to the model by the user. The decoder makes use of a multi scale interpolator to extract features from all the input feature maps at the location of the query point q and the source point s. If no source point is provided, we replace the interpolated points with a learned feature embedding. Using this information, the decoder predicts a set of offsets to interpolate an additional set of features from the feature maps, which are combined with the embedded prompts and passed through a small MLP to produce the occupancy and instantaneous flow predictions. The decoder is kept extremely small in size, allowing for tens of thousands of prompts to be run in parallel in under 20ms.

Training

Our goal is to learn fine-grained shape of objects and to be able to decompose them into separate instances. Scene-wise LiDAR unsupervised training from our previous work UnO provides detailed occupancy labels but does not include any instance information. On the other hand, labelled boxes identify separate objects, but their shapes are very coarse since they are constrained to 3D boxes. We aim to combine the advantages of both by prompting DIO to reconstruct occupancy inside specific instance boxes. Training can therefore be broken down into two separate paradigms; scene occupancy prompting and instance occupancy prompting.

Scene Occupancy Prompting

We take inspiration from our previous work with UnO to train the model using unsupervised LiDAR rays. More specifically, we can generate positive occupied labels by taking the small region located behind each lidar point, and negative labels by taking the region between the sensor at each lidar location. This gives us continuous regions in space that we can continuously sample in equal proportion for the training process. We can then query the model at these sampled locations, and train it using a simple binary cross entropy loss. Since these LiDAR based training labels do not have any instance information, the prompts used to train the model at this stage do not contain a source point. By training with future lidar points, we give the model the ability to forecast occupancy, instead of just completing it at the present.

Instance Occupancy Prompting

In the next stage of training, we would like to provide the model with non-empty source points, and train it to predict the occupancy of individual entities. To do this, we begin by sampling normal lidar-based occupancy points, similar to how we did it for the scene occupancy prompting step. Then, we can identify the sampled training points that were both labelled as occupied and that lie within the labelled bounding box of a given object. These will qualify as our positive occupied training points, and we can set all other sampled points to be unoccupied. We can then sample a random source point within the bounding box of the object, and then train the model to this end. This ultimately trains the model to predict occupancy for only that given object when the source point lies within the object itself.

However, this does raise the question, what happens when the source point lies outside the bounds of an object? Here, we sample a source point that is both unoccupied, and lies outside the bounds of any bounding box. The desired behaviour in the context is for no occupancy to be predicted, so all labels can be assigned to be unoccupied.

In practice, we find that sometimes the occupancy of a nearby object can “bleed” into the occupancy of a different object. This means when sometime when we provide a source point, the model will predict the occupancy of both that object and some of the closest nearby objects. To mitigate this issue, we additionally add more negative source points around the bounding box of a given object. More specifically, for a source point that lies within a given labelled entity, we will additionally sample several thousand negative query outside the bounding box of this object during the training process.

Overall this allows us to create object-based decomposition of the instances in the scene. These decomposed objects and their corresponding occupancy clouds are also forecasted into the future.

Results

Our model is able to produce high-fidelity predictions of both scene and instance occupancy. At a scene level, we observe that the new encoder and decoder structure is able to produce higher fidelity reconstructions of the scene than in previous models, including our previous work UnO.

UnO DIO

At an instance level, we find that our predictions for individual instances are actually more accurate than their scene level counterparts. We hypothesize that this is due to the additional information and focus the source point provides to the model.

When we interpolate the source point between objects, we can see more directly how the model has learned a general abstraction of what an object is.

This proposed effect is supported by the fact that when we evaluate the occupancy predictions of our model on objects it was not trained with, it nevertheless succeeds in producing very accurate and realistic occupancy predictions.

When compared to other scene occupancy forecasting models, DIO is able to outperform even the state of the art. On the Argoverse 2 4D occupancy forecasting challenge (here) we arise as the top performing model. Visually, it succeeds in creating extremely realistic reconstructions and forecasts of the environment.

LiDAR Forecasting

In general, LiDAR forecasting can be one of the most easily verifiable ways to evaluate the performance of an occupancy model. Towards this end, we also make a small adaptation of our model to be able to forecast lidar rays in the future. We take inspiration again from our previous work UnO, and train a new render model to take in occupancy along a ray, and predict the length of the lidar ray given this information. The model is then trained to minimize the error of the LiDAR ray length prediction based on the real lidar depths. For simplicity, we only use the scene level occupancy predictions of the model during the LiDAR training process.

Conclusion

In this work, we propose DIO, a decomposable implicit occupancy world model capable of jointly predicting 4D scene occupancy, decomposed occupancy, and flow in continuous 3D space and time. We demonstrate that DIO outperforms state-of-the-art unsupervised occupancy baselines in occupancy prediction and LiDAR point cloud prediction experiments. Additionally, it exhibits strong open-set prediction capabilities, indicating its ability to learn a generalized occupancy representation of both individual instances and the scene. We showcase that our model can be flexibly prompted to decompose scene occupancy into instance-based occupancy. While our method requires fewer labeled data than related work, it still relies on bounding box labels during training.

BibTeX

@article{diehl2025dio,
  title={DIO: Decomposable Implicit 4D Occupancy-Flow World Model
ts: Generating in-the-wild 3D Assets in Latent Space},
  author={Christopher Diehl and Quinlan Sykora and Ben Agro and Thomas Gilles and Sergio Casas and Raquel Urtasun},
  booktitle={Conference on Computer Vision and Pattern Recognition},
  year={2025},
}