MAD: Memory-augmented detection of 3D objects

CVPR 2025

Ben Agro, Sergio Casas, Patrick Wang, Thomas Gilles, Raquel Urtasun

by Waabi

To perceive, humans use memory to fill in gaps caused by our limited visibility, whether due to occlusion or our narrow field of view. However, most 3D object detectors are limited to using sensor evidence from a short temporal window (0.1s-0.3s). In this work, we present a simple and effective add-on for enhancing any existing 3D object detector with long-term memory regardless of its sensor modality (e.g., LiDAR, camera) and network architecture. We propose a model to effectively align and fuse object proposals from a detector with object proposals from a memory bank of past predictions, exploiting trajectory forecasts to align proposals across time. We propose a novel schedule to train our model on temporal data that balances data diversity and the gap between training and inference. By applying our method to existing LiDAR and camera-based detectors on the Waymo Open Dataset (WOD) and Argoverse 2 Sensor (AV2) dataset, we demonstrate significant improvements in detection performance (+2.5 to +7.6 AP points). Our method attains the best performance on the WOD 3D detection leaderboard among online methods (excluding ensembles or test-time augmentation).

Video

Motivation

Most self-driving vehicles (SDVs) utilize a 3D object detector to recognize and localize objects in 3D space. This task is challenging due to occlusion, large intra-class variability, and distant objects, which typically have limited sensor observations. To overcome these challenges, human drivers rely on their memory. For example, they may drive more cautiously when remembering a previously observed but now occluded cyclist, who may suddenly enter the road.

A common approach for improving 3D object detectors is to aggregate a short temporal window of past sensor observations. Towards this goal, most existing LiDAR-based methods transform a short buffer of sensor data into the current ego (SDV) coordinate frame to align past and current evidence. Similarly, camera-based methods stack multiple images as input to existing architectures. These methods cannot handle long temporal sequences due to computational and memory constraints. Moreover, temporal stacks of 3D/Bird's-Eye-View (BEV) representations like point clouds or lifted camera features require a large receptive field, especially for fast-moving objects, further increasing computational burden.

There is a growing interest in long-term temporal fusion. Scene-level memory approaches recurrently fuse scene-level features, but they can struggle to capture relevant foreground objects. Other approaches associate objects in memory over time via tracking, aggregating past information for each particular object. However, the associations from the tracker may contain mistakes that can compound over time and lead to information loss. Other methods leverage attention from current detection proposals to the past sensor or object information. Still, they can be challenging to scale to long histories and suffer from false negatives as the proposals refined into the final detections only come from the present time.

We present a simple and sensor-agnostic add-on for enhancing any existing 3D object detector with long-term memory. We refer to it as MAD --- short for Memory-Augmented Detection. MAD is a transformer-based model that fuses proposals from a detector with proposals from a memory bank representing past beliefs. Inspired by recent developments, we exploit joint detection and trajectory forecasting. By storing explicit trajectory forecasts in the memory bank, we can estimate object poses at arbitrary future timestamps for all the objects in the memory. This enables us to enrich the set of proposals by aligning memory proposals with the current observations.

Method

MAD is a plug-and-play module to enhance existing 3D object detectors with the ability to perform long-horizon temporal fusion. Our only requirement from the detector is that each detection includes an object bounding box, multi-class confidence scores, and a feature vector capturing local context. We demonstrate the generability of MAD by augmenting and improving various LiDAR-based and camera-based detectors.

We enable long-horizon temporal understanding through a memory bank that captures all the relevant information on objects, including where we expect them to move. These trajectory forecasts allow us to align the memory objects with the current detector proposals in space and time. Importantly, we do not require the object detector to provide motion forecasts; instead, MAD computes them. To compensate for ego-motion, we assume the ego is localized --- which is the norm in modern self-driving platforms--- and store the ego pose in the memory along with the model outputs.

Architecture

At every inference step, MAD takes as input the detection proposals, the current timestamp (e.g., LiDAR sweep-end time or camera capture time), and the ego pose in a global coordinate frame. It then retrieves objects from memory, aligns them spatially and temporally, and extracts high-dimensional features from the aligned boxes and trajectory forecasts.

We refer to the aligned boxes and trajectory forecasts with the extracted features as memory proposals. A proposal merging mechanism then fuses detection and memory proposals by rescoring their confidence scores and applying standard post-processing. Finally, our refinement transformer iteratively refines the object detections and trajectory forecasts in the merged proposals with cross-attention to the memory and factorized self-attention.

In preparation for future inferences, the memory bank is then updated by appending the model outputs (a.k.a. refined proposals) and removing older model outputs to keep the memory bounded in size.

Training

We first train an off-the-shelf 3D detector following their original training strategy. This stage can be omitted if a pre-trained 3D detector is available. Then, we train all the parameters in MAD as a subsequent stage, with the 3D detector weights frozen. Pre-training and freezing the 3D detector is important to ensure the detection proposals do not change throughout MAD training. Note that we train a separate MAD for each 3D detector, as each detector has different features, detection distribution and calibration.

Before detailing our proposed MAD training, we discuss some possibilities and trade-offs when training temporal fusion models. Training on unordered examples has the advantage of satisfying the assumption of i.i.d examples (better learning dynamics. However, it differs from evaluation, where the model is rolled out on long sequences and consumes its previous outputs. Training on long sequences of ordered data has the advantage of being closer to evaluation, but it has worse learning dynamics since consecutive examples are heavily correlated (there are few changes in the scene from one frame to the next). If, instead, gradients are accumulated over a long sequence and used to update the model parameters once per sequence, a sequence becomes one example (satisfying the i.i.d assumption), but the training duration is multiplied by the sequence length if the number of model updates is kept constant.

To tackle these challenges, we design a novel training schedule. We propose to train MAD on increasingly long chunks of ordered data, using single frames at the beginning and entire sequences at the end of training. To train object memory on short chunks (or even single frames) of data while maintaining a reasonable amount of memory inputs, we propose to maintain a cache of memory banks across training and use it to build the memory proposals for each training example.

Training schedule

Autonomous driving datasets organize their data into driving logs, each around 20s in duration with data captured at 10Hz, meaning each log has around 200 frames. Each log has a unique identifier. For the first 25% of training, we sample single frames (that is, consecutive training examples are random frames from random logs). Throughout the rest of training, we sample sequential chunks of gradually increasing size: 48 frames for (25%, 50%] of training, 96 frames for (50%, 75%], and 144 frames for (75%, 100%]. We train with a single cosine decay learning rate schedule with no resets. The intuition behind this is that when the learning rate is high, and the model weights change the most, the model is exposed to more diverse data. Then, when the learning rate is lower, the model is tuned to be closer to the evaluation setting, where it consumes its previous outputs.

Exploiting a Cache of Memory Banks

If the schedule described above is followed naively during the individual frame and short chunk training, the model cannot consume its previous outputs and thus would not learn to use memory during this phase of training. To address this problem, we introduce a cache of previous memory banks. This cache is a mapping from the unique driving log identifier to a memory bank. At the start of training, we initialize the cache with empty memory banks for all logIDs. On a given training iteration, we index the cache with the logID of the current training example to obtain the memory bank. If available, we retrieve the memory proposals from this memory bank. We update the retrieved memory bank at the end of the training iteration with the model outputs, replacing any existing entry with the same timestamp. Note that during training we do not limit the size of the memory bank.

There are a few challenges to training with the object memory cache that we address:

To train these models efficiently on large datasets, we use a distributed data training scheme, meaning we split examples in the minibatch across multiple GPUs. Each GPU has a unique index called a rnk. Each rank maintains a separate cache to prevent the cache from filling up the RAM and avoid synchronization costs. To guarantee high cache hit rates, we ensure that training examples from a given logID are always put on the same rank during training.
The cache is filled with MAD outputs, which are inaccurate at the beginning of training. We do not want erroneous model outputs to fill the cache; otherwise, the model may not learn to trust the memory proposals. To mitigate this, we only start filling the cache (and training with memory proposals) after 2.5% of training, after which performance is reasonable.
To make the model robust to variable latency and the presence and absence of memory proposals, we randomize the target timestamps that we retrieve memory elements for during training by randomly sampling the time stride, and the number of target timestamps,.

Results

Enhancing 3D LiDAR-Based Detectors

We enhanced three LiDAR-based methods with MAD on WOD. Our model brings significant improvements to all detectors on both datasets. These gains are largest for single-frame detectors, where the memory provides the most additional information. The fact that the MAD-augmented single-frame detectors are better than the multi-frame detectors clearly shows the effectiveness of our method relative to the common point aggregation approach.

Comparison against SOTA on WOD leaderboard

By augmenting SAFDNet 4f with MAD, we show in that we achieve the best performance on the WOD leaderboard, among all online methods that do not use ensembles or test-time augmentation.

Memory Components Ablation

We ablate the different components of our memory pipeline, showing that both the proposed memory attention and memory proposals have a positive effect. This is intuitive as the memory proposals let MAD recover from false negative detection proposals, which is complementary to memory cross-attention that allows MAD to use all memory information for refinement (bypassing the filtering in proposal-merging). We also find that using trajectory forecasting to align memory proposals to the current time is important, particularly for fast-moving objects. Without forecasting, the memory proposals from previous frames will be far from the current position of those objects, making it challenging for the model to leverage the memory effectively. Finally, we find the proposed learned rescoring of the merged detection and memory proposals is crucial for good performance. Without it, MAD cannot enhance the base detector because the proposal scores from the 3D detector and memory are miss-calibrated before being post-processed in the proposal merging step (i.e., NMS).

Training Schedule Ablation

We first train MAD with three different chunk sizes (i.e., sequences with {144, 48, 1} frames), each with and without the memory bank cache. Training with long chunks (144 frames) provides good performance because there is a low gap between training and evaluation. Training with shorter chunks performs worse because there is a more significant gap between training and eval

Our proposed schedule strategy allows MAD to learn generalized patterns over a diverse set of examples quickly by training on short chunks (more i.i.d. data) at the beginning when the learning rate is higher while refining its understanding on long chunks (closer to the deployment setting) towards the end when the learning rate is lower.

Conclusion

In this paper, we propose a simple, effective, and sensor-modality-agnostic add-on for enhancing any existing 3D object detector with long-term memory. To achieve this, we design a transformer-based model that uses joint detection and trajectory forecasting to populate a memory bank with spatial-temporal object trajectories. Our model can effectively fuse memory proposals with detection proposals by reading previous memory entries and aligning them with the current time and ego pose. We also propose a novel training strategy that increases data diversity while keeping the training-to-inference gap low. Our approach is very general, bringing impressive improvements to a variety of LiDAR-based and camera-based detectors, and very effective.

BibTeX

BibTeX
@article{agro2025mad,
  title={MAD: Memory-Augmented Detection of 3D Objects},
  author={Ben Agro and Sergio Casas and Patrick Wang and Thomas Gilles and Raquel Urtasun},
  booktitle={Conference on Computer Vision and Pattern Recognition},
  year={2025},}