Dense Trajectory Fields

Abstract

In this paper, we present Dense Trajectory Fields (DTF), a novel low-level holistic approach inspired by optical-flow and trajectory methods, focusing on both spatial and temporal aspects at once. DTF contains the dense and long-term trajectories of all pixels from a reference frame, over an entire sequence. We solve it with DTF-Net, a fast and lightweight neural network, comprising 3 main components:

a joint iterative refinement of image and motion features over residual layers,
token-based Reciprocal Attention clusters and,
a Refinement Network that builds patch-to-patch cost-volumes around salient centroid trajectories.

We extend the recent Kubric dataset to provide dense ground-truth over all pixels, to train DTF-Net. Experiments show that optical-flow and trajectory methods exhibit either temporal or spatial inconsistencies. Conversely, DTF-Net provides a better compromise while keeping faster, giving a coherent motion over the entire sequence.

Dense Trajectory Fields

Our novel approach, Dense Trajectory Fields (DTF), describes the 2D motion of all pixels from a reference frame. It is a 3D motion volume indexed by (t,i,j), that unifies optical-flow and trajectories: it is dense and long-term, able to handle large occlusions, and can leverage temporal and spatial contexts.

Unlike existing approaches, DTF is holistic: it considers all pixels at once, which is more efficient, and leads to more consistent results both temporally and spatially.

A brief analysis shows that this motion space and the video space have different layouts, hence they need to be treated separately.

DTF-Net

Here is DTF-Net: a residual architecture refining in parallel video and motion embeddings. Video embeddings are obtained using a simple CNN encoder to the input sequence, reducing the resolution to 1/8. Motion embeddings are initialized to zero. The estimated motion and visibility can be deduced anytime by applying a common motion head, shared accross the network.

Processing extensively all pixels would be cumbersome. We consider that the motion is redundant and scattered on the video, so we can cluster it, through a process we call Centroid Summarization.

For each layer, we learn a set of tokens that builds attention maps A, which we interpret as attention clusters. From each cluster, we deduce one single localized centroid, whose trajectory can be obtained by applying the motion head on its motion embeddings. We refine its video and motion embeddings using correlations of patches around its trajectory, processed by a temporal CNN. The video and motion update is applied back to the whole sequence by apply using the reciprocal attention: by simply transposing the original attention A, we keep the same pixel-centroid affinity.

We repeat these operations along 8 layers. Each layer learns its own set of tokens, building its own summarization of the video.

Results

We show in order: reference frame with some points to track, visibility, some point trajectories (spatial slices of DTF), and the optical-flow (temporal slice of DTF). By construction, DTF-Net gives smooth and coherent results.

Comparisons

Comparison with FlowFormer and CoTracker

Thanks to its efficient architecture, DTFNet is faster than SOTA methods, achieving over 300FPS on sequences of 512x384 (tested on RTX 4090). Although DTFNet is not always the most accurate, it is the most consistent w.r.t. spatial and temporal consistency.

Limitations

For now, our method does not cross well the gap to realworld sequences, lacking generalization in its ability to build complex cluster motions. On some data, it may not yet be on par with specialized methods on their respective tasks (trajectories & optical-flow).
As it stands, it can only be used offline.

Conclusion

We introduce Dense Trajectory Fields, a novel approach to low-level, dense and long-term motion estimation, that aims to track all pixels of a reference frame over an entire sequence. We propose DTF-Net, a neural architecture consisting of an iterative refinement of image and motion features. To efficiently alleviate the video processing, we use a mechanism of reciprocal attention to build salient centroids, and refine their trajectories using patch-to-patch cost-volumes. We extend the existing Kubric dataset to give pixel-wise ground-truth, to train our model on. We test our method against existing trajectory and optical-flow algorithms, and show how both behave on each other's task. DTF-Net offers a fast solution that exhibits a good spatial and temporal consistency.

BibTeX

        
@InProceedings{Tournadre_2024_DTF,
  author    = {Tournadre, Marc and Soladi\'e, Catherine and Stoiber, Nicolas and Richard, Pierre-Yves},
  title     = {Dense Trajectory Fields: Consistent and Efficient Spatio-Temporal Pixel Tracking},
  booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)},
  month     = {December},
  year      = {2024},
  pages     = {2212-2230}
}

Acknowledgements

Big thanks to Indramal for his awesome website template.

Parts of the code are inspired from the RAFT repository.
Data processing relies on Kubric.
Evaluation relies on the TAPNet benchmark.
Thanks to the authors for providing their code.

Dense Trajectory Fields: Consistent and Efficient Spatio-Temporal Pixel Tracking

Abstract

Dense Trajectory Fields

DTF-Net

Results

Comparisons

Limitations

Conclusion

BibTeX

Acknowledgements

License

Dense Trajectory Fields:
Consistent and Efficient Spatio-Temporal Pixel Tracking