One Trajectory, One Token:

Grounded Video Tokenization via Panoptic Sub-object Trajectory

1University of Washington   2Allen Institute for Artificial Intelligence   3Woven by Toyota, Inc  

TL;DR: We propose to tokenize video with panoptic sub-object trajectories, significantly surpassing traditional way of space-time patch tokenization by a large margin in video understanding tasks while using 10x less tokens

Abstract

Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. Our method aligns with fundamental perceptual principles, ensuring that tokenization reflects scene complexity rather than video duration. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. Trained with contrastive learning, TrajViT significantly outperforms space-time ViT (ViT3D) across multiple video understanding benchmarks, e.g., TrajViT outperforms ViT3D by a large margin of 6% top-5 recall in average at video-text retrieval task with 10x token deduction. We also show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM, obtaining an average of 5.2% performance improvement across 6 VideoQA benchmarks while having 4x faster training time and 18x less inference FLOPs. TrajViT is the first efficient encoder to consistently outperform ViT3D across diverse video analysis tasks, making it a robust and scalable solution.

Method


Interpolate start reference image.

(1) A trajectory generation pipeline that generates trajectories for all sub-objects in a video, integrated from off-the-shell segmenter and tracker through a parallelizable algorithm (details in paper).
(2) A trajectory encoder that encode dynamic trajectories into fixed-size tokens
(3) The trajectory tokens serves as input to the transformer encoder. We train the encoder with CLIP objective.





Example of generated trajectories





Regular Video understanding Tasks


Interpolate start reference image.

We compare TrajViT with standard ViT with space-time patch tokens (ViT3D) and state-of-the-art token merging methods on a wide range of video understanding tasks, including action classification, video-text retrieval, and spatial-temporal detection. It outperforms VIT3D in all tasks, while all token merging baselines underperform ViT3D on average.



Efficiency Analysis


Interpolate start reference image.

We show efficiency comparison in four different axis under varying frame number (in ActivityNet benchmark). Despite the additional overhead of trajectory generation, TrajViT trains faster, consumes less GPU memory, and performs faster inference for videos with more than 64 frames.



Scaling Behavior


Interpolate start reference image.

We show our method can scale as well as the standard vision transformer. Additionally, TrajViT can naturally process image data by treating each image segment as a trajectory of length one while standard video transformer can not. This allows seamingless joint training with both videos and images. Therefore, we see a bigger performance improvement for TrajViT when adding image data to pretraining corpus.



Application in VideoLLMs


Interpolate start reference image.

Lastly, we train two VideoLLMs by connecting Llama3 with TrajViT and ViT3D as video encoders. On six VideoQA benchmarks, the average accuracy of the TrajViT-LLM surpasses ViT3D-LLM by 5.24%, while being trained 4x faster and running at 18x fewer inference FLOPs.

BibTeX

@article{zheng2025one,
  title={One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory},
  author={Zheng, Chenhao and Zhang, Jieyu and Salehi, Mohammadreza and Gao, Ziqi and Iyengar, Vishnu and Kobori, Norimasa and Kong, Quan and Krishna, Ranjay},
  journal={arXiv preprint arXiv:2505.23617},
  year={2025}
}