Lecture Notes for feeding curiosity, nurturing knowledge, and inspiring a lifelong love of learning.

Transformers for Multi-Object Detection and Tracking

A structured explanation of Transformer-based multi-object detection and tracking, moving from the DETR foundation to object queries, track queries, temporal association, memory-based tracking, and green-learning models such as MOTT.

DETR Core foundation: object detection as direct set prediction with object queries and Hungarian matching.
MOT Tracking extends detection by requiring persistent object identities across video frames.
Queries Object queries become track queries that carry identity information through time.
MOTT Green-learning MOT reduces redundant Transformer components for efficient tracking.

1. Motivation: Why Transformers Matter for Detection and Tracking

Multi-object detection and tracking are central tasks in computer vision. Object detection answers the question: what objects are present and where are they? Multi-object tracking adds a temporal question: which object in the current frame corresponds to which object in previous frames?

Traditional multi-object tracking pipelines often follow the tracking-by-detection paradigm. First, an object detector finds bounding boxes in each frame. Then, a separate association module links detections across time using motion models, appearance embeddings, re-identification networks, graph optimization, or hand-designed heuristics.

Simple definition: Transformer-based multi-object tracking is a family of methods that uses attention, object queries, track queries, and temporal modeling to jointly detect objects and preserve their identities across video frames.

Why this matters

  • Unified modeling: Transformers can represent detection, spatial relations, and temporal association in one framework.
  • End-to-end training: DETR-style methods reduce the need for anchors, non-maximum suppression, and manually tuned association rules.
  • Identity-aware tracking: track queries can carry object identity across frames.
  • Long-range reasoning: attention can model interactions among objects, frames, and trajectories.
  • Deployment pressure: real-time MOT requires accuracy, speed, low memory, and low energy consumption.

2. Basic Concept: Detection Versus Tracking

In image-level object detection, the model receives a single image and predicts a set of object categories and bounding boxes. In video-level multi-object tracking, the model receives a sequence of frames and predicts both bounding boxes and identity labels over time.

Frame t Detector Boxes Association Tracks
Task Input Output Main difficulty
Object detection Single image Object classes and bounding boxes Localization, classification, scale variation, and small objects.
Multi-object tracking Video sequence Bounding boxes plus persistent identities. Identity switches, occlusion, reappearance, and crowded scenes.
Tracking-by-detection Frame detections plus association logic Linked object trajectories. Detector errors can propagate into tracker errors.
Joint detection and tracking Video frames or frame features Detections and identities from one model. Balancing detection quality and temporal association.

Object queries versus track queries

Transformer-based MOT methods often reinterpret tracking as query propagation. A query that detects an object in one frame can be propagated into the next frame as a track query. If the query continues to attend to the same target, the model preserves object identity.

Object query Detection Track query Next frame
The key intuition is that tracking is not only about finding objects; it is about maintaining object-centered representations across time.

3. DETR as the Foundation

The most important foundation for Transformer-based object detection is DETR: Detection Transformer. DETR reformulated object detection as a direct set prediction problem.

Instead of generating dense anchors and removing duplicate boxes with non-maximum suppression, DETR predicts a fixed-size set of object candidates. Each predicted candidate is matched to a ground-truth object using bipartite matching. Unmatched predictions are trained as “no object.”

Image CNN Backbone Transformer Encoder Object Queries Decoder Boxes + Classes

Why DETR is attractive for MOT

  • Set prediction: MOT also involves predicting a set of active objects at each frame.
  • Object queries: queries provide a natural unit for object-level reasoning.
  • Global attention: objects can interact with all image regions and with each other.
  • End-to-end formulation: tracking can be framed as extending object queries across time.

Limitations of vanilla DETR

Slow convergence

Vanilla DETR typically requires long training schedules compared with mature CNN detectors.

Small-object difficulty

Global attention over low-resolution features can be weak for small and crowded objects.

High computational cost

Attention and decoder computation can become expensive for high-resolution visual inputs.

Tracking extension is non-trivial

Detection alone does not solve identity consistency across frames.

4. Mathematical Formulation

Transformer-based detection and tracking can be explained through four connected ideas: attention, set prediction, query propagation, and tracking loss design.

4.1 Scaled dot-product attention

The basic Transformer attention operation maps queries, keys, and values into context-aware representations.

Attention(Q, K, V) = softmax((QKᵀ) / √dₖ) V
Symbol Meaning
Q Queries: what the model is looking for.
K Keys: representations used for attention matching.
V Values: information retrieved after attention weights are computed.
dₖ Key dimension used to scale dot products for numerical stability.

4.2 DETR set prediction

Given an image, DETR predicts a fixed set of N candidates:

ŷᵢ = (ĉᵢ, b̂ᵢ), i = 1, ..., N

Here, ĉᵢ is the predicted class distribution and b̂ᵢ is the predicted bounding box. Ground-truth objects are matched to predictions using Hungarian bipartite matching:

σ* = arg min Σᵢ L_match(yᵢ, ŷ_{σ(i)})

The final detection loss combines classification and box regression:

L_DETR = λ_cls L_cls + λ_L1 ||b − b̂||₁ + λ_giou L_giou(b, b̂)

4.3 Query-based tracking

In Transformer MOT, the query set at frame t often contains two types of queries: propagated track queries for existing objects and detection queries for new objects.

Q_t = [ Q_track^{t−1}, Q_detect ]

Existing track queries are updated using current-frame features:

Q_track^t = f_θ(Q_track^{t−1}, F_t)

The output may include class, box, identity state, and confidence:

ŷᵢᵗ = (ĉᵢᵗ, b̂ᵢᵗ, idᵢᵗ, sᵢᵗ)

4.4 Tracking objective

A simplified tracking loss may combine detection, localization, identity, and track-state terms:

L_MOT = L_cls + L_box + L_id + L_track_state
In some end-to-end Transformer trackers, identity is not represented by explicit identity classification. Instead, identity is preserved implicitly through track-query propagation.

5. Taxonomy of Transformer-Based Detection and Tracking Models

A clear way to explain the literature is to classify methods according to how Transformers are used: as set-prediction detectors, efficient DETR variants, query-propagation trackers, memory-based trackers, or green-learning architectures.

Category Representative models Main idea Relevance to MOT
Set-prediction detectors DETR Predict objects as a set using object queries and Hungarian matching. Foundation for query-based tracking.
Efficient DETR variants Deformable DETR, DAB-DETR, DN-DETR, DINO Improve convergence, query design, small-object detection, and training stability. Better detection improves tracking quality.
Query-propagation trackers TransTrack, TrackFormer Use previous-frame object information as current-frame tracking queries. Connects object queries to identity-preserving track queries.
End-to-end MOT Transformers MOTR, MOTRv2 Propagate and update track queries across frames. Reduces reliance on post-processing association pipelines.
Memory-based trackers MeMOT, MeMOTR Use memory to recover identities after occlusion or long temporal gaps. Improves robustness in crowded scenes and long videos.
Green-learning trackers MOTT Remove redundant Transformer components and keep effective modules. Targets real-time efficiency and reduced computation.

General evolution

DETR Deformable DETR TransTrack TrackFormer MOTR MOTRv2 / Memory MOT

6. Important Transformer MOT Models

Transformer MOT evolved from adapting DETR-style object queries to building identity-aware track queries, memory modules, and detector-guided tracking systems.

TransTrack

TransTrack extends DETR by using object features from the previous frame as track queries for the current frame. It also uses detection queries to discover newly appearing objects.

Query propagation Frame-to-frame

TrackFormer

TrackFormer formulates MOT as frame-to-frame set prediction. It propagates track queries through time and removes the need for a separate graph-based association module.

Set prediction Track queries

MOTR

MOTR is an end-to-end multi-object tracking framework that explicitly introduces track queries and updates them over time to preserve identities.

End-to-end MOT Temporal queries

MOTRv2

MOTRv2 improves MOTR by addressing weak detection quality through detector-guided or detector-bootstrapped tracking.

Detection quality Bootstrap

MeMOT

MeMOT introduces memory into Transformer tracking so that identities can be linked over longer temporal gaps.

Memory Long-term association

MeMOTR

MeMOTR extends memory-based MOT using long-term memory injection and association-aware temporal modeling.

Memory injection Occlusion recovery
The key conceptual transition is from object queries, which localize objects in one image, to track queries, which preserve object identity across frames.

7. MOTT Case Study

MOTT: A new model for multi-object tracking based on green learning paradigm is an efficiency-focused Transformer MOT paper. It argues that many MOT systems become complicated by adding multiple neural modules, so a more efficient design can be achieved by keeping only effective Transformer components.

Paper information

Title MOTT: A new model for multi-object tracking based on green learning paradigm
Authors Shan Wu, Amnir Hadachi, Chaoru Lu, Damien Vivet
Journal AI Open
Volume and pages Volume 4, 2023, pages 145–153
DOI 10.1016/j.aiopen.2023.09.002
Keywords Multi-object tracking, pedestrian tracking, green learning, Transformer, end-to-end
Official implementation GitHub: simonwu53/MOTT

Where it fits in the taxonomy

MOTT fits best under green-learning Transformer MOT. It belongs to the DETR-inspired tracking family, but its main contribution is not simply adding more modules. Instead, it studies which Transformer components are useful for MOT and removes redundant parts to improve the accuracy-runtime-computation balance.

CSWin Encoder + Deformable DETR Decoder Lightweight MOT Transformer Tracks

Main idea

MOTT proposes a pruned Transformer-based MOT architecture. Instead of increasing tracking performance by accumulating more neural modules, it keeps effective Transformer components and removes unnecessary ones. The paper positions this as a green learning approach because it emphasizes efficiency, runtime, and reduced computational cost.

How to position MOTT in a literature review: MOTT is an efficiency-oriented Transformer MOT model that bridges DETR-style tracking architectures with green-learning principles.

Reported architectural interpretation

Dimension MOTT's position Interpretation
Detection foundation DETR-family Transformer detection Uses the object-query and Transformer-decoder tradition established by DETR-style models.
Tracking paradigm End-to-end Transformer MOT Designed as a unified tracking model rather than a heavily modular tracking-by-detection pipeline.
Efficiency strategy Pruning and removing redundant modules Focuses on keeping effective Transformer components for tracking while reducing computation.
Green learning Accuracy-runtime-computation balance Evaluates tracking as a deployment-oriented problem, not only a benchmark-score problem.
Reported efficiency Up to 62% FLOPs saving and nearly twice as fast compared with other Transformer-based models Suggests that careful architectural simplification can preserve competitive performance while improving speed.

Why MOTT is useful

  • It adds an efficiency viewpoint: many MOT papers focus mainly on accuracy, while MOTT emphasizes runtime and FLOPs.
  • It connects MOT to green AI: model design is evaluated through computational sustainability.
  • It questions architectural accumulation: adding more modules is not always the best path to better MOT.
  • It complements MOTR and TrackFormer: it belongs to the Transformer-MOT family but focuses on lightweight design.

Critical interpretation

MOTT is valuable because it highlights a practical weakness of many Transformer MOT systems: they can become too computationally heavy for online scenarios. However, the broader research question remains open: how far can pruning and lightweight Transformer design go before association robustness, small-object tracking, or occlusion recovery begins to degrade?

8. Datasets and Evaluation Metrics

Transformer MOT methods must be evaluated not only by tracking accuracy, but also by association quality, speed, memory, and computational cost.

Important datasets

Dataset Scenario Why it matters
MOT17 Pedestrian tracking Classic benchmark for comparing multi-object tracking methods.
MOT20 Crowded pedestrian scenes Tests robustness under dense crowds and occlusion.
DanceTrack Human tracking with similar appearance and complex motion Reduces the usefulness of simple appearance matching and stresses motion and association reasoning.
BDD100K Autonomous driving Tests tracking in diverse road scenes with vehicles, pedestrians, and environmental variation.
nuScenes 3D autonomous-driving perception Useful for 3D multi-object tracking and sensor-fusion research.

Important metrics

Metric Meaning Strength Limitation
MOTA Multi-object tracking accuracy. Classic metric combining false positives, false negatives, and identity switches. Can overemphasize detection errors and underrepresent association quality.
IDF1 Identity F1 score. Measures identity preservation and association consistency. Less focused on localization quality.
HOTA Higher Order Tracking Accuracy. Balances detection, association, and localization. More complex to interpret than single-error metrics.
DetA Detection accuracy component of HOTA. Separates detection quality from association quality. Must be interpreted together with AssA.
AssA Association accuracy component of HOTA. Directly measures temporal association quality. Does not alone describe detector performance.
FPS / latency Runtime speed. Important for online and real-time systems. Hardware-dependent.
FLOPs Floating-point operation count. Useful for estimating computational complexity. Does not always predict real hardware latency.
For Transformer MOT, it is not enough to report only MOTA or HOTA. A fair evaluation should also include FPS, FLOPs, memory, model size, hardware, and whether the method runs online or offline.

9. Applications

Transformer-based multi-object detection and tracking models are useful wherever visual systems must detect multiple objects and preserve their identities over time.

Application area Tracking requirement Why Transformers are useful
Autonomous driving Track vehicles, pedestrians, cyclists, and obstacles. Attention can model interactions among road users and scene context.
Traffic monitoring Count vehicles, estimate flow, and detect congestion. Transformer trackers can integrate detection and association across frames.
Surveillance Maintain identities through crowded scenes and occlusion. Track queries and memory can help preserve identity.
Robotics Track humans, objects, and moving agents in real time. Query-based models can support object-centric scene understanding.
Sports analytics Track players and ball trajectories. Attention helps model interactions and coordinated motion.
Smart cities Monitor pedestrian and vehicle movement at scale. Efficient MOT models such as MOTT are relevant for low-cost deployment.

10. Challenges and Research Gaps

Although Transformer MOT is powerful, it is not automatic. Successful tracking depends on detection quality, temporal association, memory design, computational efficiency, and evaluation methodology.

Detection quality bottleneck

If the detector misses an object, the tracker cannot reliably preserve its identity. This is why improved DETR variants and detector-bootstrapped trackers remain important.

Occlusion and reappearance

Track queries can lose identity when objects disappear behind obstacles or reappear after long gaps. Memory-based Transformers attempt to address this limitation.

Identity switches

Similar-looking objects in crowded scenes can cause track identities to swap, especially when motion patterns overlap.

Small-object tracking

Small or distant objects are difficult for DETR-style models, especially when feature resolution is low.

Computational cost

Attention over high-resolution video features can be expensive. This motivates Deformable DETR, sparse attention, memory compression, and green-learning designs.

Benchmark overfitting

A method may perform well on MOT17 but fail in autonomous driving, dance, sports, or heavily crowded scenes.

Important open questions

  1. How should object queries become stable track queries across long video sequences?
  2. How much memory is necessary for robust recovery after occlusion?
  3. Can green-learning MOT match larger Transformer trackers without losing robustness?
  4. What is the best balance between detection quality and association quality?
  5. Are Transformer MOT models truly end-to-end if they still rely on detector pretraining, thresholds, or heuristic track management?
  6. How should MOT models be evaluated for deployment using HOTA, IDF1, FPS, FLOPs, memory, and energy cost?
Possible thesis direction: Compare query-propagation, memory-based, and green-learning Transformer MOT models under a unified evaluation protocol that includes HOTA, IDF1, FPS, FLOPs, and energy cost.

11. Conclusion

Transformer-based multi-object detection and tracking begins with DETR's reformulation of object detection as set prediction. DETR introduced object queries, Hungarian matching, and an end-to-end detection framework that naturally inspired tracking extensions.

In MOT, object queries evolve into track queries. Methods such as TransTrack, TrackFormer, and MOTR use query propagation to preserve object identity across frames. Later work such as MOTRv2, MeMOT, and MeMOTR shows that strong detection and memory are still crucial for robust tracking.

MOTT adds an important efficiency-oriented perspective. Instead of increasing architectural complexity, it asks which Transformer modules are actually necessary for multi-object tracking. This makes it especially relevant for real-time and green-AI deployment.

The strongest explanation structure is therefore: DETR foundation → object queries → track queries → temporal association → memory → efficiency and green learning.

References and Key Papers

  1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017. https://arxiv.org/abs/1706.03762
  2. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-End Object Detection with Transformers. ECCV 2020. https://arxiv.org/abs/2005.12872
  3. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). Deformable DETR: Deformable Transformers for End-to-End Object Detection. ICLR 2021. https://arxiv.org/abs/2010.04159
  4. Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., & Zhang, L. (2022). DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. ICLR 2022. https://arxiv.org/abs/2201.12329
  5. Li, F., Zhang, H., Liu, S., Guo, J., Ni, L. M., & Zhang, L. (2022). DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. CVPR 2022. https://arxiv.org/abs/2203.01305
  6. Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L. M., & Shum, H.-Y. (2022). DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. https://arxiv.org/abs/2203.03605
  7. Sun, P., Jiang, Y., Zhang, R., Xie, E., Cao, J., Hu, X., Kong, T., Yuan, Z., Wang, C., & Luo, P. (2020). TransTrack: Multiple Object Tracking with Transformer. https://arxiv.org/abs/2012.15460
  8. Meinhardt, T., Kirillov, A., Leal-Taixé, L., & Feichtenhofer, C. (2022). TrackFormer: Multi-Object Tracking with Transformers. CVPR 2022. https://arxiv.org/abs/2101.02702
  9. Zeng, F., Dong, B., Zhang, T., Wang, C., Zhang, X., & Wei, Y. (2022). MOTR: End-to-End Multiple-Object Tracking with Transformer. ECCV 2022. https://arxiv.org/abs/2105.03247
  10. Zhang, Y., Wang, T., & Zhang, X. (2023). MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors. https://arxiv.org/abs/2211.09791
  11. Cai, J., Xu, M., Li, W., Xiong, Y., & Xia, W. (2022). MeMOT: Multi-Object Tracking with Memory. https://arxiv.org/abs/2203.16761
  12. Gao, R., Wang, L., Wang, B., & Guo, X. (2023). MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking. https://arxiv.org/abs/2307.15700
  13. Wu, S., Hadachi, A., Lu, C., & Vivet, D. (2023). MOTT: A new model for multi-object tracking based on green learning paradigm. AI Open, 4, 145–153. DOI: 10.1016/j.aiopen.2023.09.002. Official implementation: https://github.com/simonwu53/MOTT
  14. Zhang, Y., Sun, P., Jiang, Y., Yu, D., Yuan, Z., Luo, P., Liu, W., & Wang, X. (2022). ByteTrack: Multi-Object Tracking by Associating Every Detection Box. ECCV 2022. https://arxiv.org/abs/2110.06864
  15. Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., & Leibe, B. (2021). HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking. IJCV. https://arxiv.org/abs/2009.07736
  16. Sun, P., Cao, J., Jiang, Y., Yuan, Z., Bai, S., Kitani, K., & Luo, P. (2022). DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion. CVPR 2022. https://arxiv.org/abs/2111.14690
  17. Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., & Darrell, T. (2020). BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning. CVPR 2020. https://arxiv.org/abs/1805.04687
  18. Bernardin, K., & Stiefelhagen, R. (2008). Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. EURASIP Journal on Image and Video Processing.