Transformers for Multi-Object Detection and Tracking

1. Motivation: Why Transformers Matter for Detection and Tracking

Multi-object detection and tracking are central tasks in computer vision. Object detection answers the question: what objects are present and where are they? Multi-object tracking adds a temporal question: which object in the current frame corresponds to which object in previous frames?

Traditional multi-object tracking pipelines often follow the tracking-by-detection paradigm. First, an object detector finds bounding boxes in each frame. Then, a separate association module links detections across time using motion models, appearance embeddings, re-identification networks, graph optimization, or hand-designed heuristics.

Simple definition: Transformer-based multi-object tracking is a family of methods that uses attention, object queries, track queries, and temporal modeling to jointly detect objects and preserve their identities across video frames.

Why this matters

Unified modeling: Transformers can represent detection, spatial relations, and temporal association in one framework.
End-to-end training: DETR-style methods reduce the need for anchors, non-maximum suppression, and manually tuned association rules.
Identity-aware tracking: track queries can carry object identity across frames.
Long-range reasoning: attention can model interactions among objects, frames, and trajectories.
Deployment pressure: real-time MOT requires accuracy, speed, low memory, and low energy consumption.

2. Basic Concept: Detection Versus Tracking

In image-level object detection, the model receives a single image and predicts a set of object categories and bounding boxes. In video-level multi-object tracking, the model receives a sequence of frames and predicts both bounding boxes and identity labels over time.

Frame t → Detector → Boxes → Association → Tracks

Task	Input	Output	Main difficulty
Object detection	Single image	Object classes and bounding boxes	Localization, classification, scale variation, and small objects.
Multi-object tracking	Video sequence	Bounding boxes plus persistent identities.	Identity switches, occlusion, reappearance, and crowded scenes.
Tracking-by-detection	Frame detections plus association logic	Linked object trajectories.	Detector errors can propagate into tracker errors.
Joint detection and tracking	Video frames or frame features	Detections and identities from one model.	Balancing detection quality and temporal association.

Object queries versus track queries

Transformer-based MOT methods often reinterpret tracking as query propagation. A query that detects an object in one frame can be propagated into the next frame as a track query. If the query continues to attend to the same target, the model preserves object identity.

Object query → Detection → Track query → Next frame

The key intuition is that tracking is not only about finding objects; it is about maintaining object-centered representations across time.

3. DETR as the Foundation

The most important foundation for Transformer-based object detection is DETR: Detection Transformer. DETR reformulated object detection as a direct set prediction problem.

Instead of generating dense anchors and removing duplicate boxes with non-maximum suppression, DETR predicts a fixed-size set of object candidates. Each predicted candidate is matched to a ground-truth object using bipartite matching. Unmatched predictions are trained as “no object.”

Image → CNN Backbone → Transformer Encoder → Object Queries → Decoder → Boxes + Classes

Why DETR is attractive for MOT

Set prediction: MOT also involves predicting a set of active objects at each frame.
Object queries: queries provide a natural unit for object-level reasoning.
Global attention: objects can interact with all image regions and with each other.
End-to-end formulation: tracking can be framed as extending object queries across time.

Limitations of vanilla DETR

Slow convergence

Vanilla DETR typically requires long training schedules compared with mature CNN detectors.

Small-object difficulty

Global attention over low-resolution features can be weak for small and crowded objects.

High computational cost

Attention and decoder computation can become expensive for high-resolution visual inputs.

Tracking extension is non-trivial

Detection alone does not solve identity consistency across frames.

4. Mathematical Formulation

Transformer-based detection and tracking can be explained through four connected ideas: attention, set prediction, query propagation, and tracking loss design.

4.1 Scaled dot-product attention

The basic Transformer attention operation maps queries, keys, and values into context-aware representations.

Attention(Q, K, V) = softmax((QKᵀ) / √dₖ) V

Symbol	Meaning
Q	Queries: what the model is looking for.
K	Keys: representations used for attention matching.
V	Values: information retrieved after attention weights are computed.
dₖ	Key dimension used to scale dot products for numerical stability.

4.2 DETR set prediction

Given an image, DETR predicts a fixed set of N candidates:

ŷᵢ = (ĉᵢ, b̂ᵢ), i = 1, ..., N

Here, ĉᵢ is the predicted class distribution and b̂ᵢ is the predicted bounding box. Ground-truth objects are matched to predictions using Hungarian bipartite matching:

σ* = arg min Σᵢ L_match(yᵢ, ŷ_{σ(i)})

The final detection loss combines classification and box regression:

L_DETR = λ_cls L_cls + λ_L1 ||b − b̂||₁ + λ_giou L_giou(b, b̂)

4.3 Query-based tracking

In Transformer MOT, the query set at frame t often contains two types of queries: propagated track queries for existing objects and detection queries for new objects.

Q_t = [ Q_track^{t−1}, Q_detect ]

Existing track queries are updated using current-frame features:

Q_track^t = f_θ(Q_track^{t−1}, F_t)

The output may include class, box, identity state, and confidence:

ŷᵢᵗ = (ĉᵢᵗ, b̂ᵢᵗ, idᵢᵗ, sᵢᵗ)

4.4 Tracking objective

A simplified tracking loss may combine detection, localization, identity, and track-state terms:

L_MOT = L_cls + L_box + L_id + L_track_state

In some end-to-end Transformer trackers, identity is not represented by explicit identity classification. Instead, identity is preserved implicitly through track-query propagation.

5. Taxonomy of Transformer-Based Detection and Tracking Models

A clear way to explain the literature is to classify methods according to how Transformers are used: as set-prediction detectors, efficient DETR variants, query-propagation trackers, memory-based trackers, or green-learning architectures.

Category	Representative models	Main idea	Relevance to MOT
Set-prediction detectors	DETR	Predict objects as a set using object queries and Hungarian matching.	Foundation for query-based tracking.
Efficient DETR variants	Deformable DETR, DAB-DETR, DN-DETR, DINO	Improve convergence, query design, small-object detection, and training stability.	Better detection improves tracking quality.
Query-propagation trackers	TransTrack, TrackFormer	Use previous-frame object information as current-frame tracking queries.	Connects object queries to identity-preserving track queries.
End-to-end MOT Transformers	MOTR, MOTRv2	Propagate and update track queries across frames.	Reduces reliance on post-processing association pipelines.
Memory-based trackers	MeMOT, MeMOTR	Use memory to recover identities after occlusion or long temporal gaps.	Improves robustness in crowded scenes and long videos.
Green-learning trackers	MOTT	Remove redundant Transformer components and keep effective modules.	Targets real-time efficiency and reduced computation.

General evolution

DETR → Deformable DETR → TransTrack → TrackFormer → MOTR → MOTRv2 / Memory MOT

6. Important Transformer MOT Models

Transformer MOT evolved from adapting DETR-style object queries to building identity-aware track queries, memory modules, and detector-guided tracking systems.

TransTrack

TransTrack extends DETR by using object features from the previous frame as track queries for the current frame. It also uses detection queries to discover newly appearing objects.

Query propagation Frame-to-frame

TrackFormer

TrackFormer formulates MOT as frame-to-frame set prediction. It propagates track queries through time and removes the need for a separate graph-based association module.

Set prediction Track queries

MOTR

MOTR is an end-to-end multi-object tracking framework that explicitly introduces track queries and updates them over time to preserve identities.

End-to-end MOT Temporal queries

MOTRv2

MOTRv2 improves MOTR by addressing weak detection quality through detector-guided or detector-bootstrapped tracking.

Detection quality Bootstrap

MeMOT

MeMOT introduces memory into Transformer tracking so that identities can be linked over longer temporal gaps.

Memory Long-term association

MeMOTR

MeMOTR extends memory-based MOT using long-term memory injection and association-aware temporal modeling.

Memory injection Occlusion recovery

The key conceptual transition is from object queries, which localize objects in one image, to track queries, which preserve object identity across frames.

7. MOTT Case Study

MOTT: A new model for multi-object tracking based on green learning paradigm is an efficiency-focused Transformer MOT paper. It argues that many MOT systems become complicated by adding multiple neural modules, so a more efficient design can be achieved by keeping only effective Transformer components.

Paper information

Title	MOTT: A new model for multi-object tracking based on green learning paradigm
Authors	Shan Wu, Amnir Hadachi, Chaoru Lu, Damien Vivet
Journal	AI Open
Volume and pages	Volume 4, 2023, pages 145–153
DOI	10.1016/j.aiopen.2023.09.002
Keywords	Multi-object tracking, pedestrian tracking, green learning, Transformer, end-to-end
Official implementation	GitHub: simonwu53/MOTT

Where it fits in the taxonomy

MOTT fits best under green-learning Transformer MOT. It belongs to the DETR-inspired tracking family, but its main contribution is not simply adding more modules. Instead, it studies which Transformer components are useful for MOT and removes redundant parts to improve the accuracy-runtime-computation balance.

CSWin Encoder + Deformable DETR Decoder → Lightweight MOT Transformer → Tracks

Main idea

MOTT proposes a pruned Transformer-based MOT architecture. Instead of increasing tracking performance by accumulating more neural modules, it keeps effective Transformer components and removes unnecessary ones. The paper positions this as a green learning approach because it emphasizes efficiency, runtime, and reduced computational cost.

How to position MOTT in a literature review: MOTT is an efficiency-oriented Transformer MOT model that bridges DETR-style tracking architectures with green-learning principles.

Reported architectural interpretation

Dimension	MOTT's position	Interpretation
Detection foundation	DETR-family Transformer detection	Uses the object-query and Transformer-decoder tradition established by DETR-style models.
Tracking paradigm	End-to-end Transformer MOT	Designed as a unified tracking model rather than a heavily modular tracking-by-detection pipeline.
Efficiency strategy	Pruning and removing redundant modules	Focuses on keeping effective Transformer components for tracking while reducing computation.
Green learning	Accuracy-runtime-computation balance	Evaluates tracking as a deployment-oriented problem, not only a benchmark-score problem.
Reported efficiency	Up to 62% FLOPs saving and nearly twice as fast compared with other Transformer-based models	Suggests that careful architectural simplification can preserve competitive performance while improving speed.

Why MOTT is useful

It adds an efficiency viewpoint: many MOT papers focus mainly on accuracy, while MOTT emphasizes runtime and FLOPs.
It connects MOT to green AI: model design is evaluated through computational sustainability.
It questions architectural accumulation: adding more modules is not always the best path to better MOT.
It complements MOTR and TrackFormer: it belongs to the Transformer-MOT family but focuses on lightweight design.

Critical interpretation

MOTT is valuable because it highlights a practical weakness of many Transformer MOT systems: they can become too computationally heavy for online scenarios. However, the broader research question remains open: how far can pruning and lightweight Transformer design go before association robustness, small-object tracking, or occlusion recovery begins to degrade?

8. Datasets and Evaluation Metrics

Transformer MOT methods must be evaluated not only by tracking accuracy, but also by association quality, speed, memory, and computational cost.

Important datasets

Dataset	Scenario	Why it matters
MOT17	Pedestrian tracking	Classic benchmark for comparing multi-object tracking methods.
MOT20	Crowded pedestrian scenes	Tests robustness under dense crowds and occlusion.
DanceTrack	Human tracking with similar appearance and complex motion	Reduces the usefulness of simple appearance matching and stresses motion and association reasoning.
BDD100K	Autonomous driving	Tests tracking in diverse road scenes with vehicles, pedestrians, and environmental variation.
nuScenes	3D autonomous-driving perception	Useful for 3D multi-object tracking and sensor-fusion research.

Important metrics

Metric	Meaning	Strength	Limitation
MOTA	Multi-object tracking accuracy.	Classic metric combining false positives, false negatives, and identity switches.	Can overemphasize detection errors and underrepresent association quality.
IDF1	Identity F1 score.	Measures identity preservation and association consistency.	Less focused on localization quality.
HOTA	Higher Order Tracking Accuracy.	Balances detection, association, and localization.	More complex to interpret than single-error metrics.
DetA	Detection accuracy component of HOTA.	Separates detection quality from association quality.	Must be interpreted together with AssA.
AssA	Association accuracy component of HOTA.	Directly measures temporal association quality.	Does not alone describe detector performance.
FPS / latency	Runtime speed.	Important for online and real-time systems.	Hardware-dependent.
FLOPs	Floating-point operation count.	Useful for estimating computational complexity.	Does not always predict real hardware latency.

For Transformer MOT, it is not enough to report only MOTA or HOTA. A fair evaluation should also include FPS, FLOPs, memory, model size, hardware, and whether the method runs online or offline.

9. Applications

Transformer-based multi-object detection and tracking models are useful wherever visual systems must detect multiple objects and preserve their identities over time.

Application area	Tracking requirement	Why Transformers are useful
Autonomous driving	Track vehicles, pedestrians, cyclists, and obstacles.	Attention can model interactions among road users and scene context.
Traffic monitoring	Count vehicles, estimate flow, and detect congestion.	Transformer trackers can integrate detection and association across frames.
Surveillance	Maintain identities through crowded scenes and occlusion.	Track queries and memory can help preserve identity.
Robotics	Track humans, objects, and moving agents in real time.	Query-based models can support object-centric scene understanding.
Sports analytics	Track players and ball trajectories.	Attention helps model interactions and coordinated motion.
Smart cities	Monitor pedestrian and vehicle movement at scale.	Efficient MOT models such as MOTT are relevant for low-cost deployment.

10. Challenges and Research Gaps

Although Transformer MOT is powerful, it is not automatic. Successful tracking depends on detection quality, temporal association, memory design, computational efficiency, and evaluation methodology.

Detection quality bottleneck

If the detector misses an object, the tracker cannot reliably preserve its identity. This is why improved DETR variants and detector-bootstrapped trackers remain important.

Occlusion and reappearance

Track queries can lose identity when objects disappear behind obstacles or reappear after long gaps. Memory-based Transformers attempt to address this limitation.

Identity switches

Similar-looking objects in crowded scenes can cause track identities to swap, especially when motion patterns overlap.

Small-object tracking

Small or distant objects are difficult for DETR-style models, especially when feature resolution is low.

Computational cost

Attention over high-resolution video features can be expensive. This motivates Deformable DETR, sparse attention, memory compression, and green-learning designs.

Benchmark overfitting

A method may perform well on MOT17 but fail in autonomous driving, dance, sports, or heavily crowded scenes.

Important open questions

How should object queries become stable track queries across long video sequences?
How much memory is necessary for robust recovery after occlusion?
Can green-learning MOT match larger Transformer trackers without losing robustness?
What is the best balance between detection quality and association quality?
Are Transformer MOT models truly end-to-end if they still rely on detector pretraining, thresholds, or heuristic track management?
How should MOT models be evaluated for deployment using HOTA, IDF1, FPS, FLOPs, memory, and energy cost?

Possible thesis direction: Compare query-propagation, memory-based, and green-learning Transformer MOT models under a unified evaluation protocol that includes HOTA, IDF1, FPS, FLOPs, and energy cost.

11. Conclusion

Transformer-based multi-object detection and tracking begins with DETR's reformulation of object detection as set prediction. DETR introduced object queries, Hungarian matching, and an end-to-end detection framework that naturally inspired tracking extensions.

In MOT, object queries evolve into track queries. Methods such as TransTrack, TrackFormer, and MOTR use query propagation to preserve object identity across frames. Later work such as MOTRv2, MeMOT, and MeMOTR shows that strong detection and memory are still crucial for robust tracking.

MOTT adds an important efficiency-oriented perspective. Instead of increasing architectural complexity, it asks which Transformer modules are actually necessary for multi-object tracking. This makes it especially relevant for real-time and green-AI deployment.

The strongest explanation structure is therefore: DETR foundation → object queries → track queries → temporal association → memory → efficiency and green learning.

References and Key Papers

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017. https://arxiv.org/abs/1706.03762
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-End Object Detection with Transformers. ECCV 2020. https://arxiv.org/abs/2005.12872
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). Deformable DETR: Deformable Transformers for End-to-End Object Detection. ICLR 2021. https://arxiv.org/abs/2010.04159
Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., & Zhang, L. (2022). DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. ICLR 2022. https://arxiv.org/abs/2201.12329
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L. M., & Zhang, L. (2022). DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. CVPR 2022. https://arxiv.org/abs/2203.01305
Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L. M., & Shum, H.-Y. (2022). DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. https://arxiv.org/abs/2203.03605
Sun, P., Jiang, Y., Zhang, R., Xie, E., Cao, J., Hu, X., Kong, T., Yuan, Z., Wang, C., & Luo, P. (2020). TransTrack: Multiple Object Tracking with Transformer. https://arxiv.org/abs/2012.15460
Meinhardt, T., Kirillov, A., Leal-Taixé, L., & Feichtenhofer, C. (2022). TrackFormer: Multi-Object Tracking with Transformers. CVPR 2022. https://arxiv.org/abs/2101.02702
Zeng, F., Dong, B., Zhang, T., Wang, C., Zhang, X., & Wei, Y. (2022). MOTR: End-to-End Multiple-Object Tracking with Transformer. ECCV 2022. https://arxiv.org/abs/2105.03247
Zhang, Y., Wang, T., & Zhang, X. (2023). MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors. https://arxiv.org/abs/2211.09791
Cai, J., Xu, M., Li, W., Xiong, Y., & Xia, W. (2022). MeMOT: Multi-Object Tracking with Memory. https://arxiv.org/abs/2203.16761
Gao, R., Wang, L., Wang, B., & Guo, X. (2023). MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking. https://arxiv.org/abs/2307.15700
Wu, S., Hadachi, A., Lu, C., & Vivet, D. (2023). MOTT: A new model for multi-object tracking based on green learning paradigm. AI Open, 4, 145–153. DOI: 10.1016/j.aiopen.2023.09.002. Official implementation: https://github.com/simonwu53/MOTT
Zhang, Y., Sun, P., Jiang, Y., Yu, D., Yuan, Z., Luo, P., Liu, W., & Wang, X. (2022). ByteTrack: Multi-Object Tracking by Associating Every Detection Box. ECCV 2022. https://arxiv.org/abs/2110.06864
Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., & Leibe, B. (2021). HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking. IJCV. https://arxiv.org/abs/2009.07736
Sun, P., Cao, J., Jiang, Y., Yuan, Z., Bai, S., Kitani, K., & Luo, P. (2022). DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion. CVPR 2022. https://arxiv.org/abs/2111.14690
Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., & Darrell, T. (2020). BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning. CVPR 2020. https://arxiv.org/abs/1805.04687
Bernardin, K., & Stiefelhagen, R. (2008). Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. EURASIP Journal on Image and Video Processing.