JEPA: Joint-Embedding Predictive Architecture — Fundamentals, Mathematics, and Literature Review

1. Motivation: Why JEPA Exists

Modern self-supervised learning often asks a model to predict something hidden from the input. The central question is: what should the model predict? Pixel reconstruction, token prediction, and autoregressive generation can be powerful, but they may force the model to spend capacity on low-level details rather than semantic structure.

JEPA addresses this by predicting in a learned representation space. Instead of reconstructing missing pixels, future frames, or exact sensory details, a JEPA predicts the embedding of a target from the embedding of a context. The target embedding is intended to contain useful predictable structure while ignoring irrelevant or unpredictable details.

Simple definition: A Joint-Embedding Predictive Architecture is a self-supervised learning framework in which a model learns representations by predicting the latent embedding of a target observation from the latent embedding of a related context observation.

Why this matters

Semantic abstraction: the model can focus on high-level structure instead of exact reconstruction.
Efficiency: predicting embeddings can be cheaper than predicting pixels or long token sequences.
World modeling: latent prediction is naturally connected to learning how the world evolves.
Multimodal potential: the same idea can be adapted to images, video, speech, trajectories, language, and robotics.
Planning: if future latent states can be predicted, an agent can evaluate candidate actions in latent space.

2. Basic Concept of JEPA

The classical JEPA setup contains a context, a target, encoders that map them into representation space, and a predictor that maps the context representation to the target representation. The model is trained so that the predicted target embedding is close to the actual target embedding.

Context x → Encoder f_θ → Context embedding s_x → Predictor g_φ → Predicted target ŝ_y

Target y → Target encoder f_ξ → Target embedding s_y → Loss D(ŝ_y, s_y)

Element	Meaning	Role in JEPA
Context	The visible, current, or conditioning part of the data.	Provides information from which the missing or future target must be inferred.
Target	The hidden, future, masked, or related part of the data.	Provides the representation the model must predict.
Context encoder	A neural network that maps the context to a latent representation.	Learns semantic features useful for prediction.
Target encoder	A second encoder, often updated by exponential moving average.	Creates stable target embeddings for the prediction loss.
Predictor	A small network that maps context embeddings to target embeddings.	Encourages the encoder to represent predictable structure.
Latent loss	A distance between predicted and target embeddings.	Trains the system without pixel reconstruction or labels.

JEPA versus reconstruction

In a pixel-reconstruction method, the model must reproduce detailed observations. In JEPA, the model only needs to predict the representation of the hidden or future content. This distinction is important because many pixel-level details are unpredictable, noisy, or irrelevant for downstream reasoning.

The key intuition is that JEPA does not ask, “What exactly are the missing pixels?” It asks, “What abstract representation should the missing or future content have?”

3. Mathematical Formulation

Let x be a context observation and y be a target observation. The two may be different image regions, video clips, time steps, trajectory segments, or modalities. JEPA maps both into a shared embedding space and trains a predictor to estimate the target embedding from the context embedding.

s_x = f_θ(x) s_y = f_ξ(y) ŝ_y = g_φ(s_x, c) L_JEPA = D(ŝ_y, stopgrad(s_y))

Symbol	Meaning
x	Context input, such as visible image patches or earlier video frames.
y	Target input, such as masked patches, later frames, or another related view.
f_θ	Context encoder with trainable parameters θ.
f_ξ	Target encoder with parameters ξ, often updated as an EMA teacher.
g_φ	Predictor network with parameters φ.
c	Conditioning information such as mask position, time offset, action, or modality.
D	A distance function, commonly L1, L2, smooth L1, or cosine-style distance.
stopgrad	Stops gradient flow through the target branch to stabilize training.

EMA target encoder

Many JEPA-style systems use a target encoder updated by exponential moving average of the context encoder. This makes target features slower-moving and prevents the prediction target from changing too abruptly.

ξ ← τ ξ + (1 − τ) θ where 0 < τ < 1

Energy-based interpretation

JEPA can also be viewed as an energy-based model. Compatible context-target pairs should have low energy, while incompatible pairs should have higher energy. In non-contrastive JEPA, the main challenge is to avoid a trivial flat energy landscape where all inputs map to the same representation.

E(x, y, c) = D(g_φ(f_θ(x), c), f_ξ(y)) Training lowers E for compatible pairs.

Collapse problem: If every input maps to the same embedding, the prediction loss can become small but the representation is useless. JEPA therefore needs architectural asymmetry, target encoders, masking design, variance constraints, contrastive terms, or other regularization mechanisms.

4. Core Architecture

JEPA is not one fixed neural network. It is a training architecture that can be instantiated using Vision Transformers, video transformers, recurrent models, state-space models, trajectory encoders, multimodal encoders, or policy-conditioned predictors.

Context branch

Receives visible or available information. In I-JEPA, this may be a spatially distributed context block from an image. In V-JEPA, it may be visible tokens from a video.

Input representation Encoder

Target branch

Receives the hidden or future target only during training. It produces the representation that the predictor must match.

EMA teacher Stop-gradient

Predictor

Maps context embeddings to target embeddings. It may receive positional mask tokens, time offsets, or action information.

Latent prediction Conditioning

Loss

Compares predicted and actual target embeddings. The loss is applied in representation space, not raw input space.

L1 / L2 Embedding space

Important design choices

Design choice	Why it matters	Typical JEPA answer
What is masked?	Determines what the model must infer.	Large semantic image blocks or spatiotemporal video tubes / blocks.
What is predicted?	Determines whether the model learns low-level or high-level structure.	Target embeddings rather than raw pixels.
How is collapse prevented?	Non-contrastive objectives can have trivial solutions.	EMA target encoder, stop-gradient, predictor asymmetry, variance/covariance regularization, or contrastive terms.
How is the target encoded?	The quality of the target representation controls the learning signal.	Momentum teacher, contextualized target encoder, or discriminative target construction.
How is evaluation done?	Good latent features may not always be linearly separable.	Linear probing, attentive probing, fine-tuning, segmentation, detection, motion tasks, video QA, and planning.

5. JEPA Taxonomy

JEPA methods can be classified according to the modality, prediction target, temporal structure, and collapse-prevention strategy. This taxonomy helps connect I-JEPA, V-JEPA, C-JEPA, DMT-JEPA, and newer multimodal or reinforcement-learning variants.

5.1 Image JEPA

Image → Visible context block → Predict masked target embeddings

Image JEPA predicts representations of masked image regions from visible context regions. I-JEPA is the central example. Its core claim is that predicting target embeddings can produce highly semantic image representations without relying on hand-crafted data augmentations.

5.2 Video JEPA

Video clip → Visible spatiotemporal tokens → Predict missing / future video features

Video JEPA uses latent feature prediction over space and time. V-JEPA and V-JEPA 2 show that feature prediction can learn visual representations useful for motion understanding, action anticipation, video QA, and planning-oriented world models.

5.3 Hierarchical JEPA

Low-level features → Mid-level states → Abstract states → Long-horizon prediction

Hierarchical JEPA extends the idea to multiple levels of abstraction. Lower levels can predict short-term local structure, while higher levels can predict slower, more abstract, longer-horizon structure. This is central to LeCun's world-model vision.

5.4 Contrastive or regularized JEPA

JEPA loss + Variance + Covariance + Invariance / contrastive constraints

Some work strengthens JEPA with explicit regularization. C-JEPA connects JEPA with VICReg-style variance, invariance, and covariance terms to address collapse and target-mean learning limitations.

5.5 Action-conditioned JEPA

Current latent state + Action → Future latent state → Planning

Action-conditioned JEPA predicts how latent states change under candidate actions. This connects JEPA to robotics and model-based reinforcement learning because actions can be selected by comparing predicted future states to goals.

6. Literature Timeline and Positioning

JEPA sits at the intersection of self-supervised learning, energy-based modeling, masked modeling, non-contrastive representation learning, and world-model learning.

Year	Work	Main contribution	How it relates to JEPA
2022	LeCun, A Path Towards Autonomous Machine Intelligence	Proposes JEPA and hierarchical JEPA as non-generative predictive world-model architectures.	Conceptual foundation.
2022	data2vec	Predicts latent contextualized representations across speech, vision, and language.	Closely related latent-prediction self-supervised framework.
2023	I-JEPA	Predicts target image-block embeddings from context image-block embeddings.	First major image-based JEPA implementation.
2024	V-JEPA	Trains video models using feature prediction without labels, text, negatives, or reconstruction.	Extends JEPA to video and motion-centric tasks.
2024	DMT-JEPA	Builds discriminative masked targets from semantically similar neighboring patches.	Addresses local semantic limitations of embedding-space masked modeling.
2024	C-JEPA	Combines JEPA with VICReg-style regularization to improve stability and reduce collapse.	Studies limitations and collapse-prevention mechanisms.
2025	V-JEPA 2	Scales video JEPA and connects it to latent action-conditioned world models for robotics.	Important step toward planning and embodied AI.
2025+	TD-JEPA, VL-JEPA, TS-JEPA, T-JEPA, M3-JEPA	Extends JEPA-like latent prediction to RL, vision-language, time series, trajectories, and multimodal alignment.	Shows JEPA as a broader design pattern.

Relation to other self-supervised families

Family	Training signal	Main strength	Main limitation
Contrastive SSL	Pull positive pairs together and push negative pairs apart.	Strong representation learning and collapse resistance.	May require negative samples, large batches, or carefully designed augmentations.
Non-contrastive joint embedding	Match two views without explicit negatives.	Simpler objective and strong invariance learning.	Must avoid collapse using asymmetry or regularization.
Masked autoencoding	Reconstruct missing pixels or tokens.	Simple and scalable.	May overemphasize low-level reconstruction detail.
Autoregressive modeling	Predict the next token, patch, or observation.	Powerful generative modeling.	Can be expensive and may model surface details rather than abstract dynamics.
JEPA	Predict target embeddings from context embeddings.	Focuses on predictable semantic structure in latent space.	Target quality, masking design, and collapse prevention are difficult.

7. Main Case Studies

7.1 I-JEPA: Image-based Joint-Embedding Predictive Architecture

I-JEPA predicts the representations of masked target blocks from visible context blocks within the same image. It is designed to learn semantic image representations without reconstructing pixels and without relying on hand-crafted image augmentations.

Image → Context block → ViT encoder → Predictor → Target block embedding

Core objective: predict latent representations of target image blocks.
Important design choice: target blocks should be large enough to encourage semantic prediction.
Architecture: commonly implemented with Vision Transformers.
Evaluation: classification, object counting, depth prediction, and other downstream visual tasks.

7.2 V-JEPA: Video feature prediction

V-JEPA trains video models using feature prediction as the only pretraining objective. Unlike many video learning systems, it does not require labels, text supervision, negative examples, pretrained image encoders, or pixel reconstruction.

Video → Visible tokens → Predict masked spatiotemporal features → Frozen evaluation

Why video matters: motion, persistence, object interaction, and temporal continuity are central to world modeling.
Key claim: predicting video features can produce representations useful for both appearance and motion tasks.
Evaluation issue: frozen representations may benefit from attentive probing rather than only average pooling.

7.3 V-JEPA 2: Toward world models and planning

V-JEPA 2 scales video pretraining and then post-trains an action-conditioned latent model with a small amount of robot interaction data. This connects JEPA to model-based control: candidate actions can be evaluated by predicting their future latent consequences.

Web-scale video pretraining → Latent visual state + Robot action → Predicted future state → Planning

Pretraining: action-free latent video prediction.
Post-training: action-conditioned latent prediction using robot trajectories.
Planning: choose actions whose predicted future latent state matches an image goal.
Limitation: long-horizon planning and language-specified goals remain open challenges.

7.4 C-JEPA and DMT-JEPA: Critiques and refinements

C-JEPA

C-JEPA argues that the EMA strategy in I-JEPA can be insufficient to prevent collapse and combines JEPA with VICReg-style variance, invariance, and covariance regularization.

Collapse VICReg Regularization

DMT-JEPA

DMT-JEPA argues that embedding-space masked modeling may weaken local semantic discrimination and proposes discriminative masked targets built from neighboring patches.

Local semantics Dense tasks Masked targets

8. Applications of JEPA

JEPA is most useful when the goal is to learn abstract, predictive representations from unlabeled data. It is especially attractive for domains where raw reconstruction is expensive, ambiguous, or unnecessarily detailed.

Application area	Why JEPA is useful	Example use
Image representation learning	Learns semantic features without labels or pixel reconstruction.	Pretraining a Vision Transformer with I-JEPA and fine-tuning for classification or dense prediction.
Video understanding	Captures motion, temporal coherence, and object dynamics in latent space.	Action recognition, motion understanding, and video retrieval.
Robotics	Predicts future latent states under candidate actions.	Image-goal planning for manipulation tasks.
Model-based reinforcement learning	Provides a latent dynamics model for planning and reward optimization.	Policy-conditioned prediction of future state embeddings.
Multimodal learning	Can align image, video, language, and other modalities in embedding space.	Vision-language JEPA predicting text embeddings rather than generating tokens.
Time series and trajectories	Predicts missing or future latent segments without hand-crafted domain targets.	Trajectory similarity, time-series forecasting, and representation learning.

9. Challenges, Limitations, and Research Gaps

JEPA is powerful but not automatic. Its success depends heavily on target construction, representation quality, masking strategy, collapse prevention, architecture, and evaluation design.

Representation collapse

If all inputs map to the same embedding, prediction becomes easy but useless. EMA, stop-gradient, predictor asymmetry, contrastive constraints, and variance regularization are used to reduce this risk.

Target quality

The model can only learn what the target representation contains. Weak, noisy, or over-smoothed target embeddings may produce weak features.

Masking strategy

Too-easy masks encourage local shortcut learning. Too-hard masks may make prediction unstable. Semantic block size and context coverage matter.

Local semantics

High-level latent prediction may discard details needed for segmentation, detection, localization, or fine-grained spatial reasoning.

Evaluation difficulty

Latent representations may be useful but not linearly separable. Probing choice can strongly affect conclusions.

Long-horizon planning

Predicting farther into the future requires handling uncertainty, compounding errors, and multiple possible futures.

Important open questions

What is the best theoretical explanation of why JEPA learns semantic abstractions?
How can JEPA avoid collapse without relying on fragile implementation details?
How should target embeddings be designed for both high-level semantics and local dense prediction?
Can hierarchical JEPA reliably support long-horizon planning?
How should uncertainty and multiple possible futures be represented in latent prediction?
Can JEPA scale across modalities while maintaining controllability, grounding, and interpretability?
How can JEPA-based world models be evaluated beyond standard classification or probing metrics?

10. Conclusion

JEPA is best understood as a framework for learning by predicting useful latent structure. Its central move is to replace raw reconstruction with embedding prediction. This makes it attractive for semantic representation learning, video understanding, world modeling, and robotics.

The literature shows a clear progression: LeCun's conceptual JEPA and hierarchical JEPA proposal, I-JEPA for images, V-JEPA for video, V-JEPA 2 for world-model and planning-oriented scaling, and newer variants that address collapse, dense semantics, multimodal alignment, time series, trajectories, and reinforcement learning.

The strongest explanation structure is therefore: motivation → latent prediction mechanism → mathematical objective → architecture → taxonomy → key papers → applications → challenges. The key research tension is between abstraction and information preservation: JEPA must ignore irrelevant details while preserving enough structure to support recognition, reasoning, and action.

References and Key Papers

LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview. PDF.
Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., & Auli, M. (2022). data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. ICML 2022. arXiv:2202.03555.
Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023. arXiv:2301.08243.
Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Ballas, N. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video. arXiv:2404.08471.
Mo, S., & Yun, S. (2024). DMT-JEPA: Discriminative Masked Targets for Joint-Embedding Predictive Architecture. arXiv:2405.17995.
Mo, S., & Tong, S. (2024). Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning. arXiv:2410.19560.
Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Komeili, M., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., and collaborators. (2025). V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning. arXiv:2506.09985.
Lei, H., and collaborators. (2024). M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture. arXiv:2409.05929.
Li, L., and collaborators. (2024). T-JEPA: A Joint-Embedding Predictive Architecture for Trajectory Similarity Computation. arXiv:2406.12913.
Bagatella, M., and collaborators. (2025). TD-JEPA: Latent-predictive Representations for Zero-Shot Unsupervised Reinforcement Learning. arXiv:2510.00739.
Chen, D., and collaborators. (2025). VL-JEPA: Joint Embedding Predictive Architecture for Vision-Language Modeling. arXiv:2512.10942.