1. Motivation: Why JEPA Exists
Modern self-supervised learning often asks a model to predict something hidden from the input. The central question is: what should the model predict? Pixel reconstruction, token prediction, and autoregressive generation can be powerful, but they may force the model to spend capacity on low-level details rather than semantic structure.
JEPA addresses this by predicting in a learned representation space. Instead of reconstructing missing pixels, future frames, or exact sensory details, a JEPA predicts the embedding of a target from the embedding of a context. The target embedding is intended to contain useful predictable structure while ignoring irrelevant or unpredictable details.
Why this matters
- Semantic abstraction: the model can focus on high-level structure instead of exact reconstruction.
- Efficiency: predicting embeddings can be cheaper than predicting pixels or long token sequences.
- World modeling: latent prediction is naturally connected to learning how the world evolves.
- Multimodal potential: the same idea can be adapted to images, video, speech, trajectories, language, and robotics.
- Planning: if future latent states can be predicted, an agent can evaluate candidate actions in latent space.
2. Basic Concept of JEPA
The classical JEPA setup contains a context, a target, encoders that map them into representation space, and a predictor that maps the context representation to the target representation. The model is trained so that the predicted target embedding is close to the actual target embedding.
| Element | Meaning | Role in JEPA |
|---|---|---|
| Context | The visible, current, or conditioning part of the data. | Provides information from which the missing or future target must be inferred. |
| Target | The hidden, future, masked, or related part of the data. | Provides the representation the model must predict. |
| Context encoder | A neural network that maps the context to a latent representation. | Learns semantic features useful for prediction. |
| Target encoder | A second encoder, often updated by exponential moving average. | Creates stable target embeddings for the prediction loss. |
| Predictor | A small network that maps context embeddings to target embeddings. | Encourages the encoder to represent predictable structure. |
| Latent loss | A distance between predicted and target embeddings. | Trains the system without pixel reconstruction or labels. |
JEPA versus reconstruction
In a pixel-reconstruction method, the model must reproduce detailed observations. In JEPA, the model only needs to predict the representation of the hidden or future content. This distinction is important because many pixel-level details are unpredictable, noisy, or irrelevant for downstream reasoning.
3. Mathematical Formulation
Let x be a context observation and y be a target observation. The two may be different image regions, video clips, time steps, trajectory segments, or modalities. JEPA maps both into a shared embedding space and trains a predictor to estimate the target embedding from the context embedding.
| Symbol | Meaning |
|---|---|
| x | Context input, such as visible image patches or earlier video frames. |
| y | Target input, such as masked patches, later frames, or another related view. |
| fθ | Context encoder with trainable parameters θ. |
| fξ | Target encoder with parameters ξ, often updated as an EMA teacher. |
| gφ | Predictor network with parameters φ. |
| c | Conditioning information such as mask position, time offset, action, or modality. |
| D | A distance function, commonly L1, L2, smooth L1, or cosine-style distance. |
| stopgrad | Stops gradient flow through the target branch to stabilize training. |
EMA target encoder
Many JEPA-style systems use a target encoder updated by exponential moving average of the context encoder. This makes target features slower-moving and prevents the prediction target from changing too abruptly.
Energy-based interpretation
JEPA can also be viewed as an energy-based model. Compatible context-target pairs should have low energy, while incompatible pairs should have higher energy. In non-contrastive JEPA, the main challenge is to avoid a trivial flat energy landscape where all inputs map to the same representation.
4. Core Architecture
JEPA is not one fixed neural network. It is a training architecture that can be instantiated using Vision Transformers, video transformers, recurrent models, state-space models, trajectory encoders, multimodal encoders, or policy-conditioned predictors.
Context branch
Receives visible or available information. In I-JEPA, this may be a spatially distributed context block from an image. In V-JEPA, it may be visible tokens from a video.
Input representation EncoderTarget branch
Receives the hidden or future target only during training. It produces the representation that the predictor must match.
EMA teacher Stop-gradientPredictor
Maps context embeddings to target embeddings. It may receive positional mask tokens, time offsets, or action information.
Latent prediction ConditioningLoss
Compares predicted and actual target embeddings. The loss is applied in representation space, not raw input space.
L1 / L2 Embedding spaceImportant design choices
| Design choice | Why it matters | Typical JEPA answer |
|---|---|---|
| What is masked? | Determines what the model must infer. | Large semantic image blocks or spatiotemporal video tubes / blocks. |
| What is predicted? | Determines whether the model learns low-level or high-level structure. | Target embeddings rather than raw pixels. |
| How is collapse prevented? | Non-contrastive objectives can have trivial solutions. | EMA target encoder, stop-gradient, predictor asymmetry, variance/covariance regularization, or contrastive terms. |
| How is the target encoded? | The quality of the target representation controls the learning signal. | Momentum teacher, contextualized target encoder, or discriminative target construction. |
| How is evaluation done? | Good latent features may not always be linearly separable. | Linear probing, attentive probing, fine-tuning, segmentation, detection, motion tasks, video QA, and planning. |
5. JEPA Taxonomy
JEPA methods can be classified according to the modality, prediction target, temporal structure, and collapse-prevention strategy. This taxonomy helps connect I-JEPA, V-JEPA, C-JEPA, DMT-JEPA, and newer multimodal or reinforcement-learning variants.
5.1 Image JEPA
Image JEPA predicts representations of masked image regions from visible context regions. I-JEPA is the central example. Its core claim is that predicting target embeddings can produce highly semantic image representations without relying on hand-crafted data augmentations.
5.2 Video JEPA
Video JEPA uses latent feature prediction over space and time. V-JEPA and V-JEPA 2 show that feature prediction can learn visual representations useful for motion understanding, action anticipation, video QA, and planning-oriented world models.
5.3 Hierarchical JEPA
Hierarchical JEPA extends the idea to multiple levels of abstraction. Lower levels can predict short-term local structure, while higher levels can predict slower, more abstract, longer-horizon structure. This is central to LeCun's world-model vision.
5.4 Contrastive or regularized JEPA
Some work strengthens JEPA with explicit regularization. C-JEPA connects JEPA with VICReg-style variance, invariance, and covariance terms to address collapse and target-mean learning limitations.
5.5 Action-conditioned JEPA
Action-conditioned JEPA predicts how latent states change under candidate actions. This connects JEPA to robotics and model-based reinforcement learning because actions can be selected by comparing predicted future states to goals.
6. Literature Timeline and Positioning
JEPA sits at the intersection of self-supervised learning, energy-based modeling, masked modeling, non-contrastive representation learning, and world-model learning.
| Year | Work | Main contribution | How it relates to JEPA |
|---|---|---|---|
| 2022 | LeCun, A Path Towards Autonomous Machine Intelligence | Proposes JEPA and hierarchical JEPA as non-generative predictive world-model architectures. | Conceptual foundation. |
| 2022 | data2vec | Predicts latent contextualized representations across speech, vision, and language. | Closely related latent-prediction self-supervised framework. |
| 2023 | I-JEPA | Predicts target image-block embeddings from context image-block embeddings. | First major image-based JEPA implementation. |
| 2024 | V-JEPA | Trains video models using feature prediction without labels, text, negatives, or reconstruction. | Extends JEPA to video and motion-centric tasks. |
| 2024 | DMT-JEPA | Builds discriminative masked targets from semantically similar neighboring patches. | Addresses local semantic limitations of embedding-space masked modeling. |
| 2024 | C-JEPA | Combines JEPA with VICReg-style regularization to improve stability and reduce collapse. | Studies limitations and collapse-prevention mechanisms. |
| 2025 | V-JEPA 2 | Scales video JEPA and connects it to latent action-conditioned world models for robotics. | Important step toward planning and embodied AI. |
| 2025+ | TD-JEPA, VL-JEPA, TS-JEPA, T-JEPA, M3-JEPA | Extends JEPA-like latent prediction to RL, vision-language, time series, trajectories, and multimodal alignment. | Shows JEPA as a broader design pattern. |
Relation to other self-supervised families
| Family | Training signal | Main strength | Main limitation |
|---|---|---|---|
| Contrastive SSL | Pull positive pairs together and push negative pairs apart. | Strong representation learning and collapse resistance. | May require negative samples, large batches, or carefully designed augmentations. |
| Non-contrastive joint embedding | Match two views without explicit negatives. | Simpler objective and strong invariance learning. | Must avoid collapse using asymmetry or regularization. |
| Masked autoencoding | Reconstruct missing pixels or tokens. | Simple and scalable. | May overemphasize low-level reconstruction detail. |
| Autoregressive modeling | Predict the next token, patch, or observation. | Powerful generative modeling. | Can be expensive and may model surface details rather than abstract dynamics. |
| JEPA | Predict target embeddings from context embeddings. | Focuses on predictable semantic structure in latent space. | Target quality, masking design, and collapse prevention are difficult. |
7. Main Case Studies
7.1 I-JEPA: Image-based Joint-Embedding Predictive Architecture
I-JEPA predicts the representations of masked target blocks from visible context blocks within the same image. It is designed to learn semantic image representations without reconstructing pixels and without relying on hand-crafted image augmentations.
- Core objective: predict latent representations of target image blocks.
- Important design choice: target blocks should be large enough to encourage semantic prediction.
- Architecture: commonly implemented with Vision Transformers.
- Evaluation: classification, object counting, depth prediction, and other downstream visual tasks.
7.2 V-JEPA: Video feature prediction
V-JEPA trains video models using feature prediction as the only pretraining objective. Unlike many video learning systems, it does not require labels, text supervision, negative examples, pretrained image encoders, or pixel reconstruction.
- Why video matters: motion, persistence, object interaction, and temporal continuity are central to world modeling.
- Key claim: predicting video features can produce representations useful for both appearance and motion tasks.
- Evaluation issue: frozen representations may benefit from attentive probing rather than only average pooling.
7.3 V-JEPA 2: Toward world models and planning
V-JEPA 2 scales video pretraining and then post-trains an action-conditioned latent model with a small amount of robot interaction data. This connects JEPA to model-based control: candidate actions can be evaluated by predicting their future latent consequences.
- Pretraining: action-free latent video prediction.
- Post-training: action-conditioned latent prediction using robot trajectories.
- Planning: choose actions whose predicted future latent state matches an image goal.
- Limitation: long-horizon planning and language-specified goals remain open challenges.
7.4 C-JEPA and DMT-JEPA: Critiques and refinements
C-JEPA
C-JEPA argues that the EMA strategy in I-JEPA can be insufficient to prevent collapse and combines JEPA with VICReg-style variance, invariance, and covariance regularization.
Collapse VICReg RegularizationDMT-JEPA
DMT-JEPA argues that embedding-space masked modeling may weaken local semantic discrimination and proposes discriminative masked targets built from neighboring patches.
Local semantics Dense tasks Masked targets8. Applications of JEPA
JEPA is most useful when the goal is to learn abstract, predictive representations from unlabeled data. It is especially attractive for domains where raw reconstruction is expensive, ambiguous, or unnecessarily detailed.
| Application area | Why JEPA is useful | Example use |
|---|---|---|
| Image representation learning | Learns semantic features without labels or pixel reconstruction. | Pretraining a Vision Transformer with I-JEPA and fine-tuning for classification or dense prediction. |
| Video understanding | Captures motion, temporal coherence, and object dynamics in latent space. | Action recognition, motion understanding, and video retrieval. |
| Robotics | Predicts future latent states under candidate actions. | Image-goal planning for manipulation tasks. |
| Model-based reinforcement learning | Provides a latent dynamics model for planning and reward optimization. | Policy-conditioned prediction of future state embeddings. |
| Multimodal learning | Can align image, video, language, and other modalities in embedding space. | Vision-language JEPA predicting text embeddings rather than generating tokens. |
| Time series and trajectories | Predicts missing or future latent segments without hand-crafted domain targets. | Trajectory similarity, time-series forecasting, and representation learning. |
9. Challenges, Limitations, and Research Gaps
JEPA is powerful but not automatic. Its success depends heavily on target construction, representation quality, masking strategy, collapse prevention, architecture, and evaluation design.
Representation collapse
If all inputs map to the same embedding, prediction becomes easy but useless. EMA, stop-gradient, predictor asymmetry, contrastive constraints, and variance regularization are used to reduce this risk.
Target quality
The model can only learn what the target representation contains. Weak, noisy, or over-smoothed target embeddings may produce weak features.
Masking strategy
Too-easy masks encourage local shortcut learning. Too-hard masks may make prediction unstable. Semantic block size and context coverage matter.
Local semantics
High-level latent prediction may discard details needed for segmentation, detection, localization, or fine-grained spatial reasoning.
Evaluation difficulty
Latent representations may be useful but not linearly separable. Probing choice can strongly affect conclusions.
Long-horizon planning
Predicting farther into the future requires handling uncertainty, compounding errors, and multiple possible futures.
Important open questions
- What is the best theoretical explanation of why JEPA learns semantic abstractions?
- How can JEPA avoid collapse without relying on fragile implementation details?
- How should target embeddings be designed for both high-level semantics and local dense prediction?
- Can hierarchical JEPA reliably support long-horizon planning?
- How should uncertainty and multiple possible futures be represented in latent prediction?
- Can JEPA scale across modalities while maintaining controllability, grounding, and interpretability?
- How can JEPA-based world models be evaluated beyond standard classification or probing metrics?
10. Conclusion
JEPA is best understood as a framework for learning by predicting useful latent structure. Its central move is to replace raw reconstruction with embedding prediction. This makes it attractive for semantic representation learning, video understanding, world modeling, and robotics.
The literature shows a clear progression: LeCun's conceptual JEPA and hierarchical JEPA proposal, I-JEPA for images, V-JEPA for video, V-JEPA 2 for world-model and planning-oriented scaling, and newer variants that address collapse, dense semantics, multimodal alignment, time series, trajectories, and reinforcement learning.
The strongest explanation structure is therefore: motivation → latent prediction mechanism → mathematical objective → architecture → taxonomy → key papers → applications → challenges. The key research tension is between abstraction and information preservation: JEPA must ignore irrelevant details while preserving enough structure to support recognition, reasoning, and action.
References and Key Papers
- LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview. PDF.
- Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., & Auli, M. (2022). data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. ICML 2022. arXiv:2202.03555.
- Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023. arXiv:2301.08243.
- Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Ballas, N. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video. arXiv:2404.08471.
- Mo, S., & Yun, S. (2024). DMT-JEPA: Discriminative Masked Targets for Joint-Embedding Predictive Architecture. arXiv:2405.17995.
- Mo, S., & Tong, S. (2024). Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning. arXiv:2410.19560.
- Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Komeili, M., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., and collaborators. (2025). V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning. arXiv:2506.09985.
- Lei, H., and collaborators. (2024). M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture. arXiv:2409.05929.
- Li, L., and collaborators. (2024). T-JEPA: A Joint-Embedding Predictive Architecture for Trajectory Similarity Computation. arXiv:2406.12913.
- Bagatella, M., and collaborators. (2025). TD-JEPA: Latent-predictive Representations for Zero-Shot Unsupervised Reinforcement Learning. arXiv:2510.00739.
- Chen, D., and collaborators. (2025). VL-JEPA: Joint Embedding Predictive Architecture for Vision-Language Modeling. arXiv:2512.10942.