1. Motivation: Why Knowledge Distillation Exists
Modern deep learning models are often highly accurate but expensive to deploy. Large convolutional networks, Transformers, ensembles, and multimodal models may require substantial memory, computation, and energy. This creates a gap between research performance and real-world deployment.
Knowledge distillation addresses this gap by transferring the behavior of a large or strong model into a smaller, faster, or more efficient model. The objective is not necessarily to reproduce the teacher architecture, but to reproduce useful behavior: predictions, internal representations, attention patterns, or relational structure.
Why this matters
- Lower inference cost: the student can be faster and cheaper to run.
- Smaller memory footprint: compressed models are easier to deploy on edge devices.
- Real-time inference: smaller students can support low-latency systems.
- Ensemble compression: several teacher models can be compressed into one deployable student.
- Regularization: in some cases, distillation improves generalization even when the student is not smaller.
2. Basic Concept of Knowledge Distillation
The classical knowledge distillation setup contains two main roles: the teacher and the student. The teacher is usually a large, accurate, pretrained model. The student is usually smaller, cheaper, or easier to deploy. During training, the student learns from both the ground-truth labels and the teacher's predictions.
| Element | Meaning | Role in KD |
|---|---|---|
| Teacher | A large, accurate, pretrained model or ensemble. | Provides richer learning signals than ground-truth labels alone. |
| Student | A smaller or cheaper model trained to imitate the teacher. | Learns to approximate the teacher while remaining efficient. |
| Hard label | The ground-truth class label, usually one-hot. | Provides direct supervised learning. |
| Soft label | The teacher's probability distribution over classes. | Contains information about class similarity and uncertainty. |
| Dark knowledge | Information hidden in non-maximum class probabilities. | Helps the student understand relationships between classes. |
Hard labels versus soft labels
A hard label only tells the model the correct class. For example, an image may be labeled as "cat." A soft teacher distribution may say: cat 0.82, dog 0.11, fox 0.04, car 0.01, and so on. This is more informative because it shows that the teacher considers dog and fox more similar to cat than car.
3. Mathematical Formulation
The standard KD objective combines ordinary supervised learning with a distillation term. The supervised term trains the student to match the ground-truth labels. The distillation term trains the student to match the teacher's softened output distribution.
| Symbol | Meaning |
|---|---|
| L | Total training loss for the student. |
| CE | Cross-entropy loss between ground-truth labels and student predictions. |
| KL | Kullback–Leibler divergence between teacher and student probability distributions. |
| zt | Teacher logits before softmax. |
| zs | Student logits before softmax. |
| T | Temperature parameter that softens the probability distribution. |
| α | Weight controlling the balance between hard-label learning and teacher imitation. |
Role of temperature
Temperature controls how smooth the teacher's output distribution becomes. A higher temperature makes the distribution less sharp, which exposes more information about secondary classes. This is important because the student can learn class relationships that are not visible in one-hot labels.
Low temperature
Produces sharper probabilities. The top class dominates and less secondary-class information is visible.
High temperature
Produces softer probabilities. More information about class similarity is exposed to the student.
4. Types of Knowledge Transferred
The earliest form of KD focused on the teacher's final output probabilities. Later work expanded the concept by distilling intermediate representations, attention maps, relational structures, and contrastive similarity information.
| Type | What is transferred? | Explanation | Representative direction |
|---|---|---|---|
| Response-based KD | Final logits or soft probabilities. | The student imitates the teacher's output distribution. | Classical KD. |
| Feature-based KD | Intermediate hidden representations. | The student learns from internal teacher layers, not only final outputs. | FitNets and hint-based learning. |
| Attention-based KD | Attention maps or activation patterns. | The student learns where the teacher focuses. | Attention transfer. |
| Relation-based KD | Relationships among samples or representations. | The student preserves teacher-defined distances, angles, or similarities. | Relational KD. |
| Contrastive KD | Representation-space similarity and separation. | The student learns which samples should be close or far in representation space. | Contrastive representation distillation. |
Why these distinctions matter
Different types of knowledge are useful in different situations. Response-based KD is simple and widely applicable, but it may miss internal reasoning patterns. Feature-based KD can be stronger when the student and teacher architectures are compatible. Relation-based and contrastive methods are useful when preserving the structure of the teacher's representation space is more important than matching individual logits.
5. Teacher–Student Architecture Taxonomy
A clear way to explain the KD literature is to classify methods according to how many teachers and students are involved. This taxonomy also helps position modern methods such as Mamba-PKD.
5.1 One teacher → one student
This is the classical KD setup. A large teacher is trained first, then a compact student is trained to imitate the teacher. The student usually learns from a combination of hard labels and teacher soft labels.
- Best for: standard model compression.
- Benefit: simple and effective.
- Risk: if the capacity gap is too large, the student may fail to absorb the teacher's knowledge.
5.2 One teacher → multiple students
In this configuration, one teacher supervises several students. The students may have different sizes, different architectures, or different deployment targets. This can create a family of models for different hardware constraints.
- Best for: producing multiple compressed models from one strong teacher.
- Benefit: flexible deployment across devices.
- Risk: training several students can increase total training cost.
5.3 Multiple teachers → one student
Multi-teacher KD allows one student to learn from several teacher models. The teachers may be different architectures, trained on different data, or optimized for different aspects of a task. The student can inherit complementary knowledge while avoiding the inference cost of an ensemble.
- Best for: compressing ensembles or combining specialized teachers.
- Benefit: richer supervision than a single teacher.
- Risk: teachers may disagree, so teacher weighting or selection becomes important.
5.4 Multiple students learning together
In mutual or online knowledge distillation, there may be no fixed pretrained teacher. Instead, several students learn collaboratively and teach each other during training. Each model acts partly as a student and partly as a teacher.
- Best for: settings where no strong pretrained teacher exists.
- Benefit: collaborative learning can improve generalization.
- Risk: students can become too similar, reducing useful diversity.
5.5 Multiple teachers → multiple students
This is a many-to-many form of distillation. Several teachers provide knowledge to several students. This is useful in complex systems involving different architectures, tasks, deployment environments, or quantization levels.
- Best for: robust and scalable deployment ecosystems.
- Benefit: supports specialization and diversity.
- Risk: complex loss design, scheduling, and teacher-student coordination.
5.6 Progressive distillation
Progressive distillation reduces the gap between a large teacher and small students by using stages, assistants, or progressively more complex student models. Instead of forcing a tiny student to learn directly from a very large teacher, knowledge is transferred gradually.
- Best for: large teacher–small student capacity gaps.
- Benefit: smoother knowledge transfer.
- Risk: additional training stages and design choices.
6. Modern Extensions of Knowledge Distillation
KD has evolved from a simple compression technique into a broad family of training strategies. Several extensions are now important in the literature.
Teacher-assistant KD
Uses one or more intermediate models between the teacher and the final student. The purpose is to reduce the capacity gap and make knowledge transfer easier.
Capacity gap Multi-stepSelf-distillation
A model teaches another model with the same or similar architecture. This shows that KD can act as a regularizer, not only as compression.
Same architecture RegularizationOnline KD
Teacher and student are trained simultaneously. The teacher may be dynamically generated or represented by peer models during training.
No pretrained teacher CollaborativeMulti-teacher KD
A student learns from several teachers. Modern methods may use adaptive weighting to decide which teacher is most useful for each sample.
Ensemble compression Teacher weightingProgressive KD
Knowledge is transferred through stages or multiple students. This is useful when direct compression is too difficult.
Stages Scalable studentsArchitecture-aware KD
Distillation is adapted to the architecture of the student, such as CNNs, Transformers, quantized models, or Mamba/state-space models.
CNN Transformer Mamba7. Mamba-PKD Case Study
Mamba-PKD is a recent example of architecture-aware progressive knowledge distillation. It connects KD with Mamba-style selective state-space models for efficient image classification.
Paper information
| Title | Mamba-PKD: A Framework for Efficient and Scalable Model Compression in Image Classification |
|---|---|
| Authors | José Medina, Amnir Hadachi, Paul Honeine, Abdelaziz Bensrhair |
| Venue | Proceedings of the 40th ACM/SIGAPP Symposium on Applied Computing, SAC 2025 |
| Pages | 1296–1298 |
| DOI | 10.1145/3672608.3707887 |
Where it fits in the taxonomy
Mamba-PKD fits best under one teacher → multiple progressive students. It uses a teacher model and a set of progressively defined Mamba-based student models. The students are designed to provide different trade-offs between computational cost and classification accuracy.
Main idea
The paper combines Progressive Knowledge Distillation with Mamba blocks. Mamba is based on selective state-space modeling, which allows efficient sequence processing. In the image-classification context, images are represented as sequences of pixel values or patch-like inputs, and Mamba blocks are used inside the student models.
Why Mamba is relevant to KD
- Efficiency: Mamba-style models are designed for scalable sequence processing.
- Alternative to Transformers: state-space models can offer efficient long-range modeling without standard self-attention.
- Student architecture design: KD performance depends not only on the teacher but also on whether the student architecture can absorb the teacher's behavior.
- Progressive compression: multiple Mamba students can represent several levels of computational budget.
Reported experimental setting
| Dataset | Teacher accuracy | Student behavior | Interpretation |
|---|---|---|---|
| MNIST | About 98% | The progressive students can approach or match the teacher depending on student size. | Promising result on a simple image-classification dataset. |
| CIFAR-10 | About 87% | Larger students approach the teacher more closely than smaller students. | Shows the accuracy-efficiency trade-off more clearly. |
Critical interpretation
Mamba-PKD is important because it shows how KD can be combined with newer efficient architectures. However, the current evidence should be interpreted carefully. The reported experiments focus on relatively small image-classification datasets such as MNIST and CIFAR-10. For a stronger claim about general image-classification compression, experiments on larger datasets and comparisons with strong CNN, Transformer, and hybrid baselines would be useful.
8. Applications of Knowledge Distillation
KD is widely used wherever there is a need to convert an accurate but expensive model into a cheaper deployable one. It is especially useful when inference cost matters more than training cost.
| Application area | Why KD is useful | Example use |
|---|---|---|
| Computer vision | Compresses classifiers, object detectors, and segmentation models. | Distilling a large ResNet, ViT, or Mamba model into a smaller image classifier. |
| Natural language processing | Compresses large language models and pretrained Transformers. | Training a smaller BERT-like student from a large teacher model. |
| Speech recognition | Reduces acoustic model size and latency. | Distilling multiple acoustic teachers into a compact recognizer. |
| Edge AI | Supports deployment on mobile phones, embedded boards, and IoT devices. | Running student models under memory and energy constraints. |
| Medical AI | Accelerates inference while preserving diagnostic performance. | Compressing large imaging models for hospital deployment. |
| Autonomous systems | Reduces latency for real-time perception and decision-making. | Student models for object detection or scene understanding. |
| Ensemble compression | Replaces several teacher models with one student. | Compressing an ensemble into a single deployable network. |
9. Challenges and Research Gaps
Although KD is powerful, it is not automatic. The success of distillation depends on teacher quality, student capacity, loss design, architecture compatibility, data availability, and evaluation methodology.
Capacity gap
A very small student may not be able to imitate a very large teacher. Teacher assistants or progressive KD can help.
Teacher quality
A biased or poorly calibrated teacher can transfer mistakes to the student.
Loss-function design
Different losses transfer different information. Logit matching, feature matching, and relation matching are not equivalent.
Architecture mismatch
A CNN teacher, Transformer teacher, and Mamba student may represent information differently, making transfer harder.
Multi-teacher disagreement
When teachers disagree, the student needs a strategy for weighting or selecting teacher signals.
Evaluation fairness
Accuracy alone is insufficient. Parameters, FLOPs, latency, memory, energy, and hardware must also be reported.
Important open questions
- How can KD be made reliable when teacher and student architectures are very different?
- How should multiple teachers be weighted when they disagree?
- Can progressive KD scale beyond small datasets to large-scale image classification?
- How should KD be evaluated fairly across CNN, Transformer, and Mamba-style students?
- Can distillation preserve robustness, calibration, fairness, and uncertainty, not only accuracy?
10. Conclusion
Knowledge distillation is best understood as a general framework for transferring useful behavior from one or more knowledge sources into one or more efficient target models. The classical case is one teacher and one student, but the literature now includes multi-teacher, multi-student, mutual, self, online, assistant-based, and progressive forms of distillation.
The strongest explanation structure is therefore: motivation → basic mechanism → mathematical objective → types of knowledge → teacher–student configurations → modern extensions → applications → challenges.
Mamba-PKD is a useful modern case study because it demonstrates how KD can be adapted to new architecture families. It shows that the future of KD is not only about better losses, but also about designing student architectures that are naturally efficient, scalable, and suitable for deployment.
References and Key Papers
- Buciluă, C., Caruana, R., & Niculescu-Mizil, A. (2006). Model Compression. KDD 2006.
- Ba, J., & Caruana, R. (2014). Do Deep Nets Really Need to be Deep? NeurIPS 2014.
- Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531.
- Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., & Bengio, Y. (2015). FitNets: Hints for Thin Deep Nets. ICLR 2015.
- Zagoruyko, S., & Komodakis, N. (2017). Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. ICLR 2017.
- Zhang, Y., Xiang, T., Hospedales, T. M., & Lu, H. (2018). Deep Mutual Learning. CVPR 2018.
- Furlanello, T., Lipton, Z. C., Tschannen, M., Itti, L., & Anandkumar, A. (2018). Born Again Neural Networks. ICML 2018.
- Lan, X., Zhu, X., & Gong, S. (2018). Knowledge Distillation by On-the-Fly Native Ensemble. NeurIPS 2018.
- Mirzadeh, S. I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., & Ghasemzadeh, H. (2020). Improved Knowledge Distillation via Teacher Assistant. AAAI 2020.
- Gou, J., Yu, B., Maybank, S. J., & Tao, D. (2021). Knowledge Distillation: A Survey. International Journal of Computer Vision.
- Medina, J., Hadachi, A., Honeine, P., & Bensrhair, A. (2025). Mamba-PKD: A Framework for Efficient and Scalable Model Compression in Image Classification. SAC 2025, 1296–1298. DOI: 10.1145/3672608.3707887.