Knowledge Distillation: Concept, Taxonomy, and Modern Extensions

1. Motivation: Why Knowledge Distillation Exists

Modern deep learning models are often highly accurate but expensive to deploy. Large convolutional networks, Transformers, ensembles, and multimodal models may require substantial memory, computation, and energy. This creates a gap between research performance and real-world deployment.

Knowledge distillation addresses this gap by transferring the behavior of a large or strong model into a smaller, faster, or more efficient model. The objective is not necessarily to reproduce the teacher architecture, but to reproduce useful behavior: predictions, internal representations, attention patterns, or relational structure.

Simple definition: Knowledge distillation is a model-compression and knowledge-transfer method in which a compact student model learns from a stronger teacher model so that it can achieve similar performance with lower computational cost.

Why this matters

Lower inference cost: the student can be faster and cheaper to run.
Smaller memory footprint: compressed models are easier to deploy on edge devices.
Real-time inference: smaller students can support low-latency systems.
Ensemble compression: several teacher models can be compressed into one deployable student.
Regularization: in some cases, distillation improves generalization even when the student is not smaller.

2. Basic Concept of Knowledge Distillation

The classical knowledge distillation setup contains two main roles: the teacher and the student. The teacher is usually a large, accurate, pretrained model. The student is usually smaller, cheaper, or easier to deploy. During training, the student learns from both the ground-truth labels and the teacher's predictions.

Teacher T → Soft targets → Student S

Element	Meaning	Role in KD
Teacher	A large, accurate, pretrained model or ensemble.	Provides richer learning signals than ground-truth labels alone.
Student	A smaller or cheaper model trained to imitate the teacher.	Learns to approximate the teacher while remaining efficient.
Hard label	The ground-truth class label, usually one-hot.	Provides direct supervised learning.
Soft label	The teacher's probability distribution over classes.	Contains information about class similarity and uncertainty.
Dark knowledge	Information hidden in non-maximum class probabilities.	Helps the student understand relationships between classes.

Hard labels versus soft labels

A hard label only tells the model the correct class. For example, an image may be labeled as "cat." A soft teacher distribution may say: cat 0.82, dog 0.11, fox 0.04, car 0.01, and so on. This is more informative because it shows that the teacher considers dog and fox more similar to cat than car.

The key intuition is that KD does not only transfer the final answer; it transfers the teacher's pattern of uncertainty.

3. Mathematical Formulation

The standard KD objective combines ordinary supervised learning with a distillation term. The supervised term trains the student to match the ground-truth labels. The distillation term trains the student to match the teacher's softened output distribution.

L = (1 − α) · CE(y, p_s) + α · T² · KL(softmax(z_t / T) || softmax(z_s / T))

Symbol	Meaning
L	Total training loss for the student.
CE	Cross-entropy loss between ground-truth labels and student predictions.
KL	Kullback–Leibler divergence between teacher and student probability distributions.
z_t	Teacher logits before softmax.
z_s	Student logits before softmax.
T	Temperature parameter that softens the probability distribution.
α	Weight controlling the balance between hard-label learning and teacher imitation.

Role of temperature

Temperature controls how smooth the teacher's output distribution becomes. A higher temperature makes the distribution less sharp, which exposes more information about secondary classes. This is important because the student can learn class relationships that are not visible in one-hot labels.

Low temperature

Produces sharper probabilities. The top class dominates and less secondary-class information is visible.

High temperature

Produces softer probabilities. More information about class similarity is exposed to the student.

4. Types of Knowledge Transferred

The earliest form of KD focused on the teacher's final output probabilities. Later work expanded the concept by distilling intermediate representations, attention maps, relational structures, and contrastive similarity information.

Type	What is transferred?	Explanation	Representative direction
Response-based KD	Final logits or soft probabilities.	The student imitates the teacher's output distribution.	Classical KD.
Feature-based KD	Intermediate hidden representations.	The student learns from internal teacher layers, not only final outputs.	FitNets and hint-based learning.
Attention-based KD	Attention maps or activation patterns.	The student learns where the teacher focuses.	Attention transfer.
Relation-based KD	Relationships among samples or representations.	The student preserves teacher-defined distances, angles, or similarities.	Relational KD.
Contrastive KD	Representation-space similarity and separation.	The student learns which samples should be close or far in representation space.	Contrastive representation distillation.

Why these distinctions matter

Different types of knowledge are useful in different situations. Response-based KD is simple and widely applicable, but it may miss internal reasoning patterns. Feature-based KD can be stronger when the student and teacher architectures are compatible. Relation-based and contrastive methods are useful when preserving the structure of the teacher's representation space is more important than matching individual logits.

5. Teacher–Student Architecture Taxonomy

A clear way to explain the KD literature is to classify methods according to how many teachers and students are involved. This taxonomy also helps position modern methods such as Mamba-PKD.

5.1 One teacher → one student

T → S

This is the classical KD setup. A large teacher is trained first, then a compact student is trained to imitate the teacher. The student usually learns from a combination of hard labels and teacher soft labels.

Best for: standard model compression.
Benefit: simple and effective.
Risk: if the capacity gap is too large, the student may fail to absorb the teacher's knowledge.

5.2 One teacher → multiple students

T → S₁ S₂ Sₙ

In this configuration, one teacher supervises several students. The students may have different sizes, different architectures, or different deployment targets. This can create a family of models for different hardware constraints.

Best for: producing multiple compressed models from one strong teacher.
Benefit: flexible deployment across devices.
Risk: training several students can increase total training cost.

5.3 Multiple teachers → one student

T₁ T₂ Tₘ → S

Multi-teacher KD allows one student to learn from several teacher models. The teachers may be different architectures, trained on different data, or optimized for different aspects of a task. The student can inherit complementary knowledge while avoiding the inference cost of an ensemble.

Best for: compressing ensembles or combining specialized teachers.
Benefit: richer supervision than a single teacher.
Risk: teachers may disagree, so teacher weighting or selection becomes important.

5.4 Multiple students learning together

S₁ ↔ S₂ ↔ Sₙ

In mutual or online knowledge distillation, there may be no fixed pretrained teacher. Instead, several students learn collaboratively and teach each other during training. Each model acts partly as a student and partly as a teacher.

Best for: settings where no strong pretrained teacher exists.
Benefit: collaborative learning can improve generalization.
Risk: students can become too similar, reducing useful diversity.

5.5 Multiple teachers → multiple students

T₁ T₂ → S₁ S₂

This is a many-to-many form of distillation. Several teachers provide knowledge to several students. This is useful in complex systems involving different architectures, tasks, deployment environments, or quantization levels.

Best for: robust and scalable deployment ecosystems.
Benefit: supports specialization and diversity.
Risk: complex loss design, scheduling, and teacher-student coordination.

5.6 Progressive distillation

T → S₁ → S₂ → Sₙ

Progressive distillation reduces the gap between a large teacher and small students by using stages, assistants, or progressively more complex student models. Instead of forcing a tiny student to learn directly from a very large teacher, knowledge is transferred gradually.

Best for: large teacher–small student capacity gaps.
Benefit: smoother knowledge transfer.
Risk: additional training stages and design choices.

6. Modern Extensions of Knowledge Distillation

KD has evolved from a simple compression technique into a broad family of training strategies. Several extensions are now important in the literature.

Teacher-assistant KD

Uses one or more intermediate models between the teacher and the final student. The purpose is to reduce the capacity gap and make knowledge transfer easier.

Capacity gap Multi-step

Self-distillation

A model teaches another model with the same or similar architecture. This shows that KD can act as a regularizer, not only as compression.

Same architecture Regularization

Online KD

Teacher and student are trained simultaneously. The teacher may be dynamically generated or represented by peer models during training.

No pretrained teacher Collaborative

Multi-teacher KD

A student learns from several teachers. Modern methods may use adaptive weighting to decide which teacher is most useful for each sample.

Ensemble compression Teacher weighting

Progressive KD

Knowledge is transferred through stages or multiple students. This is useful when direct compression is too difficult.

Stages Scalable students

Architecture-aware KD

Distillation is adapted to the architecture of the student, such as CNNs, Transformers, quantized models, or Mamba/state-space models.

CNN Transformer Mamba

7. Mamba-PKD Case Study

Mamba-PKD is a recent example of architecture-aware progressive knowledge distillation. It connects KD with Mamba-style selective state-space models for efficient image classification.

Paper information

Title	Mamba-PKD: A Framework for Efficient and Scalable Model Compression in Image Classification
Authors	José Medina, Amnir Hadachi, Paul Honeine, Abdelaziz Bensrhair
Venue	Proceedings of the 40th ACM/SIGAPP Symposium on Applied Computing, SAC 2025
Pages	1296–1298
DOI	10.1145/3672608.3707887

Where it fits in the taxonomy

Mamba-PKD fits best under one teacher → multiple progressive students. It uses a teacher model and a set of progressively defined Mamba-based student models. The students are designed to provide different trade-offs between computational cost and classification accuracy.

Mamba Teacher → S₁ S₂ S₃ ... S₇

Main idea

The paper combines Progressive Knowledge Distillation with Mamba blocks. Mamba is based on selective state-space modeling, which allows efficient sequence processing. In the image-classification context, images are represented as sequences of pixel values or patch-like inputs, and Mamba blocks are used inside the student models.

Why Mamba is relevant to KD

Efficiency: Mamba-style models are designed for scalable sequence processing.
Alternative to Transformers: state-space models can offer efficient long-range modeling without standard self-attention.
Student architecture design: KD performance depends not only on the teacher but also on whether the student architecture can absorb the teacher's behavior.
Progressive compression: multiple Mamba students can represent several levels of computational budget.

Reported experimental setting

Dataset	Teacher accuracy	Student behavior	Interpretation
MNIST	About 98%	The progressive students can approach or match the teacher depending on student size.	Promising result on a simple image-classification dataset.
CIFAR-10	About 87%	Larger students approach the teacher more closely than smaller students.	Shows the accuracy-efficiency trade-off more clearly.

How to position Mamba-PKD in a literature review: It is a modern example of progressive, architecture-aware knowledge distillation where the student family is built using Mamba/state-space blocks rather than conventional CNN or Transformer students.

Critical interpretation

Mamba-PKD is important because it shows how KD can be combined with newer efficient architectures. However, the current evidence should be interpreted carefully. The reported experiments focus on relatively small image-classification datasets such as MNIST and CIFAR-10. For a stronger claim about general image-classification compression, experiments on larger datasets and comparisons with strong CNN, Transformer, and hybrid baselines would be useful.

8. Applications of Knowledge Distillation

KD is widely used wherever there is a need to convert an accurate but expensive model into a cheaper deployable one. It is especially useful when inference cost matters more than training cost.

Application area	Why KD is useful	Example use
Computer vision	Compresses classifiers, object detectors, and segmentation models.	Distilling a large ResNet, ViT, or Mamba model into a smaller image classifier.
Natural language processing	Compresses large language models and pretrained Transformers.	Training a smaller BERT-like student from a large teacher model.
Speech recognition	Reduces acoustic model size and latency.	Distilling multiple acoustic teachers into a compact recognizer.
Edge AI	Supports deployment on mobile phones, embedded boards, and IoT devices.	Running student models under memory and energy constraints.
Medical AI	Accelerates inference while preserving diagnostic performance.	Compressing large imaging models for hospital deployment.
Autonomous systems	Reduces latency for real-time perception and decision-making.	Student models for object detection or scene understanding.
Ensemble compression	Replaces several teacher models with one student.	Compressing an ensemble into a single deployable network.

9. Challenges and Research Gaps

Although KD is powerful, it is not automatic. The success of distillation depends on teacher quality, student capacity, loss design, architecture compatibility, data availability, and evaluation methodology.

Capacity gap

A very small student may not be able to imitate a very large teacher. Teacher assistants or progressive KD can help.

Teacher quality

A biased or poorly calibrated teacher can transfer mistakes to the student.

Loss-function design

Different losses transfer different information. Logit matching, feature matching, and relation matching are not equivalent.

Architecture mismatch

A CNN teacher, Transformer teacher, and Mamba student may represent information differently, making transfer harder.

Multi-teacher disagreement

When teachers disagree, the student needs a strategy for weighting or selecting teacher signals.

Evaluation fairness

Accuracy alone is insufficient. Parameters, FLOPs, latency, memory, energy, and hardware must also be reported.

Important open questions

How can KD be made reliable when teacher and student architectures are very different?
How should multiple teachers be weighted when they disagree?
Can progressive KD scale beyond small datasets to large-scale image classification?
How should KD be evaluated fairly across CNN, Transformer, and Mamba-style students?
Can distillation preserve robustness, calibration, fairness, and uncertainty, not only accuracy?

10. Conclusion

Knowledge distillation is best understood as a general framework for transferring useful behavior from one or more knowledge sources into one or more efficient target models. The classical case is one teacher and one student, but the literature now includes multi-teacher, multi-student, mutual, self, online, assistant-based, and progressive forms of distillation.

The strongest explanation structure is therefore: motivation → basic mechanism → mathematical objective → types of knowledge → teacher–student configurations → modern extensions → applications → challenges.

Mamba-PKD is a useful modern case study because it demonstrates how KD can be adapted to new architecture families. It shows that the future of KD is not only about better losses, but also about designing student architectures that are naturally efficient, scalable, and suitable for deployment.

References and Key Papers

Buciluă, C., Caruana, R., & Niculescu-Mizil, A. (2006). Model Compression. KDD 2006.
Ba, J., & Caruana, R. (2014). Do Deep Nets Really Need to be Deep? NeurIPS 2014.
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531.
Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., & Bengio, Y. (2015). FitNets: Hints for Thin Deep Nets. ICLR 2015.
Zagoruyko, S., & Komodakis, N. (2017). Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. ICLR 2017.
Zhang, Y., Xiang, T., Hospedales, T. M., & Lu, H. (2018). Deep Mutual Learning. CVPR 2018.
Furlanello, T., Lipton, Z. C., Tschannen, M., Itti, L., & Anandkumar, A. (2018). Born Again Neural Networks. ICML 2018.
Lan, X., Zhu, X., & Gong, S. (2018). Knowledge Distillation by On-the-Fly Native Ensemble. NeurIPS 2018.
Mirzadeh, S. I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., & Ghasemzadeh, H. (2020). Improved Knowledge Distillation via Teacher Assistant. AAAI 2020.
Gou, J., Yu, B., Maybank, S. J., & Tao, D. (2021). Knowledge Distillation: A Survey. International Journal of Computer Vision.
Medina, J., Hadachi, A., Honeine, P., & Bensrhair, A. (2025). Mamba-PKD: A Framework for Efficient and Scalable Model Compression in Image Classification. SAC 2025, 1296–1298. DOI: 10.1145/3672608.3707887.