1. Motivation: Why Progressive Neural Networks Exist
Standard deep neural networks are usually trained for one task or fine-tuned from one task to another. This works well when only the final task matters, but it creates a major problem in continual learning: learning a new task can overwrite parameters that were important for previous tasks.
This problem is known as catastrophic forgetting. If a network is trained on Task A and then updated on Task B, the same weights are reused and modified. Unless special protection is used, performance on Task A can collapse even if Task B is learned successfully.
Why this matters
- Continual learning: agents often encounter tasks sequentially rather than all at once.
- Transfer learning: knowledge from previous tasks can accelerate learning on a new task.
- No access to old data: in many settings, old training data cannot be stored or replayed.
- Reinforcement learning: exploration can be expensive, so reusing prior skills can improve sample efficiency.
- Sim-to-real learning: knowledge learned in simulation can guide learning in the real world.
2. Basic Concept of Progressive Neural Networks
The core idea is to separate tasks into different columns. A column is a complete neural network or a major branch of a network. When Task 1 is learned, column 1 is trained normally. When Task 2 arrives, column 1 is frozen and column 2 is created. Column 2 receives lateral inputs from column 1. When Task 3 arrives, columns 1 and 2 are frozen, and column 3 receives lateral inputs from both previous columns.
Task 1 ⇢ Column 2
Task 2 ⇢ Column 3
Task 3 ⇢ Column K
Task K
| Component | Meaning | Role in PNNs |
|---|---|---|
| Column | A task-specific neural network branch. | Stores the representation and policy/classifier for one task. |
| Frozen parameters | Weights from previously learned columns. | Prevent catastrophic forgetting by blocking updates to old tasks. |
| Lateral connections | Connections from old columns into the new column. | Allow old features to support learning on the new task. |
| Adapter layers | Projection or transformation layers on lateral connections. | Control dimensionality, scaling, and compatibility of transferred features. |
| Task identity | Information about which task is currently being solved. | Usually needed at inference time to choose the correct column or output head. |
Main intuition
Progressive Neural Networks avoid the central conflict of continual learning. They do not force a single parameter set to serve every task. Instead, each task receives new capacity, while previous task solutions are preserved and made available as feature sources.
3. Architecture: Columns, Freezing, and Lateral Transfer
A Progressive Neural Network with multiple tasks can be viewed as a growing grid of neural-network columns. Vertically, each column contains layers. Horizontally, columns are added over time. Lateral connections allow the new column to access hidden activations from previous columns at corresponding or nearby depths.
3.1 Single task
For the first task, the network behaves like a normal deep model. There are no previous columns, so no lateral transfer is available.
In this notation, h₁¹ means the hidden activation at layer 1 of column 1, and h₂¹ means the hidden activation at layer 2 of column 1. More generally, hᵢᵏ denotes the activation of layer i in task column k. The superscript identifies the task column, while the subscript identifies the layer inside that column.
| Notation | Meaning | Interpretation in subsection 3.1 |
|---|---|---|
| h₁¹ | Activation of layer 1 in column 1. | The first hidden representation learned for Task 1. |
| h₂¹ | Activation of layer 2 in column 1. | A deeper hidden representation built from h₁¹. |
| hᵢᵏ | Activation of layer i in column k. | General notation used later for any layer and any task column. |
3.2 Second task
For the second task, the first column remains frozen. The second column learns new within-column weights, while also learning how to use features coming from the first column.
3.3 Many tasks
For task K, all previous task columns can provide lateral features to the new column. This creates rich transfer opportunities, but also increases memory and computation.
Design choices
Column width
Each new task can receive a full-width network, a smaller branch, or a dynamically sized module.
Capacity ScalabilityLateral depth
Lateral connections may be inserted at every layer or only at selected layers.
Feature reuse EfficiencyAdapter type
Adapters may be linear projections, nonlinear layers, or 1×1 convolutions in convolutional models.
Projection ConditioningOutput structure
Each task may have its own classifier, policy head, or value-function head.
Task-specific Inference4. Mathematical Formulation
Let there be a sequence of tasks T₁, T₂, ..., Tₖ. For each task k, the model creates a new column with parameters W⁽ᵏ⁾. The parameters of older columns W⁽¹⁾, ..., W⁽ᵏ⁻¹⁾ are frozen. Lateral parameters U connect older columns to the current column.
4.1 Standard layer equation
For layer i in column k, a simplified Progressive Neural Network layer can be written as:
| Symbol | Meaning |
|---|---|
| hᵢ⁽ᵏ⁾ | Activation of layer i in the current task column k. |
| f | Nonlinear activation function, such as ReLU. |
| Wᵢ⁽ᵏ⁾ | Within-column weight matrix for layer i of task k. |
| Uᵢ⁽ᵏ:ʲ⁾ | Lateral weight matrix from previous column j into current column k. |
| hᵢ₋₁⁽ʲ⁾ | Activation from layer i − 1 of an older frozen column j. |
| Σⱼ<ₖ | Sum over all previously learned task columns. |
4.2 Objective for supervised learning
For a supervised task k with dataset Dₖ, the trainable parameters are the new column parameters and the lateral adapters into that column. Older columns are fixed.
4.3 Objective for reinforcement learning
In reinforcement learning, the output of a column may define a policy π⁽ᵏ⁾(a|s) and possibly a value function V⁽ᵏ⁾(s). The objective is to maximize expected discounted return for the current task.
4.4 Adapter formulation
The original Progressive Neural Network paper uses adapter modules to transform old-column activations before adding them to the new column. A simplified adapter-based form is:
5. Training Procedure
Progressive Neural Networks follow a sequential training procedure. The network does not need simultaneous access to all tasks. It learns one task, freezes that task's column, then adds a new column for the next task.
Algorithmic view
| Step | Operation | Why it is important |
|---|---|---|
| Initialize new column | Add a fresh network branch for the new task. | Provides new capacity for task-specific learning. |
| Freeze previous columns | Prevent gradient updates to old task parameters. | Preserves old task performance. |
| Add lateral connections | Connect hidden layers of old columns to the new column. | Enables forward transfer. |
| Train current task | Optimize only new and lateral parameters. | Learns the current task while reusing previous features. |
| Freeze current column | After training, make the current column immutable. | Turns the learned task into a source for future tasks. |
What is trained and what is frozen?
Frozen
All parameters inside columns associated with previous tasks.
No forgetting Old skills preservedTrainable
The new task column and lateral/adaptation connections into that column.
New learning Forward transfer6. Transfer Analysis: How PNNs Reuse Knowledge
A major advantage of Progressive Neural Networks is that they make transfer more interpretable than ordinary fine-tuning. Since each previous task has its own frozen column, researchers can inspect how much the new task depends on old-column features.
6.1 Forward transfer
Forward transfer occurs when earlier tasks improve learning on later tasks. In PNNs, forward transfer happens through lateral connections. If old visual features, control policies, or intermediate abstractions are useful, the new column can exploit them.
6.2 No destructive interference
Because old columns are frozen, new gradients cannot damage old parameters. This avoids the destructive interference that occurs in normal sequential fine-tuning.
6.3 Average Perturbation Sensitivity
Average Perturbation Sensitivity estimates how important a feature or layer is by perturbing it and measuring the resulting performance drop. If perturbing an old-column feature severely damages the new task, the new task is relying on that transferred feature.
6.4 Average Fisher Sensitivity
Average Fisher Sensitivity estimates how sensitive the policy is to changes in normalized hidden activations. It is useful in reinforcement-learning settings because it links representation importance to the policy distribution.
7. Applications of Progressive Neural Networks
Progressive Neural Networks were originally studied mainly in reinforcement learning and transfer learning, but the architectural idea applies more broadly to any sequential task-learning setting where old tasks must be preserved.
| Application area | Why PNNs are useful | Example use |
|---|---|---|
| Reinforcement learning | Old policies and representations can speed up learning of new environments. | Learning multiple Atari games or game variants sequentially. |
| Robotics | Skills learned in one environment can be reused in another. | Transferring from simulated robot training to real robot control. |
| Sim-to-real transfer | Simulation provides cheap experience, while real-world learning adapts with preserved simulation features. | Robot reaching or manipulation from pixel inputs. |
| Computer vision | Earlier visual feature extractors may support later recognition tasks. | Sequential image-classification tasks with task-specific heads. |
| Lifelong learning | The system accumulates task-specific modules over time. | Agents that learn a curriculum of related tasks. |
| Multi-domain learning | Each domain can receive a dedicated column while sharing useful representations laterally. | Different domains, sensors, or data distributions learned sequentially. |
Original experimental directions
Atari
PNNs were evaluated on transfer across Atari games, showing both positive and negative transfer depending on the source-target pair.
Reinforcement learningPong variants
Game variants allowed controlled analysis of how features transfer between related tasks.
Task variants3D Labyrinth
Navigation tasks demonstrated transfer in visually rich reinforcement-learning environments.
Navigation9. Challenges and Research Gaps
Progressive Neural Networks are conceptually simple and powerful, but they introduce important practical and theoretical challenges. Many later methods can be understood as attempts to preserve the benefits of PNNs while reducing their costs.
Parameter growth
Adding a new column per task can make the model very large as the number of tasks increases.
Memory ScalabilityLateral-connection cost
If every new column connects to all previous columns, lateral modules can grow rapidly.
O(K²) ComputationTask identity requirement
At inference time, the system often needs to know which task column or output head to use.
Task-incremental RoutingNegative transfer
Previous features can hurt the new task if they create a misleading inductive bias.
Source selection RobustnessNo backward transfer
Old tasks do not automatically improve from knowledge learned in later tasks because old columns are frozen.
Backward transfer AsymmetryTask-boundary assumption
PNNs work naturally when tasks arrive in clear stages, but less naturally in task-free continuous streams.
Online learning Non-stationarityImportant open questions
- How can PNN-like models scale to hundreds or thousands of tasks?
- Can the model automatically decide when a new column is needed?
- Can lateral transfer be sparsified so only useful source columns are connected?
- How can task identity be inferred automatically at inference time?
- Can later tasks improve older tasks without damaging their original performance?
- Can compression preserve the anti-forgetting guarantees of the original PNN design?
10. Limitations and Critical Interpretation
The main strength of Progressive Neural Networks is also their main weakness. By protecting old tasks with frozen columns, they avoid catastrophic forgetting. But by allocating new capacity for every task, they can become inefficient.
10.1 Memory and computation
If each task receives a full column, total parameters grow roughly linearly with the number of task columns. If dense lateral connections are used from every previous column to every new column, the lateral parameter count can grow approximately quadratically.
10.2 Inference-time routing
PNNs are easiest to use in task-incremental learning, where the task label is known at test time. In class-incremental or task-free settings, the model must also solve a routing problem: which column should handle the input?
10.3 Transfer is not always positive
Lateral features are useful only when previous tasks contain relevant information. If source and target tasks differ strongly, transfer may be weak or negative. This means that source-task selection and lateral-connection design are important.
10.4 No compact global representation
Because knowledge is distributed across task-specific columns, the network does not automatically merge all experience into one compact shared model. This motivated later work on progress-and-compress strategies.
When PNNs are appropriate
Good fit
- Moderate number of tasks.
- Task labels are known at inference.
- Old-task forgetting is unacceptable.
- Old data cannot be stored.
- Forward transfer is valuable.
Poor fit
- Very long task sequences.
- Strict memory or latency limits.
- Unknown task identity at inference.
- Task-free online streams.
- Need for strong backward transfer.
11. Conclusion
Progressive Neural Networks are best understood as an architectural solution to continual learning. They preserve old tasks by freezing old columns, learn new tasks by adding new columns, and enable forward transfer through lateral connections from previous representations.
Their major contribution is the clean separation between preservation and adaptation. Preservation is achieved by freezing old columns. Adaptation is achieved by training a new column and learning how to use lateral features. This makes PNNs highly resistant to catastrophic forgetting and useful for sequential transfer learning.
However, their scalability limitations are serious. Memory growth, computation growth, task-label requirements, and lack of backward transfer limit their use as a universal continual-learning solution. For this reason, PNNs are often treated as a foundational architecture that inspired later methods such as dynamic expansion, sparse routing, and progress-and-compress models.
References and Key Papers
- Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., & Hadsell, R. (2016). Progressive Neural Networks. arXiv:1606.04671. https://arxiv.org/abs/1606.04671
- Rusu, A. A., Večerík, M., Rothörl, T., Heess, N., Pascanu, R., & Hadsell, R. (2017). Sim-to-Real Robot Learning from Pixels with Progressive Nets. Conference on Robot Learning. https://proceedings.mlr.press/v78/rusu17a.html
- Kirkpatrick, J., Pascanu, R., Rabinowitz, N. C., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., & Hadsell, R. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences. https://doi.org/10.1073/pnas.1611835114
- Li, Z., & Hoiem, D. (2017). Learning without Forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://arxiv.org/abs/1606.09282
- Fernando, C., Banarse, D., Blundell, C., Zwols, Y., Ha, D., Rusu, A. A., Pritzel, A., & Wierstra, D. (2017). PathNet: Evolution Channels Gradient Descent in Super Neural Networks. arXiv:1701.08734. https://arxiv.org/abs/1701.08734
- Lopez-Paz, D., & Ranzato, M. (2017). Gradient Episodic Memory for Continual Learning. NeurIPS 2017. https://proceedings.neurips.cc/paper/2017/hash/f87522788a2be2d171666752f97ddebb-Abstract.html
- Yoon, J., Yang, E., Lee, J., & Hwang, S. J. (2018). Lifelong Learning with Dynamically Expandable Networks. ICLR 2018. https://arxiv.org/abs/1708.01547
- Schwarz, J., Czarnecki, W. M., Luketina, J., Grabska-Barwinska, A., Teh, Y. W., Pascanu, R., & Hadsell, R. (2018). Progress & Compress: A scalable framework for continual learning. ICML 2018. https://arxiv.org/abs/1805.06370
- Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., & Wermter, S. (2019). Continual lifelong learning with neural networks: A review. Neural Networks. https://doi.org/10.1016/j.neunet.2019.01.012
- De Lange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G., & Tuytelaars, T. (2021). A Continual Learning Survey: Defying Forgetting in Classification Tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2021.3057446
- Wang, L., Zhang, X., Su, H., & Zhu, J. (2024). A Comprehensive Survey of Continual Learning: Theory, Method and Application. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://arxiv.org/abs/2302.00487