Understanding ResNet: How Deep Networks Became Trainable

Deep Residual Networks: Understanding How Very Deep Models Became Trainable

Deep learning models have generally improved as they become deeper. The intuition is simple: more layers allow the model to learn more complex patterns and richer representations. However, this idea does not always work in practice.

However, beyond a certain point, increasing the depth of a network starts to hurt performance instead of improving it. This observation led to the work presented in Deep Residual Learning for Image Recognition, where the authors focused on understanding why deeper networks fail and how to address this issue effectively.

Why Increasing Depth Was Not Working

In theory, a deeper network should perform at least as well as a shallower one. If additional layers are not useful, the network could simply learn identity mappings and behave like a smaller model.

However, this does not happen in practice.

As networks become deeper:

  • Training error increases
  • Optimization becomes more difficult
  • Gradients weaken as they propagate through many layers

This is referred to as the degradation problem. Importantly, this is not caused by overfitting. Instead, it reflects the difficulty of optimizing deep networks, even when they have sufficient capacity.

The authors showed that deeper models can have higher training error than their shallower counterparts, which clearly indicates an optimization issue rather than a generalization problem.

How ResNet Solves This Problem

To address this, the authors proposed a simple reformulation of the learning objective.

Instead of directly learning a mapping H(x), the network learns a residual function:

F(x) = H(x) − x

which leads to:

H(x) = F(x) + x

This means the network learns how the output differs from the input, rather than learning the full transformation from scratch.

To implement this idea, the architecture introduces shortcut (skip) connections, where the input is directly added to the output of a few layers.

The paper also discusses two types of shortcut connections:

  • Identity shortcuts (no extra parameters)
  • Projection shortcuts (used when dimensions change)

This design allows the network to maintain consistency across layers while still being flexible.

Architecture

The overall structure of ResNet and its residual blocks is shown below:

The fundamental component of ResNet is the residual block.

Each block contains a sequence of convolutional layers along with a shortcut connection that adds the input directly to the output. This addition ensures that information is preserved across layers instead of being repeatedly transformed.

The architecture is built by stacking these residual blocks. The paper presents several variants, including ResNet-18, ResNet-34, ResNet-50, and deeper models.

In deeper variants such as ResNet-50, the authors use a bottleneck design, where 1×1 convolutions are introduced to reduce and then restore dimensions. This improves computational efficiency while allowing deeper networks to be constructed.

Why This Approach Improves Optimization

The residual formulation changes how the network behaves during training.

First, shortcut connections provide direct paths for gradients to flow backward through the network. This helps reduce the vanishing gradient problem and stabilizes training.

Second, learning residual functions is often easier than learning full mappings. If the desired mapping is close to identity, the network only needs to learn small deviations.

Third, the network can effectively preserve information across layers through identity mappings, which reduces the risk of degradation as depth increases.

The paper also shows that residual networks are easier to optimize compared to plain networks with the same number of layers.

How the Model Is Trained

The models in the paper are trained using standard techniques, including:

  • Stochastic Gradient Descent (SGD)
  • Batch Normalization
  • Data Augmentation

No special optimization tricks are required. The improvement in performance primarily comes from the architectural design rather than changes in the training process.

This highlights that the difficulty in training deep networks was largely due to how they were structured, not due to limitations in optimization algorithms.

Results and Key Observations

The experiments in the paper clearly show that increasing depth alone does not guarantee better performance.

The authors compare plain networks with residual networks of the same depth and observe a clear difference in behavior. While plain networks become harder to train as they get deeper, residual networks continue to improve.

Some key observations from the paper are:

  • Deeper residual networks consistently perform better than their shallower counterparts
  • Plain networks (without shortcut connections) show higher training error as depth increases
  • Residual networks do not suffer from the degradation problem

These observations are important because they show that the issue with deep networks was not overfitting, but the difficulty of optimization.

Using this approach, the authors were able to successfully train very deep networks and achieve state-of-the-art performance on the ImageNet dataset. Their model also won the ILSVRC 2015 competition, which highlights how effective this idea is in practice.

The paper also demonstrates that residual learning is useful beyond classification. It can be applied to other tasks such as object detection, where it again leads to strong performance.

Overall, the results make it clear that residual learning solves a fundamental problem in deep network training and allows models to scale to much greater depths.

Final Thoughts

ResNet demonstrates that the main challenge in deep learning was not simply increasing model capacity, but making deep models trainable.

By introducing residual learning and shortcut connections, the paper provides a simple yet effective solution to the degradation problem. This allows networks to become significantly deeper while maintaining stable training behavior.

The idea of learning residual functions fundamentally changes how deep networks are designed and has had a lasting impact on the field.

References

Comments