Classic Networks: LeNet-5, AlexNet, VGG
- Why Look at Classic Networks?
- LeNet-5 (1998, Yann LeCun)
- AlexNet (2012, Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton)
- VGG Networks (2014, Visual Geometry Group, Oxford)
- Summary Table
- Final Thoughts
In the early stages of deep learning and computer vision, several foundational convolutional neural network (CNN) architectures shaped the field and enabled significant breakthroughs in image recognition. In this document, we explore three of the most historically significant and technically influential networks: LeNet-5, AlexNet, and VGG.
These architectures demonstrate the progression of CNN design from shallow, simple models to deeper, more powerful systems capable of scaling to large datasets like ImageNet.
Why Look at Classic Networks?
Understanding classic CNN architectures is essential for the following reasons:
- They introduce fundamental building blocks (e.g., convolutional layers, pooling layers, ReLU activation).
- They highlight challenges faced at different stages of deep learning evolution (e.g., overfitting, vanishing gradients).
- They provide insights into the design philosophy of modern deep architectures.
LeNet-5 (1998, Yann LeCun)
Overview
LeNet-5 was one of the earliest CNN models designed to recognize handwritten digits (e.g., MNIST dataset). It demonstrated the power of learned convolutional filters combined with a small number of parameters.
Architecture

- Input: 32x32 grayscale image
- C1: Convolutional layer with 6 filters of size 5x5 → output: 28x28x6
- S2: Subsampling (average pooling) layer → output: 14x14x6
- C3: Convolutional layer with 16 filters → output: 10x10x16
- S4: Subsampling layer → output: 5x5x16
- C5: Fully connected convolutional layer → output: 120
- F6: Fully connected layer → output: 84
- Output: 10-class softmax layer
Parameters
LeNet uses shared weights, reducing the number of parameters compared to fully connected networks.
Insights
- Introduced the idea of local receptive fields, weight sharing, and subsampling.
- Excellent for small datasets but struggles with large-scale data due to its shallow depth.
AlexNet (2012, Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton)
Breakthrough
AlexNet marked the first major success of deep learning in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC 2012), achieving top-5 error of 15.3%, compared to 26% for the runner-up.
Architecture

- Input: 224x224x3 RGB image
- Conv1: 96 filters of 11x11, stride 4 → 55x55x96
- MaxPool1: 3x3, stride 2 → 27x27x96
- Conv2: 256 filters of 5x5 → 27x27x256
- MaxPool2: 3x3 → 13x13x256
- Conv3: 384 filters of 3x3 → 13x13x384
- Conv4: 384 filters of 3x3 → 13x13x384
- Conv5: 256 filters of 3x3 → 13x13x256
- MaxPool3: 3x3 → 6x6x256
- FC6: Fully connected layer with 4096 neurons
- FC7: Fully connected layer with 4096 neurons
- FC8: 1000-way softmax layer
Key Innovations
- Used ReLU (Rectified Linear Unit) instead of sigmoid or tanh → faster training
- Introduced dropout for regularization
- Trained on two GPUs in parallel
Insights
- Showed the world that deep networks could outperform traditional machine learning models if trained with large datasets and GPUs.
VGG Networks (2014, Visual Geometry Group, Oxford)
VGG emphasized simplicity and depth: using small 3x3 filters and stacking them deeply to capture complex patterns.
Architecture (VGG-16)

- Input: 224x224x3 RGB image
- Stack of 13 convolutional layers using 3x3 filters
- 5 max-pooling layers to reduce spatial dimensions
- 3 fully connected layers, with the last one as a softmax for classification
Example:
- Conv3-64 → Conv3-64 → MaxPool
- Conv3-128 → Conv3-128 → MaxPool
- Conv3-256 → Conv3-256 → Conv3-256 → MaxPool
- Conv3-512 → Conv3-512 → Conv3-512 → MaxPool
- Conv3-512 → Conv3-512 → Conv3-512 → MaxPool
- FC-4096 → FC-4096 → Softmax(1000)
Characteristics
- Consistent use of 3x3 filters simplifies the design and enables deeper networks
- Requires significant memory and computation (hundreds of millions of parameters)
Insights
- Demonstrated that depth is a key factor in improving CNN performance
- The architecture became a benchmark and inspired many follow-up models
Summary Table
Model | Year | Input Size | Depth | Unique Aspects |
---|---|---|---|---|
LeNet-5 | 1998 | 32x32 | 7 | Local receptive fields, subsampling |
AlexNet | 2012 | 224x224x3 | 8 | ReLU, dropout, GPU parallelism |
VGG-16 | 2014 | 224x224x3 | 16 | Simplicity, 3x3 filters, depth |
Final Thoughts
These classic CNN architectures form the backbone of modern computer vision systems. Each contributed key architectural innovations that addressed specific challenges in training deep networks.
Understanding them allows us to appreciate the evolution of deep learning and to better design models suited for today's massive data and compute resources.