Revisiting Classical Deep Learning Research Paper — ALEXNET

One of the Most Influential Papers of Deep Convolutional Neural Network

Published in

Towards Dev

5 min readMar 24, 2022

Image of Geoffrey Hinton, Ilya Sutskever, Alex krizhevsky — Geoffrey Hinton, Ilya Sutskever, Alex krizhevsky Credit: Wired

“ImageNet Classification with Deep Convolutional Neural Networks” — was published in the year 2012 by Alex Krizhevsky, Ilya Sutskever, and the Father of Backpropagation, Geoffrey Hinton. You can download the paper from here.

Overview

Alexnet was trained on the Imagenet dataset to classify 1000 different classes of images. It has 5 convolution layers and 3 fully connected layers with 1000-class output.
Instead of traditional(that time) sigmoid or tanh activations, authors used Relu.
Local Response Normalization across the channel was used for lateral inhibition of unbounded activations(Relu). More on this later in the blog.
Dropout was used as a regularization technique. Further, in the paper, the authors mentioned the use of overlapping pooling to make the model slightly more difficult to overfit.

Architecture

**Before jumping into the architecture. Let me address an issue. According to the paper, the input size is 224x224. But, if you calculate, to get 55x55, padding of 2 pixels on both sides needs to be added.
So, I did some digging and found that 227x227 is the input image size used in the Caffe implementation of Alexnet.
So I will describe the architecture with a 227x227 input size image.

Input — 227x227x3 Image dimension (Must be fixed size) as fully connected layers are used at the end.
Output — 1000 class output
First, two convolution block has max pooling and also a local response normalization layer.
The next two is simple convolution block. But, the last Conv block also has a max-pooling layer.
Then, there are 3 fully connected layers, with the last fully connected layer output equaling the number of different classes. Dropout is used in the first two fully connected layers.

Formula To Calculate Output dimension after Convolutional Layer

Image of Conv layer output dimension formula — Formula to calculate output dimension of Conv layer

This same formula applies to max-pooling too. Kernel size is the size of the pooling layer.

Code Implementation in Pytorch

AlexNet in Pytorch

Loss Function and Optimizer

The authors used stochastic gradient descent with momentum and weight decay on a batch size of 128. The learning rate was initialized as 0.01 with a decay of 1/10.
Simply put, the loss function is a cross-entropy loss for multinomial classification.
Weight initialization is done from a zero-mean normal distribution with a standard deviation of 0.01. Biases for some layers are initialized with 1.

Pytorch Specifics on Cross-Entropy Loss function
There are two ways to implement in Pytorch —
a. Use logSoftmax with NLL loss
b. Directly use Cross-Entropy loss without softmax in the last layer.
** I used a Softmax in the last layer. If you want to continue with it, then use log after the Softmax output and then use NLL loss.

Key Features of Alexnet

a. Rectified Linear Unit Activation Function
In the paper, the author mentioned that the network with Relu consistently learned faster than saturating non-linearities like tanh.
Relu outputs the input directly if positive else outputs zero.
Formulae: f(x) = max(0,x)

Image of Relu activation Function — Relu Credit: O'Reilly

b. Local Response Normalization
The author of this paper used Inter-Channel Local Response Normalization. This type of normalization is carried out across the channel at the specific position of the pixel.

Formula of Inter-Channel LRN — Local Response Normalization Formula

k, alpha, beta = hyperparameters
n = number of kernel maps you want to normalize over
N = Total number of Kernel maps
a^i = activation of the pixel at x,y position over the ith channel

Inside the bracket, they are summing over the square of activation through the “n” number of kernels. Then they are scaling by beta, to tweak the activations of the kernel at ith position w.r.t to its neighbor channels.
“k” is to avoid the zero-division error.

Values in Paper — k=2, n=5, alpha=0.0001, beta=0.75

Why LRN was used?
The idea behind LRN is to mimic local inhibition as real biological neurons. It dampens the “flat” and enhances the “peak” response from activations. This helps to pass on important activations to the next layer and also helps in generalization because of local competitiveness in the neighborhood activation.

** Nowadays this method is not used much because there are better methods like batch-norm, dropout, etc.

c. Overlapping Pooling
In general, we use normal pooling i.e kernel size equals stride size. Basically, it means no same pixel passes through a single kernel twice.

Image of Normal Pooling and overlapping pooling — Normal Pooling (top) Overlapping Pooling (Bottom) Credit: Morning Paper

In the paper, the authors used overlapping pooling i.e stride is smaller than kernel size. So, a pixel is passed over by the same kernel multiple times. They mentioned that it made the model less susceptible to overfitting.
Values in paper — Kernel size=3, stride = 2

d. Dropout
It is a regularization technique to shut-down random neurons of a layer during the Training phase. It is done to prevent the network from overfitting.
A probability value is used to determine how many neurons will be shut off.

Image of Dropout — Dropout Credit Researchgate

Miscellaneous

a. Data Augmentation
The Authors used two ways of Data Augmentation —
1. Random image translations and horizontal reflection
2. Changing pixel intensities, which is done to make the model invariant to pixel intensity or color illumination.

b. Multi-GPU Training
The Authors used Two GPUs to train the model by dividing the channel into halves for both GPUs. Another trick that they employed was that GPUs communicate to share weight in certain layers only.

This brings me to the end of the Blog. In the future, I will write more on advanced CNN architecture like Resnet, InceptionNet, etc.

To read my blog on U-Net paper explanation, click here
“Stay Hungry, stay foolish” — Steve Jobs