分类： AI 
An Intuitive Explanation of Convolutional
Neural Networks
What are Convolutional Neural Networks and why are they important?
Convolutional Neural Networks (ConvNets
Figure 1: Source [1]
In
Figure 2: Source [2]
ConvNets, therefore, are an important tool for most machine
learning practitioners today. However, understanding ConvNets and
learning to use them for the first time can sometimes be an
intimidating experience.
If you are new to neural networks in general, I would
recommend
The LeNet Architecture (1990s)
LeNet
Below, we will develop an intuition of how the LeNet architecture
learns to recognize images. There have
been
Figure 3: A simple ConvNet. Source [5]
The Convolutional Neural Network in
There are four
 Convolution
 Non Linearity (ReLU)
 Pooling or Sub Sampling
 Classification (Fully Connected Layer)
These operations
Images are a matrix of pixel values
Essentially, every image can be represented as a matrix of pixel values.
Figure 4: Every image is a matrix of pixel values. Source [6]
Channel
A
The Convolution Step
ConvNets derive their name from the
As we discussed above, every image can be considered as a matrix of
pixel values. Consider a 5 x 5 image whose pixel values are only 0
and 1 (note that for a grayscale image, pixel values range from 0
to 255, the green matrix below is a special case
where
Also, consider another 3 x 3 matrix as shown below:
Then, the Convolution of the 5 x 5 image and the 3 x 3 matrix can
be computed as shown in the animation
in
Figure 5: The Convolution operation. The output
matrix is called Convolved Feature
or Feature Map. Source
[7]
Take a moment to understand how the computation above is being
done. We
In CNN terminology, the 3×3 matrix is called a ‘filter‘ or ‘kernel’ or ‘feature detector’ and the matrix formed by sliding the filter over the image and computing the dot product is called the ‘Convolved Feature’ or ‘Activation Map’ or the ‘Feature Map‘. It is important to note that filters acts as feature detectors from the original input image.
It is evident from the animation above that different values of the
filter matrix will produce different Feature
In the table below, we can see the effects of convolution of the
above image with
Another good way to understand the
Figure 6: The Convolution Operation. Source [9]
A filter (with red outline) slides
In practice, a CNN
The size of the Feature Map (Convolved Feature) is controlled by
three parameters [4]

Depth:
Depth corresponds to the number of filters we use for the convolution operation. In the network shown in Figure 7, we are performing convolution of the original boat image using three distinct filters, thus producing three different feature maps as shown. You can think of these three feature maps as stacked 2d matrices, so, the ‘depth’ of the feature map would be three.
Figure 7

Stride:
Stride is the number of pixels by which we slide our filter matrix over the input matrix. When the stride is 1 then we move the filters one pixel at a time. When the stride is 2, then the filters jump 2 pixels at a time as we slide them around. Having a larger stride will produce smaller feature maps.

Zeropadding:
Sometimes, it is convenient to pad the input matrix with zeros around the border, so that we can apply the filter to bordering elements of our input image matrix. A nice feature of zero padding is that it allows us to control the size of the feature maps. Adding zeropadding is also called wide convolution, and not using zeropadding would be a narrow convolution .This has been explained clearly in [14].
Introducing Non Linearity (ReLU)
An additional operation called ReLU
Figure 8: the ReLU operation
ReLU
The ReLU operation can be understood clearly
from
Figure 9: ReLU operation.
Source [10]
Other non linear functions such as
The Pooling Step
Spatial Pooling (also called subsampling or downsampling) reduces
the dimensionality of each feature map but
retains
In case of Max Pooling, we define a spatial neighborhood (for
example, a 2×2 window) and take the
Figure
10
Figure 10: Max Pooling. Source [4]
We slide our 2 x 2 window
In the network shown in
Figure 11: Pooling applied to Rectified Feature Maps
Figure
12
Figure 12: Pooling. Source [10]
The function of Pooling is to progressively reduce the spatial size of the input representation [4]. In particular, pooling
 makes the input representations (feature dimension) smaller and more manageable
 reduces the
number
of parameters and computations in the network, therefore, controlling overfitting [4]  makes the
network invariant to small transformations, distortions and
translations in the input image (a small distortion in input will
not change the output of
Pooling – since we take the maximum / average value in a local neighborhood).  helps
us
arrive at an almost scale invariant representation of our image (the exact term is “equivariant”). This is very powerful since we can detect objects in an image no matter where they are located (read [18] and [19] for details).
Story so far
Figure 13
So far we have seen how Convolution, ReLU and Pooling work. It is
important to understand that these layers
Together these layers
The output of the 2nd Pooling Layer acts as an input to the Fully Connected Layer, which we will discuss in the next section.
Fully Connected Layer
The Fully Connected layer
The output from the convolutional and pooling layers represent
highlevel features of
Figure 14: Fully Connected Layer each node is connected to every other node in the adjacent layer
Apart from classification, adding a fullyconnected layer is also a
(usually) cheap way of learning nonlinear combinations of these
features. Most of
The sum of output probabilities from the Fully Connected Layer is
1. This is ensured by using the
Putting it all together – Training using Backpropagation
As discussed above, the Convolution + Pooling layers act as Feature Extractors from the input image while Fully Connected layer acts as a classifier.
Note that in
 Input Image
=
Boat  Target Vector = [0, 0, 1, 0]
Figure 15: Training the ConvNet
The overall training process of the Convolution Network
may

Step1:
We initialize all filters and parameters / weights with random values

Step2:
The network takes a training image as input, goes through the forward propagation step (convolution, ReLU and pooling operations along with forward propagation in the Fully Connected layer) and finds the output probabilities for each class.  Lets say the
output probabilities for the boat image above
are
[0.2, 0.4, 0.1, 0.3]  Since weights are randomly assigned for the first training example, output probabilities are also random.
 Lets say the
output probabilities for the boat image above
are

Step3:
Calculate the total error at the output layer (summation over all 4 classes) 
Total Error = ∑ ½ (target probability – output probability) ²


Step4:
Use Backpropagation to calculate the gradients of the error with respect to all weights in the network and use gradient descent to update all filter values / weights and parameter values to minimize the output error.  The weights are adjusted in proportion to their contribution to the total error.
 When the same image is input again, output probabilities might now be [0.1, 0.1, 0.7, 0.1], which is closer to the target vector [0, 0, 1, 0].
 This means
that the network has
learnt to classify this particular image correctly by adjusting its weights / filters such that the output error is reduced.  Parameters like number of filters, filter sizes, architecture of the network etc. have all been fixed before Step 1 and do not change during training process – only the values of the filter matrix and connection weights get updated.

Step5:
Repeat steps 24 with all images in the training set.
The above steps
When a new (unseen) image is input into the ConvNet, the network
would go through the forward propagation step and output a
probability for each class (for a new
Note
1:
Note
2:
Figure 16: Source [4]
Visualizing Convolutional Neural Networks
In general, the more convolution steps we
Figure 17: Learned features from a Convolutional
Deep Belief Network
Adam Harley created
We will see below how the network works for an input ‘8’. Note that
the visualization in
Figure 18: Visualizing a ConvNet trained
on handwritten digits
The input image contains 1024 pixels (32 x 32 image) and the first
Convolution layer (Convolution Layer 1) is
Convolutional Layer 1 is followed by Pooling Layer 1 that does 2 ×
2 max pooling (with stride 2) separately over the six
feature
Figure 19: Visualizing the Pooling Operation
Pooling Layer 1 is followed by sixteen 5 × 5 (stride 1)
convolutional filters that perform the convolution operation.
This
We then have three
 120
neurons
in the first FC layer  100 neurons in the second FC layer
 10 neurons in the third FC layer corresponding to the 10 digits – also called the Output layer
Notice how in
Also, note how the only bright node
Figure 20: Visualizing the Filly Connected Layers
The 3d version of the same visualization is
available
Other ConvNet Architectures
Convolutional Neural Networks have been around since early 1990s.
We discussed the LeNet above
 LeNet
(1990s):
Already covered in this article.
 1990s to
2012:
In the years from late 1990s to early 2010s convolutional neural network were in incubation. As more and more data and computing power became available, tasks that convolutional neural networks could tackle became more and more interesting.
 AlexNet (2012)
–
In 2012, Alex Krizhevsky (and others) released AlexNet which was a deeper and much wider version of the LeNet and won by a large margin the difficult ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. It was a significant breakthrough with respect to the previous approaches and the current widespread application of CNNs can be attributed to this work.
 ZF Net (2013)
–
The ILSVRC 2013 winner was a Convolutional Network from Matthew Zeiler and Rob Fergus. It became known as the ZFNet (short for Zeiler & Fergus Net). It was an improvement on AlexNet by tweaking the architecture hyperparameters.
 GoogLeNet (2014)
–
The ILSVRC 2014 winner was a Convolutional Network from Szegedy et al. from Google. Its main contribution was the development of an Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M).
 VGGNet (2014)
–
The runnerup in ILSVRC 2014 was the network that became known as the VGGNet. Its main contribution was in showing that the depth of the network (number of layers) is a critical component for good performance.

ResNets
(2015) –
Residual Network developed by Kaiming He (and others) was the winner of ILSVRC 2015. ResNets are currently by far state of the art Convolutional Neural Network models and are the default choice for using ConvNets in practice (as of May 2016).

DenseNet
(August 2016) –
Recently published by Gao Huang (and others), the Densely Connected Convolutional Network has each layer directly connected to every other layer in a feedforward fashion. The DenseNet has been shown to obtain significant improvements over previous stateoftheart architectures on five highly competitive object recognition benchmark tasks. Check out the Torch implementation here.
Conclusion
In this post, I have tried to explain the main concepts
behind
This post was
All images and animations used in this post belong to their respective authors as listed in References section below.
References
 Clarifai Home Page
 Shaoqing
Ren,
et al, “Faster RCNN: Towards RealTime Object Detection with Region Proposal Networks”, 2015, arXiv:1506.01497 
Neural Network Architectures,
Eugenio Culurciello’s blog  CS231n Convolutional Neural Networks for Visual Recognition, Stanford
 Clarifai / Technology
 Machine Learning is Fun! Part 3: Deep Learning and Convolutional Neural Networks
 Feature extraction using convolution, Stanford

Wikipedia article on Kernel (image
processing)

Deep Learning Methods for Vision, CVPR 2012
Tutorial
 Neural Networks by Rob Fergus, Machine Learning Summer School 2015

What do the fully connected layers do in
CNNs?

Convolutional Neural Networks, Andrew
Gibiansky
 A. W. Harley, “An Interactive NodeLink Visualization of Convolutional Neural Networks,” in ISVC, pages 867877, 2015 (link)
 Understanding Convolutional Neural Networks for NLP
 Backpropagation in Convolutional Neural Networks
 A Beginner’s Guide To Understanding Convolutional Neural Networks
 Vincent
Dumoulin,
et al, “A guide to convolution arithmetic for deep learning”, 2015, arXiv:1603.07285  What is the difference between deep learning and usual machine learning?
 How is a convolutional neural network able to learn invariant features?
 A Taxonomy of Deep Convolutional Neural Nets for Computer Vision