Convolutional Neural Networks – Ep. 8 (Deep Learning SIMPLIFIED)

Convolutional Neural Networks – Ep. 8 (Deep Learning SIMPLIFIED)

November 30, 2019 67 By Stanley Isaacs


If there’s one deep net that has completely
dominated the machine vision space in recent years, it’s certainly the convolutional
neural net, or CNN. These nets are so influential that they’ve made Deep Learning one of the
hottest topics in AI today. But they can be tricky to understand, so let’s take a closer
look and see how they work. CNNs were pioneered by Yann Lecun of New York
University, who also serves as the director of Facebook’s AI group. It is currently believed
that Facebook uses a CNN for its facial recognition software. A convolutional net has been the go to solution
for machine vision projects in the last few years. Early in 2015, after a series of breakthroughs
by Microsoft, Google, and Baidu, a machine was able to beat a human at an object recognition
challenge for the first time in the history of AI. It’s hard to mention a CNN without touching
on the ImageNet challenge. ImageNet is a project that was inspired by the growing need for
high-quality data in the image processing space. Every year, the top Deep Learning teams
in the world compete with each other to create the best possible object recognition software.
Going back to 2012 when Geoff Hinton’s team took first place in the challenge, every single
winner has used a convolutional net as their model. This isn’t surprising, since the
error rate of image detection tasks has dropped significantly with CNNs, as seen in this image. Have you ever struggled while trying to learn
about CNNs? If so, please comment and share your experiences. We’ll keep our discussion of CNNs high level,
but if you’re inclined to learn about the math, be sure to check out Andrej Karpathy’s
amazing CS231n course notes on these nets. There are many component layers to a CNN,
and we will explain them one at a time. Let’s start with an analogy that will help describe
the first component, which is the “convolutional layer” Imagine that we have a wall, which will represent
a digital image. Also imagine that we have a series of flashlights shining at the wall,
creating a group of overlapping circles. The purpose of these flashlights is to seek out
a certain pattern in the image, like an edge or a color contrast for example. Each flashlight
looks for the exact same pattern as all the others, but they all search in a different
section of the image, defined by the fixed region created by the circle of light. When
combined together, the flashlights form what’s a called a filter. A filter is able to determine
if the given pattern occurs in the image, and in what regions. What you see in this
example is an 8×6 grid of lights, which is all considered to be one filter. Now let’s take a look from the top. In practice,
flashlights from multiple different filters will all be shining at the same spots in parallel,
simultaneously detecting a wide array of patterns. In this example, we have four filters all
shining at the wall, all looking for a different pattern. So this particular convolutional
layer is an 8x6x4, 3-dimensionsal grid of these flashlights. Now let’s connect the dots of our explanation:
– Why is it called a convolutional net? The net uses the technical operation of convolution
to search for a particular pattern. While the exact definition of convolution is beyond
the scope of this video, to keep things simple, just think of it as the process of filtering
through the image for a specific pattern. Although one important note is that the weights
and biases of this layer affect how this operation is performed: tweaking these numbers impacts
the effectiveness of the filtering process. – Each flashlight represents a neuron in the
CNN. Typically, neurons in a layer activate or fire. On the other hand, in the convolutional
layer, neurons perform this “convolution” operation. We’re going to draw a box around
one set of flashlights to make things look a bit more organized. – Unlike the nets we’ve seen thus far where
every neuron in a layer is connected to every neuron in the adjacent layers, a CNN has the
flashlight structure. Each neuron is only connected to the input neurons it “shines”
upon. The neurons in a given filter share the same
weight and bias parameters. This means that, anywhere on the filter, a given neuron is
connected to the same number of input neurons and has the same weights and biases. This
is what allows the filter to look for the same pattern in different sections of the
image. By arranging these neurons in the same structure as the flashlight grid, we ensure
that the entire image is scanned. The next two layers that follow are RELU and
pooling, both of which help to build up the simple patterns discovered by the convolutional
layer. Each node in the convolutional layer is connected to a node that fires like in
other nets. The activation used is called RELU, or rectified linear unit. CNNs are trained
using backpropagation, so the vanishing gradient is once again a potential issue. For reasons
that depend on the mathematical definition of RELU, the gradient is held more or less
constant at every layer of the net. So the RELU activation allows the net to be properly
trained, without harmful slowdowns in the crucial early layers. The pooling layer is used for dimensionality
reduction. CNNs tile multiple instances of convolutional layers and RELU layers together
in a sequence, in order to build more and more complex patterns. The problem with this
is that the number of possible patterns becomes exceedingly large. By introducing pooling
layers, we ensure that the net focuses on only the most relevant patterns discovered
by convolution and RELU. This helps limit both the memory and processing requirements
for running a CNN. Together, these three layers can discover
a host of complex patterns, but the net will have no understanding of what these patterns
mean. So a fully connected layer is attached to the end of the net in order to equip the
net with the ability to classify data samples. Let’s recap the major components of a CNN.
A typical deep CNN has three sets of layers – a convolutional layer, RELU, and pooling
layers – all of which are repeated several times. These layers are followed by a few
fully connected layers in order to support classification. Since CNNs are such deep nets,
they most likely need to be trained using server resources with GPUs. Despite the power of CNNs, these nets have
one drawback. Since they are a supervised learning method, they require a large set
of labelled data for training, which can be challenging to obtain in a real-world application.
In the next video, we’ll shift our attention to another important deep learning model – the
Recurrent Net.