YouTube Transcript:
Convolutional Neural Networks | Deep Learning Animated
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Available languages:
View:
Hello everybody, welcome to this video on convolutional neural networks.
In the following video, we'll go over how these networks process images using the
convolution operation, how it's implemented in practice, and other special features of
convolutional networks such as pooling. But first let's start with a quick
reminder of some basic concepts. Let's start by asking ourselves
what an image actually is. For instance, let's take the
example of this image of a hand-written digit. What you see as a single image is actually
a grid, composed of tiny cells called pixels. Each pixel has a value representing its intensity.
For instance a very bright pixel will have a high value, while a dark pixel will have a low value.
As you can see, an image is just a 2D array of values, also known as a matrix.
A fully connected neural network is the most simple architecture in deep learning.
In a fully connected network, every input node is connected to every node in the next layer.
While this is very simple, it's also very inefficient for image processing.
Indeed, the number of parameters in the first layer is directly
related to the size of the image. For instance, with our previous digit
example it's 7,840 input parameters. This small network has more than
15,000 parameters, just for a 28 by 28 image. It would be better to have a network with a number
of parameters independent of the input size. Let's see how filters can help us with this.
Filters are a very common tool in image processing.
But what is it ? Simply put,
a filter is essentially also a matrix. We usually note the operation of applying
a filter to an image with an asterisk symbol. Depending on what numbers you plug into this
matrix, you can make an image look blurry, detect its edges, or achieve all sorts of other effects.
To apply a filter to an image, we use a process called convolution.
Here is the mathematical equation describing the convolution between a filter k and an image f.
While this equation may seem complicated at first, it's actually quite simple to understand.
Let's examine the computation of one pixel in practice.
Let's take our previous example of a 3 by 3 filter and a written digit.
For every pixel, we multiply its value and the values of its neighbors by the
corresponding weights in the filter. And to keep the pixel values in the
initial range, we typically normalize the result by the sum of the filter's weights.
We then simply repeat this operation for every single pixel in the image.
The convolution operation for matrices usually has three main parameters:
the kernel size, the stride, and the padding. Let's see what these parameters stand for
in practice; starting with the kernel size. Kernel size simply refers to the dimensions
of the filter used: the bigger the filter, the more pixels are involved in the computation.
Stride is the amount by which the filter moves at each step of the convolution, measured in pixels.
Here you can see the filter moving with a stride of 2.
Finally we also need to compute the convolution on the edges of the image, were data is missing.
That's where padding comes into play. Padding refers to the values we place on the
edge of the image to enable the convolution, even though there is no actual data there.
The most common method is zero-padding, but any constant can be used.
Here you can see that the image is padded with ones instead.
Another very common padding policy is to copy over the last pixels in the image,
which is often called reflect padding. In this case it is equivalent to zero-padding.
Now that we have seen in detail how to perform the convolution operation,
we will go over the whole process once. For this example we will be using a
three by three mean filter, with a stride of one, and padding the image with zeros.
Intuitively, convolution is simply the operation of moving the filter over the image and
calculating a weighted sum at each position. As you can see the resulting image is
noticeably blurrier than the original. Now that we understand how to perform
a convolution, how could we incorporate this mechanism into a neural network architecture ?
The equivalent of a neuron in a fully connected layer is a filter in a convolutional network.
For instance if we want 3 neurons in a layer, we need 3 filters.
Imagine that each filter in this layer has the goal of identifying specific
features of the image, such as edges, textures or patterns.
Each of these filters will be convolved with the image.
As these filters work in parallel, each generates its own processed version of the image.
These are called feature maps, and they represent different interpretations of the input data.
Since we used three distinct filters in this example, our output becomes a three-channel image.
Now that we have a 3-channel image, how are we going to apply the convolution in the next layer ?
In order to process each channel of the image, it is necessary to apply one filter per channel.
The filters are applied to each channel in the same way as for a regular convolution.
However, each convolution produces a single image, which again amounts to a three channel output.
The key element is that at this step, we sum the outputs of each convolution
to get one output channel. The convolution operation
for a three channel image can be seen as convolving the image by a single 3D filter,
which again gives one single feature map. Now if we want more 'neurons' per layer,
we can simply add a filter with the correct number of channels.
For instance, for three input channels, we need three channels filters.
If we have three such filters in the layer, it will output three feature maps,
which can be viewed as a three channel image. Now we can stack convolutional layers with as
many neurons as desired, as long as the shapes of the filters are adapted to the shape of the input.
We can now stack layers with any number of neurons, just like with a fully connected network.
The problem is that we can only model functions that can be expressed as convolutions.
With fully connected networks, we usually introduce non-linear activation layers to be
able to model a wider class of functions. We do the same with CNNs: we can choose
an activation function from the many that already exist.
For instance we could take a sigmoid, or an hyperbolic tangent, or even a leaky ReLU,
and why not something more exotic such as the exponential linear unit.
Some activation functions are better suited to certain architectures or certain tasks, but the
most common remains the rectified linear unit. By adding an activation function, such as ReLU,
after each convolution layer, we can achieve a more powerful convolutional network.
However, this may not be sufficient for all applications.
For instance, we may want to output a vector or a scalar instead of a feature map.
In such cases, we could simply flatten the output of the network
and plug it into a fully connected layer. However, this would again require a significant
number of weights, similar to what we would need if we were to use only a fully connected network.
So, how could we reduce the dimension of the feature maps ?
The process of reducing the dimension of the feature maps is called 'pooling'.
It is actually just a special case of convolution, where we set the step and
kernel width so that we divide the size of the image by two in each direction.
For average pooling, the kernel is just a mean filter.
Another common variant of pooling is max pooling, where the kernel is a max
filter. The idea is still to summarize the information in the original image.
As you can see, this produces an image which is twice as
small but contains roughly the same information. Pooling also enables the network to increase its
'receptive field', which is the part of the input data that affects the output of a specific neuron.
If you like the video so far, make sure you like and subscribe so you
don't miss out on the next one ! If you have a suggestion for a
deep learning topic you would like me to animate, leave it in the comments below.
Here's our convolutional neural network as we built it earlier.
Now let's add pooling layers.
Our network now outputs 3 feature maps, each of which is only 7 pixels wide.
This means that if we were to add a fully connected layer to this
architecture as we did before, we would only need 1,470 parameters this time.
This is almost 5 times fewer parameters than without pooling !
With this basic convolutional network, we can now process images much more efficiently
than was possible with fully connected networks. To process this tiny 28 by 28 image with the fully
connected network, we need to learn more than 15,000 parameters, whereas with a convolutional
network we only need to learn 810 parameters. If we now take an image that is 256 pixels wide,
the number of parameters needed with the fully connected network
skyrockets to more than 1 million ! On the other hand, the number of
parameters of the convolutional network doesn't change, at it doesn't depend on the input size.
Thank you so much for watching this first video ! If you enjoyed the video, please like and
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.
Works with YouTube, Coursera, Udemy and more educational platforms
Get Instant Transcripts: Just Edit the Domain in Your Address Bar!
YouTube
←
→
↻
https://www.youtube.com/watch?v=UF8uR6Z6KLc
YoutubeToText
←
→
↻
https://youtubetotext.net/watch?v=UF8uR6Z6KLc