This content explains the concept of residual networks (ResNets) and their fundamental contribution to enabling the training of much deeper neural networks by addressing the vanishing gradient problem and signal degradation.
Mind Map
Zum Vergrößern klicken
Klicke, um die vollständige interaktive Mind Map zu öffnen
I want you to imagine approximating a
function parameterized by a deep neural network
network
in this example we're going to pass our
Network an up sample low resolution
input image and pass it through each of
the network layers
we want the network to Output the input
image but now in a high resolution
a task commonly known as super resolution
resolution
unfortunately in practice after training
our Network on high and low resolution
image pairs somehow our network is
spitting out images they're even worse
than our input after putting all your
effort into a beautifully deep
architecture you are horrified to see
that instead of going down your training
loss shoots endlessly upwards your
classmates and colleagues can't help but laugh
laugh
and seemingly counter-intuitive because
now the model has more parameters
now how can we address this and get your
loss going in the right direction [Applause]
[Applause]
this problem partly comes down to the
fact that we have an input signal that
is being lost the deeper and deeper we
go into our Network
as the signals pass through each of the
non-linear functions at each layer
look at what can happen to a training
signal even after being passed through a
single relay function the most popular
activation function for neural networks
essentially you ask in the network to do
two things
one is to retain the input signal and
the second is to find out what needs to
be added to the input image to transform
it from a low to high resolution image
instead let's look at the problem from a
different angle
let's first minus the low and high
resolution image from one another
this gives us what is known as a
residual image or the difference between
the two images
now let's reshift this equation to get
our intended output on the right hand side
side
now given we already have the low
resolution image at training time let's
now just get our Network to learn the
only bit we actually care about the residual
residual
framing the problem in this way makes
the Network's life easier as it doesn't
need to retain the entire input signal
this was the same intuition that
inspired the authors from the 2015 paper
deep residual learning for image recognition
recognition
this paper is now considered seminal in
relation to deep learning with over 130
000 citations it is rare to run into a
model architecture in deep learning
today that doesn't utilize the
contributions from this paper and some fashion
in the previous example I gave you an
easy and intuitive introduction to
residuals let's have another look at the
layers of a neural network
I chose to present residual connections
to you using the example of super
resolution as it can be visualized very easily
easily
by simply adding the input onto the
output we can instead learn the mapping
to the residual image as you can see here
here however
however
this approach I've shown you so far has
two major problems when generalizing to
other tasks
the first problem is where we have a
task where the input and outputs don't
share the same dimensionality
for example an image classification
where you take an image input and map it
to a single class label how would you
meaningfully add the inputs and outputs
in this scenario
the second problem is how the input
signal is propagated throughout the network
network
let's consider the midpoint of our super
resolution Network
at this point no matter what our input
or output is it is still easy for the
network to lose the training signal
this signal is an important piece of
information that would be useful for the
network to have access to
in order to remedy both these problems
we can add what are known as residual
connections all the way along our
Network this not only boosts input
signals all the way along the network
but also makes it easier to submit
inputs and outputs as feature
dimensionality is adjusted on the go
we can now also view the network as a
series of residual blocks instead of a
series of independent layers
most importantly now the network has the
option to not fully utilize all the
blocks since it is easy for each block
to Output the identity function and take
no penalty in relation to the loss function
function
this opens the doors to training
extremely deep Networks
now let's have a deeper look at the main
idea I introduced here
foreign
so what exactly was the resnet block
they proposed in the original paper
let's go through it step by step
firstly we pass our inputs through a 3X3
convolutional layer with a stride of one
and padding one
these parameters mean that our output
features will have the same
dimensionality as our input
we then apply batch Norm to renormalize
these features and pass them through an
activation function such as relu
we then pass the features through a
second convolutional layer exactly the
same as the first and again followed by
a batch Norm
at this stage we just have a normal
vanilla neural network so let's now add
a residual connection we can do this by
simply adding the Block's inputs onto
the current set of features
we do this element wise as our inputs
and features share the same
dimensionality remember this is only
because we have carefully chosen our
convolutional parameters
however for tasks such as image
classification we do actually want to
reduce our dimensionality throughout the
network more on that in a moment
finally we pass our features through a
final activation function now that is
essentially it it really is quite a
simple idea now let's have a quick look
at the official Pi torch code
implementation for a resnet blocks
forward pass and consolidate what we've
just learned we start with an input
tensor X and save a copy of this as our
identity function we can use later
we then pass our input through a set of
convolutional batch norm and activation layers
layers
down sampling the features if required
more on that at a moment when we
discussed Dimension matching
we can then simply add our saved
identity features to our current set of
features in the network this is done
element wise
finally we pass this through a final
activation function and return this as
the output of our residual block
note that some of these choices are
arbitrary such as applying the
activation function after adding the
identity function
this is simply done because this is what
the authors found to give the best results
results
when performing a residual connection we
must ensure that the dimensions match
such that we can do element wise Edition
in the original paper they choose to
reduce dimensionality every few residual
blocks as their end goal is image
classification where you go from a high
dimensional input to a low dimensional
output the authors decided to reduce
Dimensions by halving the height and the
width of their current set of features
to keep the computational requirements
of each part of the network consistent
they also increase the number of
channels every time they half the height
and the width this leaves us with
potentially two scenarios of mismatched dimensions
dimensions
firstly where the height and the width
don't match
and secondly where the channels don't
match could be either one of these or a
combination of the two let's have
another look at the resonate block and
understand how the network can
downsample features let's have a look at
the first convolution which I told you
earlier had a stride of one and padding
of one to keep input feature
dimensionality the same as the output
the authors propose to down samples
features directly by occasionally
altering this convolutional layer to
have a stride of two this produces
features with half the height and half
the width
when the authors down sampled in this
fashion they also double the number of
convolutional filters which in turn
doubles the number of channels in the
output features
this is where we have a problem with
Dimension matching as our input that is
sent through our residual connection
does not have the same dimensions as the
let's now have a look at our input
features coming through our residual
connection and see what options are
available to us when addressing this
Dimension mismatch
the authors proposed two solutions
firstly they propose to match the number
of feature channels by zero padding
this option has the benefit of
introducing no new parameters into the
model this is simply done by filling out
half the features with zeros
although no parameters are added we are
now wasting computation on meaningless
features full of zeros
the second solution is to match the
number of channels by passing over our
input features with a one by one convolution
convolution
this of course adds extra parameters
that means that our output features only
contain real information
so for example if our input features
have three channels we would now have
six one by one convolutional filters to
double the number of channels in our
output space for both options a stride
of two is again used this means that our
output feature maps have half the width
half the height and double the channels
this means that they exactly match our
current features in the network
essentially these two options are very
similar they both skip over every other
pixel in the input features the main
difference is whether we Zero part our
output or use the one by one
convolutional option to match the number
of channels
the authors found that the one by one
convolutional option led to the best results
hey guys it's Rupert thanks for watching
the video I hope you now have an
intuitive understanding of residual
networks the main idea is that you can
now train your networks deeper and
deeper whilst keeping training stable
please don't forget to hit that like And
subscribe button for more machine
learning videos and let me know in the
comment section down below what you want
Klicke auf einen beliebigen Text oder Zeitstempel, um direkt zu dieser Stelle im Video zu springen
Teilen:
Die meisten Transkripte sind in unter 5 Sekunden bereit
Mit einem Klick kopieren125+ SprachenInhalt durchsuchenZu Zeitstempeln springen
YouTube-URL einfügen
Gib den Link eines beliebigen YouTube-Videos ein und erhalte das vollständige Transkript
Transkript-Extraktionsformular
Die meisten Transkripte sind in unter 5 Sekunden bereit
Unsere Chrome-Erweiterung installieren
Transkripte abrufen, ohne YouTube zu verlassen. Installiere unsere Chrome-Erweiterung und greife mit einem Klick direkt auf der Wiedergabeseite auf das Transkript jedes Videos zu.