Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
Attention in Vision Models: An Introduction | NPTEL-NOC IITM | YouTubeToText
YouTube Transcript: Attention in Vision Models: An Introduction
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
having discussed rnn's last week we'll
now move to a topic which is very
contemporary in terms of trying to
address some of the technical features
of what RNN to De planning which is attention
attention
models before we go into attention
models let's discuss the question that
we left behind which was what do you
think will happen if you train a model
on normal videos and do inference on a reversed
reversed
video hope you had a chance to think about
about
this it depends on the application or
task for certain activities say maybe
let's say you want to differentiate
walking from jumping it could work to a
certain extent even if you tested it on
a reversed video however for certain
other activities say a Sports Action
such as a tennis forehand this may not
be that
trivial an interesting related problem
in this context is known as finding the
arrow of time there are a few
interesting papers in this direction
where the task at hand is to find out
whether the video is forward or
backward this can be trivial in some
cases but this can get complex in some
cases if you're interested please read
this paper known as uh learning and
using the arrow of time if You' like to
so far with rnns we saw that rnns can be
used to efficiently model sequential
data rnns use back propagation through
time as the Training Method rnn's
unfortunately suffer from the vanishing
and exploding gradients problems to
handle the exploding gradient problem
one can use gradient clipping and to
handle the vanish vanish in gradients
problems one can use RNN variants such
as lstms or
grus which was good we saw how to use
these for
handling sequential learning problems
but the question we ask now is this
sufficient are there tasks when RNN may
not be able to solve the problem let's
find more about this let's consider a
couple of popular tasks where RNN may be
useful one is the task of image
captioning given an image one has to
generate a sequence of words to make a
caption that describes the activity or
the scene in the
image another example where rnns are
extremely useful is the task of neural
measure machine translation or what is
known as nmt it it's also what you see
on your uh translate apps that you may
be using where you try to you have a
sentence given in a particular language
and then you have to give the equivalent
sentence in a different
language both of these are RNN
tasks a standard approach to handling
such tasks is given any input your in
put could be video could be an image
could be audio or could be text you
first pass these inputs through an
encoder Network which gets you a
representation of that input which we
call the context
Vector given this context Vector you
pass this through a decoder Network
which gives you your final output text
these are known as encoder decoder
models and they're extensively used in
context now let's take a brief dator to
understand encoder decoder models a bit
more the standard name for such encoder
decoder models is known as the auto
encoder although in this case it says
that the decoder is trying to encode the
input itself and that's the reason why
this is called an auto encoder not all
encoder decoder models need to be Auto
encoders however the conceptual
framework of encoder decoder models
comes from Auto encoders which is why
we're discussing this briefly before we
come back to encoder decoder
models an on autoencoder is a neural
network architecture where you have
an input Vector you have a network which
we call as the encoder Network and then
you have a concept vector or we also
call that the bottleneck layer which is
a representation of the input and then
you have a decoder layer or a network
which outputs a certain
Vector in an auto encoder we try to set
the target value to the input themselves
so you're asking the network to predict
the input itself so what are we really
trying to learn here we're trying to
learn a function f parameterized by some
weights and by WB f ofx is equal to X
rather we are trying to learn the
identity function itself and predict an
output xact which is close to
X so how would you learn such a network
using back
propagation what kind of a loss function
would you use it would be a mean squared
error where you're trying to measure the
error between x and x hat which is the
Reconstruction of the autoencoder then
you can learn the weights in the network
using back propagation as with any other
feed forward neural
network now the encoder and the decoder
need not not be just one layer you could
have several layers in the encoder
similarly a several layers in the
decoder in the auto encoder setting
traditionally the decoder is a mirror
architecture of the encoder so you have
if you have a set of layers in the
encoder with a certain number of
dimensions number of hidden nodes in
each of these layers then the decoder
mirrors the same architecture the other
way to ensure that you can get an output
which is of the same Dimension as the
input that's when you can actually
measure the mean square error between
the Reconstruction and the
input however while this is the case for
an auto encoder not all encoder decoder
models need to have such architectures
you can have a different architecture
for an encoder and a different
architecture for a decoder depending on
just to understand a variant of the
autoencoder a popular one is known as
the den noising Auto encoder in a d
noising autoencoder you have your input
data you intentionally corrupt your
input Vector for example you can add
something like a gossan noise and you
would get a set of values X1 hat to xn
hat so those are your corrupted input
values now you pass this through your
encoder you get a representation then a
decoder and you finally try to
reconstruct the original input
itself what is the loss function here
the loss function here would again be
mean squ error this time it would be the
mean squ error between your output and
the original uncorrupted
input what are we trying to do here we
are trying to ensure that the auto
encoder can generalize well tomorrow at
the end of training rather so that even
if there was some noise in the input the
auto encoder would be able to recover
your original
data with that introduction to Auto
encoders let's ask one
question in all the architectures that
we saw so far with Auto encoders
we saw that the hidden layers were
always smaller in size in dimension when
compared to the input
layer is this always
necessary can you go larger
larger
autoencoders where the hidden layers
have a lesser Dimension than the input
layer are called under complete
autoencoders so you can say that such
autoencoders learn a lower dimensional
representation on a suitable manifold of
input data from which if you use the
decoder you can reconstruct back your original
original
input on the other hand if you had an
autoencoder architecture where the
hidden layer Dimension is larger than
your input you would call such an
autoencoder as an over complete
autoencoder while technically this is
possible the limitation here is that the
auto encoder could blindly copy certain
uh inputs to the certain dimensions of
that hidden layer which is larger in
size and still be able to reconstruct
which means such an overcomplete
autoencoder can learn trivial Solutions
which don't re really give you useful
performance they may simply memorize all
the inputs and just copy inputs back to
layer then the question is are all
autoencoders also dimensionality
reduction methods assuming we are
talking about under complete
autoencoders partially yes largely
speaking autoencoders can be used as
dimensionality reduction
techniques a follow-up question then is
then can an auto encoder be considered
similar to principal component analysis
which is a popular dimensionality reduction
reduction
method the answer is actually yes again
but I'm going to leave this for you as
homework to work out the connection
PCA let's now come back to what we were
talking about which was is one of the
tasks of RNN which is neural machine
translation or
nmt these kinds of encoder decoder
models are also called sequence to
sequence models especially when you have
an input to be a sequence and an output
also to be a
sequence so if you had an input sentence
which says India got its independence
from the BR
British let's say now that we want to
translate this English sentence to Hindi
what you would do now is you would have
an encoder Network which would be a
recurrent neural network and RNN where
each word of your input sentence is
given at one time step of the RNN and
the final output of the RNN would be
what we call a context vector
and this context Vector is fed into a
decoder arnn which gives you the output
which says
bhat the rest of the sentence Millie and
then you have an end of sentence
token this is what we saw as a many to
many RNN last week why aren't we giving
an output at each time step of the encoder
encoder
RNN for for a machine translation task
if you recall the recommended
architecture we said that it's wiser to
read the full sentence and then start
giving the output of the translated
sentence why so because different
languages have different grammars and
sentence constructions so it may not
be correct for the first word in English
to be the first word in Hindi or it the
Hindi sentence may not exactly follow
the same sequence of words in English
because of gram grammatical
regulations so that's the reason why in
machine translation tasks you generally
have reading of the entire input
sentence you get a context vector and
then you start giving the entire output
in uh the translated output similarly
if you considered the image captioning
task you would have an image and in this
case your encoder would be a CNN
followed by say a fully connected
Network out of which you get a
representation or a context vector and
this context Vector goes to a decoder
which outputs the caption a woman dot
dot dot say in the park end of
sentence what's the problem this seems
to work well is there a problem at all
let's Analyze This a bit more
closely so in an RNN the hidden states
are responsible for storing relevant
input information in
RNN so you could say that a hidden State
at time step t or
HT is a compressed form of all previous
inputs that hidden state represents some
information from all the previous inputs
which is required for processing in that
state as well as future
States now let's consider a longer
sequence if you considered language
processing and a large paragraph if your
input is very long can your HT the
hidden State at any time step encode all this
this
information not really you may be faced
with the information bottleneck problem
in this kind of a context so if you
considered a sentence such as this one
here which has to be translated to
German can we guarantee that a words
seen at earlier time steps be reproduced
at at later time steps remember when you
go from a language such as is English to
a language such as German the position
of the verbs the nouns may all change
and to reproduce this one may have to
get a word much earlier in the sentence
in English which may follow much later
in say the German language is this
possible unfortunately RNN don't work
that well when you have such long sequences
sequences
similarly even if you had image
captioning and related problems such as
visual question answering which we will
see later so if you had this image that
we saw in the very beginning of this
course and if we asked the question what
is the name of the book the expected
answer is the name of the book is Lord
of the Rings the relevant information in
a cluttered image may also need to be
preserved in case there are follow-up
dialogue so a statistical way of
understanding this is through what is
known as blue score blue score is a
common performance metric used in NLP
natural language processing blue stands
for bilingual evaluation under study
it's a metric for evaluating the qual
quality of machine translated text it's
also used for other tasks such as image
captioning visual question answering so
on and so forth and when one looks at
the blue score one observes that as the
sentence length
increases then while the expected curve
is that you should get a high blue score
after a certain sequence length
unfortunately as the sentence length
goes further Beyond a threshold the blue
score starts falling
down which means using such encoder
decoder models where encoders are RNN
decoders are also rnns starts failing in
these cases when the sequences are long
by Nature if you'd like to know more
about blue you can see this entry in
so what what is the solution to this
problem the solution which is
extensively used today is what is known
as attention which is going to be the
focus of this week's
lectures so what is this
attention intuitively
speaking given an image if we had to ask
the question what is this boy
doing the human way of doing this would
be be you first identify the artifacts
in the
image you pay attention to the relevant
artifacts in this case the boy and what
activity the boy is associated with
similarly if you had an entire paragraph
and you had to
summarize you would probably look at
certain parts of the paragraph and write
them out in a summarized form so paying
attention to parts of inputs be it
images or be it long sequences like text
is an important way of how humans process
process
data so let's now see this in a sequence
learning problem in the traditional
encoder decoder model setting so this is
once again the many to many RNN setting
similar to what we saw for new neural machine
machine
translation so you have your inputs then
you have a context Vector that comes out
at the end of the inputs that context
Vector is fed to a decoder RNN which
gives you the outputs y1 to YK now let's
assume that hjs are the hidden states of
the encoder and sjs are the hidden
states of the
decoder so what does attention do
attention suggests that instead of
directly outputting HT which is the last
hidden state to your decoder
RNN we instead have a context
Vector which relies on all of the Hidden
States from the
input this creates a shortcut connection
between this context Vector
CT and the entire Source input
X how would you learn this context
Vector we'll see there are multiple
different ways so given this context
Vector the decoder hidden State St is
given by some function f of St minus1
the previous hidden state in the decoder
YT minus 1 the output of the previous
time step in the decoder could be given
as input to the next time step as well
CT and what is this context Vector this
context Vector is given by CT which is
over all the time steps in your encoder
RNN Alpha TJ HJ so it's a weighted
combination of all of your hidden State
representations in your encoder rnm
how do you find Alpha TJ how do you find
those weights of the different
inputs a standard framework for doing
this is Alpha TJ can be obtained as a
softmax over some scoring function that
captures the score between St minus one
and each of the Hidden States in your
encoder so St minus1 gives us a current
context of the output so we try to
understand what is the alignment of the
current context in the output with each
of the inputs and accordingly pay
attention to specific parts of the
inputs now there's an open question how
do you compute this score of St minus
one with each of the hjs in the encoder
RNN one once we have a way of computing
that score we can take a soft Max over
HJ with respect to all of the hjs so we
will do this for each of the hjs in the
encoder RNN and using that we can
compute your Alpha tjs and using Alpha
tjs we can compute the context Vector
once you get the context Vector you
would give the corresponding context
Vector as input to each time step of the decoder
decoder
rnm how do you compute this
score there are a few different
approaches in literature at this time we
will review many of them over the
lectures this week but to give you a
summary you could have a Content based
attention which tries to look at St and
hi so each a particular hidden state in
your decoder RNN St and a particular hid
hidden state in an encoded RNN hi as a
cosine similarity between the two that's
one way of measuring the score you could
also learn weights to compute this
alignment so you can take St and hi
learn a set of Weights wa take a tan and
use another Vector to get the score so
this is a learning procedure to get your final
final
score one could also get Alpha TJ as a
softmax over a learned set of Weights W
and STD again one could also use a more
General framework where you have St
transpose hi which is similar to cosine
which will give you a DOT product but
you also have a learned set of Weights
in between which tells you how to
compare the two vectors St and hi
remember any W here are learned by the
network to compute the score or you
could simply use just a DOT product by
itself which would be similar to your
content based attention the cosine and
the dot product would give similar
values or there is a variant known as
the scaled dot product attention where
you use the dot product between the two
vectors STD and hi but scale it by root
n which tells you the number of inputs
that you
have what about spatial data so we saw
how it is done for temporal data where
you had a sequence to sequence RNN A Min
to many RNN what if you had an image
captioning task if you had spatial data
so in this case your image would give
you a certain representation s not out
of the encoder
Network unfortunately when you use a
fully connected
layer after the CNN you lose spatial
information in that
Vector so instead of using the fully
connected layer we typically take the
output of the convolutional layers
themselves which would give you a
certain volume which let's say is M CR n
CR C now we know that if you considered
one specific patch of this volume M
cross n Cross C we know that you can
trace that back to a particular patch of
the original image which was passed
through a CNN so you know that the
output feature map say a con five
feature map if you looked at one
particular PA part of that depth volume
you would get a certain patch in the
input image
now this gives you spatial information
so what can we do we take this feature
map that we get at the output of a
certain convolutional layer we can
unroll them into 1 cross 1 Cross C
vectors so you ideally have M cross n
Cross C so you can unroll this into C different
different
vectors and then you can apply attention
to get a context vector
in what way is this useful this context
Vector now can be understood as paying
attention to certain parts of the image
while giving the output because each of
these bands each of these sub volumes
here highlighted in yellow are certain
parts of the input image and one could
Now understand the same weighted
attention concept the alignment part of
it could be implemented very similar to
what we saw on the previous slide but
now this represents different parts of
the input
image another use of Performing
attention is it gives you explainability
of the final
model why so how so if you have say a
machine translation task you know that
when you generated a certain output
word from a decoder RNN your attention
model or your context Vector tells you
which part of the input you looked at
while predicting that word as the output
and that automatically tells you which
words in your input sequence
corresponded to an uh to a word in your
output so in this case you can see that
this particular sequence here European
economic area depended on Zone economic
European so that is also highlighted by
these white patches here so white means
a higher dependence black means no
dependence and looking at this heat map
gives you an understanding of how the
model translated from one language to
another what about images IM image
captioning task in this case too you can
use the same idea given an image if the
model is generating a caption you can
see that the model generates each word
of the caption by looking at certain
parts of the image for example when it
says a it seems to be looking at a
particular part of the image when it
says a woman it seems to be looking at a
certain part of the image while the
other object is also in relevance and if
you keep going you see when it says the
word throwing it seems to be focusing on
the woman part of the image and if you
see the word frisbee it actually seems
to focus on the Frisbee in the image and
if you see the word park it seems to be
focusing on everything other than the
woman and the child this gives you an
understanding and trust that the model
is looking at the right things while
output what are the kinds of attention
one can have you could consider having a
hard versus soft attention what do these
mean in hard attention you choose one
part of the image as the only focus for
giving a certain output let's say image
captioning you look at only one patch of
the image to be able to give a word as an
an
output so this choosing of a position
could end up becoming a stochastic
sampling problem and hence one may not
be able to back propagate so through
such a hard attention problem because
that stochastic stamp sampling step
could be non-differentiable we'll see
this in more detail in the next lecture
on the other hand one could have soft
attention where you do not choose a
single part of the image but you simply
assign weights to every part of the
image in this case you are only going to
have a newer image where each part of
the image has a certain weight in this
case your output turns out to be
deterministic differentiable and hence
you can use such an approach along with
standard back
propagation another categorization of
attention is Global versus local
attention in global attention all the
input positions are chosen for attention
whereas in local attention maybe only a
neighborhood window around the object of
Interest or the area of interest is
chosen for
attention a Third Kind which is very
popular today is known as self attention
where the attention is not with respect
to an decoder RNN with respect to the
encoder or an output RNN with respect to
parts of an image but is of attention of
a part of a sequence with respect to
another part of the same
sequence this is known as self attention
or intra attention and we'll see this in
more detail in a later lecture this
week your homework for this lecture is
to read this excellent blog by Lilian
Wang known as attention attention it's a
Blog on
GitHub and one question that we left
behind which is is there a connection
between an auto encoder and principal component
component
analysis think about it and we'll
discuss this in the next
lecture references [Music]
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.