0:18 having discussed rnn's last week we'll
0:23 now move to a topic which is very
0:26 contemporary in terms of trying to
0:28 address some of the technical features
0:32 of what RNN to De planning which is attention
0:33 attention
0:35 models before we go into attention
0:38 models let's discuss the question that
0:42 we left behind which was what do you
0:45 think will happen if you train a model
0:49 on normal videos and do inference on a reversed
0:50 reversed
0:53 video hope you had a chance to think about
0:55 about
0:58 this it depends on the application or
1:01 task for certain activities say maybe
1:03 let's say you want to differentiate
1:06 walking from jumping it could work to a
1:09 certain extent even if you tested it on
1:12 a reversed video however for certain
1:14 other activities say a Sports Action
1:17 such as a tennis forehand this may not
1:19 be that
1:23 trivial an interesting related problem
1:27 in this context is known as finding the
1:31 arrow of time there are a few
1:35 interesting papers in this direction
1:39 where the task at hand is to find out
1:42 whether the video is forward or
1:45 backward this can be trivial in some
1:48 cases but this can get complex in some
1:51 cases if you're interested please read
1:55 this paper known as uh learning and
1:57 using the arrow of time if You' like to
2:05 so far with rnns we saw that rnns can be
2:07 used to efficiently model sequential
2:11 data rnns use back propagation through
2:15 time as the Training Method rnn's
2:18 unfortunately suffer from the vanishing
2:21 and exploding gradients problems to
2:24 handle the exploding gradient problem
2:28 one can use gradient clipping and to
2:30 handle the vanish vanish in gradients
2:34 problems one can use RNN variants such
2:37 as lstms or
2:40 grus which was good we saw how to use
2:42 these for
2:45 handling sequential learning problems
2:49 but the question we ask now is this
2:54 sufficient are there tasks when RNN may
2:57 not be able to solve the problem let's
3:01 find more about this let's consider a
3:04 couple of popular tasks where RNN may be
3:09 useful one is the task of image
3:12 captioning given an image one has to
3:16 generate a sequence of words to make a
3:20 caption that describes the activity or
3:22 the scene in the
3:26 image another example where rnns are
3:29 extremely useful is the task of neural
3:31 measure machine translation or what is
3:35 known as nmt it it's also what you see
3:38 on your uh translate apps that you may
3:41 be using where you try to you have a
3:43 sentence given in a particular language
3:46 and then you have to give the equivalent
3:48 sentence in a different
3:52 language both of these are RNN
3:55 tasks a standard approach to handling
3:59 such tasks is given any input your in
4:03 put could be video could be an image
4:06 could be audio or could be text you
4:10 first pass these inputs through an
4:13 encoder Network which gets you a
4:16 representation of that input which we
4:18 call the context
4:22 Vector given this context Vector you
4:25 pass this through a decoder Network
4:29 which gives you your final output text
4:31 these are known as encoder decoder
4:34 models and they're extensively used in
4:41 context now let's take a brief dator to
4:45 understand encoder decoder models a bit
4:49 more the standard name for such encoder
4:52 decoder models is known as the auto
4:55 encoder although in this case it says
4:59 that the decoder is trying to encode the
5:01 input itself and that's the reason why
5:05 this is called an auto encoder not all
5:08 encoder decoder models need to be Auto
5:11 encoders however the conceptual
5:14 framework of encoder decoder models
5:17 comes from Auto encoders which is why
5:20 we're discussing this briefly before we
5:23 come back to encoder decoder
5:26 models an on autoencoder is a neural
5:30 network architecture where you have
5:34 an input Vector you have a network which
5:37 we call as the encoder Network and then
5:40 you have a concept vector or we also
5:43 call that the bottleneck layer which is
5:46 a representation of the input and then
5:49 you have a decoder layer or a network
5:52 which outputs a certain
5:56 Vector in an auto encoder we try to set
6:00 the target value to the input themselves
6:04 so you're asking the network to predict
6:07 the input itself so what are we really
6:09 trying to learn here we're trying to
6:14 learn a function f parameterized by some
6:19 weights and by WB f ofx is equal to X
6:21 rather we are trying to learn the
6:24 identity function itself and predict an
6:27 output xact which is close to
6:31 X so how would you learn such a network
6:32 using back
6:34 propagation what kind of a loss function
6:37 would you use it would be a mean squared
6:39 error where you're trying to measure the
6:43 error between x and x hat which is the
6:46 Reconstruction of the autoencoder then
6:49 you can learn the weights in the network
6:52 using back propagation as with any other
6:54 feed forward neural
6:59 network now the encoder and the decoder
7:02 need not not be just one layer you could
7:05 have several layers in the encoder
7:07 similarly a several layers in the
7:10 decoder in the auto encoder setting
7:14 traditionally the decoder is a mirror
7:16 architecture of the encoder so you have
7:18 if you have a set of layers in the
7:20 encoder with a certain number of
7:22 dimensions number of hidden nodes in
7:25 each of these layers then the decoder
7:27 mirrors the same architecture the other
7:31 way to ensure that you can get an output
7:33 which is of the same Dimension as the
7:35 input that's when you can actually
7:38 measure the mean square error between
7:40 the Reconstruction and the
7:43 input however while this is the case for
7:47 an auto encoder not all encoder decoder
7:50 models need to have such architectures
7:52 you can have a different architecture
7:54 for an encoder and a different
7:57 architecture for a decoder depending on
8:04 just to understand a variant of the
8:08 autoencoder a popular one is known as
8:10 the den noising Auto encoder in a d
8:14 noising autoencoder you have your input
8:17 data you intentionally corrupt your
8:19 input Vector for example you can add
8:21 something like a gossan noise and you
8:25 would get a set of values X1 hat to xn
8:28 hat so those are your corrupted input
8:31 values now you pass this through your
8:34 encoder you get a representation then a
8:37 decoder and you finally try to
8:41 reconstruct the original input
8:44 itself what is the loss function here
8:46 the loss function here would again be
8:49 mean squ error this time it would be the
8:53 mean squ error between your output and
8:56 the original uncorrupted
8:59 input what are we trying to do here we
9:01 are trying to ensure that the auto
9:05 encoder can generalize well tomorrow at
9:09 the end of training rather so that even
9:11 if there was some noise in the input the
9:14 auto encoder would be able to recover
9:17 your original
9:21 data with that introduction to Auto
9:23 encoders let's ask one
9:26 question in all the architectures that
9:30 we saw so far with Auto encoders
9:32 we saw that the hidden layers were
9:36 always smaller in size in dimension when
9:37 compared to the input
9:40 layer is this always
9:44 necessary can you go larger
9:46 larger
9:48 autoencoders where the hidden layers
9:50 have a lesser Dimension than the input
9:53 layer are called under complete
9:57 autoencoders so you can say that such
10:01 autoencoders learn a lower dimensional
10:04 representation on a suitable manifold of
10:07 input data from which if you use the
10:11 decoder you can reconstruct back your original
10:12 original
10:16 input on the other hand if you had an
10:19 autoencoder architecture where the
10:22 hidden layer Dimension is larger than
10:24 your input you would call such an
10:28 autoencoder as an over complete
10:30 autoencoder while technically this is
10:33 possible the limitation here is that the
10:38 auto encoder could blindly copy certain
10:40 uh inputs to the certain dimensions of
10:42 that hidden layer which is larger in
10:45 size and still be able to reconstruct
10:48 which means such an overcomplete
10:52 autoencoder can learn trivial Solutions
10:54 which don't re really give you useful
10:57 performance they may simply memorize all
11:00 the inputs and just copy inputs back to
11:07 layer then the question is are all
11:09 autoencoders also dimensionality
11:11 reduction methods assuming we are
11:14 talking about under complete
11:17 autoencoders partially yes largely
11:21 speaking autoencoders can be used as
11:24 dimensionality reduction
11:28 techniques a follow-up question then is
11:33 then can an auto encoder be considered
11:36 similar to principal component analysis
11:38 which is a popular dimensionality reduction
11:39 reduction
11:44 method the answer is actually yes again
11:46 but I'm going to leave this for you as
11:50 homework to work out the connection
11:57 PCA let's now come back to what we were
12:00 talking about which was is one of the
12:04 tasks of RNN which is neural machine
12:07 translation or
12:10 nmt these kinds of encoder decoder
12:13 models are also called sequence to
12:15 sequence models especially when you have
12:19 an input to be a sequence and an output
12:21 also to be a
12:25 sequence so if you had an input sentence
12:28 which says India got its independence
12:29 from the BR
12:32 British let's say now that we want to
12:37 translate this English sentence to Hindi
12:40 what you would do now is you would have
12:43 an encoder Network which would be a
12:47 recurrent neural network and RNN where
12:50 each word of your input sentence is
12:53 given at one time step of the RNN and
12:56 the final output of the RNN would be
13:00 what we call a context vector
13:03 and this context Vector is fed into a
13:08 decoder arnn which gives you the output
13:10 which says
13:13 bhat the rest of the sentence Millie and
13:16 then you have an end of sentence
13:19 token this is what we saw as a many to
13:24 many RNN last week why aren't we giving
13:26 an output at each time step of the encoder
13:27 encoder
13:31 RNN for for a machine translation task
13:34 if you recall the recommended
13:37 architecture we said that it's wiser to
13:41 read the full sentence and then start
13:44 giving the output of the translated
13:47 sentence why so because different
13:51 languages have different grammars and
13:54 sentence constructions so it may not
13:58 be correct for the first word in English
14:02 to be the first word in Hindi or it the
14:05 Hindi sentence may not exactly follow
14:08 the same sequence of words in English
14:10 because of gram grammatical
14:13 regulations so that's the reason why in
14:17 machine translation tasks you generally
14:20 have reading of the entire input
14:23 sentence you get a context vector and
14:25 then you start giving the entire output
14:29 in uh the translated output similarly
14:32 if you considered the image captioning
14:36 task you would have an image and in this
14:39 case your encoder would be a CNN
14:41 followed by say a fully connected
14:44 Network out of which you get a
14:47 representation or a context vector and
14:50 this context Vector goes to a decoder
14:53 which outputs the caption a woman dot
14:57 dot dot say in the park end of
14:59 sentence what's the problem this seems
15:03 to work well is there a problem at all
15:06 let's Analyze This a bit more
15:10 closely so in an RNN the hidden states
15:13 are responsible for storing relevant
15:15 input information in
15:19 RNN so you could say that a hidden State
15:22 at time step t or
15:26 HT is a compressed form of all previous
15:30 inputs that hidden state represents some
15:34 information from all the previous inputs
15:36 which is required for processing in that
15:40 state as well as future
15:45 States now let's consider a longer
15:47 sequence if you considered language
15:51 processing and a large paragraph if your
15:55 input is very long can your HT the
15:59 hidden State at any time step encode all this
16:00 this
16:04 information not really you may be faced
16:07 with the information bottleneck problem
16:09 in this kind of a context so if you
16:12 considered a sentence such as this one
16:15 here which has to be translated to
16:20 German can we guarantee that a words
16:23 seen at earlier time steps be reproduced
16:27 at at later time steps remember when you
16:30 go from a language such as is English to
16:32 a language such as German the position
16:36 of the verbs the nouns may all change
16:39 and to reproduce this one may have to
16:42 get a word much earlier in the sentence
16:45 in English which may follow much later
16:49 in say the German language is this
16:53 possible unfortunately RNN don't work
16:56 that well when you have such long sequences
16:59 sequences
17:01 similarly even if you had image
17:04 captioning and related problems such as
17:06 visual question answering which we will
17:09 see later so if you had this image that
17:11 we saw in the very beginning of this
17:14 course and if we asked the question what
17:17 is the name of the book the expected
17:21 answer is the name of the book is Lord
17:24 of the Rings the relevant information in
17:27 a cluttered image may also need to be
17:31 preserved in case there are follow-up
17:38 dialogue so a statistical way of
17:41 understanding this is through what is
17:45 known as blue score blue score is a
17:49 common performance metric used in NLP
17:52 natural language processing blue stands
17:56 for bilingual evaluation under study
17:58 it's a metric for evaluating the qual
18:02 quality of machine translated text it's
18:05 also used for other tasks such as image
18:07 captioning visual question answering so
18:11 on and so forth and when one looks at
18:14 the blue score one observes that as the
18:16 sentence length
18:19 increases then while the expected curve
18:22 is that you should get a high blue score
18:25 after a certain sequence length
18:28 unfortunately as the sentence length
18:31 goes further Beyond a threshold the blue
18:34 score starts falling
18:38 down which means using such encoder
18:41 decoder models where encoders are RNN
18:45 decoders are also rnns starts failing in
18:49 these cases when the sequences are long
18:51 by Nature if you'd like to know more
18:55 about blue you can see this entry in
19:01 so what what is the solution to this
19:04 problem the solution which is
19:06 extensively used today is what is known
19:09 as attention which is going to be the
19:13 focus of this week's
19:15 lectures so what is this
19:18 attention intuitively
19:22 speaking given an image if we had to ask
19:25 the question what is this boy
19:28 doing the human way of doing this would
19:32 be be you first identify the artifacts
19:34 in the
19:37 image you pay attention to the relevant
19:40 artifacts in this case the boy and what
19:44 activity the boy is associated with
19:48 similarly if you had an entire paragraph
19:49 and you had to
19:52 summarize you would probably look at
19:55 certain parts of the paragraph and write
19:59 them out in a summarized form so paying
20:03 attention to parts of inputs be it
20:07 images or be it long sequences like text
20:11 is an important way of how humans process
20:13 process
20:17 data so let's now see this in a sequence
20:19 learning problem in the traditional
20:22 encoder decoder model setting so this is
20:26 once again the many to many RNN setting
20:28 similar to what we saw for new neural machine
20:30 machine
20:33 translation so you have your inputs then
20:36 you have a context Vector that comes out
20:39 at the end of the inputs that context
20:42 Vector is fed to a decoder RNN which
20:46 gives you the outputs y1 to YK now let's
20:50 assume that hjs are the hidden states of
20:54 the encoder and sjs are the hidden
20:57 states of the
21:00 decoder so what does attention do
21:03 attention suggests that instead of
21:07 directly outputting HT which is the last
21:09 hidden state to your decoder
21:14 RNN we instead have a context
21:17 Vector which relies on all of the Hidden
21:19 States from the
21:24 input this creates a shortcut connection
21:26 between this context Vector
21:31 CT and the entire Source input
21:34 X how would you learn this context
21:36 Vector we'll see there are multiple
21:40 different ways so given this context
21:44 Vector the decoder hidden State St is
21:48 given by some function f of St minus1
21:51 the previous hidden state in the decoder
21:55 YT minus 1 the output of the previous
21:57 time step in the decoder could be given
22:00 as input to the next time step as well
22:08 CT and what is this context Vector this
22:13 context Vector is given by CT which is
22:16 over all the time steps in your encoder
22:21 RNN Alpha TJ HJ so it's a weighted
22:25 combination of all of your hidden State
22:29 representations in your encoder rnm
22:32 how do you find Alpha TJ how do you find
22:35 those weights of the different
22:38 inputs a standard framework for doing
22:42 this is Alpha TJ can be obtained as a
22:46 softmax over some scoring function that
22:50 captures the score between St minus one
22:53 and each of the Hidden States in your
22:58 encoder so St minus1 gives us a current
23:02 context of the output so we try to
23:05 understand what is the alignment of the
23:09 current context in the output with each
23:12 of the inputs and accordingly pay
23:16 attention to specific parts of the
23:19 inputs now there's an open question how
23:23 do you compute this score of St minus
23:27 one with each of the hjs in the encoder
23:30 RNN one once we have a way of computing
23:35 that score we can take a soft Max over
23:38 HJ with respect to all of the hjs so we
23:40 will do this for each of the hjs in the
23:44 encoder RNN and using that we can
23:47 compute your Alpha tjs and using Alpha
23:51 tjs we can compute the context Vector
23:53 once you get the context Vector you
23:55 would give the corresponding context
23:59 Vector as input to each time step of the decoder
24:01 decoder
24:05 rnm how do you compute this
24:07 score there are a few different
24:10 approaches in literature at this time we
24:12 will review many of them over the
24:15 lectures this week but to give you a
24:19 summary you could have a Content based
24:23 attention which tries to look at St and
24:27 hi so each a particular hidden state in
24:30 your decoder RNN St and a particular hid
24:35 hidden state in an encoded RNN hi as a
24:37 cosine similarity between the two that's
24:40 one way of measuring the score you could
24:44 also learn weights to compute this
24:48 alignment so you can take St and hi
24:52 learn a set of Weights wa take a tan and
24:56 use another Vector to get the score so
24:58 this is a learning procedure to get your final
24:59 final
25:03 score one could also get Alpha TJ as a
25:07 softmax over a learned set of Weights W
25:12 and STD again one could also use a more
25:15 General framework where you have St
25:19 transpose hi which is similar to cosine
25:21 which will give you a DOT product but
25:24 you also have a learned set of Weights
25:27 in between which tells you how to
25:30 compare the two vectors St and hi
25:34 remember any W here are learned by the
25:37 network to compute the score or you
25:40 could simply use just a DOT product by
25:43 itself which would be similar to your
25:45 content based attention the cosine and
25:47 the dot product would give similar
25:50 values or there is a variant known as
25:53 the scaled dot product attention where
25:54 you use the dot product between the two
25:59 vectors STD and hi but scale it by root
26:03 n which tells you the number of inputs
26:05 that you
26:08 have what about spatial data so we saw
26:11 how it is done for temporal data where
26:14 you had a sequence to sequence RNN A Min
26:17 to many RNN what if you had an image
26:21 captioning task if you had spatial data
26:24 so in this case your image would give
26:27 you a certain representation s not out
26:29 of the encoder
26:33 Network unfortunately when you use a
26:34 fully connected
26:38 layer after the CNN you lose spatial
26:41 information in that
26:43 Vector so instead of using the fully
26:47 connected layer we typically take the
26:49 output of the convolutional layers
26:51 themselves which would give you a
26:55 certain volume which let's say is M CR n
27:00 CR C now we know that if you considered
27:03 one specific patch of this volume M
27:06 cross n Cross C we know that you can
27:09 trace that back to a particular patch of
27:13 the original image which was passed
27:14 through a CNN so you know that the
27:17 output feature map say a con five
27:20 feature map if you looked at one
27:24 particular PA part of that depth volume
27:26 you would get a certain patch in the
27:28 input image
27:32 now this gives you spatial information
27:35 so what can we do we take this feature
27:38 map that we get at the output of a
27:41 certain convolutional layer we can
27:44 unroll them into 1 cross 1 Cross C
27:46 vectors so you ideally have M cross n
27:49 Cross C so you can unroll this into C different
27:51 different
27:54 vectors and then you can apply attention
27:58 to get a context vector
28:01 in what way is this useful this context
28:05 Vector now can be understood as paying
28:08 attention to certain parts of the image
28:11 while giving the output because each of
28:14 these bands each of these sub volumes
28:17 here highlighted in yellow are certain
28:20 parts of the input image and one could
28:23 Now understand the same weighted
28:26 attention concept the alignment part of
28:29 it could be implemented very similar to
28:31 what we saw on the previous slide but
28:35 now this represents different parts of
28:37 the input
28:41 image another use of Performing
28:45 attention is it gives you explainability
28:47 of the final
28:51 model why so how so if you have say a
28:56 machine translation task you know that
28:58 when you generated a certain output
29:01 word from a decoder RNN your attention
29:05 model or your context Vector tells you
29:09 which part of the input you looked at
29:13 while predicting that word as the output
29:16 and that automatically tells you which
29:18 words in your input sequence
29:22 corresponded to an uh to a word in your
29:27 output so in this case you can see that
29:30 this particular sequence here European
29:34 economic area depended on Zone economic
29:38 European so that is also highlighted by
29:41 these white patches here so white means
29:45 a higher dependence black means no
29:48 dependence and looking at this heat map
29:51 gives you an understanding of how the
29:55 model translated from one language to
29:58 another what about images IM image
30:02 captioning task in this case too you can
30:05 use the same idea given an image if the
30:09 model is generating a caption you can
30:12 see that the model generates each word
30:15 of the caption by looking at certain
30:18 parts of the image for example when it
30:22 says a it seems to be looking at a
30:24 particular part of the image when it
30:26 says a woman it seems to be looking at a
30:28 certain part of the image while the
30:30 other object is also in relevance and if
30:33 you keep going you see when it says the
30:36 word throwing it seems to be focusing on
30:39 the woman part of the image and if you
30:42 see the word frisbee it actually seems
30:44 to focus on the Frisbee in the image and
30:47 if you see the word park it seems to be
30:50 focusing on everything other than the
30:53 woman and the child this gives you an
30:56 understanding and trust that the model
30:59 is looking at the right things while
31:06 output what are the kinds of attention
31:10 one can have you could consider having a
31:13 hard versus soft attention what do these
31:18 mean in hard attention you choose one
31:22 part of the image as the only focus for
31:24 giving a certain output let's say image
31:27 captioning you look at only one patch of
31:30 the image to be able to give a word as an
31:31 an
31:35 output so this choosing of a position
31:37 could end up becoming a stochastic
31:41 sampling problem and hence one may not
31:44 be able to back propagate so through
31:47 such a hard attention problem because
31:49 that stochastic stamp sampling step
31:52 could be non-differentiable we'll see
31:55 this in more detail in the next lecture
31:58 on the other hand one could have soft
32:01 attention where you do not choose a
32:04 single part of the image but you simply
32:06 assign weights to every part of the
32:10 image in this case you are only going to
32:13 have a newer image where each part of
32:17 the image has a certain weight in this
32:21 case your output turns out to be
32:23 deterministic differentiable and hence
32:26 you can use such an approach along with
32:29 standard back
32:32 propagation another categorization of
32:36 attention is Global versus local
32:40 attention in global attention all the
32:43 input positions are chosen for attention
32:47 whereas in local attention maybe only a
32:49 neighborhood window around the object of
32:52 Interest or the area of interest is
32:54 chosen for
32:58 attention a Third Kind which is very
33:02 popular today is known as self attention
33:06 where the attention is not with respect
33:09 to an decoder RNN with respect to the
33:12 encoder or an output RNN with respect to
33:16 parts of an image but is of attention of
33:19 a part of a sequence with respect to
33:21 another part of the same
33:25 sequence this is known as self attention
33:29 or intra attention and we'll see this in
33:33 more detail in a later lecture this
33:36 week your homework for this lecture is
33:39 to read this excellent blog by Lilian
33:42 Wang known as attention attention it's a
33:43 Blog on
33:46 GitHub and one question that we left
33:51 behind which is is there a connection
33:53 between an auto encoder and principal component
33:55 component
33:58 analysis think about it and we'll
34:01 discuss this in the next
34:04 lecture references [Music]