0:03 hi there today we're looking at
0:06 attention is all you need by Google just
0:08 to declare I don't work for Google just
0:10 because we've been looking at Google
0:12 papers lately but it's just an
0:15 interesting paper and we're gonna see
0:18 what's the deal with it so basically
0:21 what the authors are saying is we should
0:27 kind of get away from basically onions
0:30 so traditionally what you would do and
0:33 these authors particular interested in
0:35 NLP natural language processing so
0:37 traditionally when you had like a
0:46 language task the cat eats the mouse and
0:50 you would like to translate this to say
0:56 any other language like let's say German
1:00 or whatever what you would do is you
1:03 would try to encode this sentence into a
1:05 representation and then decode it again
1:10 so somehow somehow this sentence needs
1:13 to all go into say one vector and then
1:16 this one vector needs to somehow be
1:20 transformed into the target language so
1:22 these are tradition called sack to sack
1:27 tasks and they have been solved so far
1:30 using recurrent neural networks you
1:34 might know the Alice TN networks that
1:36 are very popular for these tasks what
1:38 basically happens in an RNN is that you
1:42 go over the say source sentence here one
1:45 by one here you take the word the you
1:48 kind of encode it maybe with a word
1:51 vector if you know that is so you turn
1:54 it into like a vector a word vector and
1:57 then you use a neural network to turn
2:00 this vector into what we call a hidden
2:06 state so this H 0 is a hidden state you
2:12 then take the second token here cat you
2:13 again take it
2:16 world vector because need to represent
2:19 it with numbers somehow so you use word
2:23 vectors for that you turn this into you
2:25 put it through the same function so here
2:28 is what it's like a little easy for
2:30 encoder turn into the same function but
2:33 this time this hidden state also gets
2:36 plugged in here so the word vector did
2:38 instead you can actually think of having
2:42 like a started state here a start
2:45 usually people either learn this or just
2:47 initialize with zeros that kind of goes
2:49 in to the encoder function so it's
2:54 always really the same function and from
2:56 the previous hidden state and the
2:59 current word vector the encoder again
3:03 predicts another hidden state h1 and so
3:06 on so you take the next token you turn
3:09 it into a word vector you put it through
3:14 this thing the encoder function and of
3:16 course this is a lot more complicated in
3:18 actual like say an LST M that's the
3:21 basic principle behind it so you you end
3:25 up with H 2 and here you'd have H 3 H 4
3:29 and the last hidden state H 4 here you
3:31 would use this in kind of exactly the
3:34 same fashion you plug it into like a
3:37 decoder let the decoder which would
3:42 output you a word D and it would also
3:48 output you a next hidden state so H 5
3:52 let's say let's just go on with the with
3:56 the listing of the states and this H 5
3:58 would again go into the decoder which
4:04 would output concert like so that's how
4:06 you would decode you basically these are
4:09 n ends what they do is they kind of take
4:12 if you look on top here they take an
4:15 input a current input and they take the
4:19 last hidden state and they compute a new
4:21 hidden state in the case of the decoder
4:26 they take the hidden state and they take
4:26 kind of
4:30 the previous usually the previous word
4:32 that you output you also feed this back
4:35 into the decoder and they will output
4:37 the next word kind of make sense so you
4:39 would guess that the hidden state kind
4:42 of encode what the sentence means and
4:45 the last word that you output you need
4:49 this because maybe for grammar right you
4:51 know what you've just output so kind of
4:56 the next word should be based on that
4:58 of course you don't have to have to do
5:00 it exactly this way but that's kind of
5:06 what what is orleans did so attention is
5:11 a mechanism here to basically increase
5:14 the performance of the orleans so the
5:16 tension would do is in in this
5:20 particular case if we look at the
5:24 decoder here if it's trying to predict
5:30 this word for cat then or the next word
5:35 here say here it wants the next word and [Music]
5:37 [Music]
5:43 in essence the only the only h6 the only
5:45 information it really has is what the
5:48 last word was german word for cat and
5:52 what the hidden state is so if we look
5:54 at what word it actually should output
5:56 in the input sentence it's this here
6:00 eats right and if we look at kind of the
6:05 the information flow that this word has
6:07 to travel so first it needs to encode
6:09 into a word vector it needs to go
6:10 through this encoder that's the same
6:13 function for all the words so nothing
6:15 specific and we learned to the word eats
6:17 here all right let's go through this
6:20 hidden state traverse again into another
6:22 step this hidden state because we have
6:25 two more tokens and then the next state
6:27 state then it goes all the way to the
6:31 decoder where the first two words are
6:34 decoded and still so this H six this
6:36 hidden state somehow still needs to
6:40 retain the information that now the
6:44 it's somehow is kind of their world to
6:48 be translated and that they that the
6:50 decoder should find the German word for
6:56 that so that's that's of course very a
6:58 very long path or there's a lot of
7:01 transformations involved over these all
7:03 of these hidden states and the hidden
7:05 states not only do they need to remember
7:07 this particular word but all of the
7:10 words and the order and so on and the
7:13 grammar Norquay the grammar you can
7:15 actually learn with the decoders
7:16 themselves but kind of the meaning and
7:19 the structure of the sentence so it's
7:22 very hard for an RNN to learn all of
7:24 this what they what we call long-range
7:28 dependencies and so naturally you
7:30 actually think well why can't we just
7:33 you know decode the first word to the
7:34 first word the second word to the second
7:37 world it actually works pretty well in
7:40 this example right like the the cat cuts
7:43 it eats the week just decoded it one by
7:44 one about of course that's not how
7:46 translation works in translations the
7:49 sentences can become rearranged in the
7:51 target language like one word can become
7:54 many words or you could even be an
7:57 entirely different expression so
7:59 attention is a mechanism by which this
8:01 decoder here in this step that we're
8:04 looking at can actually decide to go
8:07 back and look at particular parts of the
8:10 input especially what it would do
8:12 anything like popular attention
8:15 mechanisms is that the dis decoder here
8:21 would can decide to attend to the hidden
8:24 states of the input sentence what that
8:26 means is in in this particular case we
8:28 would like to teach the decoder somehow
8:32 that AHA look here I need to pay close
8:36 attention to this step here because that
8:39 was the step when the word eats was just
8:42 encoded so it probably has a lot of
8:45 information about what I would like to
8:49 do right now namely translate this word
8:53 eats so this mechanism
8:56 if you look at the information flow it
8:58 simply it goes through this word vector
9:01 goes through one encoding step and then
9:03 is that hidden state and then the
9:06 decoder can look directly at that so the
9:08 the path length of information is much
9:10 shorter than going through all the
9:13 hidden states in a traditional way so
9:17 that's where tension helps and the way
9:19 that the decoder decides what to look at
9:23 is like a kind of an addressing scheme
9:25 you may know it from neural turing
9:31 machines or or kind of other other kind
9:35 of neural algorithms things so what the
9:37 decoder will do is in each step it would
9:42 output a bunch of keys oops sorry about
9:48 that that's my hand being trippy so what
9:51 I would output is a bunch of keys so K 1
9:58 through K and what would these keys
10:02 would do is they would index these
10:08 hidden kind of hidden states via a kind
10:12 of softmax architecture and we're gonna
10:14 look at this I think in the actual paper
10:16 we're discussing because it's gonna
10:19 become more clear which is kind of
10:21 notice that the decoder here can decide
10:25 to attend it to the input sentence and
10:27 kind of draw information directly from
10:30 there instead of having to go just to
10:33 the hidden state it's provided with so
10:36 if we go to the paper here what do these
10:39 authors propose and the thing is they
10:42 teach the origins they basically say
10:44 attention is all you need you don't need
10:46 the entire recurrent things basically in
10:49 every step of this decode of this and
10:51 basically of the decoding so you want to
10:53 produce the target sentence so in this
10:57 step in this step in this step you can
11:00 basically you don't need the recurrence
11:03 even just kind of do attention over
11:07 everything and you
11:11 be fine namely what they do is they
11:14 propose this transformer architecture so
11:18 what does it do it has two parts what's
11:20 what's called an encoder and a decoder
11:27 but don't kind of be confused um because
11:29 this all happens at once so this is not
11:32 an art and it all happens at once every
11:35 all the source sentence so if we again
11:40 have the cat oops that doesn't work as
11:44 easy let's just do this this is a source
11:46 sentence and then we also have a target
11:49 sentence that maybe we've produced two
11:52 words and we want to produce this third
11:56 word here what a produces so we would
12:00 feed the entire source sentence and also
12:03 the targets and as we produced so far to
12:05 this network namely the source sentence
12:09 would go into this part and the target
12:11 that we've produced so far would go into
12:14 this part and this is the all combined
12:20 and at the end we get an output here at
12:23 the output probabilities that kind of
12:25 tells us the probabilities for the next
12:28 word so we can choose the top
12:31 probability and then repeat the entire
12:34 process so basically every step in
12:38 production is one training sample every
12:39 step in producing a sentence here before
12:42 with the Orang ends the entire sentence
12:44 to sentence translation is one sample
12:46 because we need to back propagate
12:47 through all of these RNA in steps
12:50 because they all happen kind of in
12:54 sequence here basically output of one
12:58 single token is one sample and then the
12:59 computation is finished the back drop
13:02 happens through everything only for this
13:05 one step there is no multi-step kind of
13:09 back propagation as in Orland and this
13:13 is kind of a paradigm shift in sequence
13:15 processing because people were always
13:18 convinced that you kind of need these
13:20 recurrent things in order to
13:24 to make good to learn these dependencies
13:25 but here they basically say Nenana
13:28 we can just do attention over everything
13:30 and little bit will actually be fine if
13:34 we just do one step projections so let's
13:37 go one by one so here with an input
13:40 embedding and say an output embedding
13:43 these these are symmetrical so basically
13:45 the tokens just get embedded with say
13:48 word vectors again then there's a
13:49 positional encoding this is kind of a
13:53 special thing where because you know
13:56 lose this kind of sequence nature of
13:57 your algorithm you kind of need to
14:01 encode where the words are that you push
14:02 through the network so the network kind
14:03 of goes AHA this is a word at the
14:05 beginning of the sentence or is the word
14:07 towards the end of the sentence so or
14:10 that it can compare to words like which
14:11 one comes first
14:14 which one comes second and you do this
14:16 it's pretty easy for the networks if you
14:19 do it with kind of these trigonometric
14:22 functioning embeddings so if I draw your
14:24 sine wave and I don't need a sine wave
14:30 of that a stop was fast and I draw you a
14:34 sine wave that is even faster maybe this
14:37 one actually sync one two three four
14:40 five doesn't matter you know what I mean
14:44 so I can encode the first world you can
14:47 encode the first position with all down
14:50 and then the second position is kind of
14:55 down down up and the third position is
14:59 kind of up down up and so on so this is
15:02 kind of a continuous way of binary
15:05 encoding of position so if I want to
15:07 compare two words I can just look at all
15:10 the scales of these things and I know
15:13 how this word one word has high here and
15:14 the other word is low here so they must
15:17 be pretty far away like one must be at
15:19 the beginning and one must be at the end
15:23 if they happen to match in this long
15:25 rate long wave and they also are both
15:29 kind of low in this wave and then I can
15:32 look in this way for like oh maybe
15:33 they're close together but here
15:35 I really got the information which ones
15:38 first which was second so these are kind
15:40 of positional encodings they they're not
15:45 critical to this algorithm but they just
15:47 encode where the words are which of
15:50 course that is important and it gives
15:52 the networking a significant boost in
15:56 performance but it's not like it's not
15:58 that the meat of the thing the meat of
16:02 the thing is that now that these
16:05 encoding is go into the network's they
16:09 simply do what they call tension here
16:13 attention here and attention here so
16:16 there's kind of three kinds of attention
16:18 so basically the first attention on the
16:21 bottom left is simply attention as you
16:25 can see over the input sentence so if I
16:27 told you before you need to take this
16:29 input sentence if you look over here and
16:32 you somehow need to encode it into a
16:38 hidden representation and this now looks
16:40 much more like the picture I drew here
16:41 in the picture I drew right at the
16:44 beginning is that all at once I kind of
16:46 put together this head representation
16:49 and all you do is he used attention over
16:51 the input sequence which basically means
16:53 you kind of pick and choose which word
16:57 you look at more or less so with the
16:59 bottom right so in the the output
17:00 sentence that you've produced so for
17:03 example a encoded into kind of a hidden
17:06 state and then the third on the top
17:10 right that's the I think that sorry I
17:13 got interrupted so as are saying the top
17:16 right is the most interesting part of
17:19 the attention mechanism here where
17:23 basically it unites the kind of encoder
17:25 part with the kind of beak let's not it
17:28 combines the source sentence with the
17:31 target sentence that you've produced so
17:37 far so as you can see maybe here I can
17:43 just slightly annoying but I'm just
17:46 gonna remove these kind of circles here so
17:47 so
17:51 if you can see here there is an output
17:54 going from the part that encodes the
17:57 source sentence and it goes into this
18:00 multi-head attention there's two
18:02 connections and there's also one
18:06 connection coming from the encoded
18:12 output so far here and so there's three
18:15 connections going in going into this and
18:18 we're gonna take a look at what these
18:22 three connections are so the the three
18:26 connections here basically are the keys
18:31 values and queries if you see here the
18:36 values and the keys are what is output
18:38 by the encoding part of the source
18:41 sentence and the query is output by the
18:45 encoding part of the target sentence and
18:48 these are not only one value key in
18:51 query so there are many and this kind of
18:54 multi-head attention fashion so there
18:55 are just many of them instead of one but
18:59 you can think of and as there's just
19:02 kind of sets so the attention computed
19:05 here is what does it do so first of all
19:09 it calculates a adult product of the
19:13 keys and the queries and then it is a
19:16 soft max over this and then it
19:17 multiplies it by the value so what does
19:22 this do if you thought product the keys
19:26 and the queries what you would get is so
19:29 as you know if you have two vectors and
19:31 the dot there dot product basically
19:35 gives you the angle between the vectors
19:38 with especially in high dimensions most
19:42 vectors going to be of kind of a 90
19:45 degree kind of I know the Americans
19:49 doodle the little square
19:51 so most vectors are going to be not
19:53 aligned very well so their dot product
19:57 will kind of be zero ish but if a key in
19:59 the query actually aligned with each
20:01 other like
20:04 if they point into the same directions
20:07 the dot product will actually be large
20:11 so what you can think of this as the the
20:13 keys are kind of here the keys are just
20:19 a bunch of vectors in space and each key
20:22 has an Associated value so each key
20:26 there is a kind of a table value one
20:31 value to value three this is really
20:34 annoying if I do this over text right so
20:37 again here so we have a bunch of keys
20:41 right in space and with a table with
20:44 values and each key here corresponds to
20:47 a value value one value to value three
20:51 value 4 and so each key is associated
20:54 with one of these values and then when
20:57 we introduce a query what can it do so
21:01 query will be a vector like this and we
21:04 simply compute D so this is Q this is
21:06 the query we compute the dot product
21:12 with each of the keys and and then we
21:14 compute a softmax over this which means
21:18 that one key will basically be selected
21:20 so in this case it would be probably
21:23 this blue key here that has the biggest
21:27 dot product with the query so this is
21:31 key to in this in this case and the
21:33 softmax so if you if you don't know what
21:35 a softmax is you have you have like X 1
21:38 2 X and B like some numbers then you
21:42 simply do you map them to the
21:46 exponential function each one of them
21:49 and but also each one of them you divide
21:54 by the sum of over over I of e to the X
21:55 I so basically and this is a
21:58 renormalization basically you you do the
22:00 exponential function of the numbers
22:02 which of course this makes the kind of
22:05 big numbers even bigger so basically
22:08 what you end up with is one of these
22:12 numbers x1 through xn will become very
22:14 big compared to the others
22:17 and then you renormalize so basically
22:18 one of them will be almost one and the
22:21 other ones will be almost zero simply
22:23 the the maximum function you can think
22:25 of in a differentiable way I mean it
22:27 should just want to select the biggest
22:30 entry in this case here we kind of
22:32 select the key that aligns most with the
22:33 query which in this case would be key
22:36 too and then we when we multiply this
22:39 softmax thing with the with the values
22:46 so this query this this inner product if
22:50 we multiply Q with K to as an inner
22:56 product and we take the softmax over it
22:59 softmax what we'll do is i'm going to
23:00 draw it upwards here we're going to
23:05 induce a distribution like this and if
23:07 we multiply this by the value it will
23:12 basically select value two so this is
23:15 this is kind of an indexing scheme into
23:19 this memory of values and this is what
23:22 then the network uses to compute further
23:25 things using so you see the output here
23:28 goes into kind of more layers of the
23:32 neural network upwards so basically what
23:34 what you can think what does this mean
23:38 you can think of here's the whoops deep
23:43 I want to delete this you can think of
23:46 this as basically the encoder of the
23:52 source sentence right here discovers
23:57 interesting things that looks ugly it
23:59 discovers interesting things about the
24:03 about the source sentence and it builds
24:07 key value pairs and then the encoder of
24:10 the target sentence builds the queries
24:13 and together they give you kind of the
24:17 next the next signal so it means that
24:20 the network basically says here's a
24:22 bunch of things here is a here's a bunch
24:27 of things about the source sentence
24:29 that you might find interesting that's
24:35 the values and the keys are ways to
24:38 index the values so it says here's a
24:40 bunch of things that are interesting
24:42 which are the values and here is how you
24:44 would address these things which is the
24:48 keys and then the other part of the
24:51 network builds the queries it says I
24:55 would like to know certain things so
24:57 think of the value is like attributes
25:01 like here is the name and the the kind
25:03 of tallness and the weight of a person
25:06 right and the keys are like that the
25:10 actual index is like name height weight
25:13 and then the other part of the network
25:16 can decide what I want I actually want
25:18 the name so my query is the name it will
25:21 be aligned with the key name and the
25:23 corresponding value would be the name of
25:25 the person you would like to describe so
25:27 that's how kind of these networks work
25:30 together and I think it's a it's a
25:33 pretty ingenious it's not entirely new
25:34 because it has been done of course
25:36 before with all the differentiable
25:40 Turing machines and whatnot but it's
25:41 pretty cool that this actually works and
25:45 actually works kind of better than our
25:50 it ends if you simply do this so they
25:52 describe a bunch of other things here I
25:56 I don't think they're too important
25:58 basically that the point they make about
26:01 this attention is that it reduces path
26:03 lengths and kind of that's the the main
26:07 reason why it should work better with
26:10 this entire attention mechanism you
26:13 reduce the amount of computation steps
26:16 that information has to flow from one
26:19 point in the network to another and that
26:21 what brings the major improvement
26:23 because all the computation steps can
26:26 make you lose information and you don't
26:28 want that you want short path lengths
26:31 and so that's that's what this method
26:34 achieves and they claim that's why it's
26:39 better and it works so well so they have
26:41 experiments you can look at them they're
26:45 really good at everything of course of
26:47 course you're always have state of the
26:53 art and I think I will conclude here if
26:55 you want to check it out yourself
26:58 they have extensive code on github where
26:59 you can build your own transformer
27:04 networks and with that have a nice day