This course provides a comprehensive, step-by-step guide to understanding and implementing advanced transformer architectures, building upon the foundational 2017 "Attention Is All You Need" paper with recent innovations to enhance accuracy, efficiency, and scalability.
Mind Map
Click to expand
Click to explore the full interactive mind map • Zoom, pan, and navigate
Transformers have revolutionized the
field of machine learning, powering
breakthroughs in natural language
processing, computer vision, and beyond.
This beginner-friendly course guides you
step by step through the advancements
that make transformers more accurate,
efficient, and scalable than ever
before. As these models continue to
shape the future of AI, understanding
their inner workings and recent
innovations is essential for anyone
looking to stay relevant in the rapidly
evolving tech landscape. Immad Sadi
developed this course. Hi everyone, I
hope you are all doing good. This is a
follow-up course to the previous one
that I have created and shared here on
free podcast where I talked about how to
train your first large language model.
In that course, we have used the
transformer architecture that was
introduced in the 2017 paper. Attention
is all you need. Now, eight years have
passed and the transformer architecture
has evolved quite a bit. This is why I
created this course in order to
basically learn about these methods that
were created in the past few years.
We are going to learn about different
positional encodings, different
attention mechanisms and how to tweak
few things in order to improve the
efficiency and performance of the
transformer architecture. If you are
curious about what we are going to
achieve at the end of this course
without watching it till the end, here
is a quick summary. The curve you see on
top refers to the baseline model which
is basically the transformer model that
we created in the previous course that
was using the 2017 architecture. And now
after applying the different methods
that you are going to learn in this
course such as multi head latent attention
attention
layer norm postnormalization
and no dropout. You can see that the
loss starts decreasing. So from the
previous model that we created
previously, we were able to reduce the
loss by 11% which is which is a lot and
this proves that yes these ideas that
researchers have proposed they work and
they help reduce the memory usage. You
will see that in some cases we reduce
memory usage by 50%. In my case, I have
a NTX 470. Previously, the model, the
2017 model was using roughly 7 GB of
VRAM while training. But now, thanks to
multi head latest attention, the model
uses only 3.5 GB of VRAM, which which
helps a lot in maybe increasing the
batch size. Also, the inference speed
has increased by a lot. Previously, I
was getting just 100 tokens per second.
Now, I can get 400 plus tokens per
second, which is really great. And we
will see other other things in in this
course. So, I hope you are excited about
this one. Now, let's get started. Hi
everyone. After I uploaded my course on
how to train LLMs, I got some really
great feedback. Thank you so much. As
you can see from this slide, this course
is titled the transformer journey from
2017 to 2025.
This is the introductory video and let
me explain why I wanted to create this
course. In the previous course, train
your own language model, I have used
techniques from the 2017 paper,
attention is all you need. Since then,
researchers have made a lot of
improvements to the transformer
architecture. So, in this new course,
which is a continuation of the first
one, I will show you some of these newer
techniques and we will compare them to
the original ones to see if they
actually make a difference. At the end
of the course, I will create two models.
one using the 2017 architecture and
another one using the latest
improvements. Then we will compare their
results, especially the loss curves and
see if the newer model really performs
better. In the upcoming video, I will
try to compare the methods that are used
to encode the position. In the first
course, I have used absolute positional
encoding. But there are other methods
such as relative positional encoding or
rotary positional encodings. So we are
going to see all of these methods and
compare them to see which one performs
the best. See you in the next video. Hi
everyone. In this video, we will compare
different ways to tell a transformer
model where each word appears in a
sentence. This is called positional
encoding. Why do we need positional
encoding at all? Let's look at an
example. Imagine we take this sentence
and break it into individual tokens like
this and turn each token into an
embedding. Here is the embedding vector
for the word hi. Without positional
encoding, the same vector is used no
matter where hi appears in the sentence.
That's a problem because the self
attention layer needs to understand word
order. That's why we need to add
positional information. But how we do
that? There are few ways to add
positional information. The main types
are absolute positional encoding and
relative positional encoding. Let's go
back to our example. We have the
embedding tensor and the positional
encoding tensor. We simply add them
before sending them to the self
attention layer. So how do we actually
build this positional encoding tensor?
This is what we are going to see in this
video. Let's start with absolute
positional encoding. There are two main
types. The first one is learnable
positional encoding. In this method, we
add a special sensor of weights which is
going to be learned during training, one
for each position. So here are here is
the matrix or here are the weights that
are going to be added to the transformer
model. And as you can see here we have
positions and this can handle up to
block size. So during training we are
going to learn each position
individually. And here you can see that
that we have the embedding size. So this
is the shape of the matrix. It's the
shape is the max length which is block
size by embedding size. During training,
the model learns what values work best
for each position. At the end, we get a
tensor, let's call it WF, which means
the final weights that contains all the
positional vectors. But this method has
a key limitation. It does not generalize
to longer sequences than what it was
trained on. And in this case, it's block
size. So if you decide to use
sentences that are longer than block
size, this method will not work. Another
issue is that each position is learned
independently. So the model does not
know that position five comes before
position 20 because there is no
relationship between them. If you
shuffle the tokens in the input
sequence, the model won't notice
anything is wrong. That's a weakness of
this approach. Next, we have sinosoidal
positional encoding. Instead of learning
position vectors, we use mathematical
formulas s and cosine waves with
different frequencies. As you can see,
here are the formulas we use to generate
them. This method was used in the
original transformer paper because it is
formula based. We don't need to learn
anything and we can handle inputs longer
than what the model saw during training
because we have just two formulas and we
can just pass the values and we will get
the positional encoding for that
specific position. Here is how this
method works. We build a tensor shaped
max length by embedding size like what
we have seen before. And here let's just
focus on the 20th position. We apply the
formulas that I have showed you in the
previous slide for each dimension of the
embedding. Here is what's the wave plot
looks like. The xaxis is the position
and the yaxis is the embedding
dimension. At position 20, we sample
values from all the waves. And just like
that, we create our positional vector.
If you want to get the vector at
position, let's say 60, you come here,
you intersect these waves and you get
the values. And just like that, you con
you construct the positional vector at
that specific position. That's it for
absolute methods. Let's move on to
relative positional encoding. Instead of
focusing on where each token is in the
sentence, this method cares about how
far apart tokens are from each other.
This method does not modify the
embedding itself. Instead, it changes
how self attention works by including
information about the distance between
tokens. For example, if we are attending
from the token high, for example, if we
are attending from the word high, the
token might be one step away and the
token three steps away. And this number
basically means the distance between the
token high and the other token. You can
see that distance between high and high
is zero because it's the same token. But
the other ones that come after have
positive distance. But in the other
case, let's take the second high. You
can see that this one will have a
distance of zero. But the ones that come
before it will have a negative distance.
and the ones that are that come after it
will have a positive distance. What's
cool is that nearby words or tokens have
more influence and distance ones have
less influence just like how natural
language works. So this is how relative
positional encoding works and it's as as
you saw it's different from absolute
positional encoding. Now let's talk
about rotary positional embedding. This
method combines the best of both worlds.
It captures both absolute and relative
positions. It does this by rotating
token embeddings in space. Imagine our
embedding space is two-dimensional.
Let's say that the token emad is located
here. Now we add a token before it.
Let's say that we added high. Now rope
this method will rotate the token aad by
a fixed angle theta for each new token
added before it. So if the token emad is
the second token in the sequence rope
will rotate it by one angle. If it is in
this case located at the fifth position.
So 1 2 3 4 5 we are going rope will
rotate it by four theta and theta is
just the rotation angle. So yeah this is
how this method works and also what's
amazing about rope is that it preserves
the relative angle between tokens. Let's
look at this sentence. My friend is a
man. The angle between the token's
friend and a mad is represented here
represents their relationship. So let's
say that we changed that sentence to
this one. Who is your friend? My friend
is a mad. You can see that because here
we added tokens before this sentence. We
need to rotate that this effect these
tokens. And as you can see the angle
between them is preserved. And just like
that, rope preserves relative positions
even if the sentence structure changes.
Here are some great resources I used to
learn these different positional
encoding methods. You have Jake's blog
on relative positional encoding. This
YouTube channel that explains rotary
positional encoding. I highly recommend
this video. It is really great.
Christopher's hugging face blog on
positional encoding and my own GitHub
repository. Here I try to keep updating
the resources file with the new links
that I find useful. Here it is if you
are wondering about it. So just click on
the resources and you'll find the useful
resources there. Now that we have seen
all the methods, let's test them. I will
train a small model using the atlas set
data set and to save time I will only
train for one epoch for each method. So
here are the methods that we are going
to test. So we are going to test no
positional encoding absolute relative
sinosoidal and rotary positional
encoding. After each run I will save the
training and validation losses so we can
compare them. Let's start with no
positional encoding. This one is easy.
Let me show you the diagram. We just
remove the positional encoding layer
from our transformer. So you can see
that here is the sentence that we
convert into individual tokens. We get
the embedding. So the word embedding and
then before we use to compute the
positional encoding and add the two
tensors. In this method, we are going to
remove this part from the transformer
architecture and do the training. So you
can see that here I say that don't add
positional encoding to the embedding
tensor. So this should be removed. Let's
look at the code in VS code. In the
previous course, we used this script
model.py to create our GPT class or this
one. So GPT language model. I have just
copied this scripts and created a new
one that I called model no positional
encoding and let me show you the
difference. It is very this one is it's
not that hard to understand. So here you
can see that before we had two embedding
tables one for the token embedding and
the other one for position embedding. In
the forward pass you can see that we
take the input. So here are the input
tokens that we got from the input
sequence. We pass that to the token
embedding and positional embedding.
After that to get the input that will go
into the blocks, we take the token
embedding and add to it the positional
embedding. Now look here in model no
positional encoding. You can see that I
have removed the positional embedding
table and in the world order pass I take
the inputs tokens I pass that to the
embedding table and I create my input.
So you can see that this is the only
change just remove the positional
embedding and take the token embedding
and consider that as the input. Okay. So
after doing this I have created a
notebook to test this. So it's this one
no positional encoding and again this
notebook I just took one of the previous
notebooks in the previous course that I
used to do pre-training and in and this
is the only change. So instead of
importing the GBT language model class
from model I am I am importing it from
this new script and also I have made
sure to tweak these parameters so that I
get a small model because I will be
doing a lot of experiments and I don't
want this to take me a lot of time. So I
want to test this on a small model just
to confirm that these methods work and
then once we you will see in the end of
this video once we get the method that
works well then we are going to increase
the size of the model and use it but
this time because we are just doing a
lot of experiments we want to do it on a
small model. I have also changed one
thing which is the evaluation method. So
here we have this estimate loss method
that we call periodically during the
training loop but before in the previous
course I used to take random batches for
evaluation. This time I have changed
this method to use the same batch each
time so we can track improvement clearly
and this is how I have done that. So you
can see that here I have this. So we
have evaluation batches which is set to
a th00and patch. You can change this
value if you like. The more the better.
Here I have get evaluation indices and
this one works both for training and
validation. And here you can see that I
compute this only once and then I reuse
it. So evaluation indices. This
dictionary will contain the batches for
training and validation. And here you
can see that we are getting them
randomly but only once. But later when I
call estimate loss and use the get batch
for loss estimation you can see that
here I am using evaluation indices and I
am providing those indices to the get
batch for loss estimation. This allows
me to get the same batches during
evaluation and it will show us if the
model is learning is improving during
training or not. Also let's go down.
Okay. So here is the training loop. Here
I have also added a learning rate
scheduleuler. I have used cosine alne
learning rates with warm up. So why we
use cosine not just cosine but why do we
use learning rate scheduulers in
general? Basically before we during
training we set the learning rate to a
fixed value. Let's say 1 e minus 4. And
that value will be will be used during
training. It will not change. Learning
rate schedulers allow you to change that
value dynamically based on the number of
epochs or the number of iterations. So
here we are using cosign and learning
rate scheduleuler with warm-up. What
does it mean? So here you can see that
we have we are computing the warm-up
iterations. Now warm up will take the
learning rate and keep increasing it
until we reach this number of iterations
and later it will start decreasing. Here
is an image that I found that explains
it. So you can see that this is the
warm-up phase. The learning rate starts
from a minimum value and goes to a
maximum value. Then it starts decreasing
following a cosine wave. And this is
exactly what we have in the code. So we
have the warm-up iterations which is the
first phase of this scheduleuler. And
after the warm-up phase, we are going to
use this scheduleuler to decay the
learning rate and until it reaches this
minimum learning rate value. So yeah,
this is something that I have added to
the training loop and yeah, these are
the only changes that I have made and
everything stayed the same. So after
training, so I already I have already
done the training and I have saved the
training and validation losses and now
let's go back to the slides to show you
what we got. I am back and here is the
plot. So you can see that here we have
two curves. So we have the training loss
and validation loss. And here I have
used no positional encoding. And as you
can see the mo the the loss was going
down but then it's stagnated which means
that the model is underfitting and this
is probably because the model is small
but we don't care about this. We want to
compare the methods. So this is the
first one. We have the graphs. Now we
are going to go to the and before that I
just want to mention the training time.
So for one epoch the model took roughly
2 hours to to train. The next method is
absolute positional encoding and we are
going to focus on the learnable version.
This is the one we used in the previous
course. So no code changes are needed.
You saw in VS code that we have a script
called model.py and we are going to use
that one. So I will not go to VS code. I
will directly show you the results and
compare this method to the no positional
encoding. So here we have two graphs.
One for the training loss and the other
one for the validation loss. The orange
line is absolute positional encoding and
the blue line is the no positional
encoding method. And as you can see in
both graphs, the absolute positional
encoding method performs better than no
positional encoding. And this was
expected because as I said adding more
information to the transformer helps it
to learn better and again the training
time. So this method also took roughly 2
hours and 10 minutes. So there is no
difference in training time. I am
keeping I am keeping track of the
training time because this is something
that we need to take into consideration
because if we get two methods that work
that give us the same performance but
one takes more time to train then this
will help us choose which method to
keep. Now we are looking at syosoidal
positional encoding. This is another way
to add positional information to tokens.
This method is pretty interesting. On
paper, it should perform just as well as
the learnable version, but it does not
have any learnable parameters. So, if it
performs like the previous method, which
is absolute positional encoding with
learnable parameters, this will be good
because it will save us parameters.
Let's see how this one is implemented in
code. Okay. So I will remove this. I'll
remove this script and I have another
one which is called model sinosoidal
positional encoding. I will put it here
and let's let me scroll until I find the
GPT language model class. So as you can
see we don't have the embedding table
for positions. So this means that we
remove those parameters from the model.
But here I have added positional
encoding. So this method will compute
the positional encoding up to block size
and let me go back to that method also.
I need some space. Okay, great. So you
can see that here create sinosoidal
encoding will use the two formulas that
I have showed you in the slides. It will
use the sign function and cos function
to compute the positional encodings. And
this what the implementation that I have
here basically will just use those
functions. So yeah as you can see here
we have that 10,000 value that I have
showed you in the slides and yeah um the
we are going to the sine wave is used
for the even even positions and the
cosine value is used for the odd
positions and at the end we will get a
positional encoding tensor of shape one
by max length by embedding size and we
compute these values only once. class.
So when we instantiate a new instance of
the GPT language model class inside the
constructor, we are going to call that
method and store them and store those
positional encodings in the in this
buffer that we called positional
encoding and later we are just going to
use them. So this is the the difference
between sinosoidal positional encoding
and absolute positional encoding with
learned parameters. Here we have zero
parameters that we add but we compute
the values beforehand so that we use
them later. Okay. So here is the the the
sign of solder encoding and now where do
we use them? So you can see that here in
the forward pass again we did the same
thing. So we take the input tokens we
pass them through the embedding table.
This gives us the token embedding. And
here I try to add the shapes because
when creating a model this is the most
important thing to look at. So the
shapes are very important. And after
that we get the positional encoding. And
remember where do we get this? We
because we have that in uh stored in
this buffer we have access to this
variable or to this tensor. And here we
need to to take just a slice. If we if
we have a sequence that h that has just
10 tokens, we are going to take just the
first 10 values from this tensor or the
first 10 vectors from the tensor. We
don't want to go up to block size. So we
get the positional encoding and we add
that to the token embedding stensor to
get the the inputs that will go to the
next layers in the transformer. After
that I have done the same thing. I have
this notebook improving the transformer
cinosidal positional encoding. Inside I
have done the same thing. I have made
the model small. So the size is just 11
million parameters. I have made sure to
change the way we evaluate the model and
I have added the learning rate
scheduleuler and I have I have run this
I have saved the training and validation
losses and I have created the graph to
compare the three methods. So let's go
to the slice and let's see if this one
performed well or not. Here are the
graphs and again we are comparing the
the sinosodal positional encoding to the
other ones in both training and
validation. Surprisingly this method
performed worse than both other methods
on both training and validation. So you
can see that in some cases so here the
gap is a little bit small between
sinosodal and no positional encoding but
during training the gap is a little bit
bigger. So this means that using
learnable parameters is better than just
premputing those uh with the with the s
and cosine formulas that we have seen.
Okay, let's move on. Uh before that I I
always forget to to mention the training
time. Most of this one took 2 hours and
10 minutes. Next up we have relative
positional encoding. There are a few
versions of this method, but I will show
you the one with learnable parameters.
Here is the idea. We define a range of
relative distances with this parameter
maxed relative distance. Let's say if
this parameter is set to eight, the we
will have two distances. We will have
negative distances and positive
distances. This is what we have seen
before because the tokens that come
before the token will have negative
distances and the ones that have
positive distances or the ones that that
come after the token will have positive
distances. And that range is called a
number of buckets. Let me let's show you
an example to so that you can
understand. So let's take this sentence.
So here we have maybe we have other
tokens that come before and here we
maybe we have tokens that come after.
Let's say that max relative distance is
set to four and let's take the word or
the token not in this sentence. So here
I said that we will have positive
distances and we will go up to max
relative distance which in this case is
set to four and we have negative
distances. So this range is called the
number of buckets and it contains unique
learned bias values. So this is the
range that is that we are interested in
because as I said before tokens that are
very close to the one that we are
looking looking at will influence it
more than far away tokens. So tokens
that are very far away that have very
big distances should not affect the
token very much. Now what do we do with
the tokens that are outside this this
range? These ones would have shared
biases. What bias do we take? So let's
take the tokens that are on the left.
These ones will use the first value from
the unique learned bias values and the
ones that are here on the far right will
use the last value from this range. This
way the model does not learn absolute
positions but learns how far apart to
two tokens are. Since this method is a
little bit tricky to implement, I
decided to make this diagram to help
explain how the model uses this method.
The diagram looks a little bit scary
because it's big, but I'll try to make
sure to explain each part individually
so that you can understand the full
picture. We start with a sentence. In
this case, the sentence is, "Hiad, how
are you doing?" We split it into
individual tokens. Here we have six
tokens and one sentence. This is the
input shape. One is the batch size and
six is the number of tokens in the
sequence. So after doing this operation,
we are going to pass this tensor into
the token embedding table. After that we
get a tensor of size 1x 6x 768. 768
768
is the embedding size. And here the
block size is set to 1,024.
So this is the input. Now we need to
feed this sensor to the attention layer.
Here we have two individual parts. We
the first one is the multi- head tension
layer and the second one is the layer
where we are going to compute the
relative bias that will be used after
but let's focus on the multi head
attention layer because I don't want to
to show a lot of heads because the the
diagram will be too complex I have
decided to show just two heads and
everything is explained in the first
head but the second one is I don't show
lots of information because it's the
same thing. When we take this input
tensor and feed it to the head, we are
going to create two tensors of the same
size. So we have the key tensors and the
query tensors. And here I said that we
will create a tensor of the same size.
But here you can see that the shape is
not the same. So here we have 1x 6x 384.
So 384 is just this value divided by the
number of heads. Since we have two
heads, we will divide 768 by 2. This
will give us 384.
So we create these two tensors K and Q.
We transpose K to get this tensor. So
the shape is 1x 384 by 6. After that we
multiply these two tensors and at the
end we get a tensor of size 1x 6 by six.
We do the same thing for the second head
and we stop here. We don't continue. We
go back to the layer where we calculate
the relative bias and do this. First we
create two position vectors. One for
queries and the other one for keys. And
here the size is six. So we have one as
the batch size and six is the number of
tokens. So we take these two tensors and
we use broadcasting to compute a 6x6
matrix. I think this is too small. So
let me go back to Inkscape where I
created this diagram and zoom in so that
you you can see this clearly. Okay. So
this is Inkscape and now I think you can
see clearly. So I said we start with
Tutto with two tensors one for query and
one for keys. We use broadcasting to
compute a 6x6 matrix from these two
tensors. And this one is called relative
positions because this matrix will will
have the relative distances between
every pair of tokens. The diagonal
contains zeros. The upper triangle of
the mat of the tensor contains positive
distances and the lower triangle
contains negative distances. Now we take
this relative positions tensor and we
shift it and clamp it. Here I am not
showing a lot of detail but don't worry
when we are going to switch to VS code I
will show you everything that goes under
the hood. After clamping we are going to
have values between zero and a maximum
value. That maximum value will be the
number of buckets minus one. But as I
said I'll not go into details. Why are
we clamping the tensor? Because we
cannot use negative values you to get
vectors from the embedded table. This
will give us an error. So we need to you
to have positive values to do that. So
we take the the values from here. We get
the vectors from the embedded table and
we create this tensor. You can see that
the the shape here is number of heads by
6x6. This is very important because if I
go back to the multi head attention
layer, you can see that the input is 6x
6 and because we have two outputs, the
shape should be number of heads which is
2x 6x 6. So this matches what we have up
here. Now we take each slice and we add
it to the output of each head. So let's
let's take this slice as an example. And
here we take this tensor, we add them
together and the output should should
say the same. It should be 6x6. After
that we continue from X we get the value
tensor. We multiply these two together.
We get the final attention weights. We
do the same thing for the second head.
We concatenate the results and as you
can see we are back and here outputs and
X have the same shape. So this is how
relative positional encoding is
implemented. I hope that this diagram
was helpful and now let's go to VS code
to see how to turn this into code. Here
is the same diagram. I'll keep it here
because I need to explain to you how
this works. And instead of model, I have
another script called model relative
positional encoding. Let me put this one
here. So I'll make this a little bit
smaller. I hope it's not too small. And
now let's go down. Here is our class. I
added a new parameter to the model. Max
relative distance. we pass it from the
GPT language model class to the block.
After that inside the block we pass it
to the multi head attention layer. And
here we start implementing the relative
positional encoding. First we start with
number of buckets. Number of buckets is
two times the relative distance.
Remember why why are we multiplying this
with two? because we have positive and
negative distances and we add one to
take into consideration the distance
zero because if you are looking at the
same token the distance should be zero.
So this is the range. I call this in the
slides unique bias values and after that
we create the relative attention bias
embedding table with shape number of
heads by number of buckets and it's this
one. Let me zoom in a little bit. So you
can see that here this is exactly what
what what I have here in the code. You
can see that here we have number of
heads and number of buckets. Currently
we haven't done any calculations. So
let's let's back up here. I need to go
inside the forward pass. So here I am
going to start with relative bias
calculation. Here I am going to get the
relative bias which is basically this
yellow tensor or this green tensor. So I
have this computative position bias
method that I am going to use. And here
you can see that I let me zoom in again.
We have two tensors query positions and
key positions. Here they are. And we are
going to have a tensor of shape one by
sequence length or batch size by
sequence length. Okay. So we have query
and positions. This is how we create the
relative positions. So the shape will be
t by t. Here it was 6x6 but in general
it should be sequence length by sequence
length. And here is how we do that just
by using broadcasting. So we take key
positions and we subtract key query
positions. If you don't understand this
notation, don't worry, I will show you
how this works. So I will open my
terminal here. I will type Python. Let's
start from the beginning. Let's import
PyTorch. Here the sequence length is set
to four. This is just an example just to
show you how this works. I will also
create the key tensor and query tensor.
So let's look at them. Key positions.
This is the content of key positions and
key query positions. Now let's run this
code relative position. So let's look at
rel relative position. As I have
mentioned here, the diagonal should have
zeros. That's what we have here. The
upper triangle contains positive
distances. Here they are. And the lower
triangle contains negative distances.
Now I said that after computing this
relative position tensor, we need to
shift it and clamp it. So we have so
this is the first operation which is
shift the range to positive values
between zero and two times max relative
distance. And this is how we do it. We
take relative position and we add to it
the max relative distance. And in that
in this case max relative position it's
not defined. We should pass it when we
create the model. But let's just add it
here. Uh let's say that max relative
distance is maybe we want to look at
just two two tokens
before and after the token that we are
looking at and we are going to take
that. So let's take relative distances
and we are going to add to it this value
max relative distance. Ah so it tells me
that relative indicity is not defined.
That's correct. Let's create it again.
Relative indies is equal to relative
position plus max relative distance. Now
let's look at both tensors relative
position and relative indices. And as
you can see things are shifted. So
maxative distance is equal to two. So
the diagonal before was zero. Now it is
two. So basically we added two to every
value in this tensor. But we need to
clamp it. So this is the the second
operation and here is how we do that. So
let me take this and as you can see here
we no longer have negative values. Let
me show you how relative indices was
before. Relative indices. You can see
that in this position we have minus one.
After clamping that sensor we have zero
and here we had five and because this is
the max value which is here 5 - 1 which
is four. This value was clamped to four.
Why are we doing this? Again just to
remind you because after that we are
going to take these values and try to
get the embedding vectors from this
embedding table. And we should not
because here this is the these are the
positions. So buckets goes from zero up
to number of buckets minus one. And here
if for example if you try to get a
vector at position minus one you will
get an error. Or if you try to get a
vector at a position higher than the
maximum position you will get another
error. This is why we need to do this
operation shift and clamp. Okay. So now
we have the relative indices. Now we are
going to use that table that embedding
table to get the biases and we are going
to permute the dimensions because this
gives us t by t by number of heads and
as you can see from the output we need
to change that to be number of heads by
t by t. So we do that with the permute
method and you can see that here two is
basically number of heads. we put that
at the first position and we we put t by
t at the last positions. After that we
return that bias. This means that we
return this. Okay. So here they are. So
relative bias is this tensor. Now we
loop over the heads and for each head we
take one slice from this tensor. It's
this arrow that's that goes here. We
take one slice and give that to the
head. Now let's go to the head class to
show you how that works. So here is the
head class. Inside the forward pass, we
get the input which is this one X and we
get the head bias which is one slice
that we get from this tensor. Okay. So
now let's go let's look at head one. You
can see that we get key and query from
the input text. We transpose the key
tensor and we multiply it. We so this
here we are using matrix multiplication.
We multiply it by Q in order to get the
weights. Here they are. Now before
continuing we are going to take that
slice. It's here. We got it from the
previous step. So we are going to take
that weight tensor and add to it the
head bias. And here we are using
unsqueeze because head bias is going to
be t byt and we need to add the batch
dimension. So that's broadcast
broadcasting works. So after doing this
we are going to continue like before we
are going to have to use maxed fill so
that we only keep the lower triangle in
the tensor. After that we apply soft max
dropout etc. And then we come to this
part. So we get the value tensor and we
multiply it by the weights in order to
get the output from one head and after
that we just concatenate the results and
this is done in the multi head attention
class. So at the end let's go here in
the forward pass. This is where we were.
So for each head we take one slice from
this tensor. we get the head output
which is this one. We append everything
but at the end we concatenate those
tensors in order to get the output. In
this case it will be of shape batch size
by sequence length by embedding size
which is similar to the input tensor. So
yeah, this is how you implement relative
positional encoding and again I have
created another notebook to let's look
at it relative positional encoding this
one to to use this method and to get the
training and validation losses. Now
let's look at the results in the slides.
Okay. So I have explained everything
here. Okay. So relative positional
encoding again we have training and
validation and you can see that relative
positional encoding performed really
well on both training and validation. So
you can see relative positional encoding
is this one. It has this plus icon or
plus marker. But the downside is that it
is slow cuz you saw we we have a lot of
calculation performed under the hood. It
took 5 hours and if you remember the
previous method took two hours to train
but it was it is worth it because we got
so far the best performance. Now we have
one more method before we end this video
which is rotary positional encoding. So
let's look at that. This method does not
add any new parameters. That's a good
start. Instead, it modifies the key and
query vectors by rotating them before
calculating the attention. I took the
previous diagram and made it a little
bit simpler or I have adjusted it to
work for rope and I think this is a
great idea to explain how the the method
works under the hood instead of just
showing you the code. So, let's let's
start. How does this work? You can see
that here. The only thing that you need
to change is this. When you create the
key and query tensors from the input,
you will rotate them before applying
before calculating the attention
weights. So when we are going to go to
VS code, we are going to focus on this
part. So don't care about the other the
other parts. Just look at this one. You
can see that here this icon means that
we are going to rotate that tensor.
Let's go to VS code and let's search for
model row. Okay, let's go down to
the GPT language class. Can see that I
have added a a new parameter called rope
base frequency. It is equal to 10,000. I
took this from the p the rope former
paper. So they are using this to compute
the frequencies. We are going to look at
this later. And here you can see that I
have also added this rotary positional
embedding class which will compute those
embeddings or those frequencies and we
pass that to the block. We will do the
same thing as relative position
encoding. Later we are going to pass it
to the multi head attention layer. But
before looking at the block let's
understand what is happening inside this
class. Okay. So in the init function you
should you should understand that the
rotary positional encoding method treats
pairs of numbers as complex as complex
numbers. This is why we are going to
loop from zero up to embedding size
divided by two because we are going to
take two values from the embedding
dimension and consider them as one
complex number. Now each frequency value
is calculated like this. So you have
data at index i. So it's equal to 1
divided by base frequency which is equal
to 10,000. But you can change that when
you create the instance. So yeah this
method as I or this formula was taken
directly from the row former paper. And
here this is just a code to basically
calculate this. And so yeah as you can
see one this one divided by the denominator
denominator
which is yeah it is calculated here.
Okay so after getting the theta is we
are going to multiply those values with
the position indices. So position
indices again will go up to block size
and here we are going to use torch.outer
to get the frequencies. Now if you don't
understand what torch.outer outer does.
Let me again open the terminal and here
I will create two tensors just to
demonstrate how that method works. Okay,
so we have tensor A and tensor B. Now I
am going to write torch dot outer and
I'm going to pass A and B. And here is
the output. Let's see how this first row
was calculated. So you take one and
multiply it by these values. So 1 * 4
will give you four. 1 * 5 gives you five
etc. Now to compute the second row we
take two and multiply it by these
values. So 2 * 4 gives you 8 10 etc. So
this is how torch out works. Now let's
go back to the code. Okay. So here we
have two tensors. So A and B in this
example are position indices and theta
E. And here the the vectors don't need
to have the same shape. So if I go back
to a and instead of 1 2 3 I'll just
replace I'll just remove three. Let's go
back to torch.outer and as you can see
that works. Here we have the same
things. So position indices will be of
shape batch size by block size and theta
is of shape batch size by embedding size
divided by two. So when we use torch to
outer we are going to take to get a
tensor of size block size by embedding
size divided by two and this will give
us the frequencies and this is what we
need. Now from this we are going to
calculate the complex numbers in polar
form. This is the polar form. So cos m
theta plus i sin theta and tors polar
will do that for us. We will get theta e
and this is what we want. So data ease
are computed only once when we create an
instance of the GPT language model class
and this is similar to the sinosoidal
positional encoding. There we used the
sign and cosine formulas to calculate
those position those positional encoding
only once. But here we are computing
these frequencies or these values that
are called theta is each position only
once and we have a method which is
called get theta ease to retrieve them
later. So this was too much math I know
uh but as I said this was taken directly
from the row former paper but I wanted
to show you that this exists in this
script so we can look at it and to
understand try to make examples like
this with smaller numbers with smaller
shapes to understand how does the math
work and yeah because when creating any
deep learning model shapes are the the
things that are that are going to to
cause you a lot of trouble. You need to
either do this by pen or paper or open
the terminal and try to create the
tensors manually. See how the shapes
work and this will help you understand
any any deep learning model that was
implemented by someone or it will also
help you create any deep learning model.
This is the role of rotary positional
embedding. Let's go back to GPT language
model. I have it here and then I pass it
to the block class. So inside block we
pass it again to the multi head
attention layer here inside the head we
have we get the key and query sensors
from the input X and again just to show
you where we are in this diagram let's
make this a little bit smaller let's
zoom in a little bit you can see that we
are here we got the key and query
tensors now we need to rotate them I
have added this helper function apply
rotary positional embedding so it takes
the tensors, it takes the theta is and
it will rotate those sensors and after
that the rest is the same. So you get Q
and K, you transpose K, you multiply it
by Q, you get the attention layers, you
apply the mask, you apply the soft mask
dropout, and finally you multiply this
by V and you get the attention weights
at the end. So the rest is the same.
This is the interesting part. So apply
rotary positional embedding. This is a
function that I have outside that will
do this rotation for us. The tensor X
has this shape patch size by sequence
length by embedding size. We take it and
we reshape it to be a to be of shape B /
by C by D /2 by two. Why are we doing
this? Because remember when we computed
theta is I told you that robe will
divide the embedding size into two and
consider each pair as a complex number.
This is exactly what we have done. We
have divided the the embedding size by
two and we created a new dimension. Now
after that so this will give us x
combined we will convert this into a
complex number and now we will have b by
t by d /2 and this dimension will go
because we are going to create the
complex numbers again if you don't
understand how this works you can come
here and create a tensor so let's do
that okay let's look at the shape so
it's 2x two and I will unsqueze squeeze
it to add sorry to add the batch size or
the batch dimension. So now sorry so I
need to
to type x= x dot unsqueeze. Now x dot
shape should be 1x 2x2. So this is
exactly what we have here. Now we need
to reshape this. So let's take this. Now
let's look at x combined. Let's look at
the shape. And it's it's 1x 2x 1 by 2.
So let's look at the let's use this
function. Let's put it here. And as you
can see, let's look at the shape. So
it's 1 by 2 by d /2, which in this case
is one. And as you can see, we have
complex numbers. So this is the real
part and this is the imaginary part of
that complex number. Okay. So now we got
theta e. So they are here. Theta is are
complex numbers. So premputed complic
comp complex frequencies of shape t by d
/2. You can see that here we are missing
the batch dimension. This is why I used
unsqueeze. And now the shape should be
1x t by d /2. Now we need to apply the
rotation by multiplying complex numbers.
So we do this by just a simple
multiplication. So here x complex has
this shape. Theta e has this shape.
broadcasting will work and this will
give us a tensor of this shape. So it's
the sh is it's the same shape but this
will result in rotating that tensor in
the complex plane. Now we need to go
back to real numbers. So torch dot view
as real will give us that dimension back
and now we need to flatten the shape to
to combine these into one dimension. So
you can see that here flatten the last
two dimensions to get back to get the
embedding size back. So at the end X out
will have this shape B by T by D and
this is exactly what we have here. So
the input was 1x 6 by 384. After
rotating the tensor it it should stay
the same. Okay. So here the input was B
by T by D. So the output should be b by
t by d and that's it. This is how you
implement rope in python. Again I have
created another notebook to to do the
experiment. So it's this one
transformer. Uh where is it? Yeah it's
this one. So improving transformer
rotary positional encoding. I'll make
sure to clean everything because here I
have a lot of I have created a lot of
files. So I'll make sure to clean them
and then push everything into my GitHub
repository. In that notebook again I created the model. I have used this
created the model. I have used this script to instantiate the GPT language
script to instantiate the GPT language model class and I have the results. So
model class and I have the results. So let's go back to the slides to show you
let's go back to the slides to show you what we got. Okay, this is the slide. So
what we got. Okay, this is the slide. So now I am comparing all the methods. So
now I am comparing all the methods. So we have absolutely
rope. And as you can see rope is this one. It's in yellow. And as you can see
one. It's in yellow. And as you can see here, it it was it was it was close to
here, it it was it was it was close to the relative positional encoding method
the relative positional encoding method in terms of loss in terms of the loss
in terms of loss in terms of the loss value at the last iteration. But here
value at the last iteration. But here you can see during the validation it
you can see during the validation it performed better. So both of these
performed better. So both of these methods are really great. So relative in
methods are really great. So relative in relative encoding and rope are the best.
relative encoding and rope are the best. But since rope took 2 hours to train and
But since rope took 2 hours to train and if you remember relative positional
if you remember relative positional encoding took 5 hours to train this
encoding took 5 hours to train this means that this method rope is 2.5 times
means that this method rope is 2.5 times faster than relative positional encoding
faster than relative positional encoding and also it generalized really well and
and also it generalized really well and this is shown by the lower validation
this is shown by the lower validation loss. So this means that we are going to
loss. So this means that we are going to consider rope as the winner uh in this
consider rope as the winner uh in this experiment that I have conducted in this
experiment that I have conducted in this video just to show you that positional
video just to show you that positional encoding is very important. So if we
encoding is very important. So if we decide to not use position positional
decide to not use position positional encoding we will get this blue line. So
encoding we will get this blue line. So the loss will be here and the same for
the loss will be here and the same for validation. But just by adding
validation. But just by adding positional encoding we improved the
positional encoding we improved the performance of the model by a big
performance of the model by a big margin. So you can see the the gap
margin. So you can see the the gap between no positional encoding and
between no positional encoding and rotary is very big and the same for
rotary is very big and the same for validation. So this means that yes
validation. So this means that yes positional encoding is very important.
positional encoding is very important. Yes, we have a lot of methods and we
Yes, we have a lot of methods and we compared between them and we ended up
compared between them and we ended up choosing rotary position coding because
choosing rotary position coding because it is faster and it generalizes well in
it is faster and it generalizes well in validation. So yeah, this is the method
validation. So yeah, this is the method that I am going to keep and it is going
that I am going to keep and it is going to be used in the final version of the
to be used in the final version of the model in the final video. I hope you
model in the final video. I hope you found this first video in this course
found this first video in this course helpful. It took a lot of time to make.
helpful. It took a lot of time to make. So if you learned something, let me know
So if you learned something, let me know and see you in the next video. Hi
and see you in the next video. Hi everyone. In this video, we are going to
everyone. In this video, we are going to focus on the attention layer. We are
focus on the attention layer. We are going to compare different methods that
going to compare different methods that researchers proposed to improve the
researchers proposed to improve the transformer architecture. In this video,
transformer architecture. In this video, I will make sure to show you the theory
I will make sure to show you the theory behind each method and how to implement
behind each method and how to implement it in code. And by the end, we will
it in code. And by the end, we will choose the method that achieves the
choose the method that achieves the lowest validation loss. If you forgot
lowest validation loss. If you forgot about why we use the attention layer in
about why we use the attention layer in a transformer, let me explain that one
a transformer, let me explain that one more time. Attention is used to help the
more time. Attention is used to help the language model focus on the most
language model focus on the most relevant pieces of information in the
relevant pieces of information in the input. In this video, we are going to
input. In this video, we are going to compare the following methods. We have
compare the following methods. We have sparse attention, multi head attention,
sparse attention, multi head attention, grouped query attention,
grouped query attention, linear, local and latent attentions. And
linear, local and latent attentions. And here you can see that I have tried my
here you can see that I have tried my best to link to the original papers
best to link to the original papers where these methods have been published.
where these methods have been published. So here for example you have sparse
So here for example you have sparse engine and sometimes I link multiple
engine and sometimes I link multiple research papers that have talked about
research papers that have talked about that specific method. So if you want to
that specific method. So if you want to go deeper please make sure to click to
go deeper please make sure to click to click on these icons. They are all
click on these icons. They are all clickable and they will take you
clickable and they will take you directly to the research paper. Let's
directly to the research paper. Let's start with multi head attention. This
start with multi head attention. This method was introduced in 2017 in the
method was introduced in 2017 in the famous attention is all you need paper
famous attention is all you need paper again. So here is a screenshot of the
again. So here is a screenshot of the first page of that paper. You can click
first page of that paper. You can click on this image if you want to go directly
on this image if you want to go directly to this paper in order to read more
to this paper in order to read more about it. Image A which stands for multi
about it. Image A which stands for multi head attention is the foundational
head attention is the foundational attention mechanism used in that paper.
attention mechanism used in that paper. Here is the transformer architecture
Here is the transformer architecture diagram and here is the attention layer.
diagram and here is the attention layer. You can see that it comes after encoding
You can see that it comes after encoding the input sequence. Here in that
the input sequence. Here in that specific attention is all you need paper
specific attention is all you need paper we were using positional encoding and
we were using positional encoding and word embedding or token embedding to
word embedding or token embedding to encode the input sequence and you can
encode the input sequence and you can see the attention layer comes after the
see the attention layer comes after the encoding and comes before the feed
encoding and comes before the feed forward network. The role of MHA is to
forward network. The role of MHA is to compute the attention scores and it does
compute the attention scores and it does this independently across multiple heads
this independently across multiple heads in parallel. And this is the beauty of
in parallel. And this is the beauty of this technique is that we can divide the
this technique is that we can divide the computation into multiple heads and
computation into multiple heads and perform those comp that computation in
perform those comp that computation in parallel which speeds up training a lot.
parallel which speeds up training a lot. But you might ask why should we use
But you might ask why should we use multiple heads? Can't we just perform a
multiple heads? Can't we just perform a a big matrix multiplication instead of
a big matrix multiplication instead of dividing that into smaller matrix multip
dividing that into smaller matrix multip multiplications? The answer to that
multiplications? The answer to that question is that when we use multiple
question is that when we use multiple heads, the model learns diverse
heads, the model learns diverse representations because each head
representations because each head focuses on various input aspects. So
focuses on various input aspects. So maybe this head will focus on something,
maybe this head will focus on something, the other head will focus on something
the other head will focus on something else. And this will help the transformer
else. And this will help the transformer model generalize better. Now let's zoom
model generalize better. Now let's zoom into the attention layer. Within each
into the attention layer. Within each head, the input is projected into three
head, the input is projected into three matrices. We have the query matrix, the
matrices. We have the query matrix, the key matrix, and the value matrix. This
key matrix, and the value matrix. This diagram contains six head. The first
diagram contains six head. The first head contains the three matrices as I
head contains the three matrices as I have talked about. And here before we go
have talked about. And here before we go to the second point, this arrow
to the second point, this arrow indicates that each query vector is
indicates that each query vector is compared to all key vectors to measure
compared to all key vectors to measure similarity. Here when I say query
similarity. Here when I say query vector, it means that I am taking a
vector, it means that I am taking a slice from that matrix because the query
slice from that matrix because the query matrix it has multiple vectors. So we
matrix it has multiple vectors. So we take one vector from query and we
take one vector from query and we compare it to the other vectors to find
compare it to the other vectors to find yeah to measure similarity and here we
yeah to measure similarity and here we measure similarity with uh by using dot
measure similarity with uh by using dot products or other methods. This
products or other methods. This similarity measurements that we are
similarity measurements that we are performing gives us the attention score
performing gives us the attention score matrix and here is how it looks like.
matrix and here is how it looks like. You can see that this matrix is a square
You can see that this matrix is a square matrix. So basically here you can see
matrix. So basically here you can see that this is a full matrix. So here this
that this is a full matrix. So here this means that we attend from both
means that we attend from both directions. So if I am so if I am at
directions. So if I am so if I am at this token I can look at the tokens that
this token I can look at the tokens that come in the future. For example, if I am
come in the future. For example, if I am here I can look at the tokens that are
here I can look at the tokens that are before me and the tokens that will come
before me and the tokens that will come in the future. But the problem is that
in the future. But the problem is that here we are using a decoder only
here we are using a decoder only transformer. We are using that for text
transformer. We are using that for text generation. This is the task that we are
generation. This is the task that we are trying to perform. So we shouldn't look
trying to perform. So we shouldn't look at the tokens that come in the future
at the tokens that come in the future because this will be the model will
because this will be the model will cheat. So during training we should
cheat. So during training we should apply a mask to remove the tokens that
apply a mask to remove the tokens that come in the future. These cells that are
come in the future. These cells that are colored in white contain zeros. This
colored in white contain zeros. This means that the model will not be able to
means that the model will not be able to cheat because it doesn't have that
cheat because it doesn't have that information. And here the colored cells
information. And here the colored cells in this pink color contains the actual
in this pink color contains the actual attention scores. By using this trick,
attention scores. By using this trick, we will ensure that the model learns how
we will ensure that the model learns how to predict the next token instead of
to predict the next token instead of cheating. Okay. So now uh here is our
cheating. Okay. So now uh here is our diagram. I am going to zoom into one
diagram. I am going to zoom into one head but the calculation is is the same
head but the calculation is is the same is similar to the other ones. Here is
is similar to the other ones. Here is the formula that we use in order to
the formula that we use in order to compute the attention scores. And here
compute the attention scores. And here is the diagram. This is exactly the
is the diagram. This is exactly the steps that we are going to follow when
steps that we are going to follow when we are going to implement multi head
we are going to implement multi head attention in code. Here we have the
attention in code. Here we have the input. We project that into three
input. We project that into three matrices. Query, key, and value. Let's
matrices. Query, key, and value. Let's look at this term inside the softmax
look at this term inside the softmax function. Here we have Q. We multiply
function. Here we have Q. We multiply that by the transpose of K. So here is K
that by the transpose of K. So here is K we transpose it and we multiply that by
we transpose it and we multiply that by Q. So after after multiplying these two
Q. So after after multiplying these two matrices we divide by this term just to
matrices we divide by this term just to scale the numbers inside the matrix. I
scale the numbers inside the matrix. I am not showing here but I'm not showing
am not showing here but I'm not showing that m division here just to keep the
that m division here just to keep the diagram simpler but we are going to add
diagram simpler but we are going to add this in the code later. After that we
this in the code later. After that we get this matrix which is basically this
get this matrix which is basically this term. After that we apply the masking as
term. After that we apply the masking as I said because we don't want to look at
I said because we don't want to look at the tokens that come in the future. So
the tokens that come in the future. So we multiply this with V and we get the
we multiply this with V and we get the output which is which contains the
output which is which contains the attention scores. And here you can see
attention scores. And here you can see that this multiplication is performed on
that this multiplication is performed on the first head and we perform the same
the first head and we perform the same calculation on the other heads that
calculation on the other heads that remain. And at the end we take the
remain. And at the end we take the outputs from each head and we
outputs from each head and we concatenate them to get the full
concatenate them to get the full attention scores matrix. In this video
attention scores matrix. In this video we are going to consider multi- head
we are going to consider multi- head attention as our baseline. Here are the
attention as our baseline. Here are the loss curves for both training and
loss curves for both training and validation. Later in the video, we are
validation. Later in the video, we are going to compare the other methods to
going to compare the other methods to this baseline because multi head
this baseline because multi head attention is the method that we used in
attention is the method that we used in the first course and now we want to try
the first course and now we want to try the other methods to see if if they
the other methods to see if if they improve the performance of the
improve the performance of the transformer model. The first method that
transformer model. The first method that we are going to compare against multi
we are going to compare against multi head attention is multi-query attention.
head attention is multi-query attention. This method was introduced in the fast
This method was introduced in the fast transformer decoding paper. Click on the
transformer decoding paper. Click on the image if you want to read more about
image if you want to read more about that paper. Multi-query attention is a
that paper. Multi-query attention is a computationally efficient method. Why?
computationally efficient method. Why? Because it reduces the memory usage by
Because it reduces the memory usage by shrinking the KV cache. If you are not
shrinking the KV cache. If you are not familiar with this term KV cache, it
familiar with this term KV cache, it means how much memory the key and value
means how much memory the key and value matrices take during training or
matrices take during training or inference. MQA uses one key and one
inference. MQA uses one key and one value for all query heads. We have seen
value for all query heads. We have seen this diagram when I was explaining multi
this diagram when I was explaining multi head attention. You can see that in each
head attention. You can see that in each head we have three matrices key query
head we have three matrices key query and value. In MQA all queries share the
and value. In MQA all queries share the same key and value matrices. This means
same key and value matrices. This means that K and V are calculated only once
that K and V are calculated only once and then shared between the query
and then shared between the query matrices or between the heads. This
matrices or between the heads. This means that the number of parameters will
means that the number of parameters will decrease which might in turn impact the
decrease which might in turn impact the model's performance in some cases. But
model's performance in some cases. But the major benefit of MQA is the
the major benefit of MQA is the inference speed. This method is way
inference speed. This method is way faster than MHA when generating tokens.
faster than MHA when generating tokens. Let's look at the internals. Same as
Let's look at the internals. Same as before, we have our diagram, the
before, we have our diagram, the attention function. And here is the
attention function. And here is the diagram that shows what goes inside one
diagram that shows what goes inside one head. Do you see a difference? The key
head. Do you see a difference? The key and value matrices are outside the head.
and value matrices are outside the head. This means that they are calculated once
This means that they are calculated once and used inside each head. This is the
and used inside each head. This is the only difference between MQA and MHA. You
only difference between MQA and MHA. You can see that the rest is the same. we
can see that the rest is the same. we perform the multiplication. So we we
perform the multiplication. So we we compute this term inside the softmax
compute this term inside the softmax function. We get this matrix. We apply
function. We get this matrix. We apply masking and after that we multiply it by
masking and after that we multiply it by v to get the attention weights. And this
v to get the attention weights. And this calculation is performed similarly on
calculation is performed similarly on the other heads. It's comparison time.
the other heads. It's comparison time. We have the two graphs for training and
We have the two graphs for training and validation. In both cases, MQA performed
validation. In both cases, MQA performed better than multi head attention, which
better than multi head attention, which is surprising. The difference is not
is surprising. The difference is not that big, but it is in favor of MQA. I
that big, but it is in favor of MQA. I say that it is surprising because
say that it is surprising because multi-query attention uses the same
multi-query attention uses the same query and value matrices for all query
query and value matrices for all query heads, which means that we will have
heads, which means that we will have less diversity when training this model.
less diversity when training this model. But in this data set that I am using, it
But in this data set that I am using, it seems that MQA works better than MHA.
seems that MQA works better than MHA. But again, the difference is not that
But again, the difference is not that big. I have also mentioned that MQA is
big. I have also mentioned that MQA is faster in inference than MHA. And here
faster in inference than MHA. And here is the diagram that I have tried to
is the diagram that I have tried to create in order to illustrate this. You
create in order to illustrate this. You can see that here we have in the X-axis
can see that here we have in the X-axis the number of tokens to generate. So I
the number of tokens to generate. So I have tried to generate 100 tokens up to
have tried to generate 100 tokens up to 2,000 tokens for M mha and MQA. And here
2,000 tokens for M mha and MQA. And here on the y-axis we have the inference
on the y-axis we have the inference speed or the inference time in seconds.
speed or the inference time in seconds. You can see that MQA is seven times
You can see that MQA is seven times faster than MHA when generating
faster than MHA when generating 100 to 200 tokens. But after increasing
100 to 200 tokens. But after increasing the number of tokens to generate that
the number of tokens to generate that benefits starts to shrink. You can see
benefits starts to shrink. You can see that here at this point MQA becomes six
that here at this point MQA becomes six times better faster than MHA. Here it
times better faster than MHA. Here it starts to decrease again. So it's 1.7.
starts to decrease again. So it's 1.7. And when we reach 20,000 it became
And when we reach 20,000 it became worse. So it's8
worse. So it's8 times worse or slower than image A. In
times worse or slower than image A. In the fast transformer decoding paper
the fast transformer decoding paper where Noam, this researcher who worked
where Noam, this researcher who worked on MQA showed that this method was 12
on MQA showed that this method was 12 times faster than MHA with a sequence
times faster than MHA with a sequence length equal to 128 tokens. So this
length equal to 128 tokens. So this matches what we see. I mean there he was
matches what we see. I mean there he was using a very powerful hardware and here
using a very powerful hardware and here I have just a an RTX 470 which is which
I have just a an RTX 470 which is which is not that impressive but still you can
is not that impressive but still you can see that here so he used 128 here I have
see that here so he used 128 here I have 100 tokens and here I I got seven times
100 tokens and here I I got seven times faster inference speed and this again
faster inference speed and this again the model here is small so if I increase
the model here is small so if I increase this the model size this value will have
this the model size this value will have changed also so but but this matches
changed also so but but this matches what no one observed when he was writing
what no one observed when he was writing that research paper and here it's
that research paper and here it's surprising that MQA becomes bad when we
surprising that MQA becomes bad when we increase the number of tokens to
increase the number of tokens to generate maybe my implementation is not
generate maybe my implementation is not optimized but since no didn't try to
optimized but since no didn't try to compare the inference speed at higher
compare the inference speed at higher number of tokens to generate I can't say
number of tokens to generate I can't say 100% if this method works only when we
100% if this method works only when we generate um fewer tokens and it works
generate um fewer tokens and it works bad when we try to generate large
bad when we try to generate large sequences of text. Now we have compared
sequences of text. Now we have compared the performance of MQA to MHA. Now let's
the performance of MQA to MHA. Now let's see how to implement this in code. I am
see how to implement this in code. I am in VS code and here I have opened the
in VS code and here I have opened the diagram. This is exactly what I have
diagram. This is exactly what I have showed you in the slides. Let's put it
showed you in the slides. Let's put it here because it will help us see what we
here because it will help us see what we are doing inside this script. I will
are doing inside this script. I will make it smaller. Open model multi-query
make it smaller. Open model multi-query attention and like the previous video
attention and like the previous video everything will stay the same but we are
everything will stay the same but we are going to change just what what we are
going to change just what what we are concerned with and here we are going to
concerned with and here we are going to change the attention layer. So here I
change the attention layer. So here I have created this multiquery attention
have created this multiquery attention class and inside it in the constructor
class and inside it in the constructor we have the number of heads the head
we have the number of heads the head size and here here are the matrices. So
size and here here are the matrices. So we have the key matrix value and query.
we have the key matrix value and query. Okay. And we have the rest of things. We
Okay. And we have the rest of things. We will we'll come back to the rest. Okay.
will we'll come back to the rest. Okay. So in the forward pass now I will zoom
So in the forward pass now I will zoom in. We get the batch size, the sequence
in. We get the batch size, the sequence length and the embedding size from the
length and the embedding size from the input. Here we compute the key and value
input. Here we compute the key and value matrices and after that we get the
matrices and after that we get the query. So here everything is the same
query. So here everything is the same but I have said that K and V are shared
but I have said that K and V are shared between all query heads. So after
between all query heads. So after getting the query you can see that here
getting the query you can see that here is the the shape. So it's going to be
is the the shape. So it's going to be batch size by sequence length by number
batch size by sequence length by number of heads times head size. So we are
of heads times head size. So we are going to reshape this to add the number
going to reshape this to add the number of heads dimension. So here we have
of heads dimension. So here we have batch by number of heads by sequence
batch by number of heads by sequence length by head size. You can see that
length by head size. You can see that here this represents all query heads.
here this represents all query heads. But remember that we have only one key
But remember that we have only one key and value head. We are going to also
and value head. We are going to also reshape K and V in order to add this
reshape K and V in order to add this dimension. But we are going to put one
dimension. But we are going to put one because this is what I said in
because this is what I said in multiquery attention. We have just one
multiquery attention. We have just one uh one key and one value and those are
uh one key and one value and those are going to be shared by the query heads.
going to be shared by the query heads. we are going to use broadcasting to
we are going to use broadcasting to multiply K and V so that we can multiply
multiply K and V so that we can multiply them by Q. So let's see how to do this.
them by Q. So let's see how to do this. So this is the only thing that we that
So this is the only thing that we that we need to do. So here when we are going
we need to do. So here when we are going to to compute the multiplication between
to to compute the multiplication between query and the key Python will use
query and the key Python will use broadcasting to make sure that the
broadcasting to make sure that the shapes match. So here when we perform
shapes match. So here when we perform this operation you can see that we
this operation you can see that we cannot do it because here we have B *
cannot do it because here we have B * number of heads and here we have B * 1.
number of heads and here we have B * 1. Python will try to match these two
Python will try to match these two dimensions by multiplying the key number
dimensions by multiplying the key number of heads times. So here we will have B *
of heads times. So here we will have B * number of heads and this is why I said
number of heads and this is why I said that repeat for each head. Okay. And the
that repeat for each head. Okay. And the rest will stay the same. So here we
rest will stay the same. So here we multiply Q and K. we divide by the head
multiply Q and K. we divide by the head size. So this is the scaling factor.
size. So this is the scaling factor. After that we are going to so here let
After that we are going to so here let me zoom in again. We are going to
me zoom in again. We are going to perform the masking. We are going to use
perform the masking. We are going to use the soft max function apply dropout and
the soft max function apply dropout and at the end multiply the masked weights
at the end multiply the masked weights matrix with V. So V we are getting it
matrix with V. So V we are getting it from here. And again you can see that
from here. And again you can see that here here are the shapes. So attention
here here are the shapes. So attention weights is B * number of heads * T * T.
weights is B * number of heads * T * T. But V is B * 1. So these two dimensions
But V is B * 1. So these two dimensions do not match. But luckily, Python will
do not match. But luckily, Python will use broadcasting to solve this issue for
use broadcasting to solve this issue for us. So it will just keep duplicating V
us. So it will just keep duplicating V number of heads times so that this
number of heads times so that this matrix multiplication can be computed.
matrix multiplication can be computed. And at the end, this is the shape that
And at the end, this is the shape that we are going to get. Here you can see
we are going to get. Here you can see that we have tried to merge all heads
that we have tried to merge all heads into one simple class instead of using
into one simple class instead of using um instead of dividing this into
um instead of dividing this into multiple heads and then concatenating
multiple heads and then concatenating the results of each head by the end of
the results of each head by the end of this forward pass we have everything
this forward pass we have everything concatenated. So we you can see that
concatenated. So we you can see that here we transpose the first and second
here we transpose the first and second dimensions so that we can basically
dimensions so that we can basically merge the the number of heads by head
merge the the number of heads by head size into one dimension which is the
size into one dimension which is the number of channels and after that we are
number of channels and after that we are going to return uh the output which
going to return uh the output which contains the attention weights. This is
contains the attention weights. This is exactly what we need to change just the
exactly what we need to change just the the attention layer the rest will stay
the attention layer the rest will stay the same. Okay. And like before, like in
the same. Okay. And like before, like in the previous video, I have created a
the previous video, I have created a notebook to try this. So it's this one
notebook to try this. So it's this one improving transformer multi-query
improving transformer multi-query attention. And here all I did is that I
attention. And here all I did is that I imported the GPT language model class
imported the GPT language model class from this new script that I have
from this new script that I have created. And yeah, the rest is the same.
created. And yeah, the rest is the same. So here we divide the data into training
So here we divide the data into training and validation.
and validation. And then I run this and at the end I
And then I run this and at the end I make sure to save the training and
make sure to save the training and validation losses so that I can print
validation losses so that I can print the curves and show them in the slides.
the curves and show them in the slides. Now let's go back to the slides in order
Now let's go back to the slides in order to explain the next method. Local
to explain the next method. Local attention is the next method that we are
attention is the next method that we are going to focus on. It was mentioned in
going to focus on. It was mentioned in these two papers. Here they are. Again
these two papers. Here they are. Again you can click on them if you want to go
you can click on them if you want to go deeper. Local attention works by
deeper. Local attention works by limiting the attention span of each
limiting the attention span of each token to a fixed size window. Here is
token to a fixed size window. Here is the full attention scores matrix. And
the full attention scores matrix. And this is how it looks after applying a
this is how it looks after applying a window size of three. You can see that
window size of three. You can see that in this case a token can attend up to
in this case a token can attend up to two positions in the pass. If I am for
two positions in the pass. If I am for example here, I can only look at the two
example here, I can only look at the two previous tokens. But you can play with
previous tokens. But you can play with this value. This makes local attention
this value. This makes local attention efficient, but it's a bit tricky to get
efficient, but it's a bit tricky to get the most out of it. You need to do a lot
the most out of it. You need to do a lot of optimizations to benefit from this
of optimizations to benefit from this approach. The problem with limiting the
approach. The problem with limiting the attention span to a fixed window is that
attention span to a fixed window is that long range dependencies are not captured
long range dependencies are not captured because the model is focusing on the
because the model is focusing on the local context only. Also, this method
local context only. Also, this method like MQA multi-query attention might
like MQA multi-query attention might lead to a degradation in the performance
lead to a degradation in the performance of the model. Let's see what happens
of the model. Let's see what happens inside one head. We have our formula and
inside one head. We have our formula and here is the diagram. It is similar to
here is the diagram. It is similar to the previous ones. The only difference
the previous ones. The only difference is here. You can see that when we apply
is here. You can see that when we apply the masking, we make sure that we also
the masking, we make sure that we also apply the window so that we also turn
apply the window so that we also turn the other values here in this lower
the other values here in this lower triangle to zero. So we will have zero
triangle to zero. So we will have zero here, zero here. And the values in the
here, zero here. And the values in the middle will have the attention scores.
middle will have the attention scores. And finally we we multiply this masked
And finally we we multiply this masked attention weights with the V matrix in
attention weights with the V matrix in order to get the output. Now let's
order to get the output. Now let's compare local attention to the previous
compare local attention to the previous methods multi head attention and
methods multi head attention and multi-query attention. Here are the two
multi-query attention. Here are the two figures and as you can see local
figures and as you can see local attention is beating the other two
attention is beating the other two methods in both training and validation.
methods in both training and validation. So you can see that the difference
So you can see that the difference between local attention and multi-query
between local attention and multi-query attention is minimal. But if we compare
attention is minimal. But if we compare local attention to the baseline which is
local attention to the baseline which is in this case the standard multi head
in this case the standard multi head attention, you can see that the gap is
attention, you can see that the gap is starting to increase. Here is the
starting to increase. Here is the inference speed. You can see that local
inference speed. You can see that local attention is not as fast as multi-query
attention is not as fast as multi-query attention. So you can see it's even
attention. So you can see it's even worse than multi head attention in this
worse than multi head attention in this configuration that I have tested. We
configuration that I have tested. We also have other variants of this method.
also have other variants of this method. So here we have dilated sliding window.
So here we have dilated sliding window. This is how it looks like. We also have
This is how it looks like. We also have chunked sliding window. This is how it
chunked sliding window. This is how it looks like. We can also combine between
looks like. We can also combine between the two. So we have global plus sliding
the two. So we have global plus sliding window. Here is how it looks like. So we
window. Here is how it looks like. So we still have the slendering window which
still have the slendering window which is what I showed you earlier but we add
is what I showed you earlier but we add these global you add these for the
these global you add these for the tokens that should be uh that should
tokens that should be uh that should attend to all the tokens for example the
attend to all the tokens for example the special tokens like the start of tags
special tokens like the start of tags end of tags etc. Okay, now we have seen
end of tags etc. Okay, now we have seen the results. Now let's see how to
the results. Now let's see how to implement this in code. Here is the new
implement this in code. Here is the new script that I have created and here is
script that I have created and here is the diagram that explains how local
the diagram that explains how local attention works. Here we have the three
attention works. Here we have the three matrices key, query and value. Here they
matrices key, query and value. Here they are just linear layers but we are going
are just linear layers but we are going to use them in the forward pass. Okay,
to use them in the forward pass. Okay, so let's start. First we have the input
so let's start. First we have the input X. we are going to project it into the
X. we are going to project it into the key and query matrices. So here we are
key and query matrices. So here we are using the key and query linear layers
using the key and query linear layers and the shape will be batch size by
and the shape will be batch size by sequence length by head size. This time
sequence length by head size. This time I have the head class but like the
I have the head class but like the previous method which was multiquery
previous method which was multiquery attention we can combine them inside the
attention we can combine them inside the attention class. It's up to you to
attention class. It's up to you to decide but it's similar when when you
decide but it's similar when when you add the head class you at the end you
add the head class you at the end you will need to concatenate the results but
will need to concatenate the results but if you want to remove the head class you
if you want to remove the head class you will do everything. So instead of having
will do everything. So instead of having three dimensions you will have four
three dimensions you will have four dimensions. The first one will be batch
dimensions. The first one will be batch size and the second one will be the
size and the second one will be the number of heads. Okay. So now we get uh
number of heads. Okay. So now we get uh these two matrices. We are going to
these two matrices. We are going to apply our formula. By now you should you
apply our formula. By now you should you should know it by heart. Now we are
should know it by heart. Now we are here. We have the full attention weights
here. We have the full attention weights that we got here. Now we need to apply
that we got here. Now we need to apply this sliding window masking. So how do
this sliding window masking. So how do we do this? Like what I did in the
we do this? Like what I did in the previous video, if you don't understand
previous video, if you don't understand how this is implemented, the easy the
how this is implemented, the easy the easiest way is to open the terminal. Let
easiest way is to open the terminal. Let me activate the environment and here
me activate the environment and here open a new Python session and start
open a new Python session and start experimenting with this. So for example
experimenting with this. So for example here if you don't understand okay let me
here if you don't understand okay let me remove this because I don't have enough
remove this because I don't have enough space. So if you don't understand for
space. So if you don't understand for example what is happening here how are
example what is happening here how are we generating this matrix you can come
we generating this matrix you can come here for example here I need to import
here for example here I need to import pytorch and here try to create small
pytorch and here try to create small examples for example let's set t24 and
examples for example let's set t24 and here I need so I will just take this I
here I need so I will just take this I will not take the device let's take
will not take the device let's take unsqueeze
unsqueeze okay let's look at row indices okay so
okay let's look at row indices okay so here we have four rows and one column
here we have four rows and one column and column indices should be similar but
and column indices should be similar but instead we will have four columns and
instead we will have four columns and one row column indices. Let's look at
one row column indices. Let's look at column indices. Okay, so as you can see
column indices. Okay, so as you can see we have one row and four columns. Okay,
we have one row and four columns. Okay, so why did we create these two tensors?
so why did we create these two tensors? Well, here is the first step. We are
Well, here is the first step. We are going to prevent the attention to future
going to prevent the attention to future tokens. Okay, so how do we do this? If I
tokens. Okay, so how do we do this? If I take this and again look at coausal mask
take this and again look at coausal mask as you can see here false means zero and
as you can see here false means zero and true means one. So you can see that by
true means one. So you can see that by doing this we were able to remove the
doing this we were able to remove the upper triangle from the uh so this is
upper triangle from the uh so this is the mask. We are going to multiply this
the mask. We are going to multiply this with our attention weights matrix in
with our attention weights matrix in order to remove the upper triangle. And
order to remove the upper triangle. And how did we do this? You can see that
how did we do this? You can see that here we have four rows and one column.
here we have four rows and one column. Here we have one column and four rows.
Here we have one column and four rows. And this is the beauty of Python. So we
And this is the beauty of Python. So we Python under the hood will use
Python under the hood will use broadcasting in order to duplicate one
broadcasting in order to duplicate one of these tensors so that we can perform
of these tensors so that we can perform this operation. And this because here we
this operation. And this because here we have uh here we have 4x 1 and here 1x4.
have uh here we have 4x 1 and here 1x4. This will be this will be multiplied
This will be this will be multiplied four times so that we get a 4x4 matrix.
four times so that we get a 4x4 matrix. If you don't do this, you will never
If you don't do this, you will never it's it's going to be hard for you to
it's it's going to be hard for you to understand this just by imagining it.
understand this just by imagining it. You need to open the terminal, open a
You need to open the terminal, open a Python session and thinker with these
Python session and thinker with these values that or with these expressions
values that or with these expressions that are in the forward path. This is
that are in the forward path. This is how you understand any model. Okay. So
how you understand any model. Okay. So that was the first step. This is what we
that was the first step. This is what we have done before. But now we need to
have done before. But now we need to apply that slide window. So here we have
apply that slide window. So here we have this parameter. uh in this case because
this parameter. uh in this case because we have a small matrix I will make sure
we have a small matrix I will make sure to have a small window size let's set it
to have a small window size let's set it to two and here let's see the second
to two and here let's see the second step so here we are going to create the
step so here we are going to create the local window mask so this restricts the
local window mask so this restricts the attention to a local window around the
attention to a local window around the current token okay this is what we have
current token okay this is what we have set and this is the formula that we need
set and this is the formula that we need to apply so let's just take it here make
to apply so let's just take it here make sure do not take self and here I need to
sure do not take self and here I need to add one. Now let's look at local mask.
add one. Now let's look at local mask. And as you can see we have everything is
And as you can see we have everything is set to true. So the upper triangle is
set to true. So the upper triangle is set to two but the lower triangle is set
set to two but the lower triangle is set to false. So this is the inverse of what
to false. So this is the inverse of what we had before. Now we just need to
we had before. Now we just need to combine the two and we are going to use
combine the two and we are going to use the and operator so that we multiply the
the and operator so that we multiply the two values. So true * false will give us
two values. So true * false will give us false. True * true or true and true will
false. True * true or true and true will give us true. So let's look at the final
give us true. So let's look at the final mask. Final mask. And as you can see
mask. Final mask. And as you can see here is the upper triangle. It is set to
here is the upper triangle. It is set to false. This is exactly what we used to
false. This is exactly what we used to do before. But now we applied the local
do before. But now we applied the local attention mask. And as you can see we
attention mask. And as you can see we have two values max in each row. So this
have two values max in each row. So this means that it worked. And after that we
means that it worked. And after that we just need to apply this mask to the
just need to apply this mask to the attention weights. And finally after
attention weights. And finally after that we use the soft max function and at
that we use the soft max function and at the end we multiply the weights with the
the end we multiply the weights with the V matrix. After that we get the output.
V matrix. After that we get the output. And remember this is just one head. So
And remember this is just one head. So now we need to go. So this this was
now we need to go. So this this was already done but I just wanted to show
already done but I just wanted to show you this. So here are the full list of
you this. So here are the full list of heads. Here in the forward pass in the
heads. Here in the forward pass in the attention class we are getting the
attention class we are getting the output for each head independently but
output for each head independently but we need to concatenate them. So as you
we need to concatenate them. So as you can see the the the shape at the end
can see the the the shape at the end will be batch size times time sequence
will be batch size times time sequence or the sequence length times the number
or the sequence length times the number of heads times head size. This is how it
of heads times head size. This is how it works and the rest is the same. Now I
works and the rest is the same. Now I have all I have also created another
have all I have also created another notebook where I have used this class
notebook where I have used this class and this is how I was able to show you
and this is how I was able to show you the comparison between local attention
the comparison between local attention and the other methods. I hope that you
and the other methods. I hope that you understood this method. Now let's move
understood this method. Now let's move to the next one. Grouped query attention
to the next one. Grouped query attention is our next method. This one was
is our next method. This one was mentioned in this paper training
mentioned in this paper training generalized multi-query transformer
generalized multi-query transformer models from multi-head checkpoints.
models from multi-head checkpoints. Click on the image if you want to read
Click on the image if you want to read more about this paper. Grouped query
more about this paper. Grouped query attention is the generalization of multi
attention is the generalization of multi head attention and multi-query
head attention and multi-query attention. This is the diagram for multi
attention. This is the diagram for multi head attention. And here is how it looks
head attention. And here is how it looks like for GQA. In groups query attention,
like for GQA. In groups query attention, we try to form groups. In this example,
we try to form groups. In this example, we have six query heads and we have
we have six query heads and we have created two groups. And in each group,
created two groups. And in each group, the query heads share the same key and
the query heads share the same key and value. In multi head attention, the
value. In multi head attention, the number of groups is equal to the number
number of groups is equal to the number of heads, which means that each head
of heads, which means that each head contains three matrices. key, query and
contains three matrices. key, query and value. In multi-query attention, the
value. In multi-query attention, the number of groups is equal to one because
number of groups is equal to one because all queries share the same key and
all queries share the same key and value. But in GQA, we have the
value. But in GQA, we have the flexibility to change that value instead
flexibility to change that value instead of being one or the number of heads.
of being one or the number of heads. This means that grouped query attention
This means that grouped query attention offers an ideal trade-off between speed
offers an ideal trade-off between speed and performance. It falls between multi
and performance. It falls between multi head attention and multi-query
head attention and multi-query attention. It is fast and also gives
attention. It is fast and also gives good results. When setting the number of
good results. When setting the number of groups to a value lower than the number
groups to a value lower than the number of heads, GQA reduces the number of
of heads, GQA reduces the number of trainable parameters compared to multi
trainable parameters compared to multi head attention. Let's zoom at the
head attention. Let's zoom at the attention layer to see what happens
attention layer to see what happens exactly. This is how the diagram looks
exactly. This is how the diagram looks like for GQA. I want you to focus on
like for GQA. I want you to focus on this part here. We are inside the
this part here. We are inside the attention layer. Remember that each
attention layer. Remember that each group shares one key and value matrices.
group shares one key and value matrices. This means that we need to duplicate the
This means that we need to duplicate the K and V matrices multiple times to match
K and V matrices multiple times to match the number of queries in one group.
the number of queries in one group. Let's take this this as an example. You
Let's take this this as an example. You can see that in one group we have three
can see that in one group we have three queries but we only have one key and one
queries but we only have one key and one value. So if you try to multiply these
value. So if you try to multiply these two matrices together, you will get an
two matrices together, you will get an error because the shapes do not match
error because the shapes do not match and I tried to depict that with this
and I tried to depict that with this change. So here you can see that query
change. So here you can see that query is bigger than V in size. So here we
is bigger than V in size. So here we need to duplicate V and Q multiple times
need to duplicate V and Q multiple times so that we have the same size. And after
so that we have the same size. And after that we can apply this formula. So here
that we can apply this formula. So here after duplicating K you are going to
after duplicating K you are going to transpose it multiply it by Q and then
transpose it multiply it by Q and then apply the mask and after this multiply
apply the mask and after this multiply everything by V in order to get the
everything by V in order to get the output. But this is the important thing
output. But this is the important thing to consider when trying to implement
to consider when trying to implement this in code. Are you ready to see the
this in code. Are you ready to see the comparison? Here you go. the model under
comparison? Here you go. the model under grouped query attention has destroyed
grouped query attention has destroyed the previous methods. Look at the gap
the previous methods. Look at the gap between the loss curves. It's huge. This
between the loss curves. It's huge. This method performed really well on both
method performed really well on both training and the validation. And here I
training and the validation. And here I want to put a big asterisk on this
want to put a big asterisk on this result that I got. I have used a
result that I got. I have used a specific data set for the Moroccan dera.
specific data set for the Moroccan dera. And maybe because here I I took 1,000
And maybe because here I I took 1,000 batches in both training and validation
batches in both training and validation to basically draw these curves. Maybe I
to basically draw these curves. Maybe I I would have got different results if I
I would have got different results if I have increased that value from let's say
have increased that value from let's say 1,000 batches to 2,000 or 10,000
1,000 batches to 2,000 or 10,000 batches. Maybe we these curves will have
batches. Maybe we these curves will have changed. But here I am just showing you
changed. But here I am just showing you how to implement these methods. How do
how to implement these methods. How do they work under the hood? And after that
they work under the hood? And after that if we change the data set we might get
if we change the data set we might get different results. Maybe I will try to
different results. Maybe I will try to rerun the grouped query attention
rerun the grouped query attention another time just to verify that this
another time just to verify that this results is consistent and update the
results is consistent and update the graphs if I see something has changed.
graphs if I see something has changed. But here as I said the curves each one
But here as I said the curves each one will get different curves each one will
will get different curves each one will get different loss values depending on
get different loss values depending on the model size and the data that you
the model size and the data that you have used. Let's also check the
have used. Let's also check the inference speed. Here is the diagram.
inference speed. Here is the diagram. GQA is fast like multi-query attention
GQA is fast like multi-query attention but it's a little bit slower than it.
but it's a little bit slower than it. You can see that MQA is in orange. So
You can see that MQA is in orange. So yeah, the difference is not that big,
yeah, the difference is not that big, but it's slower because here in
but it's slower because here in multiquery attention we have one group
multiquery attention we have one group but here we can we have more than one
but here we can we have more than one group. So we have more parameters than
group. So we have more parameters than MQA. That's why it's a little bit
MQA. That's why it's a little bit slower. Now let's go to VS code to see
slower. Now let's go to VS code to see how to implement this. I have created
how to implement this. I have created this script model grouped query
this script model grouped query attention to implement this technique.
attention to implement this technique. Here we have the grouped query attention
Here we have the grouped query attention class and let's see how it works. Here
class and let's see how it works. Here we have the classic parameters number of
we have the classic parameters number of embedding number of heads etc. But I
embedding number of heads etc. But I also added the number number of key
also added the number number of key value heads. Basically this is the
value heads. Basically this is the number of groups. So here I have this
number of groups. So here I have this comments because as I said GQA is a
comments because as I said GQA is a generalization of both MHA and MQA and
generalization of both MHA and MQA and here if number of KV heads is not
here if number of KV heads is not specified we are going to set it to the
specified we are going to set it to the number of heads to fall to MHA but yeah
number of heads to fall to MHA but yeah we shouldn't we shouldn't do that we
we shouldn't we shouldn't do that we should specify a value greater than one
should specify a value greater than one and here if number of KV heads is equal
and here if number of KV heads is equal to one that means that we are going to
to one that means that we are going to use MQA otherwise it's GQA so number of
use MQA otherwise it's GQA so number of KV heads should be greater than one and
KV heads should be greater than one and less than number of heads. So after that
less than number of heads. So after that so we are sorting it here we get the
so we are sorting it here we get the head size and the number of queries per
head size and the number of queries per KV head. We saw that in the slides we we
KV head. We saw that in the slides we we had two groups and in each group we have
had two groups and in each group we have three queries and this value stores that
three queries and this value stores that information. I will go directly to the
information. I will go directly to the forward pass so that we follow the
forward pass so that we follow the diagram that we have here on the right.
diagram that we have here on the right. So from the input X we project it to get
So from the input X we project it to get the query and here is the shape. So we
the query and here is the shape. So we have batch size, sequence length and
have batch size, sequence length and number of embedding. After that we are
number of embedding. After that we are reshaping the query to have this shape
reshaping the query to have this shape batch by number of query heads by
batch by number of query heads by sequence length by head size. And here
sequence length by head size. And here is how we do that. It's simple. We get
is how we do that. It's simple. We get those values that we got from the
those values that we got from the constructor and apply the
constructor and apply the transformation. After that we get the
transformation. After that we get the key and values and here remember so V
key and values and here remember so V and key are small because in each group
and key are small because in each group all the queries share the same key and
all the queries share the same key and value matrices. So we get the key matrix
value matrices. So we get the key matrix like this. We have this key matrix that
like this. We have this key matrix that that we have defined here. It's a linear
that we have defined here. It's a linear layer and the shape will be like this.
layer and the shape will be like this. Batch size sequence length and here the
Batch size sequence length and here the last dimension is number of KV heads
last dimension is number of KV heads times the head size and remember number
times the head size and remember number of KV heads is the number of groups. So
of KV heads is the number of groups. So here we will have two keys. So here I am
here we will have two keys. So here I am referring to the example that we that we
referring to the example that we that we have seen in the slides. So query we
have seen in the slides. So query we will have six query heads and here we
will have six query heads and here we will have two key heads. And because the
will have two key heads. And because the number of groups is set to two. So we
number of groups is set to two. So we are going to reshape this again to
are going to reshape this again to separate the head size from the number
separate the head size from the number of KV heads. And we are going to do the
of KV heads. And we are going to do the same thing for V because they are
same thing for V because they are similar. So again V is we are going to
similar. So again V is we are going to use the value matrix to project X to get
use the value matrix to project X to get the V matrix and after that we are going
the V matrix and after that we are going to reshape that tensor. And now this is
to reshape that tensor. And now this is the part that I have highlighted in the
the part that I have highlighted in the slides. So from here we need to
slides. So from here we need to duplicate V multiple times so that we
duplicate V multiple times so that we match the number of queries that we have
match the number of queries that we have in that example we have six queries and
in that example we have six queries and two values because we have two groups.
two values because we have two groups. So we need to multiply V and Q three
So we need to multiply V and Q three times. So we are going to multiply 2 by
times. So we are going to multiply 2 by three to get six value heads value heads
three to get six value heads value heads and six query heads. And this is the
and six query heads. And this is the role of repeat KV. So this meth this
role of repeat KV. So this meth this function that I have defined I will show
function that I have defined I will show you what it does. We'll take one of
you what it does. We'll take one of these matrices so K or V and it will
these matrices so K or V and it will multiply it number of queries per KV
multiply it number of queries per KV times and number of queries per KV is
times and number of queries per KV is equal to three. So 2 * 3 will give us
equal to three. So 2 * 3 will give us six and this will match the number of
six and this will match the number of queries. Okay. So let's go inside repeat
queries. Okay. So let's go inside repeat KV to see what is happening. We have the
KV to see what is happening. We have the tensor and the repetition times. How
tensor and the repetition times. How many times do we want to repeat that
many times do we want to repeat that tensor? So if repetition times is equal
tensor? So if repetition times is equal to one, we are we are going to return
to one, we are we are going to return that tensor. There is nothing we we need
that tensor. There is nothing we we need to do. But if this is greater than one,
to do. But if this is greater than one, we have this lovely function that
we have this lovely function that PyTorch provides us. It's called the
PyTorch provides us. It's called the repeat interle. And this will take so
repeat interle. And this will take so you take the tensor, you call that
you take the tensor, you call that method and you tell it how many times
method and you tell it how many times you want to repeat it. And here we are
you want to repeat it. And here we are specifying the dimensions that we want
specifying the dimensions that we want to repeat. Okay. So here we have number
to repeat. Okay. So here we have number of KV heads and this is what we want to
of KV heads and this is what we want to repeat because number of KV heads is
repeat because number of KV heads is equal to two because we have two groups
equal to two because we have two groups but we want to multiply that to get six
but we want to multiply that to get six at the end. After that this should be
at the end. After that this should be the number of query heads. This is why
the number of query heads. This is why we targeted this dimension specifically
we targeted this dimension specifically because we want it to go from two to six
because we want it to go from two to six to the total number of queries that we
to the total number of queries that we have. So this is a really great function
have. So this is a really great function that simplifies things. If we didn't
that simplifies things. If we didn't have it, we had to do a lot of
have it, we had to do a lot of gymnastics to get this output. So yeah,
gymnastics to get this output. So yeah, this is what we have inside repeat KV
this is what we have inside repeat KV and we use it for both key and value
and we use it for both key and value matrices. So yeah, now everything is
matrices. So yeah, now everything is prepared. The rest is the same. So we
prepared. The rest is the same. So we apply the attention formula. We get the
apply the attention formula. We get the weights. We apply the mask soft max. We
weights. We apply the mask soft max. We add a little bit of dropout. We multiply
add a little bit of dropout. We multiply the masked matrix with V to get the
the masked matrix with V to get the final output which in this case is Y.
final output which in this case is Y. But remember since we have four
But remember since we have four dimensions, we need to merge the number
dimensions, we need to merge the number of query heads and head size. And this
of query heads and head size. And this is exactly what we are doing here in
is exactly what we are doing here in these lines. And finally we apply
these lines. And finally we apply another dropout and get the final output
another dropout and get the final output from this grouped query attention class.
from this grouped query attention class. Yeah. Uh and again here you see that we
Yeah. Uh and again here you see that we didn't use the head class. So we have
didn't use the head class. So we have merged everything into the attention
merged everything into the attention class. And as you can see sometimes I
class. And as you can see sometimes I show you how to use the head. So if you
show you how to use the head. So if you want to multi to perform the
want to multi to perform the calculations separately you can do that.
calculations separately you can do that. And sometimes I show you how to merge
And sometimes I show you how to merge the head inside the grouped attention
the head inside the grouped attention into the attention layer. And whenever
into the attention layer. And whenever you fuse the head, you always add
you fuse the head, you always add another dimension. When we don't have
another dimension. When we don't have the head class, we will have four
the head class, we will have four dimensions. But if we decide to add the
dimensions. But if we decide to add the head class, we will have three
head class, we will have three dimensions. And at the end, you
dimensions. And at the end, you concatenate the output from each head
concatenate the output from each head separately. And again uh after I have
separately. And again uh after I have created this I have made sure to create
created this I have made sure to create a notebook that I have run in order to
a notebook that I have run in order to get those results and you can find it
get those results and you can find it here. So let me search for it. It's this
here. So let me search for it. It's this one 923 improving transformer grouped
one 923 improving transformer grouped query attention. Now let's go back to
query attention. Now let's go back to the slides to learn about the next
the slides to learn about the next method. Linear attention is the method
method. Linear attention is the method that we are going to focus on now. Read
that we are going to focus on now. Read the following papers if you want to
the following papers if you want to understand this method deeply. Again,
understand this method deeply. Again, you can click on the images if you want
you can click on the images if you want to get a direct link to the papers. We
to get a direct link to the papers. We know that multi head attention scales
know that multi head attention scales badly with long sequences of text
badly with long sequences of text because of the squared complexity.
because of the squared complexity. Linear attention tries to solve this
Linear attention tries to solve this issue by using less memory in order to
issue by using less memory in order to be efficient. How does it do that? Well,
be efficient. How does it do that? Well, researchers found a way to approximate
researchers found a way to approximate the attention formula with this one. You
the attention formula with this one. You can find this approximate equation in
can find this approximate equation in the paper that I have highlighted in the
the paper that I have highlighted in the previous slide. And as you can see here
previous slide. And as you can see here is the the equation number five is
is the the equation number five is exactly what I have written here. Also
exactly what I have written here. Also researchers found that even though we
researchers found that even though we are doing this approximation which goes
are doing this approximation which goes from big O N squar to O basically
from big O N squar to O basically turning the complexity to be linear.
turning the complexity to be linear. They found that this method can give you
They found that this method can give you good performance even with this
good performance even with this approximation. Now let's see how the
approximation. Now let's see how the attention layer looks like. First we
attention layer looks like. First we have our linear attention formula and
have our linear attention formula and here is our diagram. This time the
here is our diagram. This time the diagram is different than the previous
diagram is different than the previous ones. First we have this pi function
ones. First we have this pi function that we apply to the query and key
that we apply to the query and key matrices. This gives us fq and f k. Also
matrices. This gives us fq and f k. Also if you look at the formula we can turn
if you look at the formula we can turn it into this. So si is basically this
it into this. So si is basically this summation and z i is this summation. So
summation and z i is this summation. So we can also compute these terms. So we
we can also compute these terms. So we have already computed VQ. It's this one.
have already computed VQ. It's this one. Now we can compute SI. So SI is
Now we can compute SI. So SI is basically VK * the transpose of V. So we
basically VK * the transpose of V. So we are going to take FK. Here it is. And we
are going to take FK. Here it is. And we are going to take V. Here I didn't show
are going to take V. Here I didn't show V transpose but we should transpose it
V transpose but we should transpose it before multiplying it by FK. This gives
before multiplying it by FK. This gives us SI. And Z I is basically FQ. FK. So Z
us SI. And Z I is basically FQ. FK. So Z I is FK. Now we need to multiply SI with
I is FK. Now we need to multiply SI with the transpose of F V FQ. Here it is. So
the transpose of F V FQ. Here it is. So SI * VQ transpose. And here basically we
SI * VQ transpose. And here basically we need to multiply Z I with the transpose
need to multiply Z I with the transpose of VQ. And here it is. So Z I multiplied
of VQ. And here it is. So Z I multiplied by FQ transpose. And after that we
by FQ transpose. And after that we divide the numerator with the
divide the numerator with the denominator with this operator. And at
denominator with this operator. And at the end we get the output which gives us
the end we get the output which gives us the attention weights. You will see that
the attention weights. You will see that this diagram will help us understand how
this diagram will help us understand how to implement this in code. It will be a
to implement this in code. It will be a direct translation from these steps that
direct translation from these steps that you see here into code. Now let's
you see here into code. Now let's compare linear attention with the
compare linear attention with the previous methods. Here is the graph for
previous methods. Here is the graph for the training loss and the graph for the
the training loss and the graph for the validation loss. Because these curves
validation loss. Because these curves are close to each other. Let me zoom in
are close to each other. Let me zoom in a little bit so that we can see clearly.
a little bit so that we can see clearly. Linear attention is this pink curve and
Linear attention is this pink curve and overall it's comparable to multi head
overall it's comparable to multi head attention as you so you can see here in
attention as you so you can see here in validation both imi and linear layer
validation both imi and linear layer linear attention are very close to each
linear attention are very close to each other. So even though it is comparable,
other. So even though it is comparable, it's not the best method in our
it's not the best method in our benchmarking. But you can see that we we
benchmarking. But you can see that we we have a method that works very fast and
have a method that works very fast and gives us the same performance as multi
gives us the same performance as multi head attention. Let's see the inference
head attention. Let's see the inference speed to verify if this is correct. So
speed to verify if this is correct. So linear attention is this green bar or
linear attention is this green bar or this light green bar. You can see here,
this light green bar. You can see here, let's go to 2,00 because here where we
let's go to 2,00 because here where we see the big difference. You can see that
see the big difference. You can see that linear attention is super fast and
linear attention is super fast and that's because it doesn't use a lot of
that's because it doesn't use a lot of memory. I have an RTX 4070 and I have 8
memory. I have an RTX 4070 and I have 8 GB of VRAM and MHI was using almost 7 GB
GB of VRAM and MHI was using almost 7 GB while linear attention was using only
while linear attention was using only three. This is why this method is super
three. This is why this method is super fast. Now, let me open VS Code in order
fast. Now, let me open VS Code in order to show you how to implement this in
to show you how to implement this in code. Here is the script that I have
code. Here is the script that I have created. It's called model linear
created. It's called model linear attention. We have the linear attention
attention. We have the linear attention class. And as always, let's go directly
class. And as always, let's go directly to the forward method because here we
to the forward method because here we have the implementation. So first what
have the implementation. So first what do we do? We project the input text into
do we do? We project the input text into the three matrices. This is what we have
the three matrices. This is what we have we are doing in the first step. And here
we are doing in the first step. And here you can see that we have the pi
you can see that we have the pi function. Let me zoom in. Okay, that
function. Let me zoom in. Okay, that that's better. Pi is defined like this.
that's better. Pi is defined like this. It's u + 1. If you are wondering what is
It's u + 1. If you are wondering what is this function, basically it's an
this function, basically it's an activation function. We have sigmoid
activation function. We have sigmoid tanho
tanho etc. And u is one of them. So this was
etc. And u is one of them. So this was the the activation function that the
the the activation function that the researchers have used in the research
researchers have used in the research paper. So we are using that also. But if
paper. So we are using that also. But if you want you can change this and use
you want you can change this and use other activation function. Now let's
other activation function. Now let's continue. We have the projections. Now
continue. We have the projections. Now we need to reshape because again here we
we need to reshape because again here we are fusing the head into the linear or
are fusing the head into the linear or into the attention layer. So we need to
into the attention layer. So we need to have four dimensions instead of just
have four dimensions instead of just three. So we are introducing the number
three. So we are introducing the number of heads at uh dimension. As you can see
of heads at uh dimension. As you can see this is this should be the final output.
this is this should be the final output. Batch size by number of heads by
Batch size by number of heads by sequence length by head size. We have
sequence length by head size. We have the projection. We have reshaped them.
the projection. We have reshaped them. Now we need to apply the fi function to
Now we need to apply the fi function to the query and the keys and this is what
the query and the keys and this is what we are doing here. So we take the query
we are doing here. So we take the query we take the key and we apply the feature
we take the key and we apply the feature map. This gives us 5q and 5k. Okay. So
map. This gives us 5q and 5k. Okay. So now we need to compute si and z i. So
now we need to compute si and z i. So here here is s cumulative. S
here here is s cumulative. S commumulative is basically SI and if you
commumulative is basically SI and if you SI is what is the multiplication of P K
SI is what is the multiplication of P K with V. So here is VK and here is V and
with V. So here is VK and here is V and here we are introducing new dimensions
here we are introducing new dimensions so that we can multiply these two
so that we can multiply these two matrices. So here at the end it should
matrices. So here at the end it should give us so here head size by one one by
give us so here head size by one one by head size. So at the end it should be B
head size. So at the end it should be B by number of head number of heads by T
by number of head number of heads by T by head size by head size and Z I is
by head size by head size and Z I is just FQ FK. So there is no
just FQ FK. So there is no multiplication needed here. Okay. Now we
multiplication needed here. Okay. Now we need to compute the numerator. So the
need to compute the numerator. So the numerator is basically SI * FQ
numerator is basically SI * FQ transpose. So here is FQ and here is SI.
transpose. So here is FQ and here is SI. That gives us the numerator. And here I
That gives us the numerator. And here I try to also show you the formulas so
try to also show you the formulas so that you don't get lost. Now the
that you don't get lost. Now the denominator is Z I * PQ transposed. So
denominator is Z I * PQ transposed. So here is Z and here is FQ. And we also
here is Z and here is FQ. And we also add an epsilon. Epsilon is a small
add an epsilon. Epsilon is a small value. I think here it is set to let's
value. I think here it is set to let's see it's set to 10 to the minus 6. We
see it's set to 10 to the minus 6. We are adding the epsilon because this is
are adding the epsilon because this is the denominator. Let's say that this
the denominator. Let's say that this term was equal to zero. We shouldn't
term was equal to zero. We shouldn't divide by zero because that will give us
divide by zero because that will give us infinity. So we add that small value to
infinity. So we add that small value to prevent that from happening. And at the
prevent that from happening. And at the end we get the attention weights which
end we get the attention weights which is the numerator which we get from here.
is the numerator which we get from here. We divide it by the denominator. And
We divide it by the denominator. And finally because we have four dimensions
finally because we have four dimensions we need to fuse those into just three so
we need to fuse those into just three so that we get batch size by sequence
that we get batch size by sequence length by the number of heads times the
length by the number of heads times the head size and finally we apply the
head size and finally we apply the projection and dropout to the output so
projection and dropout to the output so that we get the dimension that we are
that we get the dimension that we are looking for. Okay. So this is how you
looking for. Okay. So this is how you implement linear attention and if you
implement linear attention and if you are wondering I also have created a
are wondering I also have created a notebook to run it. It's 924 improving
notebook to run it. It's 924 improving transformer linear attention. Now let's
transformer linear attention. Now let's go to the slides to learn about the next
go to the slides to learn about the next method. Now let's talk about a paper
method. Now let's talk about a paper called big bird which introduced sparse
called big bird which introduced sparse attention. Click on the image if you
attention. Click on the image if you want to read more about the paper. Big
want to read more about the paper. Big Bird is designed to process large
Bird is designed to process large sequences of text without sacrificing
sequences of text without sacrificing the performance. Big Bird uses sparse
the performance. Big Bird uses sparse attention which is designed to reduce
attention which is designed to reduce the computational and memory complexity
the computational and memory complexity to be linear. We call this big O of N.
to be linear. We call this big O of N. Sparse attention is the sum of global
Sparse attention is the sum of global attention which looks like this. Random
attention which looks like this. Random attention and the sliding window
attention and the sliding window attention which we also call local
attention which we also call local attention. This gives us big bird which
attention. This gives us big bird which is sparse attention. We have seen that
is sparse attention. We have seen that when we used local attention we were
when we used local attention we were capturing only local dependencies but
capturing only local dependencies but big birds because it uses this mix of
big birds because it uses this mix of attentions let's call it like that it
attentions let's call it like that it captures both global and local
captures both global and local dependencies. Let's zoom into one head
dependencies. Let's zoom into one head to understand how to implement sparse
to understand how to implement sparse attention. You can see that the diagram
attention. You can see that the diagram is big but it's simple. Let's start by
is big but it's simple. Let's start by computing the weight matrix as we did
computing the weight matrix as we did before. First we take the input, we
before. First we take the input, we project it into the three matrices and
project it into the three matrices and then we compute what we have inside the
then we compute what we have inside the softmax function that will give us the
softmax function that will give us the row weights. Uh and now this is the
row weights. Uh and now this is the change. So we need to create a mask in
change. So we need to create a mask in order to multiply it with the waist
order to multiply it with the waist matrix. And this is what we do. Here we
matrix. And this is what we do. Here we have the local attention mask. You can
have the local attention mask. You can see that here we have zeros in the upper
see that here we have zeros in the upper triangle and the lower triangle. We have
triangle and the lower triangle. We have global attention which looks like this.
global attention which looks like this. And finally the random attention mask.
And finally the random attention mask. Here this symbol means or. So we are
Here this symbol means or. So we are going to take these masks. You can see
going to take these masks. You can see that the values that are colored in blue
that the values that are colored in blue or the cells that are colored in blue
or the cells that are colored in blue contain the number one or true and
contain the number one or true and outside that we have false. When we use
outside that we have false. When we use the or operation, we are going to fuse
the or operation, we are going to fuse these three masks into one mask. And as
these three masks into one mask. And as you can see, this gives us what we have
you can see, this gives us what we have seen in the previous slide, which is the
seen in the previous slide, which is the sparse attention. And you can see that
sparse attention. And you can see that here we have a mix of global, local, and
here we have a mix of global, local, and random masks. And here we also create
random masks. And here we also create the causal mask because we want to
the causal mask because we want to remove the future tokens from the the
remove the future tokens from the the mask. And here we use the and operation.
mask. And here we use the and operation. So and it's simple if you have true and
So and it's simple if you have true and two that gives you true. If you have
two that gives you true. If you have true and false that gives you false by
true and false that gives you false by using this operator we get this final
using this operator we get this final mask. And now we take the final mask we
mask. And now we take the final mask we multiply it with the weights matrix that
multiply it with the weights matrix that gives us the masked sensor that we can
gives us the masked sensor that we can then multiply with V in order to get the
then multiply with V in order to get the attention weights. Now let's compare
attention weights. Now let's compare sparse attention to the previous
sparse attention to the previous methods. Here are the two graphs again
methods. Here are the two graphs again because the curves are close to each
because the curves are close to each other. I'm going to zoom a little bit.
other. I'm going to zoom a little bit. Big bird is this light blue color. And
Big bird is this light blue color. And as you can see, it performed good. It
as you can see, it performed good. It the performance of sparse attention is
the performance of sparse attention is good. So it's close to local attention
good. So it's close to local attention in both training and validation. Now
in both training and validation. Now let's see the inference speed. But the
let's see the inference speed. But the problem is that big bird is slow because
problem is that big bird is slow because here we are combining multiple attention
here we are combining multiple attention mechanisms and that slows down this
mechanisms and that slows down this approach. I mean we can play with the
approach. I mean we can play with the hyperparameters in order to for example
hyperparameters in order to for example we you can play with how many random
we you can play with how many random cells you want to add um the size of the
cells you want to add um the size of the window but overall when you add all
window but overall when you add all those attention mechanisms you will get
those attention mechanisms you will get a slower solution. And because here if I
a slower solution. And because here if I go back here we we do a lot of
go back here we we do a lot of multiplications. So we have a lot of
multiplications. So we have a lot of matrices that we need to compute before
matrices that we need to compute before getting the final attention weights that
getting the final attention weights that also makes the approach slower. It
also makes the approach slower. It doesn't use a lot of memory which is
doesn't use a lot of memory which is good but I maybe the implementation
good but I maybe the implementation needs to be optimized. Now let me go
needs to be optimized. Now let me go back to VS code in order to show you how
back to VS code in order to show you how to implement this method. I have created
to implement this method. I have created this script which is called model big
this script which is called model big bird and we have the head class. So
bird and we have the head class. So let's go directly to the forward pass
let's go directly to the forward pass and let's see what we have. So first of
and let's see what we have. So first of all we compute the query and key
all we compute the query and key matrices. Then we multiply them together
matrices. Then we multiply them together in order to get the weights matrix. And
in order to get the weights matrix. And now we stop. So we go down. Let me zoom
now we stop. So we go down. Let me zoom in. And things might seem familiar to
in. And things might seem familiar to you because we have seen this before. In
you because we have seen this before. In order to create this mask, we create two
order to create this mask, we create two two vectors or two tensors, row indices
two vectors or two tensors, row indices and column indices. We use this
and column indices. We use this operation in order to get the cosal mask
operation in order to get the cosal mask which is this one. After that we are
which is this one. After that we are going to compute the local window mask.
going to compute the local window mask. We get it after performing these
We get it after performing these operations. And this is exactly what we
operations. And this is exactly what we have done in the part where I have
have done in the part where I have talked about local attention. After that
talked about local attention. After that we have global mask and here we need to
we have global mask and here we need to specify the number of global tokens
specify the number of global tokens because as you can see here for example
because as you can see here for example we chose just one token but we could we
we chose just one token but we could we could choose multiple tokens and after
could choose multiple tokens and after that we have random attention that will
that we have random attention that will give us this mask and then so we get
give us this mask and then so we get here random columns. Now we combine the
here random columns. Now we combine the local global and random. So here I
local global and random. So here I forgot to add and random components.
forgot to add and random components. Okay. So you can see I am using the or
Okay. So you can see I am using the or operator in order to combine those. And
operator in order to combine those. And finally when when those are combined I
finally when when those are combined I am using the and operator to multiply
am using the and operator to multiply the causal mask with the final mask that
the causal mask with the final mask that that gives me this which is this value
that gives me this which is this value this variable final mask. Now I apply
this variable final mask. Now I apply that to the weights matrix and then I
that to the weights matrix and then I multiply that with V in order to get the
multiply that with V in order to get the output. I went very quickly as I as I
output. I went very quickly as I as I showed you before. You should not get
showed you before. You should not get scared if you see a lot of code. If you
scared if you see a lot of code. If you do not understand always always open the
do not understand always always open the terminal. Let me create let me activate
terminal. Let me create let me activate the environment. Open a new Python
the environment. Open a new Python session and start playing with this. You
session and start playing with this. You should not get scared if you see a lot
should not get scared if you see a lot of code for example. Okay. So I am here
of code for example. Okay. So I am here uh let's create or let's import torch at
uh let's create or let's import torch at the beginning. Now let's create create
the beginning. Now let's create create small examples. There is no need to have
small examples. There is no need to have big matrices. So small examples always
big matrices. So small examples always help you understand the the concept. So
help you understand the the concept. So for example here I am creating a random
for example here I am creating a random input tensor. Here the batch size is set
input tensor. Here the batch size is set to one. The sequence length is six and
to one. The sequence length is six and the embedding dimension is 16. And now
the embedding dimension is 16. And now you can come here for example take this
you can come here for example take this paste it here. I'll look at B. Okay,
paste it here. I'll look at B. Okay, that's one etc. And you can verify. So
that's one etc. And you can verify. So after that I can come here take this. So
after that I can come here take this. So I need number of embedding. Let's set it
I need number of embedding. Let's set it to 16. Let's set the head size to be
to 16. Let's set the head size to be eight so that I get two two heads. And
eight so that I get two two heads. And now I need to import n. Let's get the
now I need to import n. Let's get the key. Okay, now I can go back and use
key. Okay, now I can go back and use this. So now I can create the key matrix
this. So now I can create the key matrix or the key tensor. Give it x and that
or the key tensor. Give it x and that gives me and I can that gives me the k
gives me and I can that gives me the k matrix and I can look at the shape.
matrix and I can look at the shape. Okay, so that makes sense. So K and QR
Okay, so that makes sense. So K and QR of shape B * T * head size and this is
of shape B * T * head size and this is what what I get. So this is the batch
what what I get. So this is the batch size. This is the sequence length and
size. This is the sequence length and this is the head size and this is
this is the head size and this is exactly what I have just set here. So
exactly what I have just set here. So head size is equal to eight. You can
head size is equal to eight. You can verify these just like that. It's hard
verify these just like that. It's hard to visualize these things in your head
to visualize these things in your head because there is there is a lot of code
because there is there is a lot of code and especially if you work with large
and especially if you work with large matrices that becomes very very
matrices that becomes very very unintuitive. But choosing small examples
unintuitive. But choosing small examples will help you understand everything.
will help you understand everything. Okay. So now you can you can do the same
Okay. So now you can you can do the same thing. Let's let me continue uh just a
thing. Let's let me continue uh just a little bit. So again I can go back since
little bit. So again I can go back since key and query are the same I can remove
key and query are the same I can remove this replace that with key with query I
this replace that with key with query I can get the query like this and again I
can get the query like this and again I can look at the shape and that gives me
can look at the shape and that gives me the same the same result. Now I can go
the same the same result. Now I can go down and compute the weights. So let's
down and compute the weights. So let's take this paste it here. Let's look at
take this paste it here. Let's look at weights. Let's get the shape and this
weights. Let's get the shape and this gives me 1x 6x6 which is exactly what I
gives me 1x 6x6 which is exactly what I have mentioned in the comments. So batch
have mentioned in the comments. So batch size by t byt. Um now I can for example
size by t byt. Um now I can for example take this. I just want to show you the
take this. I just want to show you the masks really quickly. Okay. So I uh I
masks really quickly. Okay. So I uh I don't need to have a device but let me
don't need to have a device but let me just set it to be CPU. Now let's go
just set it to be CPU. Now let's go back. Let's do the same for columns.
back. Let's do the same for columns. Okay. And now let's get the causal the
Okay. And now let's get the causal the causal mask. I know we have seen these
causal mask. I know we have seen these things but I just wanted to show you how
things but I just wanted to show you how to visualize everything. So you can see
to visualize everything. So you can see that the upper triangle is set to zeros
that the upper triangle is set to zeros which is what we want. This is exactly
which is what we want. This is exactly the definition of a coal mask. And again
the definition of a coal mask. And again if I come here let me set the window
if I come here let me set the window size to be two. And let's take these
size to be two. And let's take these conditions. Okay, here is the first one.
conditions. Okay, here is the first one. Here is the second one. And let's get
Here is the second one. And let's get the local attention mask. Now let's look
the local attention mask. Now let's look at it. Attention mask. And voila. You,
at it. Attention mask. And voila. You, as you can see, the upper triangle and
as you can see, the upper triangle and the lower triangles of this tensor are
the lower triangles of this tensor are set to zero. But here we have two values
set to zero. But here we have two values in each row that are set to true. So
in each row that are set to true. So this is the local mask. And here we also
this is the local mask. And here we also if you want to create the global mask as
if you want to create the global mask as I as I told you we need to set a number
I as I told you we need to set a number of global tokens let's set it to two
of global tokens let's set it to two because I don't the matrix is small so I
because I don't the matrix is small so I can take this put it here I can do this
can take this put it here I can do this for both query and key and let's look at
for both query and key and let's look at global attention mask you might find
global attention mask you might find this strange because we don't have a
this strange because we don't have a square um a square matrix but because
square um a square matrix but because here Python will Always always remember
here Python will Always always remember that Python uses broadcasting. So here
that Python uses broadcasting. So here you have 6x1 after that it will become
you have 6x1 after that it will become 6x6 and everything will work and also we
6x6 and everything will work and also we can do the same thing here for random
can do the same thing here for random attention mask. So I have the value now.
attention mask. So I have the value now. Okay. So here we have just created a
Okay. So here we have just created a matrix of zeros. But after that we are
matrix of zeros. But after that we are going to again specify the number of
going to again specify the number of random tokens that we want to have.
random tokens that we want to have. Let's set it to four. Okay. So this
Let's set it to four. Okay. So this gives me random columns. So this just
gives me random columns. So this just will tell me where should I put those.
will tell me where should I put those. Here I have the row selector. Sorry,
Here I have the row selector. Sorry, I'll just go through this very quickly
I'll just go through this very quickly just to show you the final output. And
just to show you the final output. And as you can see here, for example, we
as you can see here, for example, we have true. We have true in some other
have true. We have true in some other places. But as you can see, it's it's
places. But as you can see, it's it's random. And now I can combine the two or
random. And now I can combine the two or combine the three. So I can take this
combine the three. So I can take this and put it here. Combined mask. Just a
and put it here. Combined mask. Just a small trick. If you find it very
small trick. If you find it very difficult to see the trus and falses,
difficult to see the trus and falses, you can use the int method to convert
you can use the int method to convert that into integer. So as you can see, so
that into integer. So as you can see, so this is the combined mask. So we have
this is the combined mask. So we have the global mask and we have some random
the global mask and we have some random values also. Let's do the same thing for
values also. Let's do the same thing for random just to be able to see that. And
random just to be able to see that. And as you can see, so everything is random.
as you can see, so everything is random. What you should take from this is that
What you should take from this is that never get discouraged. You have the code
never get discouraged. You have the code in front of you. It's easy to use small
in front of you. It's easy to use small examples in order to understand how
examples in order to understand how things work and how are yeah how the the
things work and how are yeah how the the matrices looks like. And this is exactly
matrices looks like. And this is exactly why I I add these
why I I add these graphics just to show you how things are
graphics just to show you how things are implemented because you can easily find
implemented because you can easily find these steps in the code. So you can you
these steps in the code. So you can you can see that here we the or operation
can see that here we the or operation for example it's it's here. So you can
for example it's it's here. So you can easily find the assoc the step in the
easily find the assoc the step in the code and everything should be familiar
code and everything should be familiar to you and this will help you visualize
to you and this will help you visualize the attention layer very easily. Okay.
the attention layer very easily. Okay. So after getting the output as I said
So after getting the output as I said because this is we are using the head we
because this is we are using the head we need to go to the attention class and
need to go to the attention class and here we need to concatenate the output
here we need to concatenate the output of each individual head and again I have
of each individual head and again I have created let's see so where is this 925
created let's see so where is this 925 improving transformer big bird attention
improving transformer big bird attention this notebook I have used this in order
this notebook I have used this in order to import the GPT language model class
to import the GPT language model class from this script and in order to run the
from this script and in order to run the experiments we are near the end of this
experiments we are near the end of this video. We have one last attention method
video. We have one last attention method to look at which is multi head latent
to look at which is multi head latent attention. So let's go back to the
attention. So let's go back to the slides in order to understand how that
slides in order to understand how that one works. This is the final attention
one works. This is the final attention that we are going to look at. It is
that we are going to look at. It is called multi head latent attention and
called multi head latent attention and it was introduced in the deepseek v2
it was introduced in the deepseek v2 paper. I highly recommend reading this
paper. I highly recommend reading this paper because they have explained how
paper because they have explained how they created this method or this
they created this method or this attention mechanism in detail. They have
attention mechanism in detail. They have showed all the mathematical equations
showed all the mathematical equations and even they have a graph that explains
and even they have a graph that explains how it was how it works in under the
how it was how it works in under the hood. Click on the image if you want to
hood. Click on the image if you want to read more about it. MLA is an efficient
read more about it. MLA is an efficient method that compresses the KV cache. We
method that compresses the KV cache. We have talked about KV caching before. It
have talked about KV caching before. It means that you are going to store the
means that you are going to store the key and value matrices and this becomes
key and value matrices and this becomes a bottleneck especially during inference
a bottleneck especially during inference because if you are generating large
because if you are generating large sequences of text you will need a lot of
sequences of text you will need a lot of memory in order to store those matrices.
memory in order to store those matrices. So Deepseek with this method they have
So Deepseek with this method they have showed that they can reduce this KV
showed that they can reduce this KV cache by compressing it and the great
cache by compressing it and the great thing about this is that it does not
thing about this is that it does not sacrifice the performance and it
sacrifice the performance and it guarantees faster inference. We have
guarantees faster inference. We have seen that methods like multi-query
seen that methods like multi-query attention or grouped query attention
attention or grouped query attention also tries to lower the KV cache but the
also tries to lower the KV cache but the problem is that the performance might
problem is that the performance might degrade. But here with MLA it guarantees
degrade. But here with MLA it guarantees both things. MLA incorporates latent
both things. MLA incorporates latent representations into the attention
representations into the attention mechanism. We are going to see this in
mechanism. We are going to see this in the implementation. Instead of directly
the implementation. Instead of directly projecting the input X into the tree
projecting the input X into the tree matrices, we are going to project that
matrices, we are going to project that into a latent representation. And from
into a latent representation. And from that latent representation or we can
that latent representation or we can also call it latent embeddings we are
also call it latent embeddings we are going to generate our key query and
going to generate our key query and value matrices. MLA is very fast
value matrices. MLA is very fast compared to MHA. We are going to look at
compared to MHA. We are going to look at this in the inference speed test. Now
this in the inference speed test. Now let's see how this method is
let's see how this method is implemented. I want you to focus on
implemented. I want you to focus on these two matrices. So here as I told
these two matrices. So here as I told you instead of projecting X into the
you instead of projecting X into the three matrices key, query and value we
three matrices key, query and value we are going to have an intermediary step.
are going to have an intermediary step. This is called compressed query and this
This is called compressed query and this one is compressed key and values. From
one is compressed key and values. From these latent representations, we are
these latent representations, we are going to get the query key and value.
going to get the query key and value. And after that the rest will say the
And after that the rest will say the same. But this is the thing that was
same. But this is the thing that was added and this this is shared by the by
added and this this is shared by the by every head and this helps reduce the KV
every head and this helps reduce the KV caching because you only need to cache
caching because you only need to cache these two matrices and later you can
these two matrices and later you can generate these the the rest because from
generate these the the rest because from Q we from CQ we get Q and from CKV we
Q we from CQ we get Q and from CKV we get K and V. Now let's compare the
get K and V. Now let's compare the results. Again we have the graphs for
results. Again we have the graphs for train and validation and here multi-
train and validation and here multi- query sorry multi- head latent attention
query sorry multi- head latent attention is this curve. You can see that here in
is this curve. You can see that here in validation it is very close to local
validation it is very close to local attention. Again, take these results
attention. Again, take these results with a grain of salt because we might
with a grain of salt because we might get different results if we decide to
get different results if we decide to change the hyperparameters or how many
change the hyperparameters or how many batches we include in the in the
batches we include in the in the testing. And as you can see, we get per
testing. And as you can see, we get per we get results better than MHA even
we get results better than MHA even though we reduced the KV caching. And
though we reduced the KV caching. And now let's see the inference speed
now let's see the inference speed because this is very interesting. So MLA
because this is very interesting. So MLA is colored in black and as you can see
is colored in black and as you can see so let's go to 2,00 tokens because this
so let's go to 2,00 tokens because this is where the other methods struggled and
is where the other methods struggled and as you can see MLA as I told you is
as you can see MLA as I told you is really really fast. So we can generate a
really really fast. So we can generate a lot of tokens without sacrificing the
lot of tokens without sacrificing the performance. And here the one that was
performance. And here the one that was very close to it. It's linear attention
very close to it. It's linear attention because yeah it linear attention is also
because yeah it linear attention is also good because as we have seen it reduced
good because as we have seen it reduced the complexity to O instead of O squar
the complexity to O instead of O squar but MLA is the best. Now let's go to VS
but MLA is the best. Now let's go to VS code in order to see how to implement
code in order to see how to implement this. I have created this script model
this. I have created this script model multi- head latent attention and
multi- head latent attention and everything is implemented in this
everything is implemented in this deepseek MLA attention as I told you I
deepseek MLA attention as I told you I have been inspired by the research paper
have been inspired by the research paper that they have published there they have
that they have published there they have shared everything so that was a great
shared everything so that was a great resource and let me zoom in here because
resource and let me zoom in here because we will need this diagram okay let's get
we will need this diagram okay let's get started and here also I want to mention
started and here also I want to mention that these terms that we see here are
that these terms that we see here are basically the terms that they used in
basically the terms that they used in the research paper. Let me open that
the research paper. Let me open that research paper so that you can see these
research paper so that you can see these things with your eyes. Here is the
things with your eyes. Here is the research paper and here you can read uh
research paper and here you can read uh the abstract and the introduction and
the abstract and the introduction and also the architecture chapters because
also the architecture chapters because here they explained multi head latest
here they explained multi head latest attention and they have a good diagram
attention and they have a good diagram that helps you understand how it's
that helps you understand how it's implemented but I want to go to the
implemented but I want to go to the appendex because they have gathered all
appendex because they have gathered all the formulas here okay so now let me
the formulas here okay so now let me zoom in a little bit and as you can see
zoom in a little bit and as you can see here are the terms that I was talking
here are the terms that I was talking out. So these matrices that you see here
out. So these matrices that you see here for example W DQ, WUQ
for example W DQ, WUQ etc. are here. So you can see here is
etc. are here. So you can see here is the WDQ, WQ etc. And the terms such as
the WDQ, WQ etc. And the terms such as C, CQ and where is the other one? So you
C, CQ and where is the other one? So you can see CQ and CQV here is CQV. These
can see CQ and CQV here is CQV. These are the terms that I have me that I have
are the terms that I have me that I have showed in the diagram. Okay. So let's
showed in the diagram. Okay. So let's continue. I just wanted to show you this
continue. I just wanted to show you this because this is a little bit different
because this is a little bit different from the other scripts that I have
from the other scripts that I have created. I have I try to make sure to
created. I have I try to make sure to use descriptive names but since I here I
use descriptive names but since I here I was inspired by the research paper I try
was inspired by the research paper I try to stick to it as much as possible. Now
to stick to it as much as possible. Now let's go to the forward method. Again we
let's go to the forward method. Again we extract the batch size, sequence length
extract the batch size, sequence length and embedding from the inputs. Now let's
and embedding from the inputs. Now let's start. So here we need to get CQ. So I
start. So here we need to get CQ. So I here I called it compressed Q lacant and
here I called it compressed Q lacant and we get that from the WDQ and WDQ means
we get that from the WDQ and WDQ means we are down projecting and after that we
we are down projecting and after that we apply layer norm. This is what they have
apply layer norm. This is what they have applied in the research paper and after
applied in the research paper and after that after getting CQ we are going to
that after getting CQ we are going to project that back to UQ or sorry we are
project that back to UQ or sorry we are going to project that to to get the Q
going to project that to to get the Q matrix. And here if I go back to WQ, you
matrix. And here if I go back to WQ, you will see something interesting. We go
will see something interesting. We go from number of embedding to this
from number of embedding to this compressed dimension and WQ which is
compressed dimension and WQ which is here I called a projection will get back
here I called a projection will get back from Q compression dimension back to the
from Q compression dimension back to the number of embedding. So basically here
number of embedding. So basically here we are making the matrix small and after
we are making the matrix small and after that we go back to the big matrix and
that we go back to the big matrix and this is why this method is fast because
this is why this method is fast because here we have a small matrix that
here we have a small matrix that compresses the knowledge into a small
compresses the knowledge into a small space. Okay. So these are the two
space. Okay. So these are the two matrices that we have used in the first
matrices that we have used in the first place or yeah in the first step and yeah
place or yeah in the first step and yeah after using WQ we get the Q matrix here
after using WQ we get the Q matrix here I called it Qf final and we do the same
I called it Qf final and we do the same for CV. So we use this matrix d means
for CV. So we use this matrix d means down and U means up. So we are going
down and U means up. So we are going again if I inspect WD KV you will see
again if I inspect WD KV you will see that we go from number of embedding to
that we go from number of embedding to another small space. Okay, so that gives
another small space. Okay, so that gives us CKV which is this and from CKV we
us CKV which is this and from CKV we need to get the key and value matrices.
need to get the key and value matrices. This is exactly what we are doing here.
This is exactly what we are doing here. After applying layer normalization, we
After applying layer normalization, we use W key and WUV in order to get those
use W key and WUV in order to get those two matrices. And if you have noticed,
two matrices. And if you have noticed, we don't have the head class. So we need
we don't have the head class. So we need to introduce the number of heads by
to introduce the number of heads by mention. We got Qfinal. We try to
mention. We got Qfinal. We try to extract the the number of heads from the
extract the the number of heads from the C channel or from the C dimension. I
C channel or from the C dimension. I think we you should be familiar with
think we you should be familiar with this. We have done this multiple times.
this. We have done this multiple times. I want to show you something new that I
I want to show you something new that I haven't used in the previous scripts. Is
haven't used in the previous scripts. Is this function scaled.prouct attention.
this function scaled.prouct attention. PyTorch provides this method that
PyTorch provides this method that calculates the attention scores. So I
calculates the attention scores. So I just wanted to show you this because
just wanted to show you this because previously we were doing this manually.
previously we were doing this manually. You can see that it handles soft max
You can see that it handles soft max scaling and causal masking. So there is
scaling and causal masking. So there is no need to do that manually. You all you
no need to do that manually. You all you need to give it the query key and value
need to give it the query key and value matrices. If you want to give it a
matrices. If you want to give it a custom mask, you can do that. But
custom mask, you can do that. But because here we are setting is coal to
because here we are setting is coal to true that will be handled by PyTorch.
true that will be handled by PyTorch. But if you want to provide another mask
But if you want to provide another mask just set it here. And here we are
just set it here. And here we are providing the dropout probability. After
providing the dropout probability. After that we need to concatenate these two
that we need to concatenate these two dimensions in order to get just one. So
dimensions in order to get just one. So we are going to get the C channel again
we are going to get the C channel again or the C dimension. Here we have this WO
or the C dimension. Here we have this WO which basically will give us the output.
which basically will give us the output. So we are going to project the attention
So we are going to project the attention weights into another matrix that we call
weights into another matrix that we call output which is basically this one. And
output which is basically this one. And we are going to return it. So let me
we are going to return it. So let me close this file because I don't need it.
close this file because I don't need it. And I have also made sure to create a
And I have also made sure to create a notebook. So it's called 926 improving
notebook. So it's called 926 improving transformer multi head latent attention.
transformer multi head latent attention. So here I have imported the GPT language
So here I have imported the GPT language model class from this script and
model class from this script and everything stayed the same. We have
everything stayed the same. We have reached the end of this video. I really
reached the end of this video. I really hope that you have enjoyed it. Before I
hope that you have enjoyed it. Before I finish this video, I want to show you
finish this video, I want to show you basically what we have. The baseline was
basically what we have. The baseline was standard multi head attention. We have
standard multi head attention. We have used many attention mechanisms. We have
used many attention mechanisms. We have compared them to the standard multi head
compared them to the standard multi head attention and we have seen that all of
attention and we have seen that all of them performed really well compared to
them performed really well compared to standard multi head attention. Here you
standard multi head attention. Here you can see that I have picked multi head
can see that I have picked multi head latent attention and grouped query
latent attention and grouped query attention for two reasons. We have seen
attention for two reasons. We have seen that multi head latent attention was
that multi head latent attention was super fast compared to the other methods
super fast compared to the other methods and it used less memory overall and
and it used less memory overall and grouped query attention. You can see
grouped query attention. You can see that here the gap is very huge and this
that here the gap is very huge and this is on the validation set but it was
is on the validation set but it was slower compared to multi head attention.
slower compared to multi head attention. And here I think that for some reason
And here I think that for some reason groups query attention worked really
groups query attention worked really well because I have maybe used a small
well because I have maybe used a small number of batches in the evaluation. As
number of batches in the evaluation. As I said, I took a,000 batches from the
I said, I took a,000 batches from the training set and the validation set and
training set and the validation set and as an evaluation set. Maybe if I
as an evaluation set. Maybe if I increase that to let's say 5,000 or
increase that to let's say 5,000 or more, maybe this graph would have
more, maybe this graph would have changed. But for some reason, maybe
changed. But for some reason, maybe group query attention got lucky and we
group query attention got lucky and we got this huge difference. But I think I
got this huge difference. But I think I will stick with multi head latent
will stick with multi head latent attention because it's fast and it
attention because it's fast and it performed well. That's it for this
performed well. That's it for this video. See you in the next one. Hi
video. See you in the next one. Hi everyone. In this video, we are close to
everyone. In this video, we are close to finishing the course. So far we have
finishing the course. So far we have done a great job learning about the
done a great job learning about the different attention mechanisms and ways
different attention mechanisms and ways to encode positions. Those were the big
to encode positions. Those were the big things in the transformer architecture.
things in the transformer architecture. This is why I am calling this section
This is why I am calling this section small refinements because we are going
small refinements because we are going to do small experiments that test
to do small experiments that test different normalization methods,
different normalization methods, different activation functions. Is
different activation functions. Is dropout necessary? for example, etc.
dropout necessary? for example, etc. Let's start. First, we are going to try
Let's start. First, we are going to try different activation functions in the
different activation functions in the feed forward network. Here is a graph
feed forward network. Here is a graph that shows different activation
that shows different activation functions like relu, gelu, sigmoid, etc.
functions like relu, gelu, sigmoid, etc. In this video, I will pick GLU and
In this video, I will pick GLU and swigloo. If you are wondering why I
swigloo. If you are wondering why I chose these two activation functions,
chose these two activation functions, basically because training large
basically because training large language models is expensive and I don't
language models is expensive and I don't want to try everything every activation
want to try everything every activation function that exists because the list is
function that exists because the list is very long. After that we are going to
very long. After that we are going to play a little bit with normalization
play a little bit with normalization methods. Currently we have been using
methods. Currently we have been using layer norm but we also have LMS norm and
layer norm but we also have LMS norm and batch norm although this one is not used
batch norm although this one is not used in LLMs. Then we are going to compare
in LLMs. Then we are going to compare placing the normalization layer before
placing the normalization layer before and after the attention and feed forward
and after the attention and feed forward layers. We commonly refer to these as
layers. We commonly refer to these as prelayer norm. You can see that here for
prelayer norm. You can see that here for example I have the attention layer and
example I have the attention layer and normalization comes before it. And we
normalization comes before it. And we also have post layer norm. You can see
also have post layer norm. You can see here is the attention layer and
here is the attention layer and normalization comes after it. And we can
normalization comes after it. And we can also do this for the feed forward
also do this for the feed forward network. We are going to see this in the
network. We are going to see this in the coding section. Finally we are going to
coding section. Finally we are going to ask this question. Should we use
ask this question. Should we use dropout? In the previous code I'm
dropout? In the previous code I'm referring to the baseline. We have used
referring to the baseline. We have used dropout heavily in the code but this
dropout heavily in the code but this time we are going to remove it and see
time we are going to remove it and see if that improves the performance or not.
if that improves the performance or not. This is the plan that we are going to
This is the plan that we are going to follow. Now let's go through it. In the
follow. Now let's go through it. In the previous years the most common practice
previous years the most common practice was to use ReLU in deep learning for
was to use ReLU in deep learning for every problem. This is how ReLU looks
every problem. This is how ReLU looks like if you are wondering but now every
like if you are wondering but now every LLM uses a different activation
LLM uses a different activation function. Y radio is not much used
function. Y radio is not much used anymore. The problem is that negative
anymore. The problem is that negative values are clamped to zero. This limits
values are clamped to zero. This limits the representational capacity of this
the representational capacity of this activation function. Researchers created
activation function. Researchers created many activation functions that addresses
many activation functions that addresses this issue. For example, leakyu is one
this issue. For example, leakyu is one of them. And here is how it looks like.
of them. And here is how it looks like. You can see that leaky value allows
You can see that leaky value allows negative values to pass through to the
negative values to pass through to the next layer. The problem is that negative
next layer. The problem is that negative values can grow to large values and we
values can grow to large values and we don't want that. Are we done? Is this a
don't want that. Are we done? Is this a problem that is unsolvable? No, don't
problem that is unsolvable? No, don't worry. We have another activation
worry. We have another activation function which is called selu and other
function which is called selu and other activation functions. It's not only this
activation functions. It's not only this one that solves this issue. You can see
one that solves this issue. You can see that here instead of allowing every
that here instead of allowing every negative value to propagate we are going
negative value to propagate we are going to control the interval for example the
to control the interval for example the values that are inside this interval are
values that are inside this interval are going to pass but after that we are
going to pass but after that we are going to clamp the values to zero.
going to clamp the values to zero. Swiggloo is an activation function that
Swiggloo is an activation function that is used by the llama model and it is
is used by the llama model and it is composed of two parts. We have swish and
composed of two parts. We have swish and gillu gated linear unit. Both of these
gillu gated linear unit. Both of these are activational functions and by
are activational functions and by combining them we get swigloo. And if
combining them we get swigloo. And if you are interested here are the formulas
you are interested here are the formulas to for these activation functions. Swiss
to for these activation functions. Swiss has been shown to outperform ReLU in
has been shown to outperform ReLU in many applications and GLU allows the
many applications and GLU allows the network to focus on important features
network to focus on important features by either passing or blocking
by either passing or blocking information. We have talked a little bit
information. We have talked a little bit about the activation functions that we
about the activation functions that we have that we are going to use in this
have that we are going to use in this video. Now let's see the benchmarking.
video. Now let's see the benchmarking. We are going to consider ReLU as the
We are going to consider ReLU as the baseline because I have used it in the
baseline because I have used it in the multi head attention video. Then we will
multi head attention video. Then we will evaluate the performance of GLU and
evaluate the performance of GLU and Swigloo in comparison to ReLU. Here are
Swigloo in comparison to ReLU. Here are the results. As you can see, Swiglow is
the results. As you can see, Swiglow is the winner. It converged quickly and it
the winner. It converged quickly and it achieved the lowest loss value. Now
achieved the lowest loss value. Now let's go to VS code to see what I have
let's go to VS code to see what I have changed. The model script is our
changed. The model script is our baseline. And if I search for relu, you
baseline. And if I search for relu, you will see that I have used it in the feed
will see that I have used it in the feed forward class. And this is the only
forward class. And this is the only place where we where we used an
place where we where we used an activation function and we are using it
activation function and we are using it between two linear layers. Here is the
between two linear layers. Here is the second script that I have created. Here
second script that I have created. Here I have used MLA for the attention
I have used MLA for the attention method. And here I have replaced ReLU
method. And here I have replaced ReLU with Gio and for Swigloo. Let me search
with Gio and for Swigloo. Let me search for it. Yes, it's this one. Let's go
for it. Yes, it's this one. Let's go down. This one is a little bit difficult
down. This one is a little bit difficult to implement. But here I I have been
to implement. But here I I have been inspired by the code that is provided by
inspired by the code that is provided by Meta. Here it is. I have tried to select
Meta. Here it is. I have tried to select the parts that is that we are concerned
the parts that is that we are concerned with. And here you can see that they
with. And here you can see that they also used this inside the feed forward
also used this inside the feed forward class. So let me go back to VS code. And
class. So let me go back to VS code. And here it is. Remember that I have
here it is. Remember that I have mentioned that swigo is the combination
mentioned that swigo is the combination of swish and gill. Gillu means gate
of swish and gill. Gillu means gate linear unit. And here you can see that
linear unit. And here you can see that we have this this gate linear layer. And
we have this this gate linear layer. And we have two linear layers that will
we have two linear layers that will compress and decompress the inputs. But
compress and decompress the inputs. But I won't go into detail because as I said
I won't go into detail because as I said I have I just took this implementation
I have I just took this implementation from the existing llama code. But what
from the existing llama code. But what interests us is how we can compare
interests us is how we can compare different activation functions and where
different activation functions and where you should change them. Most of the
you should change them. Most of the activation functions are implemented in
activation functions are implemented in PyTorch but the new ones or the custom
PyTorch but the new ones or the custom activation functions that researchers
activation functions that researchers come up with maybe are not yet
come up with maybe are not yet implemented. But check if they already
implemented. But check if they already inside the NLN module. For example, here
inside the NLN module. For example, here I can search for you can see that here
I can search for you can see that here we have different flavors of prelu we
we have different flavors of prelu we have 6. I have talked about selu
have 6. I have talked about selu sigmoid. You can see that we have lots
sigmoid. You can see that we have lots of activation functions. So if you are
of activation functions. So if you are wondering how to change the activation
wondering how to change the activation functions, here is where to do it. Now
functions, here is where to do it. Now let's go back to the slides to talk
let's go back to the slides to talk about normalization methods.
about normalization methods. Normalization is a method that helps the
Normalization is a method that helps the network train quickly and stabilizes the
network train quickly and stabilizes the training process. Normalization helps
training process. Normalization helps mitigate the issues of vanishing
mitigate the issues of vanishing gradients and exploding gradients. Let's
gradients and exploding gradients. Let's focus on vanishing gradient. Here is a
focus on vanishing gradient. Here is a simple diagram for a neural network. we
simple diagram for a neural network. we have the input and output layers and in
have the input and output layers and in the middle we have the hidden layers.
the middle we have the hidden layers. The arrows indicate that here we have a
The arrows indicate that here we have a forward pass. We take the input we feed
forward pass. We take the input we feed it to the hidden layer and after that we
it to the hidden layer and after that we get the output. Vanishing gradients
get the output. Vanishing gradients means that during back propagation the
means that during back propagation the gradients starts big and decrease from
gradients starts big and decrease from layer to layer. This effect starts to be
layer to layer. This effect starts to be noticeable when training networks with a
noticeable when training networks with a big number of layers. Let's say that you
big number of layers. Let's say that you have a network that contains 100 layer.
have a network that contains 100 layer. By the time you reach the first layers,
By the time you reach the first layers, maybe the gradient will be equal to a
maybe the gradient will be equal to a small value and that will make training
small value and that will make training very slow. Here is a meme that maybe
very slow. Here is a meme that maybe will help you understand vanishing
will help you understand vanishing gradients. You can see that the meme
gradients. You can see that the meme says me uses sigmoid and 10h activation
says me uses sigmoid and 10h activation functions gradients. So the gradients
functions gradients. So the gradients start small and then they start fading
start small and then they start fading and by the end they are gone. Exploding
and by the end they are gone. Exploding gradients on the other hand is the
gradients on the other hand is the opposite. The gradients start small but
opposite. The gradients start small but they keep increasing between layers. In
they keep increasing between layers. In any case, we don't want to deal with
any case, we don't want to deal with vanishing or exploding gradients. So,
vanishing or exploding gradients. So, it's great that normalization
it's great that normalization helps prevent this issue. Also I want to
helps prevent this issue. Also I want to emphasize that normalization adjusts the
emphasize that normalization adjusts the scale of the data without changing its
scale of the data without changing its shape. Here is a figure that explains
shape. Here is a figure that explains this point. Here on the left you can see
this point. Here on the left you can see that the values on the yaxis ranges
that the values on the yaxis ranges between let's say three and 8. But on
between let's say three and 8. But on the right after using normalization the
the right after using normalization the range has been changed to be 0 to 1.
range has been changed to be 0 to 1. Same story for the x-axis. You can see
Same story for the x-axis. You can see that here we have maybe 25 to 70 but
that here we have maybe 25 to 70 but after that the range is between 0 and
after that the range is between 0 and one but the shape did not change even
one but the shape did not change even though the scale of the data changed.
though the scale of the data changed. There are many normalization methods. We
There are many normalization methods. We have layer norm, batch norm, arms norm
have layer norm, batch norm, arms norm etc. Layer norm is the method used in
etc. Layer norm is the method used in the original transformer that was
the original transformer that was introduced in the attention is only need
introduced in the attention is only need paper. Here is the equation used by
paper. Here is the equation used by layer norm. This method normalizes the
layer norm. This method normalizes the activations of each layer across the
activations of each layer across the feature dimension. Here is a diagram
feature dimension. Here is a diagram that illustrates that. Here we have a
that illustrates that. Here we have a tensor where n is the batch dimension, c
tensor where n is the batch dimension, c is the feature dimension and we might
is the feature dimension and we might have other dimensions. For example, if
have other dimensions. For example, if you are dealing with images, you might
you are dealing with images, you might have the height and width. It depends on
have the height and width. It depends on the data that you are working with. But
the data that you are working with. But you can see that here the normalization
you can see that here the normalization is done across the feature dimension. In
is done across the feature dimension. In this case, it's C. Use this method if
this case, it's C. Use this method if you can't use big mini batch sizes. In
you can't use big mini batch sizes. In our example, we are training a large
our example, we are training a large language model. Maybe we have few
language model. Maybe we have few resources. So we cannot train on very
resources. So we cannot train on very big batch sizes because we don't have a
big batch sizes because we don't have a lot of memory. In that case, maybe layer
lot of memory. In that case, maybe layer norm will not work. Batch norm on the
norm will not work. Batch norm on the other hand normalizes the activations
other hand normalizes the activations across the batch dimension. You can see
across the batch dimension. You can see that here instead of applying it in the
that here instead of applying it in the C dimension we are applying the
C dimension we are applying the normalization across the batch dimension
normalization across the batch dimension which in this case is denoted by this
which in this case is denoted by this letter N. And here is the equation used
letter N. And here is the equation used by batch norm. Batch normalization uses
by batch norm. Batch normalization uses learnable parameters in order to allow
learnable parameters in order to allow the model to shift and scale the
the model to shift and scale the normalized activations. So the
normalized activations. So the normalized activations is basically this
normalized activations is basically this term in the middle x minus mu / the
term in the middle x minus mu / the square root of this term and the
square root of this term and the learnable parameters are gamma and beta.
learnable parameters are gamma and beta. Finally we have norm. This method
Finally we have norm. This method normalizes the activations based on the
normalizes the activations based on the root mean square of the activations
root mean square of the activations themselves. Here is the formula that is
themselves. Here is the formula that is used by arms. Unlike layer norm
used by arms. Unlike layer norm does not center the activations before
does not center the activations before normalization. So you can see that here
normalization. So you can see that here let me go back here we have x. So the
let me go back here we have x. So the minus mu. So this is a term that centers
minus mu. So this is a term that centers the activations. But here there is no
the activations. But here there is no there is no term that that will center
there is no term that that will center the activations before applying the
the activations before applying the normalization. RMS norm reduces
normalization. RMS norm reduces computational complexity without
computational complexity without sacrificing performance. This means that
sacrificing performance. This means that training will be a bit faster but this
training will be a bit faster but this not the performance will not degrade. So
not the performance will not degrade. So I have explained to you the three
I have explained to you the three normalization methods. Now let's start
normalization methods. Now let's start the benchmarking because I have used
the benchmarking because I have used layer norm in the previous script or in
layer norm in the previous script or in the previous course. I will consider
the previous course. I will consider this as the baseline and I will compare
this as the baseline and I will compare RMS norm to layer norm. Here are the
RMS norm to layer norm. Here are the results. I had to zoom a lot in order to
results. I had to zoom a lot in order to see which method did well. As you can
see which method did well. As you can see, layer normal performed better by
see, layer normal performed better by 0.07%.
0.07%. You can see that the value is so small
You can see that the value is so small because there is I I mean even though
because there is I I mean even though yes layer performed better than LMS
yes layer performed better than LMS norm, but because the difference is very
norm, but because the difference is very small, it's not that interesting. And
small, it's not that interesting. And this confirms the points that I said in
this confirms the points that I said in the previous slide where I said that
the previous slide where I said that MSORM reduces computational complexity
MSORM reduces computational complexity but does not degrade performance. Now
but does not degrade performance. Now let's go to VS code in order to show you
let's go to VS code in order to show you how to apply or where should you change
how to apply or where should you change the script so that you apply you use
the script so that you apply you use other normalization methods. Here is the
other normalization methods. Here is the previous script that we used in the
previous script that we used in the previous course. And if I search for
previous course. And if I search for layer norm, you can see that we have
layer norm, you can see that we have used it in many places. We have three
used it in many places. We have three layer norms that we defined. As we have
layer norms that we defined. As we have seen in the section where we talked
seen in the section where we talked about activation functions, PyTorch
about activation functions, PyTorch comes packed with many normalization
comes packed with many normalization methods and you can access them like
methods and you can access them like this. So we use the NN module. Let me go
this. So we use the NN module. Let me go down and here we al we already saw layer
down and here we al we already saw layer norm but if you want to use for example
norm but if you want to use for example norm you can find it here let's search
norm you can find it here let's search for batch norm as well here here they
for batch norm as well here here they are and batch norm is used a lot when
are and batch norm is used a lot when you develop CNN models so we are this is
you develop CNN models so we are this is why I didn't use it in this lm course
why I didn't use it in this lm course but yeah just to give you an idea if you
but yeah just to give you an idea if you want to search for something start by
want to search for something start by searching inside the NN module maybe you
searching inside the NN module maybe you will find the what you are looking for
will find the what you are looking for there but if you don't find it you can
there but if you don't find it you can implement it yourself so this is the
implement it yourself so this is the previous scripts that we have used we
previous scripts that we have used we have used layer norm so there is nothing
have used layer norm so there is nothing we need to change here but I have also
we need to change here but I have also created another scripts where I have
created another scripts where I have implemented LMS norm so here I have used
implemented LMS norm so here I have used again the same script that I have talked
again the same script that I have talked about this is the llama script so this
about this is the llama script so this python script is used for inference and
python script is used for inference and in these lines they have implemented LMS
in these lines they have implemented LMS norm. Here is the LMS norm class and
norm. Here is the LMS norm class and yeah I I've just copied this and I have
yeah I I've just copied this and I have pasted it inside my my custom script. So
pasted it inside my my custom script. So yeah this is how it looks like and
yeah this is how it looks like and basically this implements the function
basically this implements the function the formula that I showed you in the
the formula that I showed you in the slides and you might ask why did you
slides and you might ask why did you implement it yourself? Doesn't it exist
implement it yourself? Doesn't it exist in PyTorch? Yes, it does exist. But I
in PyTorch? Yes, it does exist. But I did not think that PyTorch has LMS norm.
did not think that PyTorch has LMS norm. That's why I went searching for it. But
That's why I went searching for it. But I could just delete this and use NNMS
I could just delete this and use NNMS norm and that will have worked. But
norm and that will have worked. But sometimes you might find new
sometimes you might find new normalization methods that researchers
normalization methods that researchers have come up with, but you will not find
have come up with, but you will not find them inside PyTorch. In that case, you
them inside PyTorch. In that case, you need to create a custom class and use
need to create a custom class and use it. And if I search, you can see that
it. And if I search, you can see that here I have basically replaced every
here I have basically replaced every instance of n.layer norm with my custom
instance of n.layer norm with my custom class. Or in this case, we can we could
class. Or in this case, we can we could just use n. RMS norm and that will have
just use n. RMS norm and that will have worked also. Now let's go back to the
worked also. Now let's go back to the slides. Let's talk about where to put
slides. Let's talk about where to put the normalization layer. We can place
the normalization layer. We can place normalization
normalization before or after any layer. We use
before or after any layer. We use special names pre-layer norm and post
special names pre-layer norm and post layer norm. Here ln stands for layer
layer norm. Here ln stands for layer norm. But if you use another
norm. But if you use another normalization method, you should replace
normalization method, you should replace it with that method. We have seen these
it with that method. We have seen these diagrams before. Pre-normalization
diagrams before. Pre-normalization means that the normalization method of
means that the normalization method of your choice comes before the attention
your choice comes before the attention layer in this case and
layer in this case and post-normalization
post-normalization means that normalization comes after.
means that normalization comes after. Postnormalization
Postnormalization can encounter stability issues as the
can encounter stability issues as the number of layers grow. Also, it achieves
number of layers grow. Also, it achieves better final performance, but it is very
better final performance, but it is very hard to find the right hyperparameters.
hard to find the right hyperparameters. Pre-normalization on the other hand
Pre-normalization on the other hand offers better training stability because
offers better training stability because it is less sensitive to hyperparameter
it is less sensitive to hyperparameter choices. So even if you don't search for
choices. So even if you don't search for the optimal hyperparameters,
the optimal hyperparameters, you can get good results with this
you can get good results with this approach. Pre-normalization
approach. Pre-normalization shines where the number of layer is big.
shines where the number of layer is big. Let's do the benchmarking. We are going
Let's do the benchmarking. We are going to consider pre-normalization as the
to consider pre-normalization as the baseline and we are going to compare it
baseline and we are going to compare it to postnormalization. Here are the
to postnormalization. Here are the results. As you can see,
results. As you can see, post-normalization
post-normalization performed better than pre-normalization.
performed better than pre-normalization. In this case, I have a small model. I
In this case, I have a small model. I don't have a lot of layers. This is why
don't have a lot of layers. This is why postnormalization
postnormalization performed the best. But I'm just giving
performed the best. But I'm just giving you the methods that exist and ways to
you the methods that exist and ways to implement them. And then it depends on
implement them. And then it depends on your case. You might get different
your case. You might get different results if you have a bigger model or if
results if you have a bigger model or if you are turning on a different data set.
you are turning on a different data set. Now let's go to VS Code to see how to
Now let's go to VS Code to see how to implement postnormalization
implement postnormalization in the model script. Here are the layer
in the model script. Here are the layer norms. And as you can see here, we are
norms. And as you can see here, we are using pre-normalization.
using pre-normalization. Why? Because we are applying
Why? Because we are applying normalization before the attention
normalization before the attention layer. So the output from the
layer. So the output from the normalization layer is considered as the
normalization layer is considered as the input for the attention layer. So
input for the attention layer. So postnormalization will basically change
postnormalization will basically change the order at which these operations are
the order at which these operations are performed. So let me show you how this
performed. So let me show you how this looks like. I have created this script
looks like. I have created this script that will basically change the order of
that will basically change the order of operations. Okay. So inside the block
operations. Okay. So inside the block class or before I explain let me put the
class or before I explain let me put the the script here on the right and let me
the script here on the right and let me decrease the font size so that we can
decrease the font size so that we can see both methods. Here on the right you
see both methods. Here on the right you can see that here I added a comment. So
can see that here I added a comment. So just to mention that here we are using
just to mention that here we are using postnormalization.
postnormalization. Here we feed the input to the attention
Here we feed the input to the attention layer and after that we apply
layer and after that we apply normalization. We do the same here for
normalization. We do the same here for the feed forward network. So we give
the feed forward network. So we give this input to the feed forward layer and
this input to the feed forward layer and after that after getting this output we
after that after getting this output we are going to add it to the to this
are going to add it to the to this input. This gives us X and after that we
input. This gives us X and after that we apply normalization. You see the
apply normalization. You see the difference is small but it it worked in
difference is small but it it worked in our case. Now let's go back to the
our case. Now let's go back to the slides because we have one last thing to
slides because we have one last thing to talk about which is dropout. Dropout is
talk about which is dropout. Dropout is useful when you want to avoid
useful when you want to avoid overfitting. Here is a diagram that
overfitting. Here is a diagram that explains dropout. On the left we have
explains dropout. On the left we have the standard neuron network. You can see
the standard neuron network. You can see that every neuron is connected to the
that every neuron is connected to the other neurons that it can connect to. On
other neurons that it can connect to. On the right we have the network. After
the right we have the network. After applying dropout, you can see that some
applying dropout, you can see that some neurons are deactivated, which means
neurons are deactivated, which means that some connections are being dropped.
that some connections are being dropped. This is what dropout does to to your
This is what dropout does to to your neuron network. And if you want to read
neuron network. And if you want to read more about this, I have made sure to
more about this, I have made sure to link to the original paper that
link to the original paper that introduced this idea. You should use
introduced this idea. You should use dropout if training on a small data set
dropout if training on a small data set for a few epochs because when you start
for a few epochs because when you start to iterate over the data multiple times
to iterate over the data multiple times this might lead to overfitting. But if
this might lead to overfitting. But if you train on your data set just once
you train on your data set just once then you will not have overfitting. LMS
then you will not have overfitting. LMS train for one epoch because the data set
train for one epoch because the data set size is enormous. In this case, dropout
size is enormous. In this case, dropout is not needed because the model will not
is not needed because the model will not iterate over the data more than once. In
iterate over the data more than once. In this final benchmark, I am going to
this final benchmark, I am going to compare using dropout to not using
compare using dropout to not using dropouts. Here are the results. And as
dropouts. Here are the results. And as you can see, because I have trained the
you can see, because I have trained the model on just one epoch, not using
model on just one epoch, not using dropouts performed better than using
dropouts performed better than using dropouts. Now let's go to VS code in
dropouts. Now let's go to VS code in order to show you how to implement this
order to show you how to implement this or what are the things that we need to
or what are the things that we need to change. So now let me search for
change. So now let me search for dropout. You can see that we have
dropout. You can see that we have defined dropout in many places here
defined dropout in many places here inside the head class inside the multi
inside the head class inside the multi head attention class and in the feed
head attention class and in the feed forward class. So if you don't want to
forward class. So if you don't want to use dropout basically remove these lines
use dropout basically remove these lines whenever you find nm.dropout dropout
whenever you find nm.dropout dropout remove it or in this case we have stored
remove it or in this case we have stored it in this variable. So you can remove
it in this variable. So you can remove this and remove it whenever whenever it
this and remove it whenever whenever it is used. After you do that use that this
is used. After you do that use that this new script inside the notebook train
new script inside the notebook train your model and maybe this will help you
your model and maybe this will help you get better results. We arrived at the
get better results. We arrived at the end of this video. I hope that it was
end of this video. I hope that it was useful and informative for you. In the
useful and informative for you. In the next video, we are going to compare the
next video, we are going to compare the original transformer model with a model
original transformer model with a model that uses the best methods. So, we are
that uses the best methods. So, we are going to use for example for positional
going to use for example for positional encoding, the rotary positional
encoding, the rotary positional encoding. For attention, we are going to
encoding. For attention, we are going to use maybe multi- latent attention or
use maybe multi- latent attention or grouped query attention. We are going to
grouped query attention. We are going to remove dropout. We are going to use post
remove dropout. We are going to use post normalization etc. So we are going to
normalization etc. So we are going to assemble these best methods that we
assemble these best methods that we found during these previous videos. The
found during these previous videos. The goal is to compare again the original
goal is to compare again the original model with the best model that uses the
model with the best model that uses the best methods and the goal is to see that
best methods and the goal is to see that the loss curve keeps decreasing. For
the loss curve keeps decreasing. For example, we are going to use rope the
example, we are going to use rope the the rotary positional embedding. We we
the rotary positional embedding. We we should see that the loss decreases.
should see that the loss decreases. After that we are going to add to it
After that we are going to add to it multi head latent attention. This should
multi head latent attention. This should decrease the loss again and we will keep
decrease the loss again and we will keep going like this until we implement
going like this until we implement everything. At the end we will see if by
everything. At the end we will see if by implementing these methods we will get a
implementing these methods we will get a huge boost in performance. See you in
huge boost in performance. See you in the next video. Hi everyone. In this
the next video. Hi everyone. In this video we are going to use everything we
video we are going to use everything we have learned from the past videos. We
have learned from the past videos. We will put it all together to update the
will put it all together to update the 2017 transformer architecture in order
2017 transformer architecture in order to use the best ideas. We will go step
to use the best ideas. We will go step by step. This way you can see how each
by step. This way you can see how each small change makes things better. At the
small change makes things better. At the end we will look at the old 2017
end we will look at the old 2017 transformer architecture and compare it
transformer architecture and compare it to the new one we built in this video.
to the new one we built in this video. Like I said, we are going to build the
Like I said, we are going to build the best model using what we know. Now, we
best model using what we know. Now, we will make the model better bit by bit
will make the model better bit by bit with small improvements.
with small improvements. This picture shows the parts of the 2017
This picture shows the parts of the 2017 transformer architecture. First, I will
transformer architecture. First, I will change the multi head attention part to
change the multi head attention part to multi head latent attention. Then I will
multi head latent attention. Then I will use rotary positional encoding to encode
use rotary positional encoding to encode the positions. After that instead of
the positions. After that instead of doing pre-normalization
doing pre-normalization I will use post-normalization
I will use post-normalization in the feed forward network which I
in the feed forward network which I called dense layer. I will replace the
called dense layer. I will replace the radio activation function with swigloo
radio activation function with swigloo and last I will remove dropout. So you
and last I will remove dropout. So you can see we have five steps. By the end
can see we have five steps. By the end we should see a big improvement in
we should see a big improvement in performance. After each step we will
performance. After each step we will show a graph of the loss curves so we
show a graph of the loss curves so we can see our progress. Let's begin with
can see our progress. Let's begin with step zero. This is just the basic 2017
step zero. This is just the basic 2017 transformer architecture. It uses
transformer architecture. It uses learnable positional encoding multi head
learnable positional encoding multi head attention relu as the activation
attention relu as the activation function pre-ormalization
function pre-ormalization layer norm and dropout. In step one we
layer norm and dropout. In step one we change multi head attention with multi
change multi head attention with multi head latent attention. Here are the loss
head latent attention. Here are the loss graphs for both steps. As you can see
graphs for both steps. As you can see the loss went down from 4.84
the loss went down from 4.84 to 4.66.
to 4.66. That's a drop of 3.72%.
That's a drop of 3.72%. This is a good start. Now, in step two,
This is a good start. Now, in step two, we use rotary positional encoding
we use rotary positional encoding instead of the learnable positional
instead of the learnable positional encoding. This makes the loss go down
encoding. This makes the loss go down even more to about 4.42.
even more to about 4.42. Now, the total drop is 8.57%.
Now, the total drop is 8.57%. This really shows how good rotary
This really shows how good rotary positional encoding is. Step three is
positional encoding is. Step three is about changing from pre-normalization to
about changing from pre-normalization to postnormalization.
postnormalization. This brings the loss down by 9.27%
This brings the loss down by 9.27% in total. In step four, we swap relu
in total. In step four, we swap relu swigloo. This change makes the total
swigloo. This change makes the total loss reduction even bigger. Now it is
loss reduction even bigger. Now it is 10.72%.
10.72%. Finally, in step five, we remove
Finally, in step five, we remove dropout. This brings our total loss
dropout. This brings our total loss reduction to 11.44%.
reduction to 11.44%. Now, let's clear the graph and just show
Now, let's clear the graph and just show step zero and the very last step. Here
step zero and the very last step. Here is the graph. As you can see, there is a
is the graph. As you can see, there is a big difference between the two loss
big difference between the two loss curves. I am really happy to see this.
curves. I am really happy to see this. It shows that all our hard work trying
It shows that all our hard work trying to find the best methods for each part
to find the best methods for each part paid off. The loss in step zero or phase
paid off. The loss in step zero or phase zero was around 4.84
zero was around 4.84 and in the last step it went down to
and in the last step it went down to 4.28.
4.28. Now let's go to VS code in order to show
Now let's go to VS code in order to show you the script that I have created. I
you the script that I have created. I have opened the project in VS code and
have opened the project in VS code and if you open the transformer folder you
if you open the transformer folder you should see that I have added four
should see that I have added four scripts even though in the slides I have
scripts even though in the slides I have mentioned that we have five phases here
mentioned that we have five phases here I have only four because multilent
I have only four because multilent attention we have already done that
attention we have already done that before and yeah we have it here so there
before and yeah we have it here so there is no need to recreate it what what you
is no need to recreate it what what you see here is basically just me merging
see here is basically just me merging the old scripts that we have into the
the old scripts that we have into the this final phases and I have also
this final phases and I have also created notebooks that that use these
created notebooks that that use these scripts. If you open the notebooks
scripts. If you open the notebooks folder, you will see that we have them
folder, you will see that we have them here. It's these four 951 952 to 954.
here. It's these four 951 952 to 954. And as you can see, they use the four
And as you can see, they use the four phases that we have here. If you have
phases that we have here. If you have made it to the end of this video, I want
made it to the end of this video, I want to say thank you for watching this
to say thank you for watching this course. This is the end of this video. I
course. This is the end of this video. I really hope you have learned a lot and
really hope you have learned a lot and enjoyed this journey with me. We have
enjoyed this journey with me. We have one more video left. In that video, I
one more video left. In that video, I will wrap things up and give a quick
will wrap things up and give a quick summary of what we covered in this
summary of what we covered in this course. See you next time. We have
course. See you next time. We have reached the end of the course.
reached the end of the course. Congratulations. You have done an
Congratulations. You have done an awesome job. Let's quickly go over what
awesome job. Let's quickly go over what we have learned. We started our journey
we have learned. We started our journey by looking at the original transformer
by looking at the original transformer from 2017. Thanks to that architecture,
from 2017. Thanks to that architecture, we were able to create our first
we were able to create our first language model. This course was all
language model. This course was all about exploring the improvements and new
about exploring the improvements and new ideas that were proposed from 2017 to
ideas that were proposed from 2017 to 2025 that helped make the transformer
2025 that helped make the transformer even better. As you can see from this
even better. As you can see from this diagram, we have tried a lot of ideas
diagram, we have tried a lot of ideas and we had to learn a lot in order to
and we had to learn a lot in order to get here where we have improved the
get here where we have improved the transformer architecture drastically. We
transformer architecture drastically. We observed that these new ideas
observed that these new ideas significantly enhanced the transformer
significantly enhanced the transformer architecture across various aspects such
architecture across various aspects such as memory usage, inference speed, better
as memory usage, inference speed, better results, and more. We also noticed that
results, and more. We also noticed that each improvement helped lower the
each improvement helped lower the model's loss, which means it got better
model's loss, which means it got better at understanding and predicting. We have
at understanding and predicting. We have seen this in the previous video when we
seen this in the previous video when we applied just MLA that helped reduce the
applied just MLA that helped reduce the loss a bit. When we added to it re
loss a bit. When we added to it re swigloo no dropout etc. we kept reducing
swigloo no dropout etc. we kept reducing the loss further. And here is the
the loss further. And here is the takeaway. Transformers are still
takeaway. Transformers are still changing and getting better. If you want
changing and getting better. If you want to stay on top of things, keep an eye on
to stay on top of things, keep an eye on the latest research. Thanks a lot for
the latest research. Thanks a lot for watching the
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.