Transformer networks, introduced in 2017, are a type of neural network architecture that mimics human attention to process sequential data, particularly in natural language processing tasks like translation.
Mind Map
Click to expand
Click to explore the full interactive mind map • Zoom, pan, and navigate
Hello everyone my name is Aruhi and welcome
to my channel so guy in today's video we will
study about transformer networks we will understand in detail
what transformer networks are
how they are used what is their layer
architecture we will understand the model architecture
in detail we will do its
functionalities so transformer
networks were
released in 2017 in this paper the
huge
dog when I wrote this sentence
think for yourself on which word in this sentence did you focus more on
which word did you focus more on in this entire
sentence so I went to a park and saw a
huge dog most of your attention must have gone to
this word huge
dog now let me give you another example
suppose I said I love to read books
but especially I love to read books related to
computer science after listening to this entire sentence
you must have given more attention to some words in your mind
so what were those words You
must be a lover of reading computer science books,
right? In that entire sentence, you gave
more importance to these two pieces of information, gave
more focus,
gave more attention. So, this is the concept of attention
here too. So,
what was done in transformer networks?
Transformer networks were taught to mimic human attention.
And how was it
taught with the concept of attention? It is clear that now it is
written 'Attention is all you need'. In this,
what is this attention? What is the
concept of human attention? The human attention
concept was mimicked
inside transformers. And it is called
attention. Right now we have understood the meaning of this
attention. Now let us see its model architecture. This is the
model architecture of the
transformer network. Now we will
try to understand this network.
In this, this part from here
encoder and this part from here
till here is the decoder
decoder
part. In this, the input goes from here. This
input is going to the encoder, okay, so
now we understand these things in detail, first of all
we are understanding the encoder,
encoder,
what are encoders
inside the transformer network, when we are talking about transformers, there are
six encoders inside the transformer network,
okay, this architecture which you are
seeing on this side, there are six encoders like this and inside every encoder, like six encoders
means encoder one, encoder two,
encoder three, similarly there are six encoders till e6
and inside every encoder there are two layers, which are those
two layers, one is the self-attention
self-attention
layer, okay, which one is the first layer, self-attention layer
and which is the second layer, feed-forward
feed-forward
layer, you can also see on this side, there is self-
attention layer and feed-forward layer,
similarly in the second also there is self-attention layer,
feed-forward layer, in the third encoder also
both these layers, similarly
in every encoder you will find these two layers
and now let's talk Let's see
what will be the input of the encoder, I have already told you that there are
six encoders,
every encoder has these two layers, now let's
talk about the input of the encoder, so let's take an example here, first of all
we are talking about what will be the
input to the encoder,
how will the input to the encoder be prepared, so for
example, suppose I have a sentence '
I love
reading books', okay this is my sentence,
now what task are we doing,
suppose our task today is that what we have to do is to
translate it from English to Hindi,
we want to translate English to Hindi, whatever we give to the sentence model, for example,
if we give this sentence, then our
transform or model will convert 'I love reading
books' into Hindi and
generate its output, so for that,
what will be the first step that will be performed on this put sentence of ours, we
will perform tokenization on it,
tokenization on it, now what is tokenization
Tokenization means divide this sentence
into tokens. After dividing it into tokens,
how will it become? I will come like this, Love will
come like this, then Reading will come here like this, and then
Books will come like this. Okay, so in this way
we have performed tokenization.
Tokenization simply means divide your sentence into
small tokens, into words.
Now what is step two? Step
two will bring word embeddings. Now what are we
talking about, word embeddings. Guys,
what am I telling you? I am telling you
that the input that goes to the transform, to the
encoder. This encoder, which
you are seeing, in its architecture, that
input is there,
we perform all these steps on that input, then
that input is ready to be given to the encoder.
Okay, so first tokenization is done, then
word embedding. Now what does word embedding
mean, if we talk about it in simple words, it means that our
our
algorithm is like this. It doesn't understand words, it
understands numbers. So,
we have to convert all these words into some numbers. If
If
you look at those numbers, they won't make much sense to you.
But our algorithms
work like this. We have
a unique number for all the words. That
number is assigned to them, and it happens through
word embedding. Now,
what happens in word embedding? There are many
word embedding models, like if
you might have heard the name Word to Wake or Globe
or Bert model, all these are
word embedding models. You can take any
pre-trained word embedding model and
run that model on it, and it will
give you numbers for each word, like ' I
I love
love
reading books'. We have four words.
For each word,
a separate vector is created. How will that happen? It will have
values inside it, like suppose
it has values 0.1, 0.2, 0.3, 0.4. Guys, I am
writing these just random values to
explain the concept to you. Actually, there can be
any values here. This is
ours. What has this vector
become? For whom is this value for aa
representing aa? Okay, now
we will create it for love. Now suppose your value for love is
0.9, 0.8, 0, 7 and
0.6. Similarly, there will be some embedding for reading
and some embedding for books as well.
Okay, that means we have converted them into vectors. You
understand the meaning of word embedding. All these tokens that will be created will get
such numeric values. Now, you can
see the number of values in this.
Here I wrote four values 1,
2, 3 and four. Actually, guys, there are
not that many values here, I just wrote this example so that
you can understand. That is why I
I
wrote as many random values as there are words in the sentence. I
wrote that many number four values here
because there are four words. But
what actually happens is that the word embedding model that you
use, we
use a pre-trained model to generate this number to perform word embedding, like
I told you it is BERT or Word
to Wake model, okay, you can use any other model, so whatever is the
dimension of that model,
suppose the dimension of BERT model is
786, okay, suppose, then whatever is the dimension of that model, it will
generate that many values for every word, what does it
mean if the BERT model whose dimension is
786, it means that it will provide 786 values for
every word, so
in our case, how many values will there be for 'a',
like 0 1 0.5, like this,
you will get 786 values for this entire word, similarly
for 'love' also you will get up to 786
values, so whatever is the
dimension of your word embedding model, you will get that many
numbers of Values are obtained for each word,
but for today's example, I have
used only four values to make it easier for us to understand the concept.
concept.
Okay, that much is clear.
What steps have we
performed so far? First, we did tokenization.
Second, we did word embedding.
For word embedding, you can use any model,
such as Bert.
As many dimensions as that model will give you
values for each word. So,
we have done that much work. Now,
what is the next task? When you have done word embedding,
what is the next task after that? Let's see.
Now, look here, I love
reading books and these were our word
embeddings. Okay, that much is clear to us.
Now, what happens? All these
words will go to the encoder.
Okay, I showed you the model
architecture. You can see it again.
First, what is the encoder? I told you
how many encoders there are. There are six encoders
and each encoder has two layers. So,
which The input of the encoder is '
I love reading books'.
All these words
go to it at one go, like ' I
I love
love
reading' and 'books'. All these
words will go to the
encoder all at once, in one go. So, when
all the words go to the encoder all at once, the
encoder will not know
which word will come first in the sentence and
which word will come later in the sentence
because all the words are going together at the
same time to the encoder. So, it will
not be able to understand the sequence of the sentence. It can
read it like this - 'I love
reading books' or 'I love books reading' or anything else.
Meaning, it can read this sentence in any way but the
information will not be correct. It will not be able to
find out the sequence,
what is the actual sequence of the sentence? So,
what is the solution to this problem?
For the solution of this problem, a concept has been
introduced called positional
encoding. Okay, why do we need positional encoding? It is
needed because all the words
go to the encoder at once and the
encoder will not understand their sequence,
which word should come first in the sentence
and which word later, hence positional
encoding is used. Now what does positional encoding do?
Now what does positional encoding do?
For every vector, for every word,
a separate vector will be created. Okay, in
that too, the number of values that are in it will be included
in it. Suppose here the
value is
0.1, here it is
1.0, here 0 0, here
0.0. Okay, I am telling you the value of each one,
similarly the values of all will be created.
Okay, we are understanding I. Right now,
for every word, a similar
vector is ready. Then in positional encoding,
we plus this and this
and we have a combined embedding.
So if you add 0.1 to 0.1,
you will
get 0.2, then you are adding 1.0 to 0.2. So you will get
you will get
1.2, then after that you will get 0.3
and 0.4, so this
embedding that you have created, okay, this embedding is
your combined embedding and this
embedding will be the input of the encoder, okay, it will be the same
for every word, for every word
we will add its word embedding and positional encoding and you will
get a value like this, you will
get a value like this, okay this will be your value, the
input of your encoder, now
what is the benefit of this vector, what is happening in this
vector, firstly the meaning of the word is being
known, how are the
meanings of the words being known because we have
used the word embedding model and
in the word embedding model, all
these numbers do not have any meaning, what does it
mean, which word are they
representing and what does positional
encoding tell, what is the position of a word,
what is the position of i, this
thing will tell i
What is the position of the word and this first vector will
tell what the word iiii means, it will
tell about the word ai,
what is its meaning, okay, so these two positions,
what is the combined bedding, the position and the
meaning of that word, this becomes the input
of the encoder and we will do the same for the other three as well, as many
words are there in the sentence, so now that's it,
if we
look at this transformer network, look, this is the
concept of positional encoding which
I have just told you,
what is positional encoding, what is input embedding,
what was input embedding, what steps did we perform for it, first of all
we performed tokenization,
right after tokenization,
what was the second step we performed, we performed word
embedding here, okay,
what does this input embedding mean, this word
embedding, okay, what is this word embedding being
added with, here
in positional encoding, okay, this is what I
explained to you, right now if you look at this
picture Look, this was our word bedding
or our positional encoding and we have our
final vector ready, this one which is the
final vector, it becomes our input
of the encoder, now this is our encoder here,
okay, what is this encoder, now
what did I tell you inside every encoder, there are
two things in the encoder, one is the self-attention
layer, this one and the other is the feed forward
network, self-attention layer, so guys, I had
explained to you the concept of attention in the beginning of today's class itself,
what is attention, what did I tell you, it is to
mimic human attention, that means to
focus on important things, okay, so that
task will be performed here,
multihead attention means self-attention
means, in our sentence, I love
reading books, which
words have to be given more importance, which
word has to be given importance, that
task will be performed here
in multihead attention, okay, now I will explain to you the multi-head
attention layer in detail, but first we will go through the
whole Let's understand the architecture of the encoder here.
After that, we will
also understand multihead attention. First,
let's take an overview of what is happening in the encoder.
Now, inside every encoder, there is a
multihead detection layer and the second is the
feedforward layer. And you noticed that
even after the multihead detection, the add and nom
layer is running.
Even after the feedforward network, the add and nom layer is running. So, the
layer architecture of every encoder will be like this.
I will tell you why we are using the add and nom
layer here, both
after the multihead layer and
after the feedforward network. For now, just
see that every encoder will have a multihead detection
layer and a feedforward layer. After each layer, there is an
add and nom layer. And
how many encoders are being used?
Six encoders are being used in the transformer.
Now, let's understand the
concept of multihead attention in detail because this is the
most important part of our
transformer network. Now, this is the
multihead attention layer. Now, on the same example, We are doing the
same work with the word I love reading books.
Okay, so when this embedding was
input to the multihead attention layer,
okay, for every guy, the same thing that I have
created with the combination, you have to
create it for everyone, this will happen for every word,
I have shown
you by doing it for a single word. Okay, so now when this input of multihead attention is
created in this form, okay, I
am writing in English so that you understand the concept here, but the
input that will be created for multihead attention will be this combined
vector. Okay, so now
what will happen for every word?
In the multihead attention layer, there are three tasks for every word.
Now, for example, the word I will have three tasks. The
word love will also have three tasks. The word reading will also have three tasks.
And the word book will also have three tasks. And what will
those tasks be? The
end value of the query for each word, these three tasks will be done for each word. There will be three
jobs for each word and what will that
job be? The value of the query, the
end value of the query, the same will be the
end value of the query. Okay, so here Now what I am
telling you is that in the multi-detection layer,
all the words in your sentence will have three tasks.
What is that task? Query's end value. Now let's understand what a
query is.
What is the end value? A
query is a word, that means
you can zoom in on the query. A query will be a word that will
look at other words to see which one it should pay
attention to. So a query is like a
word which is looking for other words to pay
attention to.
Now what
is a value?
It is like a word which is being looked at by other words.
Okay, and then what remains is value. Now
what is value? Value means
information of that word. Value means information of that word. Don't worry guys,
I will
explain this to you with an example. What is the end value of a query?
I just wanted to give you a brief
introduction. I have given that to you.
Now we will
continue with the same example. Let's understand,
our example sentence was I love
to read books. We call it sorry I love reading
books. So let's work on this example
I love
reading books. All these details of the query end
value that we get
for each word, some maths is used for this,
we are not studying that today because the
lecture will become very long, we have still
studied a lot of our concepts, right now we are only on the
first layer of architecture, so
today we are studying the rest of the things in detail, but the
maths of this, this
concept of query end value, I will
cover this in a separate video, okay, so
now let's
take our same example of this query end value, so
suppose our example was I love
reading books, so for this sentence I
told you that there will be a separate
query end value above every word, it will have its separate
query value, its separate query
end value will also be the same, so for
now the example To
understand this, let us take the word 'I'. So what we are
doing now is with the help of Self Attention Layer, we will
see all the words that
which other words should be
given more attention, this is the work of Self
Attention Layer, what Self Attention Layer will
tell us is that which words should we
give more importance to and which should not be given, so
now the word 'I', we will
see that among the
different words in this sentence, which
word should I give more importance to and
which word should I give less importance to, similarly,
then we will work on the word 'Love', the word '
Love' will see
which of the 'I Love Reading' books should I give more importance to, similarly, the word '
Reading' will see which of the 'I Love Reading' books should I give more
importance to, similarly, '
How I' am looking, I will see which of the ' Love
Love
Reading' books should I give more attention to, the word '
Same Books' will
see which of the 'I Love Reading' words should I give more importance to.
Importance should be given,
more attention should be given, so this is the
task, for each individual word, among the
rest of the words in the sentence, it will see
which one it needs to give more attention.
For this, the concept of query key end value is used,
it tells which word will be given
more importance by each word. So, now when
we have got all these values, suppose the
example I am showing you is of just
one word, I am telling you about I. So,
when we have these
words for each word, when we
have the query key end
value, query key value, query key end value, query key
end value, when we have the
query key end value for each word,
after that the self-attention layer will
perform the similarity score.
How is similarity score performed? On
the basis of the query key of each word and the key of all the other words, so in this way
we will get the similarity score. From this similarity
score, we come to know which word will get
more attention and which other words. It
should be given, for example, if the similarity score
comes, more attention should be given to 'I' and 'Love',
more attention should be given to 'Reading' or 'Books', similarly, the word 'L' should
give attention to 'I', 'Reading' or '
Books', so on the basis of
similarity score, it is known
which word should be given more attention to which other words, which are the
words in the sentence, which words among
them should be given more attention,
so the higher the
score means, the more
attention it will give to the other word, so for
example, now if you see here,
suppose the similarity score of 'I' and 'Love' is 0.5, the similarity score of 'I' and 'Reading' is
0.2, the similarity score of 'I' and 'Books' is
0.1, so
which is the highest score here, this one, that means the word '
I' will give more attention to which word, it will give
more attention
to the word, okay, that is why I
told you the similarity score for the word 'I',
similarly you can calculate the similarity of every word.
Calculate the score with the rest of the words and you will get to know
which word will give more attention to which other word, for example, '
Aay' will
give more attention to 'Love', alright, so this is how the
similarity score is calculated and
what did I tell you for the similarity score, how is
this similarity score calculated, the
query value of each word and the key value
of all the other words. It comes out on the basis of u,
this is how the self-attention layer works, in
this way the self-attention layer
tells every word which other word
in the sequence should be given more attention,
okay, after that if you
look at the architecture, what all have we
understood, now we have already understood this part, we have
also understood positional encoding, we have
also understood this part, so
from this part we have come to
know on the basis of the end value of the query,
which other words it should
give attention to, okay, after that we
have this addition and normalization layer,
what does this addition and normalization layer do,
now you can see that from here we are
bringing data here also, okay, now what is happening,
see, if you must have read about residual nets, it has
residual connections, that whatever
old output is there,
then we add it to the output of the current layer,
like which is our current layer,
here is the multihead detection layer, so the
multihead There will be the output of the detention layer and
we are performing addition on our original output here,
element wise addition, what will happen with this is that
new data will also be added to us, new
information will also be added and the
original information that we have will also remain,
both types of information will remain with us, so
this is how the additional layer works,
after that this normalization,
we are adding normalization so that when we
train, our data
remains in the same range, okay, so everywhere here, here, here, here, everywhere,
wherever
we are talking about addition and normalization layer, the
same task is being done there,
we are adding the output and previous output of the current layer so that
we can get new information as well as the
old information remains,
we are using the normalization layer so that
our data remains in the same range, so I will
not explain this layer again and again, okay, this is done for now,
after that when We added
both the output of the old current layer and the output of the old one
and our data got normalized, then
it became the input to the feed forward
network. After the feed forward network,
we again have an add and normalization layer
which will perform the same task. So this is the
work of the encoder. In the same way,
I told you about a single encoder. In this way, there are
six encoders. The output of the first encoder will
become the input of the second encoder. In the
same way, six encoders will perform the task.
Okay, so after that, if you look at this side,
this part is
here, taking it from here,
this is our
decoder. Just like we have six encoders,
similarly there are six decoders
inside the transformer network. So, the first layer of it is
this, the
first layer of the decoder is what is it? This
mask is multihead attention. You are seeing that
we had an attention layer in the encoder as well
and we also have an attention layer in the decoder.
But there are two types of attention in the decoder.
One is multihead attention. This is the same as it
was in the encoder. One, we have Mask
Multihead Attention, this is new, it is only
in the decoder. Okay, so what will happen now is that the
output of the encoder will become the input of the decoder
and the
first layer of the decoder,
what is this
Master Multihead Attention, now what is
happening here, what does this Master Multihead Attention
mean, that whatever this decoder is of ours,
it will generate one word at a time, so
now if we remember what I told you
in the beginning of the class today, what
task are we doing, for which task are we building a transformer network,
for language translation, so if we had this
books", then we want to translate it
into Hindi, so we want to translate it
into Hindi, so what will happen is that this decoder will
generate one word at a time, first of all,
what word will it generate from "I love
our first word which this
decoder generated, then the word "me" will be
here, so
first What is the word that I have
generated? So this word that I have will be the
input of master multi attention. On the
basis of the words that have already been generated, multi mass attention
tells what the next word will be and
decode will generate only one word at a time. Now
I have to see what the second word should be after this. So once I have got the input,
input,
then all the decoder tasks will
run on it. Then the next word will be generated.
When I get the input here, it will
create some output here, then it will go to this model
and finally we will get the next word. So the
next word is suppose to be read,
so the next word will come to read. Now
this word that is read has come here to
read. Okay, now on the basis of reading,
our mask multihead tension will
generate the third word. Okay, then the
third word will also come here. Then on the
basis of these three words, our model will
tell the next word. So this is how the decoder works.
What do we have to remember? Decoder So
what will a decoder do? It is used to generate output.
We have only six
decoders. There are two types of layers in the decoder,
one with a mask.
I will explain this to you right now. Right now
we are just learning this. So its
only task is to
generate the next word on the basis of the words that have been generated in the data so far. Okay,
after that there is an add and normalization layer.
I have already explained this to you.
What is its function? After that there is a multihead
detection layer. These layers
work exactly like the encoder layer used to work.
Which words should be
given importance? What should be the sequence of words?
Which word should come first?
Which word should come after? That's the work.
Okay, and after that there is again an add and
normalization layer. There is a feed forward layer.
And finally, in the last you see there is a soft
max layer. What this soft max layer does is it
provides you a number, it
provides a probability for each word. So the
word which has
higher probability will be the next word
in the sequence, okay, so suppose
after me the word 'mu' was there, after me the
probability of 'reading' is 50, but the
50, but the
probability of 'like' is 4, so let's
write any random values, '
reading', 'like', 'books', 'books', '
books', there is some probability, so the
word which has higher probability will
become the next word in the sequence, okay, this is how
it is decided,
in the final layer the soft max layer will assign
a value to each word, it will
assign a probability, and the one with
higher value will become the next word, so this is how
you get the output
from the decoder, this is the basics of
transformers, so I hope you understood this video
and if guys you found my content
helpful, then please like and subscribe to my channel,
thank you for watching.
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.