YouTube Transcript:
Transformers for beginners | Hindi

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

Transformer networks, introduced in 2017, are a type of neural network architecture that mimics human attention to process sequential data, particularly in natural language processing tasks like translation.

Mind Map

Click to expand

Click to explore the full interactive mind map • Zoom, pan, and navigate

Hello everyone my name is Aruhi and welcome

to my channel so guy in today's video we will

study about transformer networks we will understand in detail

what transformer networks are

how they are used what is their layer

architecture we will understand the model architecture

in detail we will do its

functionalities so transformer

networks were

released in 2017 in this paper the

huge

dog when I wrote this sentence

think for yourself on which word in this sentence did you focus more on

which word did you focus more on in this entire

sentence so I went to a park and saw a

huge dog most of your attention must have gone to

this word huge

dog now let me give you another example

suppose I said I love to read books

but especially I love to read books related to

computer science after listening to this entire sentence

you must have given more attention to some words in your mind

so what were those words You

must be a lover of reading computer science books,

right? In that entire sentence, you gave

more importance to these two pieces of information, gave

more focus,

gave more attention. So, this is the concept of attention

here too. So,

what was done in transformer networks?

Transformer networks were taught to mimic human attention.

And how was it

taught with the concept of attention? It is clear that now it is

written 'Attention is all you need'. In this,

what is this attention? What is the

concept of human attention? The human attention

concept was mimicked

inside transformers. And it is called

attention. Right now we have understood the meaning of this

attention. Now let us see its model architecture. This is the

model architecture of the

transformer network. Now we will

try to understand this network.

In this, this part from here

encoder and this part from here

till here is the decoder

decoder

part. In this, the input goes from here. This

input is going to the encoder, okay, so

now we understand these things in detail, first of all

we are understanding the encoder,

encoder,

what are encoders

inside the transformer network, when we are talking about transformers, there are

six encoders inside the transformer network,

okay, this architecture which you are

seeing on this side, there are six encoders like this and inside every encoder, like six encoders

means encoder one, encoder two,

encoder three, similarly there are six encoders till e6

and inside every encoder there are two layers, which are those

two layers, one is the self-attention

self-attention

layer, okay, which one is the first layer, self-attention layer

and which is the second layer, feed-forward

feed-forward

layer, you can also see on this side, there is self-

attention layer and feed-forward layer,

similarly in the second also there is self-attention layer,

feed-forward layer, in the third encoder also

both these layers, similarly

in every encoder you will find these two layers

and now let's talk Let's see

what will be the input of the encoder, I have already told you that there are

six encoders,

every encoder has these two layers, now let's

talk about the input of the encoder, so let's take an example here, first of all

we are talking about what will be the

input to the encoder,

how will the input to the encoder be prepared, so for

example, suppose I have a sentence '

I love

reading books', okay this is my sentence,

now what task are we doing,

suppose our task today is that what we have to do is to

translate it from English to Hindi,

we want to translate English to Hindi, whatever we give to the sentence model, for example,

if we give this sentence, then our

transform or model will convert 'I love reading

books' into Hindi and

generate its output, so for that,

what will be the first step that will be performed on this put sentence of ours, we

will perform tokenization on it,

tokenization on it, now what is tokenization

Tokenization means divide this sentence

into tokens. After dividing it into tokens,

how will it become? I will come like this, Love will

come like this, then Reading will come here like this, and then

Books will come like this. Okay, so in this way

we have performed tokenization.

Tokenization simply means divide your sentence into

small tokens, into words.

Now what is step two? Step

two will bring word embeddings. Now what are we

talking about, word embeddings. Guys,

what am I telling you? I am telling you

that the input that goes to the transform, to the

encoder. This encoder, which

you are seeing, in its architecture, that

input is there,

we perform all these steps on that input, then

that input is ready to be given to the encoder.

Okay, so first tokenization is done, then

word embedding. Now what does word embedding

mean, if we talk about it in simple words, it means that our

our

algorithm is like this. It doesn't understand words, it

understands numbers. So,

we have to convert all these words into some numbers. If

you look at those numbers, they won't make much sense to you.

But our algorithms

work like this. We have

a unique number for all the words. That

number is assigned to them, and it happens through

word embedding. Now,

what happens in word embedding? There are many

word embedding models, like if

you might have heard the name Word to Wake or Globe

or Bert model, all these are

word embedding models. You can take any

pre-trained word embedding model and

run that model on it, and it will

give you numbers for each word, like ' I

I love

love

reading books'. We have four words.

For each word,

a separate vector is created. How will that happen? It will have

values inside it, like suppose

it has values 0.1, 0.2, 0.3, 0.4. Guys, I am

writing these just random values to

explain the concept to you. Actually, there can be

any values here. This is

ours. What has this vector

become? For whom is this value for aa

representing aa? Okay, now

we will create it for love. Now suppose your value for love is

0.9, 0.8, 0, 7 and

0.6. Similarly, there will be some embedding for reading

and some embedding for books as well.

Okay, that means we have converted them into vectors. You

understand the meaning of word embedding. All these tokens that will be created will get

such numeric values. Now, you can

see the number of values in this.

Here I wrote four values 1,

2, 3 and four. Actually, guys, there are

not that many values here, I just wrote this example so that

you can understand. That is why I

wrote as many random values as there are words in the sentence. I

wrote that many number four values here

because there are four words. But

what actually happens is that the word embedding model that you

use, we

use a pre-trained model to generate this number to perform word embedding, like

I told you it is BERT or Word

to Wake model, okay, you can use any other model, so whatever is the

dimension of that model,

suppose the dimension of BERT model is

786, okay, suppose, then whatever is the dimension of that model, it will

generate that many values for every word, what does it

mean if the BERT model whose dimension is

786, it means that it will provide 786 values for

every word, so

in our case, how many values will there be for 'a',

like 0 1 0.5, like this,

you will get 786 values for this entire word, similarly

for 'love' also you will get up to 786

values, so whatever is the

dimension of your word embedding model, you will get that many

numbers of Values are obtained for each word,

but for today's example, I have

used only four values to make it easier for us to understand the concept.

concept.

Okay, that much is clear.

What steps have we

performed so far? First, we did tokenization.

Second, we did word embedding.

For word embedding, you can use any model,

such as Bert.

As many dimensions as that model will give you

values for each word. So,

we have done that much work. Now,

what is the next task? When you have done word embedding,

what is the next task after that? Let's see.

Now, look here, I love

reading books and these were our word

embeddings. Okay, that much is clear to us.

Now, what happens? All these

words will go to the encoder.

Okay, I showed you the model

architecture. You can see it again.

First, what is the encoder? I told you

how many encoders there are. There are six encoders

and each encoder has two layers. So,

which The input of the encoder is '

I love reading books'.

All these words

go to it at one go, like ' I

I love

love

reading' and 'books'. All these

words will go to the

encoder all at once, in one go. So, when

all the words go to the encoder all at once, the

encoder will not know

which word will come first in the sentence and

which word will come later in the sentence

because all the words are going together at the

same time to the encoder. So, it will

not be able to understand the sequence of the sentence. It can

read it like this - 'I love

reading books' or 'I love books reading' or anything else.

Meaning, it can read this sentence in any way but the

information will not be correct. It will not be able to

find out the sequence,

what is the actual sequence of the sentence? So,

what is the solution to this problem?

For the solution of this problem, a concept has been

introduced called positional

encoding. Okay, why do we need positional encoding? It is

needed because all the words

go to the encoder at once and the

encoder will not understand their sequence,

which word should come first in the sentence

and which word later, hence positional

encoding is used. Now what does positional encoding do?

Now what does positional encoding do?

For every vector, for every word,

a separate vector will be created. Okay, in

that too, the number of values that are in it will be included

in it. Suppose here the

value is

0.1, here it is

1.0, here 0 0, here

0.0. Okay, I am telling you the value of each one,

similarly the values of all will be created.

Okay, we are understanding I. Right now,

for every word, a similar

vector is ready. Then in positional encoding,

we plus this and this

and we have a combined embedding.

So if you add 0.1 to 0.1,

you will

get 0.2, then you are adding 1.0 to 0.2. So you will get

you will get

1.2, then after that you will get 0.3

and 0.4, so this

embedding that you have created, okay, this embedding is

your combined embedding and this

embedding will be the input of the encoder, okay, it will be the same

for every word, for every word

we will add its word embedding and positional encoding and you will

get a value like this, you will

get a value like this, okay this will be your value, the

input of your encoder, now

what is the benefit of this vector, what is happening in this

vector, firstly the meaning of the word is being

known, how are the

meanings of the words being known because we have

used the word embedding model and

in the word embedding model, all

these numbers do not have any meaning, what does it

mean, which word are they

representing and what does positional

encoding tell, what is the position of a word,

what is the position of i, this

thing will tell i

What is the position of the word and this first vector will

tell what the word iiii means, it will

tell about the word ai,

what is its meaning, okay, so these two positions,

what is the combined bedding, the position and the

meaning of that word, this becomes the input

of the encoder and we will do the same for the other three as well, as many

words are there in the sentence, so now that's it,

if we

look at this transformer network, look, this is the

concept of positional encoding which

I have just told you,

what is positional encoding, what is input embedding,

what was input embedding, what steps did we perform for it, first of all

we performed tokenization,

right after tokenization,

what was the second step we performed, we performed word

embedding here, okay,

what does this input embedding mean, this word

embedding, okay, what is this word embedding being

added with, here

in positional encoding, okay, this is what I

explained to you, right now if you look at this

picture Look, this was our word bedding

or our positional encoding and we have our

final vector ready, this one which is the

final vector, it becomes our input

of the encoder, now this is our encoder here,

okay, what is this encoder, now

what did I tell you inside every encoder, there are

two things in the encoder, one is the self-attention

layer, this one and the other is the feed forward

network, self-attention layer, so guys, I had

explained to you the concept of attention in the beginning of today's class itself,

what is attention, what did I tell you, it is to

mimic human attention, that means to

focus on important things, okay, so that

task will be performed here,

multihead attention means self-attention

means, in our sentence, I love

reading books, which

words have to be given more importance, which

word has to be given importance, that

task will be performed here

in multihead attention, okay, now I will explain to you the multi-head

attention layer in detail, but first we will go through the

whole Let's understand the architecture of the encoder here.

After that, we will

also understand multihead attention. First,

let's take an overview of what is happening in the encoder.

Now, inside every encoder, there is a

multihead detection layer and the second is the

feedforward layer. And you noticed that

even after the multihead detection, the add and nom

layer is running.

Even after the feedforward network, the add and nom layer is running. So, the

layer architecture of every encoder will be like this.

I will tell you why we are using the add and nom

layer here, both

after the multihead layer and

after the feedforward network. For now, just

see that every encoder will have a multihead detection

layer and a feedforward layer. After each layer, there is an

add and nom layer. And

how many encoders are being used?

Six encoders are being used in the transformer.

Now, let's understand the

concept of multihead attention in detail because this is the

most important part of our

transformer network. Now, this is the

multihead attention layer. Now, on the same example, We are doing the

same work with the word I love reading books.

Okay, so when this embedding was

input to the multihead attention layer,

okay, for every guy, the same thing that I have

created with the combination, you have to

create it for everyone, this will happen for every word,

I have shown

you by doing it for a single word. Okay, so now when this input of multihead attention is

created in this form, okay, I

am writing in English so that you understand the concept here, but the

input that will be created for multihead attention will be this combined

vector. Okay, so now

what will happen for every word?

In the multihead attention layer, there are three tasks for every word.

Now, for example, the word I will have three tasks. The

word love will also have three tasks. The word reading will also have three tasks.

And the word book will also have three tasks. And what will

those tasks be? The

end value of the query for each word, these three tasks will be done for each word. There will be three

jobs for each word and what will that

job be? The value of the query, the

end value of the query, the same will be the

end value of the query. Okay, so here Now what I am

telling you is that in the multi-detection layer,

all the words in your sentence will have three tasks.

What is that task? Query's end value. Now let's understand what a

query is.

What is the end value? A

query is a word, that means

you can zoom in on the query. A query will be a word that will

look at other words to see which one it should pay

attention to. So a query is like a

word which is looking for other words to pay

attention to.

Now what

is a value?

It is like a word which is being looked at by other words.

Okay, and then what remains is value. Now

what is value? Value means

information of that word. Value means information of that word. Don't worry guys,

I will

explain this to you with an example. What is the end value of a query?

I just wanted to give you a brief

introduction. I have given that to you.

Now we will

continue with the same example. Let's understand,

our example sentence was I love

to read books. We call it sorry I love reading

books. So let's work on this example

I love

reading books. All these details of the query end

value that we get

for each word, some maths is used for this,

we are not studying that today because the

lecture will become very long, we have still

studied a lot of our concepts, right now we are only on the

first layer of architecture, so

today we are studying the rest of the things in detail, but the

maths of this, this

concept of query end value, I will

cover this in a separate video, okay, so

now let's

take our same example of this query end value, so

suppose our example was I love

reading books, so for this sentence I

told you that there will be a separate

query end value above every word, it will have its separate

query value, its separate query

end value will also be the same, so for

now the example To

understand this, let us take the word 'I'. So what we are

doing now is with the help of Self Attention Layer, we will

see all the words that

which other words should be

given more attention, this is the work of Self

Attention Layer, what Self Attention Layer will

tell us is that which words should we

give more importance to and which should not be given, so

now the word 'I', we will

see that among the

different words in this sentence, which

word should I give more importance to and

which word should I give less importance to, similarly,

then we will work on the word 'Love', the word '

Love' will see

which of the 'I Love Reading' books should I give more importance to, similarly, the word '

Reading' will see which of the 'I Love Reading' books should I give more

importance to, similarly, '

How I' am looking, I will see which of the ' Love

Love

Reading' books should I give more attention to, the word '

Same Books' will

see which of the 'I Love Reading' words should I give more importance to.

Importance should be given,

more attention should be given, so this is the

task, for each individual word, among the

rest of the words in the sentence, it will see

which one it needs to give more attention.

For this, the concept of query key end value is used,

it tells which word will be given

more importance by each word. So, now when

we have got all these values, suppose the

example I am showing you is of just

one word, I am telling you about I. So,

when we have these

words for each word, when we

have the query key end

value, query key value, query key end value, query key

end value, when we have the

query key end value for each word,

after that the self-attention layer will

perform the similarity score.

How is similarity score performed? On

the basis of the query key of each word and the key of all the other words, so in this way

we will get the similarity score. From this similarity

score, we come to know which word will get

more attention and which other words. It

should be given, for example, if the similarity score

comes, more attention should be given to 'I' and 'Love',

more attention should be given to 'Reading' or 'Books', similarly, the word 'L' should

give attention to 'I', 'Reading' or '

Books', so on the basis of

similarity score, it is known

which word should be given more attention to which other words, which are the

words in the sentence, which words among

them should be given more attention,

so the higher the

score means, the more

attention it will give to the other word, so for

example, now if you see here,

suppose the similarity score of 'I' and 'Love' is 0.5, the similarity score of 'I' and 'Reading' is

0.2, the similarity score of 'I' and 'Books' is

0.1, so

which is the highest score here, this one, that means the word '

I' will give more attention to which word, it will give

more attention

to the word, okay, that is why I

told you the similarity score for the word 'I',

similarly you can calculate the similarity of every word.

Calculate the score with the rest of the words and you will get to know

which word will give more attention to which other word, for example, '

Aay' will

give more attention to 'Love', alright, so this is how the

similarity score is calculated and

what did I tell you for the similarity score, how is

this similarity score calculated, the

query value of each word and the key value

of all the other words. It comes out on the basis of u,

this is how the self-attention layer works, in

this way the self-attention layer

tells every word which other word

in the sequence should be given more attention,

okay, after that if you

look at the architecture, what all have we

understood, now we have already understood this part, we have

also understood positional encoding, we have

also understood this part, so

from this part we have come to

know on the basis of the end value of the query,

which other words it should

give attention to, okay, after that we

have this addition and normalization layer,

what does this addition and normalization layer do,

now you can see that from here we are

bringing data here also, okay, now what is happening,

see, if you must have read about residual nets, it has

residual connections, that whatever

old output is there,

then we add it to the output of the current layer,

like which is our current layer,

here is the multihead detection layer, so the

multihead There will be the output of the detention layer and

we are performing addition on our original output here,

element wise addition, what will happen with this is that

new data will also be added to us, new

information will also be added and the

original information that we have will also remain,

both types of information will remain with us, so

this is how the additional layer works,

after that this normalization,

we are adding normalization so that when we

train, our data

remains in the same range, okay, so everywhere here, here, here, here, everywhere,

wherever

we are talking about addition and normalization layer, the

same task is being done there,

we are adding the output and previous output of the current layer so that

we can get new information as well as the

old information remains,

we are using the normalization layer so that

our data remains in the same range, so I will

not explain this layer again and again, okay, this is done for now,

after that when We added

both the output of the old current layer and the output of the old one

and our data got normalized, then

it became the input to the feed forward

network. After the feed forward network,

we again have an add and normalization layer

which will perform the same task. So this is the

work of the encoder. In the same way,

I told you about a single encoder. In this way, there are

six encoders. The output of the first encoder will

become the input of the second encoder. In the

same way, six encoders will perform the task.

Okay, so after that, if you look at this side,

this part is

here, taking it from here,

this is our

decoder. Just like we have six encoders,

similarly there are six decoders

inside the transformer network. So, the first layer of it is

this, the

first layer of the decoder is what is it? This

mask is multihead attention. You are seeing that

we had an attention layer in the encoder as well

and we also have an attention layer in the decoder.

But there are two types of attention in the decoder.

One is multihead attention. This is the same as it

was in the encoder. One, we have Mask

Multihead Attention, this is new, it is only

in the decoder. Okay, so what will happen now is that the

output of the encoder will become the input of the decoder

and the

first layer of the decoder,

what is this

Master Multihead Attention, now what is

happening here, what does this Master Multihead Attention

mean, that whatever this decoder is of ours,

it will generate one word at a time, so

now if we remember what I told you

in the beginning of the class today, what

task are we doing, for which task are we building a transformer network,

for language translation, so if we had this

books", then we want to translate it

into Hindi, so we want to translate it

into Hindi, so what will happen is that this decoder will

generate one word at a time, first of all,

what word will it generate from "I love

our first word which this

decoder generated, then the word "me" will be

here, so

first What is the word that I have

generated? So this word that I have will be the

input of master multi attention. On the

basis of the words that have already been generated, multi mass attention

tells what the next word will be and

decode will generate only one word at a time. Now

I have to see what the second word should be after this. So once I have got the input,

input,

then all the decoder tasks will

run on it. Then the next word will be generated.

When I get the input here, it will

create some output here, then it will go to this model

and finally we will get the next word. So the

next word is suppose to be read,

so the next word will come to read. Now

this word that is read has come here to

read. Okay, now on the basis of reading,

our mask multihead tension will

generate the third word. Okay, then the

third word will also come here. Then on the

basis of these three words, our model will

tell the next word. So this is how the decoder works.

What do we have to remember? Decoder So

what will a decoder do? It is used to generate output.

We have only six

decoders. There are two types of layers in the decoder,

one with a mask.

I will explain this to you right now. Right now

we are just learning this. So its

only task is to

generate the next word on the basis of the words that have been generated in the data so far. Okay,

after that there is an add and normalization layer.

I have already explained this to you.

What is its function? After that there is a multihead

detection layer. These layers

work exactly like the encoder layer used to work.

Which words should be

given importance? What should be the sequence of words?

Which word should come first?

Which word should come after? That's the work.

Okay, and after that there is again an add and

normalization layer. There is a feed forward layer.

And finally, in the last you see there is a soft

max layer. What this soft max layer does is it

provides you a number, it

provides a probability for each word. So the

word which has

higher probability will be the next word

in the sequence, okay, so suppose

after me the word 'mu' was there, after me the

probability of 'reading' is 50, but the

50, but the

probability of 'like' is 4, so let's

write any random values, '

reading', 'like', 'books', 'books', '

books', there is some probability, so the

word which has higher probability will

become the next word in the sequence, okay, this is how

it is decided,

in the final layer the soft max layer will assign

a value to each word, it will

assign a probability, and the one with

higher value will become the next word, so this is how

you get the output

from the decoder, this is the basics of

transformers, so I hope you understood this video

and if guys you found my content

helpful, then please like and subscribe to my channel,

thank you for watching.

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:Transformers for beginners | Hindi

AutoDub

Video Transcript

Summary

Core Theme

Paste YouTube URL

Transcript Extraction Form

Get Our Chrome Extension

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube Transcript:
Transformers for beginners | Hindi