YouTube Transcript:
Vision Transformer

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

Video Transcript

Let's understand vision

transformers. We first divide the image

into sub images known as patches. This

image patch is nothing but the pixel

values of that area of the image. You

can see pixel values of single patch.

However, since this is an RGB image

instead of a single 2D array, we have

three channels for each patch. The

problem with these patches is that the

pixel values range from 0 to 255.

We simply normalize these values and now

the input image is ready to be fed into

the vision transformer. Remember here we

are using a patch size of 8 by8 with a

total of 64 patches for clear

visualization. However, in the actual

vision transformer paper, their patch

size is 16 * 16. To convert it into a

one-dimensional array, we need to

flatten all three normalized channels of

the patch and obtain a vector. We do

this for all the patches in the image

and obtain a linear vector for each

patch. Let's rearrange these vectors for

better visualization. Instead of using

these normalized pixel values directly

after flattening them, we transform them

into embedding vectors for each patch.

This process involves taking each input

and passing it through a neural network

to obtain an embedding vector one by one

for each patch. Now, these embedding

vectors can be treated as word

embeddings. For the attention mechanism,

we take the embedding vector. We make

three copies of it to feed into the

query key and value matrices. We get the

output query, key and value vectors for

the attention layer. This is a simple

process. Here each embedding vector can

be processed in parallel. However, this

parallelism creates a problem. I mean,

how will the attention mechanism know

which patch it is processing? The first

patch or the 10th patch or the last

patch. Just like in sentences where the

position of a word is important to

understand the meaning, changing the

position can alter the meaning of the

sentence even with the same words, the

position of each patch is also important

to understand the whole image. But how

do we feed position information to the

attention part of the transformer? For

this, we add positional encoding to the

embedding vector to incorporate position

information into the vector. After

adding position encoding, we get the

final vector to feed into the attention

block. Now the data seems ready for the

attention block to develop relationships

between patches. We apply multiple

attention blocks and then obtain the

output. After these attention layers

final output is ready. At the end we

apply a neural network with softmax for

classification. One way is to take the

last patch embedding and feed it to the

classification network just like we do

in text for next token generation. But

actually we add an extra learnable

embedding vector. The goal of this extra

embedding vector is to gather all the

important information from other patches

using the attention mechanism for

classification. Then at the end we feed

this vector to the classification layer

to get the softmax distribution for the

class label. Now what is the purpose of

the attention block here? If we take the

first patch of the image, we calculate

the attention of the first patch with

all other patches. This way we can find

out how the first patch relates to other

patches in the image. The same is the

case with other patches. This helps the

model understand how different parts of

the image are related to each other,

enabling it to develop an overall

understanding of the information in the

image. Now, if we take the first and

last patch of the image, we know that

the attention between the first and last

patch is calculated in the attention

layer and also for all the other

patches. So, we can say that this

attention mechanism in vision

transformers gives it a global receptive

field. In convolutional neural networks,

we have local receptive fields. So they

have a built-in inductive bias. In this

image, you can see the texture of the

cat because texture is a local feature.

Convolutional neural networks may

predict the class of the image using the

texture of the object. As explained

earlier, vision transformers with a

global receptive field focus more on

global features like the shape of the

object when classifying it. Remember

these images are not extracted from an

actual vision transformer or CNN. I just

used them to illustrate the difference

between CNN's and vision transformers.

This was the image we started with. We

have a fixed patch size of 16x6 because

increasing it to a higher value would

prevent vision transformers from gaining

a good understanding of the patch. So

the image size should be 128x 128 for 64

patches. There will be a total of 4,096

attention values calculated at each

layer. If the image size is now 256

cross 256, you can see attention

increased by huge margin. Now size

increased to 512. For an image size of

2048, you can see that the total number

of attentions calculated by each layer

increases making this approach less

suitable for highresolution images. But

what do you think is an alternative to

this vision transformer? Vision

transformers cannot compete with

convolutional neural networks when there

is a small amount of data. Vision

transformers require large data sets to

perform well and compete with

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:Vision Transformer

Video Transcript

Paste YouTube URL

Transcript Extraction Form

Get Our Chrome Extension

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube Transcript:
Vision Transformer