YouTube Transcript:
Vision Transformer
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
View:
Let's understand vision
transformers. We first divide the image
into sub images known as patches. This
image patch is nothing but the pixel
values of that area of the image. You
can see pixel values of single patch.
However, since this is an RGB image
instead of a single 2D array, we have
three channels for each patch. The
problem with these patches is that the
pixel values range from 0 to 255.
We simply normalize these values and now
the input image is ready to be fed into
the vision transformer. Remember here we
are using a patch size of 8 by8 with a
total of 64 patches for clear
visualization. However, in the actual
vision transformer paper, their patch
size is 16 * 16. To convert it into a
one-dimensional array, we need to
flatten all three normalized channels of
the patch and obtain a vector. We do
this for all the patches in the image
and obtain a linear vector for each
patch. Let's rearrange these vectors for
better visualization. Instead of using
these normalized pixel values directly
after flattening them, we transform them
into embedding vectors for each patch.
This process involves taking each input
and passing it through a neural network
to obtain an embedding vector one by one
for each patch. Now, these embedding
vectors can be treated as word
embeddings. For the attention mechanism,
we take the embedding vector. We make
three copies of it to feed into the
query key and value matrices. We get the
output query, key and value vectors for
the attention layer. This is a simple
process. Here each embedding vector can
be processed in parallel. However, this
parallelism creates a problem. I mean,
how will the attention mechanism know
which patch it is processing? The first
patch or the 10th patch or the last
patch. Just like in sentences where the
position of a word is important to
understand the meaning, changing the
position can alter the meaning of the
sentence even with the same words, the
position of each patch is also important
to understand the whole image. But how
do we feed position information to the
attention part of the transformer? For
this, we add positional encoding to the
embedding vector to incorporate position
information into the vector. After
adding position encoding, we get the
final vector to feed into the attention
block. Now the data seems ready for the
attention block to develop relationships
between patches. We apply multiple
attention blocks and then obtain the
output. After these attention layers
final output is ready. At the end we
apply a neural network with softmax for
classification. One way is to take the
last patch embedding and feed it to the
classification network just like we do
in text for next token generation. But
actually we add an extra learnable
embedding vector. The goal of this extra
embedding vector is to gather all the
important information from other patches
using the attention mechanism for
classification. Then at the end we feed
this vector to the classification layer
to get the softmax distribution for the
class label. Now what is the purpose of
the attention block here? If we take the
first patch of the image, we calculate
the attention of the first patch with
all other patches. This way we can find
out how the first patch relates to other
patches in the image. The same is the
case with other patches. This helps the
model understand how different parts of
the image are related to each other,
enabling it to develop an overall
understanding of the information in the
image. Now, if we take the first and
last patch of the image, we know that
the attention between the first and last
patch is calculated in the attention
layer and also for all the other
patches. So, we can say that this
attention mechanism in vision
transformers gives it a global receptive
field. In convolutional neural networks,
we have local receptive fields. So they
have a built-in inductive bias. In this
image, you can see the texture of the
cat because texture is a local feature.
Convolutional neural networks may
predict the class of the image using the
texture of the object. As explained
earlier, vision transformers with a
global receptive field focus more on
global features like the shape of the
object when classifying it. Remember
these images are not extracted from an
actual vision transformer or CNN. I just
used them to illustrate the difference
between CNN's and vision transformers.
This was the image we started with. We
have a fixed patch size of 16x6 because
increasing it to a higher value would
prevent vision transformers from gaining
a good understanding of the patch. So
the image size should be 128x 128 for 64
patches. There will be a total of 4,096
attention values calculated at each
layer. If the image size is now 256
cross 256, you can see attention
increased by huge margin. Now size
increased to 512. For an image size of
2048, you can see that the total number
of attentions calculated by each layer
increases making this approach less
suitable for highresolution images. But
what do you think is an alternative to
this vision transformer? Vision
transformers cannot compete with
convolutional neural networks when there
is a small amount of data. Vision
transformers require large data sets to
perform well and compete with
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.
Works with YouTube, Coursera, Udemy and more educational platforms
Get Instant Transcripts: Just Edit the Domain in Your Address Bar!
YouTube
←
→
↻
https://www.youtube.com/watch?v=UF8uR6Z6KLc
YoutubeToText
←
→
↻
https://youtubetotext.net/watch?v=UF8uR6Z6KLc