0:02 Let's understand vision
0:04 transformers. We first divide the image
0:07 into sub images known as patches. This
0:09 image patch is nothing but the pixel
0:11 values of that area of the image. You
0:14 can see pixel values of single patch.
0:17 However, since this is an RGB image
0:19 instead of a single 2D array, we have
0:21 three channels for each patch. The
0:22 problem with these patches is that the
0:26 pixel values range from 0 to 255.
0:28 We simply normalize these values and now
0:30 the input image is ready to be fed into
0:33 the vision transformer. Remember here we
0:35 are using a patch size of 8 by8 with a
0:37 total of 64 patches for clear
0:40 visualization. However, in the actual
0:42 vision transformer paper, their patch
0:45 size is 16 * 16. To convert it into a
0:46 one-dimensional array, we need to
0:49 flatten all three normalized channels of
0:51 the patch and obtain a vector. We do
0:52 this for all the patches in the image
0:54 and obtain a linear vector for each
0:57 patch. Let's rearrange these vectors for
0:59 better visualization. Instead of using
1:01 these normalized pixel values directly
1:03 after flattening them, we transform them
1:06 into embedding vectors for each patch.
1:08 This process involves taking each input
1:10 and passing it through a neural network
1:12 to obtain an embedding vector one by one
1:15 for each patch. Now, these embedding
1:16 vectors can be treated as word
1:19 embeddings. For the attention mechanism,
1:21 we take the embedding vector. We make
1:23 three copies of it to feed into the
1:26 query key and value matrices. We get the
1:29 output query, key and value vectors for
1:31 the attention layer. This is a simple
1:33 process. Here each embedding vector can
1:36 be processed in parallel. However, this
1:38 parallelism creates a problem. I mean,
1:40 how will the attention mechanism know
1:42 which patch it is processing? The first
1:45 patch or the 10th patch or the last
1:47 patch. Just like in sentences where the
1:49 position of a word is important to
1:51 understand the meaning, changing the
1:52 position can alter the meaning of the
1:55 sentence even with the same words, the
1:57 position of each patch is also important
1:59 to understand the whole image. But how
2:01 do we feed position information to the
2:03 attention part of the transformer? For
2:05 this, we add positional encoding to the
2:07 embedding vector to incorporate position
2:09 information into the vector. After
2:11 adding position encoding, we get the
2:13 final vector to feed into the attention
2:16 block. Now the data seems ready for the
2:18 attention block to develop relationships
2:20 between patches. We apply multiple
2:22 attention blocks and then obtain the
2:24 output. After these attention layers
2:28 final output is ready. At the end we
2:30 apply a neural network with softmax for
2:32 classification. One way is to take the
2:34 last patch embedding and feed it to the
2:36 classification network just like we do
2:38 in text for next token generation. But
2:40 actually we add an extra learnable
2:43 embedding vector. The goal of this extra
2:45 embedding vector is to gather all the
2:46 important information from other patches
2:48 using the attention mechanism for
2:51 classification. Then at the end we feed
2:53 this vector to the classification layer
2:55 to get the softmax distribution for the
2:57 class label. Now what is the purpose of
2:59 the attention block here? If we take the
3:01 first patch of the image, we calculate
3:03 the attention of the first patch with
3:06 all other patches. This way we can find
3:08 out how the first patch relates to other
3:10 patches in the image. The same is the
3:12 case with other patches. This helps the
3:14 model understand how different parts of
3:16 the image are related to each other,
3:17 enabling it to develop an overall
3:19 understanding of the information in the
3:22 image. Now, if we take the first and
3:23 last patch of the image, we know that
3:25 the attention between the first and last
3:27 patch is calculated in the attention
3:29 layer and also for all the other
3:31 patches. So, we can say that this
3:33 attention mechanism in vision
3:35 transformers gives it a global receptive
3:37 field. In convolutional neural networks,
3:40 we have local receptive fields. So they
3:42 have a built-in inductive bias. In this
3:44 image, you can see the texture of the
3:46 cat because texture is a local feature.
3:48 Convolutional neural networks may
3:50 predict the class of the image using the
3:52 texture of the object. As explained
3:54 earlier, vision transformers with a
3:56 global receptive field focus more on
3:58 global features like the shape of the
4:01 object when classifying it. Remember
4:03 these images are not extracted from an
4:06 actual vision transformer or CNN. I just
4:07 used them to illustrate the difference
4:10 between CNN's and vision transformers.
4:12 This was the image we started with. We
4:15 have a fixed patch size of 16x6 because
4:17 increasing it to a higher value would
4:18 prevent vision transformers from gaining
4:21 a good understanding of the patch. So
4:24 the image size should be 128x 128 for 64
4:28 patches. There will be a total of 4,096
4:30 attention values calculated at each
4:32 layer. If the image size is now 256
4:35 cross 256, you can see attention
4:37 increased by huge margin. Now size
4:41 increased to 512. For an image size of
4:43 2048, you can see that the total number
4:45 of attentions calculated by each layer
4:47 increases making this approach less
4:49 suitable for highresolution images. But
4:51 what do you think is an alternative to
4:52 this vision transformer? Vision
4:54 transformers cannot compete with
4:56 convolutional neural networks when there
4:58 is a small amount of data. Vision
5:00 transformers require large data sets to
5:01 perform well and compete with