YouTube Transcript:
Attention in Vision Models: An Introduction

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

Video Transcript

having discussed rnn's last week we'll

now move to a topic which is very

contemporary in terms of trying to

address some of the technical features

of what RNN to De planning which is attention

attention

models before we go into attention

models let's discuss the question that

we left behind which was what do you

think will happen if you train a model

on normal videos and do inference on a reversed

reversed

video hope you had a chance to think about

about

this it depends on the application or

task for certain activities say maybe

let's say you want to differentiate

walking from jumping it could work to a

certain extent even if you tested it on

a reversed video however for certain

other activities say a Sports Action

such as a tennis forehand this may not

be that

trivial an interesting related problem

in this context is known as finding the

arrow of time there are a few

interesting papers in this direction

where the task at hand is to find out

whether the video is forward or

backward this can be trivial in some

cases but this can get complex in some

cases if you're interested please read

this paper known as uh learning and

using the arrow of time if You' like to

so far with rnns we saw that rnns can be

used to efficiently model sequential

data rnns use back propagation through

time as the Training Method rnn's

unfortunately suffer from the vanishing

and exploding gradients problems to

handle the exploding gradient problem

one can use gradient clipping and to

handle the vanish vanish in gradients

problems one can use RNN variants such

as lstms or

grus which was good we saw how to use

these for

handling sequential learning problems

but the question we ask now is this

sufficient are there tasks when RNN may

not be able to solve the problem let's

find more about this let's consider a

couple of popular tasks where RNN may be

useful one is the task of image

captioning given an image one has to

generate a sequence of words to make a

caption that describes the activity or

the scene in the

image another example where rnns are

extremely useful is the task of neural

measure machine translation or what is

known as nmt it it's also what you see

on your uh translate apps that you may

be using where you try to you have a

sentence given in a particular language

and then you have to give the equivalent

sentence in a different

language both of these are RNN

tasks a standard approach to handling

such tasks is given any input your in

put could be video could be an image

could be audio or could be text you

first pass these inputs through an

encoder Network which gets you a

representation of that input which we

call the context

Vector given this context Vector you

pass this through a decoder Network

which gives you your final output text

these are known as encoder decoder

models and they're extensively used in

context now let's take a brief dator to

understand encoder decoder models a bit

more the standard name for such encoder

decoder models is known as the auto

encoder although in this case it says

that the decoder is trying to encode the

input itself and that's the reason why

this is called an auto encoder not all

encoder decoder models need to be Auto

encoders however the conceptual

framework of encoder decoder models

comes from Auto encoders which is why

we're discussing this briefly before we

come back to encoder decoder

models an on autoencoder is a neural

network architecture where you have

an input Vector you have a network which

we call as the encoder Network and then

you have a concept vector or we also

call that the bottleneck layer which is

a representation of the input and then

you have a decoder layer or a network

which outputs a certain

Vector in an auto encoder we try to set

the target value to the input themselves

so you're asking the network to predict

the input itself so what are we really

trying to learn here we're trying to

learn a function f parameterized by some

weights and by WB f ofx is equal to X

rather we are trying to learn the

identity function itself and predict an

output xact which is close to

X so how would you learn such a network

using back

propagation what kind of a loss function

would you use it would be a mean squared

error where you're trying to measure the

error between x and x hat which is the

Reconstruction of the autoencoder then

you can learn the weights in the network

using back propagation as with any other

feed forward neural

network now the encoder and the decoder

need not not be just one layer you could

have several layers in the encoder

similarly a several layers in the

decoder in the auto encoder setting

traditionally the decoder is a mirror

architecture of the encoder so you have

if you have a set of layers in the

encoder with a certain number of

dimensions number of hidden nodes in

each of these layers then the decoder

mirrors the same architecture the other

way to ensure that you can get an output

which is of the same Dimension as the

input that's when you can actually

measure the mean square error between

the Reconstruction and the

input however while this is the case for

an auto encoder not all encoder decoder

models need to have such architectures

you can have a different architecture

for an encoder and a different

architecture for a decoder depending on

just to understand a variant of the

autoencoder a popular one is known as

the den noising Auto encoder in a d

noising autoencoder you have your input

data you intentionally corrupt your

input Vector for example you can add

something like a gossan noise and you

would get a set of values X1 hat to xn

hat so those are your corrupted input

values now you pass this through your

encoder you get a representation then a

decoder and you finally try to

reconstruct the original input

itself what is the loss function here

the loss function here would again be

mean squ error this time it would be the

mean squ error between your output and

the original uncorrupted

input what are we trying to do here we

are trying to ensure that the auto

encoder can generalize well tomorrow at

the end of training rather so that even

if there was some noise in the input the

auto encoder would be able to recover

your original

data with that introduction to Auto

encoders let's ask one

question in all the architectures that

we saw so far with Auto encoders

we saw that the hidden layers were

always smaller in size in dimension when

compared to the input

layer is this always

necessary can you go larger

larger

autoencoders where the hidden layers

have a lesser Dimension than the input

layer are called under complete

autoencoders so you can say that such

autoencoders learn a lower dimensional

representation on a suitable manifold of

input data from which if you use the

decoder you can reconstruct back your original

original

input on the other hand if you had an

autoencoder architecture where the

hidden layer Dimension is larger than

your input you would call such an

autoencoder as an over complete

autoencoder while technically this is

possible the limitation here is that the

auto encoder could blindly copy certain

uh inputs to the certain dimensions of

that hidden layer which is larger in

size and still be able to reconstruct

which means such an overcomplete

autoencoder can learn trivial Solutions

which don't re really give you useful

performance they may simply memorize all

the inputs and just copy inputs back to

layer then the question is are all

autoencoders also dimensionality

reduction methods assuming we are

talking about under complete

autoencoders partially yes largely

speaking autoencoders can be used as

dimensionality reduction

techniques a follow-up question then is

then can an auto encoder be considered

similar to principal component analysis

which is a popular dimensionality reduction

reduction

method the answer is actually yes again

but I'm going to leave this for you as

homework to work out the connection

PCA let's now come back to what we were

talking about which was is one of the

tasks of RNN which is neural machine

translation or

nmt these kinds of encoder decoder

models are also called sequence to

sequence models especially when you have

an input to be a sequence and an output

also to be a

sequence so if you had an input sentence

which says India got its independence

from the BR

British let's say now that we want to

translate this English sentence to Hindi

what you would do now is you would have

an encoder Network which would be a

recurrent neural network and RNN where

each word of your input sentence is

given at one time step of the RNN and

the final output of the RNN would be

what we call a context vector

and this context Vector is fed into a

decoder arnn which gives you the output

which says

bhat the rest of the sentence Millie and

then you have an end of sentence

token this is what we saw as a many to

many RNN last week why aren't we giving

an output at each time step of the encoder

encoder

RNN for for a machine translation task

if you recall the recommended

architecture we said that it's wiser to

read the full sentence and then start

giving the output of the translated

sentence why so because different

languages have different grammars and

sentence constructions so it may not

be correct for the first word in English

to be the first word in Hindi or it the

Hindi sentence may not exactly follow

the same sequence of words in English

because of gram grammatical

regulations so that's the reason why in

machine translation tasks you generally

have reading of the entire input

sentence you get a context vector and

then you start giving the entire output

in uh the translated output similarly

if you considered the image captioning

task you would have an image and in this

case your encoder would be a CNN

followed by say a fully connected

Network out of which you get a

representation or a context vector and

this context Vector goes to a decoder

which outputs the caption a woman dot

dot dot say in the park end of

sentence what's the problem this seems

to work well is there a problem at all

let's Analyze This a bit more

closely so in an RNN the hidden states

are responsible for storing relevant

input information in

RNN so you could say that a hidden State

at time step t or

HT is a compressed form of all previous

inputs that hidden state represents some

information from all the previous inputs

which is required for processing in that

state as well as future

States now let's consider a longer

sequence if you considered language

processing and a large paragraph if your

input is very long can your HT the

hidden State at any time step encode all this

this

information not really you may be faced

with the information bottleneck problem

in this kind of a context so if you

considered a sentence such as this one

here which has to be translated to

German can we guarantee that a words

seen at earlier time steps be reproduced

at at later time steps remember when you

go from a language such as is English to

a language such as German the position

of the verbs the nouns may all change

and to reproduce this one may have to

get a word much earlier in the sentence

in English which may follow much later

in say the German language is this

possible unfortunately RNN don't work

that well when you have such long sequences

sequences

similarly even if you had image

captioning and related problems such as

visual question answering which we will

see later so if you had this image that

we saw in the very beginning of this

course and if we asked the question what

is the name of the book the expected

answer is the name of the book is Lord

of the Rings the relevant information in

a cluttered image may also need to be

preserved in case there are follow-up

dialogue so a statistical way of

understanding this is through what is

known as blue score blue score is a

common performance metric used in NLP

natural language processing blue stands

for bilingual evaluation under study

it's a metric for evaluating the qual

quality of machine translated text it's

also used for other tasks such as image

captioning visual question answering so

on and so forth and when one looks at

the blue score one observes that as the

sentence length

increases then while the expected curve

is that you should get a high blue score

after a certain sequence length

unfortunately as the sentence length

goes further Beyond a threshold the blue

score starts falling

down which means using such encoder

decoder models where encoders are RNN

decoders are also rnns starts failing in

these cases when the sequences are long

by Nature if you'd like to know more

about blue you can see this entry in

so what what is the solution to this

problem the solution which is

extensively used today is what is known

as attention which is going to be the

focus of this week's

lectures so what is this

attention intuitively

speaking given an image if we had to ask

the question what is this boy

doing the human way of doing this would

be be you first identify the artifacts

in the

image you pay attention to the relevant

artifacts in this case the boy and what

activity the boy is associated with

similarly if you had an entire paragraph

and you had to

summarize you would probably look at

certain parts of the paragraph and write

them out in a summarized form so paying

attention to parts of inputs be it

images or be it long sequences like text

is an important way of how humans process

process

data so let's now see this in a sequence

learning problem in the traditional

encoder decoder model setting so this is

once again the many to many RNN setting

similar to what we saw for new neural machine

machine

translation so you have your inputs then

you have a context Vector that comes out

at the end of the inputs that context

Vector is fed to a decoder RNN which

gives you the outputs y1 to YK now let's

assume that hjs are the hidden states of

the encoder and sjs are the hidden

states of the

decoder so what does attention do

attention suggests that instead of

directly outputting HT which is the last

hidden state to your decoder

RNN we instead have a context

Vector which relies on all of the Hidden

States from the

input this creates a shortcut connection

between this context Vector

CT and the entire Source input

X how would you learn this context

Vector we'll see there are multiple

different ways so given this context

Vector the decoder hidden State St is

given by some function f of St minus1

the previous hidden state in the decoder

YT minus 1 the output of the previous

time step in the decoder could be given

as input to the next time step as well

CT and what is this context Vector this

context Vector is given by CT which is

over all the time steps in your encoder

RNN Alpha TJ HJ so it's a weighted

combination of all of your hidden State

representations in your encoder rnm

how do you find Alpha TJ how do you find

those weights of the different

inputs a standard framework for doing

this is Alpha TJ can be obtained as a

softmax over some scoring function that

captures the score between St minus one

and each of the Hidden States in your

encoder so St minus1 gives us a current

context of the output so we try to

understand what is the alignment of the

current context in the output with each

of the inputs and accordingly pay

attention to specific parts of the

inputs now there's an open question how

do you compute this score of St minus

one with each of the hjs in the encoder

RNN one once we have a way of computing

that score we can take a soft Max over

HJ with respect to all of the hjs so we

will do this for each of the hjs in the

encoder RNN and using that we can

compute your Alpha tjs and using Alpha

tjs we can compute the context Vector

once you get the context Vector you

would give the corresponding context

Vector as input to each time step of the decoder

decoder

rnm how do you compute this

score there are a few different

approaches in literature at this time we

will review many of them over the

lectures this week but to give you a

summary you could have a Content based

attention which tries to look at St and

hi so each a particular hidden state in

your decoder RNN St and a particular hid

hidden state in an encoded RNN hi as a

cosine similarity between the two that's

one way of measuring the score you could

also learn weights to compute this

alignment so you can take St and hi

learn a set of Weights wa take a tan and

use another Vector to get the score so

this is a learning procedure to get your final

final

score one could also get Alpha TJ as a

softmax over a learned set of Weights W

and STD again one could also use a more

General framework where you have St

transpose hi which is similar to cosine

which will give you a DOT product but

you also have a learned set of Weights

in between which tells you how to

compare the two vectors St and hi

remember any W here are learned by the

network to compute the score or you

could simply use just a DOT product by

itself which would be similar to your

content based attention the cosine and

the dot product would give similar

values or there is a variant known as

the scaled dot product attention where

you use the dot product between the two

vectors STD and hi but scale it by root

n which tells you the number of inputs

that you

have what about spatial data so we saw

how it is done for temporal data where

you had a sequence to sequence RNN A Min

to many RNN what if you had an image

captioning task if you had spatial data

so in this case your image would give

you a certain representation s not out

of the encoder

Network unfortunately when you use a

fully connected

layer after the CNN you lose spatial

information in that

Vector so instead of using the fully

connected layer we typically take the

output of the convolutional layers

themselves which would give you a

certain volume which let's say is M CR n

CR C now we know that if you considered

one specific patch of this volume M

cross n Cross C we know that you can

trace that back to a particular patch of

the original image which was passed

through a CNN so you know that the

output feature map say a con five

feature map if you looked at one

particular PA part of that depth volume

you would get a certain patch in the

input image

now this gives you spatial information

so what can we do we take this feature

map that we get at the output of a

certain convolutional layer we can

unroll them into 1 cross 1 Cross C

vectors so you ideally have M cross n

Cross C so you can unroll this into C different

different

vectors and then you can apply attention

to get a context vector

in what way is this useful this context

Vector now can be understood as paying

attention to certain parts of the image

while giving the output because each of

these bands each of these sub volumes

here highlighted in yellow are certain

parts of the input image and one could

Now understand the same weighted

attention concept the alignment part of

it could be implemented very similar to

what we saw on the previous slide but

now this represents different parts of

the input

image another use of Performing

attention is it gives you explainability

of the final

model why so how so if you have say a

machine translation task you know that

when you generated a certain output

word from a decoder RNN your attention

model or your context Vector tells you

which part of the input you looked at

while predicting that word as the output

and that automatically tells you which

words in your input sequence

corresponded to an uh to a word in your

output so in this case you can see that

this particular sequence here European

economic area depended on Zone economic

European so that is also highlighted by

these white patches here so white means

a higher dependence black means no

dependence and looking at this heat map

gives you an understanding of how the

model translated from one language to

another what about images IM image

captioning task in this case too you can

use the same idea given an image if the

model is generating a caption you can

see that the model generates each word

of the caption by looking at certain

parts of the image for example when it

says a it seems to be looking at a

particular part of the image when it

says a woman it seems to be looking at a

certain part of the image while the

other object is also in relevance and if

you keep going you see when it says the

word throwing it seems to be focusing on

the woman part of the image and if you

see the word frisbee it actually seems

to focus on the Frisbee in the image and

if you see the word park it seems to be

focusing on everything other than the

woman and the child this gives you an

understanding and trust that the model

is looking at the right things while

output what are the kinds of attention

one can have you could consider having a

hard versus soft attention what do these

mean in hard attention you choose one

part of the image as the only focus for

giving a certain output let's say image

captioning you look at only one patch of

the image to be able to give a word as an

output so this choosing of a position

could end up becoming a stochastic

sampling problem and hence one may not

be able to back propagate so through

such a hard attention problem because

that stochastic stamp sampling step

could be non-differentiable we'll see

this in more detail in the next lecture

on the other hand one could have soft

attention where you do not choose a

single part of the image but you simply

assign weights to every part of the

image in this case you are only going to

have a newer image where each part of

the image has a certain weight in this

case your output turns out to be

deterministic differentiable and hence

you can use such an approach along with

standard back

propagation another categorization of

attention is Global versus local

attention in global attention all the

input positions are chosen for attention

whereas in local attention maybe only a

neighborhood window around the object of

Interest or the area of interest is

chosen for

attention a Third Kind which is very

popular today is known as self attention

where the attention is not with respect

to an decoder RNN with respect to the

encoder or an output RNN with respect to

parts of an image but is of attention of

a part of a sequence with respect to

another part of the same

sequence this is known as self attention

or intra attention and we'll see this in

more detail in a later lecture this

week your homework for this lecture is

to read this excellent blog by Lilian

Wang known as attention attention it's a

Blog on

GitHub and one question that we left

behind which is is there a connection

between an auto encoder and principal component

component

analysis think about it and we'll

discuss this in the next

lecture references [Music]

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:Attention in Vision Models: An Introduction

Video Transcript

Paste YouTube URL

Transcript Extraction Form

Get Our Chrome Extension

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube Transcript:
Attention in Vision Models: An Introduction