YouTube Transcript:
Value Function Based Methods

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

This content introduces fundamental concepts in reinforcement learning, focusing on value function-based methods for estimating expected payoffs (Q-values) and strategies for action selection, balancing exploitation of current knowledge with exploration of new possibilities.

[Music]

so the first thing I want to talk about is our very simple

uh what the uh what the textbook calls value function based

methods okay uh what the textbook calls value function based methods uh these are approaches

where I don't know the Q Star right but I'm going to try and form a estimate of the Q

Stars so how can I do that so what is qar the expected payoff right so I I play each action I

choose each action say 10 times I can take the average of that that gives me an estimate for

Q right so we'll denote the estimates by QT so whenever we attach a estimate to an action

we'll be using the notation Q okay this this so that you know so we use cues here and whenever

we estimate attach a estimate of the payoffs to a state we'll denote it by v v stands for

Value function okay that's easy to remember don't ask me what Q stands for so so one of

the problems with RL is that uh because there are so many different influences that came in right

uh so you have at least three different sets of notations that people use in the literature

okay so you would have to swap back and forth if you're reading papers in reinforcement learning

right so somebody will say that uh uh the value function is denoted by J right and Q denotes the

cost function right so if they had used disjoint set of symbols at least it's easy for us right but

they'll use the same symbol that one one body of literature will call the value function another

body will call that the payoff function right which essentially responds to the reward that

you are getting at every time right so that will be denoted by Q so there all kinds of confusions

are there so this is one one set of notations that I'm choosing essentially because it's the one that

is adopted in the SAT and B book in fact people who are reading Edition one right there's been a

overhaul of uh notation in Edition two right so to the extent possible I'll try to follow

the Edition two notation uh but Edition one is ingrained in me right so I mean I've been using

Edition one notation for uh God knows 15 years or something right so suddenly they decided to

change the notation in Edition two so by force of habit I might slip into Edition one notation

occasionally so it's in your uh well-being to read ahead so that you can correct me whenever

I do that right just don't wait for me to teach and then go and or well don't wait

for the quizzes and then read the book right uh but read ahead and come to class okay this

will certainly certainly help you significantly in following the material better right so I'm

deliberately going slow today warming you up okay but then uh there's lot lot of material

that we cover in fact uh usually I cover the material in the book in about uh one and a half

months right and the rest of the time I'll be doing material outside the book that one

and half months 2 months something okay so and the book is Big right so at some point

I'll switch modes and start going fast so so reading ahead helps okay uh so we're going

to have QT so essentially what you can do is uh you can just take uh whenever I

took whenever I took action a right the reward I got at that time and I can just

keep adding those up and divide by the number of times I took action a that will give me my

QT right so how will we write this let's look at introduce some notation

so that is a notation that I will use whenever I want an indicator function right so normally

what is the notation that people use for indicator functions huh I or Delta right so people typically

use Delta or I so I'll use one uh because we use I and Delta for other things later on

okay so this essentially means that it is one whenever a is a otherwise it's zero

times so what should I divide this

by okay so this gives me the the average right great so once I have this now I have

made an estimate for my qars right once I have an estimate for my Q Stars so how will I select an

action ah apologies

okay so I use T on the left hand side so I can't use T as the running variable as

well okay so now I have come to time T I have this estimates for my Q Stars so

how can I use that for selecting actions ma select the maximum right so I could

be what we call Greedy right so greedy would be I'm going to

select right so I'll take that action that gives me the

max so AR Max is the action that gives me the max right so I'll take that action that gives me the

Max and use that has the action to perform at that time right is that a good choice or alternatively

what does this Choice correspond to exploitation this Choice corresponds to exploitation but I

can't just exploit right I have to explore as well right so I can't come up with an algorithm

that only exploits right so what do you think will happen let us say that I pulled an arm

right the first time well I select an action and then I get a reward of one let us say

right so what will happen to my Q's it will be one what about my other Q's well depends on what

you initialize them with so if you initialize them with zero they'll stay at zero I could have

initialized them with 100 right or could have initialized them with minus 100 I don't know

whatever it is initialize them with they'll stay there but this one will become one and

now if I'm doing greedy what will happen I'll do this again let's assume that I get a one

again right and they keep getting one every time I pull this so what will happen is I'll

never ever sample any of the other arms right it could very well be that arm three has a reward of

50 right just because I happen to pull arm one the first time right

I'll never ever go and pull arm three even though it has a much much higher

reward right

so we don't want to be exploitative completely so we need to have some kind of a exploration

policy so the simple exploration policy okay so I'm going to confuse you a little bit uh is called

called Epsilon greedy nothing to do with this Epsilon okay so this Epsilon is completely

different okay talks about a solution concept and just because I have something Epsilon grey

doesn't mean I'm going to give you some Epsilon uh pack guarantee okay this is

a completely different Epsilon uh in fact it was named Epsilon 3D before people even

started thinking about bringing all these pack optimality questions into uh RL okay so what

does epsilon GD mean any guesses choosin of the sorry May choosing within epon of the 3 solution

with within Epsilon of the 3D solution okay good choice but I say that has nothing to do with this

Epsilon it's got nothing to do with this Epsilon what I do is with probability 1

minus Epsilon I'll be greedy okay with probability Epsilon I'll just

explore okay so Epsilon gitty means with 1 minus Epsilon I'll

choose okay with probability 1us Epsilon probability Epsilon I'll

choose I'll choose uniformly from the set of actions here okay so one thing which you should

note here is that if I choose uniformly from a I still have a probability of hitting that

AR maxia right so in fact the probability of me taking this action is 1 - Epsilon plus Epsilon by

n right the probability of me exploring is actually

N - 1 Epsilon by n okay so this is something just small thing

nitpicking thing but just remember that so the probability of exploration is not

Epsilon okay because because I'm picking uniformly randomly you could choose to say

that I pick uniformly from a minus this argmax action right doesn't make a difference but I'm

just pointing out that there is a small difference just remember that okay so Epsilon Grey by far is

one of the most popular uh exploration it's by far the most popular exploration strategy that people

adopt okay by far the most popular exploration strategy uh there's another one which ironically

I couldn't find in Edition two of the book which is also rather popular it's called [Applause] Uh

softmax or maybe I flip through Edition to to quickly to notice it uh but uh so the softmax

exploration policy does something slightly cleverer okay so if you think about the Epsilon

greedy what is it uh that you would notice is that I have the best action best according to

my current estimates right so when I say whenever I say greedy or best it is according to my current

estimate it's not Q Stars okay it's according to capital Q whenever I say greedy is according to

capital Q okay so the greedy action gets the bulk of the probability right and every other

action has a equal probability of being picked in Exploration correct but then if some actions are

clearly bad right if I figure out some actions are clearly bad so I don't want to be picking

them in during exploration because I know that they are bad I really don't don't want to be

picking them during exploration or at least I want to reduce the probability of picking them during

exploration right and if two actions look equally likely to be good right maybe Point

001 difference right some small Delta difference between the two actions in epsilon 3D what will

happen the one that is that 01 higher will get the highest probability okay and the other one

will have the same probability as the all the other arms of being picked right but then in

reality I just want to be able to compare these two things more closely and figure out which one

is better right because the difference is so small I really want to compare them and figure

out which one is better right so what people did was they came up with this soft Max idea right

where instead of uh giving all the probability to one action right I make the probability of

picking the actions proportional to the the estimate current estimate so whatever is the

current Q I'll make the probability proportional to the Q right so ideally it should be something

like right so this should be the probability with which

at the T step I'll pick action a right that should be the probability right what is the problem with

that different of one of them might beus one might be one different orders ERS yeah

different signs right so I mean so things could be all so so negative

numbers in fact I could even end up with a negative probability

right if my QT of a was minus one right then I'll end up with a minus

something as the probability of picking that action doesn't make

sense right so to Encompass all kinds of options what do we do exponentiate

right so why is this called Soft

Max because it is not greedy does not return to you the one that has the highest Q value

all the time but if that is a big difference it will return return that value to you most often

let's Suppose there are four actions right and action one has a value of say 19 Action

two has a value of uh 1.2 action 3 has a value of uh 7 Action 4 has a value of Min

-.3 right so now this will return to you most often it will return action

one to you right so because that has a significantly higher value okay that's

why it's called Soft Max right sometimes what people do is they throw in another

parameter called the temperature parameter sometimes so beta right so now think about it if

beta is very very high so what will happen to your soft Max function Rel Rel larg value everything

will get squished down right so beta is very very high so essentially you'll be behaving at

random uniformly random right because all the regardless of what the Q values are right if

the beta is very very let's say beta is 1 million right and I say 17 and 1.8 no no no doesn't make

any difference right all of them will look close to zero right so essentially I'll be picking all

actions uniformly randomly right so this way it's called temperature so if you're very very

hot right so things will move around at random right and then people typically cool it down so

they make beta smaller and smaller and smaller so that from a very high temperature where you

are essentially behaving at random regardless of what the Q values are as you keep cooling

it down it becomes more and more peaked so if beta becomes closer and closer to zero what

happens it becomes Max right as beta goes to zero this function goes to

Max right so it become greedy so it gives an additional parameter

for tuning I mean if you if you don't trust me you can work it out right you guys are

smart enough to work that out right so this gives you an additional parameter for tuning

how sharp your probability needs to fall right so so this called Soft Max and uh

so if you think about Epsilon greedy then uh writing it in this notation so this will be

right so probability a equal to a star will be 1 - Epsilon plus Epsilon by n where a star is the

greedy action according to the current estimate right the probability 8 is not equal to a star or

probability a is equal to a which is not equal to a star will be Epsilon by n okay and likewise here

you have this probabilities here okay so these are the two rather popular ways of selecting

actions in fact uh Epsilon gy and softmax are also used in the full reinforcement learning problem

right it's also used in the full reinforcement learning problem for selecting for implementing

the exploration policy right you can convince yourself that Epsilon very

simply you can convince yourself that Epsilon greeny right uh will give you asymptotic

correctness provided something but uh can

you convince yourself that it'll give you a SYM totic correctness

right I mean you're always going to be taking all the actions right So eventually you'll

converge to the true true true expectations right you'll eventually converge to Q star

because you're taking all the actions all the time right so but what is the problem

if I keep my Epsilon fixed at some value let say I keep Epsilon fixed at 0.1 right 90% of

the times I'm behaving optimally 10% of the times I choose at random right so what will

happen I will know the right Q Stars after a long time but I'll still be behaving randomly

10% of the times right what Astic correctness says is eventually I should be only giving

you the behaving the according to the right arm right so I should be only doing the right choice

eventually to achieve that what should you do you have to cool Epsilon right so Epsilon has

to keep decreasing but then you shouldn't decrease it too fast if you decrease it

too fast then you don't have the guarantee that you will get the correct Q Stars right so there

are ways in which you can balance this right and uh so in fact people show that if Epsilon Falls

as 1 by

T right that's usually a sufficient ient rate but unfortunately uh that will actually slow

down your convergence tremendously because uh it becomes truly an asymptotic convergence I

mean you have to run it till Infinity for you to get convergence so so in practice what you

can do is you can cool it in steps right set a high Epsilon right run it for a while okay

and then drop it a little bit and then run it for a while and then drop it a little bit then

run it for a while and so on so forth right so the figuring out how long the while should

be is is is a tricky question depends on a variety of factors but you can do some

kind of empirical estimates for the time being right good um so do we move on from

here okay so before I go on to uh regret optimality one small thing which I want to talk

about right so here I gave you a way of estimating the qts right so what do you notice about

that I run experiments till time T right and then when I want to know what QT is I look back

and see how many times I have taken action a and then I take the average right so that

essentially means that at time step T I need to know what were all the actions I picked earlier

and what were all the rewards I got for those actions right so I need to remember this whole

history right so is there a bit more efficient way of doing

huh don't

get yeah so I can I can essentially do this in a

kind of doing run I can do this running sum right so how do I do

that right so I want QT of

a right so it is some some numbers right so it will be essentially QT minus one of

right uh so let me introduce a little bit of a notation it make my life a little easier

okay okay so the number of times I've taken action a I'll denote by NA a okay

right people will agree with

you yes okay great so assuming that I actually took action a

at this time right so what will happen [Applause]

okay you want me to erase and

reite both the numerator and the denominator only the denominator

numerator is fine okay I'm assuming this is

fine okay so now I can do this in a incremental fashion

right so this is Na by NA + 1 r t by NA + 1 I can just keep adding this

right so I don't need to remember all the rewards all I need to know is how many times I've taken

action a so far and what is the previous estimate of QT right if a was not equal to a what happens

this will vanish this will vanish n and will get canceled just the same thing okay so no

change is there is there a different way we can write

this

go QT of a okay

is it

fine people agree with

that right this is n by Na by NA a + 1 so I can write it as 1 minus 1

by NA + 1 right great now I can rearrange this quantities and get a better looking expression

right looks good so this is a form of a general class of equations known as

stochastic averaging equations okay and there's really no stochasticity uh that

you need to worry about here but this is called stochastic averaging equation

so one way to think about it is I have an old estimate right and to the old estimate I add a

error term so what is the error term this is my target right I should have predicted the reward

is RT right so this is my Target and that is my actual output right QT minus one is what I said

will be the payoff RT was what the actual payoff that occurred right so RT minus QT minus 1 a is

some kind of a error term right and this is some kind of a step size 1 by NA a + 1 is some kind of

a step size so what I'm doing is I'm taking my world estimate and adding some Alpha times an

error right so many many iterative algorithms that we have looked at uh even if you had taken

ml that some of the algorithms you looked at in ml right and many iterative algorithms that we'll

look at later in the course will all have this form old estimate times I mean plus some Alpha

times an error term right what is the thing that you should note here why is it called stochastic

averaging rule what I'm really trying to find out what my real Target is is the what is my

real Target Target expect the expected value of RT my real Target is not RT right my real Target is

the expected value of RT right so my actual error should have been expected value of RT

minus QT minus one right that would have been my actual error but since I don't know the expected

value of RT so what I'm using here is a some kind of a measurement of that expected value right and

and the unbiased measurement is a sample drawn from the distribution According to which you're

taking the expectation right when I say expected value of RT right so RT is a sample right I'm just

take I'm going to take the average of all these samples that's going to give me the expected

value since I don't know the true expectations I can operate with these samples right so I'm

going to form the an estimate of the Target or estimate of the error right by using a sample

as opposed to using the true expectation that's why we call these a stochastic averaging rules and

so this is a very standard form that we'll keep seeing again and again in all the algorithms that

we look at right so current estimate plus Target minus current estimate right so this is the thing

right and this particular choice for my Alpha here 1 by NA + 1 uh uh gives me the the exact average

I mean this is exactly I did no approximations anywhere this gives me exactly the same estimate

as I would have obtained by storing all the values and taking the average at every time okay great

um yeah so we'll stop here today or maybe I can just do one more thing and then we can move on

to the other things in the next class uh so far we have assumed that your uh coins are

not changed right so you're assuming that the the uh the distributions from which the rewards

are sampled or the payoffs are sampled is Conant stationary okay constant has a different meaning

it's stationary stationary means it doesn't change with time okay so if it is so when I

say something doesn't change with time what does it mean it means the coin doesn't change with

time it doesn't mean it always comes up heads okay when I say something is stationary okay

it doesn't mean that it outcome is always the same okay it means that the distribution from

which I'm sampling the outcomes is always the same okay but sometimes you might have non-stationary

problems right so is ESS you keep tossing the coin the coin one side starts weing off

a little bit right so after a while it might start falling Tails more often than heads or

what will happen if the head wears off will start falling heads more often or tails more

often start falling heads more

often right because there'll be more metal on the head side so it'll be heavier so it

will tend to come at the bottom more often so if head starts wearing off okay it will

tend to come relatively more frequently on the top than it used to earlier anyway

fine so things will start changing over time right and so what do you do in such

cases so you need a way I mean there are technical ways of handling it I'm just giving

you a practical way of handling it you need a way to be able to track the changes over time right so

if I'm going to use this kind of an update so what will be the problem after I have pulled an arm say

times right this number will become very small it will be 1 by th000 + 1 like it'll be 1 by,

1 right so the impact of the new samples I'm drawing will become smaller and smaller and

smaller right suppose I pulled an arm thousand times okay thousand and first time I pull the

arm right the impact will be very very small think about it I'm averaging like a thousand

numbers any individual number is going to have a very small impact on the average right but this

is all fine if I'm sampling from a stationary distribution because I already drawn thousand

samples what is the thousand and first sample going to tell me right but if I'm drawing

from a changing distribution so the more recent samples are more important to me than the older

samples right things are changing if my coin is varying out over time then the more recent

samples are more important to me than the older samples right so what I do in such cases is uh

where Alpha is less than

one so you can do a little bit of expanding out on this and you can convince yourself that the

as time passes by the older rewards that I get will get exponentially lesser weight

right the first time I do this right the new reward will get in a weight of alpha

right but second time I get another reward what will happen I'll get a weight of alpha

squared going in there for the older reward right the new reward will still get an alpha

older reward will get an alpha squared right so like that so I'll be paying more and more

attention to the newer rewards right and the older rewards will keep getting decayed and

all I did was a very simple change instead of having one 1 by N I made it a constant Alpha

so making it a constant Alpha actually allows me to pay more attention to recent rewards and

less attention to the older rewards right and depending on how quickly your rewards

are changing you can tune the value of alpha right so problem with keeping Alpha fixed is

what start

if it is stationary right you will not converge if it is stationary you'll still not converge because

you're still paying attention to the more recent rewards right and you'll keep oscillating around

the true reward right so whenever you get a one you'll go up whenever you get a zero you'll come

down up down up down you'll keep going like that right because you paying more attention

to the recent wordss right but the advantage of keeping Alpha constant is that if it happens to

be non-stationary right you'll track it to some extent okay there are other more more

robust ways of tracking non-stationary problems but this is a very simple trick that you can use

right and more often than not we'll actually end up using a constant Alpha in many of the

algorithms that we look at and uh but whenever we talk about theoretical convergence we'll always

talk about alpha decay right whenever we look at practical implementations of this algorithms

you'll always have a constant Alpha right so the problem with dking Alpha as 1 byn is again

the same thing which I told you about dking Epsilon right uh the changes will become small

very fast right so that's always a trade-off that you have to pay okay so good [Applause]

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:Value Function Based Methods