Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
Value Function Based Methods | Reinforcement Learning | YouTubeToText
YouTube Transcript: Value Function Based Methods
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
This content introduces fundamental concepts in reinforcement learning, focusing on value function-based methods for estimating expected payoffs (Q-values) and strategies for action selection, balancing exploitation of current knowledge with exploration of new possibilities.
[Music]
so the first thing I want to talk about is our very simple
uh what the uh what the textbook calls value function based
methods okay uh what the textbook calls value function based methods uh these are approaches
where I don't know the Q Star right but I'm going to try and form a estimate of the Q
Stars so how can I do that so what is qar the expected payoff right so I I play each action I
choose each action say 10 times I can take the average of that that gives me an estimate for
Q right so we'll denote the estimates by QT so whenever we attach a estimate to an action
we'll be using the notation Q okay this this so that you know so we use cues here and whenever
we estimate attach a estimate of the payoffs to a state we'll denote it by v v stands for
Value function okay that's easy to remember don't ask me what Q stands for so so one of
the problems with RL is that uh because there are so many different influences that came in right
uh so you have at least three different sets of notations that people use in the literature
okay so you would have to swap back and forth if you're reading papers in reinforcement learning
right so somebody will say that uh uh the value function is denoted by J right and Q denotes the
cost function right so if they had used disjoint set of symbols at least it's easy for us right but
they'll use the same symbol that one one body of literature will call the value function another
body will call that the payoff function right which essentially responds to the reward that
you are getting at every time right so that will be denoted by Q so there all kinds of confusions
are there so this is one one set of notations that I'm choosing essentially because it's the one that
is adopted in the SAT and B book in fact people who are reading Edition one right there's been a
overhaul of uh notation in Edition two right so to the extent possible I'll try to follow
the Edition two notation uh but Edition one is ingrained in me right so I mean I've been using
Edition one notation for uh God knows 15 years or something right so suddenly they decided to
change the notation in Edition two so by force of habit I might slip into Edition one notation
occasionally so it's in your uh well-being to read ahead so that you can correct me whenever
I do that right just don't wait for me to teach and then go and or well don't wait
for the quizzes and then read the book right uh but read ahead and come to class okay this
will certainly certainly help you significantly in following the material better right so I'm
deliberately going slow today warming you up okay but then uh there's lot lot of material
that we cover in fact uh usually I cover the material in the book in about uh one and a half
months right and the rest of the time I'll be doing material outside the book that one
and half months 2 months something okay so and the book is Big right so at some point
I'll switch modes and start going fast so so reading ahead helps okay uh so we're going
to have QT so essentially what you can do is uh you can just take uh whenever I
took whenever I took action a right the reward I got at that time and I can just
keep adding those up and divide by the number of times I took action a that will give me my
QT right so how will we write this let's look at introduce some notation
so that is a notation that I will use whenever I want an indicator function right so normally
what is the notation that people use for indicator functions huh I or Delta right so people typically
use Delta or I so I'll use one uh because we use I and Delta for other things later on
okay so this essentially means that it is one whenever a is a otherwise it's zero
times so what should I divide this
by okay so this gives me the the average right great so once I have this now I have
made an estimate for my qars right once I have an estimate for my Q Stars so how will I select an
action ah apologies
okay so I use T on the left hand side so I can't use T as the running variable as
well okay so now I have come to time T I have this estimates for my Q Stars so
how can I use that for selecting actions ma select the maximum right so I could
be what we call Greedy right so greedy would be I'm going to
select right so I'll take that action that gives me the
max so AR Max is the action that gives me the max right so I'll take that action that gives me the
Max and use that has the action to perform at that time right is that a good choice or alternatively
what does this Choice correspond to exploitation this Choice corresponds to exploitation but I
can't just exploit right I have to explore as well right so I can't come up with an algorithm
that only exploits right so what do you think will happen let us say that I pulled an arm
right the first time well I select an action and then I get a reward of one let us say
right so what will happen to my Q's it will be one what about my other Q's well depends on what
you initialize them with so if you initialize them with zero they'll stay at zero I could have
initialized them with 100 right or could have initialized them with minus 100 I don't know
whatever it is initialize them with they'll stay there but this one will become one and
now if I'm doing greedy what will happen I'll do this again let's assume that I get a one
again right and they keep getting one every time I pull this so what will happen is I'll
never ever sample any of the other arms right it could very well be that arm three has a reward of
50 right just because I happen to pull arm one the first time right
I'll never ever go and pull arm three even though it has a much much higher
reward right
so we don't want to be exploitative completely so we need to have some kind of a exploration
policy so the simple exploration policy okay so I'm going to confuse you a little bit uh is called
called Epsilon greedy nothing to do with this Epsilon okay so this Epsilon is completely
different okay talks about a solution concept and just because I have something Epsilon grey
doesn't mean I'm going to give you some Epsilon uh pack guarantee okay this is
a completely different Epsilon uh in fact it was named Epsilon 3D before people even
started thinking about bringing all these pack optimality questions into uh RL okay so what
does epsilon GD mean any guesses choosin of the sorry May choosing within epon of the 3 solution
with within Epsilon of the 3D solution okay good choice but I say that has nothing to do with this
Epsilon it's got nothing to do with this Epsilon what I do is with probability 1
minus Epsilon I'll be greedy okay with probability Epsilon I'll just
explore okay so Epsilon gitty means with 1 minus Epsilon I'll
choose okay with probability 1us Epsilon probability Epsilon I'll
choose I'll choose uniformly from the set of actions here okay so one thing which you should
note here is that if I choose uniformly from a I still have a probability of hitting that
AR maxia right so in fact the probability of me taking this action is 1 - Epsilon plus Epsilon by
n right the probability of me exploring is actually
N - 1 Epsilon by n okay so this is something just small thing
nitpicking thing but just remember that so the probability of exploration is not
Epsilon okay because because I'm picking uniformly randomly you could choose to say
that I pick uniformly from a minus this argmax action right doesn't make a difference but I'm
just pointing out that there is a small difference just remember that okay so Epsilon Grey by far is
one of the most popular uh exploration it's by far the most popular exploration strategy that people
adopt okay by far the most popular exploration strategy uh there's another one which ironically
I couldn't find in Edition two of the book which is also rather popular it's called [Applause] Uh
softmax or maybe I flip through Edition to to quickly to notice it uh but uh so the softmax
exploration policy does something slightly cleverer okay so if you think about the Epsilon
greedy what is it uh that you would notice is that I have the best action best according to
my current estimates right so when I say whenever I say greedy or best it is according to my current
estimate it's not Q Stars okay it's according to capital Q whenever I say greedy is according to
capital Q okay so the greedy action gets the bulk of the probability right and every other
action has a equal probability of being picked in Exploration correct but then if some actions are
clearly bad right if I figure out some actions are clearly bad so I don't want to be picking
them in during exploration because I know that they are bad I really don't don't want to be
picking them during exploration or at least I want to reduce the probability of picking them during
exploration right and if two actions look equally likely to be good right maybe Point
001 difference right some small Delta difference between the two actions in epsilon 3D what will
happen the one that is that 01 higher will get the highest probability okay and the other one
will have the same probability as the all the other arms of being picked right but then in
reality I just want to be able to compare these two things more closely and figure out which one
is better right because the difference is so small I really want to compare them and figure
out which one is better right so what people did was they came up with this soft Max idea right
where instead of uh giving all the probability to one action right I make the probability of
picking the actions proportional to the the estimate current estimate so whatever is the
current Q I'll make the probability proportional to the Q right so ideally it should be something
like right so this should be the probability with which
at the T step I'll pick action a right that should be the probability right what is the problem with
that different of one of them might beus one might be one different orders ERS yeah
different signs right so I mean so things could be all so so negative
numbers in fact I could even end up with a negative probability
right if my QT of a was minus one right then I'll end up with a minus
something as the probability of picking that action doesn't make
sense right so to Encompass all kinds of options what do we do exponentiate
right so why is this called Soft
Max because it is not greedy does not return to you the one that has the highest Q value
all the time but if that is a big difference it will return return that value to you most often
let's Suppose there are four actions right and action one has a value of say 19 Action
two has a value of uh 1.2 action 3 has a value of uh 7 Action 4 has a value of Min
-.3 right so now this will return to you most often it will return action
one to you right so because that has a significantly higher value okay that's
why it's called Soft Max right sometimes what people do is they throw in another
parameter called the temperature parameter sometimes so beta right so now think about it if
beta is very very high so what will happen to your soft Max function Rel Rel larg value everything
will get squished down right so beta is very very high so essentially you'll be behaving at
random uniformly random right because all the regardless of what the Q values are right if
the beta is very very let's say beta is 1 million right and I say 17 and 1.8 no no no doesn't make
any difference right all of them will look close to zero right so essentially I'll be picking all
actions uniformly randomly right so this way it's called temperature so if you're very very
hot right so things will move around at random right and then people typically cool it down so
they make beta smaller and smaller and smaller so that from a very high temperature where you
are essentially behaving at random regardless of what the Q values are as you keep cooling
it down it becomes more and more peaked so if beta becomes closer and closer to zero what
happens it becomes Max right as beta goes to zero this function goes to
Max right so it become greedy so it gives an additional parameter
for tuning I mean if you if you don't trust me you can work it out right you guys are
smart enough to work that out right so this gives you an additional parameter for tuning
how sharp your probability needs to fall right so so this called Soft Max and uh
so if you think about Epsilon greedy then uh writing it in this notation so this will be
right so probability a equal to a star will be 1 - Epsilon plus Epsilon by n where a star is the
greedy action according to the current estimate right the probability 8 is not equal to a star or
probability a is equal to a which is not equal to a star will be Epsilon by n okay and likewise here
you have this probabilities here okay so these are the two rather popular ways of selecting
actions in fact uh Epsilon gy and softmax are also used in the full reinforcement learning problem
right it's also used in the full reinforcement learning problem for selecting for implementing
the exploration policy right you can convince yourself that Epsilon very
simply you can convince yourself that Epsilon greeny right uh will give you asymptotic
correctness provided something but uh can
you convince yourself that it'll give you a SYM totic correctness
right I mean you're always going to be taking all the actions right So eventually you'll
converge to the true true true expectations right you'll eventually converge to Q star
because you're taking all the actions all the time right so but what is the problem
if I keep my Epsilon fixed at some value let say I keep Epsilon fixed at 0.1 right 90% of
the times I'm behaving optimally 10% of the times I choose at random right so what will
happen I will know the right Q Stars after a long time but I'll still be behaving randomly
10% of the times right what Astic correctness says is eventually I should be only giving
you the behaving the according to the right arm right so I should be only doing the right choice
eventually to achieve that what should you do you have to cool Epsilon right so Epsilon has
to keep decreasing but then you shouldn't decrease it too fast if you decrease it
too fast then you don't have the guarantee that you will get the correct Q Stars right so there
are ways in which you can balance this right and uh so in fact people show that if Epsilon Falls
as 1 by
T right that's usually a sufficient ient rate but unfortunately uh that will actually slow
down your convergence tremendously because uh it becomes truly an asymptotic convergence I
mean you have to run it till Infinity for you to get convergence so so in practice what you
can do is you can cool it in steps right set a high Epsilon right run it for a while okay
and then drop it a little bit and then run it for a while and then drop it a little bit then
run it for a while and so on so forth right so the figuring out how long the while should
be is is is a tricky question depends on a variety of factors but you can do some
kind of empirical estimates for the time being right good um so do we move on from
here okay so before I go on to uh regret optimality one small thing which I want to talk
about right so here I gave you a way of estimating the qts right so what do you notice about
that I run experiments till time T right and then when I want to know what QT is I look back
and see how many times I have taken action a and then I take the average right so that
essentially means that at time step T I need to know what were all the actions I picked earlier
and what were all the rewards I got for those actions right so I need to remember this whole
history right so is there a bit more efficient way of doing
huh don't
get yeah so I can I can essentially do this in a
kind of doing run I can do this running sum right so how do I do
that right so I want QT of
a right so it is some some numbers right so it will be essentially QT minus one of
right uh so let me introduce a little bit of a notation it make my life a little easier
okay okay so the number of times I've taken action a I'll denote by NA a okay
right people will agree with
you yes okay great so assuming that I actually took action a
at this time right so what will happen [Applause]
okay you want me to erase and
reite both the numerator and the denominator only the denominator
numerator is fine okay I'm assuming this is
fine okay so now I can do this in a incremental fashion
right so this is Na by NA + 1 r t by NA + 1 I can just keep adding this
right so I don't need to remember all the rewards all I need to know is how many times I've taken
action a so far and what is the previous estimate of QT right if a was not equal to a what happens
this will vanish this will vanish n and will get canceled just the same thing okay so no
change is there is there a different way we can write
this
go QT of a okay
is it
fine people agree with
that right this is n by Na by NA a + 1 so I can write it as 1 minus 1
by NA + 1 right great now I can rearrange this quantities and get a better looking expression
right looks good so this is a form of a general class of equations known as
stochastic averaging equations okay and there's really no stochasticity uh that
you need to worry about here but this is called stochastic averaging equation
so one way to think about it is I have an old estimate right and to the old estimate I add a
error term so what is the error term this is my target right I should have predicted the reward
is RT right so this is my Target and that is my actual output right QT minus one is what I said
will be the payoff RT was what the actual payoff that occurred right so RT minus QT minus 1 a is
some kind of a error term right and this is some kind of a step size 1 by NA a + 1 is some kind of
a step size so what I'm doing is I'm taking my world estimate and adding some Alpha times an
error right so many many iterative algorithms that we have looked at uh even if you had taken
ml that some of the algorithms you looked at in ml right and many iterative algorithms that we'll
look at later in the course will all have this form old estimate times I mean plus some Alpha
times an error term right what is the thing that you should note here why is it called stochastic
averaging rule what I'm really trying to find out what my real Target is is the what is my
real Target Target expect the expected value of RT my real Target is not RT right my real Target is
the expected value of RT right so my actual error should have been expected value of RT
minus QT minus one right that would have been my actual error but since I don't know the expected
value of RT so what I'm using here is a some kind of a measurement of that expected value right and
and the unbiased measurement is a sample drawn from the distribution According to which you're
taking the expectation right when I say expected value of RT right so RT is a sample right I'm just
take I'm going to take the average of all these samples that's going to give me the expected
value since I don't know the true expectations I can operate with these samples right so I'm
going to form the an estimate of the Target or estimate of the error right by using a sample
as opposed to using the true expectation that's why we call these a stochastic averaging rules and
so this is a very standard form that we'll keep seeing again and again in all the algorithms that
we look at right so current estimate plus Target minus current estimate right so this is the thing
right and this particular choice for my Alpha here 1 by NA + 1 uh uh gives me the the exact average
I mean this is exactly I did no approximations anywhere this gives me exactly the same estimate
as I would have obtained by storing all the values and taking the average at every time okay great
um yeah so we'll stop here today or maybe I can just do one more thing and then we can move on
to the other things in the next class uh so far we have assumed that your uh coins are
not changed right so you're assuming that the the uh the distributions from which the rewards
are sampled or the payoffs are sampled is Conant stationary okay constant has a different meaning
it's stationary stationary means it doesn't change with time okay so if it is so when I
say something doesn't change with time what does it mean it means the coin doesn't change with
time it doesn't mean it always comes up heads okay when I say something is stationary okay
it doesn't mean that it outcome is always the same okay it means that the distribution from
which I'm sampling the outcomes is always the same okay but sometimes you might have non-stationary
problems right so is ESS you keep tossing the coin the coin one side starts weing off
a little bit right so after a while it might start falling Tails more often than heads or
what will happen if the head wears off will start falling heads more often or tails more
often start falling heads more
often right because there'll be more metal on the head side so it'll be heavier so it
will tend to come at the bottom more often so if head starts wearing off okay it will
tend to come relatively more frequently on the top than it used to earlier anyway
fine so things will start changing over time right and so what do you do in such
cases so you need a way I mean there are technical ways of handling it I'm just giving
you a practical way of handling it you need a way to be able to track the changes over time right so
if I'm going to use this kind of an update so what will be the problem after I have pulled an arm say
times right this number will become very small it will be 1 by th000 + 1 like it'll be 1 by,
1 right so the impact of the new samples I'm drawing will become smaller and smaller and
smaller right suppose I pulled an arm thousand times okay thousand and first time I pull the
arm right the impact will be very very small think about it I'm averaging like a thousand
numbers any individual number is going to have a very small impact on the average right but this
is all fine if I'm sampling from a stationary distribution because I already drawn thousand
samples what is the thousand and first sample going to tell me right but if I'm drawing
from a changing distribution so the more recent samples are more important to me than the older
samples right things are changing if my coin is varying out over time then the more recent
samples are more important to me than the older samples right so what I do in such cases is uh
where Alpha is less than
one so you can do a little bit of expanding out on this and you can convince yourself that the
as time passes by the older rewards that I get will get exponentially lesser weight
right the first time I do this right the new reward will get in a weight of alpha
right but second time I get another reward what will happen I'll get a weight of alpha
squared going in there for the older reward right the new reward will still get an alpha
older reward will get an alpha squared right so like that so I'll be paying more and more
attention to the newer rewards right and the older rewards will keep getting decayed and
all I did was a very simple change instead of having one 1 by N I made it a constant Alpha
so making it a constant Alpha actually allows me to pay more attention to recent rewards and
less attention to the older rewards right and depending on how quickly your rewards
are changing you can tune the value of alpha right so problem with keeping Alpha fixed is
what start
if it is stationary right you will not converge if it is stationary you'll still not converge because
you're still paying attention to the more recent rewards right and you'll keep oscillating around
the true reward right so whenever you get a one you'll go up whenever you get a zero you'll come
down up down up down you'll keep going like that right because you paying more attention
to the recent wordss right but the advantage of keeping Alpha constant is that if it happens to
be non-stationary right you'll track it to some extent okay there are other more more
robust ways of tracking non-stationary problems but this is a very simple trick that you can use
right and more often than not we'll actually end up using a constant Alpha in many of the
algorithms that we look at and uh but whenever we talk about theoretical convergence we'll always
talk about alpha decay right whenever we look at practical implementations of this algorithms
you'll always have a constant Alpha right so the problem with dking Alpha as 1 byn is again
the same thing which I told you about dking Epsilon right uh the changes will become small
very fast right so that's always a trade-off that you have to pay okay so good [Applause]
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.