Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
This content explains the theoretical underpinnings of the UCB1 algorithm in reinforcement learning, focusing on its regret bound and the mathematical tools, specifically concentration inequalities like Chernoff bounds, used to derive it.
[Music]
so this is interesting because we can have a nice theorem which says that
so this particular form of UCB is actually called UCB 1 right and in the literature if you generally
see UCB without any qualifier assigned to to it it usually means ucb1 okay there are many many
variants on it we are not going to look at all the variants that is uh like normal UCB then UCB
Revisited UCB improved UCB 2 right so there are many many variations of UCB right and each one
gives you a slightly better theoretical results than the one that I'm going to write down right
and uh sometimes involving exponentially harder analysis ecb1 is fairly trivial compared to the
other algorithms that uh that were proposed later and I just want to give you a flavor of the kind
of work that is done in this area right so I'm not going to get into the other things but but
we'll certainly put all those papers up on modle if people want to read they can read those okay
another thing about ucb1 is that we can have arbitrary reward distributions so
some of the other algorithms that are out there assume say beri rewards right
for showing their results and so on so forth the nice thing about ucb1 is arbitary reward
right so where uh
so Capital Delta is Q star a star minus qar I essentially this means the loss that you'll
incur if if you play I the expected loss that you'll incur if you play i instead of playing
a star right you remember what a star is the max over a of qar a right so that is a star
AR Max a of Q star a is a star right so they always denote by a star the optimal action
right so what does this tell you so the regret right this upper bounded by 8 * lwn n by Delta
I this summation running over all the nonoptimal arms right why non why all the non-optimal arms
because Delta I is zero for optimal arm so I don't really want that to figure in the summation okay
plus 1 + Pi Square by 3 that's a magic constant okay times summation over all J Delta J right I
mean here I'm being little sloppy I have to in exclude the optimal arm here also but I don't
because it's zero anyway Delta J is Zer anyway so I mean symmetrically just as I have the
summation I should have the summation there but I can be a little sloppy and I can write
it like that because I don't have worry about the additional zero terms I'm adding it okay
great okay how did we get from here to there or more importantly how did we get this expression
right so before we go on so I want to introduce a little bit of notation
here that's a term that tells you the number of times it's a random variable
that represents the number of times I have played am I in the first end
trials okay number of times I played am I in the first entrails does it make
sense so then I can write my regret after ENT TR else as summation / I the expected value
of that times
Delta people get that so the expected the regret that I expect to have after end time
steps is I'll take each action look at the number of times I have I would have play
that action till that time right multiply it by the regret of playing that action right so
if I play the optimal action then the regret is zero so if I play some suboptimal action then
I'll be adding the Delta I times the number of times I have played that suboptimal action
I correct so this is essentially my definition of regret so I have
a now a formal way of writing down my definition of regret is this clear any
questions okay so one more notation so one thing which I yeah so one more notation that I'll have
here it's a random variable that denotes the reward that I obtained for playing action I at
time n some some n okay some arbitrary n time n if I play action I right what is the reward that
I get right since we are assuming everything is stationary right so the expectation of this will
be this will tell me people are following it or
not so what is x i n the reward I get for playing am I at time n right what
is the expected value of that huh somebody say something one one of you put your hand
up and say something so that I can very hard for me to make out in that
murmur hand hand hand h q star of I people see
why since we are assuming things are stationary this this this expectation
is really not going to depend on N so and then it becomes expected reward I get for pulling
Ari which is essentially qar okay great so anything else that I need to do notation
wise okay so essentially to show this result right what I'm going to try and show is that this guy
is bounded right I'm going to show that that guy is bounded so what I'm going to show
we going to show that the expected value of TJ of n or t of n is bounded by 8 by Delta
j s l n so since I multiply by another Delta here to get the regret so that will get me the
first term in the regret and some constant okay which will get me the second term in the regret
yeah can you explain the regret function again why are we multiplying it with a we taking an
expectation of the number of times play and by some value Delta I which is calculated at the
end stage right it's not calculated at the end stage it's calculated by using the True Values
that's Q star don't know those values you don't know the values but uh it's just a definition
it's just a quantity that's the expected regret that you are going to get for playing Amai it's
expected loss you would ACR for playing Amai if I had played the best arm over and over again right
if I played the best arm over over again I would get Q star a star if I play r i I'm going to get
Q star a right so the loss I'm going to get is qar a star minus qar a correct so this is
what I'm denoting by Delta I this is the loss that expect a crew for playing arm I that's is
playing with that one arm right and if you keep playing the arm repeatedly the total loss I'm
going to AC is Delta I * the number of times I'll pay I right and I sum this over all the
arms so one of these arms in the summation or one or more of these arms I mean we don't know
one or more of these arms in the summation would be optimal and for those terms Delta will be zero
so the regret the contribution to the regret will be zero and all the other suboptimal arms
will contribute Delta I to the regret yeah yeah so the expected reward for at time how is itend
because I'm assuming the reward distributions are stationary right remember I said you deciding the
reward by tossing a coin but you don't change the coin right because you don't change the
coin doesn't matter when I toss the coin the probability of it coming up heads will be the same
whether I toss it the first time whether I toss it at the 10th time I pull the arm the probability
of it coming heads will be the same and that is the expectation right so the expectation will be
the same right so one way to think about the this expectation is right so I do multiple experiments
I do several experiments every time starting from time zero right I do I do like millions
of experiments okay every time I start from time zero and I keep pulling arms for some number of
time steps right and then I reset my whole system start from time zero and pull for another say 100
time steps or something like that right when I do this experiments like this in the 10th time step
if I pull arm three right so in this million experiments some number of times I would have
pulled arm three in the 10th time step okay then I'll take all the rewards I saw then I'll take
the average of that right that is one way of thinking about what this expectation means so
now you understand it doesn't matter whether I pulled arm three at 10th time step or whether
I pulled arm three at the thousandth time step if I take this expectation
across these million trials it'll be the same so that's essentially what we're saying here okay
okay is it is it clear so what we are trying to show is this right so if we able to show
this uh then we are all done as long as this constant here evaluates to 1 + p s
by 3 okay our proof is all done okay now we have to now go and start counting that
okay okay so here things start get getting interesting sh you want to complete the proof
okay so before I move on with the proof I'll need one more
uh I need one more fact
so uh there are this whole set of results in probability Theory called concentration
inequalities or concentration bonds okay or sometimes called large deviation bonds okay uh
these are essentially uh results that relate the true expectations of distributions okay
with estimated values of those expectations from samples not just expectations I mean different
kinds of Statistics you have different bonds you have bounds on expectations you have bounds on
variance right and so on so forth right so essentially the idea behind these bonds is
to kind of characterize right so given a certain number of samples from which you are estimating
a statistic and the statistics in this case will for the one that we are interested in is
the expectation right so we are interested in the expectation as a statistic right so I have some
number of samples from which I'm estimating this expectation right and I have the true expectations
right so as the number of samples increases how quickly or how slowly does this estimation
approach my true expectation so this is something that we want to characterize so we're going to
talk about one such bond which is a very popular bond which applied all over the place okay so if
you're going to do anything in machine learning forget about RL anything that has remotely shades
of theoretical results in machine learning it's a good idea to know more about concentration
bonds but at least the least you should know is about the shernoff bonds or sh of of BS okay
so there's a very specialized form of the bond I'm writing right very narrow form of the bond
I am writing specifically for this setting that we are interested in right the sh of
is slightly more General than what I am stating now right so you can look it up I
mean so you can look up I don't know your favorite online resource right Wikipedia
or whatever mathematic or one of those things right so it'll give you a more expanded bound
is
[Music]
okay so you can see the setup here so I'm assuming that X1 to xn are random
variables right each one with a range 0 to one right that's kind of putting it here and if
it is not in the range 0 to 1 you'll have to do some kind of normalization in the bound right
but for the time being let's just assume it's in the range 0 to 1 and such that the expected
value of XT right given all the variables that came before that is Mu and this holds for all
T basically it essentially means that all of the variables have a mean mu all the variables
each one independently X1 is a random variable with mean mu and the range 0 to 1 X2 is a random
variable with mean mu in the range 0 to one okay so essentially that is what we are assuming and
I'm defining another random variable which we call SN which is essentially summing up all
the X1 to xn and divide by n okay so get that okay so what am I trying to estimate as with
s mu right I'm trying to estimate the mean mu okay so one way now let's try to make this
little Concrete in our case so that it becomes easier for you to understand right so X1 to xn
are random variables that are corresponding to different times I have pulled the same arm
I right so I have pulled the arm i n times right so every time I pull the arm I'm going
to get some reward from the range 0 to one right since I'm assuming it's a stationary distribution
all these rewards will have the same mean mu correct and SN is the average right this SN
is essentially my QJ right if you think about it so my QJ is like one sample of my random variable
SN is it clear what I mean by saying QJ is a sample of SN because SN is a random variable
right what is QJ it is one realization of that random variable because I've taken
actual samples and taken the average is one realization of that random variable right and
what I'm going to now tell you is okay given that I have taken n samples so in our case it
will be NJ samples right given that I have taken NJ samples how can you relate QJ with with what
huh s sample how can I relate QJ with q qar j okay so given that I have taken NJ samples
which is my n here how can relate QJ with qar J okay so this is how the huting bond
is applicable in our case right so I'll write down the expression I'm not going to prove it okay so
that's it'll get little bit more involved I'll write down the expression and after that we can
apply it in deriving those bonds okay so essentially what I have
here is now the probability that SN is greater than or equal to Mu + some
Epsilon h a a
oh where did you do a here all oh good because a is action
right so I don't want to confuse a for action with something else
yeah this is what this is the problem with translating on the
Fly okay fine so it should be n or an n squ trible
so what does this mean the probability that my SN will be far away from my mu in One Direction
so mu plus Epsilon is so mu is the true mean SN is the estimated mean the probability that
SN will be greater than at least Epsilon from mu right is lesser than e powerus 2 Epsilon s
n right so the smaller the
Epsilon what happen smaller the probability or larger the probability larger the probability
smaller the Epsilon larger the probability right so larger the probability of me making
an error I mean if I want to be really really confident then I need more and
more samples right if I want my Epsilon to be very small then n has to be very large
only then I'll get a high probability for that right what about this it says
in the other direction probability that the estimate I make will be far away from Epsilon
basically below Epsilon right is again bounded by the same expression [Applause]
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.