YouTube Transcript:
Concentration Bounds

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

This content explains the theoretical underpinnings of the UCB1 algorithm in reinforcement learning, focusing on its regret bound and the mathematical tools, specifically concentration inequalities like Chernoff bounds, used to derive it.

[Music]

so this is interesting because we can have a nice theorem which says that

so this particular form of UCB is actually called UCB 1 right and in the literature if you generally

see UCB without any qualifier assigned to to it it usually means ucb1 okay there are many many

variants on it we are not going to look at all the variants that is uh like normal UCB then UCB

Revisited UCB improved UCB 2 right so there are many many variations of UCB right and each one

gives you a slightly better theoretical results than the one that I'm going to write down right

and uh sometimes involving exponentially harder analysis ecb1 is fairly trivial compared to the

other algorithms that uh that were proposed later and I just want to give you a flavor of the kind

of work that is done in this area right so I'm not going to get into the other things but but

we'll certainly put all those papers up on modle if people want to read they can read those okay

another thing about ucb1 is that we can have arbitrary reward distributions so

some of the other algorithms that are out there assume say beri rewards right

for showing their results and so on so forth the nice thing about ucb1 is arbitary reward

right so where uh

so Capital Delta is Q star a star minus qar I essentially this means the loss that you'll

incur if if you play I the expected loss that you'll incur if you play i instead of playing

a star right you remember what a star is the max over a of qar a right so that is a star

AR Max a of Q star a is a star right so they always denote by a star the optimal action

right so what does this tell you so the regret right this upper bounded by 8 * lwn n by Delta

I this summation running over all the nonoptimal arms right why non why all the non-optimal arms

because Delta I is zero for optimal arm so I don't really want that to figure in the summation okay

plus 1 + Pi Square by 3 that's a magic constant okay times summation over all J Delta J right I

mean here I'm being little sloppy I have to in exclude the optimal arm here also but I don't

because it's zero anyway Delta J is Zer anyway so I mean symmetrically just as I have the

summation I should have the summation there but I can be a little sloppy and I can write

it like that because I don't have worry about the additional zero terms I'm adding it okay

great okay how did we get from here to there or more importantly how did we get this expression

right so before we go on so I want to introduce a little bit of notation

here that's a term that tells you the number of times it's a random variable

that represents the number of times I have played am I in the first end

trials okay number of times I played am I in the first entrails does it make

sense so then I can write my regret after ENT TR else as summation / I the expected value

of that times

Delta people get that so the expected the regret that I expect to have after end time

steps is I'll take each action look at the number of times I have I would have play

that action till that time right multiply it by the regret of playing that action right so

if I play the optimal action then the regret is zero so if I play some suboptimal action then

I'll be adding the Delta I times the number of times I have played that suboptimal action

I correct so this is essentially my definition of regret so I have

a now a formal way of writing down my definition of regret is this clear any

questions okay so one more notation so one thing which I yeah so one more notation that I'll have

here it's a random variable that denotes the reward that I obtained for playing action I at

time n some some n okay some arbitrary n time n if I play action I right what is the reward that

I get right since we are assuming everything is stationary right so the expectation of this will

be this will tell me people are following it or

not so what is x i n the reward I get for playing am I at time n right what

is the expected value of that huh somebody say something one one of you put your hand

up and say something so that I can very hard for me to make out in that

murmur hand hand hand h q star of I people see

why since we are assuming things are stationary this this this expectation

is really not going to depend on N so and then it becomes expected reward I get for pulling

Ari which is essentially qar okay great so anything else that I need to do notation

wise okay so essentially to show this result right what I'm going to try and show is that this guy

is bounded right I'm going to show that that guy is bounded so what I'm going to show

we going to show that the expected value of TJ of n or t of n is bounded by 8 by Delta

j s l n so since I multiply by another Delta here to get the regret so that will get me the

first term in the regret and some constant okay which will get me the second term in the regret

yeah can you explain the regret function again why are we multiplying it with a we taking an

expectation of the number of times play and by some value Delta I which is calculated at the

end stage right it's not calculated at the end stage it's calculated by using the True Values

that's Q star don't know those values you don't know the values but uh it's just a definition

it's just a quantity that's the expected regret that you are going to get for playing Amai it's

expected loss you would ACR for playing Amai if I had played the best arm over and over again right

if I played the best arm over over again I would get Q star a star if I play r i I'm going to get

Q star a right so the loss I'm going to get is qar a star minus qar a correct so this is

what I'm denoting by Delta I this is the loss that expect a crew for playing arm I that's is

playing with that one arm right and if you keep playing the arm repeatedly the total loss I'm

going to AC is Delta I * the number of times I'll pay I right and I sum this over all the

arms so one of these arms in the summation or one or more of these arms I mean we don't know

one or more of these arms in the summation would be optimal and for those terms Delta will be zero

so the regret the contribution to the regret will be zero and all the other suboptimal arms

will contribute Delta I to the regret yeah yeah so the expected reward for at time how is itend

because I'm assuming the reward distributions are stationary right remember I said you deciding the

reward by tossing a coin but you don't change the coin right because you don't change the

coin doesn't matter when I toss the coin the probability of it coming up heads will be the same

whether I toss it the first time whether I toss it at the 10th time I pull the arm the probability

of it coming heads will be the same and that is the expectation right so the expectation will be

the same right so one way to think about the this expectation is right so I do multiple experiments

I do several experiments every time starting from time zero right I do I do like millions

of experiments okay every time I start from time zero and I keep pulling arms for some number of

time steps right and then I reset my whole system start from time zero and pull for another say 100

time steps or something like that right when I do this experiments like this in the 10th time step

if I pull arm three right so in this million experiments some number of times I would have

pulled arm three in the 10th time step okay then I'll take all the rewards I saw then I'll take

the average of that right that is one way of thinking about what this expectation means so

now you understand it doesn't matter whether I pulled arm three at 10th time step or whether

I pulled arm three at the thousandth time step if I take this expectation

across these million trials it'll be the same so that's essentially what we're saying here okay

okay is it is it clear so what we are trying to show is this right so if we able to show

this uh then we are all done as long as this constant here evaluates to 1 + p s

by 3 okay our proof is all done okay now we have to now go and start counting that

okay okay so here things start get getting interesting sh you want to complete the proof

okay so before I move on with the proof I'll need one more

uh I need one more fact

so uh there are this whole set of results in probability Theory called concentration

inequalities or concentration bonds okay or sometimes called large deviation bonds okay uh

these are essentially uh results that relate the true expectations of distributions okay

with estimated values of those expectations from samples not just expectations I mean different

kinds of Statistics you have different bonds you have bounds on expectations you have bounds on

variance right and so on so forth right so essentially the idea behind these bonds is

to kind of characterize right so given a certain number of samples from which you are estimating

a statistic and the statistics in this case will for the one that we are interested in is

the expectation right so we are interested in the expectation as a statistic right so I have some

number of samples from which I'm estimating this expectation right and I have the true expectations

right so as the number of samples increases how quickly or how slowly does this estimation

approach my true expectation so this is something that we want to characterize so we're going to

talk about one such bond which is a very popular bond which applied all over the place okay so if

you're going to do anything in machine learning forget about RL anything that has remotely shades

of theoretical results in machine learning it's a good idea to know more about concentration

bonds but at least the least you should know is about the shernoff bonds or sh of of BS okay

so there's a very specialized form of the bond I'm writing right very narrow form of the bond

I am writing specifically for this setting that we are interested in right the sh of

is slightly more General than what I am stating now right so you can look it up I

mean so you can look up I don't know your favorite online resource right Wikipedia

or whatever mathematic or one of those things right so it'll give you a more expanded bound

[Music]

okay so you can see the setup here so I'm assuming that X1 to xn are random

variables right each one with a range 0 to one right that's kind of putting it here and if

it is not in the range 0 to 1 you'll have to do some kind of normalization in the bound right

but for the time being let's just assume it's in the range 0 to 1 and such that the expected

value of XT right given all the variables that came before that is Mu and this holds for all

T basically it essentially means that all of the variables have a mean mu all the variables

each one independently X1 is a random variable with mean mu and the range 0 to 1 X2 is a random

variable with mean mu in the range 0 to one okay so essentially that is what we are assuming and

I'm defining another random variable which we call SN which is essentially summing up all

the X1 to xn and divide by n okay so get that okay so what am I trying to estimate as with

s mu right I'm trying to estimate the mean mu okay so one way now let's try to make this

little Concrete in our case so that it becomes easier for you to understand right so X1 to xn

are random variables that are corresponding to different times I have pulled the same arm

I right so I have pulled the arm i n times right so every time I pull the arm I'm going

to get some reward from the range 0 to one right since I'm assuming it's a stationary distribution

all these rewards will have the same mean mu correct and SN is the average right this SN

is essentially my QJ right if you think about it so my QJ is like one sample of my random variable

SN is it clear what I mean by saying QJ is a sample of SN because SN is a random variable

right what is QJ it is one realization of that random variable because I've taken

actual samples and taken the average is one realization of that random variable right and

what I'm going to now tell you is okay given that I have taken n samples so in our case it

will be NJ samples right given that I have taken NJ samples how can you relate QJ with with what

huh s sample how can I relate QJ with q qar j okay so given that I have taken NJ samples

which is my n here how can relate QJ with qar J okay so this is how the huting bond

is applicable in our case right so I'll write down the expression I'm not going to prove it okay so

that's it'll get little bit more involved I'll write down the expression and after that we can

apply it in deriving those bonds okay so essentially what I have

here is now the probability that SN is greater than or equal to Mu + some

Epsilon h a a

oh where did you do a here all oh good because a is action

right so I don't want to confuse a for action with something else

yeah this is what this is the problem with translating on the

Fly okay fine so it should be n or an n squ trible

so what does this mean the probability that my SN will be far away from my mu in One Direction

so mu plus Epsilon is so mu is the true mean SN is the estimated mean the probability that

SN will be greater than at least Epsilon from mu right is lesser than e powerus 2 Epsilon s

n right so the smaller the

Epsilon what happen smaller the probability or larger the probability larger the probability

smaller the Epsilon larger the probability right so larger the probability of me making

an error I mean if I want to be really really confident then I need more and

more samples right if I want my Epsilon to be very small then n has to be very large

only then I'll get a high probability for that right what about this it says

in the other direction probability that the estimate I make will be far away from Epsilon

basically below Epsilon right is again bounded by the same expression [Applause]

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:Concentration Bounds