Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
UCB 1 | Reinforcement Learning | YouTubeToText
YouTube Transcript: UCB 1
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
This content introduces the Upper Confidence Bound (UCB) algorithm as a popular and effective method for solving multi-armed bandit problems, focusing on balancing exploration and exploitation to minimize regret.
[Music]
so we're looking at
uh
right looking at multiarm banded problems
right so you looked at a couple of ways of solving it so essentially you try to keep an estimate of
the value function around right the value function is the expected payoff that you get for pulling an
arm right you keep an estimate of the value function around and then you behave um some
explorative fashion with respect to that value function that you have right so either you do
Epsilon 3D or you do soft Max or something and uh you get some kind of asymptotic guarantees correct
yes okay good um we also spoke about other forms of optimality so one of them we spoke
about was regret optimality so regret optimality is what reducing the total I mean or increasing
the total reward that you get over the process of learning right so so the initial loss that
you get due to the EXP the exploration you want to minimize so that I come as close as possible
to the optimal case as quickly as possible right so that is essentially what regret is
all about and so we'll start by looking at uh one algorithm which in some sense has become uh
the popular most popular uh Bandit algorithm around right now uh because it's so easy to
implement and also gives you not too bad regret bonds okay it's called the upper confidence Bond
the upper confidence Bond algorithm or the UCB algorithm okay so I'll use slightly
different notation today right uh not not too different I'll try to as much as possible uh
I'll try to translate on the Fly okay to the notation that is used in the textbook
right so even though I we will give we will be giving you some papers to read uh you can
take care of linking the UCB paper and the B elimination paper and other things right on
mod we'll be giving you some papers to read and the notation in the papers will be very
different okay you probably have to spend like half an hour just uh trying to understand the
notation but when I'm explaining it in class to the extent possible I'll try to map it to
the notations used in the book right so that you have one uniform set of things that you can keep
track off throughout right the changes are the following right I'm going to assume that there
are capital K arms right so earlier I was assuming there were n arms but it
turns out the reason I want to change n is uh we're going to now talk about a notion
of time right so now so there are going to be pulls at every time instant and so on so forth
and uh this should be familiar to electrical engineers so if time is discrete you denote
it by n okay if time is continuous denoted by T right so I want to be able to denote discrete
time so I want to use n for discrete time okay and so we'll save n for discrete time and I'll
not use that for the number of arms so that's the reason I switch to K okay assume there are
K arms and uh so as before with each arm there is associated some arbitrary probability distribution
which I don't know right whenever I pull an arm I'll get a sample drawn from that probability
distribution it could be veroli it could be goian right it could be poison we don't know
what distribution it is but according to that distribution we'll get a payoff right and what
else do we did we assume last time that there is the expectation given for that distribution
right yeah yeah we assume that the distribution is stationary it's not going to change with time
right and that there is an expectation associated with that distribution which you denote by Q qar
of a right when you say qar of a it is the the expectation for pulling arm a okay this clear so
here is the UCB algorithm so assume there are K arms so the initialization phase is
play each arm at least once play each arm once right you have to pull every
arm at least one time you can't do better than that agreed if if I never pull the
arm I don't know anything about that arms right so I need to pull each arm at least
once so that's the least so we start off with that and then we do the following in a loop
so you remember what QJ is
is estimated payoff expected payoff right so this is average reward that we'll be maintaining right
that's a value function right for RJ so QJ is the estimate that I have at time n
right QJ is estimate I have a Time n for MJ so essentially what I'm doing is I'm taking
whatever is the current estimate right and I'm adding this expression to that right
and then I'm playing the arm that gives me the highest value for this total
expression right so the way to think about this is the following I'll come back to this
we'll show things more precisely later but let's assume that this is my Q of om1 right whatever
is some some scale this the the x-axis I mean the x axis let us assume is the arm
index and the y axis is the expected payoff right so that is q1 then I have Q2 say Q3 and
Q4 let's say I have four arms right so what the idea behind upper confidence bond is to say that
hey I'm not going to use just the estimated expectation so far right so I've actually
drawn many samples right I can use the samples that I have drawn so far to figure out what is
the confidence with which I can say that this is the expectation right so so you can you can think
of giving kind some kind of a bound around this you can say that okay this is the expectation I
have right but the true value of q1 or qar one
right the Q value of qar one is going to lie within that
band right so intuitively you can see that the more number of times I have sampled arm
one the smaller is this going to be right if I taken a lot of samples then I'm more confident
about the expectation that I'm going to tell you right we know that as the number of samples tends
to Infinity I this q q will converge to qar so we know that right q1 will converge to qar 1
if the number of times have sampled one goes to Infinity so the more the number
of samples I draw the narrower will be this band of uncertainty right so
likewise I'll have a band like this for Q2 right I'll have a band like this for
Q3 I'll have a band like this for Q4 so what arm do you am I suggesting that you take now
no I'm I'm saying take I'm one that's what essentially UCB tell here right so why do
you think this kind of a scenario might have happened I might have chosen arm for a lot of
times because it seems to have a higher value so my uncertainty and the value of arm four has
come down all lot right but arm one I have not taken enough times arm three obviously
I not taken enough times but I'm really not inclined to take arm three anymore because
given my current state of uncertainty right the probability that arm three will be higher than
this end right I mean it's very very small right even though my uncertainty arm three
is very high I'm not inclined to take that arm because the chance that it will truly be
high the chance that it will truly be high is very small right while
for arm one the chance that it might be better than arm four
it's fairly decent so this interval I'm telling you is some kind of bound that says that with
very high probability right the true value of q1 will lie within this bond that means there is a
chance that it will lie here which case it can be higher than Q4 right so I'll take this if I
take this a few more times and let us say after some time my estimate moves up like this but my
estimate moves up a little bit but my bounds also shrink right now it might not look attractive
compared to Q4 so I'll go back to taking Q4 right so that's why you can see in this
expression I have the total number of samples I have drawn of MJ right what NJ is number of
times I have trans sampled arm J so the larger the NJ the smaller this expression is going to
become okay so larger the number of samples I draw the smaller this
expression will become and therefore this interval will keep coming down
okay what's that yeah that's more like a normalizing term right it is there in all the
Expressions all the values right all the intervals that you compute so it's not relatively it is not
going to matter but you'll need it to show some results later so so algorithm itself very clear
right so the things here the N is essentially the number of times I have played arm so far any arm
NJ is the number of times I played arj okay so you have that pictorial description of what this
algorithm is doing great so what is nice about this algorithm it's very very simple algorithm
you saw that right there's nothing about it all you need to do is apart from keeping track of
QJ you have to keep track of NJ as well which you are anyway doing if you are doing
that incremental update right if you remember last class we wrote the incremental update so
the step size was anywhere related to NJ so the only additional overhead would be
if you had been using a constant Alpha for your updates you'll have to remember NJ and addition
otherwise it's exactly the same thing like we did for Epsilon gring so instead of randomly deciding
which action to take for exploration you you don't do any exploration also the random the
random number generator also is gone now right this is very deterministic algorithm right so
what is so great about it yeah so don't we need some kind of normalizing on QJ no because the
other term might be a very small term even when Q might be a very large value doesn't
matter why does it matter like the like you like the expected pay offs can be something like 10 ah
that way okay fine fine fine fine fine uh yeah so all of this this this particular form of
expression does assume that the rewards are bonded between 0o and one right so yeah I'll come to that
in a minute so yeah you right so you can rescale the rewards if you want to right as long as all
the rewards are positive so you can rescale them to lie between zero and one but if you
have negative rewards you'll have to think about how we want to do this okay [Music] [Applause]
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.