YouTube Transcript:
UCB 1

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

This content introduces the Upper Confidence Bound (UCB) algorithm as a popular and effective method for solving multi-armed bandit problems, focusing on balancing exploration and exploitation to minimize regret.

[Music]

so we're looking at

right looking at multiarm banded problems

right so you looked at a couple of ways of solving it so essentially you try to keep an estimate of

the value function around right the value function is the expected payoff that you get for pulling an

arm right you keep an estimate of the value function around and then you behave um some

explorative fashion with respect to that value function that you have right so either you do

Epsilon 3D or you do soft Max or something and uh you get some kind of asymptotic guarantees correct

yes okay good um we also spoke about other forms of optimality so one of them we spoke

about was regret optimality so regret optimality is what reducing the total I mean or increasing

the total reward that you get over the process of learning right so so the initial loss that

you get due to the EXP the exploration you want to minimize so that I come as close as possible

to the optimal case as quickly as possible right so that is essentially what regret is

all about and so we'll start by looking at uh one algorithm which in some sense has become uh

the popular most popular uh Bandit algorithm around right now uh because it's so easy to

implement and also gives you not too bad regret bonds okay it's called the upper confidence Bond

the upper confidence Bond algorithm or the UCB algorithm okay so I'll use slightly

different notation today right uh not not too different I'll try to as much as possible uh

I'll try to translate on the Fly okay to the notation that is used in the textbook

right so even though I we will give we will be giving you some papers to read uh you can

take care of linking the UCB paper and the B elimination paper and other things right on

mod we'll be giving you some papers to read and the notation in the papers will be very

different okay you probably have to spend like half an hour just uh trying to understand the

notation but when I'm explaining it in class to the extent possible I'll try to map it to

the notations used in the book right so that you have one uniform set of things that you can keep

track off throughout right the changes are the following right I'm going to assume that there

are capital K arms right so earlier I was assuming there were n arms but it

turns out the reason I want to change n is uh we're going to now talk about a notion

of time right so now so there are going to be pulls at every time instant and so on so forth

and uh this should be familiar to electrical engineers so if time is discrete you denote

it by n okay if time is continuous denoted by T right so I want to be able to denote discrete

time so I want to use n for discrete time okay and so we'll save n for discrete time and I'll

not use that for the number of arms so that's the reason I switch to K okay assume there are

K arms and uh so as before with each arm there is associated some arbitrary probability distribution

which I don't know right whenever I pull an arm I'll get a sample drawn from that probability

distribution it could be veroli it could be goian right it could be poison we don't know

what distribution it is but according to that distribution we'll get a payoff right and what

else do we did we assume last time that there is the expectation given for that distribution

right yeah yeah we assume that the distribution is stationary it's not going to change with time

right and that there is an expectation associated with that distribution which you denote by Q qar

of a right when you say qar of a it is the the expectation for pulling arm a okay this clear so

here is the UCB algorithm so assume there are K arms so the initialization phase is

play each arm at least once play each arm once right you have to pull every

arm at least one time you can't do better than that agreed if if I never pull the

arm I don't know anything about that arms right so I need to pull each arm at least

once so that's the least so we start off with that and then we do the following in a loop

so you remember what QJ is

is estimated payoff expected payoff right so this is average reward that we'll be maintaining right

that's a value function right for RJ so QJ is the estimate that I have at time n

right QJ is estimate I have a Time n for MJ so essentially what I'm doing is I'm taking

whatever is the current estimate right and I'm adding this expression to that right

and then I'm playing the arm that gives me the highest value for this total

expression right so the way to think about this is the following I'll come back to this

we'll show things more precisely later but let's assume that this is my Q of om1 right whatever

is some some scale this the the x-axis I mean the x axis let us assume is the arm

index and the y axis is the expected payoff right so that is q1 then I have Q2 say Q3 and

Q4 let's say I have four arms right so what the idea behind upper confidence bond is to say that

hey I'm not going to use just the estimated expectation so far right so I've actually

drawn many samples right I can use the samples that I have drawn so far to figure out what is

the confidence with which I can say that this is the expectation right so so you can you can think

of giving kind some kind of a bound around this you can say that okay this is the expectation I

have right but the true value of q1 or qar one

right the Q value of qar one is going to lie within that

band right so intuitively you can see that the more number of times I have sampled arm

one the smaller is this going to be right if I taken a lot of samples then I'm more confident

about the expectation that I'm going to tell you right we know that as the number of samples tends

to Infinity I this q q will converge to qar so we know that right q1 will converge to qar 1

if the number of times have sampled one goes to Infinity so the more the number

of samples I draw the narrower will be this band of uncertainty right so

likewise I'll have a band like this for Q2 right I'll have a band like this for

Q3 I'll have a band like this for Q4 so what arm do you am I suggesting that you take now

no I'm I'm saying take I'm one that's what essentially UCB tell here right so why do

you think this kind of a scenario might have happened I might have chosen arm for a lot of

times because it seems to have a higher value so my uncertainty and the value of arm four has

come down all lot right but arm one I have not taken enough times arm three obviously

I not taken enough times but I'm really not inclined to take arm three anymore because

given my current state of uncertainty right the probability that arm three will be higher than

this end right I mean it's very very small right even though my uncertainty arm three

is very high I'm not inclined to take that arm because the chance that it will truly be

high the chance that it will truly be high is very small right while

for arm one the chance that it might be better than arm four

it's fairly decent so this interval I'm telling you is some kind of bound that says that with

very high probability right the true value of q1 will lie within this bond that means there is a

chance that it will lie here which case it can be higher than Q4 right so I'll take this if I

take this a few more times and let us say after some time my estimate moves up like this but my

estimate moves up a little bit but my bounds also shrink right now it might not look attractive

compared to Q4 so I'll go back to taking Q4 right so that's why you can see in this

expression I have the total number of samples I have drawn of MJ right what NJ is number of

times I have trans sampled arm J so the larger the NJ the smaller this expression is going to

become okay so larger the number of samples I draw the smaller this

expression will become and therefore this interval will keep coming down

okay what's that yeah that's more like a normalizing term right it is there in all the

Expressions all the values right all the intervals that you compute so it's not relatively it is not

going to matter but you'll need it to show some results later so so algorithm itself very clear

right so the things here the N is essentially the number of times I have played arm so far any arm

NJ is the number of times I played arj okay so you have that pictorial description of what this

algorithm is doing great so what is nice about this algorithm it's very very simple algorithm

you saw that right there's nothing about it all you need to do is apart from keeping track of

QJ you have to keep track of NJ as well which you are anyway doing if you are doing

that incremental update right if you remember last class we wrote the incremental update so

the step size was anywhere related to NJ so the only additional overhead would be

if you had been using a constant Alpha for your updates you'll have to remember NJ and addition

otherwise it's exactly the same thing like we did for Epsilon gring so instead of randomly deciding

which action to take for exploration you you don't do any exploration also the random the

random number generator also is gone now right this is very deterministic algorithm right so

what is so great about it yeah so don't we need some kind of normalizing on QJ no because the

other term might be a very small term even when Q might be a very large value doesn't

matter why does it matter like the like you like the expected pay offs can be something like 10 ah

that way okay fine fine fine fine fine uh yeah so all of this this this particular form of

expression does assume that the rewards are bonded between 0o and one right so yeah I'll come to that

in a minute so yeah you right so you can rescale the rewards if you want to right as long as all

the rewards are positive so you can rescale them to lie between zero and one but if you

have negative rewards you'll have to think about how we want to do this okay [Music] [Applause]

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:UCB 1