YouTube Transcript:
Bandit Optimalities

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

This content explains the concept of "bandit problems," specifically multi-arm bandits, which are analogous to slot machines with multiple levers. It then delves into three primary frameworks for evaluating solutions to these problems: asymptotic correctness, regret optimality, and Probably Approximately Correct (PAC) optimality.

[Music]

so these kinds of problems this is a very very simplified version

of what are called in the statistics literature as banded problem

s right so you know why they're called Bandit

problems you know this

typic no oh come on you enough of lab meetings to know these things

anyway so people know what a one-arm bandit is

come on man anyway so one arm bandit is a slot machine you know what slot machines are you put

a coin there then you pull a lever right and then what it does it steals your money basically I me

all of that is mumbo jumbo you know of course all of you must realize that no casinos is going to

put up a machine that actually makes money for the customer right so in the long run the slot

machine will steal your money right and then you just keep pulling that thing so that lever

that you pull is called the arm right and since it has one lever is called the one arm bandit

because it steals your money right so it's one arm bandit so here this is very similar to the

slot machine except that instead of having one arm it has n arms right so each time you pull

an arm you get some kind of a payoff right and if you want to really complete the Bandit and

ology so every time you need to pull an arm you have to pay one rupee right sometimes

you get back one rupee sometimes you get back nothing so that way you know that

it's certainly stealing your money right so something like that right so but that's

essentially why it is called a bandit problem uh sometimes it it's also called

the multi-arm bandits right so it's called multiarm Bandits because well it has multiple

arms right and the Dynamics is very Sim SAR to a slot machine right now we know why I kept

calling actions as arms right so because the the the literature typically talks about arms

on a bandit right but it's it's really for us we don't have to worry about the Bandit connections

so then it's essentially this actions right good any questions so far there are many many

ways in which you can solve this multiarm banded problems right but uh the Crux here is always that

you have to be balancing the exploration versus exploitation right so I'll talk about multiple

uh solution Concepts right what do we mean when we say I want to solve a multiarm banded

problem right so one solution concept right

it's asymptotic correctness so what do I mean by that I don't put any bounds

on you or anything right here is this multi-arm banded problem so

give me a guarantee that eventually you'll be selecting the arm which has the highest

payoff right as T tends to Infinity you'll be selecting the arm that has the highest payoff

right so that is called asymptotic correctness right so this is one way of solving it a lot

of the older literature on Bandit problems essentially concern themselves with asymptotic

correctness and then of course they had some results on things like rat of convergence and

so on so forth how quickly you reach the the guarantee and so on so forth but by and large

the analysis was on asymptotic correctness you come up with very simple algorithms and then

you show that the ASM totically they will converge to the right right right so this

is the one kind of thing right the second uh uh popular uh solution concept is essentially known

as regret

optimality suppose I

knew right suppose I knew from Time Zero which is the best arm

right suppose I knew from the beginning what is the best arm to pull right and I

keep pulling the arm over and over and over again right and I keep repeating

this experiment multiple times what will be my expected payoff so this is

time expected payoff it's going to be

some kind of a flat line right so that's possibly the best expected payoff that

I can achieve right because I know from the beginning I know what is the right arm right

but since I don't know this I have to do this exploration right to figure out which is the right

arm right so my payoff will look something like

this right over time it'll eventually reach that right as time becomes T

goes to Infinity I'll eventually reach that point right so now this reward that you see

here right this is what I could have got if I had known the right answer from the

beginning right so this is in some sense a loss that I incurred because of my

learning process right so this is sometimes colorfully referred to as regret so it's like

if I had known this from the beginning you know I could have done so much better so this

is Regret so another way of thinking about regret is that I'm trying to maximize the

total reward that I obtain okay not just the asymptotically the payoff right even during

the learning I need to get as much payoff as possible right so ideally I would want

this slope to be pretty steep right if the slope is very steep so what will happen is this this

this area will come down right if the slope is very steep this area is going to come down but

what is the tradeoff typically you'll have to give up is it'll take a longer time to reach

optimality usually there's a tradeoff right because you typically to do this right you

will be giving up some amount of exploration right so there are

some Corner cases where you might actually miss out on important exploration because you are

trying to be very optimistic with respect to this regret thing right so I wanted to be I wanted to

have very very little regret so what what I might end up doing is I might miss out on certain key

exploration that I should have done so essentially what will happen is so in some cases I will never

reach optimality also because I would have ignored certain important outcomes along the way so this

is the trade-off that you have to worry about but regret optimality is essentially looking at how

steeply you learn at the outside okay of course that doesn't mean I can be really bad right I

mean does it have a small

regret right but I did learn very fast I actually went up the y-

Axis right but then I'm going to incur a constant very very large constant regret

okay so that's not that's not so you have to balance it right so because you keep accumulating

even though this has reached here but you still keep accumulating regret right so this is not yet

come to the optimal case in fact we know that no algorithm can guarantee that your regret will grow

small I mean essentially regret will fall right faster than Lo T so it has to grow at least as Lo

T right suppose you have taken T time steps the regret you have accumulated till that point will

be proportional to log T right so and as T becomes larger the rate of growth will become smaller and

smaller right but that is the best rate of growth that you can achieve all that you can fiddle play

around with is some a * log T will be the rate right so that a is what you can fiddle around

with so those constants you can fit around with but log T itself is non-negotiable so there are

results that show that log T is a lower Bond so you can't do better than log T in achieving regret

right so essentially so if you think of this area above this curve and between this dotted line and

this curve so that area will keep growing at some rate right as T becomes larger and larger

that area keeps growing at some rate so the rate at which it will grow will be at least

Lo so that's the result that we have so I'm not going to show you that result but I'll we'll talk

about a couple of other uh other things okay so is it clear so people understand what regret is

right good okay so third thing that I want to talk about is what is

called pack optimality right or not it's not I shouldn't really call it

pack optimality but it should be more pack complexity right so here's a little tricky

thing so pack stands for probably approximately

correct

okay sometimes in a very loose fashion we tend to use these as interchangeable things yeah he

probably right okay and he's approximately right right so but they are not interchangeable they

very different context very very different things when they say somebody is probably

right that means he's either right or wrong okay so he's right with some probability and

he's wrong with some probability right when they say somebody's approximately right he's almost

surely not right right but he very close to being right this is essentially what

approximately means so it turns out that both of these concepts are applicable in a bandit setting

so when I say somebody is approximately right in the Bandit setting what do I

mean that I give you an arm right finally you know what is the goal at the end of the day I'm

supposed to give you an arm back right and and this is supposed to have the highest expected

payoff when they say I'm approximately right that means that arm I'm going to return to you will be

very close in payoff to the the best possible on right suppose I return some a to you so Q

star of a will be very close to say qar of a star which is the best arm right does it make sense so

I'll return some arm a to you at the end of my algorithm Q star of a will be very very close to

qar of a star right what is qar again the true expected payoff which which I don't know about

right the algorithm doesn't know what qar is but it'll return an arm and the guarantee I give you

is the qar of the ter the unknown Q star of the arm will be close to the Q star of the best arm

okay so this is the approximately correct okay so what is the probably correct part

here optim is the optimal no no no no it is either approximately correct or not because we

already only gu giving you approximately correct guarantees right so with a very high probability

it is approximately correct right with some small probability it might give you an arm that is more

than some distance away from the best arm this is probably approximately correct yeah that's

what PC is probably approximately correct oh okay probably correct would mean yeah either optimal

or not optimal yeah probably approximately correct is well with some probability you

are approximately correct some probability you are not right so typically uh there are many

ways in which uh people talk about this that a one one popular way of specifying this is

called the Epsilon Delta pack uh where [Music] Epsilon refers to the the approximately part

and the Delta refers to the probability part right so with the probability of 1 minus Delta

okay the solution I return to you will be within Epsilon of the best

arm right so this essentially means probability that

probability that qar of a that's armor return to you is greater than

uh there's one way of writing it but that is not what the pack framework

guarantees okay so what is the difference between the first one and this

one no that is the first one is relative guarantee this an absolute guarantee

so you could think of either way you can think of absolute guarantee

or as a relative guarantee but this is essentially what pack op pack optimality

means right so for a given Epsilon Delta if I can give that guarantee right then

you say that is back optimal for this but what is the interesting part here

right if I allow you to draw an infinite number of samples from the arms right I can always guarantee

this given give me whatever Epsilon Delta I want I can just keep drawing arms okay and then some

point I can say okay now I have satisfied this okay the optimality part comes in when you want

to minimize the sample complexity right so given an Epsilon and a Delta what is the smallest number

of times I have to select arms such that I can give you that Epsilon Delta pack guarantee right

does it make sense so that's essentially what we are looking at here so this is a sample complexity

question right this is the correctness question right this is a kind of a rate of convergence

question and this is a sample complexity question so these are all slightly different

Notions of solutions when I say you're solving a banded problem these are different Notions of

solutions the kind of algorithms that you come up with for addressing each of these questions

would be pretty different so how do you know qar a star you don't know but I give you the

guarantee this is the if if you know qar a star then this will be as close as that so

how will we know when when to stop how will we know that will beep what will be number

of exactly so these are questions that we look at as we go along I have not told you what the

I'm just telling you what the solution concepts are right I not even told you how you actually

solve these problems right so when we look at those I'll tell you how to go about doing

this in fact it'll turn out that the algorithms themselves are very simple okay but to analyze

it to show that this kind of guarantee holds is where all the trick lies [Music] [Applause]

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:Bandit Optimalities