Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
This content explains the concept of "bandit problems," specifically multi-arm bandits, which are analogous to slot machines with multiple levers. It then delves into three primary frameworks for evaluating solutions to these problems: asymptotic correctness, regret optimality, and Probably Approximately Correct (PAC) optimality.
[Music]
so these kinds of problems this is a very very simplified version
of what are called in the statistics literature as banded problem
s right so you know why they're called Bandit
problems you know this
typic no oh come on you enough of lab meetings to know these things
anyway so people know what a one-arm bandit is
come on man anyway so one arm bandit is a slot machine you know what slot machines are you put
a coin there then you pull a lever right and then what it does it steals your money basically I me
all of that is mumbo jumbo you know of course all of you must realize that no casinos is going to
put up a machine that actually makes money for the customer right so in the long run the slot
machine will steal your money right and then you just keep pulling that thing so that lever
that you pull is called the arm right and since it has one lever is called the one arm bandit
because it steals your money right so it's one arm bandit so here this is very similar to the
slot machine except that instead of having one arm it has n arms right so each time you pull
an arm you get some kind of a payoff right and if you want to really complete the Bandit and
ology so every time you need to pull an arm you have to pay one rupee right sometimes
you get back one rupee sometimes you get back nothing so that way you know that
it's certainly stealing your money right so something like that right so but that's
essentially why it is called a bandit problem uh sometimes it it's also called
the multi-arm bandits right so it's called multiarm Bandits because well it has multiple
arms right and the Dynamics is very Sim SAR to a slot machine right now we know why I kept
calling actions as arms right so because the the the literature typically talks about arms
on a bandit right but it's it's really for us we don't have to worry about the Bandit connections
so then it's essentially this actions right good any questions so far there are many many
ways in which you can solve this multiarm banded problems right but uh the Crux here is always that
you have to be balancing the exploration versus exploitation right so I'll talk about multiple
uh solution Concepts right what do we mean when we say I want to solve a multiarm banded
problem right so one solution concept right
it's asymptotic correctness so what do I mean by that I don't put any bounds
on you or anything right here is this multi-arm banded problem so
give me a guarantee that eventually you'll be selecting the arm which has the highest
payoff right as T tends to Infinity you'll be selecting the arm that has the highest payoff
right so that is called asymptotic correctness right so this is one way of solving it a lot
of the older literature on Bandit problems essentially concern themselves with asymptotic
correctness and then of course they had some results on things like rat of convergence and
so on so forth how quickly you reach the the guarantee and so on so forth but by and large
the analysis was on asymptotic correctness you come up with very simple algorithms and then
you show that the ASM totically they will converge to the right right right so this
is the one kind of thing right the second uh uh popular uh solution concept is essentially known
as regret
optimality suppose I
knew right suppose I knew from Time Zero which is the best arm
right suppose I knew from the beginning what is the best arm to pull right and I
keep pulling the arm over and over and over again right and I keep repeating
this experiment multiple times what will be my expected payoff so this is
time expected payoff it's going to be
some kind of a flat line right so that's possibly the best expected payoff that
I can achieve right because I know from the beginning I know what is the right arm right
but since I don't know this I have to do this exploration right to figure out which is the right
arm right so my payoff will look something like
this right over time it'll eventually reach that right as time becomes T
goes to Infinity I'll eventually reach that point right so now this reward that you see
here right this is what I could have got if I had known the right answer from the
beginning right so this is in some sense a loss that I incurred because of my
learning process right so this is sometimes colorfully referred to as regret so it's like
if I had known this from the beginning you know I could have done so much better so this
is Regret so another way of thinking about regret is that I'm trying to maximize the
total reward that I obtain okay not just the asymptotically the payoff right even during
the learning I need to get as much payoff as possible right so ideally I would want
this slope to be pretty steep right if the slope is very steep so what will happen is this this
this area will come down right if the slope is very steep this area is going to come down but
what is the tradeoff typically you'll have to give up is it'll take a longer time to reach
optimality usually there's a tradeoff right because you typically to do this right you
will be giving up some amount of exploration right so there are
some Corner cases where you might actually miss out on important exploration because you are
trying to be very optimistic with respect to this regret thing right so I wanted to be I wanted to
have very very little regret so what what I might end up doing is I might miss out on certain key
exploration that I should have done so essentially what will happen is so in some cases I will never
reach optimality also because I would have ignored certain important outcomes along the way so this
is the trade-off that you have to worry about but regret optimality is essentially looking at how
steeply you learn at the outside okay of course that doesn't mean I can be really bad right I
mean does it have a small
regret right but I did learn very fast I actually went up the y-
Axis right but then I'm going to incur a constant very very large constant regret
okay so that's not that's not so you have to balance it right so because you keep accumulating
even though this has reached here but you still keep accumulating regret right so this is not yet
come to the optimal case in fact we know that no algorithm can guarantee that your regret will grow
small I mean essentially regret will fall right faster than Lo T so it has to grow at least as Lo
T right suppose you have taken T time steps the regret you have accumulated till that point will
be proportional to log T right so and as T becomes larger the rate of growth will become smaller and
smaller right but that is the best rate of growth that you can achieve all that you can fiddle play
around with is some a * log T will be the rate right so that a is what you can fiddle around
with so those constants you can fit around with but log T itself is non-negotiable so there are
results that show that log T is a lower Bond so you can't do better than log T in achieving regret
right so essentially so if you think of this area above this curve and between this dotted line and
this curve so that area will keep growing at some rate right as T becomes larger and larger
that area keeps growing at some rate so the rate at which it will grow will be at least
Lo so that's the result that we have so I'm not going to show you that result but I'll we'll talk
about a couple of other uh other things okay so is it clear so people understand what regret is
right good okay so third thing that I want to talk about is what is
called pack optimality right or not it's not I shouldn't really call it
pack optimality but it should be more pack complexity right so here's a little tricky
thing so pack stands for probably approximately
correct
okay sometimes in a very loose fashion we tend to use these as interchangeable things yeah he
probably right okay and he's approximately right right so but they are not interchangeable they
very different context very very different things when they say somebody is probably
right that means he's either right or wrong okay so he's right with some probability and
he's wrong with some probability right when they say somebody's approximately right he's almost
surely not right right but he very close to being right this is essentially what
approximately means so it turns out that both of these concepts are applicable in a bandit setting
so when I say somebody is approximately right in the Bandit setting what do I
mean that I give you an arm right finally you know what is the goal at the end of the day I'm
supposed to give you an arm back right and and this is supposed to have the highest expected
payoff when they say I'm approximately right that means that arm I'm going to return to you will be
very close in payoff to the the best possible on right suppose I return some a to you so Q
star of a will be very close to say qar of a star which is the best arm right does it make sense so
I'll return some arm a to you at the end of my algorithm Q star of a will be very very close to
qar of a star right what is qar again the true expected payoff which which I don't know about
right the algorithm doesn't know what qar is but it'll return an arm and the guarantee I give you
is the qar of the ter the unknown Q star of the arm will be close to the Q star of the best arm
okay so this is the approximately correct okay so what is the probably correct part
here optim is the optimal no no no no it is either approximately correct or not because we
already only gu giving you approximately correct guarantees right so with a very high probability
it is approximately correct right with some small probability it might give you an arm that is more
than some distance away from the best arm this is probably approximately correct yeah that's
what PC is probably approximately correct oh okay probably correct would mean yeah either optimal
or not optimal yeah probably approximately correct is well with some probability you
are approximately correct some probability you are not right so typically uh there are many
ways in which uh people talk about this that a one one popular way of specifying this is
called the Epsilon Delta pack uh where [Music] Epsilon refers to the the approximately part
and the Delta refers to the probability part right so with the probability of 1 minus Delta
okay the solution I return to you will be within Epsilon of the best
arm right so this essentially means probability that
probability that qar of a that's armor return to you is greater than
uh there's one way of writing it but that is not what the pack framework
guarantees okay so what is the difference between the first one and this
one no that is the first one is relative guarantee this an absolute guarantee
so you could think of either way you can think of absolute guarantee
or as a relative guarantee but this is essentially what pack op pack optimality
means right so for a given Epsilon Delta if I can give that guarantee right then
you say that is back optimal for this but what is the interesting part here
right if I allow you to draw an infinite number of samples from the arms right I can always guarantee
this given give me whatever Epsilon Delta I want I can just keep drawing arms okay and then some
point I can say okay now I have satisfied this okay the optimality part comes in when you want
to minimize the sample complexity right so given an Epsilon and a Delta what is the smallest number
of times I have to select arms such that I can give you that Epsilon Delta pack guarantee right
does it make sense so that's essentially what we are looking at here so this is a sample complexity
question right this is the correctness question right this is a kind of a rate of convergence
question and this is a sample complexity question so these are all slightly different
Notions of solutions when I say you're solving a banded problem these are different Notions of
solutions the kind of algorithms that you come up with for addressing each of these questions
would be pretty different so how do you know qar a star you don't know but I give you the
guarantee this is the if if you know qar a star then this will be as close as that so
how will we know when when to stop how will we know that will beep what will be number
of exactly so these are questions that we look at as we go along I have not told you what the
I'm just telling you what the solution concepts are right I not even told you how you actually
solve these problems right so when we look at those I'll tell you how to go about doing
this in fact it'll turn out that the algorithms themselves are very simple okay but to analyze
it to show that this kind of guarantee holds is where all the trick lies [Music] [Applause]
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.