Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
This content introduces Thompson Sampling (also known as Posterior Sampling) as a popular and increasingly researched Bayesian approach to solving multi-armed bandit problems, offering potential advantages over traditional methods like UCB.
[Music] there's only one thing which I want to mention more in a very uh you know very very high
level I don't want to get into the proofs and other things the guys who did the course last
year actually had a uh guest lecture from somebody who works in this field and who was able to talk
about another technique which is gaining a lot of popularity recently uh so he talked about that in
in in gory detail in fact he did some proofs from one of the papers that they wrote and so on so
forth uh but unfortunately I don't have the guest lecturer here this time and it's also I mean just
I introduce you the ideas to you and then if you want you can read up on those later right uh but
in some sense whatever we have done so far gives you all the tools that you need you know it gives
you the basic tools for you to understand more complex algorithms and so on so forth that's
that's the whole idea for me to do this and if you actually look at the book right the textbook
um they almost in passing mention all of these algorithms right mean they have like one paragraph
saying oh this is how you do UCB and this is how you do they don't talk about Medan elimination
at all uh this is how you do something else which I'll talk about in the next class called
reinforce right they mentioned these things in passing and then they say oh we'll build on some
of these ideas as and when required for the full reinforcement learning problem right so in some
sense the the book treats Bandits as something that you is a necessary evil that you need to
know about on your way to learning about the full ARL problem right given the focus of the book
that's a valid way of treating Bandit problems uh but in fact it Bandits itself is a very very
rich large literature and it's very active area of research for a lot of reinforcement learning
people that's one of the reasons I actually covered uh bandits in a little more depth than I
used to do uh maybe a couple of years ago right so a couple of years ago I did banditss in like one
class right all of Bandits anyway so one thing which I want to talk about
was Thompson sampling which is gaining a lot of traction now okay uh so it's more descriptively
you could call it posterior sampling depending on which literature you're reading some some
people call it posterior sampling some people call it Thompson sampling so the basic idea is
the following right so I'm trying to solve a bandit problem right and we know that if you
know the the parameters of the Bandit if I knew all the Q Stars right I can solve
it very easily right so how what do I do take the highest right so what I'm going to do when
I do this kind of Thompson sampling approach uh is to uh take a somewhat a basian way of
looking at these things right I'm going to say that okay I'm solving this unknown
problem I'm solving this unknown problem but I'm going to make some assumption about the
parameters of this unknown problem right so essentially the Assumption I'm making
here is about the Q Stars I'm going to make an assumption that the true Q stars come from
some distribution right I'll start with some guess for this the True Q stars come from some
distribution right so if my if my uh if I know that my uh rewards are limited between 0 and
one right like many of the cases we being assuming rewards are limited between 0 and
one so what would be an appropriate assumption for this qar distribution
beta distribution so people remember beta distribution the guys who are in ml yes the
guys who are in ml at least should remember the beta distribution so the beta distribution is is
a distribution that is limited to 01 right and it can take all kinds of weird forms uh right
and um so this is typically a prior that prior distribution that you assume uh whenever you
have you you're trying to find parameters of a uh uh distribution where the parameters limited
between 0 and one okay so the beta distribution is a very popular prior that is used like that I
make some kind of an assumption let us say that I don't know anything about the underlying uh
problem right then what will I assume uniform right you can assume a uniform distribution
saying that I don't know anything so my Q Stars can come from from anywhere so now what do I do
and one way of solving this problem is to say that given my current knowledge about the parameters
given my current knowledge about the parameters which arm should I play okay so that I my reward
is optimized that's one way of thinking about it so what should I do then I should take each Bandit
combination right I should I should say that okay okay the probability of I me so Q star one is
some value say3 Q star two is. five Q star three is something else 1.2 like that so I had to make
one banded problem and then figure out okay what is the best ter to play here and then
take another banded problem figure out what is the best ter to play take another Bandit
problem figure out what is the best arm to play and for each of these Bandit problems I should
look at what is the probability that this is the True Q Stars according to my current knowledge
about the problem so if I start with a uniform distribution then everything is equally likely
right so all arms I mean can be optimal all arms can be non-optimal and so on so forth
so I basically how to play at random right but it turns out that doing this computation can be very
very cumbersome you can just think about it right I'm I'm asking you to figure out for each arm
combination what is the probability that the arm combination can occur right and depending on the
nature of the probability distributions you're assuming this comp computation can become very
very complex right so the posterior sampling idea is a way of approximating this computation so what
I do is the following it's very very simple okay I have some distribution over what my qar values
will be I have some distribution over my qar values so what I do is I draw a sample from the
distribution right so let us say something like this so my uh
right so qar one I'm saying it's something like this right and
then that's qar
two plus qar
three you have another color okay okay good for arms because I'm running out of
colors right so qar 1 2 3 4 right this is the current distribution I have so
I'm I'm assuming that the true value of qar Lies somewhere in this range
right and it's most probably this value but I don't know yet right there's some
uncertainty about what the values are now what I do is I draw samples you remember
we talked about drawing samples from goian and other distribution I draw samples so I draw a
so basically right so this is basically my
reward this is my reward axis and this
are a probability axis I draw a sample for that is one that is
another that is
another okay so now I have a specific Bandit problem where my qar one is this
qar 2 is this Q star three is this and qar 4 is this so I've taken this distribution
right and I have drawn one banded problem from this distribution okay now I solve
this problem the right I basically pull arm three because arm 3 gives me the gave
me the best value here so I pull arm three now I look at the actual reward I get for
arm3 right so I pull arm three and let's say that my arm three reward is that this
is the actual reward that I get for arm three and now I go back and update this
pH based on this reward right so what what do you think it should
be it will become narrower because this is I kind of read en forcing
my original belief that this is the most probable outcome
so yeah this will become my belief now this is my new distribution now again I try to
sample so the probability that I will sample a high value for the green arm goes down now
because I when I played the Green arm I got a smaller value so I keep doing this until
my probability is kind of converge to the True Q Stars at the time when a sample problem from that
I'll most probably be sampling the true problem right the actual problem underlying that which I'm
trying to solve this is called Thompson sampling or posterior sampling so this is getting a lot
of uh attention recently mainly because people have been able to show that uh this can give
you better regret bounds than UCB right so this can give you better regret bounds than UCB and
therefore there's a lot of attention that's being paid to uh Thompson sampling approaches
uh the analysis gets really hay and in fact until recently there was no analysis of the Thompson
sampling approach uh but more recently there have been several papers in the last 3 4 years there
have been several papers uh that uh that have come up with new techniques for analyzing uh these
kinds of algorithms different from what we have been talking about so far are there round this
th it's kind of round based right only us one looking at each time is
yeah we're not eliminating any arms that's what you're saying right well it turns out that you
don't have to even without eliminating arms you getting very good complexity and typically what
happens is after some time the posteriors become so focused right so the probability
of you actually selecting the bad arm is gone I mean you don't really need to do a round based uh
uh computation there but yeah I mean that might be a after few steps room for sorry after few
steps that after few steps that can be done El yeah I mean yeah that's the possibility so the
only thing is the only tricky part is once you do this kind of elimination you have to
come back and analyze it so you saw how tricky this became right I mean I gave you the right
events to consider here right but to come up with these events is a tricky part right so once you
come up with this events it's easy to bound the probabilities right how do you know that this
is the actual set of conditions that you should consider right for you to prove the result so that
is the tricky path now once I say that okay here is a a round based uh Thompson sampling approach
then uh going and showing that it actually gives you the guarantees that you want becomes tricky
I mean people just the plain Thompson sampling without any round basing itself resisted analysis
for about a decade right uh and so so it's not not immediately clear but there's more interesting
results that have come in uh yeah so people uh I mean there is a very old uh style uh algorithm
very old class of algorithms for solving Bandit problems called learning automat algorithm these
are the people know about finite State machines right all of you know about finite State machines
that they have a bunch of states and you have some input signal symbols coming and then you jump
around the states and so on so forth right uh so there is a a class of finite State machines called
variable structured finite State machines variable structured automator right and people have used
variable structured automator to solve bandaid problems I'm not going to get into the details
of it right so it's uh and uh it turns out that uh some of what the variable structure automate
algorithms are doing is posterior sampling right it turns out to be very similar to posterior
sampling and very recently and and earlier we knew results only for asymptotic convergence
of this uh learning automated algorithms so very recently like late last year October or November
of last year uh people have shown that uh they also satisfy regret bounds they they satisfy the
the log t uh regret Bond so that mean so a lot of interesting work that's happening in this space
and uh yeah I mean of course you're welcome to try a round based Thomson sampling algorithm I
don't know or just check on RK right the rate at which work comes out in this space there might be
something already out there okay so stop here uh the next class we will look at uh ways of solving
the Bandit problem which do not depend on the do not depend on estimating the value function so
everything we have looked on so far look finds the function Q right capital Q so next class we look
at one class of algorithms which don't look at uh maintaining the value function and directly try
to learn uh uh the probabilities which which you have to select the arms okay [Applause]
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.