YouTube Transcript:
Thompson Sampling

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

Video Transcript

Video Summary

Summary

Core Theme

This content introduces Thompson Sampling (also known as Posterior Sampling) as a popular and increasingly researched Bayesian approach to solving multi-armed bandit problems, offering potential advantages over traditional methods like UCB.

[Music] there's only one thing which I want to mention more in a very uh you know very very high

level I don't want to get into the proofs and other things the guys who did the course last

year actually had a uh guest lecture from somebody who works in this field and who was able to talk

about another technique which is gaining a lot of popularity recently uh so he talked about that in

in in gory detail in fact he did some proofs from one of the papers that they wrote and so on so

forth uh but unfortunately I don't have the guest lecturer here this time and it's also I mean just

I introduce you the ideas to you and then if you want you can read up on those later right uh but

in some sense whatever we have done so far gives you all the tools that you need you know it gives

you the basic tools for you to understand more complex algorithms and so on so forth that's

that's the whole idea for me to do this and if you actually look at the book right the textbook

um they almost in passing mention all of these algorithms right mean they have like one paragraph

saying oh this is how you do UCB and this is how you do they don't talk about Medan elimination

at all uh this is how you do something else which I'll talk about in the next class called

reinforce right they mentioned these things in passing and then they say oh we'll build on some

of these ideas as and when required for the full reinforcement learning problem right so in some

sense the the book treats Bandits as something that you is a necessary evil that you need to

know about on your way to learning about the full ARL problem right given the focus of the book

that's a valid way of treating Bandit problems uh but in fact it Bandits itself is a very very

rich large literature and it's very active area of research for a lot of reinforcement learning

people that's one of the reasons I actually covered uh bandits in a little more depth than I

used to do uh maybe a couple of years ago right so a couple of years ago I did banditss in like one

class right all of Bandits anyway so one thing which I want to talk about

was Thompson sampling which is gaining a lot of traction now okay uh so it's more descriptively

you could call it posterior sampling depending on which literature you're reading some some

people call it posterior sampling some people call it Thompson sampling so the basic idea is

the following right so I'm trying to solve a bandit problem right and we know that if you

know the the parameters of the Bandit if I knew all the Q Stars right I can solve

it very easily right so how what do I do take the highest right so what I'm going to do when

I do this kind of Thompson sampling approach uh is to uh take a somewhat a basian way of

looking at these things right I'm going to say that okay I'm solving this unknown

problem I'm solving this unknown problem but I'm going to make some assumption about the

parameters of this unknown problem right so essentially the Assumption I'm making

here is about the Q Stars I'm going to make an assumption that the true Q stars come from

some distribution right I'll start with some guess for this the True Q stars come from some

distribution right so if my if my uh if I know that my uh rewards are limited between 0 and

one right like many of the cases we being assuming rewards are limited between 0 and

one so what would be an appropriate assumption for this qar distribution

beta distribution so people remember beta distribution the guys who are in ml yes the

guys who are in ml at least should remember the beta distribution so the beta distribution is is

a distribution that is limited to 01 right and it can take all kinds of weird forms uh right

and um so this is typically a prior that prior distribution that you assume uh whenever you

have you you're trying to find parameters of a uh uh distribution where the parameters limited

between 0 and one okay so the beta distribution is a very popular prior that is used like that I

make some kind of an assumption let us say that I don't know anything about the underlying uh

problem right then what will I assume uniform right you can assume a uniform distribution

saying that I don't know anything so my Q Stars can come from from anywhere so now what do I do

and one way of solving this problem is to say that given my current knowledge about the parameters

given my current knowledge about the parameters which arm should I play okay so that I my reward

is optimized that's one way of thinking about it so what should I do then I should take each Bandit

combination right I should I should say that okay okay the probability of I me so Q star one is

some value say3 Q star two is. five Q star three is something else 1.2 like that so I had to make

one banded problem and then figure out okay what is the best ter to play here and then

take another banded problem figure out what is the best ter to play take another Bandit

problem figure out what is the best arm to play and for each of these Bandit problems I should

look at what is the probability that this is the True Q Stars according to my current knowledge

about the problem so if I start with a uniform distribution then everything is equally likely

right so all arms I mean can be optimal all arms can be non-optimal and so on so forth

so I basically how to play at random right but it turns out that doing this computation can be very

very cumbersome you can just think about it right I'm I'm asking you to figure out for each arm

combination what is the probability that the arm combination can occur right and depending on the

nature of the probability distributions you're assuming this comp computation can become very

very complex right so the posterior sampling idea is a way of approximating this computation so what

I do is the following it's very very simple okay I have some distribution over what my qar values

will be I have some distribution over my qar values so what I do is I draw a sample from the

distribution right so let us say something like this so my uh

right so qar one I'm saying it's something like this right and

then that's qar

two plus qar

three you have another color okay okay good for arms because I'm running out of

colors right so qar 1 2 3 4 right this is the current distribution I have so

I'm I'm assuming that the true value of qar Lies somewhere in this range

right and it's most probably this value but I don't know yet right there's some

uncertainty about what the values are now what I do is I draw samples you remember

we talked about drawing samples from goian and other distribution I draw samples so I draw a

so basically right so this is basically my

reward this is my reward axis and this

are a probability axis I draw a sample for that is one that is

another that is

another okay so now I have a specific Bandit problem where my qar one is this

qar 2 is this Q star three is this and qar 4 is this so I've taken this distribution

right and I have drawn one banded problem from this distribution okay now I solve

this problem the right I basically pull arm three because arm 3 gives me the gave

me the best value here so I pull arm three now I look at the actual reward I get for

arm3 right so I pull arm three and let's say that my arm three reward is that this

is the actual reward that I get for arm three and now I go back and update this

pH based on this reward right so what what do you think it should

be it will become narrower because this is I kind of read en forcing

my original belief that this is the most probable outcome

so yeah this will become my belief now this is my new distribution now again I try to

sample so the probability that I will sample a high value for the green arm goes down now

because I when I played the Green arm I got a smaller value so I keep doing this until

my probability is kind of converge to the True Q Stars at the time when a sample problem from that

I'll most probably be sampling the true problem right the actual problem underlying that which I'm

trying to solve this is called Thompson sampling or posterior sampling so this is getting a lot

of uh attention recently mainly because people have been able to show that uh this can give

you better regret bounds than UCB right so this can give you better regret bounds than UCB and

therefore there's a lot of attention that's being paid to uh Thompson sampling approaches

uh the analysis gets really hay and in fact until recently there was no analysis of the Thompson

sampling approach uh but more recently there have been several papers in the last 3 4 years there

have been several papers uh that uh that have come up with new techniques for analyzing uh these

kinds of algorithms different from what we have been talking about so far are there round this

th it's kind of round based right only us one looking at each time is

yeah we're not eliminating any arms that's what you're saying right well it turns out that you

don't have to even without eliminating arms you getting very good complexity and typically what

happens is after some time the posteriors become so focused right so the probability

of you actually selecting the bad arm is gone I mean you don't really need to do a round based uh

uh computation there but yeah I mean that might be a after few steps room for sorry after few

steps that after few steps that can be done El yeah I mean yeah that's the possibility so the

only thing is the only tricky part is once you do this kind of elimination you have to

come back and analyze it so you saw how tricky this became right I mean I gave you the right

events to consider here right but to come up with these events is a tricky part right so once you

come up with this events it's easy to bound the probabilities right how do you know that this

is the actual set of conditions that you should consider right for you to prove the result so that

is the tricky path now once I say that okay here is a a round based uh Thompson sampling approach

then uh going and showing that it actually gives you the guarantees that you want becomes tricky

I mean people just the plain Thompson sampling without any round basing itself resisted analysis

for about a decade right uh and so so it's not not immediately clear but there's more interesting

results that have come in uh yeah so people uh I mean there is a very old uh style uh algorithm

very old class of algorithms for solving Bandit problems called learning automat algorithm these

are the people know about finite State machines right all of you know about finite State machines

that they have a bunch of states and you have some input signal symbols coming and then you jump

around the states and so on so forth right uh so there is a a class of finite State machines called

variable structured finite State machines variable structured automator right and people have used

variable structured automator to solve bandaid problems I'm not going to get into the details

of it right so it's uh and uh it turns out that uh some of what the variable structure automate

algorithms are doing is posterior sampling right it turns out to be very similar to posterior

sampling and very recently and and earlier we knew results only for asymptotic convergence

of this uh learning automated algorithms so very recently like late last year October or November

of last year uh people have shown that uh they also satisfy regret bounds they they satisfy the

the log t uh regret Bond so that mean so a lot of interesting work that's happening in this space

and uh yeah I mean of course you're welcome to try a round based Thomson sampling algorithm I

don't know or just check on RK right the rate at which work comes out in this space there might be

something already out there okay so stop here uh the next class we will look at uh ways of solving

the Bandit problem which do not depend on the do not depend on estimating the value function so

everything we have looked on so far look finds the function Q right capital Q so next class we look

at one class of algorithms which don't look at uh maintaining the value function and directly try

to learn uh uh the probabilities which which you have to select the arms okay [Applause]

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:Thompson Sampling