YouTube Transcript:
RL Framework and Applications

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

Reinforcement learning (RL) is a machine learning paradigm where an agent learns to make optimal sequential decisions by interacting with an environment and receiving scalar reward signals. The core challenge lies in balancing exploration of new actions with exploitation of known rewards to maximize long-term cumulative reward.

[Music]

you can do all kinds of fancy stuff with reinforcement learning right so the Crux

in reinforcement learning is that uh the agent is going to that is this learning agent right

it's going to learn in close interaction with an environment right so the environment could

be the hel opter it could be the cycle right and uh or it could be your back gam and board and

your opponent all of this could constitute the environment a variety of different choices so

you sense the state in which the environment is in right you sense the state of the environment okay

and figure out what is the action that you should take in response to the state right so you apply

the action back to the environment this causes a change in the state right so now comes at to

keeper right so you should not just choose actions that are beneficial in the current state but you

should choose actions in such a way that they'll put you in a state which is beneficial for you in

the future right just just capturing the queen of of your opponent is not enough on the chest

that might give you a high reward but it might put you in a really bad position so you don't

want that right you really want to be uh looking at the entire sequence of decisions that you're

going to have to make and then try to behave optimally with respect to that right so what

what do we mean by behave optimally in this case right we're going to assume that the environment

right is giving you some kind of an evaluation like falling down HTS right or capturing a piece

maybe gives you a a small plus. five or winning the game gives you like 100 right so every time

you win every time you make a move or every time you execute an action you did not get a reward or

you did not get an evaluation from the environment right so it could be just zero it could be could

be nothing right so I should point out that this whole idea of having an evaluation come from the

environment is is just a mathematical convenience that we have here uh but in reality if you think

about biological systems that are learning using reinforcement learning right all they are getting

is their usual sensory inputs right so there is some fraction in the brain okay that sits there

and interprets some of those sensory input as Rewards or punishments right so you fall down

you get hurt I mean that's still a sensory input that is coming from your skin right or somebody

Pats you on a back that's still a sensory input that comes from the skin right it's just another

kind of an input right so you could choose to interpret this as a reward right or this

as a collision with an obstacle if something is brushing against my shoulder let me move

away right or you can just take it as somebody's patting my back so I did something good right so

it's a matter of interpretation so this is uh this whole thing about having a state signal and having

a separate evaluation coming from the environment is a fiction right that is created to have a clear

cleaner mathematical model but in reality things are a lot Messier you know you don't you don't

have such a clean separation right and like I said uh so you have a stochastic environment

you have uh delayed evaluation noisy so the new term that we have added here is huh noisy

scalar the new term we have had today is scalar so that is one of the things with uh the classical

reinforcement learning approaches he said I'm going to assume that my reward is a scalar signal

right so we talked about getting hurt and having food and so on so forth what all of this will

happen mathematically is I'll convert that into some kind of a number on a scale right so getting

hurt might be minus 100 right getting food might be plus 5 winning the game might be+ 20 right

capturing a piece might be plus5 or something like that so I'm going to convert them to a

scale right and the goal is now now that I have a single number that represents the evaluation the

goal is now to get as much as possible of that quantity over the long run okay make sense right

so if you have questions doubts stop me and ask so mathematically a scalar is easier to optimize

uh not necessarily right I'm just talking about so it's it's like a cost function if you want to

think about it in in terms of in terms of control systems right so this is like a cost and I'm

trying to optimize the cost right and uh so if the cost is going to be Vector valued then I have to

start trading off one One Direction of the vector against the other so which component of the vector

is more important so then it get into all kinds of uh so par optimality kind of questions and so

it's not really clear uh what exactly is optimal in such cases right so in fact later on I we have

homework exercise for you to argue for and against scalar rewards so you can put down your thoughts

then right uh so here again let me emphasize it's not supervised learning right in supervised

learning this is essentially what you're going to see there will be an input and there will be

an output that you producing and somebody will be giving you a Target output okay so this is

what you're supposed to produce and essentially you come compare the output you are producing to

the Target output right and you can form some error signal right and you can use that error

in order to train your agent right you can try to minimize the error you can do gradient descent on

the error or you can do variety of things and you can try to train the agent so here I don't

have a Target I do have to learn mapping from the input to the output but I don't have a Target and

hence I can't form an error right and therefore my trial and error becomes very essential so

if I have errors right I can form gradients of errors and they can go in the opposite direction

of the gradient of the error right and then that gives me some uh direction in which to change my

parameters right that constitute the agent right agent is going to be described in some

way right the error gives me a direction right but now since I I don't know a direction right

so I just I do something I get one evaluation so I don't know whether the evaluation is good

or bad right so think of writing an exam right I don't tell you the right answer I just tell you

three right and so what do you do now do are you happy with the answer should you change it should

you change it in One Direction or should you change it in the other direction see what makes

it even more tricky is I don't you don't even know how much the exam is out of so when I say three it

could be three out of three it could be three out of 100 right and or it could be like three out of

18 which was what a class average in ml in the first

place uh right uh so it could be any of these things right so you don't even know whether three

is a good number or a bad number so you have to explore to figure out a if I can get higher

than three right or three is the best the second thing is okay if I can get higher than three how

should I change my parameters to get to become higher than three right so I have to change my

parameters a little bit that way okay I have to change the parameters a little bit this way right

so if I'm cycling right I have to push down a little harder on the pedal okay or I have to

push down a little softer on the pedal to figure out whether I'm staying balanced for a longer

period of time or not right so I don't know that otherwise unless I try these things I wouldn't

know this is why the trial and error part so if I push down a little harder and I stay balanced

maybe I should try pushing down even more harder next time right so maybe that'll make it better

right and then there might be some point where I tip over so I need to come back so this is some

things which you have to try unless you try that you don't even know which direction you have to

move in right so this is much more than just the psychological aspects of traal and error there's

also a mathematical reason if you want to adapt my parameters right I need to know the gradient

so that you need to do this yeah reward come in the previous picture uh the reward is the one

that uh I know that gives you the evaluation for the output right so here uh in the supervised case

the error is the evaluation for the output if the error is zero then your output is perfect right

but then you have a way of gauging what the error is because you have a Target to which you can

compare right and from there you get the error so in the reinforcement learning case the evaluation

is directly given to you as the uh the evaluation of the output right it's not necessarily comparing

against Target value or anything you don't know how the evaluation was generated right you just

get an evaluation directly so you you just get some number corresponding to the output right

so maybe I should have done put an arrow from the top saying evaluation comes in from there

but uh but that's exactly where it's coming it's a substitute for the error signal but just that

you don't know what the evaluation is of course the way it differs from the error is the minor

difference is you typically tend to minimize error but you tend to maximize evaluation right uh it's

also not unsupervised learning so unsupervised learning you have some kind of an input right

that goes to the agent and then it figures out what are the patterns for the in the input right

here you have some kind of an evaluation and you are expected to produce an action in response to

the input it is not simply pattern deduction right so you might want to dedu patterns in the input so

that you know what is the right response to give but that is not the primary goal right but in uh

unsupervised learning the pattern deduction itself is the primary goal so that is the difference

right so here is this one slide which I think is uh kind of uh the soulle of reinforcement learning

right um it's called temporal difference so I I I'll explain it little more detail in in a couple

of slides uh but the intuition here right so if you if you remember the Pavlov's dog experiment

right what was the dog doing it was predicting the outcome of the Bell you know if the Bell Rings

there is an outcome that's going to happen it's predicting the outcome which is food is going to

happen and then it was reacting appropriate to the outcome right so most of reinforcement learning

you're going to be predicting some kind of outcome that's going to happen see am I going to get a

reward if I do this or if or am I going to not get a reward right am I going to win this game if I

make this move or am I not going to win this game right so I'm trying to always trying to predict

the outcome right the outcome here is the amount of reward or punishment I'm going to get right

this is essentially what I'm trying to predict at every point right so the intuition behind uh

the what is called temporal difference learning is the following right so the prediction that I make

at time t + 1 okay of what will be the eventual outcome let us say I'm playing a game right I'm

going to say I'm going to win now I'm very sure I'm going to win now right so I can say that with

a greater confidence when I'm closer to the end of the game then I can at the beginning of the

game right so I have all the pieces set up right if I'm going to sit there then and say I'm going

to win the game right it's most probably wishful thinking right but then you have played the game

for like 30 minutes or something and there are like five pieces left on the board now I'm going

to say I'm going to win the game now I say I'm going to win the game that's a much more confident

prediction than what I did at the beginning right so taking this to the extreme right

so the prediction I make at t + 1 is probably more accurate than the prediction I make at T

corre the prediction I make at t + 1 is more accurate than the prediction I make at T So

if I want to improve the prediction I make a t what can I do I can look go forward in time

right basically go to the next let the clock tick over and see what is the prediction I'll

make at time t+ one with additional knowledge I'm getting right I would have moved one step

closer to the end of the game so I know I know it's little bit better about the game right I

don't know how the game is proceeding and I can now make a prediction about whether I'll win or

lose right and use this go back and modify the prediction I make it time T right at T I think

there is a pro possibility of say probability of 6 of me winning the game okay and then I make a

move then I find out that I'm going to lose the game with a very high probability then what will

I do is I'll go back and reduce the probability of winning that I made at time T so instead of 6

I'll say okay maybe .55 or something right so next time I come to the same state as I was at

time T I won't make the prediction of 6 I'll say 0.55 that's essentially the idea behind

temporal difference learning right so it has a whole lot of advantages we'll talk about it

a couple of slides down uh but one one thing is has had a significant impact in behavioral

psychology and in Neuroscience right so it's widely accepted that uh animals actually use

some form of temporal difference learning and in fact there are specific models right that uh that

have been proposed for temporal difference learning which seem to explain some of the

uh neurotransmitter behaviors in the brain yeah supp we are playing a game and we have 10 moves

that we can make and out of 10 we will select the best one okay this is the usual scenario so

here the temporal difference is like we are Sting the move from next to next move no no no no see

at this point I'll be making a prediction about what is the probability of winning right it could

be for each of the moves right if I make this move what is a probability of ending if I make this

move what is the probability of ending let us say I I make move two okay and then I go I see a new

position right my opponent responds to it and then I decide oh my God this is a much worse move than

I thought earlier so what I'll do is I'll change the prediction I make for move two in the previous

state see that the other other moves will not be affected because the only move I took

was two only about that move I have additional information therefore I can go back and change

the prediction I make for move to alone okay so you you still have the 10 moves you're not

changing any of that yeah we going back and like changing the entire thing again or just changing

the prediction changing the prediction it's not like I mean in Ideal World you should be able

to take back a bad move right except if if it's a parent playing with the kid I don't think uh

those things are allowed right uh in fact when I play with my son we have sometimes

had to RND back all the way to the beginning it'll probably be me asking to do the reing not

him because he'll be drubbing me in some one of those games but uh yeah otherwise you can't you

just make the change the prediction so next time you play the game you'll be better at it not for

that game well basically you're you're messed up or you did well I mean so whatever it is

yeah uh looking at RL right so let's look like Tic Tac to um how many of you have played tic

Taco good oh even you put your hand up okay

good okay good uh so so in tic Taco so you have these board positions right and uh so you make

different moves so in the first this is what I have drawn here is called a game tree right so

I start off with the initial board which is empty right and there are um how many possible branches

there for people making moves nine possible branches right for excess move there are nine

possible moves I have nine possible branches and then for each of these uh I'll have like eight

possible I'm not sure which is the right TV huh for each of these I have eight possible branches

and they keep going right so what we are going to be doing is uh essentially uh trying to to

Let's formulate this as a reinforcement learning problem so how will you do this as a reinforcement

learning problem right so I have all these board positions right let us say x is the reinforcement

learning agent and O is the opponent right so initially given a blank board I allow to choose

one among nine actions right so the the state that I'm going to see is this BL the the the

excess and was on the board right and the moves I'll be making are the actions right so so in the

initial position I have nine actions I make that do I get any reward not really there's no natural

reward signal that you can give essentially the reward that I'm going to get in this case is if

at the end of the moves if I win I'll get a one if I don't win I get a zero right if I win I get

a one if I don't win I get a zero right so what is going to happen is I'm going to keep playing

this games multiple times right and at each point right um yeah okay so there is a note here so what

ises that note say you have to assume it's an imperfect opponent right otherwise there's no

point in trying to learn Tic Tac Toe why Always draw you'll always draw and the way we have set

up the game you're indifferent between drawing and losing so you learn nothing I mean basically

so you not even learn to draw okay you'll just learn to nothing basically you learn nothing

because you can never win right so you're never going to get a reward of one so you'll just be

playing randomly so it's it's a bad bad idea so let's assume that you have an agent that can that

is imperfect right that makes mistakes so that you can actually learn to figure out where the

agent makes mistakes where the opponent makes mistakes and learn to exploit those things okay

right so your states are going to be these board positions as you can see give you a game that is

being played out on the top of the slide right and the actions you take are in response to those

board positions and finally at the end of the game right if you win you get a one if you don't win

you get a zero right sir is it clear sir in cases like this I mean does it have to be a binary sort

of a reward system I mean could you have a scale rather there are three parameters if you lose it

zero if you draw it's point if you win it's one sure you could even do other things like if you

win it is one if you lose it minus one so so then would you have a perfect component and

learn uh yeah you possibly could but you probably have to play lot lot and lot of games because the

perfect opponent it's almost impossible for you to start getting any feedback in the beginning

right you'll always be losing so it's going to be hard for you to learn but you'll eventually

learn something yeah it'll take a lot of moves you'll eventually learn something if I say that

at every Point yeah so when we learning like at a particular stage the probability of winning and

like when you update previous state so uh you're storing information for you're storing information

for each and every state that you have entered right so how will it be different from exploring

the prop like State space every time because after you've done let's say a thousand,

games or a million games you you have explored a lot of states and you'll have to store for

each state the probability of you winning at that point and all that so how will that be different

from exploring it again like why would I know the probability of winning from there why would I have

to explore it again uh no I'm uh let's I'm not even told you how we are going to solve it okay

let me explain that and then you can come back and ask me these questions okay if you still have them

uh okay great so what the way we are going to try and solve this game is as follows right for every

board position I'm going to try and estimate the reward I'll get if I start from there and play

the game to the end right every board position I'm going to look at the reward I'll get if I

start from there and play till the end now if you think about it what will this reward con connotate

right so if I win from there I'll get a one if I lose from there or if I don't win from there I'll

get a zero right when I say what is the reward I expect to get starting from this boat position

right it's essentially this average over multiple games it's some games I'll win some games I will

lose or I'll not win like some games I win some games I'll not win so what will this expected

reward represent after after having played many many many games like probility the probability of

winning right the reward is going to represent the probability of winning in this particular

case right if the reward had not been one right if it had been something else if it had been plus

five right it would have been some function of the probability of winning right or if it had

been plus one for winning minus one for losing and zero for draw it's something more complex it's no

longer the probability of winning right it's it's the gain I expect to get right how what fraction

of games I expect to win over the fraction of games I expect to lose or something like that

right it becomes a little bit more complex so the there could be some interpretation for the

value function but in general it is just the expected expected reward that I'm going to get

starting from a particular board position okay so that is what I'm trying to estimate right let us

assume that I have such a expectation well defined for me right say I have such an expectation well

defined right now I come to a specific position let us say I come I come to this position here

right let's say I come to this position how will I decide what is the next move I have to

make sorry whichever next state has whichever next state has the highest probability of winning

so I just look ahead to see okay where if I put if I put the X here right if I put the X

here what is the probability of wiing if I put the X here what is the probability of ending

if I put the X here what is the probability of wining right I do this for each one of

these right and then I figure out whichever has the highest probability of ending and I'll put

the X there right so that's how I'm going to use this function does it make sense yes it's

very important so this is this is something which you should understand this the Crux of

all reinforcement learning algorithms right I'm going to learn this function that tells

me if you're are in this state right if you play things out to the end what will be the

expected payoff that you will get right whether the Rewards or punishment or cost whatever you

want to call it what is the expected value you're going to get and and I want to behave according

to this uh learned function so when I come to a state I look ahead figure out which of the next

States has the highest expectation and then go to that state okay great how do I learn this

expectation what's the simplest way to learn the

expectation come to that St from that fraction of essentially you keep track of what happens

essentially keep track of the trajectory through the game tree right you play a game you go all the

way to the end right so you keep track of the TR trory and if you win right you go back along the

trajectory and update every state that you saw on the trajectory you update the probability of

winning right you just increase it a little bit or you come to the end of the game and you found

that you have not won right you go back along the trajectory decrease the probability of winning a

little bit right alternatively you can keep the history of all the games you have played so far

right at every after every game has been completed you can go back and compute the average probility

of winning across the entire history of all the games in which you saw that particular position

right makes sense that's easiest way of estimating this probability right but the problem with this

is a you have to wait till the game ends right or you have to store the history of all the games you

have played right I mean all of these could be potential drawbacks okay you can get around the

History part by coming up with an incremental rule but the main difficulty here is you have

to wait all the way to the end of the game right before you can change anything along the way so

Tic Tac to is easy it's like how many moves can you make in Tic Tac to at best four right the

fifth one is determined for you right so it's basically four choices that you can make right

so and uh that's easy enough to remember right you can always wait till the end of the game

and then you can always make the updates right but what if it's a much more complex situation

right what if you're playing chess maybe you can wait till the end so what if you're

cycling maybe you can wait till the end huh exactly we don't know right it depends on

where you're cycling if you're cycling learning to cycle in it m it's fine if you're learning

to cycle somewhere on Sardar Patel Road you don't want to even think about what end is

there right so so so there are some tasks for which you really like to learn along

the way right so this is where TD learning comes into comes to help right I don't think

I have that slide anyway um and I'm not using the fancy thing where I can draw on the uh on

the projection so let's see if I can do it here right um suppose I have come here right

and from here I have played at this point I know the probability of winning is say4

right so I came here by making a move from this position so I said we are here right and we made

a uh we know that the probability of winning from here is a.3 right but I made the move from

here to come here right but here I had thought my probability of winning was let us say 6 right I

thought my probability of winning was 6 right but then I looked at my next States and I found that

the best one was3 somehow right so I went there right but now since the best I can do from here is

3 me saying 6 here there's something wrong right so I should probably decrease the probability of

winning from here right so why could it be why could it have happened that I thought that was

6 but the best among the next was three maybe the other BR that you explored 4 that was fully

exploded all all of that was decreased to4 oh the thing is so that whenever I came to came through

this part right maybe I won before right the it so happened that when I went through like this right

initially I would have gone through like this and played the game right and the examples I drew I

might have actually won some of those games right so I would have changed this to 6 right

but it's possible for me to get here by playing a different sequence of moves also right so for

example to come here right I could have put the X first here and then here or I could have put

the X like I did here I put the X first here and then here right either way I would I I could have

reached this position right so there are many combinations in which I could have reached the

same positions right just to be nice to these guys right to reach here there are different orders in

which I could have put the zeros and the X's right here we have showing a specific order the zero

was first put here then put here right the X was first put here then put here it could very well

be I put the X first here and the O first here and then I put the X here and the O here right there

could are multiple ways in this thing was re reach right so sometimes when I play those games right I

lost right sometimes when I play this games I won therefore it turns out that for due to some random

fluctuations right so sometimes I win when I go through this specific point and therefore I have

higher evaluation of winning right but when I went through the other paths I had a lower evaluation

of winning right but we know that really doesn't matter what path you went through in Tic Tac to

right once you reach that point right what is going to happen further is determined only by

that point right so what I can do now is take this 3 right you should to update that point six

down down down so I I'm very confident here I think I'll win with the probability of 6 right but

the best probability I have from the next stage is3 therefore here I should not be so confident

right H good point uh that depends on how stochastic your game is right so if your game

has a lot of variability then you don't want to make a complete uh you know commitment to

to3 so you might want to say okay now let me move it little bit towards 3 right but

if it's more or less a deterministic game then you can say okay yeah 3 yes sure let

me go to all the way to 3 it depends on the nature of the game we'll talk

about these issues later yeah but how far down is a tricky thing right is calling AE

mising yeah it's misleading it's called game

tree usually but it's a game graph in this case yeah yeah

yeah so uh is is that clear this is this is an instance of temporal difference learning so how

will I use the the thing to update is this is called temporal difference learning okay

so there's one other uh thing which I should uh mention here right uh if I always take the

move that which I think is the best move right now right let's let's talk about it I I I start tabas

I have never played Tic Tac to before right so I play the game I play it once I get I get to the

end I win so now what I do I go back right whether I'm using temporal difference learning or waiting

till the end updating whatever it is I change the value of all the moves I have made in this

particular game right so the next time I come to a board position what am I going to do I look at all

possible outcomes everything except the one that I have played will have a zero right and the one

that I have played will have something slightly higher than zero I'm going to take that right in

fact it'll be like how many of you watch the movie groundhog day it'll be like Groundhog Day I'll be

playing I'll be playing the same game again and again because that's what happened to give me a

win in the first time around right that but that might not be the best way to play this right so

I need to explore right so I need to explore multiple options right so I should should not

be always playing the best best move right and I should not always be playing the best move I need

to do some amount of exploration so that I can figure out if there are better moves than what I

think is currently the best move right so in tic Taco there is inherently some kind of noise if

your opponent is random right but if your opponent is not random and if opponent is also playing

a fixed Rule and if you are playing greedy then you'll be just playing very very small fraction of

the game tree and you wouldn't have explored the rest of the outcomes right so you have to do some

amount of things at random so that you learn more about the game right so here is a question for

you when I'm estimating this probabilities of winning right let us say I've reached here I

look down right and the action that gives me the highest highest probability of winning

say gives me a probability of say8 right but I want to explore right so I take an action

that gives me a probability say point4 okay so I'll go from here to another action that

has a probability point4 right another board position that has a probability of

point4 of winning so should I use this point 4 to update this probability or

not no why

learning that you questioning the whole TD idea one with that particular sequence and now you're

exploring it so you shouldn't update it when you're exploring you should probably wait for

the end and or not just ignore the exploration okay any any other answer you have to update

it because you're learning that move whether it will be a good move or a bad move will be

found out I have to update the value of that move I agree do I update the value of winning from the

previous board position was the question so that point 4 I'll have to change right but

do I change the point8 that was a question the point8 was a probability of winning from here

right I look or whatever was the prob let say I had a probability of wining of 6 from here I

look at the bottom and the best probability of wining says 8 but then I take because I'm

exploring I take an action that has a probability of winning of point4 right the question is do I

go back and change the 6 towards point 4 or do I leave the 6 as it is unless this goes over sorry

un that won't go right I'm exploring right I mean this is will be necessarily be less than point8

this will be 4 that will be 6 so the question here is see one way of arguing about this is to say

that hey if I'm playing to win right I'll play the best action from here right and the best action

say is8 therefore I should not penalize it for the bad action which is point4 which I did to learn

more about the system that's one way of thinking about it another way of arguing is to say that hey

no no this is how I am actually behaving now right so I should give you the probability of winning

about about the current behavior policy right it should not be some other ideal policy it should

be about what I'm behaving currently and therefore I should update it right so which one is correct

first or the second first one you don't know what the second is you would say the second

one great so all of you can go write that in your homework answers okay I think this

is one of the homework questions as well right are you looking at the question I know you're

browsing you're not paying attention to me but can you look at the questions as

well you don't have Network here icsr network doesn't work

okay there should be a Wi-Fi access point called

icsr Studio or something no okay fine do you have the question you have the book

offline okay fine yeah but I do believe this is one of the questions right so otherwise uh

make up a question and write answer to that so we we we'll make sure that this is one of the

questions yeah but this is something this this is like I said I'm going to ask you

to think about the whole Tic Tac to and many of these answers have relevance later on in

fact there are two different algorithms one does option one one does option two

right so so there there's no right answer or wrong answer right it answer is it depends

depends yeah so yeah so that's these are different things that you can think about in this but I now

I told you about two different ways of learning with tic tacor one wait till the end and figure

out what the probabilities will be the other one is keep adapting this as you go along right

and both cases you'll have to explore that's the tricky part here in both cases you have to explore

otherwise you'll not learn about the entire game so this is where the explore exploit thingy comes

in okay yeah sir how do you know what point to stop exploring and choosing the best great

great question different algorithms deal with it different way that's one of the crucial uh

questions that you have to answer in uh uh n RL so it's called the explore exploit dma right um

so you have to explore to find out which is the best action right and you have to exploit whatever

knowledge you have gathered right uh and uh you have to act according to the best observations

already made right so this is called exploitation right so the key one of the key questions is when

do you know you have explored enough right should I explore now or should I exploit now

right this is called the explore exploit dilemma right and uh slightly simpler version of uh

reinforcement learning uh called Bandit problems okay some colorfully called Bandit problems yeah

of course he's an expert on Bandit problems yeah you can uh the Bandit problems encapsulate this

explore exploit dilemma my God lot of people are turning and looking at you

know and uh so this is the next chapter we're going to look at so probably in the next class

I'll start uh plunging into Bandit problems in more detail uh but so this will ignore a

whole bunch of other things like the delayed rewards you know the sequential decisions and

other things even in the absence of all of these other complications right even if I

say that your all your problem is you have to take an action and you'll get a reward okay

your goal is to pick the action that gives the highest reward I give you 10 actions

you have to pick the action that gives you the highest reward right but the problem is

you don't know what is the the probability distribution from which these rewards are

coming right so you'll have to do some exploration I I have to actually do every action at least once

okay to know what will be the reward even if they are deterministic right so I can't I can't

say which is the best action before I try every action at least once if it deterministic it's

fine I can just try every action once and I know what is the payoff right but if it stochastic I

have to try every action multiple times right how many times do you have to try it depends on the

variability of the distribution right so these are subtle questions which we'll talk about as we look

at the next chapter in the uh the first chapter in the book uh the first uh real chapter in the

book okay there's intro chapter which talks about tic taco and a whole bunch of other things and uh

yeah we just stop soon not sure want to do all of

this yeah so we'll come back and talk about all of these uh later right so I

I thought I would uh uh not spend enough time talking about the general background

of RL and applications so I put in a little bit more slides I typically do this on the

board right so I'll do this on the board when I get to chapter 3 right and so one one thing

which I wanted to tell you was uh this right uh there are are uh you know two classes of

algorithms that we will be talking about so one of them is based on what's called dynamic

programming right uh so dynamic programming um how many of you know what dynamic programming

is I expect at least 23 hands up and I have the mechanical engineers putting

their hands up not the Cs guys really you don't know what dynamic programming is huh

something wrong with your hand then okay fine so what what essential idea behind Dynamic I I'll

talk again I'll talk as as usual with other things we'll dwell in this in larger detail

the essential idea behind dynamic programming is you'll be using some kind of repeated structure

in the problem right so and to solve the problem more efficiently right suppose I have I have a

solution that I can give for a problem of say size n right then I'll try to see if I can use

that for defining the solution for the problem of size n plus one right so some kind of a repeated

substructure in the problems right uh a very very rough way of describing what dynamic programming

is right uh so for example one way of thinking about dynamic programming is I have this game

tree right so I look at the values of winning or or the expectations of winning from all of these

steps right I'll use these in order to compute the value of winning or the the probability of winning

from this state right so if you think about it if if from here if I'm going to take say n steps from

here how many steps would you expect me to take n minus one steps right so I look at the probability

of winning when I when I have only n minus one steps left right I'll estimate that first I'll use

that solution for estimating the probability of winning when I have n steps left right so that's a

the idea behind dynamic programming and so what we will do is uh well we will talk about how do

you use Dynamic programing to solve reinforcement learning problems right and then we'll talk about

various reinforcement learning methods which are essentially online approximate dynamic

programming you know so they are uh they use dynamic programming ideas but they don't have

any knowledge of the System Dynamics right so the P the P and R will other things will come

clear later and uh so all you do now is instead of having the entire outcome and using that for

estimating the probability of winning here right I'm going to just use one step that I take through

the tree right I use just what happens in this one step I'll use that in order to update the

probability of wining here right so instead of using the entire outcome right as in dynamic

programming in reinforcement learning methods we'll be using samples that you are getting

through the state space okay this is the TD method in the other method I explained to in uh for Tic

Tac to what would you do your sample will run all the way to the end right and you'll use that to

update so that's the two different ways of using samples yeah so uh so in this uh in this case we

have the updating of s of T will be determined by the value in s t + 1 right so the value of

s+1 should be computed first yeah and so since s+1 depends on S plus2 again that should be computed

first so won't this be the same thing as exploring the whole all the way down and then Computing say

let us if you're doing it for the first time right the first time down the tree there will

be no updates actually because st+ one will also be zero St will also be zero the first time you go

down there will be no updates but once you reach an end then from there you'll start updating the

previous day so the next time you go down the tree then it'll keep going further up

uh uh I I don't understand how this is different from updating at the end of whether you won the

game or whether you lost the game um whether he won the game or lost the game I actually

I'm taking the exact outcome that happened in that particular trial right that particular game

and I used that to change my probabilities right but here I'm not just taking the outcome of that

particular game right I'm looking at the expected value of winning from the next board position

right so if I wait till all the way here and I say I won and I take this and update St then

I'll be only updating it with the fact whether I won or not okay but if I'm updating it from St +1

see I could have reached St plus1 multiple ways before right when I'm doing the updation from St

plus1 it's essentially the average accumulated over all the previous trails that I'll be using

right so if I play all the way to the end and update it it'll be with a one or a zero

but st+ one could be anywhere between 1 and zero depending on what's the

probability ofing so I'll be using that value for updation so that's the crucial

difference okay so there are many different solution methods so there are all this which

are called temporal difference methods so these are all different algorithms TD Lambda Q learning

sarca actor critic so on so forth right and then there is a all search of algorithms which called

policy search algorithms and then there is uh dynamic programming so we will actually look at uh

each of these classes of algorithms in the uh the next few lectures right these are all essentially

uh mostly from the textbook except policy search algorithms uh which I will be covering from other

material right and there are a whole bunch of other applications for RL right so you could

uh they're all over the place as you could see optimal control optimal optimization or psychology

neuros science so that's one of the reasons I was asking is there anyone from biotech because uh

biotech people do use reinforcement learning a lot and usually there are one or two people in in in

the RL class so be surprised or maybe that is that was a BoTech register according to the academic

website maybe they just gave up in cs36 and went back I don't know right so here is the the most

recent uh hot thing that came from RL right uh more game playing and for a change it's not from

IBM right it's from Google uh but the company that actually uh built the first uh this arcade game

playing engine was called Deep Mind and as soon as deep mind built a successful engine Google bought

them right and so now it's Google deep mind uh but it's a separate entity right it's it's not part

of Google it's Google Deep Mind operates out of London and they're doing all kinds of interesting

stuff uh many of the hot uh advances very recent advances in the last year or so in reinforcement

learning seem to be coming out of Deep Mind uh so what they did was how many of you know about

this Atari games right everyone knows about Atari games people are getting tired really no one has

played Pac-Man ah yeah how about how about pong breakout Space Invaders come on on yeah anyway so

what happened was this this team in University of Alberta okay which put out this what they

call the AL the arcade learning Atari learning environment or arcade learning environment uh

which essentially uh they allowed computers to play these games right these Atari games

and what the Deep Mind fellows came up with is a reinforcement learning agent that learned to play

these games from scratch right just by looking at the screen okay that's all the input it was

getting just the pixels from the screen the raw pixels from the screen were given as inputs to it

it used a very complex neural network so it's it's a deep deep learning uh deep Network right and it

is considered one of the hardest learning problems solved by a computer and I think I believe it's a

one the only computational reinforcement learning paper ever to appear in nature right so usually

it's very hard for uh non natural science people to population nature right kind of obvious right

it's usually hard for computer science folks to population nature but uh uh this was uh uh toted

touted as a next step in trying to understand how humans process inputs blah blah blah all kinds

of marketing jargon right uh but more importantly than anything else about this it's reproducible so

I told you about uh I think that's a warning sign for me to stop so I told you about the helicopter

right so that's basically Stanford and Bur or the two people who get the helicopter to fly I

told you about the the back gam player it's like Jerry Taro is a one person who gets the back gamon

player to work right uh partly because all the input features he uses in there are proprietary

but partly because it's a very hard problem to solve right what is amazing thing about this Atari

game engine is that uh these guys have released a cord right you can if you have enough powerful

gpus right you can set it up here and get it to play and get a reasonably uh working uh engine

right that plays plays the Atari games that's the amazing thing about it that it's reproducible as

opposed to many of the other things uh other success stories other success stories we've

had in the past so I do believe I have just one more slide after this so let's see if this will

work okay

so if you have any doubts the green one is the learning agent

no no this just sped up for you to see I mean it's not like the game

is progressing at the same right but you can see the the scores

though mind it was not given a reward okay is just given the screen okay never never

got a reward for winning or anything you have to understand that the pixels on the top are

rewarding and if you give it rewards it becomes cheating right what is

yeah which is what they did they did they did add a game over which

is which the the AL purists consider as cheating but uh yeah they did add

a game over sign yeah so the longer you keep it going the better it is

basically right I think this is getting boring

right so it's it's learned

yeah so this is a sequest so you have to swim and then sink down get some things

and come up so sequest is a game that it never learned to do greatly on though

sequence is not something which it learned to solve well so there are a few games like

this so when they initially publish the paper I think they had like like 45 or 46 out of the

50 a league games uh they were able to play well and I think in 43 of them it had better

than human performance and I think the current state is they have like one game that doesn't

play well right and they have better than human performance in like 48 of those games or something

so which is this is amazing if you think about it right so these are all lot of references uh

but but we'll be using the SAT and B textbook uh the second edition We There is a pre-release

copy which rich saton is continuously updating on his web page right and uh a very very short

introduction to uh uh reinfor learning in Tom Mitchell's book and that's a chapter in

Russell and norw and uh if you're interested in looking at the the Neuroscience uh perspective

of reinforcement learning or the history of how Behavioral psychology models evolve to temporal

difference learning uh you can look at Dian and Abott uh that's one chapter on reinforcement

learning uh you don't have to understand the Neuroscience to understand the RL chapter so

don't get put off by the title of the book right and then the more mathematically uh well grounded

introduction to reinforcement learning would be from bers and cist so apart from this we'll use

one book which is uh Mark of Deion processes by puterman right we'll use certain chapters from PU

and that's just for us to understand mdps better mation process better [Applause]

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:RL Framework and Applications