Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
RL Framework and Applications | Reinforcement Learning | YouTubeToText
YouTube Transcript: RL Framework and Applications
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
Reinforcement learning (RL) is a machine learning paradigm where an agent learns to make optimal sequential decisions by interacting with an environment and receiving scalar reward signals. The core challenge lies in balancing exploration of new actions with exploitation of known rewards to maximize long-term cumulative reward.
[Music]
you can do all kinds of fancy stuff with reinforcement learning right so the Crux
in reinforcement learning is that uh the agent is going to that is this learning agent right
it's going to learn in close interaction with an environment right so the environment could
be the hel opter it could be the cycle right and uh or it could be your back gam and board and
your opponent all of this could constitute the environment a variety of different choices so
you sense the state in which the environment is in right you sense the state of the environment okay
and figure out what is the action that you should take in response to the state right so you apply
the action back to the environment this causes a change in the state right so now comes at to
keeper right so you should not just choose actions that are beneficial in the current state but you
should choose actions in such a way that they'll put you in a state which is beneficial for you in
the future right just just capturing the queen of of your opponent is not enough on the chest
that might give you a high reward but it might put you in a really bad position so you don't
want that right you really want to be uh looking at the entire sequence of decisions that you're
going to have to make and then try to behave optimally with respect to that right so what
what do we mean by behave optimally in this case right we're going to assume that the environment
right is giving you some kind of an evaluation like falling down HTS right or capturing a piece
maybe gives you a a small plus. five or winning the game gives you like 100 right so every time
you win every time you make a move or every time you execute an action you did not get a reward or
you did not get an evaluation from the environment right so it could be just zero it could be could
be nothing right so I should point out that this whole idea of having an evaluation come from the
environment is is just a mathematical convenience that we have here uh but in reality if you think
about biological systems that are learning using reinforcement learning right all they are getting
is their usual sensory inputs right so there is some fraction in the brain okay that sits there
and interprets some of those sensory input as Rewards or punishments right so you fall down
you get hurt I mean that's still a sensory input that is coming from your skin right or somebody
Pats you on a back that's still a sensory input that comes from the skin right it's just another
kind of an input right so you could choose to interpret this as a reward right or this
as a collision with an obstacle if something is brushing against my shoulder let me move
away right or you can just take it as somebody's patting my back so I did something good right so
it's a matter of interpretation so this is uh this whole thing about having a state signal and having
a separate evaluation coming from the environment is a fiction right that is created to have a clear
cleaner mathematical model but in reality things are a lot Messier you know you don't you don't
have such a clean separation right and like I said uh so you have a stochastic environment
you have uh delayed evaluation noisy so the new term that we have added here is huh noisy
scalar the new term we have had today is scalar so that is one of the things with uh the classical
reinforcement learning approaches he said I'm going to assume that my reward is a scalar signal
right so we talked about getting hurt and having food and so on so forth what all of this will
happen mathematically is I'll convert that into some kind of a number on a scale right so getting
hurt might be minus 100 right getting food might be plus 5 winning the game might be+ 20 right
capturing a piece might be plus5 or something like that so I'm going to convert them to a
scale right and the goal is now now that I have a single number that represents the evaluation the
goal is now to get as much as possible of that quantity over the long run okay make sense right
so if you have questions doubts stop me and ask so mathematically a scalar is easier to optimize
uh not necessarily right I'm just talking about so it's it's like a cost function if you want to
think about it in in terms of in terms of control systems right so this is like a cost and I'm
trying to optimize the cost right and uh so if the cost is going to be Vector valued then I have to
start trading off one One Direction of the vector against the other so which component of the vector
is more important so then it get into all kinds of uh so par optimality kind of questions and so
it's not really clear uh what exactly is optimal in such cases right so in fact later on I we have
homework exercise for you to argue for and against scalar rewards so you can put down your thoughts
then right uh so here again let me emphasize it's not supervised learning right in supervised
learning this is essentially what you're going to see there will be an input and there will be
an output that you producing and somebody will be giving you a Target output okay so this is
what you're supposed to produce and essentially you come compare the output you are producing to
the Target output right and you can form some error signal right and you can use that error
in order to train your agent right you can try to minimize the error you can do gradient descent on
the error or you can do variety of things and you can try to train the agent so here I don't
have a Target I do have to learn mapping from the input to the output but I don't have a Target and
hence I can't form an error right and therefore my trial and error becomes very essential so
if I have errors right I can form gradients of errors and they can go in the opposite direction
of the gradient of the error right and then that gives me some uh direction in which to change my
parameters right that constitute the agent right agent is going to be described in some
way right the error gives me a direction right but now since I I don't know a direction right
so I just I do something I get one evaluation so I don't know whether the evaluation is good
or bad right so think of writing an exam right I don't tell you the right answer I just tell you
three right and so what do you do now do are you happy with the answer should you change it should
you change it in One Direction or should you change it in the other direction see what makes
it even more tricky is I don't you don't even know how much the exam is out of so when I say three it
could be three out of three it could be three out of 100 right and or it could be like three out of
18 which was what a class average in ml in the first
place uh right uh so it could be any of these things right so you don't even know whether three
is a good number or a bad number so you have to explore to figure out a if I can get higher
than three right or three is the best the second thing is okay if I can get higher than three how
should I change my parameters to get to become higher than three right so I have to change my
parameters a little bit that way okay I have to change the parameters a little bit this way right
so if I'm cycling right I have to push down a little harder on the pedal okay or I have to
push down a little softer on the pedal to figure out whether I'm staying balanced for a longer
period of time or not right so I don't know that otherwise unless I try these things I wouldn't
know this is why the trial and error part so if I push down a little harder and I stay balanced
maybe I should try pushing down even more harder next time right so maybe that'll make it better
right and then there might be some point where I tip over so I need to come back so this is some
things which you have to try unless you try that you don't even know which direction you have to
move in right so this is much more than just the psychological aspects of traal and error there's
also a mathematical reason if you want to adapt my parameters right I need to know the gradient
so that you need to do this yeah reward come in the previous picture uh the reward is the one
that uh I know that gives you the evaluation for the output right so here uh in the supervised case
the error is the evaluation for the output if the error is zero then your output is perfect right
but then you have a way of gauging what the error is because you have a Target to which you can
compare right and from there you get the error so in the reinforcement learning case the evaluation
is directly given to you as the uh the evaluation of the output right it's not necessarily comparing
against Target value or anything you don't know how the evaluation was generated right you just
get an evaluation directly so you you just get some number corresponding to the output right
so maybe I should have done put an arrow from the top saying evaluation comes in from there
but uh but that's exactly where it's coming it's a substitute for the error signal but just that
you don't know what the evaluation is of course the way it differs from the error is the minor
difference is you typically tend to minimize error but you tend to maximize evaluation right uh it's
also not unsupervised learning so unsupervised learning you have some kind of an input right
that goes to the agent and then it figures out what are the patterns for the in the input right
here you have some kind of an evaluation and you are expected to produce an action in response to
the input it is not simply pattern deduction right so you might want to dedu patterns in the input so
that you know what is the right response to give but that is not the primary goal right but in uh
unsupervised learning the pattern deduction itself is the primary goal so that is the difference
right so here is this one slide which I think is uh kind of uh the soulle of reinforcement learning
right um it's called temporal difference so I I I'll explain it little more detail in in a couple
of slides uh but the intuition here right so if you if you remember the Pavlov's dog experiment
right what was the dog doing it was predicting the outcome of the Bell you know if the Bell Rings
there is an outcome that's going to happen it's predicting the outcome which is food is going to
happen and then it was reacting appropriate to the outcome right so most of reinforcement learning
you're going to be predicting some kind of outcome that's going to happen see am I going to get a
reward if I do this or if or am I going to not get a reward right am I going to win this game if I
make this move or am I not going to win this game right so I'm trying to always trying to predict
the outcome right the outcome here is the amount of reward or punishment I'm going to get right
this is essentially what I'm trying to predict at every point right so the intuition behind uh
the what is called temporal difference learning is the following right so the prediction that I make
at time t + 1 okay of what will be the eventual outcome let us say I'm playing a game right I'm
going to say I'm going to win now I'm very sure I'm going to win now right so I can say that with
a greater confidence when I'm closer to the end of the game then I can at the beginning of the
game right so I have all the pieces set up right if I'm going to sit there then and say I'm going
to win the game right it's most probably wishful thinking right but then you have played the game
for like 30 minutes or something and there are like five pieces left on the board now I'm going
to say I'm going to win the game now I say I'm going to win the game that's a much more confident
prediction than what I did at the beginning right so taking this to the extreme right
so the prediction I make at t + 1 is probably more accurate than the prediction I make at T
corre the prediction I make at t + 1 is more accurate than the prediction I make at T So
if I want to improve the prediction I make a t what can I do I can look go forward in time
right basically go to the next let the clock tick over and see what is the prediction I'll
make at time t+ one with additional knowledge I'm getting right I would have moved one step
closer to the end of the game so I know I know it's little bit better about the game right I
don't know how the game is proceeding and I can now make a prediction about whether I'll win or
lose right and use this go back and modify the prediction I make it time T right at T I think
there is a pro possibility of say probability of 6 of me winning the game okay and then I make a
move then I find out that I'm going to lose the game with a very high probability then what will
I do is I'll go back and reduce the probability of winning that I made at time T so instead of 6
I'll say okay maybe .55 or something right so next time I come to the same state as I was at
time T I won't make the prediction of 6 I'll say 0.55 that's essentially the idea behind
temporal difference learning right so it has a whole lot of advantages we'll talk about it
a couple of slides down uh but one one thing is has had a significant impact in behavioral
psychology and in Neuroscience right so it's widely accepted that uh animals actually use
some form of temporal difference learning and in fact there are specific models right that uh that
have been proposed for temporal difference learning which seem to explain some of the
uh neurotransmitter behaviors in the brain yeah supp we are playing a game and we have 10 moves
that we can make and out of 10 we will select the best one okay this is the usual scenario so
here the temporal difference is like we are Sting the move from next to next move no no no no see
at this point I'll be making a prediction about what is the probability of winning right it could
be for each of the moves right if I make this move what is a probability of ending if I make this
move what is the probability of ending let us say I I make move two okay and then I go I see a new
position right my opponent responds to it and then I decide oh my God this is a much worse move than
I thought earlier so what I'll do is I'll change the prediction I make for move two in the previous
state see that the other other moves will not be affected because the only move I took
was two only about that move I have additional information therefore I can go back and change
the prediction I make for move to alone okay so you you still have the 10 moves you're not
changing any of that yeah we going back and like changing the entire thing again or just changing
the prediction changing the prediction it's not like I mean in Ideal World you should be able
to take back a bad move right except if if it's a parent playing with the kid I don't think uh
those things are allowed right uh in fact when I play with my son we have sometimes
had to RND back all the way to the beginning it'll probably be me asking to do the reing not
him because he'll be drubbing me in some one of those games but uh yeah otherwise you can't you
just make the change the prediction so next time you play the game you'll be better at it not for
that game well basically you're you're messed up or you did well I mean so whatever it is
yeah uh looking at RL right so let's look like Tic Tac to um how many of you have played tic
Taco good oh even you put your hand up okay
good okay good uh so so in tic Taco so you have these board positions right and uh so you make
different moves so in the first this is what I have drawn here is called a game tree right so
I start off with the initial board which is empty right and there are um how many possible branches
there for people making moves nine possible branches right for excess move there are nine
possible moves I have nine possible branches and then for each of these uh I'll have like eight
possible I'm not sure which is the right TV huh for each of these I have eight possible branches
and they keep going right so what we are going to be doing is uh essentially uh trying to to
Let's formulate this as a reinforcement learning problem so how will you do this as a reinforcement
learning problem right so I have all these board positions right let us say x is the reinforcement
learning agent and O is the opponent right so initially given a blank board I allow to choose
one among nine actions right so the the state that I'm going to see is this BL the the the
excess and was on the board right and the moves I'll be making are the actions right so so in the
initial position I have nine actions I make that do I get any reward not really there's no natural
reward signal that you can give essentially the reward that I'm going to get in this case is if
at the end of the moves if I win I'll get a one if I don't win I get a zero right if I win I get
a one if I don't win I get a zero right so what is going to happen is I'm going to keep playing
this games multiple times right and at each point right um yeah okay so there is a note here so what
ises that note say you have to assume it's an imperfect opponent right otherwise there's no
point in trying to learn Tic Tac Toe why Always draw you'll always draw and the way we have set
up the game you're indifferent between drawing and losing so you learn nothing I mean basically
so you not even learn to draw okay you'll just learn to nothing basically you learn nothing
because you can never win right so you're never going to get a reward of one so you'll just be
playing randomly so it's it's a bad bad idea so let's assume that you have an agent that can that
is imperfect right that makes mistakes so that you can actually learn to figure out where the
agent makes mistakes where the opponent makes mistakes and learn to exploit those things okay
right so your states are going to be these board positions as you can see give you a game that is
being played out on the top of the slide right and the actions you take are in response to those
board positions and finally at the end of the game right if you win you get a one if you don't win
you get a zero right sir is it clear sir in cases like this I mean does it have to be a binary sort
of a reward system I mean could you have a scale rather there are three parameters if you lose it
zero if you draw it's point if you win it's one sure you could even do other things like if you
win it is one if you lose it minus one so so then would you have a perfect component and
learn uh yeah you possibly could but you probably have to play lot lot and lot of games because the
perfect opponent it's almost impossible for you to start getting any feedback in the beginning
right you'll always be losing so it's going to be hard for you to learn but you'll eventually
learn something yeah it'll take a lot of moves you'll eventually learn something if I say that
at every Point yeah so when we learning like at a particular stage the probability of winning and
like when you update previous state so uh you're storing information for you're storing information
for each and every state that you have entered right so how will it be different from exploring
the prop like State space every time because after you've done let's say a thousand,
games or a million games you you have explored a lot of states and you'll have to store for
each state the probability of you winning at that point and all that so how will that be different
from exploring it again like why would I know the probability of winning from there why would I have
to explore it again uh no I'm uh let's I'm not even told you how we are going to solve it okay
let me explain that and then you can come back and ask me these questions okay if you still have them
uh okay great so what the way we are going to try and solve this game is as follows right for every
board position I'm going to try and estimate the reward I'll get if I start from there and play
the game to the end right every board position I'm going to look at the reward I'll get if I
start from there and play till the end now if you think about it what will this reward con connotate
right so if I win from there I'll get a one if I lose from there or if I don't win from there I'll
get a zero right when I say what is the reward I expect to get starting from this boat position
right it's essentially this average over multiple games it's some games I'll win some games I will
lose or I'll not win like some games I win some games I'll not win so what will this expected
reward represent after after having played many many many games like probility the probability of
winning right the reward is going to represent the probability of winning in this particular
case right if the reward had not been one right if it had been something else if it had been plus
five right it would have been some function of the probability of winning right or if it had
been plus one for winning minus one for losing and zero for draw it's something more complex it's no
longer the probability of winning right it's it's the gain I expect to get right how what fraction
of games I expect to win over the fraction of games I expect to lose or something like that
right it becomes a little bit more complex so the there could be some interpretation for the
value function but in general it is just the expected expected reward that I'm going to get
starting from a particular board position okay so that is what I'm trying to estimate right let us
assume that I have such a expectation well defined for me right say I have such an expectation well
defined right now I come to a specific position let us say I come I come to this position here
right let's say I come to this position how will I decide what is the next move I have to
make sorry whichever next state has whichever next state has the highest probability of winning
so I just look ahead to see okay where if I put if I put the X here right if I put the X
here what is the probability of wiing if I put the X here what is the probability of ending
if I put the X here what is the probability of wining right I do this for each one of
these right and then I figure out whichever has the highest probability of ending and I'll put
the X there right so that's how I'm going to use this function does it make sense yes it's
very important so this is this is something which you should understand this the Crux of
all reinforcement learning algorithms right I'm going to learn this function that tells
me if you're are in this state right if you play things out to the end what will be the
expected payoff that you will get right whether the Rewards or punishment or cost whatever you
want to call it what is the expected value you're going to get and and I want to behave according
to this uh learned function so when I come to a state I look ahead figure out which of the next
States has the highest expectation and then go to that state okay great how do I learn this
expectation what's the simplest way to learn the
expectation come to that St from that fraction of essentially you keep track of what happens
essentially keep track of the trajectory through the game tree right you play a game you go all the
way to the end right so you keep track of the TR trory and if you win right you go back along the
trajectory and update every state that you saw on the trajectory you update the probability of
winning right you just increase it a little bit or you come to the end of the game and you found
that you have not won right you go back along the trajectory decrease the probability of winning a
little bit right alternatively you can keep the history of all the games you have played so far
right at every after every game has been completed you can go back and compute the average probility
of winning across the entire history of all the games in which you saw that particular position
right makes sense that's easiest way of estimating this probability right but the problem with this
is a you have to wait till the game ends right or you have to store the history of all the games you
have played right I mean all of these could be potential drawbacks okay you can get around the
History part by coming up with an incremental rule but the main difficulty here is you have
to wait all the way to the end of the game right before you can change anything along the way so
Tic Tac to is easy it's like how many moves can you make in Tic Tac to at best four right the
fifth one is determined for you right so it's basically four choices that you can make right
so and uh that's easy enough to remember right you can always wait till the end of the game
and then you can always make the updates right but what if it's a much more complex situation
right what if you're playing chess maybe you can wait till the end so what if you're
cycling maybe you can wait till the end huh exactly we don't know right it depends on
where you're cycling if you're cycling learning to cycle in it m it's fine if you're learning
to cycle somewhere on Sardar Patel Road you don't want to even think about what end is
there right so so so there are some tasks for which you really like to learn along
the way right so this is where TD learning comes into comes to help right I don't think
I have that slide anyway um and I'm not using the fancy thing where I can draw on the uh on
the projection so let's see if I can do it here right um suppose I have come here right
and from here I have played at this point I know the probability of winning is say4
right so I came here by making a move from this position so I said we are here right and we made
a uh we know that the probability of winning from here is a.3 right but I made the move from
here to come here right but here I had thought my probability of winning was let us say 6 right I
thought my probability of winning was 6 right but then I looked at my next States and I found that
the best one was3 somehow right so I went there right but now since the best I can do from here is
3 me saying 6 here there's something wrong right so I should probably decrease the probability of
winning from here right so why could it be why could it have happened that I thought that was
6 but the best among the next was three maybe the other BR that you explored 4 that was fully
exploded all all of that was decreased to4 oh the thing is so that whenever I came to came through
this part right maybe I won before right the it so happened that when I went through like this right
initially I would have gone through like this and played the game right and the examples I drew I
might have actually won some of those games right so I would have changed this to 6 right
but it's possible for me to get here by playing a different sequence of moves also right so for
example to come here right I could have put the X first here and then here or I could have put
the X like I did here I put the X first here and then here right either way I would I I could have
reached this position right so there are many combinations in which I could have reached the
same positions right just to be nice to these guys right to reach here there are different orders in
which I could have put the zeros and the X's right here we have showing a specific order the zero
was first put here then put here right the X was first put here then put here it could very well
be I put the X first here and the O first here and then I put the X here and the O here right there
could are multiple ways in this thing was re reach right so sometimes when I play those games right I
lost right sometimes when I play this games I won therefore it turns out that for due to some random
fluctuations right so sometimes I win when I go through this specific point and therefore I have
higher evaluation of winning right but when I went through the other paths I had a lower evaluation
of winning right but we know that really doesn't matter what path you went through in Tic Tac to
right once you reach that point right what is going to happen further is determined only by
that point right so what I can do now is take this 3 right you should to update that point six
down down down so I I'm very confident here I think I'll win with the probability of 6 right but
the best probability I have from the next stage is3 therefore here I should not be so confident
right H good point uh that depends on how stochastic your game is right so if your game
has a lot of variability then you don't want to make a complete uh you know commitment to
to3 so you might want to say okay now let me move it little bit towards 3 right but
if it's more or less a deterministic game then you can say okay yeah 3 yes sure let
me go to all the way to 3 it depends on the nature of the game we'll talk
about these issues later yeah but how far down is a tricky thing right is calling AE
mising yeah it's misleading it's called game
tree usually but it's a game graph in this case yeah yeah
yeah so uh is is that clear this is this is an instance of temporal difference learning so how
will I use the the thing to update is this is called temporal difference learning okay
so there's one other uh thing which I should uh mention here right uh if I always take the
move that which I think is the best move right now right let's let's talk about it I I I start tabas
I have never played Tic Tac to before right so I play the game I play it once I get I get to the
end I win so now what I do I go back right whether I'm using temporal difference learning or waiting
till the end updating whatever it is I change the value of all the moves I have made in this
particular game right so the next time I come to a board position what am I going to do I look at all
possible outcomes everything except the one that I have played will have a zero right and the one
that I have played will have something slightly higher than zero I'm going to take that right in
fact it'll be like how many of you watch the movie groundhog day it'll be like Groundhog Day I'll be
playing I'll be playing the same game again and again because that's what happened to give me a
win in the first time around right that but that might not be the best way to play this right so
I need to explore right so I need to explore multiple options right so I should should not
be always playing the best best move right and I should not always be playing the best move I need
to do some amount of exploration so that I can figure out if there are better moves than what I
think is currently the best move right so in tic Taco there is inherently some kind of noise if
your opponent is random right but if your opponent is not random and if opponent is also playing
a fixed Rule and if you are playing greedy then you'll be just playing very very small fraction of
the game tree and you wouldn't have explored the rest of the outcomes right so you have to do some
amount of things at random so that you learn more about the game right so here is a question for
you when I'm estimating this probabilities of winning right let us say I've reached here I
look down right and the action that gives me the highest highest probability of winning
say gives me a probability of say8 right but I want to explore right so I take an action
that gives me a probability say point4 okay so I'll go from here to another action that
has a probability point4 right another board position that has a probability of
point4 of winning so should I use this point 4 to update this probability or
not no why
learning that you questioning the whole TD idea one with that particular sequence and now you're
exploring it so you shouldn't update it when you're exploring you should probably wait for
the end and or not just ignore the exploration okay any any other answer you have to update
it because you're learning that move whether it will be a good move or a bad move will be
found out I have to update the value of that move I agree do I update the value of winning from the
previous board position was the question so that point 4 I'll have to change right but
do I change the point8 that was a question the point8 was a probability of winning from here
right I look or whatever was the prob let say I had a probability of wining of 6 from here I
look at the bottom and the best probability of wining says 8 but then I take because I'm
exploring I take an action that has a probability of winning of point4 right the question is do I
go back and change the 6 towards point 4 or do I leave the 6 as it is unless this goes over sorry
un that won't go right I'm exploring right I mean this is will be necessarily be less than point8
this will be 4 that will be 6 so the question here is see one way of arguing about this is to say
that hey if I'm playing to win right I'll play the best action from here right and the best action
say is8 therefore I should not penalize it for the bad action which is point4 which I did to learn
more about the system that's one way of thinking about it another way of arguing is to say that hey
no no this is how I am actually behaving now right so I should give you the probability of winning
about about the current behavior policy right it should not be some other ideal policy it should
be about what I'm behaving currently and therefore I should update it right so which one is correct
first or the second first one you don't know what the second is you would say the second
one great so all of you can go write that in your homework answers okay I think this
is one of the homework questions as well right are you looking at the question I know you're
browsing you're not paying attention to me but can you look at the questions as
well you don't have Network here icsr network doesn't work
okay there should be a Wi-Fi access point called
icsr Studio or something no okay fine do you have the question you have the book
offline okay fine yeah but I do believe this is one of the questions right so otherwise uh
make up a question and write answer to that so we we we'll make sure that this is one of the
questions yeah but this is something this this is like I said I'm going to ask you
to think about the whole Tic Tac to and many of these answers have relevance later on in
fact there are two different algorithms one does option one one does option two
right so so there there's no right answer or wrong answer right it answer is it depends
depends yeah so yeah so that's these are different things that you can think about in this but I now
I told you about two different ways of learning with tic tacor one wait till the end and figure
out what the probabilities will be the other one is keep adapting this as you go along right
and both cases you'll have to explore that's the tricky part here in both cases you have to explore
otherwise you'll not learn about the entire game so this is where the explore exploit thingy comes
in okay yeah sir how do you know what point to stop exploring and choosing the best great
great question different algorithms deal with it different way that's one of the crucial uh
questions that you have to answer in uh uh n RL so it's called the explore exploit dma right um
so you have to explore to find out which is the best action right and you have to exploit whatever
knowledge you have gathered right uh and uh you have to act according to the best observations
already made right so this is called exploitation right so the key one of the key questions is when
do you know you have explored enough right should I explore now or should I exploit now
right this is called the explore exploit dilemma right and uh slightly simpler version of uh
reinforcement learning uh called Bandit problems okay some colorfully called Bandit problems yeah
of course he's an expert on Bandit problems yeah you can uh the Bandit problems encapsulate this
explore exploit dilemma my God lot of people are turning and looking at you
know and uh so this is the next chapter we're going to look at so probably in the next class
I'll start uh plunging into Bandit problems in more detail uh but so this will ignore a
whole bunch of other things like the delayed rewards you know the sequential decisions and
other things even in the absence of all of these other complications right even if I
say that your all your problem is you have to take an action and you'll get a reward okay
your goal is to pick the action that gives the highest reward I give you 10 actions
you have to pick the action that gives you the highest reward right but the problem is
you don't know what is the the probability distribution from which these rewards are
coming right so you'll have to do some exploration I I have to actually do every action at least once
okay to know what will be the reward even if they are deterministic right so I can't I can't
say which is the best action before I try every action at least once if it deterministic it's
fine I can just try every action once and I know what is the payoff right but if it stochastic I
have to try every action multiple times right how many times do you have to try it depends on the
variability of the distribution right so these are subtle questions which we'll talk about as we look
at the next chapter in the uh the first chapter in the book uh the first uh real chapter in the
book okay there's intro chapter which talks about tic taco and a whole bunch of other things and uh
yeah we just stop soon not sure want to do all of
this yeah so we'll come back and talk about all of these uh later right so I
I thought I would uh uh not spend enough time talking about the general background
of RL and applications so I put in a little bit more slides I typically do this on the
board right so I'll do this on the board when I get to chapter 3 right and so one one thing
which I wanted to tell you was uh this right uh there are are uh you know two classes of
algorithms that we will be talking about so one of them is based on what's called dynamic
programming right uh so dynamic programming um how many of you know what dynamic programming
is I expect at least 23 hands up and I have the mechanical engineers putting
their hands up not the Cs guys really you don't know what dynamic programming is huh
something wrong with your hand then okay fine so what what essential idea behind Dynamic I I'll
talk again I'll talk as as usual with other things we'll dwell in this in larger detail
the essential idea behind dynamic programming is you'll be using some kind of repeated structure
in the problem right so and to solve the problem more efficiently right suppose I have I have a
solution that I can give for a problem of say size n right then I'll try to see if I can use
that for defining the solution for the problem of size n plus one right so some kind of a repeated
substructure in the problems right uh a very very rough way of describing what dynamic programming
is right uh so for example one way of thinking about dynamic programming is I have this game
tree right so I look at the values of winning or or the expectations of winning from all of these
steps right I'll use these in order to compute the value of winning or the the probability of winning
from this state right so if you think about it if if from here if I'm going to take say n steps from
here how many steps would you expect me to take n minus one steps right so I look at the probability
of winning when I when I have only n minus one steps left right I'll estimate that first I'll use
that solution for estimating the probability of winning when I have n steps left right so that's a
the idea behind dynamic programming and so what we will do is uh well we will talk about how do
you use Dynamic programing to solve reinforcement learning problems right and then we'll talk about
various reinforcement learning methods which are essentially online approximate dynamic
programming you know so they are uh they use dynamic programming ideas but they don't have
any knowledge of the System Dynamics right so the P the P and R will other things will come
clear later and uh so all you do now is instead of having the entire outcome and using that for
estimating the probability of winning here right I'm going to just use one step that I take through
the tree right I use just what happens in this one step I'll use that in order to update the
probability of wining here right so instead of using the entire outcome right as in dynamic
programming in reinforcement learning methods we'll be using samples that you are getting
through the state space okay this is the TD method in the other method I explained to in uh for Tic
Tac to what would you do your sample will run all the way to the end right and you'll use that to
update so that's the two different ways of using samples yeah so uh so in this uh in this case we
have the updating of s of T will be determined by the value in s t + 1 right so the value of
s+1 should be computed first yeah and so since s+1 depends on S plus2 again that should be computed
first so won't this be the same thing as exploring the whole all the way down and then Computing say
let us if you're doing it for the first time right the first time down the tree there will
be no updates actually because st+ one will also be zero St will also be zero the first time you go
down there will be no updates but once you reach an end then from there you'll start updating the
previous day so the next time you go down the tree then it'll keep going further up
uh uh I I don't understand how this is different from updating at the end of whether you won the
game or whether you lost the game um whether he won the game or lost the game I actually
I'm taking the exact outcome that happened in that particular trial right that particular game
and I used that to change my probabilities right but here I'm not just taking the outcome of that
particular game right I'm looking at the expected value of winning from the next board position
right so if I wait till all the way here and I say I won and I take this and update St then
I'll be only updating it with the fact whether I won or not okay but if I'm updating it from St +1
see I could have reached St plus1 multiple ways before right when I'm doing the updation from St
plus1 it's essentially the average accumulated over all the previous trails that I'll be using
right so if I play all the way to the end and update it it'll be with a one or a zero
but st+ one could be anywhere between 1 and zero depending on what's the
probability ofing so I'll be using that value for updation so that's the crucial
difference okay so there are many different solution methods so there are all this which
are called temporal difference methods so these are all different algorithms TD Lambda Q learning
sarca actor critic so on so forth right and then there is a all search of algorithms which called
policy search algorithms and then there is uh dynamic programming so we will actually look at uh
each of these classes of algorithms in the uh the next few lectures right these are all essentially
uh mostly from the textbook except policy search algorithms uh which I will be covering from other
material right and there are a whole bunch of other applications for RL right so you could
uh they're all over the place as you could see optimal control optimal optimization or psychology
neuros science so that's one of the reasons I was asking is there anyone from biotech because uh
biotech people do use reinforcement learning a lot and usually there are one or two people in in in
the RL class so be surprised or maybe that is that was a BoTech register according to the academic
website maybe they just gave up in cs36 and went back I don't know right so here is the the most
recent uh hot thing that came from RL right uh more game playing and for a change it's not from
IBM right it's from Google uh but the company that actually uh built the first uh this arcade game
playing engine was called Deep Mind and as soon as deep mind built a successful engine Google bought
them right and so now it's Google deep mind uh but it's a separate entity right it's it's not part
of Google it's Google Deep Mind operates out of London and they're doing all kinds of interesting
stuff uh many of the hot uh advances very recent advances in the last year or so in reinforcement
learning seem to be coming out of Deep Mind uh so what they did was how many of you know about
this Atari games right everyone knows about Atari games people are getting tired really no one has
played Pac-Man ah yeah how about how about pong breakout Space Invaders come on on yeah anyway so
what happened was this this team in University of Alberta okay which put out this what they
call the AL the arcade learning Atari learning environment or arcade learning environment uh
which essentially uh they allowed computers to play these games right these Atari games
and what the Deep Mind fellows came up with is a reinforcement learning agent that learned to play
these games from scratch right just by looking at the screen okay that's all the input it was
getting just the pixels from the screen the raw pixels from the screen were given as inputs to it
it used a very complex neural network so it's it's a deep deep learning uh deep Network right and it
is considered one of the hardest learning problems solved by a computer and I think I believe it's a
one the only computational reinforcement learning paper ever to appear in nature right so usually
it's very hard for uh non natural science people to population nature right kind of obvious right
it's usually hard for computer science folks to population nature but uh uh this was uh uh toted
touted as a next step in trying to understand how humans process inputs blah blah blah all kinds
of marketing jargon right uh but more importantly than anything else about this it's reproducible so
I told you about uh I think that's a warning sign for me to stop so I told you about the helicopter
right so that's basically Stanford and Bur or the two people who get the helicopter to fly I
told you about the the back gam player it's like Jerry Taro is a one person who gets the back gamon
player to work right uh partly because all the input features he uses in there are proprietary
but partly because it's a very hard problem to solve right what is amazing thing about this Atari
game engine is that uh these guys have released a cord right you can if you have enough powerful
gpus right you can set it up here and get it to play and get a reasonably uh working uh engine
right that plays plays the Atari games that's the amazing thing about it that it's reproducible as
opposed to many of the other things uh other success stories other success stories we've
had in the past so I do believe I have just one more slide after this so let's see if this will
work okay
so if you have any doubts the green one is the learning agent
no no this just sped up for you to see I mean it's not like the game
is progressing at the same right but you can see the the scores
though mind it was not given a reward okay is just given the screen okay never never
got a reward for winning or anything you have to understand that the pixels on the top are
rewarding and if you give it rewards it becomes cheating right what is
yeah which is what they did they did they did add a game over which
is which the the AL purists consider as cheating but uh yeah they did add
a game over sign yeah so the longer you keep it going the better it is
basically right I think this is getting boring
right so it's it's learned
yeah so this is a sequest so you have to swim and then sink down get some things
and come up so sequest is a game that it never learned to do greatly on though
sequence is not something which it learned to solve well so there are a few games like
this so when they initially publish the paper I think they had like like 45 or 46 out of the
50 a league games uh they were able to play well and I think in 43 of them it had better
than human performance and I think the current state is they have like one game that doesn't
play well right and they have better than human performance in like 48 of those games or something
so which is this is amazing if you think about it right so these are all lot of references uh
but but we'll be using the SAT and B textbook uh the second edition We There is a pre-release
copy which rich saton is continuously updating on his web page right and uh a very very short
introduction to uh uh reinfor learning in Tom Mitchell's book and that's a chapter in
Russell and norw and uh if you're interested in looking at the the Neuroscience uh perspective
of reinforcement learning or the history of how Behavioral psychology models evolve to temporal
difference learning uh you can look at Dian and Abott uh that's one chapter on reinforcement
learning uh you don't have to understand the Neuroscience to understand the RL chapter so
don't get put off by the title of the book right and then the more mathematically uh well grounded
introduction to reinforcement learning would be from bers and cist so apart from this we'll use
one book which is uh Mark of Deion processes by puterman right we'll use certain chapters from PU
and that's just for us to understand mdps better mation process better [Applause]
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.