Reinforcement learning is a machine learning framework that enables agents to learn optimal control strategies for interacting with complex environments through trial and error, driven by rewards and feedback.
Mind Map
Zum Vergrößern klicken
Klicke, um die vollständige interaktive Mind Map zu öffnen
Welcome back. So I'm really excited to do this lecture on reinforcement learning. I've been
wanting to do this for a long time. Those of you who know me know that I love control theory
and machine learning and reinforcement learning is kind of at this sweet spot between these two super
important fields. Okay, so reinforcement learning is essentially a branch of machine learning that
deals with how to learn control strategies to interact with a complex environment. And one of
the ways I think about this, the way I'm going to define this, is that reinforcement learning
is a framework for learning how to interact with the environment from experience. This is
a very biologically inspired idea... this is what animals do. So through trial and error,
through experience, through positive and negative rewards and feedback, they learn how to interact
with their environment. OK good. So before I jump in I want to show some motivating videos. I really
like this one where reinforcement learning is used to learn how to walk in this artificial
environment. And there's a lot of papers like this where people use reinforcement learning as kind of
an optimization framework to learn how to control a complex system, in this case a bipedal walker,
often in a simulated environment. And this just looks really cool and it's a difficult control
problem. This is a really hard non-linear control problem. Now the goal would be to take what you
learn here and start to port that over into the real world to make better robots and better
actual physical agents that can interact with the world alongside us, to learn how to learn
like humans and animals do. So another video I love... this is my dog Mordecai and my wife
has trained him... this is a treat on his nose... to hold the treat on his nose until she says ok,
after which he can then grab the treat and eat it. This is not an easy trick to learn and this again
this goes to show you anytime you, anybody who's trained an animal, a dog or any other animal,
has done some type of reinforcement learning or reinforcement training. OK and so that's actually
where the word reinforcements comes from is that in in animal systems in human systems you in you
reinforce good behavior with rewards like treats okay and so that's kind of the whole name of the
game here is learning a good control strategy or a good set of actions through positive reinforcement
good so that's what we're gonna talk about today I'm gonna walk you through the framework so I want
to disentangle there is a reinforcement learning framework kind of the framework for how you learn
to interact with the environment and then there's a hard optimization problem for how you actually
optimize the agents actions or policies given that framework and those are kind of two pieces
that we're going to talk about today and then in a future video I'm going to talk about kind
of deep reinforcement learning or reinforcement learning with modern techniques and deep neural
networks and some of the incredible applications and and performance that you can get out of those
systems good so also I'll point out you can follow updates on these videos at eggin steve
on twitter please like please subscribe hit the bell so you get notifications and comment below
tell me what you want to see more of tell me what you like or don't like oftentimes
people in the comments provide a lot of really important useful information that I might have
left out of these videos so I think it's also a big service to other people watching these
all right so we're gonna jump in and we're gonna build this reinforcement learning framework from
the ground up from scratch and so at the heart of it you start with an agent and an environment
and the agent I actually like the name agent because it implies some agency the agent gets
to take actions to interact with the environment so in the first example and I'm gonna have a few
examples we're going to talk about a mouse in a maze so the agent is a mouse the environment is
a maze the mouse gets to measure its current state in the environment so it measures that
state s notice that it doesn't measure the full state the mouse does not have a top-down view
of the whole maze it just knows where it is right now and where it was in the past and then the add
Mouse gets to take some action a it gets to make some decision about what to do next okay so it
could turn left it could turn right or it could go forward in this case and only until the very
end of the maze does the mouse actually get a reward are so these rewards are very sparse few
and far between in this case if it goes to the very end of the maze it might get a piece of
cheese actually my wife tells me that when they do training experiments for rats they really like
fruit loops and it looks adorable because a fruit loop is gigantic to a mouse or to a rat but the
moral of the story here is that this agent gets to make some decisions it has control over its
actions so it has agency and the environment it gets to measure where it is in the environment and
occasionally it gets rewards very occasionally it gets rewards and so part of the the goal of this
system is to learn what actions actually caused it to get a reward or not okay and this is in
some sense in the machine learning lingo this is called semi-supervised learning so if the mouse
got a reward at every single stage of the maze if at every correct turn it got a piece of cheese
that would just be regular supervised learning and those rewards would be called labels they would
tell you yes you did the right thing or no you did not do the right thing but because the reward here
is time-delayed it comes at the very end of the game or very sporadically and it's not linked
to every single individual action we call that time delay the label a reward and this becomes
a semi supervised learning problem so it's still supervised in the sense that there is supervisory
feedback telling the agent what worked and what didn't but it's not nearly as much information
as in classical supervised learning and that's one of the major challenges of reinforcement
learning is that these these labels are extremely rare and it's very hard to tell what actions gave
rise to actually getting that reward so this is a much harder optimization problem and often times
requires much more data and much more trial and error and I'm going to talk about that good I
also like to think about the the game of chess or checkers or tic-tac-toe basically games in general
where the agent basically there are some rules of the game and you get to make a finite set of
actions to interact with that environment now in the case of chess it's interesting because
the environment is not just the rules of the game there's also an adversarial opponent trying
to beat you there so you're trying to beat the opponent you're trying to checkmate the other to
the other side and they are trying to beat you and so that's really interesting is that the rules of
this game are trying to you know there's an active player on the other side in this environment good
you also might be a terminator trying to rule the world or try to learn how to walk I actually think
that sounds kind of funny that in the matrix neo is actually the agent from a reinforcement
learning standpoint trying to learn the rules of the matrix which is the environment okay so
let's go back to the chess example because I think the chess example really exemplifies a lot of the
issues with reinforcement learning so we're going to use this as kind of our exemplar problem at the
end of the day the big challenge in reinforcement learning is to design a policy of what actions to
take given a state s to maximize my chance of getting a future reward that's all that this
agent can do is decide on a policy now this is called a policy and not a control law for a lot
of reasons partly because the environment is is not deterministic its probabilistic and so this
policy is also gonna be probabilistic okay so my policy PI given a state and an action basically it
tells me what is my probability of taking action a given that I'm currently in state s and again
this is probabilistic because I might decide on playing a mixed strategy I might a normal
control system like swinging up a pendulum out of carts the rules never change the system is always
given by F equals MA and so my control law also is deterministic and never changes but in the game of
chess maybe my opponents kind of random so maybe or maybe I'm just learning how to play so what I'm
gonna do with my policy is maybe 80% of the time I'm gonna move my pond you know this way but 20%
of the time I'm gonna try this other move just in case my environment changes or just in case yeah
just in case you know something different happens that time so you're gonna use a probabilistic
policy to explore and optimize the rewards coming from your environments good and you get to take
actions eh that's the whole point is that once you have this policy and you know you know what is the
probability of taking an action given a state and then you just run that policy and you see
how much rewards you get good and this all happens in time so you take actions at time step one time
step two and so on and so forth you measure the state at time one time two time three all the way
up to st and there are rewards that you could be getting at each of these actions and each of these
measurements now most of the time these are gonna be null or empty you're not gonna get any rewards
until maybe the very end of the game of chess but in principle you could get rewards at some points
along the way and again the game of chess is a really good example of how hard this is because
you might create a policy of what you think is the right thing to do in chess to beat your opponent
but you only get one reward at the very end of the game maybe I played a great game of chess and I
made one mistake and I lose the game do I throw away that whole sequence of actions how do you
figure out what actions were good and what actions were bad that's very very hard optimization
problem and that's at the absolute heart of reinforcement learning okay so part of helping
design a good policy is understanding what is the value of being in a certain state s given that
policy PI so once I choose a policy I can as I can start to learn what is the value of each state of
the system of each board position in chess for example based on what is the expected reward I
will get in the future if I start at that state and I enact that policy I'm gonna say that again
that's a mouthful so the value of a state s given a policy PI is my expectation of how much reward
I'll get in the future if I start in that state and I enact that policy and there's this gamma to
the T which is a discount rate and so what this is saying is that I am slightly discounting my future
rewards compared to my immediate rewards so is a constant between zero and one that basically
tells you how much you favor getting a reward right now versus far in the future and this
is you know intimately related to economic theory psychology that you know generally people are more
eager to get a reward now then wait for a delayed reward much later okay but the basic idea is that
you can start to understand this policy and what policies are good or bad based on what are good
board positions what are good value functions and this kind of is how a human would play is
that you might so the the the set of all states of a chessboard is combinatorially lard there's
too many to count you could never hold them all in your mind but we start creating rules of thumb of
what are good board positions so for example if I take my opponent's queen but I still have a queen
I'm much I probably have a better expected chance of winning and getting a reward and so you might
just count the points on the board and that would give you some proxy for the value of a given state
that's one very rudimentary value function that you could use and over time as you play and gain
mastery you might refine your value function and get a better idea of kind of what matters in the
game okay and that's also then going to help you refine your policy to get to those good states
good so in this large framework again this is the reinforcement learning framework the goal
is then to optimize your policy to maximize your future rewards so at the end of the day it's an
optimization problem to solve for pi so usually we think of our environment as not being fully
deterministic like we do in classical mechanics and classical control systems often and instead we
think of our environment as being somehow there's a random or a stochastic component so these are
called Markov decision processes mdps and what that means is that if we are in a state s now and
I take an action a now there is some probability of me going to a new state s at the next time step
and I could go to multiple different states and it's kind of you you roll a dice and you
go to that next state okay so I actually think about backgammon I think that's a great example
of a game that has rules it has a deterministic element but at every turn your rolling died and
that gives you this kind of random Markov decision process so there's a probability of going from my
current state and action to the next state s and that again that makes it hard to optimize
these policies and that's why these policies have to be probabilistic in nature because
your environment is probabilistic in nature so the credit assignment problem I've mentioned
before it's this idea that because your awards are often very sparse and infrequent it's very
hard to tell what action sequence was actually responsible for getting that reward this issue
was recognized as early as the 1960s by Minsky and it's been one of the central it's the central
challenge and reinforcement learning and it has been for six decades this is the problem that
people are still working on today is how to beat the credit assignment problem and so a couple of
key words I think are important are dense versus partial rewards so again the game of chess has
very sparse rewards you only know if you win when you checkmate or when you are checkmated
and you don't necessarily get concrete Ward's at intermediate intervals if you had denser rewards
if some if there if you were playing with a more knowledgeable like master and they were telling
you move after move oh I wouldn't move that because then I'll do this or no that's a really
good move because that makes this structure that's it's a really strong you know position they would
be giving you extra dense rewards and they would be helping you learn faster but in general if you
have sparse rewards then reinforcement learning is very sample inefficient to use machine learning
terminology meaning if I only got sparks for rewards I would have to play many many many many
many times I'd have to have tons of examples to learn a good optimal policy given those
sparse rewards so sparse rewards and the credit assignment problem make it very hard to learn
through optimization what the right policy is and that's related to sample efficiency so in general
what we do in a lot of systems is called rewards shaping where even if you get an infrequent reward
an expert human might build a proxy reward so that you get more dense intermediate rewards
on the way to this final reward and so an expert human would basically guide the learning process
by giving more dense rewards intermediately that's called reward shaping okay good so now we're going
to talk about how again the ultimate goal is to optimize this policy so there's lots and lots
of strategies for this optimization problem and remember I'm gonna go back and say all
of machine learning and all of control theory almost our optimization problems they are you
can pose these as hard non-linear non convex optimization problems and then in the case of
machine learning you solve them with data okay in the case of control you solved them subject
to the constraints of the dynamics this is no different this is at the intersection of machine
learning and control theory and reinforcement learning is again a big optimization problem
within this framework and so to optimize this policy s and a given measurements of your rewards
given this this sporadic feedback there are lots of strategies so there's differential programming
so reinforcement learning and game theory kind of grew up together and differential programming is
one of the optimization techniques Monte Carlo is an old strategy for optimizing these policies just
try a bunch of stuff kind of randomly temporal difference is like an optimal balance between
differential programming and Monte Carlo so it kind of finds the sweet spot of both of these
and its model free so it doesn't require you have any model of the system and it's related
to the bellman optimization so so bellman was one of the pioneers of optimal control theory
and also laid a lot of the foundations that are used in reinforcement learning today now in the
reinforcement learning problem and this is true again for most of machine learning is
that there's this balance and control theory for that matter between exploration and exploitation
so this policy PI usually we're gonna parameterize this we're gonna have some parameters we're gonna
try to optimize those parameters to win the game to get them now how much effort how how much do
I put into optimizing a strategy or exploiting a strategy and how much effort do I go to try new
things that might that might not work but might also give me better rewards things I've never
tried before how many you know how much effort am I going to use to explore good policies versus to
exploit a policy I think is the best one and this is always a challenge I'm not going to
talk too much about this I talk about this a lot in other videos but this is a fundamental
challenge in machine learning and control theory is this exploration exploitation balance and it's
a big problem in reinforcement learning also policy iteration is so basically you set up a
dynamical system where based on your rewards you iteratively update the policy to make it better
and better over time based on new information based on better information from new rewards
that's policy iteration and there are lots of strategies to do this so I'm just going to name
a bunch of them so you can use simulated annealing evolutionary optimization gradient descent and you
can use all of the modern tools in neural networks and machine learning stochastic gradient descent
atom optimization so a lot of really interesting new work is happening just in the last 10-15 years
using deep learning to optimize these policies okay good so I'll just give you some cool examples
so this is one of my favorite examples of learning how to catch a ball in a cup this is a fun kids
game and so an expert human first off gives the robot like one example to show that if possible
and then after imitation learning the robot gets to through trial and error notice that there's a
white screen and the cup is blue and the ball is red so it's using visual information from a camera
and after a few iterations it's actually getting pretty close I think this isn't so different from
how a child would learn you know kids not going to learn this in two trials or three trials it might
take a few dozen times before it actually gets close and then learns how to catch the ball in
the cup after 45 trials it's getting very close it bounced right off and I don't know if it will
get it at 60 I think it barely misses but it's getting very very close after 62 catching it in
the in the cup and finally after a hundred trials this system has actually learned the rules of the
game the physics of how to get that ball in the cup very simple robotic example but it's also
pretty interesting and in this videos not not that recent but very interesting to show that
it is possible to learn a real physical system this is another example I love this is called
the Pilko learner I encourage you to go read all about Pilko in this case they're learning kind of
how to swing up and stabilize a pendulum on a cart and again they are using some combination
of trial and error and a physical model to you know learn how to do this very efficiently with
very few samples so you couldn't learn how to do this without learning a model or without having a
model of the physics Newton's laws F equals MA if you just actually I tried this I downloaded
some code in MATLAB and played around with this just to learn if you could swing up a pendulum it
took like eight hours on my laptop and thousands and thousands and thousands of trials very sample
inefficient to learn like the random control signal that gets you near the upright position
where you can start to stabilize it so a lot of trial and error if you don't have a model so the
Pilko learner in some senses model is leveraging the fact that there is physics we do know physics
we do have models to learn this much much faster and much more efficiently many fewer samples and
I forget which which trial were on but after trial five or six or seven it actually does learn how to
get this thing up and stabilize I think trial six is gonna get really really close let's see alright
so I notice that the human does have to get this thing back down to zero it's guiding the humans
guiding this process all along alright so in trial six it's gonna get really close and almost do it
and I think in Charles seven it's actually gonna figure it out you're gonna have to go watch the
video to see what happens okay so again this is our framework for learning we're trying to
optimize this policy one last thing I'll tell you about is q-learning so instead of just learning
the policy and the value functions separately in q-learning you can kind of learn them both at the
same time so there's this cue function it's not just a function of the state s it's a function of
the state and the action and it tells you what is the quality of being in that state and taking that
action so it kind of combines the value and the policy you can almost think of it as like a value
function of the state and the action assuming I do the smartest thing in the future that I
can and the best thing in the future and so I'll walk you through what this could look like so the
way that you update this quality function is you take your old quality function and then when you
get a reward you basically update this alpha is a learning rate gamma again is the discount rate
and what you're doing this max of Q in the future basically says I'm assuming that I'm always doing
the best thing I possibly can in the future and if I do the best thing in the future kind of what
is the quality of my current state and action so I'm gonna I'm gonna say this again this is
a little animal seems a little circular but it's a really nice way of combining the policy and the
value into one function that you can learn and again you can learn with this with a deep neural
network nowadays it says given a state s and an action a and assuming I do the best thing I can
in the future what is the quality of being in that state and taking that action and this is really
nice because if you actually know the quality function then once I find myself in a state s
I just have to look across all of the out all of the actions a and pick the one with the best
quality so it's a really nice way of choosing an action given this quality function when I find
myself at state s I just picked the action that gives me the best quality and I enact that action
and if I do that in the future I will maximize my value and that gives me a policy so that's really
cool okay I guess another thing I think is super interesting is hindsight and replay so again when
we talk about the credit assignment problem and thus partial rewards problem in that inverted
pendulum example that I ran in MATLAB it took a really really long time before this thing actually
got near the upright position where it could start getting rewards and so what you do in hindsight
replay is instead of throwing out all of the data that doesn't actually get you a reward what you
do is you say maybe I maybe my a system my set of actions would be good for a different reward
so maybe in that ball in a cup example instead of you know if I didn't get the ball in the cup maybe
the ball went over here what I do is I look back at replay that event and I say well maybe someday
I'll actually want to do this other thing that I just did so I better remember that and I would
encode that that was a good action sequence for a different reward structure not the reward I want
of getting the ball in the cup but this other word of getting the ball over here and then by learning
how to get into these different states you get a lot more reinforcement a lot more like kind
of artificial rewards and you learn more about the physics and the dynamics of the system about this
kind of enhanced value value and so high insight replay has been an absolutely critical advance in
making these more data efficient and learning harder tasks that involve a more complex state
space and it's much more what a human would do right like maybe I'm playing tennis and I mess
up you know some some aspect of of hitting the ball and it goes in a different direction than
I thought but maybe in the future that's exactly what I'm gonna want to do and so I'm gonna catalog
that back and when I need to do that in the future I'm gonna have use that information so
it's much more memory efficient yeah sorry reward efficient sample efficient okay good so in this
video I've talked about reinforcement learning which is a framework for learning from experience
how to interact with the environment so in the next lecture I'm going to talk about how to do
this with neural networks and some of the really exciting advances in the field all right thank you
Klicke auf einen beliebigen Text oder Zeitstempel, um direkt zu dieser Stelle im Video zu springen
Teilen:
Die meisten Transkripte sind in unter 5 Sekunden bereit
Mit einem Klick kopieren125+ SprachenInhalt durchsuchenZu Zeitstempeln springen
YouTube-URL einfügen
Gib den Link eines beliebigen YouTube-Videos ein und erhalte das vollständige Transkript
Transkript-Extraktionsformular
Die meisten Transkripte sind in unter 5 Sekunden bereit
Unsere Chrome-Erweiterung installieren
Transkripte abrufen, ohne YouTube zu verlassen. Installiere unsere Chrome-Erweiterung und greife mit einem Klick direkt auf der Wiedergabeseite auf das Transkript jedes Videos zu.