YouTube-Transkript:
Reinforcement Learning: Machine Learning Meets Control Theory

Kein langes Zuschauen mehr – hol dir das vollständige Transkript, suche nach Stichwörtern und kopiere alles mit einem Klick.

AutoDub

Fremdsprachige YouTube-Videos verstehen

Immersive YouTube-Synchronisation auf Deutsch

Sprachbarrieren überwinden, erstklassige Inhalte aus aller Welt genießen

Kostenlos nutzen

Videotranskript

Videozusammenfassung

Summary

Core Theme

Reinforcement learning is a machine learning framework that enables agents to learn optimal control strategies for interacting with complex environments through trial and error, driven by rewards and feedback.

Mind Map

Zum Vergrößern klicken

Klicke, um die vollständige interaktive Mind Map zu öffnen

Welcome back. So I'm really excited to do this lecture on reinforcement learning. I've been

wanting to do this for a long time. Those of you who know me know that I love control theory

and machine learning and reinforcement learning is kind of at this sweet spot between these two super

important fields. Okay, so reinforcement learning is essentially a branch of machine learning that

deals with how to learn control strategies to interact with a complex environment. And one of

the ways I think about this, the way I'm going to define this, is that reinforcement learning

is a framework for learning how to interact with the environment from experience. This is

a very biologically inspired idea... this is what animals do. So through trial and error,

through experience, through positive and negative rewards and feedback, they learn how to interact

with their environment. OK good. So before I jump in I want to show some motivating videos. I really

like this one where reinforcement learning is used to learn how to walk in this artificial

environment. And there's a lot of papers like this where people use reinforcement learning as kind of

an optimization framework to learn how to control a complex system, in this case a bipedal walker,

often in a simulated environment. And this just looks really cool and it's a difficult control

problem. This is a really hard non-linear control problem. Now the goal would be to take what you

learn here and start to port that over into the real world to make better robots and better

actual physical agents that can interact with the world alongside us, to learn how to learn

like humans and animals do. So another video I love... this is my dog Mordecai and my wife

has trained him... this is a treat on his nose... to hold the treat on his nose until she says ok,

after which he can then grab the treat and eat it. This is not an easy trick to learn and this again

this goes to show you anytime you, anybody who's trained an animal, a dog or any other animal,

has done some type of reinforcement learning or reinforcement training. OK and so that's actually

where the word reinforcements comes from is that in in animal systems in human systems you in you

reinforce good behavior with rewards like treats okay and so that's kind of the whole name of the

game here is learning a good control strategy or a good set of actions through positive reinforcement

good so that's what we're gonna talk about today I'm gonna walk you through the framework so I want

to disentangle there is a reinforcement learning framework kind of the framework for how you learn

to interact with the environment and then there's a hard optimization problem for how you actually

optimize the agents actions or policies given that framework and those are kind of two pieces

that we're going to talk about today and then in a future video I'm going to talk about kind

of deep reinforcement learning or reinforcement learning with modern techniques and deep neural

networks and some of the incredible applications and and performance that you can get out of those

systems good so also I'll point out you can follow updates on these videos at eggin steve

on twitter please like please subscribe hit the bell so you get notifications and comment below

tell me what you want to see more of tell me what you like or don't like oftentimes

people in the comments provide a lot of really important useful information that I might have

left out of these videos so I think it's also a big service to other people watching these

all right so we're gonna jump in and we're gonna build this reinforcement learning framework from

the ground up from scratch and so at the heart of it you start with an agent and an environment

and the agent I actually like the name agent because it implies some agency the agent gets

to take actions to interact with the environment so in the first example and I'm gonna have a few

examples we're going to talk about a mouse in a maze so the agent is a mouse the environment is

a maze the mouse gets to measure its current state in the environment so it measures that

state s notice that it doesn't measure the full state the mouse does not have a top-down view

of the whole maze it just knows where it is right now and where it was in the past and then the add

Mouse gets to take some action a it gets to make some decision about what to do next okay so it

could turn left it could turn right or it could go forward in this case and only until the very

end of the maze does the mouse actually get a reward are so these rewards are very sparse few

and far between in this case if it goes to the very end of the maze it might get a piece of

cheese actually my wife tells me that when they do training experiments for rats they really like

fruit loops and it looks adorable because a fruit loop is gigantic to a mouse or to a rat but the

moral of the story here is that this agent gets to make some decisions it has control over its

actions so it has agency and the environment it gets to measure where it is in the environment and

occasionally it gets rewards very occasionally it gets rewards and so part of the the goal of this

system is to learn what actions actually caused it to get a reward or not okay and this is in

some sense in the machine learning lingo this is called semi-supervised learning so if the mouse

got a reward at every single stage of the maze if at every correct turn it got a piece of cheese

that would just be regular supervised learning and those rewards would be called labels they would

tell you yes you did the right thing or no you did not do the right thing but because the reward here

is time-delayed it comes at the very end of the game or very sporadically and it's not linked

to every single individual action we call that time delay the label a reward and this becomes

a semi supervised learning problem so it's still supervised in the sense that there is supervisory

feedback telling the agent what worked and what didn't but it's not nearly as much information

as in classical supervised learning and that's one of the major challenges of reinforcement

learning is that these these labels are extremely rare and it's very hard to tell what actions gave

rise to actually getting that reward so this is a much harder optimization problem and often times

requires much more data and much more trial and error and I'm going to talk about that good I

also like to think about the the game of chess or checkers or tic-tac-toe basically games in general

where the agent basically there are some rules of the game and you get to make a finite set of

actions to interact with that environment now in the case of chess it's interesting because

the environment is not just the rules of the game there's also an adversarial opponent trying

to beat you there so you're trying to beat the opponent you're trying to checkmate the other to

the other side and they are trying to beat you and so that's really interesting is that the rules of

this game are trying to you know there's an active player on the other side in this environment good

you also might be a terminator trying to rule the world or try to learn how to walk I actually think

that sounds kind of funny that in the matrix neo is actually the agent from a reinforcement

learning standpoint trying to learn the rules of the matrix which is the environment okay so

let's go back to the chess example because I think the chess example really exemplifies a lot of the

issues with reinforcement learning so we're going to use this as kind of our exemplar problem at the

end of the day the big challenge in reinforcement learning is to design a policy of what actions to

take given a state s to maximize my chance of getting a future reward that's all that this

agent can do is decide on a policy now this is called a policy and not a control law for a lot

of reasons partly because the environment is is not deterministic its probabilistic and so this

policy is also gonna be probabilistic okay so my policy PI given a state and an action basically it

tells me what is my probability of taking action a given that I'm currently in state s and again

this is probabilistic because I might decide on playing a mixed strategy I might a normal

control system like swinging up a pendulum out of carts the rules never change the system is always

given by F equals MA and so my control law also is deterministic and never changes but in the game of

chess maybe my opponents kind of random so maybe or maybe I'm just learning how to play so what I'm

gonna do with my policy is maybe 80% of the time I'm gonna move my pond you know this way but 20%

of the time I'm gonna try this other move just in case my environment changes or just in case yeah

just in case you know something different happens that time so you're gonna use a probabilistic

policy to explore and optimize the rewards coming from your environments good and you get to take

actions eh that's the whole point is that once you have this policy and you know you know what is the

probability of taking an action given a state and then you just run that policy and you see

how much rewards you get good and this all happens in time so you take actions at time step one time

step two and so on and so forth you measure the state at time one time two time three all the way

up to st and there are rewards that you could be getting at each of these actions and each of these

measurements now most of the time these are gonna be null or empty you're not gonna get any rewards

until maybe the very end of the game of chess but in principle you could get rewards at some points

along the way and again the game of chess is a really good example of how hard this is because

you might create a policy of what you think is the right thing to do in chess to beat your opponent

but you only get one reward at the very end of the game maybe I played a great game of chess and I

made one mistake and I lose the game do I throw away that whole sequence of actions how do you

figure out what actions were good and what actions were bad that's very very hard optimization

problem and that's at the absolute heart of reinforcement learning okay so part of helping

design a good policy is understanding what is the value of being in a certain state s given that

policy PI so once I choose a policy I can as I can start to learn what is the value of each state of

the system of each board position in chess for example based on what is the expected reward I

will get in the future if I start at that state and I enact that policy I'm gonna say that again

that's a mouthful so the value of a state s given a policy PI is my expectation of how much reward

I'll get in the future if I start in that state and I enact that policy and there's this gamma to

the T which is a discount rate and so what this is saying is that I am slightly discounting my future

rewards compared to my immediate rewards so is a constant between zero and one that basically

tells you how much you favor getting a reward right now versus far in the future and this

is you know intimately related to economic theory psychology that you know generally people are more

eager to get a reward now then wait for a delayed reward much later okay but the basic idea is that

you can start to understand this policy and what policies are good or bad based on what are good

board positions what are good value functions and this kind of is how a human would play is

that you might so the the the set of all states of a chessboard is combinatorially lard there's

too many to count you could never hold them all in your mind but we start creating rules of thumb of

what are good board positions so for example if I take my opponent's queen but I still have a queen

I'm much I probably have a better expected chance of winning and getting a reward and so you might

just count the points on the board and that would give you some proxy for the value of a given state

that's one very rudimentary value function that you could use and over time as you play and gain

mastery you might refine your value function and get a better idea of kind of what matters in the

game okay and that's also then going to help you refine your policy to get to those good states

good so in this large framework again this is the reinforcement learning framework the goal

is then to optimize your policy to maximize your future rewards so at the end of the day it's an

optimization problem to solve for pi so usually we think of our environment as not being fully

deterministic like we do in classical mechanics and classical control systems often and instead we

think of our environment as being somehow there's a random or a stochastic component so these are

called Markov decision processes mdps and what that means is that if we are in a state s now and

I take an action a now there is some probability of me going to a new state s at the next time step

and I could go to multiple different states and it's kind of you you roll a dice and you

go to that next state okay so I actually think about backgammon I think that's a great example

of a game that has rules it has a deterministic element but at every turn your rolling died and

that gives you this kind of random Markov decision process so there's a probability of going from my

current state and action to the next state s and that again that makes it hard to optimize

these policies and that's why these policies have to be probabilistic in nature because

your environment is probabilistic in nature so the credit assignment problem I've mentioned

before it's this idea that because your awards are often very sparse and infrequent it's very

hard to tell what action sequence was actually responsible for getting that reward this issue

was recognized as early as the 1960s by Minsky and it's been one of the central it's the central

challenge and reinforcement learning and it has been for six decades this is the problem that

people are still working on today is how to beat the credit assignment problem and so a couple of

key words I think are important are dense versus partial rewards so again the game of chess has

very sparse rewards you only know if you win when you checkmate or when you are checkmated

and you don't necessarily get concrete Ward's at intermediate intervals if you had denser rewards

if some if there if you were playing with a more knowledgeable like master and they were telling

you move after move oh I wouldn't move that because then I'll do this or no that's a really

good move because that makes this structure that's it's a really strong you know position they would

be giving you extra dense rewards and they would be helping you learn faster but in general if you

have sparse rewards then reinforcement learning is very sample inefficient to use machine learning

terminology meaning if I only got sparks for rewards I would have to play many many many many

many times I'd have to have tons of examples to learn a good optimal policy given those

sparse rewards so sparse rewards and the credit assignment problem make it very hard to learn

through optimization what the right policy is and that's related to sample efficiency so in general

what we do in a lot of systems is called rewards shaping where even if you get an infrequent reward

an expert human might build a proxy reward so that you get more dense intermediate rewards

on the way to this final reward and so an expert human would basically guide the learning process

by giving more dense rewards intermediately that's called reward shaping okay good so now we're going

to talk about how again the ultimate goal is to optimize this policy so there's lots and lots

of strategies for this optimization problem and remember I'm gonna go back and say all

of machine learning and all of control theory almost our optimization problems they are you

can pose these as hard non-linear non convex optimization problems and then in the case of

machine learning you solve them with data okay in the case of control you solved them subject

to the constraints of the dynamics this is no different this is at the intersection of machine

learning and control theory and reinforcement learning is again a big optimization problem

within this framework and so to optimize this policy s and a given measurements of your rewards

given this this sporadic feedback there are lots of strategies so there's differential programming

so reinforcement learning and game theory kind of grew up together and differential programming is

one of the optimization techniques Monte Carlo is an old strategy for optimizing these policies just

try a bunch of stuff kind of randomly temporal difference is like an optimal balance between

differential programming and Monte Carlo so it kind of finds the sweet spot of both of these

and its model free so it doesn't require you have any model of the system and it's related

to the bellman optimization so so bellman was one of the pioneers of optimal control theory

and also laid a lot of the foundations that are used in reinforcement learning today now in the

reinforcement learning problem and this is true again for most of machine learning is

that there's this balance and control theory for that matter between exploration and exploitation

so this policy PI usually we're gonna parameterize this we're gonna have some parameters we're gonna

try to optimize those parameters to win the game to get them now how much effort how how much do

I put into optimizing a strategy or exploiting a strategy and how much effort do I go to try new

things that might that might not work but might also give me better rewards things I've never

tried before how many you know how much effort am I going to use to explore good policies versus to

exploit a policy I think is the best one and this is always a challenge I'm not going to

talk too much about this I talk about this a lot in other videos but this is a fundamental

challenge in machine learning and control theory is this exploration exploitation balance and it's

a big problem in reinforcement learning also policy iteration is so basically you set up a

dynamical system where based on your rewards you iteratively update the policy to make it better

and better over time based on new information based on better information from new rewards

that's policy iteration and there are lots of strategies to do this so I'm just going to name

a bunch of them so you can use simulated annealing evolutionary optimization gradient descent and you

can use all of the modern tools in neural networks and machine learning stochastic gradient descent

atom optimization so a lot of really interesting new work is happening just in the last 10-15 years

using deep learning to optimize these policies okay good so I'll just give you some cool examples

so this is one of my favorite examples of learning how to catch a ball in a cup this is a fun kids

game and so an expert human first off gives the robot like one example to show that if possible

and then after imitation learning the robot gets to through trial and error notice that there's a

white screen and the cup is blue and the ball is red so it's using visual information from a camera

and after a few iterations it's actually getting pretty close I think this isn't so different from

how a child would learn you know kids not going to learn this in two trials or three trials it might

take a few dozen times before it actually gets close and then learns how to catch the ball in

the cup after 45 trials it's getting very close it bounced right off and I don't know if it will

get it at 60 I think it barely misses but it's getting very very close after 62 catching it in

the in the cup and finally after a hundred trials this system has actually learned the rules of the

game the physics of how to get that ball in the cup very simple robotic example but it's also

pretty interesting and in this videos not not that recent but very interesting to show that

it is possible to learn a real physical system this is another example I love this is called

the Pilko learner I encourage you to go read all about Pilko in this case they're learning kind of

how to swing up and stabilize a pendulum on a cart and again they are using some combination

of trial and error and a physical model to you know learn how to do this very efficiently with

very few samples so you couldn't learn how to do this without learning a model or without having a

model of the physics Newton's laws F equals MA if you just actually I tried this I downloaded

some code in MATLAB and played around with this just to learn if you could swing up a pendulum it

took like eight hours on my laptop and thousands and thousands and thousands of trials very sample

inefficient to learn like the random control signal that gets you near the upright position

where you can start to stabilize it so a lot of trial and error if you don't have a model so the

Pilko learner in some senses model is leveraging the fact that there is physics we do know physics

we do have models to learn this much much faster and much more efficiently many fewer samples and

I forget which which trial were on but after trial five or six or seven it actually does learn how to

get this thing up and stabilize I think trial six is gonna get really really close let's see alright

so I notice that the human does have to get this thing back down to zero it's guiding the humans

guiding this process all along alright so in trial six it's gonna get really close and almost do it

and I think in Charles seven it's actually gonna figure it out you're gonna have to go watch the

video to see what happens okay so again this is our framework for learning we're trying to

optimize this policy one last thing I'll tell you about is q-learning so instead of just learning

the policy and the value functions separately in q-learning you can kind of learn them both at the

same time so there's this cue function it's not just a function of the state s it's a function of

the state and the action and it tells you what is the quality of being in that state and taking that

action so it kind of combines the value and the policy you can almost think of it as like a value

function of the state and the action assuming I do the smartest thing in the future that I

can and the best thing in the future and so I'll walk you through what this could look like so the

way that you update this quality function is you take your old quality function and then when you

get a reward you basically update this alpha is a learning rate gamma again is the discount rate

and what you're doing this max of Q in the future basically says I'm assuming that I'm always doing

the best thing I possibly can in the future and if I do the best thing in the future kind of what

is the quality of my current state and action so I'm gonna I'm gonna say this again this is

a little animal seems a little circular but it's a really nice way of combining the policy and the

value into one function that you can learn and again you can learn with this with a deep neural

network nowadays it says given a state s and an action a and assuming I do the best thing I can

in the future what is the quality of being in that state and taking that action and this is really

nice because if you actually know the quality function then once I find myself in a state s

I just have to look across all of the out all of the actions a and pick the one with the best

quality so it's a really nice way of choosing an action given this quality function when I find

myself at state s I just picked the action that gives me the best quality and I enact that action

and if I do that in the future I will maximize my value and that gives me a policy so that's really

cool okay I guess another thing I think is super interesting is hindsight and replay so again when

we talk about the credit assignment problem and thus partial rewards problem in that inverted

pendulum example that I ran in MATLAB it took a really really long time before this thing actually

got near the upright position where it could start getting rewards and so what you do in hindsight

replay is instead of throwing out all of the data that doesn't actually get you a reward what you

do is you say maybe I maybe my a system my set of actions would be good for a different reward

so maybe in that ball in a cup example instead of you know if I didn't get the ball in the cup maybe

the ball went over here what I do is I look back at replay that event and I say well maybe someday

I'll actually want to do this other thing that I just did so I better remember that and I would

encode that that was a good action sequence for a different reward structure not the reward I want

of getting the ball in the cup but this other word of getting the ball over here and then by learning

how to get into these different states you get a lot more reinforcement a lot more like kind

of artificial rewards and you learn more about the physics and the dynamics of the system about this

kind of enhanced value value and so high insight replay has been an absolutely critical advance in

making these more data efficient and learning harder tasks that involve a more complex state

space and it's much more what a human would do right like maybe I'm playing tennis and I mess

up you know some some aspect of of hitting the ball and it goes in a different direction than

I thought but maybe in the future that's exactly what I'm gonna want to do and so I'm gonna catalog

that back and when I need to do that in the future I'm gonna have use that information so

it's much more memory efficient yeah sorry reward efficient sample efficient okay good so in this

video I've talked about reinforcement learning which is a framework for learning from experience

how to interact with the environment so in the next lecture I'm going to talk about how to do

this with neural networks and some of the really exciting advances in the field all right thank you

Klicke auf einen beliebigen Text oder Zeitstempel, um direkt zu dieser Stelle im Video zu springen

Die meisten Transkripte sind in unter 5 Sekunden bereit

Mit einem Klick kopieren125+ SprachenInhalt durchsuchenZu Zeitstempeln springen

YouTube-URL einfügen

Gib den Link eines beliebigen YouTube-Videos ein und erhalte das vollständige Transkript

Die meisten Transkripte sind in unter 5 Sekunden bereit

Unsere Chrome-Erweiterung installieren

Transkripte abrufen, ohne YouTube zu verlassen. Installiere unsere Chrome-Erweiterung und greife mit einem Klick direkt auf der Wiedergabeseite auf das Transkript jedes Videos zu.

Zu Chrome hinzufügen – kostenlos

Funktioniert mit YouTube, Coursera, Udemy und weiteren Lernplattformen

Transkripte sofort abrufen: Einfach die Domain in der Adressleiste ändern!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube-TranskriptDeine Ergebnisse werden vorbereitet …

YouTube-Transkript:Reinforcement Learning: Machine Learning Meets Control Theory