Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
Introduction to RL | Reinforcement Learning | YouTubeToText
YouTube Transcript: Introduction to RL
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
Reinforcement learning (RL) is a distinct paradigm of machine learning focused on learning through trial and error and delayed feedback, contrasting with supervised and unsupervised learning, and has demonstrated success in complex control and decision-making tasks.
[Music]
so good so we can finally get underway uh so this is uh CS 6700 reinforcement learning if
anyone is here by mistake looking for the planning class still should leave okay and
um yeah so so how many of you were in the machine learning course just for me to get a
sense okay large fraction of the people were enl okay and
uh so this is uh a very uh uh a different kind of learning than what we looked at it
uh ml right so uh so in machine learning we looked at familiar modes of machine learning
where the idea was to learn from data you know so you had given a lot of data as training uh
instances for you and essentially you were trying to learn from those training instances
as to what to do right and there were different kinds of problems that we are looking at so one
was supervised learning problem in which uh we looking at classification and regression
yeah so in the machine learning class we looked at learning from data right primarily so one of
the um models we looked at was supervised learning right where we learned about classification and
regression uh the goal there was to learn an in mapping from an input space to a uh output
which could be a categorical output in which case it's classification could be a continuous output
in which case it was called regression right so if you haven't been in the ML Class don't
worry about it right because this is just to tell you that RL is not whatever you learned
in the ML Class okay so if you have haven't learned anything in the ML Class then you
don't have anything to unlearn so don't worry so the second part second kind of learning uh
thing we looked at was unsupervised learning uh where there was really no output that was
expected of you right since therefore there was no supervision uh the goal was to find patterns
in the input data I'll give you a lot of data points you can find out if there are groupings
of you know similar kinds of data points can I divide them into segments right so that kind of
thing was called clustering right or you are asked to figure out if there were frequently
repeating patterns in the data right and so this is called frequent pattern mining or
derived problem there was Association rule Mining and so on so forth right so people have heard me
give this analogy multiple multiple times before but this is the most apt one how did you learn to
cycle right so was it superwise learning so how did you learn to cycle somebody who
hasn't heard me or who hasn't been in ml you haven't been in ml right yeah okay how did
you learn to cycle was did somebody tell you how to cycle and then you just follow their
instruction okay first of all do you know how to cycle yes do you know how to cycle yeah you yes
yes okay how did you learn to cycle down fell down a couple of times and that automatically
made cycle you have you have to actually figure out how to not to fall down right
so falling down alone is not enough but you have to try different things right it's not supervised
learning right it's really not supervised learning how much ever you think because now that I have
given this uh talk multiple times people are getting Vice to it right earlier when I used
to ask this people used to say of course it's supervised learning my uncle was there holding
me or my father was telling me what to do and so on so forth right at best what did they tell you
hey look out look out don't fall down right so that doesn't count as supervision right so
or keep your body right keep your body up or some some kind of very vague instructions was
what they giving you right supervised learning would mean that so you get on the cycle somebody
tells you okay now push down with your left foot with three lounds of pressure right and move your
center of gravity 3° to the right right so this is I mean somebody has to give you exactly what
is the the the control signals that you have to give to your body in order for you to cycle right
then that would be supervised learning right if some somebody actually gives you supervision at
that scale you probably have never learned to cycle if you think about it right because it's
such a complex complex uh dynamical system if somebody gives you control at that level gives
you input at that level you never learn to cycle and so immediately people flipped and say that
it was unsupervised learning right because yeah of course nobody told me how to cycle therefore
it's unsupervised learning so if it is truly unsupervised learning what should happened is
you should have watched uh hundreds of videos of people cycling figured out what is the pattern of
cycling that they do okay and get on a cycle and reproduce it right so that is essentially what
unsupervised learning would be you just have lot of data right and based on the data you figure out
what the patterns are and then you try to execute those patterns that doesn't work right you can
watch hours and hours of uh somebody playing fight simulator you can't go and fly a plane
right so so you have to get on the cycle yourself and you have to try things yourself
right so that's that's the Crux here right so what it's it's how do you learn to cycle is neither
of the above right it's neither supervised nor unsupervised it's a it's a different Paradigm so
the reason I always start out uh my uh talks not just in the class but in general when I talk about
reinforcement learning is because uh people always talk about reinforcement learning as unsupervised
learning right which always really irks me because it's not unsupervised learning just because you
don't have a uh classification error or or a class label doesn't make it unsupervised learning it's a
it's completely different form of learning and so reinforcement learning is essentially this
mathematical formula for this trial and error kind of learning right so how do you learn from this
kind of minimal feedback you know falling down hurts or somebody your your mom or somebody stands
there and claps when you finally manage to get get on the cycle you know that's kind of a positive
reinforcement right they fall down you get hurt right that's kind of a negative feedback how do
you just use this kinds of uh minimal feedback and you learn to cycle so this is essentially
the the Crux of what reinforcement learning is about Trail and error right uh so the goal here
is to learn about a system through interacting with the system right it's not something that is
done completely offline okay you have some notion of interaction with the system okay
and you learn about the system through that interaction uh reinforcement learning originally
was inspired by uh behavioral psychology right so one of the earliest reinforcement
systems that was studied was a Pavlov's dog how many of you know of the Pavlov's dog
experiment what is a pao's dog experiment and Pao he tried to
give food to dog and whenever he gave food dog it started salivating
mhm and a b in association with whenever he's going to give food to them okay so whenever like
he just started ringing the bell the dog started expecting the food and it started uhhuh so that
is called a condition refle so when the dog looks at the food and it starts salivating right it's a
Primary Response because there is a reason for it to salivate on the site of food any idea why
exactly so it's preparing to digest the food you know mean show the food it's preparing to digest
the food so it starts salivating right so then now if you think about it right hearing the bell
and it elevates what is it doing preparing to digest the Bell so when you ring the bell and
then serve the food the dog forms an association between the the Bell and the food right and later
on when you just ring the bell without even serving the food the dog starts salivating in
response to digesting the food that it expects to be delivered right so essentially the food
is the payoff you know the food is like a reward for it and it has learned to form associations
between signals in this case which was the Bell like an input signal which was the Bell and the
reward that is going to get right so this is called uh behavioral conditioning right and
uh so inspired by these kinds of experiments on more complex behavioral experiments on animals
and people started to come up with different theories to explain how learning proceeds right
in fact some of the earlier uh reinforcement learning papers uh appeared in behavioral
psychology journals right the earliest paper by Satan BTO um appeared in brain and Behavioral
Sciences Journal just just go back uh I needed to need to say something about s b know that there's
a larger audience we can tell that about them so the we have we're going to follow a textbook
written by Rich and Andy BTO right uh but uh more importantly they are also kind of the
co-founders of the modern field of reinforcement learning right so in 1983 they wrote a a paper
uh um um adaptive neuron like element that learn control Behavior or some something to the effect
right and that essentially kickstarted this whole modern field of reinforcement learning
so the concept of reinforcement learning like I said goes back to Pavo and earlier right people
have been talking about this kind of Behavioral conditioning and learning and stuff uh but um
uh the whole modern uh computational uh techniques
that people use in reinforcement learning are started by uh certain and
right so what is reinforcement learning right so it's learning about stimuli right the inputs that
are coming to you and the actions that you can take in response to it right learning about the
stimuli only from rewards and punish punishments okay so you're not going to get anything else food
is a reward right falling down and scraping your hand is a punishment right so only from this kinds
of rewards and punishments alone right there is no detailed supervision available nobody tells
you what is the response that you should give to a specific input right suppose you are playing a
game there are multiple ways in which you can learn to play a game right so you can learn to
play chess by looking at a board position right and then looking at a table right that tells you
for this board position this is the move you have to make right and then you go and make the move
right so so that is a kind of supervision that you could get you know that gives you a kind
of a mapping from the input to the output right it gives you mapping from the input to the output and
um essentially you learn to generalize from that so this is what we mean by detailed supervision
so another way of learning to play chess is you just okay you have an opponent you sit in front
of him and you just make a sequence of moves at the end of the move you win okay you get
a reward right somebody pays you say 10 Rupees okay if you lose you have to pay the opponent
10 Rupees that's all that's all that happens right that's all the feedback you're going to
get right whether you're going to get the 10 rupees or going to lose the 10 Rupees at the
end of the game so nobody tells you given this position this is the move you should have made
right that's what we mean by saying learning from rewards and punishments in the absence of detailed
supervision okay is it clear okay and The crucial component to this is Trail and error learning
because since I don't know what is the right thing to do given an input right I need to try
multiple things to see what the outcome will be right I need to try different things to see
if I'm going to get the reward or not right if I don't try different things right I'm not going to
be able to learn anything at all right so we'll I can give you more formal mathematical reasons
for why we need all of this as we go go on uh but this intuitively you can understand this as uh uh
requiring uh uh EX exploration so that you know what the right outcome is right and there are a
bunch of things uh which are also characteristic of reinforcement learning problems one of those
is uh that uh the outcomes right the pay the the rewards and punishments based on which you
are learning can be fairly delayed in time they need not be temporarily close to the thing that
cost it I mean while you're playing a game let us say right so you might uh you know
or drop a batsman right and then he goes on to score like 150 or something like that right
so then you lose the match at the end of the day right but the even that cost you to lose
the match is the dropped catch that probably happened around the 12th over right or it could
be a much more convoluted uh causal uh cause and effect right so and how many of you follow
Cricket my god really losing popularity yeah put your hand I'm not going to give a cricket example
then look at it uh okay so there a bunch of other things right so so we talked about delayed rewards
the rewards could come much later in time from the action that caused the reward to happen right for
example let's go back to our cycling case right I might have done something stupid on or I might
have gone over a stone somewhere right while I'm cycling at a very high speed there might have been
a small Stone in the the road and that that will cause me to lose my balance right and then I'll
try my level best to get the balance back right I might not and I'll finally fall down and get
hurt that doesn't mean what costed the Falling Down is the Last Action I tried right I might
have desperately tried to jump off the cycle or something like that but that is not what cost the
punishment right what caus the punishment happened a few few seconds ago when I ran over the stone
right so there could be this kind of a temporal disconnect between what causes the reward or
punishment from the actual reward and Punishment so it becomes a little tricky how do you going
to learn those things right learn the associations right so quite often right you are going to need a
sequence of actions to obtain a reward right it's not going to be like a onot thing you're going to
need a sequence of actions to get to the reward so again going back to the chess example right you're
not going to get a reward every time you move a piece on the board right you have to finish
playing the game at the end of the game if you actually manage to win you get a reward so it's
a sequence of actions right and therefore you need to learn some kind of an association between the
inputs that you are seeing in this case it will be both positions right or how fast the cycle is
moving and how unbalanced do you feel and so on so forth right two actions so the inputs that you
are getting sometimes which we call States right and and the actions that you take in response to
this input that you are seeing right so this is essentially what you're going to be doing
when you're solving a a reinforcement learning problem so this kind of of associations are
essentially known as policies right so what you're are essentially learning is a policy to behave in
a world right so you're learning a policy to play chess or you're learning a policy to cycle right
so this is essentially what you're learning that you're not just learning about individual actions
right and all of this happen typically in a noisy stochastic World okay that makes makes things
more challenging so these are all the different characteristics of uh uh reinforcement learning
problems right and um and we'll be looking at all of this as we go along right I mean this I'll not
be explicitly talking about each and every one of these bullet points but everything that we
look at all the algorithms all the methods that we look at as we go along in this course okay
we'll we'll have all all these aspects as part of it okay uh so reinforcement learning has been
used fairly successfully in in a wide variety of applications right uh so you can see a helicopter
there okay so there not a cut and paste error okay the helicopter is actually flying upside
down okay uh so uh so the this um group group at Stanford and Berkeley uh which have actually used
reinforcement learning to train a helicopter to fly uh all kinds of things not just upside down
a naral agent can do all kinds of tricks on the helicopter so I'll show you a video in a minute uh
and uh it's it's an amazing piece of work right I mean it's considered uh it was considered the the
showpiece application for reinforcement learning I mean getting such a complex control system uh to
work and it actually could do things at a much uh finer levels of control than a human being
could right well that's after all a machine so you would expect that but the tricky part was how it
learned to control this complex system uh from uh without any human uh intervention right and in the
middle right so I have a couple of games there so that is uh can you see that okay it's twoo
small and arrow yeah that's a game called back gamon right so how many of you know about back
gamon one two there was one maybe how many of you know about
Ludo okay fine so back gaming is like a two-player Ludo okay so you throw the dice you move pieces
around and you take them off the board right so it's it's it's it's a fairly U easy game
but then you have all kinds of strategies that you could do with it but it's also a hard game for uh
computers to play because of the stochasticity right and also because of a large uh branching
factor that is there in the game right so at each point there are many many combinations in which
you could move the uh board pieces around and then there is this D roll that adds an
additional complexity so people aren't really uh you know getting great results and then there's
this person Jerry tessaro from IBM uh who um came up with something called neurog gamon I think it
was called neurog gamon and uh that was trained using supervised learning under neural network
right and so if it if he had done it recently it would have been called a deep learning version
of neurog gammer or something because he did it back in the '90s early '90s it was just called
neural network version of back gam and it it it played really well for a computer program right
so it was essentially the best computer program back gamman player at that point and then uh
Jerry heard about uh uh TD I mean heard about reinforcement learning and he decided to train
a reinforcement learning agent to play bam okay so what he did was uh set up this uh um reinforcement
learning agent which played against another copy of itself right and let them play hundreds and
hundreds of games right rather thousands and thousands of games so essentially what they
did was so you train one copy for like 100 100 move 100 games or something then you move it
here right freeze it and then continue learning with this so essentially what was happening as
you learn you're playing against better and better plays gradually your opponent was also
improving right and then this was called selfplay right so he trained back gam using selfplay
and uh he it came to a point where the the TD gam as he called it was even better than the human
player of bamon at that point in the world right so they actually had a head-to-head challenge with
the human Champion that is a world championship of back gamman you know it's apparently very popular
in the Middle East and people actually have World Championships that's the world championship of
B gamman and uh so he he challenged the human Champion which IBM seems to do a lot right I mean
they challenge casparo uh to matches and things like this so he also challenged uh uh justar
worked for IBM you should realize right people who spend a lot of resources getting computers to play
games will probably be working for IBM you know uh so so so Jerry had this thing and it beat the
world world champion so you have reinforcement learning uh agent that's the best back gaming
player in the world okay not not no more best computer player or anything right so we could
actually make that claim right and there's another game there which is a snapshot from the game of
go right so people have played go oh come on us there at least one or two people who have played
go okay uh people have played U
otelo okay that's also very number I mean isn't it one of those free games on
Ubuntu I thought everybody plays that because I mean at some point or other well you would rather
play Oto than watch paint Dr you know but anyway so go goes like a more more complex version of AO
F right it's it's it's again a very hard game for uh computers to play because the branching
factor is huge right and it is actually a miracle that humans even play this uh because the the
search trees and other things are really complex right so this is one uh one case which clearly
illustrates that humans actually solve problems in a fundamentally different way than we try to
write down in our algorithms because we seem to be making all kinds of intuitive leaps in order
to be able to play go right so there's this person uh David silver uh who currently works
for Google Deep Mind and uh but before that he spent some time with uh Jen tesaro at IBM and at
some point along the way he came up with this uh reinforcement learning agent called uh TD search
uh that play go at a decent level still not um you know not not like Master human level performance
but it perform plays at a pretty decent level right so what I'm pointing out here is things
that are typically hard for traditional computer algorithms or even traditional machine learning
approaches to solve AI has had good success right and here is another example oh if I'm
to point I'm supposed to use this or that I forget which one right they told me I should use only one
of those screens for pointing because it's hard for them to record on another one I forget which
one okay forget it there are some robots on the bottom left of the screen right and
uh so that's a snapshot from uh the UT Austin Robo soccer team called Austin Villa right uh
and they use reinforcement learning to get their robots to execute really complex strategies so so
that's it's it's really cool uh but the nice thing about the uh the robo soer application is that uh
they don't use reinforcement learning alone right they actually use a mix of uh different learning
strategies and also planning and so on so forth which is going on the other studio right so they
use a mix of different kinds of AI and machine learning techniques in order to get a very very
competent agent it's very hard hard to beat and they have been the Champions I think for like 2
or 3 years running now uh in the humanoid league right and again hard control problems uh things
like uh how do I take a spot kick you know those are the things for which they use reinforcement
learning which it's a really hard balancing problem you basically have to balance the
robot on one leg and then swing the other leg so that you can take the kick it turns out to
be a hard control problem right so they they used RL to RL to solve those right and then
up on the top right okay uh is is is an application which will probably the one
that actually makes money uh of all these three all all the others right uh that is on
uh essentially on um using reinforcement learning to solve online learning right
so so online learning is is is is a is a use case where uh I do not get I don't have the feedback
available to me a priori right so the feedback keeps coming peace meal right so for example uh
that is a case where uh we are having uh new stories that need to be shown to people that
come to my web page right and when people come uh come to the page I'll have like some editors will
pick like 20 stories for me and from those 20 stories I have to figure out which are the ones
I put up there prominently right and what is the feedback I'm going to to again nobody tells me
what stories that user is going to like I mean I cannot have a supervised learning algorithm
here right so the feedback I'm going to get is if the user clicks on the story I'm going to get
a reward the user does not click on the story I will not get a reward right that's essentially
the feedback I'm going to get nobody tells me anything beforehand right so I have to try out
things I have to show different stories to figure out which one he's going to click on right and I
have very few attempts to do this in so how do I do this more effectively so people have done
a supervised approach for solving this and it has worked fairly uh successfully uh so it has worked
fairly successfully but um but reinforcement learning seems to be a much more uh uh you know
natural way of modeling these problems so not only in these kinds of uh news story selection
people use reinforcement learning ideas even in ad selection right so how do you see some of those
ads that you see on the sides when you go to Google or some other page right so how how are
those ads selected so so there might be some very basic economic Criterion for selecting a slate
of ads okay here are these 10 ads which would probably give me the right uh payoff right and
then you can figure out which of those which three of those 10 am I going to put here and
things like that you could use a reinforcement learning solution for uh selecting those right of
course the this there's this whole field called computational advertising right it's a lot more
complex than what I explained uh but uh RL is is a component in computational advertising as well
right okay here is the video courtesy Andrew's web page uh the people recognize a guy there yeah
okay it's just not a human siiz helicopter but still it's a fairly large
amazing huh all of this is being learned by uh our
agent this goes on for a while so we'll stop
[Music] [Applause]
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.