YouTube Transcript:
Introduction to RL

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

Reinforcement learning (RL) is a distinct paradigm of machine learning focused on learning through trial and error and delayed feedback, contrasting with supervised and unsupervised learning, and has demonstrated success in complex control and decision-making tasks.

[Music]

so good so we can finally get underway uh so this is uh CS 6700 reinforcement learning if

anyone is here by mistake looking for the planning class still should leave okay and

um yeah so so how many of you were in the machine learning course just for me to get a

sense okay large fraction of the people were enl okay and

uh so this is uh a very uh uh a different kind of learning than what we looked at it

uh ml right so uh so in machine learning we looked at familiar modes of machine learning

where the idea was to learn from data you know so you had given a lot of data as training uh

instances for you and essentially you were trying to learn from those training instances

as to what to do right and there were different kinds of problems that we are looking at so one

was supervised learning problem in which uh we looking at classification and regression

yeah so in the machine learning class we looked at learning from data right primarily so one of

the um models we looked at was supervised learning right where we learned about classification and

regression uh the goal there was to learn an in mapping from an input space to a uh output

which could be a categorical output in which case it's classification could be a continuous output

in which case it was called regression right so if you haven't been in the ML Class don't

worry about it right because this is just to tell you that RL is not whatever you learned

in the ML Class okay so if you have haven't learned anything in the ML Class then you

don't have anything to unlearn so don't worry so the second part second kind of learning uh

thing we looked at was unsupervised learning uh where there was really no output that was

expected of you right since therefore there was no supervision uh the goal was to find patterns

in the input data I'll give you a lot of data points you can find out if there are groupings

of you know similar kinds of data points can I divide them into segments right so that kind of

thing was called clustering right or you are asked to figure out if there were frequently

repeating patterns in the data right and so this is called frequent pattern mining or

derived problem there was Association rule Mining and so on so forth right so people have heard me

give this analogy multiple multiple times before but this is the most apt one how did you learn to

cycle right so was it superwise learning so how did you learn to cycle somebody who

hasn't heard me or who hasn't been in ml you haven't been in ml right yeah okay how did

you learn to cycle was did somebody tell you how to cycle and then you just follow their

instruction okay first of all do you know how to cycle yes do you know how to cycle yeah you yes

yes okay how did you learn to cycle down fell down a couple of times and that automatically

made cycle you have you have to actually figure out how to not to fall down right

so falling down alone is not enough but you have to try different things right it's not supervised

learning right it's really not supervised learning how much ever you think because now that I have

given this uh talk multiple times people are getting Vice to it right earlier when I used

to ask this people used to say of course it's supervised learning my uncle was there holding

me or my father was telling me what to do and so on so forth right at best what did they tell you

hey look out look out don't fall down right so that doesn't count as supervision right so

or keep your body right keep your body up or some some kind of very vague instructions was

what they giving you right supervised learning would mean that so you get on the cycle somebody

tells you okay now push down with your left foot with three lounds of pressure right and move your

center of gravity 3° to the right right so this is I mean somebody has to give you exactly what

is the the the control signals that you have to give to your body in order for you to cycle right

then that would be supervised learning right if some somebody actually gives you supervision at

that scale you probably have never learned to cycle if you think about it right because it's

such a complex complex uh dynamical system if somebody gives you control at that level gives

you input at that level you never learn to cycle and so immediately people flipped and say that

it was unsupervised learning right because yeah of course nobody told me how to cycle therefore

it's unsupervised learning so if it is truly unsupervised learning what should happened is

you should have watched uh hundreds of videos of people cycling figured out what is the pattern of

cycling that they do okay and get on a cycle and reproduce it right so that is essentially what

unsupervised learning would be you just have lot of data right and based on the data you figure out

what the patterns are and then you try to execute those patterns that doesn't work right you can

watch hours and hours of uh somebody playing fight simulator you can't go and fly a plane

right so so you have to get on the cycle yourself and you have to try things yourself

right so that's that's the Crux here right so what it's it's how do you learn to cycle is neither

of the above right it's neither supervised nor unsupervised it's a it's a different Paradigm so

the reason I always start out uh my uh talks not just in the class but in general when I talk about

reinforcement learning is because uh people always talk about reinforcement learning as unsupervised

learning right which always really irks me because it's not unsupervised learning just because you

don't have a uh classification error or or a class label doesn't make it unsupervised learning it's a

it's completely different form of learning and so reinforcement learning is essentially this

mathematical formula for this trial and error kind of learning right so how do you learn from this

kind of minimal feedback you know falling down hurts or somebody your your mom or somebody stands

there and claps when you finally manage to get get on the cycle you know that's kind of a positive

reinforcement right they fall down you get hurt right that's kind of a negative feedback how do

you just use this kinds of uh minimal feedback and you learn to cycle so this is essentially

the the Crux of what reinforcement learning is about Trail and error right uh so the goal here

is to learn about a system through interacting with the system right it's not something that is

done completely offline okay you have some notion of interaction with the system okay

and you learn about the system through that interaction uh reinforcement learning originally

was inspired by uh behavioral psychology right so one of the earliest reinforcement

systems that was studied was a Pavlov's dog how many of you know of the Pavlov's dog

experiment what is a pao's dog experiment and Pao he tried to

give food to dog and whenever he gave food dog it started salivating

mhm and a b in association with whenever he's going to give food to them okay so whenever like

he just started ringing the bell the dog started expecting the food and it started uhhuh so that

is called a condition refle so when the dog looks at the food and it starts salivating right it's a

Primary Response because there is a reason for it to salivate on the site of food any idea why

exactly so it's preparing to digest the food you know mean show the food it's preparing to digest

the food so it starts salivating right so then now if you think about it right hearing the bell

and it elevates what is it doing preparing to digest the Bell so when you ring the bell and

then serve the food the dog forms an association between the the Bell and the food right and later

on when you just ring the bell without even serving the food the dog starts salivating in

response to digesting the food that it expects to be delivered right so essentially the food

is the payoff you know the food is like a reward for it and it has learned to form associations

between signals in this case which was the Bell like an input signal which was the Bell and the

reward that is going to get right so this is called uh behavioral conditioning right and

uh so inspired by these kinds of experiments on more complex behavioral experiments on animals

and people started to come up with different theories to explain how learning proceeds right

in fact some of the earlier uh reinforcement learning papers uh appeared in behavioral

psychology journals right the earliest paper by Satan BTO um appeared in brain and Behavioral

Sciences Journal just just go back uh I needed to need to say something about s b know that there's

a larger audience we can tell that about them so the we have we're going to follow a textbook

written by Rich and Andy BTO right uh but uh more importantly they are also kind of the

co-founders of the modern field of reinforcement learning right so in 1983 they wrote a a paper

uh um um adaptive neuron like element that learn control Behavior or some something to the effect

right and that essentially kickstarted this whole modern field of reinforcement learning

so the concept of reinforcement learning like I said goes back to Pavo and earlier right people

have been talking about this kind of Behavioral conditioning and learning and stuff uh but um

uh the whole modern uh computational uh techniques

that people use in reinforcement learning are started by uh certain and

right so what is reinforcement learning right so it's learning about stimuli right the inputs that

are coming to you and the actions that you can take in response to it right learning about the

stimuli only from rewards and punish punishments okay so you're not going to get anything else food

is a reward right falling down and scraping your hand is a punishment right so only from this kinds

of rewards and punishments alone right there is no detailed supervision available nobody tells

you what is the response that you should give to a specific input right suppose you are playing a

game there are multiple ways in which you can learn to play a game right so you can learn to

play chess by looking at a board position right and then looking at a table right that tells you

for this board position this is the move you have to make right and then you go and make the move

right so so that is a kind of supervision that you could get you know that gives you a kind

of a mapping from the input to the output right it gives you mapping from the input to the output and

um essentially you learn to generalize from that so this is what we mean by detailed supervision

so another way of learning to play chess is you just okay you have an opponent you sit in front

of him and you just make a sequence of moves at the end of the move you win okay you get

a reward right somebody pays you say 10 Rupees okay if you lose you have to pay the opponent

10 Rupees that's all that's all that happens right that's all the feedback you're going to

get right whether you're going to get the 10 rupees or going to lose the 10 Rupees at the

end of the game so nobody tells you given this position this is the move you should have made

right that's what we mean by saying learning from rewards and punishments in the absence of detailed

supervision okay is it clear okay and The crucial component to this is Trail and error learning

because since I don't know what is the right thing to do given an input right I need to try

multiple things to see what the outcome will be right I need to try different things to see

if I'm going to get the reward or not right if I don't try different things right I'm not going to

be able to learn anything at all right so we'll I can give you more formal mathematical reasons

for why we need all of this as we go go on uh but this intuitively you can understand this as uh uh

requiring uh uh EX exploration so that you know what the right outcome is right and there are a

bunch of things uh which are also characteristic of reinforcement learning problems one of those

is uh that uh the outcomes right the pay the the rewards and punishments based on which you

are learning can be fairly delayed in time they need not be temporarily close to the thing that

cost it I mean while you're playing a game let us say right so you might uh you know

or drop a batsman right and then he goes on to score like 150 or something like that right

so then you lose the match at the end of the day right but the even that cost you to lose

the match is the dropped catch that probably happened around the 12th over right or it could

be a much more convoluted uh causal uh cause and effect right so and how many of you follow

Cricket my god really losing popularity yeah put your hand I'm not going to give a cricket example

then look at it uh okay so there a bunch of other things right so so we talked about delayed rewards

the rewards could come much later in time from the action that caused the reward to happen right for

example let's go back to our cycling case right I might have done something stupid on or I might

have gone over a stone somewhere right while I'm cycling at a very high speed there might have been

a small Stone in the the road and that that will cause me to lose my balance right and then I'll

try my level best to get the balance back right I might not and I'll finally fall down and get

hurt that doesn't mean what costed the Falling Down is the Last Action I tried right I might

have desperately tried to jump off the cycle or something like that but that is not what cost the

punishment right what caus the punishment happened a few few seconds ago when I ran over the stone

right so there could be this kind of a temporal disconnect between what causes the reward or

punishment from the actual reward and Punishment so it becomes a little tricky how do you going

to learn those things right learn the associations right so quite often right you are going to need a

sequence of actions to obtain a reward right it's not going to be like a onot thing you're going to

need a sequence of actions to get to the reward so again going back to the chess example right you're

not going to get a reward every time you move a piece on the board right you have to finish

playing the game at the end of the game if you actually manage to win you get a reward so it's

a sequence of actions right and therefore you need to learn some kind of an association between the

inputs that you are seeing in this case it will be both positions right or how fast the cycle is

moving and how unbalanced do you feel and so on so forth right two actions so the inputs that you

are getting sometimes which we call States right and and the actions that you take in response to

this input that you are seeing right so this is essentially what you're going to be doing

when you're solving a a reinforcement learning problem so this kind of of associations are

essentially known as policies right so what you're are essentially learning is a policy to behave in

a world right so you're learning a policy to play chess or you're learning a policy to cycle right

so this is essentially what you're learning that you're not just learning about individual actions

right and all of this happen typically in a noisy stochastic World okay that makes makes things

more challenging so these are all the different characteristics of uh uh reinforcement learning

problems right and um and we'll be looking at all of this as we go along right I mean this I'll not

be explicitly talking about each and every one of these bullet points but everything that we

look at all the algorithms all the methods that we look at as we go along in this course okay

we'll we'll have all all these aspects as part of it okay uh so reinforcement learning has been

used fairly successfully in in a wide variety of applications right uh so you can see a helicopter

there okay so there not a cut and paste error okay the helicopter is actually flying upside

down okay uh so uh so the this um group group at Stanford and Berkeley uh which have actually used

reinforcement learning to train a helicopter to fly uh all kinds of things not just upside down

a naral agent can do all kinds of tricks on the helicopter so I'll show you a video in a minute uh

and uh it's it's an amazing piece of work right I mean it's considered uh it was considered the the

showpiece application for reinforcement learning I mean getting such a complex control system uh to

work and it actually could do things at a much uh finer levels of control than a human being

could right well that's after all a machine so you would expect that but the tricky part was how it

learned to control this complex system uh from uh without any human uh intervention right and in the

middle right so I have a couple of games there so that is uh can you see that okay it's twoo

small and arrow yeah that's a game called back gamon right so how many of you know about back

gamon one two there was one maybe how many of you know about

Ludo okay fine so back gaming is like a two-player Ludo okay so you throw the dice you move pieces

around and you take them off the board right so it's it's it's it's a fairly U easy game

but then you have all kinds of strategies that you could do with it but it's also a hard game for uh

computers to play because of the stochasticity right and also because of a large uh branching

factor that is there in the game right so at each point there are many many combinations in which

you could move the uh board pieces around and then there is this D roll that adds an

additional complexity so people aren't really uh you know getting great results and then there's

this person Jerry tessaro from IBM uh who um came up with something called neurog gamon I think it

was called neurog gamon and uh that was trained using supervised learning under neural network

right and so if it if he had done it recently it would have been called a deep learning version

of neurog gammer or something because he did it back in the '90s early '90s it was just called

neural network version of back gam and it it it played really well for a computer program right

so it was essentially the best computer program back gamman player at that point and then uh

Jerry heard about uh uh TD I mean heard about reinforcement learning and he decided to train

a reinforcement learning agent to play bam okay so what he did was uh set up this uh um reinforcement

learning agent which played against another copy of itself right and let them play hundreds and

hundreds of games right rather thousands and thousands of games so essentially what they

did was so you train one copy for like 100 100 move 100 games or something then you move it

here right freeze it and then continue learning with this so essentially what was happening as

you learn you're playing against better and better plays gradually your opponent was also

improving right and then this was called selfplay right so he trained back gam using selfplay

and uh he it came to a point where the the TD gam as he called it was even better than the human

player of bamon at that point in the world right so they actually had a head-to-head challenge with

the human Champion that is a world championship of back gamman you know it's apparently very popular

in the Middle East and people actually have World Championships that's the world championship of

B gamman and uh so he he challenged the human Champion which IBM seems to do a lot right I mean

they challenge casparo uh to matches and things like this so he also challenged uh uh justar

worked for IBM you should realize right people who spend a lot of resources getting computers to play

games will probably be working for IBM you know uh so so so Jerry had this thing and it beat the

world world champion so you have reinforcement learning uh agent that's the best back gaming

player in the world okay not not no more best computer player or anything right so we could

actually make that claim right and there's another game there which is a snapshot from the game of

go right so people have played go oh come on us there at least one or two people who have played

go okay uh people have played U

otelo okay that's also very number I mean isn't it one of those free games on

Ubuntu I thought everybody plays that because I mean at some point or other well you would rather

play Oto than watch paint Dr you know but anyway so go goes like a more more complex version of AO

F right it's it's it's again a very hard game for uh computers to play because the branching

factor is huge right and it is actually a miracle that humans even play this uh because the the

search trees and other things are really complex right so this is one uh one case which clearly

illustrates that humans actually solve problems in a fundamentally different way than we try to

write down in our algorithms because we seem to be making all kinds of intuitive leaps in order

to be able to play go right so there's this person uh David silver uh who currently works

for Google Deep Mind and uh but before that he spent some time with uh Jen tesaro at IBM and at

some point along the way he came up with this uh reinforcement learning agent called uh TD search

uh that play go at a decent level still not um you know not not like Master human level performance

but it perform plays at a pretty decent level right so what I'm pointing out here is things

that are typically hard for traditional computer algorithms or even traditional machine learning

approaches to solve AI has had good success right and here is another example oh if I'm

to point I'm supposed to use this or that I forget which one right they told me I should use only one

of those screens for pointing because it's hard for them to record on another one I forget which

one okay forget it there are some robots on the bottom left of the screen right and

uh so that's a snapshot from uh the UT Austin Robo soccer team called Austin Villa right uh

and they use reinforcement learning to get their robots to execute really complex strategies so so

that's it's it's really cool uh but the nice thing about the uh the robo soer application is that uh

they don't use reinforcement learning alone right they actually use a mix of uh different learning

strategies and also planning and so on so forth which is going on the other studio right so they

use a mix of different kinds of AI and machine learning techniques in order to get a very very

competent agent it's very hard hard to beat and they have been the Champions I think for like 2

or 3 years running now uh in the humanoid league right and again hard control problems uh things

like uh how do I take a spot kick you know those are the things for which they use reinforcement

learning which it's a really hard balancing problem you basically have to balance the

robot on one leg and then swing the other leg so that you can take the kick it turns out to

be a hard control problem right so they they used RL to RL to solve those right and then

up on the top right okay uh is is is an application which will probably the one

that actually makes money uh of all these three all all the others right uh that is on

uh essentially on um using reinforcement learning to solve online learning right

so so online learning is is is is a is a use case where uh I do not get I don't have the feedback

available to me a priori right so the feedback keeps coming peace meal right so for example uh

that is a case where uh we are having uh new stories that need to be shown to people that

come to my web page right and when people come uh come to the page I'll have like some editors will

pick like 20 stories for me and from those 20 stories I have to figure out which are the ones

I put up there prominently right and what is the feedback I'm going to to again nobody tells me

what stories that user is going to like I mean I cannot have a supervised learning algorithm

here right so the feedback I'm going to get is if the user clicks on the story I'm going to get

a reward the user does not click on the story I will not get a reward right that's essentially

the feedback I'm going to get nobody tells me anything beforehand right so I have to try out

things I have to show different stories to figure out which one he's going to click on right and I

have very few attempts to do this in so how do I do this more effectively so people have done

a supervised approach for solving this and it has worked fairly uh successfully uh so it has worked

fairly successfully but um but reinforcement learning seems to be a much more uh uh you know

natural way of modeling these problems so not only in these kinds of uh news story selection

people use reinforcement learning ideas even in ad selection right so how do you see some of those

ads that you see on the sides when you go to Google or some other page right so how how are

those ads selected so so there might be some very basic economic Criterion for selecting a slate

of ads okay here are these 10 ads which would probably give me the right uh payoff right and

then you can figure out which of those which three of those 10 am I going to put here and

things like that you could use a reinforcement learning solution for uh selecting those right of

course the this there's this whole field called computational advertising right it's a lot more

complex than what I explained uh but uh RL is is a component in computational advertising as well

right okay here is the video courtesy Andrew's web page uh the people recognize a guy there yeah

okay it's just not a human siiz helicopter but still it's a fairly large

amazing huh all of this is being learned by uh our

agent this goes on for a while so we'll stop

[Music] [Applause]

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:Introduction to RL