YouTube Transcript:
Benjamin Recht: Optimization Perspectives on Learning to Control (ICML 2018 tutorial)

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

This tutorial explores the intersection of reinforcement learning (RL) and control theory, aiming to bridge the gap between theoretical RL advancements and practical applications in physical systems. It highlights the challenges and opportunities in applying RL to continuous-time, real-world control problems, emphasizing the need for robust, reliable, and safe learning systems.

Mind Map

Click to expand

Click to explore the full interactive mind map • Zoom, pan, and navigate

all right everyone if you can please get

seated as quickly as possible we're

gonna bout to get started momentarily in

fact we're about to get started right

now I changed my mind

so my name is East on you I'll be

chairing this tutorial session it is my

great pleasure to introduce been

recognized brief so they can get going

Ben is on the faculty at EC us at

Berkeley he's done a range of research

on a range of topics most recently at

the intersection of learning and control

which is what he'll be talking about today

today

just one quick logistical note if you

have a question during the Q&A session

of this tutorial please ask it at the

mics that I've been set up for Q&A so

that it will get recorded and with that

let's welcome Ben rekt hey hey oh this

is on great welcome to Stockholm

everybody so where to begin today is a

kind of a culmination of I guess was a

six month long project trying to write

up some thoughts about reinforcement

learning control and machine learning

this has been a longer kind of studying

my group for the past few years and so

this is kind of the culmination tutorial

trying to figure out what do we figure

out from trying to read all the

literature gotta preface it all with

this will be very short and biased and

so we're gonna probably miss sight eight

everybody's work and I apologize and

just bug me afterwards and we can add

you to the citation list it's a huge and

growing field and has very long history

and I just what I want to do today is

try to give you a quick tour through it

from a very particular perspective which

is if you had come up as someone who

knew control theory maybe how would you

actually go and jump into this

reinforcement learning space now what

now why is that the perspective I take

other than the fact that I'm from a

much of the success of reinforcement

learning and it has been tremendous

success has been in environments that

are very controlled like games and we've

defeated the hardest games that humans

can play notably go but as we then try

to move from the situation of games to

where we have to interact with the real world

world

many new problems arise and notably the

problems that we really have to care

about are that these physical objects

are going to interact with people and we

already have these physical objects

based on machine learning systems

actually ending human lives and that's

scary this is something that machine

learning never had to grapple with in

its history until now and so that our

technology is at the forefront of

engineering and our technology is being

pushed into all these situations that

have to interact with people it means

that we have to step up and take new

responsibility over the outcomes of what

we do so in particular the kinds of

problems I want to look at today are

exactly those that interact with the

physical world and perhaps have this

kind of view that might come from a bit

more classical of a background so for me

what reinforcement learning is and this

is we can all debate the actual

definition but I think at its most broad

it's trying to understand how do you use

past data to enhance the future behavior

of a dynamical system and from that

perspective if you had been a controls

engineer you might say wait that's

control theory right so so so what is

that distinction in some sense that

distinction mostly comes from which

department you did your undergraduate in

if you were in a department which has an

e in it many of them then reinforcement

learning is a bit of a subset of control

and it's taught that way on the other

hand from a computer science perspective

reinforcement learning is really the

primal theory and controls is this very

taking our kind of prejudices aside for

me control theory has always been

focused primarily on continuous spaces

and actually primarily on continuous

time I'm not going to talk about

continuous time today because computers

do compute things discretely but we will

talking about continuous spaces where as

reinforcement learning starts with a

very discrete view of the world the

other thing that's primal in in controls

are models models are things that we

were supposed to derive from physics and

perhaps all we're gonna have to do is

estimate a parameter or two we never

actually have to go in and learn a model

necessarily loosely and reinforcement

learning data is really the key right

everything in reinforcement learning is

how do you take data and just turn it

directly into actions and I guess the

other main difference between controls

and reinforcement learning right is that

controls gets published in very stuffy I

Triple E journals and reinforcement

learning of course gets published in the

front page of the New York Times

sometimes all right so different in

their PR departments as well so today

what we want to do is try to unify these

two camps and we're gonna do that in a

funny way which is what we're gonna do

is we're gonna try to solve continuous

problems using the techniques from

reinforcement learning and we're gonna

see what in what insights that gives us

and we're gonna see what we we can learn

from both of these communities and then

I will pose a endless list of problems

two questions that I just don't know how

to answer so well we're gonna get a baby

step today and the number of things I

don't know is uncountably larger than

the things I do okay so the as I mention

at the beginning the kinds of things I'm

interested in is trying to understand

what are the limits of learning systems

that interact with the physical

environment and I'm gonna kind of narrow

that down a little bit today to just

talk about how much do we have to

understand systems in order to do

control and the foundations of this will be

I'm a theorist at heart there will be

experiments that will appear in the talk

but at its foundation it's going to be

trying to merge the camps of view from

learning theory a view from control

theory and for me really the most

important thing here will be

optimization because that and so the

core optimization will kind of unite

everything okay so control theory it's

really just the study of dynamical

systems that have inputs you have a

system you can play with this input

which is here you that maybe will spit

out some output Y but in particular it

has some state X and the X will evolve

according to some update law okay so the

next state is a function of the current

state and the current input and today

we're not going to bother with the

output being anything but the state so

it's kind of the simplest setting so X

is the state I will always have to

mention D and remind me if you'll see

something else and U is the input and it

will have to mention P and when these

come up I'll try to remind us of what

those dimensions are so reinforcement

learning is kind of that starts from the

same place right it's the study of

discrete dynamical systems with inputs

and in this case the object that we use

to describe how the discrete system

evolves is an MDP a Markov decision

process so the Markov decision process

the probability of the next state is

simply a function of the current state

and the input and in some sense this is

what defines state state is that what I

need to predict the future and the main

difference between the Markov decision

process setting and the control setting

is that it's these are taking their

values in discrete space rather than in

a continuous space so there are D

discrete States P discrete inputs and

everything can be described now in terms

of tables rather than some kind of

continuous functions okay so I'm almost

not going to talk about mdps at all

after that introduction this is just to

make the connection if you don't know

much about em DPS there are a variety of

tutorials out there that I will

recommend on the on the web page but

what's nice is though this is usually

where reinforcement learning starts

we're gonna start with this other model

we're gonna start with this control oops

come with this one we're gonna start

with this model and see where we can go

okay so let's do that okay so now

optimal control is just trying to solve

an optimization problem to make a

control system do something I have this

input you I want to find the U that

gives me some properties out of from

from the axis so here I've broken the

input up into two there's you that's the

thing I can control and there's e which

is some single I can't control he is

some kind of either an error or

exogenous input or whatever you would

like to call it it's something I can't control

control

and today I will always assume that that

E is stochastic okay it's always a

random variable you can make your life

harder if you make it adversarial okay

so now we have a state transition

function that's a function of the state

the input and the disturbance the cost

function is given incrementally over

time by some cost C T this is another

difference between reinforcement

learning control optimization especially

optimal control is always minimizing

costs and reinforcement learning always

maximizes rewards which is kind of a

glass half empty glass half-full kind of

perspective control theories are very

pessimistic people that's just kind of

the nature of the field so we're gonna

minimize costs ET is our noise ft is our

state transition function and here I'm

letting it change from time to time just

being as general as possible for most

simple settings people typically assume

that this is stationary so it's the same

F at every step okay so now here's a

really important thing

tau is the trajectory tau is essentially

everything I've seen from time zero up

till now

and the input to the system is just some

function of that trajectory okay so it's

now a function of the future it's just

function of all the data I've seen up to

now so the decision variable here is

this function pi which we call a policy

and a policy is simply something that

picks out an action based on everything

I've seen before okay so that's the

decision variable that's somewhat

complicated there are lots of different

ways to parameterize and write those

down but that's kind of that's the thing

we're searching for is this policy oh

there it is n just hold that for one

second cost transition policy okay so

let me do an example to motivate this

and it's very simple and that's trying

to go fly a quadrotor we want maybe say

want to get the quadrotor from point A

to point B and so we have to model this

and fortunately for us the models are

handed to us we don't necessarily have

to learn them and the simplest model

would be Newton's laws and Newton's laws

would just say that position Z is the

old position plus the velocity up to

Constance we're just scaling everything

so that's true the velocity is going to

be the velocity plus the acceleration

that's how that updates just saying that

the derivative of velocity is

acceleration the derivative of position

is velocity and then F equals MA okay

and if I put those together that

actually gives me a state representation

and it's very nice it's linear in this

case so the output of the the new state

is a linear function of the current

state and the input and now what should

we do for cost so for cost what we're

gonna do is this would be one if I'm not

on the blue dot and then zero if I am on

the blue dot and if I look at that it

seems that maybe you guys can solve this

but it seems like that might be hard to

optimize because it's very discrete so

what people do in controls is typically

come up with a surrogate cost so instead

of minimizing this discreet cost I would

minimize the surrogate cost

like the sum of the squares and this is

just saying this will be larger than

zero if I'm not on the dot and will be

zero if I'm on the dot so it's capturing

something and this is very much like

what we do in machine learning we don't

minimize a zero one classification error

we minimize some kind of surrogate which

surrogate you use as a design question

and then the analysis goes in to proving

that that surrogate is the right thing

to do so it's the same in controls and

here we could even add maybe a quadratic

function of the input as well right and

so we can design a cost function and

then say well does this actual does this

do to solve the problem I initially

cared about okay

what's nice about what I just posed is

this as quadratic as linear constraints

you can write down the KKT conditions

and solve it with least squares or if

maybe you're a little lazy

you could call CVX or perhaps you could

run backpropagation on this as well

variety of different options that's

probably that might be slow it's also

interesting is that how this parameter R

here R is essentially some trade-off

between how much power I use and how

much and how quickly I go to zero and

here I'm just gonna graph two values of

R here's the solution paths for time and

position and again we're just trying to

get the position to 0 and the solution

path for time and the control action U

and what you see is that when R equals

10 it's clearly trying to minimize the

amount of control action that we apply

and it takes longer to get 0 whereas for

a larger value I start a smaller value

of R when R equals 1 now we go to 0 much

faster but we exert a lot more force and

what this is also showing is that there

isn't necessarily one cost function that

we want to optimize and controls which

makes it a lot more challenging than in

classification in classification and

regression usually there's one cost

function there's some error that we'd

like to minimize in controls and when

you're actually designing systems there

are lots of contraindicated

at the same time so you may have a

limited battery so you can't use too

much force and you might have a fixed

time that you really have to get to the

origin and you have to kind of account

for all of these in the optimization

design so the cost function itself is

not given it's something that we as part

of the design process and it's something

that's typically iteratively refined

okay so there's our quick example this

turns out to be a specific case of a

much more general problem which is

called the linear quadratic regulator

so linear quadratic regulator is optimal

control my cost function is quadratic my

dynamical system is linear so there's

linear dynamics there's a quadratic cost

and as is linear quadratic and we call

this lqr for short and I'm gonna refer

to this throughout the talk as a very

simple baseline that allows us to to

compare and contrast different

approaches to the general optimal

control problem okay so optimal control

back to our the more general case how

would you solve it I told you how to

solve lqr turns out that the way you

would solve alp you are is sort of how

you would solve the general problem even

when F is nonlinear if everything is

known I told you the model I told you

the cost you can either do batch

optimization which will be probably non

convex but this kind of looks like a

computation graph so though I was joking

about it you can immediately write down

a PI torch graph and solve it this way

so just do back propagation you could

also use a non convex interior point

solver and many of my colleagues you do

this sort of thing will actually use

software like IP opt or SN opt and and

they will give good solutions or there's

this general principle of dynamic

programming which you can't always do

but when you can do is quite efficient

it tends to be linear time in the time

horizon okay so if I knew F then I could

either run dynamic programming or batch optimization

optimization

if I don't know F then the question is

what do I do

okay so if I don't know F oh man fire

code if I don't know F the question is

what's the right thing to do and

actually this is this is really where we

start that was your quick introduction

to dynamical systems and control and now

we're back into machine learning

I don't know F I do know well I don't

care if you know C or not for this talk

but like I said the cost is usually

something you design so let's just

assume we know it but then the question

is what do I actually do if I don't know

the dynamics and so we're gonna try to

reinvent reinforcement learning to solve

that problem okay so let's do that so

let me motivate that when do we not know

the dynamics as soon as we move to

something slightly more complicated

there are a variety of systems where's

actually sometimes hard to get a full

dynamical model so let's motivate it

with what the reinforcement learning

people motivate their research with

which you know data center cooling I

love this problem I love this problem

mostly again because it shows that

control people my control theory friends

they have the worst PR departments in

the world because data center cooling is

you know the control theorist would call

it air conditioning and it's really hard

to get kind of like a flashy write-up in

New York Times if you just you know

figure out how to deep solve an air

conditioning problem so how would you

learn how to cool a data center or a

building a room like this so there are a

variety of different techniques one I

could build a finite element model of

the data center I have a full ordinary

differential equation of everything as

that seems like overkill for a variety

of reasons especially because people

reconfigure their data centers on daily

basis I could do what the air

conditioning people tell me to do and in

which case you have you build a kind of

much coarser model with heat sources and

and and and sinks

and this is actually what would be

typically done in air conditioning and

this would just be an ordinary

differential equation model with

temperatures and changes in heat or I

could be really cavalier and say let's

not do any modeling whatsoever and just

try to find a function that map's center

state to action okay so those are the

three possibilities we could either do

identify everything and there are

actually application for that in

particular this is a turbine and for

really high performance aircraft you

really do want to identify everything

because you're pushing up against the

limits of what's possible you can

identify something very coarse which

would be kind of what I was arguing is

probably what people actually do in

practice for air conditioning and this

kind of underlies what we call model

predictive control they have people

leave is there no way they can stay okay

okay okay sorry that's how far is it

well I it's mmm I don't know what I do

at this point okay all right so that's

all I apologize everybody has to leave

I'm sorry that's a fortunate right so

that one out got to my one funny joke

but then unfortunately interrupted by

the bouncer [Music]

[Music]

that will happen Oh France it's our

Facebook live where's the link thank you

okay Thank You Francie's so let's thank

the general chair Francie's Bach who has

arranged for this to be on Facebook live

soon are you going to tweet it it will

be tweeted it will be tweeted okay good

fine back to action we don't need we

don't need no stinking models that's

that's us right that's everybody who's

here that's the machine learning point

of view and right that's kind of I think

that's at the core what people do in

reinforcement learning but I want to

make an aside with this one because

while it seems crazy that this is a what

happens in reinforcement learning is

also what we do in PID control I don't

know who's learned PID control some

people anybody okay good

what's PID that's good right that so

West PID control PID control is kind of

I would it's this funny thing where 94

95 % go ahead you saw we're good us I

all right so we've got it worked out so

what does PID control PID control is you

know is probably 95% of the control

systems in industrial production our PID

controllers no that's that's a rough

estimate but people have actually done

surveys of thousands tens of thousands

of systems and found that what people

implement in practice is PID control and

PID control is incredibly simple

typically you have one signal that you

would like to make to zero like you want

to stay in a lane in a self-driving car

and not crash into the divider on the

101 so what do you do to stay in the

lane I know it's a dark joke but it's

you know anyway what do you do to stay

in the lane you first assume that your

vision system is perfect okay this is

what's the vision system is perfect then

what do we do so then you have this

error signal which is staying in the

center of the lane that's some time

series my control signal is just the

deviation from the center is going to be

some error and what I'm going to do is

come up with an input that's a

combination of the error the derivative

of the error and the integral of the

error so it's proportional integral

derivative three parameters and you

typically don't need a model to actually

set this up there are heuristics that

will that actually do very well in

practice that automatically tune these

parameters without a model and moreover

most people don't even use the

derivative term because it's very hard

to tune it's very sensitive so two

parameters suffice for about 95% of

control applications and now the

question is as we move well one question

is do we have to move beyond that maybe

but then so for this remaining 5% of the

really challenging things where we'd

like to add some learning some control

something more sophisticated how much

how much of a jump do we have to make

you have to go from two parameters to

twenty million parameters or is there

something in between and that's the

question so that's what we would like to

know okay so with that motivated here's

our learning to control problem [Music]

[Music]

just wanted to put that back up to

remind us where we are and so how would

we how might we actually but before we

even talk about how to attack that let's

set this up as a bit more of a math problem

problem

okay so let's actually like define the

parameters and the parameters here are

going to be the following we can

generate n trajectories of length T okay

enter duck therese of length T and so

the timer isin is T here we can generate

n as arbitrarily many and then the

challenge is we want to build a policy

or a controller which have the smallest

error given that sampling budget and so

N and T here are gonna be equally I'm

just gonna weight them equally you don't

have to actually do that it may be more

expensive to run lots of experiments so

that let T is getting a long time

horizon is easier than multiple

experiments or it may be that multiple

experiments are easier than a long time horizon

horizon

I'm just lumping them in as a product

today trying to make it as simple as possible

possible

and the question is what's the right

thing to do that's my Oracle what's the

right scheme to do you can already see

it's more complicated than what we

typically do in optimization that ICML

where we have just a function Oracle or

a derivative Oracle here we have

something a little bit more complicated

we could treat it just as giving us

function values because we can

accumulate the rewards but we have a lot

more structure under the hood and we

should try to take advantage of that

extra structure so our big question is

how many samples do we need to actually

solve that problem let's see what we can

do okay so to solve that problem I'm

going to first invoke my favorite design

principle which is the linearization

principle which basically says if you

have an algorithm and you think it

solves deep learning I want you to show

me what it does on linear regression

first or you know if I had something

that solves Sat which is you know and be

complete I want to see what happens on

to set instances just as a sanity check

and this has basically been the last

four years of my research

and many of my colleagues research so

it's actually been a kind I feel like

this gives a lot of interesting insights

starting with linear models seeing what

the complicated method does on the

simple case and a lot of times you can

actually back out good insights about

why that what's happening in the more

complicated nonlinear space by looking

at the linear case first and a lot of

times what happens in the linear case

doesn't form the nonlinear case it's not

there's not a it's it's not necessary

that's something that works on nonlinear

models or sorry something that works on

linear models works well on non linear

models but you'd like it if it's still

working on nonlinear models to not be

too bad on their models

it's my this is my axiom here so I

already introduced this one for us the

simple problem the linear case I'm gonna

call lqr what's cool about lqr is it's

been studied forever it has a closed

form solution people understand the

stability people understand robustness

properties it's really just a very well

analyzed problem when we know a and B

and when we don't know a and B well now

we can actually start to think how does

this compare we can compare ourselves at

least to what would happen if we did the

hard part with optimal control when a

and B when this function is nonlinear

and more general is that even solving

the control problem with perfect

information might be hard so here we're

kind of going to something where if I

have perfect information I can solve it

it's well studied I can do a lot of

different kinds of analyses and when I

move to this kind of case where we have

imperfect information hopefully we can

bound the gap okay so we're gonna use

we're gonna use lqr as a kind of our

foundation but we're gonna present

methods that work on general more

generally and so as far as I can tell

there are three general things that

might work and I'm going to lump names

that come from the MDP land onto these

three general things but there are

changes that have to be made when you

move from mdps to these continuous

models okay so the first thing would be

fitted model to the data and I'm going

to call that model based RL even though

based RL actually contains other things

for me today model based RL just means

do some experiments fit a model use it

somehow in the loop then there's model

free bottle free RL has basically two

sub branches and at least all the deep

RL talks I've seen break it down into

these two categories category one is

estimate your the cost function itself

the cost function of the entire

optimization from data so try to forget

that there are constraints since there

are quality constraints these are all

actually is just one big cost function

perhaps I could use regression to tell

me what that cost function is and

estimated from data and you'll see why

dynamic programming becomes critical for

doing that and then the other one which

is the most cavalier is direct policy

search which just says let me just

directly try to optimize for the policy

without trying to fit anything else okay

so those are our three categories and

let's see where we can get here so model

based I like model based because this is

the only one that fits on the slide this

one fits on a slide so yeah it's

satisfies my simplicity criteria okay so

the first thing we do is we collect

simulation data and based on simulation

data well we should have that the next X

is some function of the current X the

current input and the noise okay and

there will be some noise new that's

different than E right so this is kind

of our assumption roughly I put

squiggles here to beam I'm being

incredibly imprecise anytime there's

squiggles we're just going to take this

as close to true if it was actually true

what would you do if I wanted to find

the best Phi that satisfied the equation

I would solve a least squares problem

and maybe you could construct a more

complicated loss maybe you could

construct a more interesting loss but

that's probably what I would do at an

escort which is fit the dynamics using

supervised learning which we already

understand actually I was just at Colt

and the nice thing about going to Colt

is you realize we don't even understand

supervised learning so let me not even

say that but we understand it pretty

well we know a lot about supervised

learning so once you have a supervised

learning model what you could do is now

an approximate problem instead of

solving the problem that we wanted to

solve we can solve this approximation

and the approximation plugs in in some

capacity this model that you fit and

maybe models the noise that would it be

accrued and so we have some

approximation now the goal is just how

far off is this approximation from what

we started with okay for people who've

done controls this is kind of the bread

and butter what the community has done

the first stage these first two lines

are called system identification for

controls we call it supervised learning

and then when you take the system the

system that's identified you can design

a control problem around the identified

system okay so that's kind of like what

probably is what most controls engineers

would have suggested that you do okay

let's but but let's be adventurous we we

know that we've solved go by other means

what else might we do okay what else

might we do approximate dynamic

programming is a possibility alright so

how does approximate dynamic programming

work this is really the core of

classical reinforcement learning is

methods from approximate dynamic

programming things that you have heard

whether they be cue learning or TDD or

what else P I or via policy iteration

value iteration help me out Gergely what

did I forget to which one q what DQ n DQ

n DQ an yeah that's a good one

DQ n thank you all fill into this box

write DQ n that's a good one

so that's really it's a classic it's not

just classical are lots of contemporary

methods use this too I'm gonna just make

a quick disclaimer because I've

presented this a couple of times now if

you haven't seen this before I don't

think you're gonna understand it at the

end of my tutorial and in some sense

this is why RL has been tricky for a

long time it requires an investment of

time to understand the primitives and

it's really it's worth spending the time

but in the few slides I've prepared it

might not make sense but it'll be it

will be a taste of what happens and so

here here's my approach what I'm gonna

do is talk about dynamic programming

and then say what does it mean to

approximate the diner program so let's

rewrite our optimal control problem

again assume everything is known now now

where everything is known eventually you

know so the machine learning will come

in in a second if we knew everything we

could define this this value here to be

called the Q function I'm calling it Q 1

because it's from 1 to the final and it

basically says that if you start at

state X and your initial action is you

what's the minimal achievable

optimization value from those two

initial conditions and this is called a

q function or a value action function ok

so how do i compute the q function

well the lat if i started at the end the

q function is easy it's just whatever

that final cost is that's clearly what

I'm going to accrue and then you could

get a recursive formula which just

follows from working from the back to

the front

and it has this form so that you add in

the new cost you have an update for the

next state and you have the Q function

from one time in the future so

oftentimes this is called cost to go ok

so this is the this is at time K what

the cost to go is from action U and

state X ok and you could derive this

again just by by dynamic programming

work from the back work your way forward

what's cool about these things all right

if I knew this Q function and I want to

solve this problem what do I do

well I know that this is the value taken

for any X and any u this will be the

value I get running the problem so what

I have to do is just minimize the Q

function with respect to U and that will

be the best action and that turns out to

be the optimal policy and this is why

people like these methods you get this

compact thing which is to minimize some

Q function subject to subject to the

state being XK now let's put my

optimization hat on for a second in

discrete land where I have tables that's

usually where we motivate Q learning in

that case the value action pairs are

just a table

and then you fill in an every value of

the table the value and if I want to

minimize it I just look at the minimum

value of a column if I'm an optimizer I

note that this optimization problem is

in no way necessarily easier than the

one I started with so even if I do the

bookkeeping it's not clear at all that I

can actually solve the problem that I

want to do and anyway more effectively

than solving this original problem so

you have to kind of build that end I

have to be able to minimize this cue

function well in order to have that but

that's just like that's an aside in the

continuous case that's an extra

complication but in lqr magic happens

again people love lqr because magic

happens at every corner

you look so here's the cool magic here

let's imagine I decided that the final

cost is quadratic I'm just gonna assert

this is my optimization problem

everything's quadratic the final cost is

quadratic then when I do this

calculation now which is minimizing with

respect to with respect to the next

action the step into the future of Q

well Q is quadratic f was linear so this

is a quadratic quadratics are close dr.

minimization and so I get a quadratic

back so I get some new matrix PT plus 1

and this is my new state step into the

future and then some residual cost

because the error so in order for this

to work the error has to be zero mean

and uncorrelated with itself so it just

has to be white noise and sunset so the

from time to time independent actually

just uncorrelated to have this formula

work okay so that's my that's my formula

and what I can actually write down

although it's absolutely horrendous the

formula for P now where does this come

from it comes from taking a minimizing a

quadratic what ever minimize a quadratic

before this is a sure compliment right

of a some bigger matrix so I just didn't

show you the bigger matrix which we did

this your complement on and the optimal

action comes out from minimizing this

function and you could just see you

could pretty much read it off that this

is the optimal action it's a very cool

so that so this makes my this

makes everything very simple that mean I

could compute all this stuff offline no

I mean Kai you can compute it offline to

get these formula

you'd have to derive them the

derivations are not complicated I'm if I

really wanted to torture my

undergraduate so I can assign it to my

undergraduates but you know then they

then they don't like that you know they

get mad when you do these kind of things

but it really is there is nothing

sophisticated here it is just linear

algebra and dynamic programming that

gets us these formula and was

particularly interesting is if you take

a limit as time goes to infinity and get

an incredibly long time horizon then you

get a really interesting property out so

let's go back to this one real quick one

thing that's cool is that the optimal

policy is a linear function of the state

so the best thing to do is multiply the

current state by a matrix for the lqr so

very simple policy turns out to be

optimal on the infinite time horizon the

optimal policy is to multiply by a fixed

matrix there's some matrix one fixed

when you compute in every time step all

you do is multiply by that matrix that

makes for a pretty simple controller and

the way you solve for it is you get

basically the you just take a limit of

the equations on the previous slide you

get this matrix which is KX and the way

that you get P is by solving this

equation which is nasty but it's doable

there is a cyber teen that will solve

this for you so life is good I don't

know if it's in tensorflow

but Google friends you know the D are

this thing is called a discreet

algebraic riccati equation we should we

should add that and the dynamic

programming here has has a simple form

because quadratics are a miracle and

this is also probably why we study it so

much quadratics had to be have all the

properties you want to make this simple

it's also cool is the solution is

independent of the noise variance your

performance degrades the more noise you

have but your action doesn't change so

that's that's interesting and finally

for for finite time horizons you can

solve you could you could have just

solved this with batch solvers

and the last thing I'll point out and we

always overlook this in reinforcement

learning but let's just I want to point

this out the optimal policy here on a

fixed time horizon is not time invariant

so if you have a fixed time horizon the

optimal policy is not time invariant you

can search for the best time invariant

policy and people do this all a lot but

that is not the optimal policy so it's

always good to keep that in mind the

shorter your time horizon the less often

will they are so there's a lot of things

you have to worry about as you kind of

get to these terminal conditions

okay that's lqr to teach everybody our

programming again there's no machine

learning yet this is just crazy controls

where's machine learning here we go

machine learning is this stage so this

is the crazy idea and I love it and it

uses one of my favorite things that's

ever been invented so we're in math so

we're it's really cool we have this

recursive formula for the finite time

horizon on the infinite time horizon

well to make our lives easier we can

introduce something called the discount

factor which is some number that's less

than one that damps things off to

infinity and this just makes my life

like limits always exist once you

introduce the discount factor it's

saying way out in the future the cost is

exponentially small and so if I have a

cost that's being discounted and I have

a static stationary update now I have

these two bellman equations and you get

a fixed point operation which is that

the Q function is equal to some non

linear function of the Q function okay

so how do I solve something or have a

linear equality constraints well I could

say well roughly that if I were to

actually take that step out into the

fold at the future this equation should

be true that the Q function at the

current state and the current action

should be the cost plus the min min

action over the future State and if I've

observed XK and I've observed UK and

have observed XK I could just run

gradient descent and this is Q learning

Q learning is just saying I have this

fixed point equation I'm going to solve

that fixed point

using stochastic approximation in this

case because it's not the gradient of

anything we have a fancier name but the

analysis is exactly what you would have

done and the original Robinson Munroe

paper although it gets quite tricky for

general queue functions okay so that but

it's cool I got this simple update and I

can just run this and this is hopefully

if all my ducks in a row

will allow me to effectively learn a

queue function just from data now the

again for the tabular case where all I

have are discrete states and discrete

actions then this is this is a very well

defined update when I have functions and

function approximation things get messy

I'm not gonna go through that today

because well I'll really be in pain but

again this is not to OS effort to adapt

this to that case as well okay alright

one last approach so that was that was

approximate dynamic programming it's

heavy you do dynamic programming you

realized that in the infinite time

horizon hi this nice fixpoint equation

and then I attacked that fixpoint

equation using techniques from

stochastic optimization or in particular

stochastic approximation and that is in

the theory land pretty much what people

study when it comes to RL now that was

until 1999 where you know because this

caused pain and suffering in the 90s

because all the analyses are hard trying

to get good guarantees was hard and and

sometimes when you introduce function

approximations when you move to the

continuous space you can't even prove

these sinc converge to stationary points

can't prove they converge to anything so

in 1999

some folks published a paper at nips

that proposed something actually no

sorry different in 1992 somebody

published something at nips that

proposed an algorithm which I'm going to

go through today and then in 1999 there

was analyzed that under function

approximation this works quite well and

that's the idea of just being cavalier

and going directly at the policy okay so

I'm calling this broadly direct policy

search this is broadly direct policy

search I'm gonna actually introduce it

without any of this stuff

classic optimization business and just

introduce it as optimization because you

can kind of get everything you want to

know about how to do policy search from

the idea of sampling to search ok so in

this case there's no dynamics let's take

everything out of it let's just try to

optimize we want to minimize a function

fee of a variable Z and I just want to

solve that minimization problem and now

let me do some sleight of hand to come

up with a crazy way to solve it so the

first thing I could do is instead of

minimizing over variables I can minimize

over probability distributions there's a

long time at nips where this is all we

ever did your if you're a variational

person you love this stuff right so

turns out these are completely equal

because I could always just put a delta

function at the Z I want these are

completely equivalent things but

optimizing over probability

distributions is hard so what you

typically do is make an approximation

and optimize over parametric probability

distributions ok

so what I want to do is minimize the

expected value of this function subject

to a parametric form of the probability

distribution and I'm just gonna optimize

the parameters to get the smallest cost

and then I could sample from that

distribution and get Z's that should

have low cost at least that's the idea

ok that's that's roughly the idea and

what's cool and I think what people love

this is that Williams showed that you

can get a really cheap stochastic

gradient formula for this second problem

and it goes like this let's call that

function J of theta because Z is no

longer in the picture that now is just J

of theta because we integrated Z away

the gradient of J with respect to theta

is the expected value under this

distribution of the function value times

the gradient of the log of the

probability is not that's actually not

that complicated a formula and we could

derive it very quickly but the key thing

is if I were to sample Z from this

distribution parametrized by theta this

thing inside the expected value I could

just compute if you know the probability distribution

distribution

I could hopefully compute the gradient

the log of the probability distribution

and then you can multiply by the

function value and so the algorithm oh

let me okay it's gonna do this backwards

let me let me derive this formula first

and then show us how to run stochastic

gradient the derivation is supercute

which is another reason why I think

people love this and it's kinda the same

trick that you use when you prove the

crime around which is that if I if I

take the gradient of this function

that's just the the gradient is not

going to do anything to Z he's only

going to do something to the probability

distribution and then I could just

multiply and divide by the probability

distribution anybody who's seen the

proof of the crime about has seen that

trick and that that now is the grading

of the logarithm and that's it that's it

so this is now the prop integral of this

function with respect to the probability

distribution also known as the expected

value of that function it so this gives

us an algorithm this gives us an

algorithm which is to sample Z plug in

this formula and treat that as a

stochastic gradient so that's it sample

Z compute this gradient approximation

this is an unbiased estimate of the

gradient of J and as I could use

stochastic gradient to solve this

problem and this is called the

reinforced algorithm usually with all

caps I don't like the all caps I think

it was an acronym just pick the name

okay so this sounds amazing

let's use this to solve something here

we go so let's say I have arbitrary

function on the hyper cube minus 1 1 so

it takes values in bits and it's an

arbitrary function so I need a

distribution to sample from the

hypercube how about just a Bernoulli

distribution and every coordinate

independence with different parameters

that gives me a parametric distribution

like this and so here's my algorithm for

minimizing functions or the hypercube I

will evaluate the function and move that

way so I'm using this example to point

out something that you always have to

even though it's not obvious is that I

don't think this solves any problem I

don't know what problem that solves

my guess is nothing or nothing

interesting I'd be curious to see which

class of functions this actually would

minimize effectively and you because

it's not using any information

whatsoever about Phi and hence its kind

of lumped simple functions like linear

ones in with arbitrarily complicated

functions like random ones they're

treated the same and so maybe the

dynamics can make sense of that but

that's something you have to prove and

it's not at all clear that this is

actually going to be something sensible

to do okay I just want you to keep that

in mind as we now apply it to control oh

this one seems absurd but then we just

go and do it to control and it feels

like me you know is it actually working

why is this absurd well look we have to

take an expected value with respect to

this small class of probability

distributions it's not all probability

distributions it's a restricted class

and that might actually hamper what we

can do so by restricting to that class

we actually understand what that

restriction is costing us okay but let's

let's let's plow ahead let's plow ahead

and I'm gonna plow ahead and introduce

this is not what people typically use in

our L but it is what a lot of people

have been using in robotics and it's

called parameter perturbation so this is

the if I have this cost function this

would be so what are we gonna do we're

going to add a rent we have a policy the

policy is going to be a parametric

function has some parameter vector theta

what we're gonna do is just add a little

bit of noise to that parameter vector

and then estimate the value estimate how

good that perturbation is okay so that's

our random sampling we could integrate

over this and it turns out that what

this algorithm does is the following is

essentially its equivalent to the

following approach if I were to write

the cumulative cost of the sum and let

my perturbation be Gaussian then what I

would do is I would add a little bit of

Gaussian to theta one way add a little

bit of Gaussian to the theta in the

other way and then it turns out that the

log of the probability distribution is

gonna give me a Omega I out here so what

this ends up being

is a finite difference approximation to

the gradient in the span of these random

omegas so in this case this algorithm

give me something maybe that looks

somewhat simple accessible right I mean

what I've done is I have a function

I can't compute the derivatives of the

function but I can approximate them

using finite differences and now we're

approximating the five differences you

know some random subspace this

particular instance iation is never

called reinforce so it was initially

called random search by raster gun in

the 60s it was discovered in

evolutionary algorithms sometime in the

70s and it's usually a mew lambda

evolution strategy and I can't remember

what Mew and lambda correspond to here

parameter wise but probably Sigma and M

it was also invented by our friends of

stochastic approximation by spall and it

was also invented in the learning theory

community it's called bandit convex

optimization as usually my feeling that

if you have an algorithm that is

developed independently by four

different communities is probably

they're probably onto something right

that's use that's just signal okay so

we're gonna we're gonna come back to

this one I call it random search I know

a lot of people hate that but I'm just

gonna stick with what I'm gonna stick

with what the Russians called it because

they get mad to don't cross the Russians

bad things happen okay so policy

gradient is different reinforce is

usually used for this algorithm called

policy gradient and policy gradient what

it does is it actually just replaces a

deterministic policy with a randomized

policy but this is no different than

what we were doing before it's just

saying that the value you instead of

being a deterministic function is now a

randomized function and in continuous

spaces it's worth noting I probably

should have written this on the slide

but in continuous spaces what typically

you do is you take a deterministic

policy and add noise to it okay so

that's not perturbing the parameters of

the policy is perturbing the outputs of

the policy and that's the main

difference between policy gradients

and random search and it's actually hard

to write down exactly what this formula

looks like but we can do it for lqr and

I think it's instructive that you can

actually you could say let's say I

wanted to solve a finite time horizon

lqr I could assume that it has a

stationary policy and then I can perturb

the stationary policy by noise so at

every time step I add noise to the

action and then I could compute the cost

and this is the update so the update

uses the state it uses those noise

directions and it just evaluates the

cost but note that it has no it has no

gradient information about the cost it

has no gradient information about the

the matrices in fact it's like

completely independent of the dynamics

so this is one of these things I always

find frustrating it is certainly a

gradient method the reinforce algorithm

is a stochastic gradient method but for

the functions we care about which is the

minimizing the lqr problem we only get

function estimates so as a gradient

algorithm on a function that we might

not necessarily care about again it's

worth worth emphasizing that this is

essentially a derivative free algorithm

because we never compute derivatives of

any of these quantities that we care

about and particularly then it suffers

from a lot of the complexity results

that go along with zeroth order

optimization so again there are two

different categories here that we're

covering there's policy gradient there's

random search reinforced applied to

either of these problems does not depend

on the dynamics and I think that's

really cool and that's why people love

them not only does it not depend on the

dynamics it doesn't depend on you

knowing how to compute the cost function

as long as there's some Oracle handing

you the cost function so if you have

some kind of simulator or video game as

long as the score is returned you can

apply the algorithm and I think this is

also why they're exciting I could just

go this is are actually kind of what's I

feel like really what this

differentially is computer science from

the engineering the other engineering

schools on campus is like at the end of

your first lecture you're doing stuff so

why not just do this too right you're

just they're much more fearless now the

question is does it work how does it

work can we analyze it how does it

compare to other techniques

the next step and I would like to

emphasize that they're both derivative

free algorithms and I'm gonna keep

saying that okay all right so there's

some of this stuff reinforced is not a

magic algorithm you have to be careful

both about the approximation error and

another thing I didn't mention the

variance when I run stochastic gradient

descent you care about the variance of

that gradient and if the variance is

really large you will never converge so

those are two things you have to take

into account you want a small variance

estimator you want of the gradient you

want the approximation error to the true

function to be good and you have to be

able to sample it's a lot of burden on

you which I might call modeling honestly

so even though it's model free there's a

lot of modeling that kind of goes in

under the hood and again I'm gonna say

this again is necessarily derivative

free okay but it's easy and that's what

makes it exciting and that's what lets

us plow ahead great that's great so

actually this is a perfect time let's

take a 10-minute break we'll reconvene I

guess an ad should I be actually could

be really precise because I have the

clock here but that's probably dumb you

guys don't have the clock so let's

reconvene at five to five faziz so the

we'll be live-streaming a seven so when

you come back from the break if it's

full going a seven exact same vein but

on video awesome

a seven except thank you again Francie's

alright and so yeah so let's come come

back at actually I would say let 10 till

let's do 10 till that would be good 454

let's get restarted hi yes no I got it we're good

we're good that's right so he sounds got it

that's right so he sounds got it just FYI there is an overflow room with

just FYI there is an overflow room with a live string in a seven and just down

a live string in a seven and just down the hall that's right and if you done

the hall that's right and if you done for tchen Utley if you can't find a seat

for tchen Utley if you can't find a seat you you have to go to the overflow room

you you have to go to the overflow room or else that guy's gonna come back I

or else that guy's gonna come back I don't mean cool all right let's do this

don't mean cool all right let's do this where we're we we left off at

where we're we we left off at everybody's favorite parks

everybody's favorite parks well no actually I don't think anybody

well no actually I don't think anybody has this favorite part we left off is

has this favorite part we left off is when we move into learning theory okay

when we move into learning theory okay so I want to do just a little bit of

so I want to do just a little bit of learning theory not too much today but

learning theory not too much today but this is actually something that we

this is actually something that we grapple with throughout RL and in some

grapple with throughout RL and in some sense half of the papers in

sense half of the papers in reinforcement learning that I see

reinforcement learning that I see discuss a notion of what's called sample

discuss a notion of what's called sample complexity so sample complexity means a

complexity so sample complexity means a very particular thing in learning theory

very particular thing in learning theory it means a slightly more nebulous thing

it means a slightly more nebulous thing in reinforcement learning and I kinda

in reinforcement learning and I kinda just wanted to touch on that here is

just wanted to touch on that here is like again this only other slide that

like again this only other slide that has the MDP formulation even though

has the MDP formulation even though we're talking about the tabular case off

we're talking about the tabular case off and on the idea here is that again I

and on the idea here is that again I have a discrete vector X or just X that

have a discrete vector X or just X that takes only discrete values it's not a

takes only discrete values it's not a vector X is the values 1 to D u is

vector X is the values 1 to D u is values 1 to P and I have an MDP and the

values 1 to P and I have an MDP and the question is

question is what's the relative complexity of these

what's the relative complexity of these methods something you should take away

methods something you should take away from that first half of the talk is we

from that first half of the talk is we saw that in model-based methods at least

saw that in model-based methods at least from my narrow view of model-based what

from my narrow view of model-based what we would do is we would take data and

we would do is we would take data and then we would solve essentially a system

then we would solve essentially a system of equations right we have we have

of equations right we have we have measurements and each of this gives us

measurements and each of this gives us equations and then we have unknowns

equations and then we have unknowns which could be the model could be the

which could be the model could be the cue function or could be the policy

cue function or could be the policy right and the question is we can now do

right and the question is we can now do parameter counting to get some like

parameter counting to get some like rough rule of thumb as to which one may

rough rule of thumb as to which one may or may not be more efficient okay so

or may not be more efficient okay so this is one way to maybe try to decide

this is one way to maybe try to decide which of these methods are best so we

which of these methods are best so we have there three algorithm classes so in

have there three algorithm classes so in the MVP case for model-based we get the

the MVP case for model-based we get the value of the state and we're just going

value of the state and we're just going to use the value of the state to predict

to use the value of the state to predict what the probability of the transition

what the probability of the transition is for approximate dynamic programming

is for approximate dynamic programming we get the value and also for policy

we get the value and also for policy search we get a value and each of these

search we get a value and each of these gives us one equation for every time

gives us one equation for every time step now what how many parameters do we

step now what how many parameters do we need to identify in the most general

need to identify in the most general case well the number of numbers required

case well the number of numbers required to give me the state transitions is just

to give me the state transitions is just d squared P right and this is

d squared P right and this is discounting because it has three

discounting because it has three arguments that's pretty the discrete

arguments that's pretty the discrete world is nice very easy to count things

world is nice very easy to count things approximate dynamic programming we need

approximate dynamic programming we need a cue function the cue function has a

a cue function the cue function has a value it takes a state and an action so

value it takes a state and an action so it has D times P parameters and policy

it has D times P parameters and policy search what does that do well if for

search what does that do well if for every state it gives you an action and

every state it gives you an action and so that would also have D times P

so that would also have D times P parameters this is not parametric this

parameters this is not parametric this is actually just trying to write down

is actually just trying to write down what those mappings might look like and

what those mappings might look like and so the optimal error if I were to write

so the optimal error if I were to write it down and guess what would be optimal

it down and guess what would be optimal so again this is very naive and this is

so again this is very naive and this is just chicken scratch on a piece of paper

just chicken scratch on a piece of paper what would tell me we'd say that roughly

what would tell me we'd say that roughly these are sample complexities so simple

these are sample complexities so simple complexities have parameters in the

complexities have parameters in the numerator and then number of samples in

numerator and then number of samples in the denominator and they suggest that

the denominator and they suggest that probably a DP and Policy search

probably a DP and Policy search are more efficient than very broad

are more efficient than very broad fitting a model approach to model based

fitting a model approach to model based RL that's rough I mean again it looks

RL that's rough I mean again it looks like that should be the case that looks

like that should be the case that looks like that should be the case so assuming

like that should be the case so assuming we know nothing about the model

we know nothing about the model transitions it looks like that would

transitions it looks like that would probably be the hardest and then ADP and

probably be the hardest and then ADP and policy search should be easier but there

policy search should be easier but there is no algorithm that gets these

is no algorithm that gets these complexities currently out there and if

complexities currently out there and if you're into this kind of thing there are

you're into this kind of thing there are all these papers written by baddest

all these papers written by baddest people who try to get these numbers as

people who try to get these numbers as small as possible

small as possible now let's now go to the continuous world

now let's now go to the continuous world and the continuous world we can do the

and the continuous world we can do the exact same thing and the story

exact same thing and the story completely changes it's a very different

completely changes it's a very different kind of setting when the continuous

kind of setting when the continuous world we get more samples per iteration

world we get more samples per iteration when we have a model let's think about

when we have a model let's think about the least squares problem that we solve

the least squares problem that we solve right we have this we observe the state

right we have this we observe the state there are D elements in this state and I

there are D elements in this state and I get a mapping from a model to the state

get a mapping from a model to the state so there are D equations per time step

so there are D equations per time step whereas approximate dynamic programming

whereas approximate dynamic programming and policy search are still only using

and policy search are still only using values moreover once we move to the

values moreover once we move to the continuous case we actually worked out

continuous case we actually worked out what these things were the Q function

what these things were the Q function the approximate dynamic programming has

the approximate dynamic programming has D plus P over two parameters versus our

D plus P over two parameters versus our D plus P choose two parameters that's a

D plus P choose two parameters that's a quadratic this D plus P squared the

quadratic this D plus P squared the policy has D times P parameters and the

policy has D times P parameters and the model based has d squared plus DP in in

model based has d squared plus DP in in for lqr right for l q RS d squared plus

for lqr right for l q RS d squared plus DP you have a square matrix for the

DP you have a square matrix for the state and you have a rectangular matrix

state and you have a rectangular matrix for the control and so if we put that

for the control and so if we put that together and we get our rough rule of

together and we get our rough rule of thumb it seems like the model-based

thumb it seems like the model-based approach might require about D plus P

approach might require about D plus P over T to the square root that should be

over T to the square root that should be roughly our error if if life was good

roughly our error if if life was good and everything worked out with parameter

and everything worked out with parameter counting

counting whereas approximate dynamic programming

whereas approximate dynamic programming my suffer a factor of dimension and

my suffer a factor of dimension and policy search might be somewhere into

policy search might be somewhere into depending on the size of the parameter

depending on the size of the parameter either about the same or maybe a little

either about the same or maybe a little worse right because the product should

worse right because the product should be bigger than the

be bigger than the some okay rough rule of thumb but it

some okay rough rule of thumb but it looks different

looks different and the question is in practice is it

and the question is in practice is it actually look to end up looking

actually look to end up looking different as well

different as well are these model free methods making

are these model free methods making effective use of the information that's

effective use of the information that's given to us that's really the question

given to us that's really the question we have to figure out okay and of course

we have to figure out okay and of course I do want to point out again to

I do want to point out again to everybody everybody here that there's a

everybody everybody here that there's a concept that of course is gonna go out

concept that of course is gonna go out front both in the MDP case and in the

front both in the MDP case and in the the continuous case and a lot of times I

the continuous case and a lot of times I mean this is gone now for my theory

mean this is gone now for my theory friends in the room a lot of times the

friends in the room a lot of times the interesting stuff is actually happening

interesting stuff is actually happening here not here

here not here so just worth keeping that in mind maybe

so just worth keeping that in mind maybe there's actually some other complexity

there's actually some other complexity that comes out of front that's more

that comes out of front that's more important than parameters and let's see

important than parameters and let's see if we can get at that as well okay and I

if we can get at that as well okay and I guess we could also talk here's my I

guess we could also talk here's my I can't come to a machine learning

can't come to a machine learning conference without a deep learning slide

conference without a deep learning slide this will be the only one deep

this will be the only one deep reinforcement learning is at least from

reinforcement learning is at least from my perspective today was just that like

my perspective today was just that like the only thing that would change if you

the only thing that would change if you wanted to implement it would be that we

wanted to implement it would be that we would parameterize our cue function or a

would parameterize our cue function or a policy by a deep neural network or maybe

policy by a deep neural network or maybe we make our model a deep neural network

we make our model a deep neural network everything is going to be tricky to

everything is going to be tricky to analyze in this case but none of the

analyze in this case but none of the algorithms will change and again this is

algorithms will change and again this is why the linear principle is going to

why the linear principle is going to apply so that hopefully we can get some

apply so that hopefully we can get some insights because nothing changes moving

insights because nothing changes moving to deep networks other than the function

to deep networks other than the function approximator so anyway that's for next

approximator so anyway that's for next year maybe we'll analyze deep networks

year maybe we'll analyze deep networks we're gonna just stick with very shallow

we're gonna just stick with very shallow reinforcement learning for the rest of

reinforcement learning for the rest of the day and again the reason why is

the day and again the reason why is because the algorithms don't change okay

because the algorithms don't change okay so let's go turn to my friend lqr and

so let's go turn to my friend lqr and start to look at sample complexity

start to look at sample complexity rather than trying to prove stuff let's

rather than trying to prove stuff let's run some experiments because I like

run some experiments because I like running experiments on these continuous

running experiments on these continuous problems and in particular let's run the

problems and in particular let's run the silly double integrator experiment in

silly double integrator experiment in controls we call this model a double

controls we call this model a double integrator and with lqr and so what's

integrator and with lqr and so what's this plot so policy gradient is in

this plot so policy gradient is in the error bars are over ten trials and

the error bars are over ten trials and it's kind of the men and the max over

it's kind of the men and the max over those ten trials of what value is

those ten trials of what value is returned after about thirty thousand

returned after about thirty thousand samples we get something that is close

samples we get something that is close to the optimal of the riccati equation

to the optimal of the riccati equation so a lot of variants to actually get

so a lot of variants to actually get this this was a fun interaction on

this this was a fun interaction on Twitter I couldn't actually get it to

Twitter I couldn't actually get it to work at all and then a colleague tweeted

work at all and then a colleague tweeted or actually I think was on reddit

or actually I think was on reddit somebody come up came up with a better

somebody come up came up with a better implementation my friend Pavel who did a

implementation my friend Pavel who did a great job here told me that we have to

great job here told me that we have to do is instead of running stochastic

do is instead of running stochastic gradient descent you have to run Adam so

gradient descent you have to run Adam so this actually requires not policy

this actually requires not policy gradient but Adam and I don't know why

gradient but Adam and I don't know why but that's fine so we added some extra

but that's fine so we added some extra stuff we're able to get it to converge

stuff we're able to get it to converge people like Adam that's cool and it does

people like Adam that's cool and it does things but so an interesting thing is

things but so an interesting thing is that just fitting this model not telling

that just fitting this model not telling you anything about the structure but

you anything about the structure but just fitting it as a black box but

just fitting it as a black box but linear model and then just running

linear model and then just running treating that as true and not doing

treating that as true and not doing anything just treating it as true is

anything just treating it as true is indistinguishable from this black line

indistinguishable from this black line like ten samples and you get the answer

like ten samples and you get the answer and similarly approximate dynamic

and similarly approximate dynamic programming I implemented what was

programming I implemented what was called least squares policy iteration

called least squares policy iteration which is by uh which is a nice paper by

which is by uh which is a nice paper by Lazarus and somebody obvious names I'm

Lazarus and somebody obvious names I'm forgetting but it's it's a very fairly

forgetting but it's it's a very fairly straightforward that also achieves this

straightforward that also achieves this line okay ten samples so there's this

line okay ten samples so there's this weird gap here where the policy gradient

weird gap here where the policy gradient as we said and I did the vanilla

as we said and I did the vanilla algorithm that I showed in the first

algorithm that I showed in the first half which was just you compute the cost

half which was just you compute the cost you multiply it by the noise times the

you multiply it by the noise times the state accumulated and it is kind of what

state accumulated and it is kind of what immediately falls out but I've never

immediately falls out but I've never been able to get it to do something

been able to get it to do something nicer even after tuning the parameters

nicer even after tuning the parameters of Adam of the sampling scheme and

of Adam of the sampling scheme and what-have-you and and this is weird

what-have-you and and this is weird right because policy gradient I think

right because policy gradient I think I'm told now as a strawman but I'm told

I'm told now as a strawman but I'm told that direct policy methods are supposed

that direct policy methods are supposed to work right I mean this is like

to work right I mean this is like apparently what open AI is going to use

apparently what open AI is going to use to defeat dota and it seems like if it

to defeat dota and it seems like if it can't solve a double integrator

can't solve a double integrator something weird is happening

something weird is happening and I everybody attributes this quote to

and I everybody attributes this quote to Carl Sagan that extraordinary claims

Carl Sagan that extraordinary claims require extraordinary evidence but it

require extraordinary evidence but it was borrowed by Lance Armstrong when he

was borrowed by Lance Armstrong when he was defending himself against

was defending himself against accusations of doping and so the

accusations of doping and so the important thing that even Carl Sagan and

important thing that even Carl Sagan and every good Beijing would say is that's

every good Beijing would say is that's true

true only if your prior is correct right if

only if your prior is correct right if you had the right you know you actually

you had the right you know you actually had the real prior the claim might not

had the real prior the claim might not seem quite so so extraordinary and so I

seem quite so so extraordinary and so I guess the thing I want to say about

guess the thing I want to say about these policy gradient methods is like if

these policy gradient methods is like if we look at what other but people who

we look at what other but people who implement them say there are red flags

implement them say there are red flags even in their own descriptions and in

even in their own descriptions and in particular that reinforcement learning

particular that reinforcement learning results are tricky to reproduce

results are tricky to reproduce performance is very noisy algorithms

performance is very noisy algorithms have many moving parts which allow for

have many moving parts which allow for subtle bugs and many papers don't report

subtle bugs and many papers don't report all the required tricks is this is a

all the required tricks is this is a quote from a blog by some of the folks

quote from a blog by some of the folks at open a eye on their DQ on baseline

at open a eye on their DQ on baseline they also say that reinforcement

they also say that reinforcement learning algorithms are challenging to

learning algorithms are challenging to implement correctly good results

implement correctly good results typically only come after fixing many

typically only come after fixing many seemingly trivial bugs I would like to

seemingly trivial bugs I would like to ask them what a bug is in this regard

ask them what a bug is in this regard and honestly I don't know who would

and honestly I don't know who would actually put that in a car and it's not

actually put that in a car and it's not just them there's actually been a lot of

just them there's actually been a lot of papers about this Joelle Pino gave a

papers about this Joelle Pino gave a great keynote talk at AI CLR I believe

great keynote talk at AI CLR I believe that video is still up about some work

that video is still up about some work that her group has been has been doing

that her group has been has been doing with her collaborators trying to get at

with her collaborators trying to get at kind of how sensitive these algorithms

kind of how sensitive these algorithms are to a variety of settings and in

are to a variety of settings and in particular there's this famous plot and

particular there's this famous plot and we're gonna go into this pot later which

we're gonna go into this pot later which shows that if you change the random seed

shows that if you change the random seed algorithms can seem to have incredibly

algorithms can seem to have incredibly different performance so five grand

different performance so five grand seeds look one way five other random

seeds look one way five other random seeds look some other way and you know

seeds look some other way and you know that leaves two weird stuff and I

that leaves two weird stuff and I actually think this this plot is the

actually think this this plot is the more scary one because this is three

more scary one because this is three different implementations of I think TRP

different implementations of I think TRP oh yeah of TRP oh by open AI and just

oh yeah of TRP oh by open AI and just three different git repos and in

three different git repos and in particular two of them the blue and the

particular two of them the blue and the orange

orange up it's the same repo just a different

up it's the same repo just a different release point so two different release

release point so two different release versions of the same repo have very

versions of the same repo have very different behaviors and obviously the

different behaviors and obviously the first ones gonna apply the second one

first ones gonna apply the second one because if you're sensitive to random

because if you're sensitive to random seed and you just change the order of

seed and you just change the order of operations and how you access the

operations and how you access the randomness you'll get different behavior

randomness you'll get different behavior and so that's a bit worrying so this

and so that's a bit worrying so this leads to something that looks like

leads to something that looks like reinforcement learning is incredibly

reinforcement learning is incredibly challenging to implement but now the

challenging to implement but now the question that I want to ask for the

question that I want to ask for the remainder of today is is does it have to

remainder of today is is does it have to be this way or is there some other way

be this way or is there some other way forward for these kinds of problems

forward for these kinds of problems we're allowed to use machine learning

we're allowed to use machine learning and in particular can we use models when

and in particular can we use models when we start doing problems which involved

we start doing problems which involved continuous control and does that work

continuous control and does that work better one thing I do want to point out

better one thing I do want to point out before going to models which I think is

before going to models which I think is pretty interesting and worth kind of

pretty interesting and worth kind of highlighting is that if you just run the

highlighting is that if you just run the random search algorithm which was

random search algorithm which was pointed to me over the break actually

pointed to me over the break actually appears in nemirovsky and uten so it's

appears in nemirovsky and uten so it's everywhere in the Russian literature so

everywhere in the Russian literature so we're gonna stick with random search

we're gonna stick with random search this algorithm remember perturbs the

this algorithm remember perturbs the parameter rather than perturbing the

parameter rather than perturbing the actions this actually achieves and I

actions this actually achieves and I guess 5,000 iterations a pretty good

guess 5,000 iterations a pretty good policy but still not I mean 5000 versus

policy but still not I mean 5000 versus 10 is still a huge gap so even though

10 is still a huge gap so even though it's a general-purpose algorithm again

it's a general-purpose algorithm again the always want to ask yourself is even

the always want to ask yourself is even if you have a general-purpose algorithm

if you have a general-purpose algorithm how much are you paying for that general

how much are you paying for that general purpose and if it involves a lot of

purpose and if it involves a lot of variance and a lot of uncertainty maybe

variance and a lot of uncertainty maybe it's not so maybe it's not something you

it's not so maybe it's not something you want to rely on so obviously we're at

want to rely on so obviously we're at five o'clock so and I'm supposed to have

five o'clock so and I'm supposed to have another hour or so and so yeah of course

another hour or so and so yeah of course I have other I have more content so let

I have other I have more content so let me tell you some ideas that's that's our

me tell you some ideas that's that's our friend modeled based RL which I think

friend modeled based RL which I think has a particular you know it gets it

has a particular you know it gets it gets it has this Bates even in the MDP

gets it has this Bates even in the MDP case about whether or not model-based RL

case about whether or not model-based RL is as good as these model free methods

is as good as these model free methods under a variety of different settings

under a variety of different settings but I think for continuous control is

but I think for continuous control is actually like we already saw that the

actually like we already saw that the sample complexity conceivably is a

sample complexity conceivably is a little bit better and moreover what

little bit better and moreover what maybe we can even try to analyze this

maybe we can even try to analyze this for a simple case and see if that's

for a simple case and see if that's actually true so the one thing that's

actually true so the one thing that's hard here that's big

hard here that's big supervised learning easy roughly well

supervised learning easy roughly well understood the hard part here is

understood the hard part here is actually what control problems do we

actually what control problems do we solve because just plugging in the truth

solve because just plugging in the truth or plugging in an estimate and treating

or plugging in an estimate and treating it as true feels wrong right even a

it as true feels wrong right even a machine learning this is kinda why we

machine learning this is kinda why we have all this stuff with you know

have all this stuff with you know confidence sets and a variety of things

confidence sets and a variety of things so how can we take him to account the

so how can we take him to account the fact that we know this model is not

fact that we know this model is not perfect and maybe we can even estimate

perfect and maybe we can even estimate how imperfect that model is can we

how imperfect that model is can we actually take that into account in

actually take that into account in control and the answer is yes and the

control and the answer is yes and the answer this was something where if you

answer this was something where if you went to a control talk in the 90s you

went to a control talk in the 90s you would have seen this slide in every in

would have seen this slide in every in everybody's talk you've probably still

everybody's talk you've probably still do if you go to them today but so here

do if you go to them today but so here we have our system which is evolving

we have our system which is evolving according to F we're trying to design a

according to F we're trying to design a controller and what we do is we can like

controller and what we do is we can like estimate an odd Elfriede

estimate an odd Elfriede all right so f hat is our estimate it's

all right so f hat is our estimate it's not the true F so now we have a system

not the true F so now we have a system that's f hat we have our now we want to

that's f hat we have our now we want to design a controller for the true F but

design a controller for the true F but we only have F hat well the key idea is

we only have F hat well the key idea is what if I could estimate the uncertainty

what if I could estimate the uncertainty in F and treat that as another building

in F and treat that as another building block in my dynamical system so that I

block in my dynamical system so that I could account for not just F but every

could account for not just F but every possible instantiation of my certainty

possible instantiation of my certainty about F so I could use high dimensional

about F so I could use high dimensional statistics to say not only how good is

statistics to say not only how good is this model but how far is this model

this model but how far is this model from like I can estimate how far my

from like I can estimate how far my model is from from the truth so it's not

model is from from the truth so it's not just getting a point estimate it's

just getting a point estimate it's getting a confidence set once I have the

getting a confidence set once I have the confidence that I could try to design a

confidence that I could try to design a controller that works for every single

controller that works for every single function in the covenant-- set and that

function in the covenant-- set and that is what we call robust control okay so

is what we call robust control okay so I'm calling this connected thing where

I'm calling this connected thing where we're coupling statistics with control

we're coupling statistics with control course id control and this is actually

course id control and this is actually something well that for whatever reason

something well that for whatever reason because the statistics wasn't there I

because the statistics wasn't there I didn't really see in the controls

didn't really see in the controls literature and the controls literature

literature and the controls literature they're much more into you take your

they're much more into you take your you you take the model that comes out

you you take the model that comes out and you treated a certainty equivalent

and you treated a certainty equivalent but you it's also possible controls

but you it's also possible controls people love putting in these extra

people love putting in these extra things about modeling errors to just add

things about modeling errors to just add those in and the other fancy present we

those in and the other fancy present we can estimate modeling errors and we can

can estimate modeling errors and we can estimate modeling errors of modeling

estimate modeling errors of modeling errors if we wanted to and we can have

errors if we wanted to and we can have Turtles all the way to Hough okay so let

Turtles all the way to Hough okay so let me describe how this might work and it's

me describe how this might work and it's actually kind of cute and now it's gonna

actually kind of cute and now it's gonna get a little bit more mathematical then

get a little bit more mathematical then what we've had so far but it's kind of

what we've had so far but it's kind of it it's again hopefully relatively

it it's again hopefully relatively simple for how this might work so what

simple for how this might work so what I'm gonna do my analysis of the one step

I'm gonna do my analysis of the one step of lqr just to show how this works and

of lqr just to show how this works and I'm kind of walk us through all the

I'm kind of walk us through all the different steps and all the different

different steps and all the different machinery that we'd need okay

machinery that we'd need okay so we're gonna say B is unknown and I

so we're gonna say B is unknown and I have this cost function which is

have this cost function which is quadratic and I have this equation which

quadratic and I have this equation which is linear right and B is unknown and so

is linear right and B is unknown and so what can I do well I can run experiments

what can I do well I can run experiments collect some data I collect some data

collect some data I collect some data I'm gonna have that X I is equal to B UI

I'm gonna have that X I is equal to B UI plus x0 plus C I and maybe you I would

plus x0 plus C I and maybe you I would be Gaussian and then I could estimate B

be Gaussian and then I could estimate B by just solving these squares now what

by just solving these squares now what we all know is this will give me an

we all know is this will give me an estimate right this gives me an estimate

estimate right this gives me an estimate for free it's pretty cheap well I think

for free it's pretty cheap well I think it's less appreciated is one if we

it's less appreciated is one if we assume properties of the noise we can

assume properties of the noise we can actually estimate how small this error

actually estimate how small this error is with high probability and it doesn't

is with high probability and it doesn't even have to be that common that

even have to be that common that restricted what properties to get out

restricted what properties to get out these kind of Epsilon's and moreover if

these kind of Epsilon's and moreover if I could just run simulations I could

I could just run simulations I could estimate how sensitive I am using

estimate how sensitive I am using something like a bootstrap and if you

something like a bootstrap and if you use the bootstrap the bootstrap can give

use the bootstrap the bootstrap can give you a pretty good estimate of this

you a pretty good estimate of this Epsilon without even having to work

Epsilon without even having to work through the the concentration

through the the concentration inequalities so whenever you build a

inequalities so whenever you build a model you can test how accurate it is by

model you can test how accurate it is by bootstrapping I imagine this is true for

bootstrapping I imagine this is true for the deep case as well I know I haven't

the deep case as well I know I haven't thought about how you do bootstrap with

thought about how you do bootstrap with with deep models but there's no no the

with deep models but there's no no the deadlines are far away

deadlines are far away so the ICML to 2019 let's see that

so the ICML to 2019 let's see that okay so now here's here's kind of the

okay so now here's here's kind of the idea for how we would actually turn this

idea for how we would actually turn this back into the optimization problem and

back into the optimization problem and all we're gonna do is note the very

all we're gonna do is note the very simple fact that if Delta B is my error

simple fact that if Delta B is my error it that's B minus B hat and B hat is my

it that's B minus B hat and B hat is my estimate then the X here is equal to the

estimate then the X here is equal to the X here because I'm just replacing B with

X here because I'm just replacing B with B hat plus its error this is very this

B hat plus its error this is very this is very naive and silly but what this

is very naive and silly but what this lets me do is rewrite the problem into a

lets me do is rewrite the problem into a robust optimization problem so I can

robust optimization problem so I can have just the estimate as the constraint

have just the estimate as the constraint and this is what I have in the previous

and this is what I have in the previous slide this is the estimated model is my

slide this is the estimated model is my constraint not the true one but then

constraint not the true one but then what I'm going to do is write in the

what I'm going to do is write in the cost function The effect of that

cost function The effect of that estimate so the error now just gets

estimate so the error now just gets pushed into the cost function once the

pushed into the cost function once the error gets pushed into the cost function

error gets pushed into the cost function I could take the worst one okay so what

I could take the worst one okay so what I did was I took a problem that we

I did was I took a problem that we started off with which was just

started off with which was just minimizing a quadratic

minimizing a quadratic I took a square root because it made my

I took a square root because it made my life easier as you'll see in a second

life easier as you'll see in a second because I'm gonna apply the triangle

because I'm gonna apply the triangle inequality

inequality spoiler alert so I had to take a square

spoiler alert so I had to take a square root but so once I did that I could take

root but so once I did that I could take the have now just nominal constraints

the have now just nominal constraints which are easy and a robust cost okay so

which are easy and a robust cost okay so this is robust optimization and it's

this is robust optimization and it's robust in the sense that what we're

robust in the sense that what we're trying to do is be we're trying to be

trying to do is be we're trying to be able to solve the control problem for

able to solve the control problem for every possible instance of that

every possible instance of that uncertainty okay so the trick and this

uncertainty okay so the trick and this is actually the trick we used for lqr to

is actually the trick we used for lqr to actually analyze this is that if you

actually analyze this is that if you apply the triangle inequality and bound

apply the triangle inequality and bound the maximum eigen value of Q now I have

the maximum eigen value of Q now I have a convex problem the convex problem has

a convex problem the convex problem has introduced a regularized version of the

introduced a regularized version of the original problem and if people know

original problem and if people know robust optimization this happens all the

robust optimization this happens all the time uncertainty in data can be

time uncertainty in data can be translated robust uncertainty and data

translated robust uncertainty and data can be translated into a penalized

can be translated into a penalized problem on the model and that's exactly

problem on the model and that's exactly what happened here this is kind of the

what happened here this is kind of the rael GAO he has really nice

rael GAO he has really nice on this kind of thing okay so and we

on this kind of thing okay so and we just did the triangle equality there

just did the triangle equality there Renee do you have a question I will

Renee do you have a question I will repeat it

okay so Renee is that is asking the question which is good why do we have to

question which is good why do we have to do this decoupling a layering of

do this decoupling a layering of estimation and then control just a great

estimation and then control just a great question and I don't think we do but

question and I don't think we do but that's what we did

that's what we did so in our and the work that I'm going to

so in our and the work that I'm going to present that's what we did I don't

present that's what we did I don't necessarily think these steps have to be

necessarily think these steps have to be decoupled I think you would just have to

decoupled I think you would just have to come up with a different formulation

correct correct so what Renee is also saying is that this optimization problem

saying is that this optimization problem doesn't know anything about Q so we're

doesn't know anything about Q so we're estimating B without any knowledge of

estimating B without any knowledge of the of the costs we want to optimize and

the of the costs we want to optimize and that could be accruing a loss but I'm

that could be accruing a loss but I'm gonna punt on it because I don't want to

gonna punt on it because I don't want to I wanted to get a finite result the

I wanted to get a finite result the other thing that's nice for people who

other thing that's nice for people who care about these things is a

care about these things is a generalization bound on my model will

generalization bound on my model will turn into a generalization bound on my

turn into a generalization bound on my cost function and that's also what I

cost function and that's also what I really like so we have so this is now

really like so we have so this is now one of these weird things and again this

one of these weird things and again this is kind of addressing why not do them

is kind of addressing why not do them jointly is that we've developed a lot of

jointly is that we've developed a lot of techniques to figure out what the order

techniques to figure out what the order of epsilon should be given properties of

of epsilon should be given properties of the measurement process and what's cool

the measurement process and what's cool about this approach is that if you have

about this approach is that if you have those you have some model and you know

those you have some model and you know some things about its deviations those

some things about its deviations those immediately give us ways to bound the

immediately give us ways to bound the error of the optimization and that's

error of the optimization and that's that's kind of why I like this

that's kind of why I like this particular approach and I can hear

particular approach and I can hear myself through the wall that's weird

myself through the wall that's weird okay so hello everybody in a7 all right

okay so hello everybody in a7 all right now let's do this for linear systems

now let's do this for linear systems let's do this for lqr and I'm only gonna

let's do this for lqr and I'm only gonna do two slides it's gonna be very high

do two slides it's gonna be very high level and then we'll go back to

level and then we'll go back to experiments which are common more

experiments which are common more interesting but let's at least say what

interesting but let's at least say what is the theorem that we can prove and I

is the theorem that we can prove and I think that's kind of interesting that we

think that's kind of interesting that we could prove theorems about all of these

could prove theorems about all of these things for the lqr case

things for the lqr case so what you could do is you can run an

so what you could do is you can run an experiment 40 steps and then you could

experiment 40 steps and then you could ask them a a and B and it turns out that

ask them a a and B and it turns out that that actually is a weird optimization

that actually is a weird optimization problem because all of the X's depend on

problem because all of the X's depend on a and hence a depends on all the axes

a and hence a depends on all the axes everything is very coupled but with some

everything is very coupled but with some analysis you can show that this actually

analysis you can show that this actually gives you a optimal in the parameters

gives you a optimal in the parameters estimate of an D and you also have this

estimate of an D and you also have this funny matrix that enters in which is the

funny matrix that enters in which is the property of the system itself so like I

property of the system itself so like I said these constants actually do appear

said these constants actually do appear here and what they say is that if a

here and what they say is that if a system is easier to excite what this

system is easier to excite what this matrix lambda C is is basically how much

matrix lambda C is is basically how much signal do you see from a random input if

signal do you see from a random input if the signal is easy to excite this

the signal is easy to excite this denominator is large and the number of

denominator is large and the number of samples goes down that you would need so

samples goes down that you would need so signals that are easy to assist um's

signals that are easy to assist um's that are easy to excite are easier to

that are easy to excite are easier to estimate and with if we just assume the

estimate and with if we just assume the worst case here which would be the case

worst case here which would be the case when a equals 0 you do or near 0 you get

when a equals 0 you do or near 0 you get an optimal dependence at least on the

an optimal dependence at least on the parameters which we had many slides ago

parameters which we had many slides ago I was hinting at then we could use this

I was hinting at then we could use this triangle inequality technique albeit in

triangle inequality technique albeit in a more complicated way to actually

a more complicated way to actually analyze the limit version of lqr it

analyze the limit version of lqr it requires more bookkeeping because now

requires more bookkeeping because now it's an infinite problem so doing one

it's an infinite problem so doing one step was relatively simple but even

step was relatively simple but even already a little complicated this

already a little complicated this requires more bookkeeping but is in a

requires more bookkeeping but is in a paper by Sarah Deane hoary Manya dick

paper by Sarah Deane hoary Manya dick Montney myself and Steven - and then we

Montney myself and Steven - and then we references at the end it turns out that

references at the end it turns out that if you solve a relaxation of this

if you solve a relaxation of this problem this robust optimization problem

problem this robust optimization problem you actually get that the error for the

you actually get that the error for the controller that I predict - the true

controller that I predict - the true that the best thing you could ever do if

that the best thing you could ever do if you knew the model / the best thing you

you knew the model / the best thing you can do if you knew the model so this is

can do if you knew the model so this is a relative error scales as some

a relative error scales as some constants constants times D plus P over

constants constants times D plus P over T so this was exactly what we hoped by

T so this was exactly what we hoped by doing writing down what would be maybe

doing writing down what would be maybe the best case for model based RL we

the best case for model based RL we would be able to get

would be able to get that part we do get and then we get

that part we do get and then we get properties that depend on the actual

properties that depend on the actual system we're estimating how large is the

system we're estimating how large is the gain if the gain is large you need to

gain if the gain is large you need to have more samples how sensitive this

have more samples how sensitive this gamma CL is a very interesting thing it

gamma CL is a very interesting thing it says that if you were designed the

says that if you were designed the optimal controller for the system and

optimal controller for the system and then blow on it how much signal do you

then blow on it how much signal do you see so the system is being stabilized by

see so the system is being stabilized by the optimal controller and how sensitive

the optimal controller and how sensitive is that to perturbation the more

is that to perturbation the more sensitive that is the more samples you

sensitive that is the more samples you need and again the easier the system is

need and again the easier the system is to excite in an open-loop setting

to excite in an open-loop setting meaning just there's no controller I'm

meaning just there's no controller I'm just doing experiments the less samples

just doing experiments the less samples you need and so these things actually

you need and so these things actually kind of come in in an intro they kind of

kind of come in in an intro they kind of show an interesting trade-off what's

show an interesting trade-off what's also interesting here is that this also

also interesting here is that this also tells us when the cost is finite which

tells us when the cost is finite which is crazy we were doing experiments on a

is crazy we were doing experiments on a finite time horizon and extrapolating to

finite time horizon and extrapolating to an infinite time horizon

an infinite time horizon we're extrapolating out to say when is

we're extrapolating out to say when is this actually finite we're gonna see in

this actually finite we're gonna see in a second not always it's not always easy

a second not always it's not always easy to do that and this is actually giving

to do that and this is actually giving us a bound of how many samples do we

us a bound of how many samples do we need to guarantee that we get finite

need to guarantee that we get finite cost and I'm going to describe in a

cost and I'm going to describe in a second my finite cost is very important

second my finite cost is very important for practical considerations and and yet

for practical considerations and and yet let's go back to my favorite example

let's go back to my favorite example let's examine why finite cost is

let's examine why finite cost is important okay so this is a very time

important okay so this is a very time model of the Google Data Center they

model of the Google Data Center they have some cooling rack that's some

have some cooling rack that's some servers and they put fans on them and

servers and they put fans on them and then the servers maybe shed heat to each

then the servers maybe shed heat to each other and this is not this is a really

other and this is not this is a really dumb model but maybe this would be like

dumb model but maybe this would be like what that model would look like where

what that model would look like where each server has some kind of activity

each server has some kind of activity that's heating it up and this number is

that's heating it up and this number is bigger than one so if I ran it open-loop

bigger than one so if I ran it open-loop longer and longer it's gonna go to

longer and longer it's gonna go to infinity that's called unstable this is

infinity that's called unstable this is unstable this is gonna blow up if I run

unstable this is gonna blow up if I run it without doing control and if it blows

it without doing control and if it blows up the farm blows up that's probably bad

now I have a fan at every location so I can cool it down individually but if I

can cool it down individually but if I want to get a real

want to get a real flashy result that I can get in the

flashy result that I can get in the newspaper what I really need to do is

newspaper what I really need to do is spend a little electricity as possible

spend a little electricity as possible because you need this p/e ratio that you

because you need this p/e ratio that you know that everybody makes a big deal

know that everybody makes a big deal about to be small so I'm gonna really

about to be small so I'm gonna really penalize doing too much effort with my

penalize doing too much effort with my control

control but that's weird now that introduces the

but that's weird now that introduces the tension that's very easy to clarify what

tension that's very easy to clarify what that tension is this system is called

that tension is this system is called unstable because it will go to infinity

unstable because it will go to infinity if you don't do something about it

if you don't do something about it but not but if I do least squares

but not but if I do least squares maybe I asked him eight this guy to be

maybe I asked him eight this guy to be 0.99 so if I did these squares and

0.99 so if I did these squares and that's 0.99 and I want to save energy as

that's 0.99 and I want to save energy as like maybe I won't control that one but

like maybe I won't control that one but it was actually bigger than one and so

it was actually bigger than one and so it may blow up and that's actually what

it may blow up and that's actually what happens in the simulation we do see that

happens in the simulation we do see that behavior I have three plots here one is

behavior I have three plots here one is this kind of course ID control where I

this kind of course ID control where I tell the true error between the

tell the true error between the estimated model and the real model and I

estimated model and the real model and I guess the blue curve the green curve is

guess the blue curve the green curve is where I estimate that using the

where I estimate that using the bootstrap and you see that there's some

bootstrap and you see that there's some degradation we did a very conservative

degradation we did a very conservative bootstrap here but it's not crazy bad

bootstrap here but it's not crazy bad the orange curve is what happens when I

the orange curve is what happens when I just plug in the model this is called

just plug in the model this is called certainty equivalence in control it's

certainty equivalence in control it's the naive thing to do take your point

the naive thing to do take your point estimate and pretend like it's true if

estimate and pretend like it's true if you take your point estimate and pretend

you take your point estimate and pretend like it's true you see it starts to

like it's true you see it starts to spike up here and there's no data down

spike up here and there's no data down here so the cost seems to be something

here so the cost seems to be something weird seems to be happening earlier if

weird seems to be happening earlier if we turn to the second plot what's

we turn to the second plot what's happening is this plot is the percentage

happening is this plot is the percentage of the time that when I run a simulation

of the time that when I run a simulation the model that's returned in closed loop

the model that's returned in closed loop is finite how often is my cost finite

is finite how often is my cost finite finite cost means hopefully I mean at

finite cost means hopefully I mean at least roughly finite cost means maybe

least roughly finite cost means maybe your data center doesn't catch on fire

your data center doesn't catch on fire infinite cost means definite fire and so

infinite cost means definite fire and so what we see is that after if you give it

what we see is that after if you give it the true cost sorry the true distance to

the true cost sorry the true distance to the model we actually get a nice curve

the model we actually get a nice curve in after about a hundred samples are

in after about a hundred samples are always returning nice stable behavior

always returning nice stable behavior when you use the bootstrap we're never

when you use the bootstrap we're never getting a good model up until about here

getting a good model up until about here but then we start to chase what the blue

but then we start to chase what the blue does and was cute

does and was cute is down here what's happening is our SDP

is down here what's happening is our SDP solver can't find a feasible solution

solver can't find a feasible solution that's also interesting and this is

that's also interesting and this is actually something we always take for

actually something we always take for granted if you can't find a feasible

granted if you can't find a feasible solution it might mean that you have a

solution it might mean that you have a bug it might mean that there's something

bug it might mean that there's something wrong with the model meaning that

wrong with the model meaning that perhaps there is no feasible solution

perhaps there is no feasible solution and this is why you don't want bugs in

and this is why you don't want bugs in your control design software because you

your control design software because you have to be able to distinguish between

have to be able to distinguish between witches which now the nominal control is

witches which now the nominal control is actually even out at 600 samples so well

actually even out at 600 samples so well beyond where we started about 10 percent

beyond where we started about 10 percent of the time is returning something

of the time is returning something unstable because because of this model

unstable because because of this model so the least squares estimate if you

so the least squares estimate if you just plug it in may yield in an unstable

just plug it in may yield in an unstable controller but the robust method yields

controller but the robust method yields a stable one okay so on this plot we

a stable one okay so on this plot we return to our friends which are model

return to our friends which are model free note that the x-axis is now ten

free note that the x-axis is now ten times larger we need more samples to

times larger we need more samples to make the correct comparison fair this

make the correct comparison fair this gray curve is our Russian random search

gray curve is our Russian random search evolutionary strategies bandish convex

evolutionary strategies bandish convex optimization whatever you want to call

optimization whatever you want to call it so it's it's still about an order of

it so it's it's still about an order of magnitude off from the other guys but

magnitude off from the other guys but it's doing something the red curve is L

it's doing something the red curve is L SPI and policy gradient is not here

SPI and policy gradient is not here policy gradient is blue which under the

policy gradient is blue which under the klieg lights a little hard to see but

klieg lights a little hard to see but it's this blue curve which will trace

it's this blue curve which will trace out here and policy gradient has a hard

out here and policy gradient has a hard time finding a feasible stabilizing

time finding a feasible stabilizing solution even 50% of the time after

solution even 50% of the time after 5,000 samples so maybe we could tune it

5,000 samples so maybe we could tune it to be better I tried the github the

to be better I tried the github the sorry the jupiter notebooks are linked

sorry the jupiter notebooks are linked off my blog so you're welcome to try to

off my blog so you're welcome to try to do better

do better but something has to be done here to

but something has to be done here to make them all do these model free

make them all do these model free methods kind of outperform the model

methods kind of outperform the model based so this is cool we have this like

based so this is cool we have this like end-to-end bound this is new work we

end-to-end bound this is new work we have an end-to-end bound that goes from

have an end-to-end bound that goes from estimation error directly to control

estimation error directly to control error for this lqr and and one question

error for this lqr and and one question we were asking because

we were asking because we were digging through the literature

we were digging through the literature why was this never done before the

why was this never done before the biggest reason is just to get the least

biggest reason is just to get the least squares estimate actually requires some

squares estimate actually requires some heavy machinery we needed to go to some

heavy machinery we needed to go to some probability of papers that had appeared

probability of papers that had appeared on archive in the last year so high

on archive in the last year so high dimensional stats that basically the

dimensional stats that basically the difference between 1995 and 2018 is this

difference between 1995 and 2018 is this growth and high dimensional statistics

growth and high dimensional statistics and our ability to understand it and our

and our ability to understand it and our ability to analyze it and that could now

ability to analyze it and that could now be plugged in to tools from controls

be plugged in to tools from controls maybe from the 90s and try to link these

maybe from the 90s and try to link these things together that said even the

things together that said even the controls result requires some new

controls result requires some new results for controls in particular it's

results for controls in particular it's building on something called system

building on something called system level synthesis that was developed by

level synthesis that was developed by Nick Montney

Nick Montney and his collaborators at Caltech and

and his collaborators at Caltech and that really was key to making this work

that really was key to making this work and I think the thing that was really

and I think the thing that was really cool about this is what we initially

cool about this is what we initially started to do was we were just looking

started to do was we were just looking at the solution like the lqr solution

at the solution like the lqr solution comes out of this crazy equation I wrote

comes out of this crazy equation I wrote called the algebraic riccati equation we

called the algebraic riccati equation we tried to understand what happens to that

tried to understand what happens to that as you perturb the parameters and we got

as you perturb the parameters and we got very frustrated and we spent about a

very frustrated and we spent about a year of frustrating time on

year of frustrating time on understanding the perturbations but if

understanding the perturbations but if you somehow restrict yourself to always

you somehow restrict yourself to always be robust the analysis also ends up

be robust the analysis also ends up being easier so it's safer and for some

being easier so it's safer and for some reason the Antoine I mean it actually

reason the Antoine I mean it actually kind of makes sense because we're

kind of makes sense because we're restricting in our class the analysis is

restricting in our class the analysis is easier the surprising part is it does

easier the surprising part is it does look like this gets to something near

look like this gets to something near optimal and maybe that's the other

optimal and maybe that's the other reason why we're here what's really

reason why we're here what's really exciting is there's a lot of work and

exciting is there's a lot of work and just like the last year on this problem

just like the last year on this problem so there's all sorts of stuff I'm not

so there's all sorts of stuff I'm not going to talk about today but like there

going to talk about today but like there are multiple results that this

are multiple results that this conference about lqr which is really

conference about lqr which is really cool in particular you can prove that

cool in particular you can prove that policy gradient if you start with a

policy gradient if you start with a precision that doesn't blow up your data

precision that doesn't blow up your data center finds the optimal one so as long

center finds the optimal one so as long as you could find one solution that

as you could find one solution that doesn't stable that doesn't blow up you

doesn't stable that doesn't blow up you can find the optimal one with policy

can find the optimal one with policy gradient if you don't know it to begin

gradient if you don't know it to begin with you may run into some trouble

with you may run into some trouble there's also all sorts of other work

there's also all sorts of other work adapting lqr and what I'm not going to

adapting lqr and what I'm not going to talk about today is about adaptive lqr

talk about today is about adaptive lqr they're like 10 papers have just come

they're like 10 papers have just come out this year

out this year trying to

trying to this problem I'll talk about that again

this problem I'll talk about that again at the end so it kind of really is just

at the end so it kind of really is just starting to it we're just really

starting to it we're just really starting to understand this linear case

starting to understand this linear case and I've had in quotations the entire

and I've had in quotations the entire time simplest example what's amazing is

time simplest example what's amazing is how hard and complicated lqr is oqr is

how hard and complicated lqr is oqr is not simple it's like the simplest thing

not simple it's like the simplest thing I knew how to write down and analyzing

I knew how to write down and analyzing it actually requires a lot of machinery

it actually requires a lot of machinery and in particular a lot of new

and in particular a lot of new techniques the one thing I'll say if

techniques the one thing I'll say if there are any theorists in the room is

there are any theorists in the room is that our bound required doing an

that our bound required doing an analysis to avoid mixing time analyses

analysis to avoid mixing time analyses this is now just for the experts just so

this is now just for the experts just so if you don't know this stuff I just want

if you don't know this stuff I just want to point this out that we had to most

to point this out that we had to most people who analyze time series use an

people who analyze time series use an analysis that depends on mixing and the

analysis that depends on mixing and the mixing analyses turn this minus 1/2 into

mixing analyses turn this minus 1/2 into a plus 1/2 give us the exact opposite

a plus 1/2 give us the exact opposite behavior of what we see and there's

behavior of what we see and there's about 50 pages on cosmos he leaves his

about 50 pages on cosmos he leaves his blog that do that so again this is this

blog that do that so again this is this case of working in the most generality

case of working in the most generality versus the specific case for these

versus the specific case for these specific models we do have there's a lot

specific models we do have there's a lot of interesting work just in time series

of interesting work just in time series analysis to try to push forward ok

analysis to try to push forward ok that's that was for the experts okay so

that's that was for the experts okay so that's the lqr we're making progress

that's the lqr we're making progress we're starting does this understand the

we're starting does this understand the basics of the sample complexity but I

basics of the sample complexity but I don't think anybody in this room is that

don't think anybody in this room is that interested in lqr it is a workhorse lqr

interested in lqr it is a workhorse lqr honestly it's like you know that's a 95%

honestly it's like you know that's a 95% of the things that we do our PID control

of the things that we do our PID control of that remaining 5% 95% of those are

of that remaining 5% 95% of those are model predictive control and almost all

model predictive control and almost all model predictive control which I will

model predictive control which I will talk about in a couple slides is using

talk about in a couple slides is using lqr under the hood in one way or another

lqr under the hood in one way or another so it really is kind of like a workhorse

so it really is kind of like a workhorse algorithm it's worth understanding but

algorithm it's worth understanding but you know if we want to go do some

you know if we want to go do some complex robotics perhaps we need

complex robotics perhaps we need something more sophisticated and so a

something more sophisticated and so a question is does the converse of the

question is does the converse of the linearization principle apply for these

linearization principle apply for these problems and all I'm going to do now is

problems and all I'm going to do now is just look at some experimental evidence

just look at some experimental evidence I have no answer I'm gonna look at one

I have no answer I'm gonna look at one case study and that'll be the end and

case study and that'll be the end and then we are see from there we're

then we are see from there we're we go from here and so the the case

we go from here and so the the case study is on this funny set of demos

study is on this funny set of demos based in the Mojo Co simulator so open

based in the Mojo Co simulator so open AI actually starting with the work of

AI actually starting with the work of the PhD work of John Shulman probably

the PhD work of John Shulman probably even earlier than that starting with

even earlier than that starting with work that Sergey Levine had done people

work that Sergey Levine had done people started to use mojo Co as a way of

started to use mojo Co as a way of trying to do controls benchmarks for

trying to do controls benchmarks for reinforcement learning algorithms mojo

reinforcement learning algorithms mojo Co is the simulator you get guys that

Co is the simulator you get guys that look like this you could make all sorts

look like this you could make all sorts of various robots you can execute the

of various robots you can execute the dynamics forward and then from executing

dynamics forward and then from executing those dynamics forward you can now try

those dynamics forward you can now try to optimize them it was developed by a

to optimize them it was developed by a Motorola at Washington so what we found

Motorola at Washington so what we found is that actually this random search

is that actually this random search algorithm and linear controllers

actually outperform on these mojo demos a variety of other approaches that were

a variety of other approaches that were taken including natural gradient with

taken including natural gradient with linear natural policy gradient on linear

linear natural policy gradient on linear controllers natural policy gradient with

controllers natural policy gradient with some weird nonlinear controllers TRP oh

some weird nonlinear controllers TRP oh and essentially everything we were

and essentially everything we were looking at the linear controllers

looking at the linear controllers themselves actually were able to get to

themselves actually were able to get to these solutions much faster just using

these solutions much faster just using random search so in this case you want a

random search so in this case you want a larger number and we were able to get

larger number and we were able to get larger rewards but even though we get

larger rewards but even though we get larger rewards and I apologize because

larger rewards and I apologize because there's lights here let's just look at

there's lights here let's just look at this for a second even though we get

this for a second even though we get larger rewards this already points out

larger rewards this already points out something that is I said at the

something that is I said at the beginning of the talk and I want to say

beginning of the talk and I want to say again now is that one optimization

again now is that one optimization problem probably does not capture

problem probably does not capture everything we care about in control and

everything we care about in control and in particular this is the human aid

in particular this is the human aid model this is this last one a humanoid

model this is this last one a humanoid model it declares successful in the open

model it declares successful in the open a gym if the reward exceeds 6,000 right

a gym if the reward exceeds 6,000 right as long as you exceed 6,000 reward you

as long as you exceed 6,000 reward you you you have a success so this one has

you you have a success so this one has 6,000 as you can see it's kind of

6,000 as you can see it's kind of weirdly hopping on the leg it's a little

weirdly hopping on the leg it's a little hard to apologize that the video is a

hard to apologize that the video is a little washed out but the dynamics at

little washed out but the dynamics at the foot are also doing something a

the foot are also doing something a little funny and this this one actually

little funny and this this one actually is the one that risk gets to 11 600 and

is the one that risk gets to 11 600 and again you've got to watch the feet

again you've got to watch the feet because the feet dynamics are actually

because the feet dynamics are actually absolutely not

absolutely not what a robot that the robot would have

what a robot that the robot would have fallen over a long time ago so the

fallen over a long time ago so the problem here is that at least for these

problem here is that at least for these models they're too simplified to capture

models they're too simplified to capture a reality and in particular the way that

a reality and in particular the way that they're modeling the contacts might not

they're modeling the contacts might not actually be accurate enough to really

actually be accurate enough to really carry over and moreover it's not at all

carry over and moreover it's not at all clear what the high reward means that's

clear what the high reward means that's not necessarily realistic gait the other

not necessarily realistic gait the other thing that's gone oh yeah sorry right

thing that's gone oh yeah sorry right the other thing that is uh that would be

the other thing that is uh that would be worth considering is because random

worth considering is because random search is faster we can start to

search is faster we can start to actually evaluate random search on more

actually evaluate random search on more random seeds one of the weirdest things

random seeds one of the weirdest things about reinforcement learning is the way

about reinforcement learning is the way that the random seed is treated like a

that the random seed is treated like a parameter that's bad and in some sense I

parameter that's bad and in some sense I really feel like one thing that we all

really feel like one thing that we all have to think about if I can convince

have to think about if I can convince anybody to work on it would be great is

anybody to work on it would be great is the right abstractions for how random

the right abstractions for how random seeds enter into machine learning

seeds enter into machine learning because you want to fix the random seed

because you want to fix the random seed for debugging maybe do you actually

for debugging maybe do you actually don't know but if you fix the random

don't know but if you fix the random seed for debugging and you start tuning

seed for debugging and you start tuning piper parameters to random seeds you

piper parameters to random seeds you might not actually be seeing the

might not actually be seeing the behavior that you want you might be

behavior that you want you might be tuning to random seeds so really you

tuning to random seeds so really you want to think about where the randomness

want to think about where the randomness enters and actually think of that as a

enters and actually think of that as a random number generator that you can't

random number generator that you can't control and just take it out but we have

control and just take it out but we have to think about how to do that as we plug

to think about how to do that as we plug in with this is a complex software

in with this is a complex software engineering question that I don't really

engineering question that I don't really know how to answer that said we looked

know how to answer that said we looked at it and what we saw is that if you

at it and what we saw is that if you just tuned on three random seeds and

just tuned on three random seeds and then take that policy and apply it to

then take that policy and apply it to other random seeds you get huge variance

other random seeds you get huge variance in all of these different problems all

in all of these different problems all the different mojo instances in the open

the different mojo instances in the open a gym even for random search have

a gym even for random search have incredibly high variance meaning that

incredibly high variance meaning that tuning the parameters for a few random

tuning the parameters for a few random seeds does not necessarily mean you're

seeds does not necessarily mean you're gonna work on another random seed and so

gonna work on another random seed and so here if you have you know 100 random

here if you have you know 100 random seeds for the Walker the median

seeds for the Walker the median performance looks much different if you

performance looks much different if you look at 100 random seeds than if you

look at 100 random seeds than if you look at ten random seeds that's

look at ten random seeds that's completely different and so this is a

completely different and so this is a really tricky thing

maybe this is another reason to move away from pure model-based methods and

away from pure model-based methods and start to understand a little bit more

start to understand a little bit more about what's under the hood

about what's under the hood and so let me propose a way forward

and so let me propose a way forward again away from the pure model free and

again away from the pure model free and what I'm gonna say the way forward might

what I'm gonna say the way forward might again be to use models maybe and again

again be to use models maybe and again let's go back let's go back I've said

let's go back let's go back I've said this word several times let me define

this word several times let me define model predictive control model

model predictive control model predictive control is really very much

predictive control is really very much like q-learning

like q-learning they're kind of the same thing what

they're kind of the same thing what mantra predictive control does is it

mantra predictive control does is it estimates it it's solves for a cue

estimates it it's solves for a cue function over a time horizon H and often

function over a time horizon H and often times rather than learning Q it just

times rather than learning Q it just specifies some Q at the terminal cost

specifies some Q at the terminal cost now what's interesting about that is the

now what's interesting about that is the longer time horizon just think about it

longer time horizon just think about it the longer the time horizon if all I

the longer the time horizon if all I really care about is 10 steps and I was

really care about is 10 steps and I was time horizon for 20 steps that Q doesn't

time horizon for 20 steps that Q doesn't matter too much and if my model is

matter too much and if my model is pretty good that you can actually our

pretty good that you can actually our simulations suggest that you depend with

simulations suggest that you depend with an exponential decay on an accurate q

an exponential decay on an accurate q function so the Q function matter is

function so the Q function matter is less and less and in fact you can

less and less and in fact you can oftentimes just drop it and treat it as

oftentimes just drop it and treat it as zero at the end if you make the horizon

zero at the end if you make the horizon longer and so what you could do is you

longer and so what you could do is you could plan on this long horizon have

could plan on this long horizon have some kind of model for what the terminal

some kind of model for what the terminal cost looks like take one step and then

cost looks like take one step and then replant and even if you ignore the

replant and even if you ignore the dynamic programming this is an

dynamic programming this is an incredibly sensitive thing to do you

incredibly sensitive thing to do you plan for some time horizon you take one

plan for some time horizon you take one step and then you plan for a longer time

step and then you plan for a longer time horizon again and what that allows you

horizon again and what that allows you to do is at every step you're getting

to do is at every step you're getting feedback and because you're getting

feedback and because you're getting feedback you can correct for errors both

feedback you can correct for errors both that come from the environment and that

that come from the environment and that come from the model and so if you plan

come from the model and so if you plan on short time horizon this feedback can

on short time horizon this feedback can allow you to correct not only

allow you to correct not only disturbances but errors in your model

disturbances but errors in your model and I think would be really interesting

and I think would be really interesting to understand the difference between

to understand the difference between what you could do when you have a good

what you could do when you have a good model and what you could do when you

model and what you could do when you have an approximate model and just as an

have an approximate model and just as an example here's an example of what

example here's an example of what walking looks like from the Totoro

walking looks like from the Totoro lab who designed the mojo quest

lab who designed the mojo quest simulator this is the same humanoid they

simulator this is the same humanoid they did this in 2012 okay so this is well

did this in 2012 okay so this is well before we got excited about

before we got excited about reinforcement learning and with a very

reinforcement learning and with a very simple cost function and they even show

simple cost function and they even show its robust to specification with a model

its robust to specification with a model they're able to using MPC get the robot

they're able to using MPC get the robot to well at least lurch

and mo has videos where he you can keep making this better if you just improve

making this better if you just improve your cost function so if you iteratively

your cost function so if you iteratively refine your cost function you can get

refine your cost function you can get better walking behavior and this one is

better walking behavior and this one is from 2013 there's the robot doing some

from 2013 there's the robot doing some complex tasks I think that uh there's an

complex tasks I think that uh there's an external perturbation he gets pulled

external perturbation he gets pulled that's not modeled and the thing is able

that's not modeled and the thing is able to stabilize itself so what's cool about

to stabilize itself so what's cool about this one as this could be actually

this one as this could be actually executed in real time in 2013 with no

executed in real time in 2013 with no GPUs so it's some sense for these kinds

GPUs so it's some sense for these kinds of things we have taken many steps

of things we have taken many steps backwards trying to figure out just what

backwards trying to figure out just what are the right ways to pose these

are the right ways to pose these problems and really understand what's

problems and really understand what's unknown here and what's hard to plan and

unknown here and what's hard to plan and what are the right problems I think are

what are the right problems I think are actually things that we should be

actually things that we should be thinking about in machine learning and

thinking about in machine learning and so what I'd like to close with is this

so what I'd like to close with is this beautiful example by my colleagues in

beautiful example by my colleagues in mechanical engineering where they

mechanical engineering where they actually do learning but they learn a

actually do learning but they learn a cue function but they learn cue function

cue function but they learn cue function in the context of MPC and essentially

in the context of MPC and essentially what happens here this is a car that

what happens here this is a car that drives around a track and I also link to

drives around a track and I also link to the YouTube video you really got to

the YouTube video you really got to watch it and what the way that they

watch it and what the way that they learn the terminal cost function is they

learn the terminal cost function is they start with a controller that just does

start with a controller that just does Lane following PID from that that gives

Lane following PID from that that gives me values for lots of different states

me values for lots of different states that actually tells me roughly how much

that actually tells me roughly how much is the cost to go from a particular

is the cost to go from a particular state to the end and so what I can make

state to the end and so what I can make the cue function be is what value did I

the cue function be is what value did I see at this state with that action

see at this state with that action because I have a database and anywhere I

because I have a database and anywhere I didn't see data I make it infinity and

didn't see data I make it infinity and if you do that this allows you to

if you do that this allows you to explore you

explore you your model inside the time horizon but

your model inside the time horizon but get to a safe state at the end and this

get to a safe state at the end and this car is able to essentially after about

car is able to essentially after about ten laps drive is a remote-control car

ten laps drive is a remote-control car and it starts to perform better than a

and it starts to perform better than a human controller after about ten laps

human controller after about ten laps and then it starts zipping around this

and then it starts zipping around this is the conference room and the

is the conference room and the mechanical engineering building start

mechanical engineering building start zipping around that room really fast

zipping around that room really fast after about 20 so this is really cool it

after about 20 so this is really cool it is doing machine learning it's basically

is doing machine learning it's basically doing cue learning but it's using a

doing cue learning but it's using a course model as part of the loop and

course model as part of the loop and there's lots of things that are hard

there's lots of things that are hard that they have to push on that are still

that they have to push on that are still worth doing because the other thing

worth doing because the other thing that's interesting and I honestly think

that's interesting and I honestly think this is where we need tons of machine

this is where we need tons of machine learning is the faster that car drives

learning is the faster that car drives the more nonlinear the dynamics become

the more nonlinear the dynamics become the tires start to slip and so now you

the tires start to slip and so now you have to figure out you have to learn

have to figure out you have to learn that's where learning really becomes

that's where learning really becomes important because at that point you

important because at that point you don't want to skid off the track and so

don't want to skid off the track and so there's a lot of interesting stuff to do

there's a lot of interesting stuff to do as you push into higher and higher

as you push into higher and higher performance we're learning should come

performance we're learning should come in cool so there's so many things left

in cool so there's so many things left to do I have an unbounded list this is

to do I have an unbounded list this is what fit on a slide I want to just have

what fit on a slide I want to just have some time so let me just at least

some time so let me just at least highlight these theory questions are the

highlight these theory questions are the results that we got for course ID

results that we got for course ID control for lqr optimal with respect to

control for lqr optimal with respect to the parameters with respect to the

the parameters with respect to the problem instance parameters can we

problem instance parameters can we actually get tight upper and lower

actually get tight upper and lower bounds for these problems can we get

bounds for these problems can we get tight upper and lower bounds and what's

tight upper and lower bounds and what's called the adaptive setting we've been

called the adaptive setting we've been looking so for again for people who work

looking so for again for people who work in reinforcement learning we've been

in reinforcement learning we've been looking at episodic reinforcement

looking at episodic reinforcement learning we're allowed to do restart the

learning we're allowed to do restart the problem can you work in the adaptive

problem can you work in the adaptive setting where you just get one trial

setting where you just get one trial there's actually a ton of work on this

there's actually a ton of work on this if I had more time I would look I could

if I had more time I would look I could go do a huge survey just about adaptive

go do a huge survey just about adaptive control trying to know that down as well

control trying to know that down as well would be great and then there are a

would be great and then there are a variety of other things involving safe

variety of other things involving safe exploration understanding how to

exploration understanding how to incorporate non-linearity this is

incorporate non-linearity this is confident endless number of things that

confident endless number of things that are left to do and since there's so many

are left to do and since there's so many things left to do I guess it kind of

things left to do I guess it kind of looping back to the beginning it seems

looping back to the beginning it seems like reinforcement learning it's kind of

like reinforcement learning it's kind of this critical problem that we really

this critical problem that we really have to start tackling right and

have to start tackling right and you know I don't think that control

you know I don't think that control theory itself is going to solve all the

theory itself is going to solve all the problems we want to solve with these

problems we want to solve with these tools and so it seems that this is why

tools and so it seems that this is why it's a little funny but I figure we just

it's a little funny but I figure we just need a new name and we need a new name I

need a new name and we need a new name I know it's tongue-in-cheek to call it

know it's tongue-in-cheek to call it actionable intelligence but we need a

actionable intelligence but we need a new name because I think this has to be

new name because I think this has to be an inclusive effort it can't just be

an inclusive effort it can't just be deep reinforcement learning it has to

deep reinforcement learning it has to take into account things from controls

take into account things from controls engineering and actually I think was

engineering and actually I think was also interesting as I think it has to

also interesting as I think it has to take into account something even broader

take into account something even broader because even this most mundane machine

because even this most mundane machine learning systems start to look like RL

learning systems start to look like RL systems if you squint at them so

systems if you squint at them so consider recommendation recommendation

consider recommendation recommendation was something I worked on decades ago

was something I worked on decades ago it seemed like it was a fun problem it

it seemed like it was a fun problem it was very innocuous problem who wouldn't

was very innocuous problem who wouldn't want to just be recommended some movies

want to just be recommended some movies to watch on the weekend or like the next

to watch on the weekend or like the next heavy metal song that you really want to

heavy metal song that you really want to hear given the long day at ICML but what

hear given the long day at ICML but what we started to observe is that these

we started to observe is that these recommendation systems when put into

recommendation systems when put into feedback with people start to make

feedback with people start to make people do things that maybe aren't great

people do things that maybe aren't great they start to bring out the worst of us

they start to bring out the worst of us and in particular we know that

and in particular we know that maximizing engagement is the easiest way

maximizing engagement is the easiest way to do that is to make people angry

to do that is to make people angry we know that YouTube provides all sorts

we know that YouTube provides all sorts of radicalizing behavior by just feeding

of radicalizing behavior by just feeding people more and more stuff to keep them

people more and more stuff to keep them addicted to the site and the reason why

addicted to the site and the reason why that happens is because we keep

that happens is because we keep retraining the model to maximize these

retraining the model to maximize these metrics which maybe aren't the right

metrics which maybe aren't the right metrics to maximize so if we don't know

metrics to maximize so if we don't know how to build a cost function for walking

how to build a cost function for walking we definitely don't know how to build a

we definitely don't know how to build a cost function for people are getting

cost function for people are getting valuable time out of products so instead

valuable time out of products so instead we just use time and when we do that

we just use time and when we do that basically that's the reinforcement

basically that's the reinforcement learning now it feels like it's

learning now it feels like it's supervised learning matrix completion

supervised learning matrix completion but if you keep retraining you're doing

but if you keep retraining you're doing machine learning with feedback and now

machine learning with feedback and now this RL or maybe perhaps we're going to

this RL or maybe perhaps we're going to call it axle intelligence and so I think

call it axle intelligence and so I think the future here it's like all machine

the future here it's like all machine learning has actually become

learning has actually become reinforcement learning since we started

reinforcement learning since we started putting everything in production

putting everything in production it interfaces with humans it's a it's

it interfaces with humans it's a it's going to be at the forefront of all of

going to be at the forefront of all of kind of our infrastructure and so it's

kind of our infrastructure and so it's imperative for all of us we all have to

imperative for all of us we all have to start thinking about what's the right

start thinking about what's the right way to do this safely what's the right

way to do this safely what's the right way to do this reliably and and what's

way to do this reliably and and what's the right way to do this and something

the right way to do this and something that will be predictable without human

that will be predictable without human intervention and I think that that's

intervention and I think that that's exciting so I think you know I challenge

exciting so I think you know I challenge everybody here let's join in on this

everybody here let's join in on this effort it's certainly more fun than just

effort it's certainly more fun than just thinking about least squares and so the

thinking about least squares and so the theory questions are wonderful the

theory questions are wonderful the outcome couldn't be more important and

outcome couldn't be more important and I'm looking forward to collaborating

I'm looking forward to collaborating with everybody on this so thanks for

with everybody on this so thanks for your time

we do have time for questions you have to go to the microphone I was told but

to go to the microphone I was told but while people go decide if they want to

while people go decide if they want to ask questions I do want to thank Sarah

ask questions I do want to thank Sarah irelia Horia Nick Max and Steven for

irelia Horia Nick Max and Steven for kind of working with me on the problems

kind of working with me on the problems that we had here and there's a huge list

that we had here and there's a huge list of acknowledgments if I had more time of

of acknowledgments if I had more time of other people I'd like to thank who

other people I'd like to thank who helped put this together please

helped put this together please how does the model predictive control

how does the model predictive control work on the quadrotor example we're

work on the quadrotor example we're showing the simple LT there simple lqr

showing the simple LT there simple lqr problem yeah it works great because you

problem yeah it works great because you could just truncate the time horizon so

could just truncate the time horizon so just just having it you don't even have

just just having it you don't even have to model the cue function to get that to

to model the cue function to get that to work Thanks

we can't take questions from next door thank you go ahead nice tutorial how

thank you go ahead nice tutorial how much general question you mentioned in

much general question you mentioned in your talk that a lot of our our work is

your talk that a lot of our our work is difficult to reproduce and we definitely

difficult to reproduce and we definitely see that as well so I guess I have two

see that as well so I guess I have two questions first as a researcher what

questions first as a researcher what should we do if someone has a fabulous

should we do if someone has a fabulous result but I can't reproduce a second

result but I can't reproduce a second the question is as a community what

the question is as a community what should we do thank you that's a great

should we do thank you that's a great question I think it's one we definitely

question I think it's one we definitely have to grapple with that gee welp you

have to grapple with that gee welp you know actually sponsored kind of a

know actually sponsored kind of a reproducibility challenge and it would

reproducibility challenge and it would be kind of crazy if we could scale that

be kind of crazy if we could scale that out we have a lot of resources because

out we have a lot of resources because of sponsors who have a lot of cloud

of sponsors who have a lot of cloud computing

computing I think it's expensive to try to have

I think it's expensive to try to have the the the community have everything be

the the the community have everything be replicated but is if we could try to

replicated but is if we could try to think about ways to have all of these

think about ways to have all of these experimental results subject to some

experimental results subject to some kind of replication I think that would

kind of replication I think that would be really fantastic I guess the one

be really fantastic I guess the one thing I want to point out that's

thing I want to point out that's interesting to me is that for machine

interesting to me is that for machine learning I actually think that the

learning I actually think that the standard old machine learning I feel

standard old machine learning I feel like reproducibility has gotten much

like reproducibility has gotten much much better than any time I've ever

much better than any time I've ever experienced it

experienced it everybody puts things on github more or

everybody puts things on github more or less github returns what you want maybe

less github returns what you want maybe it's not exactly what people told us it

it's not exactly what people told us it did but for a supervised learning if an

did but for a supervised learning if an idea is good it gets reproduced super

idea is good it gets reproduced super quickly the example I have in mind is

quickly the example I have in mind is this transformer network for translation

this transformer network for translation which was appeared on before even the

which was appeared on before even the code was released this is crazy because

code was released this is crazy because this never used to happen before even

this never used to happen before even the code was released

the code was released people had prototypes that seemed to

people had prototypes that seemed to work pretty well and so that idea caught

work pretty well and so that idea caught on and I could just very well be that RL

on and I could just very well be that RL isn't at that stage yet and we need some

isn't at that stage yet and we need some more mature thought about algorithms so

more mature thought about algorithms so that we can get to that point where

that we can get to that point where someone could state an algorithm in a

someone could state an algorithm in a paper and we could all reproduce it at

paper and we could all reproduce it at home and I feel like in some sense

home and I feel like in some sense that's the bar we want to get to

go ahead Rhino go okay that was a really amazing tutorial thank you I just wanted

amazing tutorial thank you I just wanted to ask just you could go next but oh

okay go go announce that five that very will go go ahead laugh

will go go ahead laugh make an announcement oh hi I'm rosemary

make an announcement oh hi I'm rosemary um yeah I just wanted to say we do have

um yeah I just wanted to say we do have the you reproducibility workshop on the

the you reproducibility workshop on the Saturday so if anybody wants to come Oh

Saturday so if anybody wants to come Oh feel free Thanks awesome so have that

feel free Thanks awesome so have that continue that conversation there

continue that conversation there unfortunately I can't go but next time

unfortunately I can't go but next time go ahead I'm just gonna ask do you have

go ahead I'm just gonna ask do you have any optimism about whether you're into

any optimism about whether you're into end bound on course idea would work for

end bound on course idea would work for mapping between simulated environments

mapping between simulated environments and real world implementation you asked

and real world implementation you asked that's a great question that's a great

that's a great question that's a great question I think this is one of these

question I think this is one of these hard things from machine learning where

hard things from machine learning where we're so used to not having to model

we're so used to not having to model anything and having great results and

anything and having great results and when the real world comes in we even

when the real world comes in we even know we do know and we've seen I mean

know we do know and we've seen I mean I'm always entertained by these things

I'm always entertained by these things that you can introduce adversarial

that you can introduce adversarial perturbations and break machine learning

perturbations and break machine learning systems this has been a very hot area of

systems this has been a very hot area of research turns out that they don't even

research turns out that they don't even have to be that adversarial if you show

have to be that adversarial if you show a video to an image detector it will get

a video to an image detector it will get mistakes every other frame so I guess

mistakes every other frame so I guess this is weird thing where's like I don't

this is weird thing where's like I don't know understanding how to transfer to

know understanding how to transfer to the physical world is very complicated I

the physical world is very complicated I don't think we've really solved it for

don't think we've really solved it for the vision problems however for for

the vision problems however for for robotics I mean for a lot of robotics it

robotics I mean for a lot of robotics it works right a lot of robotics if you

works right a lot of robotics if you model the physics well and you build a

model the physics well and you build a very specific simulator for the robot

very specific simulator for the robot you're building people have been able to

you're building people have been able to use simulators as part of the design

use simulators as part of the design pipeline it's definitely a mature

pipeline it's definitely a mature technology and it would be great to

technology and it would be great to really understand how to do that with

really understand how to do that with machine learning in the loop as well but

machine learning in the loop as well but I think that's I I don't I don't know

I think that's I I don't I don't know enough to yet comment about where

enough to yet comment about where what the possibility of what we need to

what the possibility of what we need to do to make that happen so I just had one

do to make that happen so I just had one question - you kind of had those C's

question - you kind of had those C's earlier and you weren't really you said

earlier and you weren't really you said that some of the theorists were really

that some of the theorists were really interested in actually the C not maybe

interested in actually the C not maybe the sample complexity but yeah

the sample complexity but yeah so there's this weird tension for doing

so there's this weird tension for doing this minimization where you're trying to

this minimization where you're trying to drive this state to zero but actually

drive this state to zero but actually also means that you have less excitation

also means that you have less excitation that's actually not less excitation yes

that's actually not less excitation yes the regressor the state so maybe you

the regressor the state so maybe you could give some comments on you know

could give some comments on you know that's one of the things an adaptive

that's one of the things an adaptive control we struggle with it's like if

control we struggle with it's like if you try to drive a system to be stable

you try to drive a system to be stable and safe you are stealing information

and safe you are stealing information for the ability to identify the system

for the ability to identify the system so you kind of alluded to that but

so you kind of alluded to that but that's also kind of like a major a major

that's also kind of like a major a major issue that's right

issue that's right that's that that's exactly right this is

that's that that's exactly right this is actually a real challenge in the

actually a real challenge in the adaptive setting and this is something

adaptive setting and this is something that it's always a little bit weird in

that it's always a little bit weird in the practice of reinforcement learning

the practice of reinforcement learning that you don't necessarily see the

that you don't necessarily see the difference between episodic and adaptive

difference between episodic and adaptive reinforcement learning in episodic you

reinforcement learning in episodic you get to constantly reset and constantly

get to constantly reset and constantly retry and then you get to do a variety

retry and then you get to do a variety of different control actions if you need

of different control actions if you need to have your system be safe and you

to have your system be safe and you really want to drive your system to be

really want to drive your system to be stable that's antithetical to exploring

stable that's antithetical to exploring to learn how to be better and in an

to learn how to be better and in an example I gave about cars it does seem

example I gave about cars it does seem like if you've never like what is the

like if you've never like what is the right algorithm to explore going faster

right algorithm to explore going faster if you don't really know what happens if

if you don't really know what happens if you push the accelerator down further it

you push the accelerator down further it so you have to assume something right

so you have to assume something right and I think this is what makes the

and I think this is what makes the adaptive control problem considerably

adaptive control problem considerably harder than the the non adaptive case so

harder than the the non adaptive case so it's something to be but this is why

it's something to be but this is why it's a great research problem and there

it's a great research problem and there are lots of folks working addressing

are lots of folks working addressing exactly bad and there's a lot of great

exactly bad and there's a lot of great work it actually just understanding

work it actually just understanding exploration not even from the

exploration not even from the theoretical perspective but

theoretical perspective but understanding how exploration and safety

understanding how exploration and safety interact that I've seen in particular

interact that I've seen in particular there was a paper by the groups of

there was a paper by the groups of boundary uh

boundary uh Angeles Oleg and andreas Kraus a-- which

Angeles Oleg and andreas Kraus a-- which actually tries to estimate exactly how

actually tries to estimate exactly how bad it would be to move away from things

bad it would be to move away from things that you are

that you are no we're safe which I thought was pretty

no we're safe which I thought was pretty interesting other questions

interesting other questions awesome I'm happy to stick around if you

awesome I'm happy to stick around if you want to ask things one on one we have

want to ask things one on one we have ten minutes thanks everybody for your

ten minutes thanks everybody for your time

time [Applause]

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:Benjamin Recht: Optimization Perspectives on Learning to Control (ICML 2018 tutorial)