This tutorial explores the intersection of reinforcement learning (RL) and control theory, aiming to bridge the gap between theoretical RL advancements and practical applications in physical systems. It highlights the challenges and opportunities in applying RL to continuous-time, real-world control problems, emphasizing the need for robust, reliable, and safe learning systems.
Mind Map
Click to expand
Click to explore the full interactive mind map • Zoom, pan, and navigate
all right everyone if you can please get
seated as quickly as possible we're
gonna bout to get started momentarily in
fact we're about to get started right
now I changed my mind
so my name is East on you I'll be
chairing this tutorial session it is my
great pleasure to introduce been
recognized brief so they can get going
Ben is on the faculty at EC us at
Berkeley he's done a range of research
on a range of topics most recently at
the intersection of learning and control
which is what he'll be talking about today
today
just one quick logistical note if you
have a question during the Q&A session
of this tutorial please ask it at the
mics that I've been set up for Q&A so
that it will get recorded and with that
let's welcome Ben rekt hey hey oh this
is on great welcome to Stockholm
everybody so where to begin today is a
kind of a culmination of I guess was a
six month long project trying to write
up some thoughts about reinforcement
learning control and machine learning
this has been a longer kind of studying
my group for the past few years and so
this is kind of the culmination tutorial
trying to figure out what do we figure
out from trying to read all the
literature gotta preface it all with
this will be very short and biased and
so we're gonna probably miss sight eight
everybody's work and I apologize and
just bug me afterwards and we can add
you to the citation list it's a huge and
growing field and has very long history
and I just what I want to do today is
try to give you a quick tour through it
from a very particular perspective which
is if you had come up as someone who
knew control theory maybe how would you
actually go and jump into this
reinforcement learning space now what
now why is that the perspective I take
other than the fact that I'm from a
much of the success of reinforcement
learning and it has been tremendous
success has been in environments that
are very controlled like games and we've
defeated the hardest games that humans
can play notably go but as we then try
to move from the situation of games to
where we have to interact with the real world
world
many new problems arise and notably the
problems that we really have to care
about are that these physical objects
are going to interact with people and we
already have these physical objects
based on machine learning systems
actually ending human lives and that's
scary this is something that machine
learning never had to grapple with in
its history until now and so that our
technology is at the forefront of
engineering and our technology is being
pushed into all these situations that
have to interact with people it means
that we have to step up and take new
responsibility over the outcomes of what
we do so in particular the kinds of
problems I want to look at today are
exactly those that interact with the
physical world and perhaps have this
kind of view that might come from a bit
more classical of a background so for me
what reinforcement learning is and this
is we can all debate the actual
definition but I think at its most broad
it's trying to understand how do you use
past data to enhance the future behavior
of a dynamical system and from that
perspective if you had been a controls
engineer you might say wait that's
control theory right so so so what is
that distinction in some sense that
distinction mostly comes from which
department you did your undergraduate in
if you were in a department which has an
e in it many of them then reinforcement
learning is a bit of a subset of control
and it's taught that way on the other
hand from a computer science perspective
reinforcement learning is really the
primal theory and controls is this very
taking our kind of prejudices aside for
me control theory has always been
focused primarily on continuous spaces
and actually primarily on continuous
time I'm not going to talk about
continuous time today because computers
do compute things discretely but we will
talking about continuous spaces where as
reinforcement learning starts with a
very discrete view of the world the
other thing that's primal in in controls
are models models are things that we
were supposed to derive from physics and
perhaps all we're gonna have to do is
estimate a parameter or two we never
actually have to go in and learn a model
necessarily loosely and reinforcement
learning data is really the key right
everything in reinforcement learning is
how do you take data and just turn it
directly into actions and I guess the
other main difference between controls
and reinforcement learning right is that
controls gets published in very stuffy I
Triple E journals and reinforcement
learning of course gets published in the
front page of the New York Times
sometimes all right so different in
their PR departments as well so today
what we want to do is try to unify these
two camps and we're gonna do that in a
funny way which is what we're gonna do
is we're gonna try to solve continuous
problems using the techniques from
reinforcement learning and we're gonna
see what in what insights that gives us
and we're gonna see what we we can learn
from both of these communities and then
I will pose a endless list of problems
two questions that I just don't know how
to answer so well we're gonna get a baby
step today and the number of things I
don't know is uncountably larger than
the things I do okay so the as I mention
at the beginning the kinds of things I'm
interested in is trying to understand
what are the limits of learning systems
that interact with the physical
environment and I'm gonna kind of narrow
that down a little bit today to just
talk about how much do we have to
understand systems in order to do
control and the foundations of this will be
be
I'm a theorist at heart there will be
experiments that will appear in the talk
but at its foundation it's going to be
trying to merge the camps of view from
learning theory a view from control
theory and for me really the most
important thing here will be
optimization because that and so the
core optimization will kind of unite
everything okay so control theory it's
really just the study of dynamical
systems that have inputs you have a
system you can play with this input
which is here you that maybe will spit
out some output Y but in particular it
has some state X and the X will evolve
according to some update law okay so the
next state is a function of the current
state and the current input and today
we're not going to bother with the
output being anything but the state so
it's kind of the simplest setting so X
is the state I will always have to
mention D and remind me if you'll see
something else and U is the input and it
will have to mention P and when these
come up I'll try to remind us of what
those dimensions are so reinforcement
learning is kind of that starts from the
same place right it's the study of
discrete dynamical systems with inputs
and in this case the object that we use
to describe how the discrete system
evolves is an MDP a Markov decision
process so the Markov decision process
the probability of the next state is
simply a function of the current state
and the input and in some sense this is
what defines state state is that what I
need to predict the future and the main
difference between the Markov decision
process setting and the control setting
is that it's these are taking their
values in discrete space rather than in
a continuous space so there are D
discrete States P discrete inputs and
everything can be described now in terms
of tables rather than some kind of
continuous functions okay so I'm almost
not going to talk about mdps at all
after that introduction this is just to
make the connection if you don't know
much about em DPS there are a variety of
tutorials out there that I will
recommend on the on the web page but
what's nice is though this is usually
where reinforcement learning starts
we're gonna start with this other model
we're gonna start with this control oops
come with this one we're gonna start
with this model and see where we can go
okay so let's do that okay so now
optimal control is just trying to solve
an optimization problem to make a
control system do something I have this
input you I want to find the U that
gives me some properties out of from
from the axis so here I've broken the
input up into two there's you that's the
thing I can control and there's e which
is some single I can't control he is
some kind of either an error or
exogenous input or whatever you would
like to call it it's something I can't control
control
and today I will always assume that that
E is stochastic okay it's always a
random variable you can make your life
harder if you make it adversarial okay
so now we have a state transition
function that's a function of the state
the input and the disturbance the cost
function is given incrementally over
time by some cost C T this is another
difference between reinforcement
learning control optimization especially
optimal control is always minimizing
costs and reinforcement learning always
maximizes rewards which is kind of a
glass half empty glass half-full kind of
perspective control theories are very
pessimistic people that's just kind of
the nature of the field so we're gonna
minimize costs ET is our noise ft is our
state transition function and here I'm
letting it change from time to time just
being as general as possible for most
simple settings people typically assume
that this is stationary so it's the same
F at every step okay so now here's a
really important thing
tau is the trajectory tau is essentially
everything I've seen from time zero up
till now
and the input to the system is just some
function of that trajectory okay so it's
now a function of the future it's just
function of all the data I've seen up to
now so the decision variable here is
this function pi which we call a policy
and a policy is simply something that
picks out an action based on everything
I've seen before okay so that's the
decision variable that's somewhat
complicated there are lots of different
ways to parameterize and write those
down but that's kind of that's the thing
we're searching for is this policy oh
there it is n just hold that for one
second cost transition policy okay so
let me do an example to motivate this
and it's very simple and that's trying
to go fly a quadrotor we want maybe say
want to get the quadrotor from point A
to point B and so we have to model this
and fortunately for us the models are
handed to us we don't necessarily have
to learn them and the simplest model
would be Newton's laws and Newton's laws
would just say that position Z is the
old position plus the velocity up to
Constance we're just scaling everything
so that's true the velocity is going to
be the velocity plus the acceleration
that's how that updates just saying that
the derivative of velocity is
acceleration the derivative of position
is velocity and then F equals MA okay
and if I put those together that
actually gives me a state representation
and it's very nice it's linear in this
case so the output of the the new state
is a linear function of the current
state and the input and now what should
we do for cost so for cost what we're
gonna do is this would be one if I'm not
on the blue dot and then zero if I am on
the blue dot and if I look at that it
seems that maybe you guys can solve this
but it seems like that might be hard to
optimize because it's very discrete so
what people do in controls is typically
come up with a surrogate cost so instead
of minimizing this discreet cost I would
minimize the surrogate cost
like the sum of the squares and this is
just saying this will be larger than
zero if I'm not on the dot and will be
zero if I'm on the dot so it's capturing
something and this is very much like
what we do in machine learning we don't
minimize a zero one classification error
we minimize some kind of surrogate which
surrogate you use as a design question
and then the analysis goes in to proving
that that surrogate is the right thing
to do so it's the same in controls and
here we could even add maybe a quadratic
function of the input as well right and
so we can design a cost function and
then say well does this actual does this
do to solve the problem I initially
cared about okay
what's nice about what I just posed is
this as quadratic as linear constraints
you can write down the KKT conditions
and solve it with least squares or if
maybe you're a little lazy
you could call CVX or perhaps you could
run backpropagation on this as well
variety of different options that's
probably that might be slow it's also
interesting is that how this parameter R
here R is essentially some trade-off
between how much power I use and how
much and how quickly I go to zero and
here I'm just gonna graph two values of
R here's the solution paths for time and
position and again we're just trying to
get the position to 0 and the solution
path for time and the control action U
and what you see is that when R equals
10 it's clearly trying to minimize the
amount of control action that we apply
and it takes longer to get 0 whereas for
a larger value I start a smaller value
of R when R equals 1 now we go to 0 much
faster but we exert a lot more force and
what this is also showing is that there
isn't necessarily one cost function that
we want to optimize and controls which
makes it a lot more challenging than in
classification in classification and
regression usually there's one cost
function there's some error that we'd
like to minimize in controls and when
you're actually designing systems there
are lots of contraindicated
at the same time so you may have a
limited battery so you can't use too
much force and you might have a fixed
time that you really have to get to the
origin and you have to kind of account
for all of these in the optimization
design so the cost function itself is
not given it's something that we as part
of the design process and it's something
that's typically iteratively refined
okay so there's our quick example this
turns out to be a specific case of a
much more general problem which is
called the linear quadratic regulator
so linear quadratic regulator is optimal
control my cost function is quadratic my
dynamical system is linear so there's
linear dynamics there's a quadratic cost
and as is linear quadratic and we call
this lqr for short and I'm gonna refer
to this throughout the talk as a very
simple baseline that allows us to to
compare and contrast different
approaches to the general optimal
control problem okay so optimal control
back to our the more general case how
would you solve it I told you how to
solve lqr turns out that the way you
would solve alp you are is sort of how
you would solve the general problem even
when F is nonlinear if everything is
known I told you the model I told you
the cost you can either do batch
optimization which will be probably non
convex but this kind of looks like a
computation graph so though I was joking
about it you can immediately write down
a PI torch graph and solve it this way
so just do back propagation you could
also use a non convex interior point
solver and many of my colleagues you do
this sort of thing will actually use
software like IP opt or SN opt and and
they will give good solutions or there's
this general principle of dynamic
programming which you can't always do
but when you can do is quite efficient
it tends to be linear time in the time
horizon okay so if I knew F then I could
either run dynamic programming or batch optimization
optimization
if I don't know F then the question is
what do I do
okay so if I don't know F oh man fire
code if I don't know F the question is
what's the right thing to do and
actually this is this is really where we
start that was your quick introduction
to dynamical systems and control and now
we're back into machine learning
I don't know F I do know well I don't
care if you know C or not for this talk
but like I said the cost is usually
something you design so let's just
assume we know it but then the question
is what do I actually do if I don't know
the dynamics and so we're gonna try to
reinvent reinforcement learning to solve
that problem okay so let's do that so
let me motivate that when do we not know
the dynamics as soon as we move to
something slightly more complicated
there are a variety of systems where's
actually sometimes hard to get a full
dynamical model so let's motivate it
with what the reinforcement learning
people motivate their research with
which you know data center cooling I
love this problem I love this problem
mostly again because it shows that
control people my control theory friends
they have the worst PR departments in
the world because data center cooling is
you know the control theorist would call
it air conditioning and it's really hard
to get kind of like a flashy write-up in
New York Times if you just you know
figure out how to deep solve an air
conditioning problem so how would you
learn how to cool a data center or a
building a room like this so there are a
variety of different techniques one I
could build a finite element model of
the data center I have a full ordinary
differential equation of everything as
that seems like overkill for a variety
of reasons especially because people
reconfigure their data centers on daily
basis I could do what the air
conditioning people tell me to do and in
which case you have you build a kind of
much coarser model with heat sources and
and and and sinks
and this is actually what would be
typically done in air conditioning and
this would just be an ordinary
differential equation model with
temperatures and changes in heat or I
could be really cavalier and say let's
not do any modeling whatsoever and just
try to find a function that map's center
state to action okay so those are the
three possibilities we could either do
identify everything and there are
actually application for that in
particular this is a turbine and for
really high performance aircraft you
really do want to identify everything
because you're pushing up against the
limits of what's possible you can
identify something very coarse which
would be kind of what I was arguing is
probably what people actually do in
practice for air conditioning and this
kind of underlies what we call model
predictive control they have people
leave is there no way they can stay okay
okay okay sorry that's how far is it
well I it's mmm I don't know what I do
at this point okay all right so that's
all I apologize everybody has to leave
I'm sorry that's a fortunate right so
that one out got to my one funny joke
but then unfortunately interrupted by
the bouncer [Music]
[Music]
that will happen Oh France it's our
Facebook live where's the link thank you
okay Thank You Francie's so let's thank
the general chair Francie's Bach who has
arranged for this to be on Facebook live
soon are you going to tweet it it will
be tweeted it will be tweeted okay good
fine back to action we don't need we
don't need no stinking models that's
that's us right that's everybody who's
here that's the machine learning point
of view and right that's kind of I think
that's at the core what people do in
reinforcement learning but I want to
make an aside with this one because
while it seems crazy that this is a what
happens in reinforcement learning is
also what we do in PID control I don't
know who's learned PID control some
people anybody okay good
what's PID that's good right that so
West PID control PID control is kind of
I would it's this funny thing where 94
95 % go ahead you saw we're good us I
all right so we've got it worked out so
what does PID control PID control is you
know is probably 95% of the control
systems in industrial production our PID
controllers no that's that's a rough
estimate but people have actually done
surveys of thousands tens of thousands
of systems and found that what people
implement in practice is PID control and
PID control is incredibly simple
typically you have one signal that you
would like to make to zero like you want
to stay in a lane in a self-driving car
and not crash into the divider on the
101 so what do you do to stay in the
lane I know it's a dark joke but it's
you know anyway what do you do to stay
in the lane you first assume that your
vision system is perfect okay this is
what's the vision system is perfect then
what do we do so then you have this
error signal which is staying in the
center of the lane that's some time
series my control signal is just the
deviation from the center is going to be
some error and what I'm going to do is
come up with an input that's a
combination of the error the derivative
of the error and the integral of the
error so it's proportional integral
derivative three parameters and you
typically don't need a model to actually
set this up there are heuristics that
will that actually do very well in
practice that automatically tune these
parameters without a model and moreover
most people don't even use the
derivative term because it's very hard
to tune it's very sensitive so two
parameters suffice for about 95% of
control applications and now the
question is as we move well one question
is do we have to move beyond that maybe
but then so for this remaining 5% of the
really challenging things where we'd
like to add some learning some control
something more sophisticated how much
how much of a jump do we have to make
you have to go from two parameters to
twenty million parameters or is there
something in between and that's the
question so that's what we would like to
know okay so with that motivated here's
our learning to control problem [Music]
[Music]
just wanted to put that back up to
remind us where we are and so how would
we how might we actually but before we
even talk about how to attack that let's
set this up as a bit more of a math problem
problem
okay so let's actually like define the
parameters and the parameters here are
going to be the following we can
generate n trajectories of length T okay
enter duck therese of length T and so
the timer isin is T here we can generate
n as arbitrarily many and then the
challenge is we want to build a policy
or a controller which have the smallest
error given that sampling budget and so
N and T here are gonna be equally I'm
just gonna weight them equally you don't
have to actually do that it may be more
expensive to run lots of experiments so
that let T is getting a long time
horizon is easier than multiple
experiments or it may be that multiple
experiments are easier than a long time horizon
horizon
I'm just lumping them in as a product
today trying to make it as simple as possible
possible
and the question is what's the right
thing to do that's my Oracle what's the
right scheme to do you can already see
it's more complicated than what we
typically do in optimization that ICML
where we have just a function Oracle or
a derivative Oracle here we have
something a little bit more complicated
we could treat it just as giving us
function values because we can
accumulate the rewards but we have a lot
more structure under the hood and we
should try to take advantage of that
extra structure so our big question is
how many samples do we need to actually
solve that problem let's see what we can
do okay so to solve that problem I'm
going to first invoke my favorite design
principle which is the linearization
principle which basically says if you
have an algorithm and you think it
solves deep learning I want you to show
me what it does on linear regression
first or you know if I had something
that solves Sat which is you know and be
complete I want to see what happens on
to set instances just as a sanity check
and this has basically been the last
four years of my research
and many of my colleagues research so
it's actually been a kind I feel like
this gives a lot of interesting insights
starting with linear models seeing what
the complicated method does on the
simple case and a lot of times you can
actually back out good insights about
why that what's happening in the more
complicated nonlinear space by looking
at the linear case first and a lot of
times what happens in the linear case
doesn't form the nonlinear case it's not
there's not a it's it's not necessary
that's something that works on nonlinear
models or sorry something that works on
linear models works well on non linear
models but you'd like it if it's still
working on nonlinear models to not be
too bad on their models
it's my this is my axiom here so I
already introduced this one for us the
simple problem the linear case I'm gonna
call lqr what's cool about lqr is it's
been studied forever it has a closed
form solution people understand the
stability people understand robustness
properties it's really just a very well
analyzed problem when we know a and B
and when we don't know a and B well now
we can actually start to think how does
this compare we can compare ourselves at
least to what would happen if we did the
hard part with optimal control when a
and B when this function is nonlinear
and more general is that even solving
the control problem with perfect
information might be hard so here we're
kind of going to something where if I
have perfect information I can solve it
it's well studied I can do a lot of
different kinds of analyses and when I
move to this kind of case where we have
imperfect information hopefully we can
bound the gap okay so we're gonna use
we're gonna use lqr as a kind of our
foundation but we're gonna present
methods that work on general more
generally and so as far as I can tell
there are three general things that
might work and I'm going to lump names
that come from the MDP land onto these
three general things but there are
changes that have to be made when you
move from mdps to these continuous
models okay so the first thing would be
fitted model to the data and I'm going
to call that model based RL even though
based RL actually contains other things
for me today model based RL just means
do some experiments fit a model use it
somehow in the loop then there's model
free bottle free RL has basically two
sub branches and at least all the deep
RL talks I've seen break it down into
these two categories category one is
estimate your the cost function itself
the cost function of the entire
optimization from data so try to forget
that there are constraints since there
are quality constraints these are all
actually is just one big cost function
perhaps I could use regression to tell
me what that cost function is and
estimated from data and you'll see why
dynamic programming becomes critical for
doing that and then the other one which
is the most cavalier is direct policy
search which just says let me just
directly try to optimize for the policy
without trying to fit anything else okay
so those are our three categories and
let's see where we can get here so model
based I like model based because this is
the only one that fits on the slide this
one fits on a slide so yeah it's
satisfies my simplicity criteria okay so
the first thing we do is we collect
simulation data and based on simulation
data well we should have that the next X
is some function of the current X the
current input and the noise okay and
there will be some noise new that's
different than E right so this is kind
of our assumption roughly I put
squiggles here to beam I'm being
incredibly imprecise anytime there's
squiggles we're just going to take this
as close to true if it was actually true
what would you do if I wanted to find
the best Phi that satisfied the equation
I would solve a least squares problem
and maybe you could construct a more
complicated loss maybe you could
construct a more interesting loss but
that's probably what I would do at an
escort which is fit the dynamics using
supervised learning which we already
understand actually I was just at Colt
and the nice thing about going to Colt
is you realize we don't even understand
supervised learning so let me not even
say that but we understand it pretty
well we know a lot about supervised
learning so once you have a supervised
learning model what you could do is now
an approximate problem instead of
solving the problem that we wanted to
solve we can solve this approximation
and the approximation plugs in in some
capacity this model that you fit and
maybe models the noise that would it be
accrued and so we have some
approximation now the goal is just how
far off is this approximation from what
we started with okay for people who've
done controls this is kind of the bread
and butter what the community has done
the first stage these first two lines
are called system identification for
controls we call it supervised learning
and then when you take the system the
system that's identified you can design
a control problem around the identified
system okay so that's kind of like what
probably is what most controls engineers
would have suggested that you do okay
let's but but let's be adventurous we we
know that we've solved go by other means
what else might we do okay what else
might we do approximate dynamic
programming is a possibility alright so
how does approximate dynamic programming
work this is really the core of
classical reinforcement learning is
methods from approximate dynamic
programming things that you have heard
whether they be cue learning or TDD or
what else P I or via policy iteration
value iteration help me out Gergely what
did I forget to which one q what DQ n DQ
n DQ an yeah that's a good one
DQ n thank you all fill into this box
write DQ n that's a good one
so that's really it's a classic it's not
just classical are lots of contemporary
methods use this too I'm gonna just make
a quick disclaimer because I've
presented this a couple of times now if
you haven't seen this before I don't
think you're gonna understand it at the
end of my tutorial and in some sense
this is why RL has been tricky for a
long time it requires an investment of
time to understand the primitives and
it's really it's worth spending the time
but in the few slides I've prepared it
might not make sense but it'll be it
will be a taste of what happens and so
here here's my approach what I'm gonna
do is talk about dynamic programming
and then say what does it mean to
approximate the diner program so let's
rewrite our optimal control problem
again assume everything is known now now
where everything is known eventually you
know so the machine learning will come
in in a second if we knew everything we
could define this this value here to be
called the Q function I'm calling it Q 1
because it's from 1 to the final and it
basically says that if you start at
state X and your initial action is you
what's the minimal achievable
optimization value from those two
initial conditions and this is called a
q function or a value action function ok
so how do i compute the q function
well the lat if i started at the end the
q function is easy it's just whatever
that final cost is that's clearly what
I'm going to accrue and then you could
get a recursive formula which just
follows from working from the back to
the front
and it has this form so that you add in
the new cost you have an update for the
next state and you have the Q function
from one time in the future so
oftentimes this is called cost to go ok
so this is the this is at time K what
the cost to go is from action U and
state X ok and you could derive this
again just by by dynamic programming
work from the back work your way forward
what's cool about these things all right
if I knew this Q function and I want to
solve this problem what do I do
well I know that this is the value taken
for any X and any u this will be the
value I get running the problem so what
I have to do is just minimize the Q
function with respect to U and that will
be the best action and that turns out to
be the optimal policy and this is why
people like these methods you get this
compact thing which is to minimize some
Q function subject to subject to the
state being XK now let's put my
optimization hat on for a second in
discrete land where I have tables that's
usually where we motivate Q learning in
that case the value action pairs are
just a table
and then you fill in an every value of
the table the value and if I want to
minimize it I just look at the minimum
value of a column if I'm an optimizer I
note that this optimization problem is
in no way necessarily easier than the
one I started with so even if I do the
bookkeeping it's not clear at all that I
can actually solve the problem that I
want to do and anyway more effectively
than solving this original problem so
you have to kind of build that end I
have to be able to minimize this cue
function well in order to have that but
that's just like that's an aside in the
continuous case that's an extra
complication but in lqr magic happens
again people love lqr because magic
happens at every corner
you look so here's the cool magic here
let's imagine I decided that the final
cost is quadratic I'm just gonna assert
this is my optimization problem
everything's quadratic the final cost is
quadratic then when I do this
calculation now which is minimizing with
respect to with respect to the next
action the step into the future of Q
well Q is quadratic f was linear so this
is a quadratic quadratics are close dr.
minimization and so I get a quadratic
back so I get some new matrix PT plus 1
and this is my new state step into the
future and then some residual cost
because the error so in order for this
to work the error has to be zero mean
and uncorrelated with itself so it just
has to be white noise and sunset so the
from time to time independent actually
just uncorrelated to have this formula
work okay so that's my that's my formula
and what I can actually write down
although it's absolutely horrendous the
formula for P now where does this come
from it comes from taking a minimizing a
quadratic what ever minimize a quadratic
before this is a sure compliment right
of a some bigger matrix so I just didn't
show you the bigger matrix which we did
this your complement on and the optimal
action comes out from minimizing this
function and you could just see you
could pretty much read it off that this
is the optimal action it's a very cool
so that so this makes my this
makes everything very simple that mean I
could compute all this stuff offline no
I mean Kai you can compute it offline to
get these formula
you'd have to derive them the
derivations are not complicated I'm if I
really wanted to torture my
undergraduate so I can assign it to my
undergraduates but you know then they
then they don't like that you know they
get mad when you do these kind of things
but it really is there is nothing
sophisticated here it is just linear
algebra and dynamic programming that
gets us these formula and was
particularly interesting is if you take
a limit as time goes to infinity and get
an incredibly long time horizon then you
get a really interesting property out so
let's go back to this one real quick one
thing that's cool is that the optimal
policy is a linear function of the state
so the best thing to do is multiply the
current state by a matrix for the lqr so
very simple policy turns out to be
optimal on the infinite time horizon the
optimal policy is to multiply by a fixed
matrix there's some matrix one fixed
when you compute in every time step all
you do is multiply by that matrix that
makes for a pretty simple controller and
the way you solve for it is you get
basically the you just take a limit of
the equations on the previous slide you
get this matrix which is KX and the way
that you get P is by solving this
equation which is nasty but it's doable
there is a cyber teen that will solve
this for you so life is good I don't
know if it's in tensorflow
but Google friends you know the D are
this thing is called a discreet
algebraic riccati equation we should we
should add that and the dynamic
programming here has has a simple form
because quadratics are a miracle and
this is also probably why we study it so
much quadratics had to be have all the
properties you want to make this simple
it's also cool is the solution is
independent of the noise variance your
performance degrades the more noise you
have but your action doesn't change so
that's that's interesting and finally
for for finite time horizons you can
solve you could you could have just
solved this with batch solvers
and the last thing I'll point out and we
always overlook this in reinforcement
learning but let's just I want to point
this out the optimal policy here on a
fixed time horizon is not time invariant
so if you have a fixed time horizon the
optimal policy is not time invariant you
can search for the best time invariant
policy and people do this all a lot but
that is not the optimal policy so it's
always good to keep that in mind the
shorter your time horizon the less often
will they are so there's a lot of things
you have to worry about as you kind of
get to these terminal conditions
okay that's lqr to teach everybody our
programming again there's no machine
learning yet this is just crazy controls
where's machine learning here we go
machine learning is this stage so this
is the crazy idea and I love it and it
uses one of my favorite things that's
ever been invented so we're in math so
we're it's really cool we have this
recursive formula for the finite time
horizon on the infinite time horizon
well to make our lives easier we can
introduce something called the discount
factor which is some number that's less
than one that damps things off to
infinity and this just makes my life
like limits always exist once you
introduce the discount factor it's
saying way out in the future the cost is
exponentially small and so if I have a
cost that's being discounted and I have
a static stationary update now I have
these two bellman equations and you get
a fixed point operation which is that
the Q function is equal to some non
linear function of the Q function okay
so how do I solve something or have a
linear equality constraints well I could
say well roughly that if I were to
actually take that step out into the
fold at the future this equation should
be true that the Q function at the
current state and the current action
should be the cost plus the min min
action over the future State and if I've
observed XK and I've observed UK and
have observed XK I could just run
gradient descent and this is Q learning
Q learning is just saying I have this
fixed point equation I'm going to solve
that fixed point
using stochastic approximation in this
case because it's not the gradient of
anything we have a fancier name but the
analysis is exactly what you would have
done and the original Robinson Munroe
paper although it gets quite tricky for
general queue functions okay so that but
it's cool I got this simple update and I
can just run this and this is hopefully
if all my ducks in a row
will allow me to effectively learn a
queue function just from data now the
again for the tabular case where all I
have are discrete states and discrete
actions then this is this is a very well
defined update when I have functions and
function approximation things get messy
I'm not gonna go through that today
because well I'll really be in pain but
again this is not to OS effort to adapt
this to that case as well okay alright
one last approach so that was that was
approximate dynamic programming it's
heavy you do dynamic programming you
realized that in the infinite time
horizon hi this nice fixpoint equation
and then I attacked that fixpoint
equation using techniques from
stochastic optimization or in particular
stochastic approximation and that is in
the theory land pretty much what people
study when it comes to RL now that was
until 1999 where you know because this
caused pain and suffering in the 90s
because all the analyses are hard trying
to get good guarantees was hard and and
sometimes when you introduce function
approximations when you move to the
continuous space you can't even prove
these sinc converge to stationary points
can't prove they converge to anything so
in 1999
some folks published a paper at nips
that proposed something actually no
sorry different in 1992 somebody
published something at nips that
proposed an algorithm which I'm going to
go through today and then in 1999 there
was analyzed that under function
approximation this works quite well and
that's the idea of just being cavalier
and going directly at the policy okay so
I'm calling this broadly direct policy
search this is broadly direct policy
search I'm gonna actually introduce it
without any of this stuff
classic optimization business and just
introduce it as optimization because you
can kind of get everything you want to
know about how to do policy search from
the idea of sampling to search ok so in
this case there's no dynamics let's take
everything out of it let's just try to
optimize we want to minimize a function
fee of a variable Z and I just want to
solve that minimization problem and now
let me do some sleight of hand to come
up with a crazy way to solve it so the
first thing I could do is instead of
minimizing over variables I can minimize
over probability distributions there's a
long time at nips where this is all we
ever did your if you're a variational
person you love this stuff right so
turns out these are completely equal
because I could always just put a delta
function at the Z I want these are
completely equivalent things but
optimizing over probability
distributions is hard so what you
typically do is make an approximation
and optimize over parametric probability
distributions ok
so what I want to do is minimize the
expected value of this function subject
to a parametric form of the probability
distribution and I'm just gonna optimize
the parameters to get the smallest cost
and then I could sample from that
distribution and get Z's that should
have low cost at least that's the idea
ok that's that's roughly the idea and
what's cool and I think what people love
this is that Williams showed that you
can get a really cheap stochastic
gradient formula for this second problem
and it goes like this let's call that
function J of theta because Z is no
longer in the picture that now is just J
of theta because we integrated Z away
the gradient of J with respect to theta
is the expected value under this
distribution of the function value times
the gradient of the log of the
probability is not that's actually not
that complicated a formula and we could
derive it very quickly but the key thing
is if I were to sample Z from this
distribution parametrized by theta this
thing inside the expected value I could
just compute if you know the probability distribution
distribution
I could hopefully compute the gradient
the log of the probability distribution
and then you can multiply by the
function value and so the algorithm oh
let me okay it's gonna do this backwards
let me let me derive this formula first
and then show us how to run stochastic
gradient the derivation is supercute
which is another reason why I think
people love this and it's kinda the same
trick that you use when you prove the
crime around which is that if I if I
take the gradient of this function
that's just the the gradient is not
going to do anything to Z he's only
going to do something to the probability
distribution and then I could just
multiply and divide by the probability
distribution anybody who's seen the
proof of the crime about has seen that
trick and that that now is the grading
of the logarithm and that's it that's it
so this is now the prop integral of this
function with respect to the probability
distribution also known as the expected
value of that function it so this gives
us an algorithm this gives us an
algorithm which is to sample Z plug in
this formula and treat that as a
stochastic gradient so that's it sample
Z compute this gradient approximation
this is an unbiased estimate of the
gradient of J and as I could use
stochastic gradient to solve this
problem and this is called the
reinforced algorithm usually with all
caps I don't like the all caps I think
it was an acronym just pick the name
okay so this sounds amazing
let's use this to solve something here
we go so let's say I have arbitrary
function on the hyper cube minus 1 1 so
it takes values in bits and it's an
arbitrary function so I need a
distribution to sample from the
hypercube how about just a Bernoulli
distribution and every coordinate
independence with different parameters
that gives me a parametric distribution
like this and so here's my algorithm for
minimizing functions or the hypercube I
will evaluate the function and move that
way so I'm using this example to point
out something that you always have to
even though it's not obvious is that I
don't think this solves any problem I
don't know what problem that solves
my guess is nothing or nothing
interesting I'd be curious to see which
class of functions this actually would
minimize effectively and you because
it's not using any information
whatsoever about Phi and hence its kind
of lumped simple functions like linear
ones in with arbitrarily complicated
functions like random ones they're
treated the same and so maybe the
dynamics can make sense of that but
that's something you have to prove and
it's not at all clear that this is
actually going to be something sensible
to do okay I just want you to keep that
in mind as we now apply it to control oh
this one seems absurd but then we just
go and do it to control and it feels
like me you know is it actually working
why is this absurd well look we have to
take an expected value with respect to
this small class of probability
distributions it's not all probability
distributions it's a restricted class
and that might actually hamper what we
can do so by restricting to that class
we actually understand what that
restriction is costing us okay but let's
let's let's plow ahead let's plow ahead
and I'm gonna plow ahead and introduce
this is not what people typically use in
our L but it is what a lot of people
have been using in robotics and it's
called parameter perturbation so this is
the if I have this cost function this
would be so what are we gonna do we're
going to add a rent we have a policy the
policy is going to be a parametric
function has some parameter vector theta
what we're gonna do is just add a little
bit of noise to that parameter vector
and then estimate the value estimate how
good that perturbation is okay so that's
our random sampling we could integrate
over this and it turns out that what
this algorithm does is the following is
essentially its equivalent to the
following approach if I were to write
the cumulative cost of the sum and let
my perturbation be Gaussian then what I
would do is I would add a little bit of
Gaussian to theta one way add a little
bit of Gaussian to the theta in the
other way and then it turns out that the
log of the probability distribution is
gonna give me a Omega I out here so what
this ends up being
is a finite difference approximation to
the gradient in the span of these random
omegas so in this case this algorithm
give me something maybe that looks
somewhat simple accessible right I mean
what I've done is I have a function
I can't compute the derivatives of the
function but I can approximate them
using finite differences and now we're
approximating the five differences you
know some random subspace this
particular instance iation is never
called reinforce so it was initially
called random search by raster gun in
the 60s it was discovered in
evolutionary algorithms sometime in the
70s and it's usually a mew lambda
evolution strategy and I can't remember
what Mew and lambda correspond to here
parameter wise but probably Sigma and M
it was also invented by our friends of
stochastic approximation by spall and it
was also invented in the learning theory
community it's called bandit convex
optimization as usually my feeling that
if you have an algorithm that is
developed independently by four
different communities is probably
they're probably onto something right
that's use that's just signal okay so
we're gonna we're gonna come back to
this one I call it random search I know
a lot of people hate that but I'm just
gonna stick with what I'm gonna stick
with what the Russians called it because
they get mad to don't cross the Russians
bad things happen okay so policy
gradient is different reinforce is
usually used for this algorithm called
policy gradient and policy gradient what
it does is it actually just replaces a
deterministic policy with a randomized
policy but this is no different than
what we were doing before it's just
saying that the value you instead of
being a deterministic function is now a
randomized function and in continuous
spaces it's worth noting I probably
should have written this on the slide
but in continuous spaces what typically
you do is you take a deterministic
policy and add noise to it okay so
that's not perturbing the parameters of
the policy is perturbing the outputs of
the policy and that's the main
difference between policy gradients
and random search and it's actually hard
to write down exactly what this formula
looks like but we can do it for lqr and
I think it's instructive that you can
actually you could say let's say I
wanted to solve a finite time horizon
lqr I could assume that it has a
stationary policy and then I can perturb
the stationary policy by noise so at
every time step I add noise to the
action and then I could compute the cost
and this is the update so the update
uses the state it uses those noise
directions and it just evaluates the
cost but note that it has no it has no
gradient information about the cost it
has no gradient information about the
the matrices in fact it's like
completely independent of the dynamics
so this is one of these things I always
find frustrating it is certainly a
gradient method the reinforce algorithm
is a stochastic gradient method but for
the functions we care about which is the
minimizing the lqr problem we only get
function estimates so as a gradient
algorithm on a function that we might
not necessarily care about again it's
worth worth emphasizing that this is
essentially a derivative free algorithm
because we never compute derivatives of
any of these quantities that we care
about and particularly then it suffers
from a lot of the complexity results
that go along with zeroth order
optimization so again there are two
different categories here that we're
covering there's policy gradient there's
random search reinforced applied to
either of these problems does not depend
on the dynamics and I think that's
really cool and that's why people love
them not only does it not depend on the
dynamics it doesn't depend on you
knowing how to compute the cost function
as long as there's some Oracle handing
you the cost function so if you have
some kind of simulator or video game as
long as the score is returned you can
apply the algorithm and I think this is
also why they're exciting I could just
go this is are actually kind of what's I
feel like really what this
differentially is computer science from
the engineering the other engineering
schools on campus is like at the end of
your first lecture you're doing stuff so
why not just do this too right you're
just they're much more fearless now the
question is does it work how does it
work can we analyze it how does it
compare to other techniques
the next step and I would like to
emphasize that they're both derivative
free algorithms and I'm gonna keep
saying that okay all right so there's
some of this stuff reinforced is not a
magic algorithm you have to be careful
both about the approximation error and
another thing I didn't mention the
variance when I run stochastic gradient
descent you care about the variance of
that gradient and if the variance is
really large you will never converge so
those are two things you have to take
into account you want a small variance
estimator you want of the gradient you
want the approximation error to the true
function to be good and you have to be
able to sample it's a lot of burden on
you which I might call modeling honestly
so even though it's model free there's a
lot of modeling that kind of goes in
under the hood and again I'm gonna say
this again is necessarily derivative
free okay but it's easy and that's what
makes it exciting and that's what lets
us plow ahead great that's great so
actually this is a perfect time let's
take a 10-minute break we'll reconvene I
guess an ad should I be actually could
be really precise because I have the
clock here but that's probably dumb you
guys don't have the clock so let's
reconvene at five to five faziz so the
we'll be live-streaming a seven so when
you come back from the break if it's
full going a seven exact same vein but
on video awesome
a seven except thank you again Francie's
alright and so yeah so let's come come
back at actually I would say let 10 till
let's do 10 till that would be good 454
let's get restarted hi yes no I got it we're good
we're good that's right so he sounds got it
that's right so he sounds got it just FYI there is an overflow room with
just FYI there is an overflow room with a live string in a seven and just down
a live string in a seven and just down the hall that's right and if you done
the hall that's right and if you done for tchen Utley if you can't find a seat
for tchen Utley if you can't find a seat you you have to go to the overflow room
you you have to go to the overflow room or else that guy's gonna come back I
or else that guy's gonna come back I don't mean cool all right let's do this
don't mean cool all right let's do this where we're we we left off at
where we're we we left off at everybody's favorite parks
everybody's favorite parks well no actually I don't think anybody
well no actually I don't think anybody has this favorite part we left off is
has this favorite part we left off is when we move into learning theory okay
when we move into learning theory okay so I want to do just a little bit of
so I want to do just a little bit of learning theory not too much today but
learning theory not too much today but this is actually something that we
this is actually something that we grapple with throughout RL and in some
grapple with throughout RL and in some sense half of the papers in
sense half of the papers in reinforcement learning that I see
reinforcement learning that I see discuss a notion of what's called sample
discuss a notion of what's called sample complexity so sample complexity means a
complexity so sample complexity means a very particular thing in learning theory
very particular thing in learning theory it means a slightly more nebulous thing
it means a slightly more nebulous thing in reinforcement learning and I kinda
in reinforcement learning and I kinda just wanted to touch on that here is
just wanted to touch on that here is like again this only other slide that
like again this only other slide that has the MDP formulation even though
has the MDP formulation even though we're talking about the tabular case off
we're talking about the tabular case off and on the idea here is that again I
and on the idea here is that again I have a discrete vector X or just X that
have a discrete vector X or just X that takes only discrete values it's not a
takes only discrete values it's not a vector X is the values 1 to D u is
vector X is the values 1 to D u is values 1 to P and I have an MDP and the
values 1 to P and I have an MDP and the question is
question is what's the relative complexity of these
what's the relative complexity of these methods something you should take away
methods something you should take away from that first half of the talk is we
from that first half of the talk is we saw that in model-based methods at least
saw that in model-based methods at least from my narrow view of model-based what
from my narrow view of model-based what we would do is we would take data and
we would do is we would take data and then we would solve essentially a system
then we would solve essentially a system of equations right we have we have
of equations right we have we have measurements and each of this gives us
measurements and each of this gives us equations and then we have unknowns
equations and then we have unknowns which could be the model could be the
which could be the model could be the cue function or could be the policy
cue function or could be the policy right and the question is we can now do
right and the question is we can now do parameter counting to get some like
parameter counting to get some like rough rule of thumb as to which one may
rough rule of thumb as to which one may or may not be more efficient okay so
or may not be more efficient okay so this is one way to maybe try to decide
this is one way to maybe try to decide which of these methods are best so we
which of these methods are best so we have there three algorithm classes so in
have there three algorithm classes so in the MVP case for model-based we get the
the MVP case for model-based we get the value of the state and we're just going
value of the state and we're just going to use the value of the state to predict
to use the value of the state to predict what the probability of the transition
what the probability of the transition is for approximate dynamic programming
is for approximate dynamic programming we get the value and also for policy
we get the value and also for policy search we get a value and each of these
search we get a value and each of these gives us one equation for every time
gives us one equation for every time step now what how many parameters do we
step now what how many parameters do we need to identify in the most general
need to identify in the most general case well the number of numbers required
case well the number of numbers required to give me the state transitions is just
to give me the state transitions is just d squared P right and this is
d squared P right and this is discounting because it has three
discounting because it has three arguments that's pretty the discrete
arguments that's pretty the discrete world is nice very easy to count things
world is nice very easy to count things approximate dynamic programming we need
approximate dynamic programming we need a cue function the cue function has a
a cue function the cue function has a value it takes a state and an action so
value it takes a state and an action so it has D times P parameters and policy
it has D times P parameters and policy search what does that do well if for
search what does that do well if for every state it gives you an action and
every state it gives you an action and so that would also have D times P
so that would also have D times P parameters this is not parametric this
parameters this is not parametric this is actually just trying to write down
is actually just trying to write down what those mappings might look like and
what those mappings might look like and so the optimal error if I were to write
so the optimal error if I were to write it down and guess what would be optimal
it down and guess what would be optimal so again this is very naive and this is
so again this is very naive and this is just chicken scratch on a piece of paper
just chicken scratch on a piece of paper what would tell me we'd say that roughly
what would tell me we'd say that roughly these are sample complexities so simple
these are sample complexities so simple complexities have parameters in the
complexities have parameters in the numerator and then number of samples in
numerator and then number of samples in the denominator and they suggest that
the denominator and they suggest that probably a DP and Policy search
probably a DP and Policy search are more efficient than very broad
are more efficient than very broad fitting a model approach to model based
fitting a model approach to model based RL that's rough I mean again it looks
RL that's rough I mean again it looks like that should be the case that looks
like that should be the case that looks like that should be the case so assuming
like that should be the case so assuming we know nothing about the model
we know nothing about the model transitions it looks like that would
transitions it looks like that would probably be the hardest and then ADP and
probably be the hardest and then ADP and policy search should be easier but there
policy search should be easier but there is no algorithm that gets these
is no algorithm that gets these complexities currently out there and if
complexities currently out there and if you're into this kind of thing there are
you're into this kind of thing there are all these papers written by baddest
all these papers written by baddest people who try to get these numbers as
people who try to get these numbers as small as possible
small as possible now let's now go to the continuous world
now let's now go to the continuous world and the continuous world we can do the
and the continuous world we can do the exact same thing and the story
exact same thing and the story completely changes it's a very different
completely changes it's a very different kind of setting when the continuous
kind of setting when the continuous world we get more samples per iteration
world we get more samples per iteration when we have a model let's think about
when we have a model let's think about the least squares problem that we solve
the least squares problem that we solve right we have this we observe the state
right we have this we observe the state there are D elements in this state and I
there are D elements in this state and I get a mapping from a model to the state
get a mapping from a model to the state so there are D equations per time step
so there are D equations per time step whereas approximate dynamic programming
whereas approximate dynamic programming and policy search are still only using
and policy search are still only using values moreover once we move to the
values moreover once we move to the continuous case we actually worked out
continuous case we actually worked out what these things were the Q function
what these things were the Q function the approximate dynamic programming has
the approximate dynamic programming has D plus P over two parameters versus our
D plus P over two parameters versus our D plus P choose two parameters that's a
D plus P choose two parameters that's a quadratic this D plus P squared the
quadratic this D plus P squared the policy has D times P parameters and the
policy has D times P parameters and the model based has d squared plus DP in in
model based has d squared plus DP in in for lqr right for l q RS d squared plus
for lqr right for l q RS d squared plus DP you have a square matrix for the
DP you have a square matrix for the state and you have a rectangular matrix
state and you have a rectangular matrix for the control and so if we put that
for the control and so if we put that together and we get our rough rule of
together and we get our rough rule of thumb it seems like the model-based
thumb it seems like the model-based approach might require about D plus P
approach might require about D plus P over T to the square root that should be
over T to the square root that should be roughly our error if if life was good
roughly our error if if life was good and everything worked out with parameter
and everything worked out with parameter counting
counting whereas approximate dynamic programming
whereas approximate dynamic programming my suffer a factor of dimension and
my suffer a factor of dimension and policy search might be somewhere into
policy search might be somewhere into depending on the size of the parameter
depending on the size of the parameter either about the same or maybe a little
either about the same or maybe a little worse right because the product should
worse right because the product should be bigger than the
be bigger than the some okay rough rule of thumb but it
some okay rough rule of thumb but it looks different
looks different and the question is in practice is it
and the question is in practice is it actually look to end up looking
actually look to end up looking different as well
different as well are these model free methods making
are these model free methods making effective use of the information that's
effective use of the information that's given to us that's really the question
given to us that's really the question we have to figure out okay and of course
we have to figure out okay and of course I do want to point out again to
I do want to point out again to everybody everybody here that there's a
everybody everybody here that there's a concept that of course is gonna go out
concept that of course is gonna go out front both in the MDP case and in the
front both in the MDP case and in the the continuous case and a lot of times I
the continuous case and a lot of times I mean this is gone now for my theory
mean this is gone now for my theory friends in the room a lot of times the
friends in the room a lot of times the interesting stuff is actually happening
interesting stuff is actually happening here not here
here not here so just worth keeping that in mind maybe
so just worth keeping that in mind maybe there's actually some other complexity
there's actually some other complexity that comes out of front that's more
that comes out of front that's more important than parameters and let's see
important than parameters and let's see if we can get at that as well okay and I
if we can get at that as well okay and I guess we could also talk here's my I
guess we could also talk here's my I can't come to a machine learning
can't come to a machine learning conference without a deep learning slide
conference without a deep learning slide this will be the only one deep
this will be the only one deep reinforcement learning is at least from
reinforcement learning is at least from my perspective today was just that like
my perspective today was just that like the only thing that would change if you
the only thing that would change if you wanted to implement it would be that we
wanted to implement it would be that we would parameterize our cue function or a
would parameterize our cue function or a policy by a deep neural network or maybe
policy by a deep neural network or maybe we make our model a deep neural network
we make our model a deep neural network everything is going to be tricky to
everything is going to be tricky to analyze in this case but none of the
analyze in this case but none of the algorithms will change and again this is
algorithms will change and again this is why the linear principle is going to
why the linear principle is going to apply so that hopefully we can get some
apply so that hopefully we can get some insights because nothing changes moving
insights because nothing changes moving to deep networks other than the function
to deep networks other than the function approximator so anyway that's for next
approximator so anyway that's for next year maybe we'll analyze deep networks
year maybe we'll analyze deep networks we're gonna just stick with very shallow
we're gonna just stick with very shallow reinforcement learning for the rest of
reinforcement learning for the rest of the day and again the reason why is
the day and again the reason why is because the algorithms don't change okay
because the algorithms don't change okay so let's go turn to my friend lqr and
so let's go turn to my friend lqr and start to look at sample complexity
start to look at sample complexity rather than trying to prove stuff let's
rather than trying to prove stuff let's run some experiments because I like
run some experiments because I like running experiments on these continuous
running experiments on these continuous problems and in particular let's run the
problems and in particular let's run the silly double integrator experiment in
silly double integrator experiment in controls we call this model a double
controls we call this model a double integrator and with lqr and so what's
integrator and with lqr and so what's this plot so policy gradient is in
this plot so policy gradient is in the error bars are over ten trials and
the error bars are over ten trials and it's kind of the men and the max over
it's kind of the men and the max over those ten trials of what value is
those ten trials of what value is returned after about thirty thousand
returned after about thirty thousand samples we get something that is close
samples we get something that is close to the optimal of the riccati equation
to the optimal of the riccati equation so a lot of variants to actually get
so a lot of variants to actually get this this was a fun interaction on
this this was a fun interaction on Twitter I couldn't actually get it to
Twitter I couldn't actually get it to work at all and then a colleague tweeted
work at all and then a colleague tweeted or actually I think was on reddit
or actually I think was on reddit somebody come up came up with a better
somebody come up came up with a better implementation my friend Pavel who did a
implementation my friend Pavel who did a great job here told me that we have to
great job here told me that we have to do is instead of running stochastic
do is instead of running stochastic gradient descent you have to run Adam so
gradient descent you have to run Adam so this actually requires not policy
this actually requires not policy gradient but Adam and I don't know why
gradient but Adam and I don't know why but that's fine so we added some extra
but that's fine so we added some extra stuff we're able to get it to converge
stuff we're able to get it to converge people like Adam that's cool and it does
people like Adam that's cool and it does things but so an interesting thing is
things but so an interesting thing is that just fitting this model not telling
that just fitting this model not telling you anything about the structure but
you anything about the structure but just fitting it as a black box but
just fitting it as a black box but linear model and then just running
linear model and then just running treating that as true and not doing
treating that as true and not doing anything just treating it as true is
anything just treating it as true is indistinguishable from this black line
indistinguishable from this black line like ten samples and you get the answer
like ten samples and you get the answer and similarly approximate dynamic
and similarly approximate dynamic programming I implemented what was
programming I implemented what was called least squares policy iteration
called least squares policy iteration which is by uh which is a nice paper by
which is by uh which is a nice paper by Lazarus and somebody obvious names I'm
Lazarus and somebody obvious names I'm forgetting but it's it's a very fairly
forgetting but it's it's a very fairly straightforward that also achieves this
straightforward that also achieves this line okay ten samples so there's this
line okay ten samples so there's this weird gap here where the policy gradient
weird gap here where the policy gradient as we said and I did the vanilla
as we said and I did the vanilla algorithm that I showed in the first
algorithm that I showed in the first half which was just you compute the cost
half which was just you compute the cost you multiply it by the noise times the
you multiply it by the noise times the state accumulated and it is kind of what
state accumulated and it is kind of what immediately falls out but I've never
immediately falls out but I've never been able to get it to do something
been able to get it to do something nicer even after tuning the parameters
nicer even after tuning the parameters of Adam of the sampling scheme and
of Adam of the sampling scheme and what-have-you and and this is weird
what-have-you and and this is weird right because policy gradient I think
right because policy gradient I think I'm told now as a strawman but I'm told
I'm told now as a strawman but I'm told that direct policy methods are supposed
that direct policy methods are supposed to work right I mean this is like
to work right I mean this is like apparently what open AI is going to use
apparently what open AI is going to use to defeat dota and it seems like if it
to defeat dota and it seems like if it can't solve a double integrator
can't solve a double integrator something weird is happening
something weird is happening and I everybody attributes this quote to
and I everybody attributes this quote to Carl Sagan that extraordinary claims
Carl Sagan that extraordinary claims require extraordinary evidence but it
require extraordinary evidence but it was borrowed by Lance Armstrong when he
was borrowed by Lance Armstrong when he was defending himself against
was defending himself against accusations of doping and so the
accusations of doping and so the important thing that even Carl Sagan and
important thing that even Carl Sagan and every good Beijing would say is that's
every good Beijing would say is that's true
true only if your prior is correct right if
only if your prior is correct right if you had the right you know you actually
you had the right you know you actually had the real prior the claim might not
had the real prior the claim might not seem quite so so extraordinary and so I
seem quite so so extraordinary and so I guess the thing I want to say about
guess the thing I want to say about these policy gradient methods is like if
these policy gradient methods is like if we look at what other but people who
we look at what other but people who implement them say there are red flags
implement them say there are red flags even in their own descriptions and in
even in their own descriptions and in particular that reinforcement learning
particular that reinforcement learning results are tricky to reproduce
results are tricky to reproduce performance is very noisy algorithms
performance is very noisy algorithms have many moving parts which allow for
have many moving parts which allow for subtle bugs and many papers don't report
subtle bugs and many papers don't report all the required tricks is this is a
all the required tricks is this is a quote from a blog by some of the folks
quote from a blog by some of the folks at open a eye on their DQ on baseline
at open a eye on their DQ on baseline they also say that reinforcement
they also say that reinforcement learning algorithms are challenging to
learning algorithms are challenging to implement correctly good results
implement correctly good results typically only come after fixing many
typically only come after fixing many seemingly trivial bugs I would like to
seemingly trivial bugs I would like to ask them what a bug is in this regard
ask them what a bug is in this regard and honestly I don't know who would
and honestly I don't know who would actually put that in a car and it's not
actually put that in a car and it's not just them there's actually been a lot of
just them there's actually been a lot of papers about this Joelle Pino gave a
papers about this Joelle Pino gave a great keynote talk at AI CLR I believe
great keynote talk at AI CLR I believe that video is still up about some work
that video is still up about some work that her group has been has been doing
that her group has been has been doing with her collaborators trying to get at
with her collaborators trying to get at kind of how sensitive these algorithms
kind of how sensitive these algorithms are to a variety of settings and in
are to a variety of settings and in particular there's this famous plot and
particular there's this famous plot and we're gonna go into this pot later which
we're gonna go into this pot later which shows that if you change the random seed
shows that if you change the random seed algorithms can seem to have incredibly
algorithms can seem to have incredibly different performance so five grand
different performance so five grand seeds look one way five other random
seeds look one way five other random seeds look some other way and you know
seeds look some other way and you know that leaves two weird stuff and I
that leaves two weird stuff and I actually think this this plot is the
actually think this this plot is the more scary one because this is three
more scary one because this is three different implementations of I think TRP
different implementations of I think TRP oh yeah of TRP oh by open AI and just
oh yeah of TRP oh by open AI and just three different git repos and in
three different git repos and in particular two of them the blue and the
particular two of them the blue and the orange
orange up it's the same repo just a different
up it's the same repo just a different release point so two different release
release point so two different release versions of the same repo have very
versions of the same repo have very different behaviors and obviously the
different behaviors and obviously the first ones gonna apply the second one
first ones gonna apply the second one because if you're sensitive to random
because if you're sensitive to random seed and you just change the order of
seed and you just change the order of operations and how you access the
operations and how you access the randomness you'll get different behavior
randomness you'll get different behavior and so that's a bit worrying so this
and so that's a bit worrying so this leads to something that looks like
leads to something that looks like reinforcement learning is incredibly
reinforcement learning is incredibly challenging to implement but now the
challenging to implement but now the question that I want to ask for the
question that I want to ask for the remainder of today is is does it have to
remainder of today is is does it have to be this way or is there some other way
be this way or is there some other way forward for these kinds of problems
forward for these kinds of problems we're allowed to use machine learning
we're allowed to use machine learning and in particular can we use models when
and in particular can we use models when we start doing problems which involved
we start doing problems which involved continuous control and does that work
continuous control and does that work better one thing I do want to point out
better one thing I do want to point out before going to models which I think is
before going to models which I think is pretty interesting and worth kind of
pretty interesting and worth kind of highlighting is that if you just run the
highlighting is that if you just run the random search algorithm which was
random search algorithm which was pointed to me over the break actually
pointed to me over the break actually appears in nemirovsky and uten so it's
appears in nemirovsky and uten so it's everywhere in the Russian literature so
everywhere in the Russian literature so we're gonna stick with random search
we're gonna stick with random search this algorithm remember perturbs the
this algorithm remember perturbs the parameter rather than perturbing the
parameter rather than perturbing the actions this actually achieves and I
actions this actually achieves and I guess 5,000 iterations a pretty good
guess 5,000 iterations a pretty good policy but still not I mean 5000 versus
policy but still not I mean 5000 versus 10 is still a huge gap so even though
10 is still a huge gap so even though it's a general-purpose algorithm again
it's a general-purpose algorithm again the always want to ask yourself is even
the always want to ask yourself is even if you have a general-purpose algorithm
if you have a general-purpose algorithm how much are you paying for that general
how much are you paying for that general purpose and if it involves a lot of
purpose and if it involves a lot of variance and a lot of uncertainty maybe
variance and a lot of uncertainty maybe it's not so maybe it's not something you
it's not so maybe it's not something you want to rely on so obviously we're at
want to rely on so obviously we're at five o'clock so and I'm supposed to have
five o'clock so and I'm supposed to have another hour or so and so yeah of course
another hour or so and so yeah of course I have other I have more content so let
I have other I have more content so let me tell you some ideas that's that's our
me tell you some ideas that's that's our friend modeled based RL which I think
friend modeled based RL which I think has a particular you know it gets it
has a particular you know it gets it gets it has this Bates even in the MDP
gets it has this Bates even in the MDP case about whether or not model-based RL
case about whether or not model-based RL is as good as these model free methods
is as good as these model free methods under a variety of different settings
under a variety of different settings but I think for continuous control is
but I think for continuous control is actually like we already saw that the
actually like we already saw that the sample complexity conceivably is a
sample complexity conceivably is a little bit better and moreover what
little bit better and moreover what maybe we can even try to analyze this
maybe we can even try to analyze this for a simple case and see if that's
for a simple case and see if that's actually true so the one thing that's
actually true so the one thing that's hard here that's big
hard here that's big supervised learning easy roughly well
supervised learning easy roughly well understood the hard part here is
understood the hard part here is actually what control problems do we
actually what control problems do we solve because just plugging in the truth
solve because just plugging in the truth or plugging in an estimate and treating
or plugging in an estimate and treating it as true feels wrong right even a
it as true feels wrong right even a machine learning this is kinda why we
machine learning this is kinda why we have all this stuff with you know
have all this stuff with you know confidence sets and a variety of things
confidence sets and a variety of things so how can we take him to account the
so how can we take him to account the fact that we know this model is not
fact that we know this model is not perfect and maybe we can even estimate
perfect and maybe we can even estimate how imperfect that model is can we
how imperfect that model is can we actually take that into account in
actually take that into account in control and the answer is yes and the
control and the answer is yes and the answer this was something where if you
answer this was something where if you went to a control talk in the 90s you
went to a control talk in the 90s you would have seen this slide in every in
would have seen this slide in every in everybody's talk you've probably still
everybody's talk you've probably still do if you go to them today but so here
do if you go to them today but so here we have our system which is evolving
we have our system which is evolving according to F we're trying to design a
according to F we're trying to design a controller and what we do is we can like
controller and what we do is we can like estimate an odd Elfriede
estimate an odd Elfriede all right so f hat is our estimate it's
all right so f hat is our estimate it's not the true F so now we have a system
not the true F so now we have a system that's f hat we have our now we want to
that's f hat we have our now we want to design a controller for the true F but
design a controller for the true F but we only have F hat well the key idea is
we only have F hat well the key idea is what if I could estimate the uncertainty
what if I could estimate the uncertainty in F and treat that as another building
in F and treat that as another building block in my dynamical system so that I
block in my dynamical system so that I could account for not just F but every
could account for not just F but every possible instantiation of my certainty
possible instantiation of my certainty about F so I could use high dimensional
about F so I could use high dimensional statistics to say not only how good is
statistics to say not only how good is this model but how far is this model
this model but how far is this model from like I can estimate how far my
from like I can estimate how far my model is from from the truth so it's not
model is from from the truth so it's not just getting a point estimate it's
just getting a point estimate it's getting a confidence set once I have the
getting a confidence set once I have the confidence that I could try to design a
confidence that I could try to design a controller that works for every single
controller that works for every single function in the covenant-- set and that
function in the covenant-- set and that is what we call robust control okay so
is what we call robust control okay so I'm calling this connected thing where
I'm calling this connected thing where we're coupling statistics with control
we're coupling statistics with control course id control and this is actually
course id control and this is actually something well that for whatever reason
something well that for whatever reason because the statistics wasn't there I
because the statistics wasn't there I didn't really see in the controls
didn't really see in the controls literature and the controls literature
literature and the controls literature they're much more into you take your
they're much more into you take your you you take the model that comes out
you you take the model that comes out and you treated a certainty equivalent
and you treated a certainty equivalent but you it's also possible controls
but you it's also possible controls people love putting in these extra
people love putting in these extra things about modeling errors to just add
things about modeling errors to just add those in and the other fancy present we
those in and the other fancy present we can estimate modeling errors and we can
can estimate modeling errors and we can estimate modeling errors of modeling
estimate modeling errors of modeling errors if we wanted to and we can have
errors if we wanted to and we can have Turtles all the way to Hough okay so let
Turtles all the way to Hough okay so let me describe how this might work and it's
me describe how this might work and it's actually kind of cute and now it's gonna
actually kind of cute and now it's gonna get a little bit more mathematical then
get a little bit more mathematical then what we've had so far but it's kind of
what we've had so far but it's kind of it it's again hopefully relatively
it it's again hopefully relatively simple for how this might work so what
simple for how this might work so what I'm gonna do my analysis of the one step
I'm gonna do my analysis of the one step of lqr just to show how this works and
of lqr just to show how this works and I'm kind of walk us through all the
I'm kind of walk us through all the different steps and all the different
different steps and all the different machinery that we'd need okay
machinery that we'd need okay so we're gonna say B is unknown and I
so we're gonna say B is unknown and I have this cost function which is
have this cost function which is quadratic and I have this equation which
quadratic and I have this equation which is linear right and B is unknown and so
is linear right and B is unknown and so what can I do well I can run experiments
what can I do well I can run experiments collect some data I collect some data
collect some data I collect some data I'm gonna have that X I is equal to B UI
I'm gonna have that X I is equal to B UI plus x0 plus C I and maybe you I would
plus x0 plus C I and maybe you I would be Gaussian and then I could estimate B
be Gaussian and then I could estimate B by just solving these squares now what
by just solving these squares now what we all know is this will give me an
we all know is this will give me an estimate right this gives me an estimate
estimate right this gives me an estimate for free it's pretty cheap well I think
for free it's pretty cheap well I think it's less appreciated is one if we
it's less appreciated is one if we assume properties of the noise we can
assume properties of the noise we can actually estimate how small this error
actually estimate how small this error is with high probability and it doesn't
is with high probability and it doesn't even have to be that common that
even have to be that common that restricted what properties to get out
restricted what properties to get out these kind of Epsilon's and moreover if
these kind of Epsilon's and moreover if I could just run simulations I could
I could just run simulations I could estimate how sensitive I am using
estimate how sensitive I am using something like a bootstrap and if you
something like a bootstrap and if you use the bootstrap the bootstrap can give
use the bootstrap the bootstrap can give you a pretty good estimate of this
you a pretty good estimate of this Epsilon without even having to work
Epsilon without even having to work through the the concentration
through the the concentration inequalities so whenever you build a
inequalities so whenever you build a model you can test how accurate it is by
model you can test how accurate it is by bootstrapping I imagine this is true for
bootstrapping I imagine this is true for the deep case as well I know I haven't
the deep case as well I know I haven't thought about how you do bootstrap with
thought about how you do bootstrap with with deep models but there's no no the
with deep models but there's no no the deadlines are far away
deadlines are far away so the ICML to 2019 let's see that
so the ICML to 2019 let's see that okay so now here's here's kind of the
okay so now here's here's kind of the idea for how we would actually turn this
idea for how we would actually turn this back into the optimization problem and
back into the optimization problem and all we're gonna do is note the very
all we're gonna do is note the very simple fact that if Delta B is my error
simple fact that if Delta B is my error it that's B minus B hat and B hat is my
it that's B minus B hat and B hat is my estimate then the X here is equal to the
estimate then the X here is equal to the X here because I'm just replacing B with
X here because I'm just replacing B with B hat plus its error this is very this
B hat plus its error this is very this is very naive and silly but what this
is very naive and silly but what this lets me do is rewrite the problem into a
lets me do is rewrite the problem into a robust optimization problem so I can
robust optimization problem so I can have just the estimate as the constraint
have just the estimate as the constraint and this is what I have in the previous
and this is what I have in the previous slide this is the estimated model is my
slide this is the estimated model is my constraint not the true one but then
constraint not the true one but then what I'm going to do is write in the
what I'm going to do is write in the cost function The effect of that
cost function The effect of that estimate so the error now just gets
estimate so the error now just gets pushed into the cost function once the
pushed into the cost function once the error gets pushed into the cost function
error gets pushed into the cost function I could take the worst one okay so what
I could take the worst one okay so what I did was I took a problem that we
I did was I took a problem that we started off with which was just
started off with which was just minimizing a quadratic
minimizing a quadratic I took a square root because it made my
I took a square root because it made my life easier as you'll see in a second
life easier as you'll see in a second because I'm gonna apply the triangle
because I'm gonna apply the triangle inequality
inequality spoiler alert so I had to take a square
spoiler alert so I had to take a square root but so once I did that I could take
root but so once I did that I could take the have now just nominal constraints
the have now just nominal constraints which are easy and a robust cost okay so
which are easy and a robust cost okay so this is robust optimization and it's
this is robust optimization and it's robust in the sense that what we're
robust in the sense that what we're trying to do is be we're trying to be
trying to do is be we're trying to be able to solve the control problem for
able to solve the control problem for every possible instance of that
every possible instance of that uncertainty okay so the trick and this
uncertainty okay so the trick and this is actually the trick we used for lqr to
is actually the trick we used for lqr to actually analyze this is that if you
actually analyze this is that if you apply the triangle inequality and bound
apply the triangle inequality and bound the maximum eigen value of Q now I have
the maximum eigen value of Q now I have a convex problem the convex problem has
a convex problem the convex problem has introduced a regularized version of the
introduced a regularized version of the original problem and if people know
original problem and if people know robust optimization this happens all the
robust optimization this happens all the time uncertainty in data can be
time uncertainty in data can be translated robust uncertainty and data
translated robust uncertainty and data can be translated into a penalized
can be translated into a penalized problem on the model and that's exactly
problem on the model and that's exactly what happened here this is kind of the
what happened here this is kind of the rael GAO he has really nice
rael GAO he has really nice on this kind of thing okay so and we
on this kind of thing okay so and we just did the triangle equality there
just did the triangle equality there Renee do you have a question I will
Renee do you have a question I will repeat it
okay so Renee is that is asking the question which is good why do we have to
question which is good why do we have to do this decoupling a layering of
do this decoupling a layering of estimation and then control just a great
estimation and then control just a great question and I don't think we do but
question and I don't think we do but that's what we did
that's what we did so in our and the work that I'm going to
so in our and the work that I'm going to present that's what we did I don't
present that's what we did I don't necessarily think these steps have to be
necessarily think these steps have to be decoupled I think you would just have to
decoupled I think you would just have to come up with a different formulation
correct correct so what Renee is also saying is that this optimization problem
saying is that this optimization problem doesn't know anything about Q so we're
doesn't know anything about Q so we're estimating B without any knowledge of
estimating B without any knowledge of the of the costs we want to optimize and
the of the costs we want to optimize and that could be accruing a loss but I'm
that could be accruing a loss but I'm gonna punt on it because I don't want to
gonna punt on it because I don't want to I wanted to get a finite result the
I wanted to get a finite result the other thing that's nice for people who
other thing that's nice for people who care about these things is a
care about these things is a generalization bound on my model will
generalization bound on my model will turn into a generalization bound on my
turn into a generalization bound on my cost function and that's also what I
cost function and that's also what I really like so we have so this is now
really like so we have so this is now one of these weird things and again this
one of these weird things and again this is kind of addressing why not do them
is kind of addressing why not do them jointly is that we've developed a lot of
jointly is that we've developed a lot of techniques to figure out what the order
techniques to figure out what the order of epsilon should be given properties of
of epsilon should be given properties of the measurement process and what's cool
the measurement process and what's cool about this approach is that if you have
about this approach is that if you have those you have some model and you know
those you have some model and you know some things about its deviations those
some things about its deviations those immediately give us ways to bound the
immediately give us ways to bound the error of the optimization and that's
error of the optimization and that's that's kind of why I like this
that's kind of why I like this particular approach and I can hear
particular approach and I can hear myself through the wall that's weird
myself through the wall that's weird okay so hello everybody in a7 all right
okay so hello everybody in a7 all right now let's do this for linear systems
now let's do this for linear systems let's do this for lqr and I'm only gonna
let's do this for lqr and I'm only gonna do two slides it's gonna be very high
do two slides it's gonna be very high level and then we'll go back to
level and then we'll go back to experiments which are common more
experiments which are common more interesting but let's at least say what
interesting but let's at least say what is the theorem that we can prove and I
is the theorem that we can prove and I think that's kind of interesting that we
think that's kind of interesting that we could prove theorems about all of these
could prove theorems about all of these things for the lqr case
things for the lqr case so what you could do is you can run an
so what you could do is you can run an experiment 40 steps and then you could
experiment 40 steps and then you could ask them a a and B and it turns out that
ask them a a and B and it turns out that that actually is a weird optimization
that actually is a weird optimization problem because all of the X's depend on
problem because all of the X's depend on a and hence a depends on all the axes
a and hence a depends on all the axes everything is very coupled but with some
everything is very coupled but with some analysis you can show that this actually
analysis you can show that this actually gives you a optimal in the parameters
gives you a optimal in the parameters estimate of an D and you also have this
estimate of an D and you also have this funny matrix that enters in which is the
funny matrix that enters in which is the property of the system itself so like I
property of the system itself so like I said these constants actually do appear
said these constants actually do appear here and what they say is that if a
here and what they say is that if a system is easier to excite what this
system is easier to excite what this matrix lambda C is is basically how much
matrix lambda C is is basically how much signal do you see from a random input if
signal do you see from a random input if the signal is easy to excite this
the signal is easy to excite this denominator is large and the number of
denominator is large and the number of samples goes down that you would need so
samples goes down that you would need so signals that are easy to assist um's
signals that are easy to assist um's that are easy to excite are easier to
that are easy to excite are easier to estimate and with if we just assume the
estimate and with if we just assume the worst case here which would be the case
worst case here which would be the case when a equals 0 you do or near 0 you get
when a equals 0 you do or near 0 you get an optimal dependence at least on the
an optimal dependence at least on the parameters which we had many slides ago
parameters which we had many slides ago I was hinting at then we could use this
I was hinting at then we could use this triangle inequality technique albeit in
triangle inequality technique albeit in a more complicated way to actually
a more complicated way to actually analyze the limit version of lqr it
analyze the limit version of lqr it requires more bookkeeping because now
requires more bookkeeping because now it's an infinite problem so doing one
it's an infinite problem so doing one step was relatively simple but even
step was relatively simple but even already a little complicated this
already a little complicated this requires more bookkeeping but is in a
requires more bookkeeping but is in a paper by Sarah Deane hoary Manya dick
paper by Sarah Deane hoary Manya dick Montney myself and Steven - and then we
Montney myself and Steven - and then we references at the end it turns out that
references at the end it turns out that if you solve a relaxation of this
if you solve a relaxation of this problem this robust optimization problem
problem this robust optimization problem you actually get that the error for the
you actually get that the error for the controller that I predict - the true
controller that I predict - the true that the best thing you could ever do if
that the best thing you could ever do if you knew the model / the best thing you
you knew the model / the best thing you can do if you knew the model so this is
can do if you knew the model so this is a relative error scales as some
a relative error scales as some constants constants times D plus P over
constants constants times D plus P over T so this was exactly what we hoped by
T so this was exactly what we hoped by doing writing down what would be maybe
doing writing down what would be maybe the best case for model based RL we
the best case for model based RL we would be able to get
would be able to get that part we do get and then we get
that part we do get and then we get properties that depend on the actual
properties that depend on the actual system we're estimating how large is the
system we're estimating how large is the gain if the gain is large you need to
gain if the gain is large you need to have more samples how sensitive this
have more samples how sensitive this gamma CL is a very interesting thing it
gamma CL is a very interesting thing it says that if you were designed the
says that if you were designed the optimal controller for the system and
optimal controller for the system and then blow on it how much signal do you
then blow on it how much signal do you see so the system is being stabilized by
see so the system is being stabilized by the optimal controller and how sensitive
the optimal controller and how sensitive is that to perturbation the more
is that to perturbation the more sensitive that is the more samples you
sensitive that is the more samples you need and again the easier the system is
need and again the easier the system is to excite in an open-loop setting
to excite in an open-loop setting meaning just there's no controller I'm
meaning just there's no controller I'm just doing experiments the less samples
just doing experiments the less samples you need and so these things actually
you need and so these things actually kind of come in in an intro they kind of
kind of come in in an intro they kind of show an interesting trade-off what's
show an interesting trade-off what's also interesting here is that this also
also interesting here is that this also tells us when the cost is finite which
tells us when the cost is finite which is crazy we were doing experiments on a
is crazy we were doing experiments on a finite time horizon and extrapolating to
finite time horizon and extrapolating to an infinite time horizon
an infinite time horizon we're extrapolating out to say when is
we're extrapolating out to say when is this actually finite we're gonna see in
this actually finite we're gonna see in a second not always it's not always easy
a second not always it's not always easy to do that and this is actually giving
to do that and this is actually giving us a bound of how many samples do we
us a bound of how many samples do we need to guarantee that we get finite
need to guarantee that we get finite cost and I'm going to describe in a
cost and I'm going to describe in a second my finite cost is very important
second my finite cost is very important for practical considerations and and yet
for practical considerations and and yet let's go back to my favorite example
let's go back to my favorite example let's examine why finite cost is
let's examine why finite cost is important okay so this is a very time
important okay so this is a very time model of the Google Data Center they
model of the Google Data Center they have some cooling rack that's some
have some cooling rack that's some servers and they put fans on them and
servers and they put fans on them and then the servers maybe shed heat to each
then the servers maybe shed heat to each other and this is not this is a really
other and this is not this is a really dumb model but maybe this would be like
dumb model but maybe this would be like what that model would look like where
what that model would look like where each server has some kind of activity
each server has some kind of activity that's heating it up and this number is
that's heating it up and this number is bigger than one so if I ran it open-loop
bigger than one so if I ran it open-loop longer and longer it's gonna go to
longer and longer it's gonna go to infinity that's called unstable this is
infinity that's called unstable this is unstable this is gonna blow up if I run
unstable this is gonna blow up if I run it without doing control and if it blows
it without doing control and if it blows up the farm blows up that's probably bad
now I have a fan at every location so I can cool it down individually but if I
can cool it down individually but if I want to get a real
want to get a real flashy result that I can get in the
flashy result that I can get in the newspaper what I really need to do is
newspaper what I really need to do is spend a little electricity as possible
spend a little electricity as possible because you need this p/e ratio that you
because you need this p/e ratio that you know that everybody makes a big deal
know that everybody makes a big deal about to be small so I'm gonna really
about to be small so I'm gonna really penalize doing too much effort with my
penalize doing too much effort with my control
control but that's weird now that introduces the
but that's weird now that introduces the tension that's very easy to clarify what
tension that's very easy to clarify what that tension is this system is called
that tension is this system is called unstable because it will go to infinity
unstable because it will go to infinity if you don't do something about it
if you don't do something about it but not but if I do least squares
but not but if I do least squares maybe I asked him eight this guy to be
maybe I asked him eight this guy to be 0.99 so if I did these squares and
0.99 so if I did these squares and that's 0.99 and I want to save energy as
that's 0.99 and I want to save energy as like maybe I won't control that one but
like maybe I won't control that one but it was actually bigger than one and so
it was actually bigger than one and so it may blow up and that's actually what
it may blow up and that's actually what happens in the simulation we do see that
happens in the simulation we do see that behavior I have three plots here one is
behavior I have three plots here one is this kind of course ID control where I
this kind of course ID control where I tell the true error between the
tell the true error between the estimated model and the real model and I
estimated model and the real model and I guess the blue curve the green curve is
guess the blue curve the green curve is where I estimate that using the
where I estimate that using the bootstrap and you see that there's some
bootstrap and you see that there's some degradation we did a very conservative
degradation we did a very conservative bootstrap here but it's not crazy bad
bootstrap here but it's not crazy bad the orange curve is what happens when I
the orange curve is what happens when I just plug in the model this is called
just plug in the model this is called certainty equivalence in control it's
certainty equivalence in control it's the naive thing to do take your point
the naive thing to do take your point estimate and pretend like it's true if
estimate and pretend like it's true if you take your point estimate and pretend
you take your point estimate and pretend like it's true you see it starts to
like it's true you see it starts to spike up here and there's no data down
spike up here and there's no data down here so the cost seems to be something
here so the cost seems to be something weird seems to be happening earlier if
weird seems to be happening earlier if we turn to the second plot what's
we turn to the second plot what's happening is this plot is the percentage
happening is this plot is the percentage of the time that when I run a simulation
of the time that when I run a simulation the model that's returned in closed loop
the model that's returned in closed loop is finite how often is my cost finite
is finite how often is my cost finite finite cost means hopefully I mean at
finite cost means hopefully I mean at least roughly finite cost means maybe
least roughly finite cost means maybe your data center doesn't catch on fire
your data center doesn't catch on fire infinite cost means definite fire and so
infinite cost means definite fire and so what we see is that after if you give it
what we see is that after if you give it the true cost sorry the true distance to
the true cost sorry the true distance to the model we actually get a nice curve
the model we actually get a nice curve in after about a hundred samples are
in after about a hundred samples are always returning nice stable behavior
always returning nice stable behavior when you use the bootstrap we're never
when you use the bootstrap we're never getting a good model up until about here
getting a good model up until about here but then we start to chase what the blue
but then we start to chase what the blue does and was cute
does and was cute is down here what's happening is our SDP
is down here what's happening is our SDP solver can't find a feasible solution
solver can't find a feasible solution that's also interesting and this is
that's also interesting and this is actually something we always take for
actually something we always take for granted if you can't find a feasible
granted if you can't find a feasible solution it might mean that you have a
solution it might mean that you have a bug it might mean that there's something
bug it might mean that there's something wrong with the model meaning that
wrong with the model meaning that perhaps there is no feasible solution
perhaps there is no feasible solution and this is why you don't want bugs in
and this is why you don't want bugs in your control design software because you
your control design software because you have to be able to distinguish between
have to be able to distinguish between witches which now the nominal control is
witches which now the nominal control is actually even out at 600 samples so well
actually even out at 600 samples so well beyond where we started about 10 percent
beyond where we started about 10 percent of the time is returning something
of the time is returning something unstable because because of this model
unstable because because of this model so the least squares estimate if you
so the least squares estimate if you just plug it in may yield in an unstable
just plug it in may yield in an unstable controller but the robust method yields
controller but the robust method yields a stable one okay so on this plot we
a stable one okay so on this plot we return to our friends which are model
return to our friends which are model free note that the x-axis is now ten
free note that the x-axis is now ten times larger we need more samples to
times larger we need more samples to make the correct comparison fair this
make the correct comparison fair this gray curve is our Russian random search
gray curve is our Russian random search evolutionary strategies bandish convex
evolutionary strategies bandish convex optimization whatever you want to call
optimization whatever you want to call it so it's it's still about an order of
it so it's it's still about an order of magnitude off from the other guys but
magnitude off from the other guys but it's doing something the red curve is L
it's doing something the red curve is L SPI and policy gradient is not here
SPI and policy gradient is not here policy gradient is blue which under the
policy gradient is blue which under the klieg lights a little hard to see but
klieg lights a little hard to see but it's this blue curve which will trace
it's this blue curve which will trace out here and policy gradient has a hard
out here and policy gradient has a hard time finding a feasible stabilizing
time finding a feasible stabilizing solution even 50% of the time after
solution even 50% of the time after 5,000 samples so maybe we could tune it
5,000 samples so maybe we could tune it to be better I tried the github the
to be better I tried the github the sorry the jupiter notebooks are linked
sorry the jupiter notebooks are linked off my blog so you're welcome to try to
off my blog so you're welcome to try to do better
do better but something has to be done here to
but something has to be done here to make them all do these model free
make them all do these model free methods kind of outperform the model
methods kind of outperform the model based so this is cool we have this like
based so this is cool we have this like end-to-end bound this is new work we
end-to-end bound this is new work we have an end-to-end bound that goes from
have an end-to-end bound that goes from estimation error directly to control
estimation error directly to control error for this lqr and and one question
error for this lqr and and one question we were asking because
we were asking because we were digging through the literature
we were digging through the literature why was this never done before the
why was this never done before the biggest reason is just to get the least
biggest reason is just to get the least squares estimate actually requires some
squares estimate actually requires some heavy machinery we needed to go to some
heavy machinery we needed to go to some probability of papers that had appeared
probability of papers that had appeared on archive in the last year so high
on archive in the last year so high dimensional stats that basically the
dimensional stats that basically the difference between 1995 and 2018 is this
difference between 1995 and 2018 is this growth and high dimensional statistics
growth and high dimensional statistics and our ability to understand it and our
and our ability to understand it and our ability to analyze it and that could now
ability to analyze it and that could now be plugged in to tools from controls
be plugged in to tools from controls maybe from the 90s and try to link these
maybe from the 90s and try to link these things together that said even the
things together that said even the controls result requires some new
controls result requires some new results for controls in particular it's
results for controls in particular it's building on something called system
building on something called system level synthesis that was developed by
level synthesis that was developed by Nick Montney
Nick Montney and his collaborators at Caltech and
and his collaborators at Caltech and that really was key to making this work
that really was key to making this work and I think the thing that was really
and I think the thing that was really cool about this is what we initially
cool about this is what we initially started to do was we were just looking
started to do was we were just looking at the solution like the lqr solution
at the solution like the lqr solution comes out of this crazy equation I wrote
comes out of this crazy equation I wrote called the algebraic riccati equation we
called the algebraic riccati equation we tried to understand what happens to that
tried to understand what happens to that as you perturb the parameters and we got
as you perturb the parameters and we got very frustrated and we spent about a
very frustrated and we spent about a year of frustrating time on
year of frustrating time on understanding the perturbations but if
understanding the perturbations but if you somehow restrict yourself to always
you somehow restrict yourself to always be robust the analysis also ends up
be robust the analysis also ends up being easier so it's safer and for some
being easier so it's safer and for some reason the Antoine I mean it actually
reason the Antoine I mean it actually kind of makes sense because we're
kind of makes sense because we're restricting in our class the analysis is
restricting in our class the analysis is easier the surprising part is it does
easier the surprising part is it does look like this gets to something near
look like this gets to something near optimal and maybe that's the other
optimal and maybe that's the other reason why we're here what's really
reason why we're here what's really exciting is there's a lot of work and
exciting is there's a lot of work and just like the last year on this problem
just like the last year on this problem so there's all sorts of stuff I'm not
so there's all sorts of stuff I'm not going to talk about today but like there
going to talk about today but like there are multiple results that this
are multiple results that this conference about lqr which is really
conference about lqr which is really cool in particular you can prove that
cool in particular you can prove that policy gradient if you start with a
policy gradient if you start with a precision that doesn't blow up your data
precision that doesn't blow up your data center finds the optimal one so as long
center finds the optimal one so as long as you could find one solution that
as you could find one solution that doesn't stable that doesn't blow up you
doesn't stable that doesn't blow up you can find the optimal one with policy
can find the optimal one with policy gradient if you don't know it to begin
gradient if you don't know it to begin with you may run into some trouble
with you may run into some trouble there's also all sorts of other work
there's also all sorts of other work adapting lqr and what I'm not going to
adapting lqr and what I'm not going to talk about today is about adaptive lqr
talk about today is about adaptive lqr they're like 10 papers have just come
they're like 10 papers have just come out this year
out this year trying to
trying to this problem I'll talk about that again
this problem I'll talk about that again at the end so it kind of really is just
at the end so it kind of really is just starting to it we're just really
starting to it we're just really starting to understand this linear case
starting to understand this linear case and I've had in quotations the entire
and I've had in quotations the entire time simplest example what's amazing is
time simplest example what's amazing is how hard and complicated lqr is oqr is
how hard and complicated lqr is oqr is not simple it's like the simplest thing
not simple it's like the simplest thing I knew how to write down and analyzing
I knew how to write down and analyzing it actually requires a lot of machinery
it actually requires a lot of machinery and in particular a lot of new
and in particular a lot of new techniques the one thing I'll say if
techniques the one thing I'll say if there are any theorists in the room is
there are any theorists in the room is that our bound required doing an
that our bound required doing an analysis to avoid mixing time analyses
analysis to avoid mixing time analyses this is now just for the experts just so
this is now just for the experts just so if you don't know this stuff I just want
if you don't know this stuff I just want to point this out that we had to most
to point this out that we had to most people who analyze time series use an
people who analyze time series use an analysis that depends on mixing and the
analysis that depends on mixing and the mixing analyses turn this minus 1/2 into
mixing analyses turn this minus 1/2 into a plus 1/2 give us the exact opposite
a plus 1/2 give us the exact opposite behavior of what we see and there's
behavior of what we see and there's about 50 pages on cosmos he leaves his
about 50 pages on cosmos he leaves his blog that do that so again this is this
blog that do that so again this is this case of working in the most generality
case of working in the most generality versus the specific case for these
versus the specific case for these specific models we do have there's a lot
specific models we do have there's a lot of interesting work just in time series
of interesting work just in time series analysis to try to push forward ok
analysis to try to push forward ok that's that was for the experts okay so
that's that was for the experts okay so that's the lqr we're making progress
that's the lqr we're making progress we're starting does this understand the
we're starting does this understand the basics of the sample complexity but I
basics of the sample complexity but I don't think anybody in this room is that
don't think anybody in this room is that interested in lqr it is a workhorse lqr
interested in lqr it is a workhorse lqr honestly it's like you know that's a 95%
honestly it's like you know that's a 95% of the things that we do our PID control
of the things that we do our PID control of that remaining 5% 95% of those are
of that remaining 5% 95% of those are model predictive control and almost all
model predictive control and almost all model predictive control which I will
model predictive control which I will talk about in a couple slides is using
talk about in a couple slides is using lqr under the hood in one way or another
lqr under the hood in one way or another so it really is kind of like a workhorse
so it really is kind of like a workhorse algorithm it's worth understanding but
algorithm it's worth understanding but you know if we want to go do some
you know if we want to go do some complex robotics perhaps we need
complex robotics perhaps we need something more sophisticated and so a
something more sophisticated and so a question is does the converse of the
question is does the converse of the linearization principle apply for these
linearization principle apply for these problems and all I'm going to do now is
problems and all I'm going to do now is just look at some experimental evidence
just look at some experimental evidence I have no answer I'm gonna look at one
I have no answer I'm gonna look at one case study and that'll be the end and
case study and that'll be the end and then we are see from there we're
then we are see from there we're we go from here and so the the case
we go from here and so the the case study is on this funny set of demos
study is on this funny set of demos based in the Mojo Co simulator so open
based in the Mojo Co simulator so open AI actually starting with the work of
AI actually starting with the work of the PhD work of John Shulman probably
the PhD work of John Shulman probably even earlier than that starting with
even earlier than that starting with work that Sergey Levine had done people
work that Sergey Levine had done people started to use mojo Co as a way of
started to use mojo Co as a way of trying to do controls benchmarks for
trying to do controls benchmarks for reinforcement learning algorithms mojo
reinforcement learning algorithms mojo Co is the simulator you get guys that
Co is the simulator you get guys that look like this you could make all sorts
look like this you could make all sorts of various robots you can execute the
of various robots you can execute the dynamics forward and then from executing
dynamics forward and then from executing those dynamics forward you can now try
those dynamics forward you can now try to optimize them it was developed by a
to optimize them it was developed by a Motorola at Washington so what we found
Motorola at Washington so what we found is that actually this random search
is that actually this random search algorithm and linear controllers
actually outperform on these mojo demos a variety of other approaches that were
a variety of other approaches that were taken including natural gradient with
taken including natural gradient with linear natural policy gradient on linear
linear natural policy gradient on linear controllers natural policy gradient with
controllers natural policy gradient with some weird nonlinear controllers TRP oh
some weird nonlinear controllers TRP oh and essentially everything we were
and essentially everything we were looking at the linear controllers
looking at the linear controllers themselves actually were able to get to
themselves actually were able to get to these solutions much faster just using
these solutions much faster just using random search so in this case you want a
random search so in this case you want a larger number and we were able to get
larger number and we were able to get larger rewards but even though we get
larger rewards but even though we get larger rewards and I apologize because
larger rewards and I apologize because there's lights here let's just look at
there's lights here let's just look at this for a second even though we get
this for a second even though we get larger rewards this already points out
larger rewards this already points out something that is I said at the
something that is I said at the beginning of the talk and I want to say
beginning of the talk and I want to say again now is that one optimization
again now is that one optimization problem probably does not capture
problem probably does not capture everything we care about in control and
everything we care about in control and in particular this is the human aid
in particular this is the human aid model this is this last one a humanoid
model this is this last one a humanoid model it declares successful in the open
model it declares successful in the open a gym if the reward exceeds 6,000 right
a gym if the reward exceeds 6,000 right as long as you exceed 6,000 reward you
as long as you exceed 6,000 reward you you you have a success so this one has
you you have a success so this one has 6,000 as you can see it's kind of
6,000 as you can see it's kind of weirdly hopping on the leg it's a little
weirdly hopping on the leg it's a little hard to apologize that the video is a
hard to apologize that the video is a little washed out but the dynamics at
little washed out but the dynamics at the foot are also doing something a
the foot are also doing something a little funny and this this one actually
little funny and this this one actually is the one that risk gets to 11 600 and
is the one that risk gets to 11 600 and again you've got to watch the feet
again you've got to watch the feet because the feet dynamics are actually
because the feet dynamics are actually absolutely not
absolutely not what a robot that the robot would have
what a robot that the robot would have fallen over a long time ago so the
fallen over a long time ago so the problem here is that at least for these
problem here is that at least for these models they're too simplified to capture
models they're too simplified to capture a reality and in particular the way that
a reality and in particular the way that they're modeling the contacts might not
they're modeling the contacts might not actually be accurate enough to really
actually be accurate enough to really carry over and moreover it's not at all
carry over and moreover it's not at all clear what the high reward means that's
clear what the high reward means that's not necessarily realistic gait the other
not necessarily realistic gait the other thing that's gone oh yeah sorry right
thing that's gone oh yeah sorry right the other thing that is uh that would be
the other thing that is uh that would be worth considering is because random
worth considering is because random search is faster we can start to
search is faster we can start to actually evaluate random search on more
actually evaluate random search on more random seeds one of the weirdest things
random seeds one of the weirdest things about reinforcement learning is the way
about reinforcement learning is the way that the random seed is treated like a
that the random seed is treated like a parameter that's bad and in some sense I
parameter that's bad and in some sense I really feel like one thing that we all
really feel like one thing that we all have to think about if I can convince
have to think about if I can convince anybody to work on it would be great is
anybody to work on it would be great is the right abstractions for how random
the right abstractions for how random seeds enter into machine learning
seeds enter into machine learning because you want to fix the random seed
because you want to fix the random seed for debugging maybe do you actually
for debugging maybe do you actually don't know but if you fix the random
don't know but if you fix the random seed for debugging and you start tuning
seed for debugging and you start tuning piper parameters to random seeds you
piper parameters to random seeds you might not actually be seeing the
might not actually be seeing the behavior that you want you might be
behavior that you want you might be tuning to random seeds so really you
tuning to random seeds so really you want to think about where the randomness
want to think about where the randomness enters and actually think of that as a
enters and actually think of that as a random number generator that you can't
random number generator that you can't control and just take it out but we have
control and just take it out but we have to think about how to do that as we plug
to think about how to do that as we plug in with this is a complex software
in with this is a complex software engineering question that I don't really
engineering question that I don't really know how to answer that said we looked
know how to answer that said we looked at it and what we saw is that if you
at it and what we saw is that if you just tuned on three random seeds and
just tuned on three random seeds and then take that policy and apply it to
then take that policy and apply it to other random seeds you get huge variance
other random seeds you get huge variance in all of these different problems all
in all of these different problems all the different mojo instances in the open
the different mojo instances in the open a gym even for random search have
a gym even for random search have incredibly high variance meaning that
incredibly high variance meaning that tuning the parameters for a few random
tuning the parameters for a few random seeds does not necessarily mean you're
seeds does not necessarily mean you're gonna work on another random seed and so
gonna work on another random seed and so here if you have you know 100 random
here if you have you know 100 random seeds for the Walker the median
seeds for the Walker the median performance looks much different if you
performance looks much different if you look at 100 random seeds than if you
look at 100 random seeds than if you look at ten random seeds that's
look at ten random seeds that's completely different and so this is a
completely different and so this is a really tricky thing
maybe this is another reason to move away from pure model-based methods and
away from pure model-based methods and start to understand a little bit more
start to understand a little bit more about what's under the hood
about what's under the hood and so let me propose a way forward
and so let me propose a way forward again away from the pure model free and
again away from the pure model free and what I'm gonna say the way forward might
what I'm gonna say the way forward might again be to use models maybe and again
again be to use models maybe and again let's go back let's go back I've said
let's go back let's go back I've said this word several times let me define
this word several times let me define model predictive control model
model predictive control model predictive control is really very much
predictive control is really very much like q-learning
like q-learning they're kind of the same thing what
they're kind of the same thing what mantra predictive control does is it
mantra predictive control does is it estimates it it's solves for a cue
estimates it it's solves for a cue function over a time horizon H and often
function over a time horizon H and often times rather than learning Q it just
times rather than learning Q it just specifies some Q at the terminal cost
specifies some Q at the terminal cost now what's interesting about that is the
now what's interesting about that is the longer time horizon just think about it
longer time horizon just think about it the longer the time horizon if all I
the longer the time horizon if all I really care about is 10 steps and I was
really care about is 10 steps and I was time horizon for 20 steps that Q doesn't
time horizon for 20 steps that Q doesn't matter too much and if my model is
matter too much and if my model is pretty good that you can actually our
pretty good that you can actually our simulations suggest that you depend with
simulations suggest that you depend with an exponential decay on an accurate q
an exponential decay on an accurate q function so the Q function matter is
function so the Q function matter is less and less and in fact you can
less and less and in fact you can oftentimes just drop it and treat it as
oftentimes just drop it and treat it as zero at the end if you make the horizon
zero at the end if you make the horizon longer and so what you could do is you
longer and so what you could do is you could plan on this long horizon have
could plan on this long horizon have some kind of model for what the terminal
some kind of model for what the terminal cost looks like take one step and then
cost looks like take one step and then replant and even if you ignore the
replant and even if you ignore the dynamic programming this is an
dynamic programming this is an incredibly sensitive thing to do you
incredibly sensitive thing to do you plan for some time horizon you take one
plan for some time horizon you take one step and then you plan for a longer time
step and then you plan for a longer time horizon again and what that allows you
horizon again and what that allows you to do is at every step you're getting
to do is at every step you're getting feedback and because you're getting
feedback and because you're getting feedback you can correct for errors both
feedback you can correct for errors both that come from the environment and that
that come from the environment and that come from the model and so if you plan
come from the model and so if you plan on short time horizon this feedback can
on short time horizon this feedback can allow you to correct not only
allow you to correct not only disturbances but errors in your model
disturbances but errors in your model and I think would be really interesting
and I think would be really interesting to understand the difference between
to understand the difference between what you could do when you have a good
what you could do when you have a good model and what you could do when you
model and what you could do when you have an approximate model and just as an
have an approximate model and just as an example here's an example of what
example here's an example of what walking looks like from the Totoro
walking looks like from the Totoro lab who designed the mojo quest
lab who designed the mojo quest simulator this is the same humanoid they
simulator this is the same humanoid they did this in 2012 okay so this is well
did this in 2012 okay so this is well before we got excited about
before we got excited about reinforcement learning and with a very
reinforcement learning and with a very simple cost function and they even show
simple cost function and they even show its robust to specification with a model
its robust to specification with a model they're able to using MPC get the robot
they're able to using MPC get the robot to well at least lurch
and mo has videos where he you can keep making this better if you just improve
making this better if you just improve your cost function so if you iteratively
your cost function so if you iteratively refine your cost function you can get
refine your cost function you can get better walking behavior and this one is
better walking behavior and this one is from 2013 there's the robot doing some
from 2013 there's the robot doing some complex tasks I think that uh there's an
complex tasks I think that uh there's an external perturbation he gets pulled
external perturbation he gets pulled that's not modeled and the thing is able
that's not modeled and the thing is able to stabilize itself so what's cool about
to stabilize itself so what's cool about this one as this could be actually
this one as this could be actually executed in real time in 2013 with no
executed in real time in 2013 with no GPUs so it's some sense for these kinds
GPUs so it's some sense for these kinds of things we have taken many steps
of things we have taken many steps backwards trying to figure out just what
backwards trying to figure out just what are the right ways to pose these
are the right ways to pose these problems and really understand what's
problems and really understand what's unknown here and what's hard to plan and
unknown here and what's hard to plan and what are the right problems I think are
what are the right problems I think are actually things that we should be
actually things that we should be thinking about in machine learning and
thinking about in machine learning and so what I'd like to close with is this
so what I'd like to close with is this beautiful example by my colleagues in
beautiful example by my colleagues in mechanical engineering where they
mechanical engineering where they actually do learning but they learn a
actually do learning but they learn a cue function but they learn cue function
cue function but they learn cue function in the context of MPC and essentially
in the context of MPC and essentially what happens here this is a car that
what happens here this is a car that drives around a track and I also link to
drives around a track and I also link to the YouTube video you really got to
the YouTube video you really got to watch it and what the way that they
watch it and what the way that they learn the terminal cost function is they
learn the terminal cost function is they start with a controller that just does
start with a controller that just does Lane following PID from that that gives
Lane following PID from that that gives me values for lots of different states
me values for lots of different states that actually tells me roughly how much
that actually tells me roughly how much is the cost to go from a particular
is the cost to go from a particular state to the end and so what I can make
state to the end and so what I can make the cue function be is what value did I
the cue function be is what value did I see at this state with that action
see at this state with that action because I have a database and anywhere I
because I have a database and anywhere I didn't see data I make it infinity and
didn't see data I make it infinity and if you do that this allows you to
if you do that this allows you to explore you
explore you your model inside the time horizon but
your model inside the time horizon but get to a safe state at the end and this
get to a safe state at the end and this car is able to essentially after about
car is able to essentially after about ten laps drive is a remote-control car
ten laps drive is a remote-control car and it starts to perform better than a
and it starts to perform better than a human controller after about ten laps
human controller after about ten laps and then it starts zipping around this
and then it starts zipping around this is the conference room and the
is the conference room and the mechanical engineering building start
mechanical engineering building start zipping around that room really fast
zipping around that room really fast after about 20 so this is really cool it
after about 20 so this is really cool it is doing machine learning it's basically
is doing machine learning it's basically doing cue learning but it's using a
doing cue learning but it's using a course model as part of the loop and
course model as part of the loop and there's lots of things that are hard
there's lots of things that are hard that they have to push on that are still
that they have to push on that are still worth doing because the other thing
worth doing because the other thing that's interesting and I honestly think
that's interesting and I honestly think this is where we need tons of machine
this is where we need tons of machine learning is the faster that car drives
learning is the faster that car drives the more nonlinear the dynamics become
the more nonlinear the dynamics become the tires start to slip and so now you
the tires start to slip and so now you have to figure out you have to learn
have to figure out you have to learn that's where learning really becomes
that's where learning really becomes important because at that point you
important because at that point you don't want to skid off the track and so
don't want to skid off the track and so there's a lot of interesting stuff to do
there's a lot of interesting stuff to do as you push into higher and higher
as you push into higher and higher performance we're learning should come
performance we're learning should come in cool so there's so many things left
in cool so there's so many things left to do I have an unbounded list this is
to do I have an unbounded list this is what fit on a slide I want to just have
what fit on a slide I want to just have some time so let me just at least
some time so let me just at least highlight these theory questions are the
highlight these theory questions are the results that we got for course ID
results that we got for course ID control for lqr optimal with respect to
control for lqr optimal with respect to the parameters with respect to the
the parameters with respect to the problem instance parameters can we
problem instance parameters can we actually get tight upper and lower
actually get tight upper and lower bounds for these problems can we get
bounds for these problems can we get tight upper and lower bounds and what's
tight upper and lower bounds and what's called the adaptive setting we've been
called the adaptive setting we've been looking so for again for people who work
looking so for again for people who work in reinforcement learning we've been
in reinforcement learning we've been looking at episodic reinforcement
looking at episodic reinforcement learning we're allowed to do restart the
learning we're allowed to do restart the problem can you work in the adaptive
problem can you work in the adaptive setting where you just get one trial
setting where you just get one trial there's actually a ton of work on this
there's actually a ton of work on this if I had more time I would look I could
if I had more time I would look I could go do a huge survey just about adaptive
go do a huge survey just about adaptive control trying to know that down as well
control trying to know that down as well would be great and then there are a
would be great and then there are a variety of other things involving safe
variety of other things involving safe exploration understanding how to
exploration understanding how to incorporate non-linearity this is
incorporate non-linearity this is confident endless number of things that
confident endless number of things that are left to do and since there's so many
are left to do and since there's so many things left to do I guess it kind of
things left to do I guess it kind of looping back to the beginning it seems
looping back to the beginning it seems like reinforcement learning it's kind of
like reinforcement learning it's kind of this critical problem that we really
this critical problem that we really have to start tackling right and
have to start tackling right and you know I don't think that control
you know I don't think that control theory itself is going to solve all the
theory itself is going to solve all the problems we want to solve with these
problems we want to solve with these tools and so it seems that this is why
tools and so it seems that this is why it's a little funny but I figure we just
it's a little funny but I figure we just need a new name and we need a new name I
need a new name and we need a new name I know it's tongue-in-cheek to call it
know it's tongue-in-cheek to call it actionable intelligence but we need a
actionable intelligence but we need a new name because I think this has to be
new name because I think this has to be an inclusive effort it can't just be
an inclusive effort it can't just be deep reinforcement learning it has to
deep reinforcement learning it has to take into account things from controls
take into account things from controls engineering and actually I think was
engineering and actually I think was also interesting as I think it has to
also interesting as I think it has to take into account something even broader
take into account something even broader because even this most mundane machine
because even this most mundane machine learning systems start to look like RL
learning systems start to look like RL systems if you squint at them so
systems if you squint at them so consider recommendation recommendation
consider recommendation recommendation was something I worked on decades ago
was something I worked on decades ago it seemed like it was a fun problem it
it seemed like it was a fun problem it was very innocuous problem who wouldn't
was very innocuous problem who wouldn't want to just be recommended some movies
want to just be recommended some movies to watch on the weekend or like the next
to watch on the weekend or like the next heavy metal song that you really want to
heavy metal song that you really want to hear given the long day at ICML but what
hear given the long day at ICML but what we started to observe is that these
we started to observe is that these recommendation systems when put into
recommendation systems when put into feedback with people start to make
feedback with people start to make people do things that maybe aren't great
people do things that maybe aren't great they start to bring out the worst of us
they start to bring out the worst of us and in particular we know that
and in particular we know that maximizing engagement is the easiest way
maximizing engagement is the easiest way to do that is to make people angry
to do that is to make people angry we know that YouTube provides all sorts
we know that YouTube provides all sorts of radicalizing behavior by just feeding
of radicalizing behavior by just feeding people more and more stuff to keep them
people more and more stuff to keep them addicted to the site and the reason why
addicted to the site and the reason why that happens is because we keep
that happens is because we keep retraining the model to maximize these
retraining the model to maximize these metrics which maybe aren't the right
metrics which maybe aren't the right metrics to maximize so if we don't know
metrics to maximize so if we don't know how to build a cost function for walking
how to build a cost function for walking we definitely don't know how to build a
we definitely don't know how to build a cost function for people are getting
cost function for people are getting valuable time out of products so instead
valuable time out of products so instead we just use time and when we do that
we just use time and when we do that basically that's the reinforcement
basically that's the reinforcement learning now it feels like it's
learning now it feels like it's supervised learning matrix completion
supervised learning matrix completion but if you keep retraining you're doing
but if you keep retraining you're doing machine learning with feedback and now
machine learning with feedback and now this RL or maybe perhaps we're going to
this RL or maybe perhaps we're going to call it axle intelligence and so I think
call it axle intelligence and so I think the future here it's like all machine
the future here it's like all machine learning has actually become
learning has actually become reinforcement learning since we started
reinforcement learning since we started putting everything in production
putting everything in production it interfaces with humans it's a it's
it interfaces with humans it's a it's going to be at the forefront of all of
going to be at the forefront of all of kind of our infrastructure and so it's
kind of our infrastructure and so it's imperative for all of us we all have to
imperative for all of us we all have to start thinking about what's the right
start thinking about what's the right way to do this safely what's the right
way to do this safely what's the right way to do this reliably and and what's
way to do this reliably and and what's the right way to do this and something
the right way to do this and something that will be predictable without human
that will be predictable without human intervention and I think that that's
intervention and I think that that's exciting so I think you know I challenge
exciting so I think you know I challenge everybody here let's join in on this
everybody here let's join in on this effort it's certainly more fun than just
effort it's certainly more fun than just thinking about least squares and so the
thinking about least squares and so the theory questions are wonderful the
theory questions are wonderful the outcome couldn't be more important and
outcome couldn't be more important and I'm looking forward to collaborating
I'm looking forward to collaborating with everybody on this so thanks for
with everybody on this so thanks for your time
we do have time for questions you have to go to the microphone I was told but
to go to the microphone I was told but while people go decide if they want to
while people go decide if they want to ask questions I do want to thank Sarah
ask questions I do want to thank Sarah irelia Horia Nick Max and Steven for
irelia Horia Nick Max and Steven for kind of working with me on the problems
kind of working with me on the problems that we had here and there's a huge list
that we had here and there's a huge list of acknowledgments if I had more time of
of acknowledgments if I had more time of other people I'd like to thank who
other people I'd like to thank who helped put this together please
helped put this together please how does the model predictive control
how does the model predictive control work on the quadrotor example we're
work on the quadrotor example we're showing the simple LT there simple lqr
showing the simple LT there simple lqr problem yeah it works great because you
problem yeah it works great because you could just truncate the time horizon so
could just truncate the time horizon so just just having it you don't even have
just just having it you don't even have to model the cue function to get that to
to model the cue function to get that to work Thanks
we can't take questions from next door thank you go ahead nice tutorial how
thank you go ahead nice tutorial how much general question you mentioned in
much general question you mentioned in your talk that a lot of our our work is
your talk that a lot of our our work is difficult to reproduce and we definitely
difficult to reproduce and we definitely see that as well so I guess I have two
see that as well so I guess I have two questions first as a researcher what
questions first as a researcher what should we do if someone has a fabulous
should we do if someone has a fabulous result but I can't reproduce a second
result but I can't reproduce a second the question is as a community what
the question is as a community what should we do thank you that's a great
should we do thank you that's a great question I think it's one we definitely
question I think it's one we definitely have to grapple with that gee welp you
have to grapple with that gee welp you know actually sponsored kind of a
know actually sponsored kind of a reproducibility challenge and it would
reproducibility challenge and it would be kind of crazy if we could scale that
be kind of crazy if we could scale that out we have a lot of resources because
out we have a lot of resources because of sponsors who have a lot of cloud
of sponsors who have a lot of cloud computing
computing I think it's expensive to try to have
I think it's expensive to try to have the the the community have everything be
the the the community have everything be replicated but is if we could try to
replicated but is if we could try to think about ways to have all of these
think about ways to have all of these experimental results subject to some
experimental results subject to some kind of replication I think that would
kind of replication I think that would be really fantastic I guess the one
be really fantastic I guess the one thing I want to point out that's
thing I want to point out that's interesting to me is that for machine
interesting to me is that for machine learning I actually think that the
learning I actually think that the standard old machine learning I feel
standard old machine learning I feel like reproducibility has gotten much
like reproducibility has gotten much much better than any time I've ever
much better than any time I've ever experienced it
experienced it everybody puts things on github more or
everybody puts things on github more or less github returns what you want maybe
less github returns what you want maybe it's not exactly what people told us it
it's not exactly what people told us it did but for a supervised learning if an
did but for a supervised learning if an idea is good it gets reproduced super
idea is good it gets reproduced super quickly the example I have in mind is
quickly the example I have in mind is this transformer network for translation
this transformer network for translation which was appeared on before even the
which was appeared on before even the code was released this is crazy because
code was released this is crazy because this never used to happen before even
this never used to happen before even the code was released
the code was released people had prototypes that seemed to
people had prototypes that seemed to work pretty well and so that idea caught
work pretty well and so that idea caught on and I could just very well be that RL
on and I could just very well be that RL isn't at that stage yet and we need some
isn't at that stage yet and we need some more mature thought about algorithms so
more mature thought about algorithms so that we can get to that point where
that we can get to that point where someone could state an algorithm in a
someone could state an algorithm in a paper and we could all reproduce it at
paper and we could all reproduce it at home and I feel like in some sense
home and I feel like in some sense that's the bar we want to get to
go ahead Rhino go okay that was a really amazing tutorial thank you I just wanted
amazing tutorial thank you I just wanted to ask just you could go next but oh
okay go go announce that five that very will go go ahead laugh
will go go ahead laugh make an announcement oh hi I'm rosemary
make an announcement oh hi I'm rosemary um yeah I just wanted to say we do have
um yeah I just wanted to say we do have the you reproducibility workshop on the
the you reproducibility workshop on the Saturday so if anybody wants to come Oh
Saturday so if anybody wants to come Oh feel free Thanks awesome so have that
feel free Thanks awesome so have that continue that conversation there
continue that conversation there unfortunately I can't go but next time
unfortunately I can't go but next time go ahead I'm just gonna ask do you have
go ahead I'm just gonna ask do you have any optimism about whether you're into
any optimism about whether you're into end bound on course idea would work for
end bound on course idea would work for mapping between simulated environments
mapping between simulated environments and real world implementation you asked
and real world implementation you asked that's a great question that's a great
that's a great question that's a great question I think this is one of these
question I think this is one of these hard things from machine learning where
hard things from machine learning where we're so used to not having to model
we're so used to not having to model anything and having great results and
anything and having great results and when the real world comes in we even
when the real world comes in we even know we do know and we've seen I mean
know we do know and we've seen I mean I'm always entertained by these things
I'm always entertained by these things that you can introduce adversarial
that you can introduce adversarial perturbations and break machine learning
perturbations and break machine learning systems this has been a very hot area of
systems this has been a very hot area of research turns out that they don't even
research turns out that they don't even have to be that adversarial if you show
have to be that adversarial if you show a video to an image detector it will get
a video to an image detector it will get mistakes every other frame so I guess
mistakes every other frame so I guess this is weird thing where's like I don't
this is weird thing where's like I don't know understanding how to transfer to
know understanding how to transfer to the physical world is very complicated I
the physical world is very complicated I don't think we've really solved it for
don't think we've really solved it for the vision problems however for for
the vision problems however for for robotics I mean for a lot of robotics it
robotics I mean for a lot of robotics it works right a lot of robotics if you
works right a lot of robotics if you model the physics well and you build a
model the physics well and you build a very specific simulator for the robot
very specific simulator for the robot you're building people have been able to
you're building people have been able to use simulators as part of the design
use simulators as part of the design pipeline it's definitely a mature
pipeline it's definitely a mature technology and it would be great to
technology and it would be great to really understand how to do that with
really understand how to do that with machine learning in the loop as well but
machine learning in the loop as well but I think that's I I don't I don't know
I think that's I I don't I don't know enough to yet comment about where
enough to yet comment about where what the possibility of what we need to
what the possibility of what we need to do to make that happen so I just had one
do to make that happen so I just had one question - you kind of had those C's
question - you kind of had those C's earlier and you weren't really you said
earlier and you weren't really you said that some of the theorists were really
that some of the theorists were really interested in actually the C not maybe
interested in actually the C not maybe the sample complexity but yeah
the sample complexity but yeah so there's this weird tension for doing
so there's this weird tension for doing this minimization where you're trying to
this minimization where you're trying to drive this state to zero but actually
drive this state to zero but actually also means that you have less excitation
also means that you have less excitation that's actually not less excitation yes
that's actually not less excitation yes the regressor the state so maybe you
the regressor the state so maybe you could give some comments on you know
could give some comments on you know that's one of the things an adaptive
that's one of the things an adaptive control we struggle with it's like if
control we struggle with it's like if you try to drive a system to be stable
you try to drive a system to be stable and safe you are stealing information
and safe you are stealing information for the ability to identify the system
for the ability to identify the system so you kind of alluded to that but
so you kind of alluded to that but that's also kind of like a major a major
that's also kind of like a major a major issue that's right
issue that's right that's that that's exactly right this is
that's that that's exactly right this is actually a real challenge in the
actually a real challenge in the adaptive setting and this is something
adaptive setting and this is something that it's always a little bit weird in
that it's always a little bit weird in the practice of reinforcement learning
the practice of reinforcement learning that you don't necessarily see the
that you don't necessarily see the difference between episodic and adaptive
difference between episodic and adaptive reinforcement learning in episodic you
reinforcement learning in episodic you get to constantly reset and constantly
get to constantly reset and constantly retry and then you get to do a variety
retry and then you get to do a variety of different control actions if you need
of different control actions if you need to have your system be safe and you
to have your system be safe and you really want to drive your system to be
really want to drive your system to be stable that's antithetical to exploring
stable that's antithetical to exploring to learn how to be better and in an
to learn how to be better and in an example I gave about cars it does seem
example I gave about cars it does seem like if you've never like what is the
like if you've never like what is the right algorithm to explore going faster
right algorithm to explore going faster if you don't really know what happens if
if you don't really know what happens if you push the accelerator down further it
you push the accelerator down further it so you have to assume something right
so you have to assume something right and I think this is what makes the
and I think this is what makes the adaptive control problem considerably
adaptive control problem considerably harder than the the non adaptive case so
harder than the the non adaptive case so it's something to be but this is why
it's something to be but this is why it's a great research problem and there
it's a great research problem and there are lots of folks working addressing
are lots of folks working addressing exactly bad and there's a lot of great
exactly bad and there's a lot of great work it actually just understanding
work it actually just understanding exploration not even from the
exploration not even from the theoretical perspective but
theoretical perspective but understanding how exploration and safety
understanding how exploration and safety interact that I've seen in particular
interact that I've seen in particular there was a paper by the groups of
there was a paper by the groups of boundary uh
boundary uh Angeles Oleg and andreas Kraus a-- which
Angeles Oleg and andreas Kraus a-- which actually tries to estimate exactly how
actually tries to estimate exactly how bad it would be to move away from things
bad it would be to move away from things that you are
that you are no we're safe which I thought was pretty
no we're safe which I thought was pretty interesting other questions
interesting other questions awesome I'm happy to stick around if you
awesome I'm happy to stick around if you want to ask things one on one we have
want to ask things one on one we have ten minutes thanks everybody for your
ten minutes thanks everybody for your time
time [Applause]
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.