This lecture introduces supervised learning for behavior imitation, framing policy learning as a supervised task where an agent learns to map observations to actions based on expert demonstrations. It highlights the fundamental differences between standard supervised learning and sequential decision-making, particularly the violation of the IID assumption.
Mind Map
点击展开
点击探索完整互动思维导图
hi welcome to lecture two of cs285 today
we're going to talk about supervised
learning of behaviors
so let's start with a little bit of
terminology and notation so we're going
to see a lot of
terminology in this lecture denoting
policies that we're going to be learning
from data and we're not going to talk
about reinforcement learning just yet
we're going to talk about supervised
learning methods for learning policies
but we'll get started with a lot of the
same terminology that we'll use in the
rest of the course
so typically
if you want to represent your policy you
have to represent a mapping from
whatever the agent observes to its actions
actions
now this is not such a strange object
those of you that are familiar with
supervised learning you can think of
this as much the same way that you
represent for example an image
classifier an image classifier maps from
inputs X to outputs y a policy maps from
observations o to outputs a
but other than changing the names of the
symbols in principle it things haven't
actually changed all that much so in the
same way that you might train an image
classifier that looks at a picture and
outputs the label of that picture you
could train a policy that looks at an
observation outputs in action same
principle we're going to use the letter
Pi to denote the policy and the
subscript Theta to denote the parameters
of that policy which might be for
example the weights in a neural network
now typically in a control problem in a
decision-making problem things exist in
the context of a temporal process so at
this instant in time I might look at the
image from my camera and make a decision
about what to do and then at the next
incident time I might see a different
image and make a different decision
so typically we will write that both the
inputs and outputs with a subscript
lowercase T to denote the time step t
for almost all of the discussion in the
score so we're going to operate on
discrete time meaning that t you can
think of it as an integer that starts
with zero and is then incremented with
every time step but of course in a in a
real physical system T might correspond
to some continuous notion of time for
example T equals zero might be zero
milliseconds into the Control process T
equals one might be 200 milliseconds
equals to maybe 400 milliseconds and so on
on
now in an actual sequential decision
making process of course the action that
you choose will affect the observations
that you see in the future and your
actions are not going to be image labels
like they are for example in the
standard image classification task but
they're going to be decisions decisions
that are bearing on future outcomes so
instead of predicting whether the
picture is a picture of a tiger you
might predict a choice of action like
run away or ignore it or do something else
else
but this doesn't really change the
representation of the policy if you have
a discrete action space you would still
represent the policy and basically the
same exact way that you represent an
image classifier if your inputs are images
images
you could also have continuous action
spaces and in that case perhaps the
output would not be a discrete label
maybe it would be the parameters of a
continuous distribution a very common
Choice here is to represent the
distribution over a as a gaussian
distribution which means that the policy
would output the mean and the covariance
for that gaussian but there are many
so to recap our terminology we're going
to have observations which we denote
with the letter O and the subscript T to
denote that it's the observation of time t
t
our output will be actions which we
denote with the letter a and a subscript t
t
and our goal will be to learn policies
that in the most General sense are going
to be distributions over a given o
now something I want to note here
because this is sometimes a source of confusion
confusion
a policy needs to provide us with an
action to take in the most General case
policies are distributions meaning that
they assign a probability to all the
possible actions uh given a particular
observation of course a policy could be
deterministic meaning that it prescribes
a single action for a given observation
that's a special case of a distribution
it's just a distribution that assigns a
probability of one to something and a
probability of zero to everything else
so in most uh cases we will actually
talk about
stochastic policies policies that have
specified distribution over actions but
keep in mind this is fully General in
the sense that deterministic policies
are simply a special case of these
distributions and it's very convenient
to talk about distributions here for the
same reason that we tend to talk about
distributions and supervised learning so
in supervised learning interclassifying
images perhaps you only really want to
predict one label for a given image but
you might still learn distribution over
labels and then just take the most
likely output and that makes training
these things a lot more convenient and
it's the same way with decision making
and control that training these policies
as probability distributions often is
much more convenient even if in the end
now one more term that we have to
introduce and here we're going to start
getting to some of the idiosyncrasies of
sequential decision making is the notion
of a state
the state is going to be denoted with
the letter S and also the subscript t
and a state is in general a distinct
thing from the observation understanding
this distinction will be very important
for certain types of reinforcement
learning algorithms it's not so
important for today's lecture because
for imitation learning we often don't
need to make this distinction although
even here it'll be important when we try
to understand the theoretical
underpinnings of some of these imitation
and sometimes when we learn policies
we'll write policies as distributions
over a given s rather than given o
I will try to point out when this is
happening and why but to understand the
difference between these two objects
let's talk about the difference between
states and observations and then we'll
come back to this and typically we'll
we'll refer to policies that are
conditioned on a full State as fully
observed policies as opposed to policies
conditional observation which might have
only personal information so what I mean
by this
well let's say that you are observing a
picture of a cheetah chasing a gazelle
and you need to make some decision about
what to do in this situation
now the picture consists of pixels so
they're recordings from a camera you
know that underneath those pixels there
are actual physical events taking place
that you know maybe the cheetah has a
position and velocity and so does the
gazelle but the input technically is
just an array of pixels
so that's the observation
the state
is what produced that observation and
the state is a concise and complete
physical description of the world so if
you knew the positions and velocities
and maybe like the mental state of the
cheetah and the gazelle you could figure
out what they're going to do next
the observation sometimes contains
everything you need to infer the state
but not necessarily so for example maybe
there's a car driving in front and you
don't see the cheetah the cheetah is
still there the state hasn't changed
just because it's not visible but the
observation might have changed so in
general it might not be possible to
perfectly infer the current state St
from the current observation OT
whereas going the other way going from st2ot
st2ot
by definition of what a state is is
always possible because a state always
encodes all the information you need to
produce the observation so if it would
help to think about it this way if you
imagine this was a simulation St might
be the entire state of the computer's
memory encoding a full state of the
simulator whereas the observation is
just an image that is rendered out based
on that state on the computer screen so
going from observation back to State
might not be possible if some things are
now if we want to make this a little bit
more precise and we can we can describe
this in the language of probabilistic
graphical models so in the language of
probabilistic graphical models
we can draw a graphical model that
represents the relationship between
for those of you that took some course
that covers Bayes Nets this will look
familiar for those of you that haven't
roughly speaking in these pictures the
edges denote conditional Independence
relationships so if there's an edge then
the variable is not independent of its
parents and in some cases these things
can encode independencies
I won't get into the details of how to
understand probabilistographical models
if you haven't covered this part this
won't entirely make sense to you but
hopefully the verbal explanation of the
relationship between these variables
will still make sense
so the policy Pi Theta is at least for
the partial observed case a relationship
between o and a so it gives the
conditional distribution over a given o
the state is what determines how you
transition to the next state so the
state and action together provide a
probability distribution over the next
state P of St plus one given s t a t
that is sometimes referred as the
transition probabilities or the Dynamics
you can think of this as basically the
physics of the underlying world so when
we write down equations of motion and
physics we don't write down equations
describing how image pixels move around
we write down equations about how rigid
bodies move and things like that so
that's referring to S the state the
position of the velocity of the cheetah
so the cheetah might transition to a
different position based on its current
velocity and maybe based on how hungry
the cheetah is and what it's trying to
do and that's all captured in the state
and then uh something to note about the
state is that
the state S3 here is conditionally
independent of the state S1 if you know
the state S2 so let me say that again
because that might have been a little
bit unclear if you know the state S2 and
you need to figure out the state S3 then
S1 doesn't give you any additional
information that means that S3 is
conditionally dependent of S1 given S2
this is what is referred to as the
Markov property and it's one of the most
fundamental defining features of a state
essentially if you know the state now
then the state in the past does not
matter to you because you know
everything about the state of the world
and that actually makes sense if you
think back to that that analogy about
the computer simulator if you know the
full state of the memory of the computer
that's all you really need to put in
order to predict future States because
the past memory of the computer doesn't
matter the computer is only going to be
making its simulation stuff based on
what's In memory now the computer itself
has no access to its memory in the past
only its memory now so it makes sense
that the future is independent of the
so this is referred to as the Markov
property and it's very very important
the Markov property essentially defines
what it means to be a state a state is
that which captures everything you need
to know to predict the future without
knowing the past
that doesn't mean that the future is
perfectly predictable the future might
still be random there might be stochasticity
stochasticity
but knowing the past doesn't help you
okay so just to finish this uh discussion
discussion
now it's hopefully clear with the
distinction between policies that
operate on observations Pi of a t given
OT and policies that operate on States
but I have 80 given stns so some
algorithms especially some of the later
reinforced learning algorithms will
describe can only learn policies that
operate on States meaning that they
require the input into the policy to
satisfy the Markov property to fully
encode the entire cellular system some
policies some algorithms will not
require this some algorithms will be
perfectly happy to operate on partial
observations that are perhaps
insufficient to infer the state
I'll try to make this distinction every
time I present an algorithm but I will
warn you right now that a reinforcement
learning practitioners and researchers
have a very bad habit of often
confounding o and S so sometimes people
will refer to O as s they'll say oh I
this is my state when in fact they mean
this is my observation sometimes vice
versa and sometimes I'll make this
distinction very unclear sometimes
they'll switch back and forth between
observations in the states so this
confusion often happens
if everything is going well this
confusion is benign because it's
typically this kind of confusion
typically happens for algorithms where
it doesn't matter whether it's state or
observation so then it's kind of okay to
mix them I'll try not to mix them but
sometimes I'll fall into old habits and
mix them anyway in which case I'll do my
best to tell you but be warned that ons
gets mixed up a lot if you want to be
fully reversed and fully correct this
slide explains the difference
so as an asylum notation in this class
we use the standard reinforcement
learning notation where s denotes States
and a denotes actions this kind of
terminology goes back to the study of
dynamic programming which was pioneered
in the 50s and 60s uh principally in the
United States by folks Like Richard
Bellman and I believe the sna notation
is actually was first used in his work
although I could be wrong about that
those of you that have more of a
controls or robotics background might be
familiar with a different notation which
means exactly the same thing so if
you've seen the symbol X used to denote
State such as a configuration of a robot
or a control system and the symbol U to
denote the action
don't be concerned it means exactly the
same thing this kind of notation is uh
more commonly used in controls a lot of
it goes back to the study of optimal
control and optimization much of which
was actually pioneered in the Soviet
Union by Volkswagen
and much like the word action begins
with symbol a the word action also
begins with a symbol uh in Russian so
that's why we have u x well because it's
a commonly used variable in algebra
okay so that's the set now let's
actually talk about imitation the main
topic of today's discussion so our goal
will be to learn policies which are
distributions over a given o
and to do this using supervised learning algorithms