Current large language models (LLMs) are insufficient for achieving human-level AI due to their autoregressive, token-by-token generation, which lacks common sense and a true understanding of the physical world. Future AI development should focus on building systems with mental models, planning capabilities, and more robust inference mechanisms, inspired by biological learning.
Mind Map
Click to expand
Click to explore the full interactive mind map • Zoom, pan, and navigate
I'd like to uh welcome our second and
final uh plenary uh to the stage um up
next is Yan lukan uh he's a chief AI
scientist at meta and professor at
NYU now Yan was the founding director of
meta and of the NYU uh Center I should
say for data science he works Prim
primarily in a number of fields machine
learning computer vision uh mobile
Robotics and computational Euro science
in 2019 Yan won the prestigious ACM
touring award for his work on AI and
he's of course a member of uh the US
nationaly and the French Academy the
sance a warm welcome to you Yan good to have
have [Applause]
you thank you very much a real pleasure
to be here uh last time must have been
before covid or something um
okay um there's going to be some uh
connection a little bit with what
Bernard just talked about um and what
I'm going to talk about is all the stuff
that Mark Jordan earlier today told you
on um so as a matter of fact we do need
human level
AI um and it's not just because it's an
interesting scientific question it's
also sort of a product need um we are
going to be uh wearing smart devices
like smart glasses and things of that
type in the future and in in those smart
U devices we'll be able to um access AI
assistants that will be with us at all
times and we'll be interacting with them
either through voice or through U uh
electron um electrogram CMG um the
glasses will eventually have displays
although currently they don't and
and
um and we need those system to have
human level intelligence because that's
what we're the most familiar um um
interacting with we're familiar with
interacting with other humans uh we are
familiar with the level of intelligence
that we expect in a in a human and uh it
would be more you know easier to
interact with systems that have kind of
similar forms of
intelligence um so you know those
ubiquitous assistants um are going to
mediate all of our interactions with the
digital world and um that that's why
that's why we we we need them to be easy
to use for a wide population that is not
necessarily familiar with um using
technology okay but the problem is
machine learning sucks compared to what
we observe in humans and animals uh we
don't really have the techniques that
would um allow us to build machines that
have the the same type of
uh learning abilities and Common Sense
and understanding of the physical world
um so animals and humans um have
background knowledge that allows them to
um learn new tasks extremely quickly
understand how the world Works um being
able to reason and plan and that's based
on what we call common sense it's not a
very well- defined concept um and and
our behavior and behaviors of animals
are driven by objectives essentially
essentially um
um
so I'm going to argue that the type of
AI systems that we uh we have at the
moment um or or that everybody is you
know playing with almost everybody is
playing with uh do not have the right
characteristics uh for for what we want
want
um and the reason is uh they basically
um produce one token after the other
autor regressively right so you have a
sequence of tokens which are subo units
but it doesn't matter what they are a
sequence of symbols and then you have a
predictor that is repeated over the
sequence that Bic basically take a
window of previous tokens and predict
the next
token um and the way you train those
system is that you put the sequence at
the at the input and I really apologize
for this I'm going to perhaps
change the
resolution of the
screen so
hopefully all right um so
so so the way those things are trained
is you take a sequence and you basically
train the system to just reproduce its
input on its output and because it has a
causal structure um it cannot cheat and
use a particular input to predict itself
it has to only look at the symbols that
are to the left of it that's called causal
causal
architecture um so that's very efficient
this is you know what people people call
a GPT general purpose Transformer but
you don't have to put Transformers in it
this could be anything it's just a caal
architecture and I'm afraid I haven't
fixed the flashing anyway um so the the
the way you train the uh those systems
uh then you can use it to generate text
by just Auto aggressively producing a
token shifting it into the input and
then producing the second token shifting
that in ETC that's Auto prediction Not A
New Concept at all obviously um and
there's an issue with this which is that
um the
U the that process is basically
Divergent every time you produce a token
there is some chance that the token is
not within the set of reasonable answers
and take you outside a set of reasonable
answers and if it does that there is no
way to fix it afterwards um and if there
is if you assume there is some
probability for that you know wrong
token uh for wrong tokens to be
generated and the errors are independent
which of course they're not um then you
get exponential Divergence uh which is
why you know we have with those models hallucination
hallucination
issues um but we're missing something
really big because uh you know never
mind trying to reproduce human
intelligence we can even reproduce cat
intelligence or rat intelligence let
alone dog intelligence they can do
amazing feits they understand the
physical world um um you know any house
cat can plan very highly complex um
actions um and they have causal models
of of the world some of them know how to
open doors and and Taps and things of
that type um and in humans you know a
10-year-old can clear up the dinner
table and fill up the dishwasher without
learning zero shot the first time you
ask a 10-year-old to do it um yeah she
will do it any 17-year-old can learn to
drive a car in 20 hours of practice but
we still don't have robots that can act
like a cat we don't have domestic robots
that can clear up the dinner table and
we don't have level five cell driving
cars despite the fact that we have
hundreds of thousands if not millions of
hours of supervis training data okay so
that tells you we're missing something really
really
big um yet we have systems that can pass
the bar exam do math problems prove theorems
theorems
but no domestic robots so we keep
bumping into this Paradox called Mor
Paradox right things that we take for
granted um because humans and animals
can do it we think it's not complicated
it's actually very complicated and the
stuff that we think is uniquely human
like manipulating and generating
language playing chess playing go
playing poker
producing poetry and this kind of stuff
turn that to be easy
relatively okay and perhaps the reason
for this is this very simple calculation
um a typical llm nowadays is trained on
on the order of 30 trillion tokens three
10 to the 13 uh
tokens that's two to the 13 words
roughly each token is about three bytes
um so the data volume is roughly 10 to
the 14 bytes
uh it would take any of us uh almost
half a million years to read through all
that material it's basically all the
publicly available text on the
internet now consider her human child a
four-year-old has been awake a total of
16,000 hours which by the way is only 30
minutes of YouTube
uploads um we have 2 million optical
nerve fibers Each of which carries about
1 B per second maybe a bit less but it
doesn't matter so the data volume is
about 10 to the 14 in four years a
four-year-old child has seen as much
data as the biggest llm in the form of
visual perception and for blind children
is touch it's the same kind of uh
bandwidth uh that tells you a number of
things we're never going to get to human
level intelligence by just turning on
text it's not just not
happening despite what you know some
people who are have a vested interest in
this happening are telling us we're
going to reach you know PhD level
intelligence by next year it's just not
happening we might have PhD level in
some subfield in some area some uh um
problems like chess playing you know but
more of them um as long as we train
those systems specifically for for those
problems as um as Bernard was explaining
with the visual Illusions um there are a
lot of problems of this type when you
formulate a problem you pose a problem to
to
an llm and if the problem is kind of a
standard puzzle the answer will be
regurgitated in just a few seconds if
you change the statement of the problem
a little bit the system will still
produce the same answer that it had
before because it has no real mental
model what goes on um in the in the
puzzle so how do um humans infants learn
how the world works and you know infants
accumulate a huge amount of background
knowledge about the world in the first
few months of life
um Notions like object permanence um
solidity rigidity natural categories of
objects before children understand
language they do understand the
difference between the table and the
chair um that kind of develops
naturally and they understand intuitive
physics notion like gravity inertia and
things of that type around the age of nine
nine
months um so it takes a long time uh
observation mostly um until four months
because babies don't really have any
influence on the on the world before
that um but then uh through interactions
but the amount of interaction that's
that's required is astonishingly small
small
so if we want um AI system that can
reach eventually reach human level might
take a while um we call this Advanced
machine intelligence at meta we don't
like the term AGI artificial general
intelligence the reason being that that
human intelligence is actually quite
specialized and so calling it AGI is
kind of a
misnomer um so we call this Ami we
actually pronounce it Ami which means
friend in French um so we need systems
that um learn well models from sensory
input basically mental models of how the
world works that you can manipulate in
your mind learning 2D physics um from
video let's say systems that have
persistent memory systems that can plan
actions uh possibly
hierarchically so as to fulfill an
objective and systems that can
reason um and then systems that are
controllable and safe by Design not by
fine-tuning which is the the case for
llms now the only way I know to build
systems of this type is to change the
type of of inference um that um current
uh AI systems perform so right now the
way an llm uh performs inference is by
running through a fixed number of layers
of anet a transformer then producing a
token injecting that token on the input
and then running through a fixed number
of layers again and the problem with
this is that if you ask a simple
question or complex question and you ask
the system to answer by yes or no like
does 2 and two equal four yes or no or
does p equal NP yes or no it's going to
spend the exact same amount of
computation to answer those two
questions so people have been kind of
cheating and telling the system system
will explain you know the Chain of
Thought trick you you basically have the
system produce more tokens so that is
going to spend more competition
answering the question but that's kind
of a hack the way um a lot of inference
in statistics for example that's going
to make Mike happy actually um the way
inference works is is not that way in uh
In classical AI in statistics uh in
structure prediction a lot of different
domains the way it works is that you
have a function that measures the degree
of compatibility or incompatibility
between your observation and a proposed
output and then the inference process
consist in finding the value of an
output that minimizes this
incompatibility measure okay let's call
it an energy function so you have an
energy function okay represented by the
square box here on the right um when it
doesn't disappear and and the system
just do performs optimization for doing
inference now if the inference uh
problem is more difficult the system
will just spend more time performing
inference in other words they will think
about complex problems for longer than
simple ones for which the answer is pretty
pretty
obvious um and this is really a very
classical thing to do in classical
classical AI is all about reasoning and
uh search and therefore optimization
pretty much any computational problem
can be reduce an optimization problem
essentially or search problem uh it's
also very classical in probabilistic
modeling like probabilistic graphical
models and things of that type so this
type of inference would be more akin to
what psychologists call system two in uh
sort of human U mind if you want system
two is when you think about what action
or sequence of actions you're going to
take before you you you take them you
think about something before doing it
and the system one is when you can do
the thing without thinking about it you
know it becomes sort of subconscious so
llms are system one what I'm proposing
is system two um and then the
appropriate um sort of semi theoretical
framework to um explain this is energy
based models which I'm not going to have
time to get into too much detail but
basically you capture the dependency
between variables let's say observations
X and uh outputs uh y through an energy
function that takes low value where when
X and Y are compatible and then larger
values when X and why are not compatible
you don't want to just compute y from X
as we just saw you just want an energy
function that measures the degree of
incompatibility and then you know given
an X find a y that has low energy for that
X okay so now let's go a little bit into
the details of how this type of
architecture can be built so essentially
and how it kind of relates to um uh
thinking or planning
uh so a system would look like this um
you you get observation from the world
it go through a perception module that
produces an estimate about the state of
the world but of course the state of the
world is not completely observable so
you may have to combine this with a
memory the content of a memory that
constit you know contains your idea of
the state of the world you don't uh currently
currently
perceive um and the combination of those
two goes into a world model so what is a
world model World model is given given a
current estimate of the state of the
world which is in an abstract
representation space and given an action
sequence that you imagine
taking uh your world model predicts the
the resulting state of the world that
will um occur after you take that
sequence of actions okay that's what a
world model is if I tell you imagine a
cube floating in the air in front of you
okay now rotate this Cube by 90 degrees
around a vertical axis
um what does it look like it's very easy
for you to kind of have this metal model
hopefully all
right let's hope this will be more stable
stable
okay um 50 Herz not 60
HZ okay so uh what you can do now is uh
feed okay hang
okay this doesn't look like it was a good
nice okay I think we're going to have
human level intelligence before we have
works okay um so so if we have this
world model which is able to predict the
result of a sequence of
actions um we can feed it to an
objective which is a task objective that
measure to what extent the predicted
final State U satisfies a goal that we
set for ourselves it's just a cost
function um and we also can set some uh
guardrail objectives think of them as
constraints that need to be satisfied
for the system to behave in a safe
manner right so those guardes will be
explicitly implemented and the way the
system proceeds is by optimization it's
looking for an action sequence that
minimizes the task objective and the uh
guard rail objectives at runtime okay
we're not talking about learning here
we're just talking about
inference um and that will guarantee the
safety of the system because uh the
guard rails guarantee safety and there
is no way you can Jailbreak that system
by giving it a prompt that will you know
have it ES Escape its guardwire
objectives the guard objectives would be just
just
hardwired they might be trained but
hardwired now a sequence of actions
should probably use a single World model
that you repeat you use repeatedly over
multiple time steps okay so you have a
one model if you did the first action it
predicts the next state and the second
action predicts the second next state
you can have guard R cost and objective
uh task uh task objectives along the
trajectory the ad specifying what
optimization algorithm we can use it
doesn't really matter for the discussion
that we have um if the world happens not
to be completely deterministic and
predictable the world model may need to
have latent variables to account for all
the things about the world that we do
not observe and that uh you know makes
our prediction basically um inexact and
ultimately what we want is a system that
can plan hierarchically so something
that may have several levels of
abstraction in such a way that um at the
low level we plan low level actions like
basically muscle control but at a high
level we can plan abstract macro action
where the world model predicts at longer
time steps but in a representation space
that is more abstract and therefore
contains fewer detail so if I want if
I'm sitting at my office at NYU and I
decide to go to Paris um I can decompose
that task into two sub tasks go to the
airport and catch a
plane okay now I have a sub goal going
to the airport
um I'm in New York city so going to the
airport consist in going down on the
street and haing a taxi how do I go down
in the street well I need to uh get to
the elevator push the button go down go
out the building how do I go to the
elevator well I need to stand up for my
chair pick up my bag open the door walk
to the elevator avoid all the obstacles
and then at some point I get to a level
where I don't need to plan I can just
take the actions um but we do those type
of this type of hierarchical planning
absolutely all the time and I tell you
we have no idea how to do this with learning
learning
machines almost every robot does
hierarchical planning but the the
representations at every level of the
hierarchy are hand
handcrafted what we need is to train an
architecture perhaps of the type that
I'm describing here so that it can learn
repres abstract representations not just
of the state of the world but also
prediction World models that predict
what's going to happen but also abstract
actions at levels of abstraction so we
can do this hierarchical planning
animals do this
okay humans do this very well we're
completely incapable of doing this withm
today if you're starting a PhD great
years
um so I with all those Reflections about
3 years ago I wrote a long paper where I
kind of explained sort of where where I
think AI research should be focusing on
so this so before the whole GP CH GPT
craze um I haven't changed my mind about
this CH GPT hasn't Chang anything we
wereing Els before that so we knew what
was coming anyway um this is the paper
um a path towards autonomous machine
intelligence that we now call Advanced
machine intelligence because autonomous
just scares people um and it's on open
review it's not on
archive and there's various versions of
this talk that I've I've given various
ways okay so very natural idea for for
getting systems to understand how the world
world
works is um using the same process that
we used to
um to to train system for natural
language and apply this to let's say
video okay if a system is capable of
predicting what's going to happen in a
video you show it A short segment of
video and you ask it to predict what's
going to happen next presumably it would
have understood the underlying structure
of the world um and so training it to
make that prediction might actually
cause the system to understand the
annoing structure of the
world it works for
text because predicting words is
relatively simple why is predicting
words simple because words um there's
only a finite number of possible words
certainly a finite number of possible
tokens and so we can't predict exactly
which word will follow another word or
what what word is missing in the text
but we can produce a probability
distribution or score for every possible
word in the dictionary we cannot do this
for images for video frames we do not
have good ways of representing
distributions of our video
frames um every attempt to do this uh
basically bumps into mathematical intract
intract
abilities um and so you could try to get
around the problem using you know um
statistics and and the math that was
invented by by physicists you know vial
inference and all that stuff but in fact
it's better to just throw away the
entire idea of doing probabilistic
modeling and just just say I just want
to learn this energy function that tells
me whether my output is compatible with
my input and I don't care if this energy
function is a negative log of some
distribution um and so the reason we
need to do this of course is because we
cannot predict exactly what's going to
happen in the world there is a whole set
of possible things that may happen and
if we train a system to just predict one
frame it's not going to do a good job um
so the solution to that problem is an AR
a new architecture I call John tedding
predictive architecture or
jepa and that's because generative
architecture simply do not work for
producing videos you may have seen video
generation systems that produce pretty
amazing stuff there's a lot of hacks
that go be Beyond them uh behind them
and they don't really understand
physics um they don't need to they just
need to to predict pretty pictures they
don't need to actually have kind of
accurate model of the world okay so
here's what the JEA is the idea is that
you run both the observation and the
output which is the next observation
into an encoder so that the prediction
does not consist in predicting pixels
but basically predicting an abstract
representations of what goes on in the
video video or anything okay so let's
compare those two architectures on the
left you have generative
architectures you run X the observation
to an encoder and perhaps to a predictor
or decoder and you make a prediction for
y okay that straightforward
prediction and then on the right this
jeta architecture you run both X and Y
through and codos which may be identical or
or
different and then you predict the
representation of Y from the
representation of X in this abstract
space what this will cause the system to
basically learn an encoder that
eliminates all the stuff you cannot
predict and this is really what we do
there's no way that you know if if I
observe the left part of this room here
and I kind of pan the camera towards the
right there's no way any video
prediction system including humans can
predict what every one of you looks like
or predict the texture on the wall or
the texture of the wood U on the on the
hardwood floor um there's a lot of
things that we just simply cannot
predict and so instead of insisting that
we should make a probabilistic
prediction about stuff that we cannot
predict let's just not predict it learn
a representation in which all of those
details are essentially eliminated so
that the prediction is much simpler it
may still we need to be uh non-
deterministic but at least we simplify
the problem so there's various flavors
of those jads which I'm not going to go
into some of which have latent variables
some of which have are action
conditioned so I'm going to talk about
the action condition because that's uh
the the most interesting one because
they really are World models right so
you have an encoder X is current state
of the world or current observation XX
is current state of the world you feel
an action to a predictor which you
imagine taking and the predictor which
is a world model predicts the
representation of the next state of the
world um and that's how you can do
planning okay so um you need to we need
to train those systems and we need to
figure out how to train those jepa
architectures and tells that to not be
completely trivial because you need to
train the the cost function in this JEA
architecture that measures the the
Divergence between the representation of
Y and the predicted representation of Y
we need this to be low on the training
data but we need also needed to be large
outside the training set okay so this is
you know this kind of energy function
here that has kind of uh Contours of
equal equal energy we need to make sure
the energy is high outside of the
manifold of data and I only know two
classes of methods for this one set of
method is called contrastive it consists
in um having uh data points which are
those those blue dark blue dots pushing
the down the energy of those and then
generating you know those flashing green
dots and then pushing the energy up the
problem with this type of method Contra
method is that they don't scale very
well in high dimension if you have too
many dimensions in your space of Y
you're going to need to push up in lots
of different places and um it it doesn't
work so well you need a lot of
contrastive samples for this to work
there's another set of method that um
called regularized method and what they
do is they use a regularizer on the
energy so as to minimize the volume of
space that can take low energy okay so
that leads to two
different types of learning procedure
one one learning procedure which is
contrastive you need to generate those
contrastive points and then push their
energy up to some loss function and the
other one is some regularizer that is
going to sort of shrink wrap the the
manifold of data um so as to make sure
that the energy is Tire outside so
there's a number of techniques to do
this um I'll describe just just a
handful and the way um we we started
testing them several years ago um maybe
five six years ago was um to train them
to learn representations of images so
you take one image you corrupt it or
transform it in some ways and you run
the original image and the corrupted
version in identical encoders and you
train a predictor to predict the
representation of the original image
from the corrupted one once you're done
training the system you remove the
predictor and you use a representation
at the output of the encoder as input to
a simple um like a linear classifier or
something of that type that you train
supervised uh so as to verify that the
representations that are learned are
good and this idea is very old it goes
back to the 198 90s and things like uh
we used to call SES networks um and some
more recent work on on those joint
embedding architectures and then adding
the predictor is more is more
recent um so s clear which is from from
Google is a contrastive method derived
from s
Nets um but again the dimension is is
restricted so the regularized method uh
worked the following way you try to
estimate have some sort of estimate of
the information content coming out of
the encoders and what you need to do is
prevent the encoder from collapsing this
a trivial solution of training a a
Jeeter architecture where the encoder
basically ignores the input produces a
constant output and another the
prodction error is zero all the time
okay and obviously that's a collapsed
solution that is uh not interesting so
you need a system you need to prevent
the system from collapsing and which is
the regularization method I was talking
about earlier and an indirect way of
doing this is maintain the information
content coming out of the
encoder Okay so so you're going to have
a training objective function which is a
negative information content if you want
because we minimize in machine learning
we don't
maximize uh one way to do this is to
basically take the
um vectors representation vectors that
come out of the encoder over a batch of
samples um and make sure they contain
information how you can you do this you
can take that Matrix of representation
vectors and compute the product of that
matrix by its transposed you get aarian
Matrix and you try to make that coari
Matrix equal to
Identity um
so there's a bad news with this which is
that this
basically approximates the information
content by making very strong
assumptions about the the nature of the
dependencies between the variables and
in fact it's an upper bound on
information content and we're pushing it
up crossing our fingers that the actual
information contain which is below is
going to follow okay so it's slightly uh
uh irregular uh theoretically but but it
works all right so again uh you have a
matrix coming out of your encoder it's
got a number of samples um and each
Vector is a separate variable what we're
going to try to do is going to try to
make each variable individually uh
informative so we're going to try to
prevent the the variance of the variable
from going to to zero force it to be one
for example and then we're going to
decorrelate the variables with each
other and that means Computing The
coverance Matrix of this Matrix is
transpose multiply by itself and then
try to make the resulting coar Matrix as
close to the identity uh Matrix as
possible um there are other methods that
try to make the samples uh orthogonal
not the not the variables um and those
are contrasting sample contrasting
methods um but they don't work in high
dimension and they require large
batches uh so we have um a method of
this type called viag that means
variance in variance Co variance
regularization and it's got particular
loss functions for this ciance Matrix um
there been kind of similar methods
proposed by uh yima and his team called
MCR squar and then another method by uh
some colleagues from NYU called
mmcr from neuroscience
so that's one set of methods and I
really like those methods and I I think
and they work really well I expect to
see more of them in the future but there
is another set of method that to some
extent has been slightly more successful
over the last couple years and those are
based on distillation so again you have
two encoders it's still a joint Ting
productive architecture you have two
encoders they kind of share the same
weights but not really so the encoder on
the right uh gets a version of the
weights of the enod on the left that are
obtained through a um exponential moving
average okay a moving average so
basically you force the encoder on the
right to uh change its weights more
slowly than the one on the left and for
some reason that prevents collapse
there's some theoretical work on this um
in fact uh this one that jum just
finished writing um but it's a little
bit mysterious why this works and
frankly I'm a little uncomfortable with
this method but we have
to um accept the fact that actually
works um if you if you're
careful um you know real Engineers
buildings without necessarily knowing
why they work that's good
engineers and then the usual joke in
France that everybody here should should
learn is that students that come out of
e poly technique when they build
something it doesn't work but they can
tell you
why sorry about that
um I didn't study here you can tell um
okay let me uh switch ahead skip ahead a
little bit in interest of time because
we wasted a bit of time um okay so
there's a particular way of implementing
this AIO distillation called IA there's
another one called called Dino or Dino
uh which I I skipped a little bit um and
um so Dino um is V2 people are working
on on V3 this is a method produced by
some some of my colleagues at at Fair Paris
Paris
um team led by Max Maximo cab um and
then a slight different version um
called IA V
JEA by also Fair people in in Montreal
and Paris mostly so no need for negative
samples there and those those kind of
those systems learn generic features
that you can then learn for any
Downstream task and the features are
really good um so this works really well
I'm not going to bore you with details
because I don't have time uh more
recently we worked on a version of this
for video so this is a system that takes
a a chunk of 16 frames from video and
you corrupt you you take those 16 frames
run them to an encoder and then you
corrupt those 16 frames by masking some
parts of it run them to the same encoder
and then train a predictor to predict
the U representation of a full video
from the one that is partially masked or
corrupted and the U so again this
is group of researchers at at Fair in
Paris and Montreal
um and this works really well in the
sense that uh you learn features that
you can then feed to A system that can
classify actions in videos and you get
really good results with the with this
these these methods again I'm not going
to bore you with details but here is a
really interesting thing this is a paper
that we just submitted um if you show that
that
system um videos where something really strange
strange
happens that system actually is capable
of telling you my prediction error is
going through the roof there is
something strange going on in that
window so you you take a you take a
video and you take the 16 video Frame
Window you slide it over the video and
you measure the prediction error of the
system and if something really strange
happen like an object spontaneously
disappears or change
shape um the prediction error shoots up
so what that tells you is that that
system despite its Simplicity has
learned some level of Common Sense he
can tell you if something really strange
in the world is
happening um
lots of experiments to show this in
various contexts for various types of
intuitive physics but I'm not going to
I'm to skip to this uh latest work uh D
Dino World model um so this is using
Dino features and then training a
predictor on top of it which is action
condition so that it's a world model
that we can use for
planning um and this is a a paper that
is on archive there's a website also
that you can uh you can look at the URL
is at the top here
so basically uh train a predictor using
you know a picture of the world that you
run through a dino
encoder and then an action that maybe a
robot um takes so you get the next frame
uh of that of that video next image from
the world run this to the dino encoder
and then train your predictor to just
predict what's going to happen given the
action that was taken okay very simple
to do planning um You observe an initial
state run into the doo encoder then run
your world model multiple time steps
with imagined actions um then you have a
Target state which is represented by a
Target image for example you run it to
the encoder and then you compute the
distance in state space between the
predicted State and the the the state
representing the the target
image and the planning consists in just
through optimization finding a sequence
of actions that minimizes that cost at
runtime okay reference time you know
people are excited about
um um you know test time computation and
blah blah blah as if it was something
new this is completely classical in
optimal control this is called Model
preductive control it's been around with us
us
for about the same time that I've been
around all right um the first paper is
on you know planning using using models
of this type using optimization are from
the early 60s um the the ones that
actually learned the model are more
recent they're more from the 70s from France
France
actually um it's called edcom um some
people in optimal control might know
about this um but you know it's very
simple concept this works amazingly well
so let me skip to the video
because okay so let's say you have this
uh Little T shape and you want to push
it into a particular um position and so
you know which position it has to go to
because you put an image of that
position run to the enod and that gives
you a Target state in representation
space um let me play that video
again okay so at the top you see what
actually happens in the real world when
you take a sequence of actions that is
planned and what you see at the bottom
is the internal mental prediction of
what the system of the sequence of
actions the system was planning and this
is run to a decoder that produces a
pictorial representation of the internal
state but that is trained separately
there's no image generation um let me
skip to the more interesting one so here
is one where you have an initial state
which is a bunch of Blue Chips
randomly thrown on the floor and the
target state is at the top and what you
see here are the actions that are
resulted from planning and the robot
like accomplishing those actions the
Dynamics of this environment is actually
fairly complicated because those blue
Chiefs kind of interact with each other
and and everything um the system has
just learned this through you know
observing a bunch of uh uh State action
next state um and this works in a lot of
situations for you know arms and moving
through mazes and pushing a te around
and and things like that so
um okay and I'm not sure where I came
back um we've applied kind of similar
idea to navigation but interest of time
I'm just going to skip um so this is you
know basically sequences the videos
where a frame is uh is taken at one time
and then the robot moves and you know
through odometry you know by how much
the robot has moved you get the next
frame and so you just train a system to
predict what the world is going to look
like if you take a particular motion uh
action and what you can do next is you
can tell a system like you know navigate
to that point um and it it will it will
do it and you know avoid obstacles on
the way this is a very new
work but let me go to the conclusion so
I'm having a number of uh
recommendations abandon generative
models the most popular method today
that everybody is working on startop
working on this you work on jads those
are not generative models they predict
in representation space probably seek
models because it's
intractable use energy based
models uh M have had like a 20
year contentious discussion about this
um abandon contractive methods in favor
of those regularized methods abandon
reinforcement learning but that I've
been saying for a long time we know it's
inefficient um you have to use
reinforcement learning really as a last
result when your model is inaccurate or
or your cost function is inaccurate um
but if you are interested in human level
AI just don't work on llm there's no
point I mean in fact if you are in
Academia don't work on LM because you're
in competition with like hundreds of
people with tens of thousands of gpus
like there's nothing you can bring to
the table do something else um there's a
number of problems to solve U training
those things with you know large scale
data blah blah blah planning algorithms
are kind of inefficient we have to come
up with better methods so if you are
like into optimization applied math it's
great um J with latent variables
planning under uncertainty hierarchical
planning which is completely unsolved um
learning cost module because probably
most of them you can't build by hand you
need to learn them and then there is
issues exploration Etc okay so in the
future we'll have Universal virtual
assistants they'll be with us at all
times they will mediate all our
interaction with the digital world we
cannot afford to have those systems come
from a handful of companies from the
west coast of the US or China uh which
means the platforms on top of which we
build those systems need to be open
source and widely available they are
expensive to train but once you have a
foundation model fun tuning it for a
particular application is relatively
cheap and a lot of people afford to do
this so the platforms need to be shared
they need to speak all the the world
languages understand all the world's
cultures all the value systems all the
centers of Interest no single entity in
the world can train a foundational model
of this type this probably will have to
be done in a collaborative fashion or
distributed fashion again some work for
Applied mathematicians who are
interested in distributed algorithms for
large scale
optimization um and so open source AI
platforms are necessary
the danger I see um in Europe and in
other places is that geopolitical
rivalry will U entice governments to
basically make the release of Open
Source model illegal because there are
under the impression that a country will
stay ahead if he keeps uh its science
secret that's that would be a huge
mistake when you do research in secret
you fall behind that's
inevitable what will happen is that the
rest of the world we go up and and will
overtake you that's currently what's
what's happening the open source models are
are
overtaking uh slowly but surely uh proprietary
proprietary
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.