0:02 I'd like to uh welcome our second and
0:07 final uh plenary uh to the stage um up
0:10 next is Yan lukan uh he's a chief AI
0:14 scientist at meta and professor at
0:17 NYU now Yan was the founding director of
0:20 meta and of the NYU uh Center I should
0:23 say for data science he works Prim
0:25 primarily in a number of fields machine
0:27 learning computer vision uh mobile
0:31 Robotics and computational Euro science
0:35 in 2019 Yan won the prestigious ACM
0:37 touring award for his work on AI and
0:40 he's of course a member of uh the US
0:43 nationaly and the French Academy the
0:45 sance a warm welcome to you Yan good to have
0:46 have [Applause]
0:54 you thank you very much a real pleasure
0:57 to be here uh last time must have been
1:01 before covid or something um
1:03 okay um there's going to be some uh
1:05 connection a little bit with what
1:08 Bernard just talked about um and what
1:10 I'm going to talk about is all the stuff
1:13 that Mark Jordan earlier today told you
1:21 on um so as a matter of fact we do need
1:22 human level
1:25 AI um and it's not just because it's an
1:27 interesting scientific question it's
1:30 also sort of a product need um we are
1:34 going to be uh wearing smart devices
1:36 like smart glasses and things of that
1:40 type in the future and in in those smart
1:44 U devices we'll be able to um access AI
1:46 assistants that will be with us at all
1:48 times and we'll be interacting with them
1:51 either through voice or through U uh
1:56 electron um electrogram CMG um the
1:58 glasses will eventually have displays
2:01 although currently they don't and
2:03 and
2:05 um and we need those system to have
2:07 human level intelligence because that's
2:10 what we're the most familiar um um
2:12 interacting with we're familiar with
2:15 interacting with other humans uh we are
2:17 familiar with the level of intelligence
2:21 that we expect in a in a human and uh it
2:24 would be more you know easier to
2:26 interact with systems that have kind of
2:27 similar forms of
2:30 intelligence um so you know those
2:32 ubiquitous assistants um are going to
2:34 mediate all of our interactions with the
2:38 digital world and um that that's why
2:40 that's why we we we need them to be easy
2:43 to use for a wide population that is not
2:46 necessarily familiar with um using
2:48 technology okay but the problem is
2:52 machine learning sucks compared to what
2:54 we observe in humans and animals uh we
2:56 don't really have the techniques that
3:00 would um allow us to build machines that
3:03 have the the same type of
3:07 uh learning abilities and Common Sense
3:10 and understanding of the physical world
3:13 um so animals and humans um have
3:15 background knowledge that allows them to
3:19 um learn new tasks extremely quickly
3:22 understand how the world Works um being
3:25 able to reason and plan and that's based
3:27 on what we call common sense it's not a
3:30 very well- defined concept um and and
3:33 our behavior and behaviors of animals
3:36 are driven by objectives essentially
3:38 essentially um
3:40 um
3:43 so I'm going to argue that the type of
3:45 AI systems that we uh we have at the
3:48 moment um or or that everybody is you
3:50 know playing with almost everybody is
3:52 playing with uh do not have the right
3:55 characteristics uh for for what we want
3:57 want
4:01 um and the reason is uh they basically
4:05 um produce one token after the other
4:07 autor regressively right so you have a
4:10 sequence of tokens which are subo units
4:11 but it doesn't matter what they are a
4:14 sequence of symbols and then you have a
4:16 predictor that is repeated over the
4:18 sequence that Bic basically take a
4:20 window of previous tokens and predict
4:22 the next
4:24 token um and the way you train those
4:26 system is that you put the sequence at
4:28 the at the input and I really apologize
4:31 for this I'm going to perhaps
4:33 change the
4:36 resolution of the
4:38 screen so
5:30 hopefully all right um so
5:33 so so the way those things are trained
5:35 is you take a sequence and you basically
5:36 train the system to just reproduce its
5:38 input on its output and because it has a
5:41 causal structure um it cannot cheat and
5:44 use a particular input to predict itself
5:45 it has to only look at the symbols that
5:46 are to the left of it that's called causal
5:47 causal
5:51 architecture um so that's very efficient
5:53 this is you know what people people call
5:55 a GPT general purpose Transformer but
5:56 you don't have to put Transformers in it
6:00 this could be anything it's just a caal
6:03 architecture and I'm afraid I haven't
6:06 fixed the flashing anyway um so the the
6:08 the way you train the uh those systems
6:10 uh then you can use it to generate text
6:12 by just Auto aggressively producing a
6:14 token shifting it into the input and
6:16 then producing the second token shifting
6:19 that in ETC that's Auto prediction Not A
6:22 New Concept at all obviously um and
6:24 there's an issue with this which is that
6:26 um the
6:30 U the that process is basically
6:32 Divergent every time you produce a token
6:34 there is some chance that the token is
6:37 not within the set of reasonable answers
6:39 and take you outside a set of reasonable
6:41 answers and if it does that there is no
6:44 way to fix it afterwards um and if there
6:45 is if you assume there is some
6:48 probability for that you know wrong
6:50 token uh for wrong tokens to be
6:52 generated and the errors are independent
6:54 which of course they're not um then you
6:57 get exponential Divergence uh which is
6:59 why you know we have with those models hallucination
7:01 hallucination
7:04 issues um but we're missing something
7:06 really big because uh you know never
7:07 mind trying to reproduce human
7:09 intelligence we can even reproduce cat
7:11 intelligence or rat intelligence let
7:13 alone dog intelligence they can do
7:14 amazing feits they understand the
7:18 physical world um um you know any house
7:21 cat can plan very highly complex um
7:24 actions um and they have causal models
7:26 of of the world some of them know how to
7:29 open doors and and Taps and things of
7:32 that type um and in humans you know a
7:35 10-year-old can clear up the dinner
7:37 table and fill up the dishwasher without
7:38 learning zero shot the first time you
7:40 ask a 10-year-old to do it um yeah she
7:43 will do it any 17-year-old can learn to
7:45 drive a car in 20 hours of practice but
7:47 we still don't have robots that can act
7:50 like a cat we don't have domestic robots
7:51 that can clear up the dinner table and
7:54 we don't have level five cell driving
7:56 cars despite the fact that we have
7:58 hundreds of thousands if not millions of
8:01 hours of supervis training data okay so
8:03 that tells you we're missing something really
8:03 really
8:06 big um yet we have systems that can pass
8:09 the bar exam do math problems prove theorems
8:10 theorems
8:13 but no domestic robots so we keep
8:15 bumping into this Paradox called Mor
8:17 Paradox right things that we take for
8:20 granted um because humans and animals
8:21 can do it we think it's not complicated
8:23 it's actually very complicated and the
8:25 stuff that we think is uniquely human
8:26 like manipulating and generating
8:28 language playing chess playing go
8:30 playing poker
8:33 producing poetry and this kind of stuff
8:35 turn that to be easy
8:37 relatively okay and perhaps the reason
8:40 for this is this very simple calculation
8:42 um a typical llm nowadays is trained on
8:46 on the order of 30 trillion tokens three
8:48 10 to the 13 uh
8:51 tokens that's two to the 13 words
8:54 roughly each token is about three bytes
8:57 um so the data volume is roughly 10 to
8:59 the 14 bytes
9:01 uh it would take any of us uh almost
9:04 half a million years to read through all
9:06 that material it's basically all the
9:07 publicly available text on the
9:11 internet now consider her human child a
9:13 four-year-old has been awake a total of
9:16 16,000 hours which by the way is only 30
9:18 minutes of YouTube
9:20 uploads um we have 2 million optical
9:23 nerve fibers Each of which carries about
9:25 1 B per second maybe a bit less but it
9:28 doesn't matter so the data volume is
9:31 about 10 to the 14 in four years a
9:34 four-year-old child has seen as much
9:37 data as the biggest llm in the form of
9:40 visual perception and for blind children
9:43 is touch it's the same kind of uh
9:47 bandwidth uh that tells you a number of
9:49 things we're never going to get to human
9:50 level intelligence by just turning on
9:53 text it's not just not
9:56 happening despite what you know some
9:58 people who are have a vested interest in
9:59 this happening are telling us we're
10:01 going to reach you know PhD level
10:03 intelligence by next year it's just not
10:05 happening we might have PhD level in
10:11 some subfield in some area some uh um
10:13 problems like chess playing you know but
10:17 more of them um as long as we train
10:19 those systems specifically for for those
10:23 problems as um as Bernard was explaining
10:26 with the visual Illusions um there are a
10:27 lot of problems of this type when you
10:29 formulate a problem you pose a problem to
10:29 to
10:32 an llm and if the problem is kind of a
10:34 standard puzzle the answer will be
10:36 regurgitated in just a few seconds if
10:38 you change the statement of the problem
10:40 a little bit the system will still
10:41 produce the same answer that it had
10:43 before because it has no real mental
10:46 model what goes on um in the in the
10:52 puzzle so how do um humans infants learn
10:55 how the world works and you know infants
10:56 accumulate a huge amount of background
10:58 knowledge about the world in the first
11:00 few months of life
11:04 um Notions like object permanence um
11:07 solidity rigidity natural categories of
11:09 objects before children understand
11:11 language they do understand the
11:12 difference between the table and the
11:15 chair um that kind of develops
11:18 naturally and they understand intuitive
11:20 physics notion like gravity inertia and
11:22 things of that type around the age of nine
11:23 nine
11:26 months um so it takes a long time uh
11:29 observation mostly um until four months
11:31 because babies don't really have any
11:33 influence on the on the world before
11:39 that um but then uh through interactions
11:40 but the amount of interaction that's
11:42 that's required is astonishingly small
11:44 small
11:49 so if we want um AI system that can
11:51 reach eventually reach human level might
11:54 take a while um we call this Advanced
11:56 machine intelligence at meta we don't
11:58 like the term AGI artificial general
11:59 intelligence the reason being that that
12:01 human intelligence is actually quite
12:04 specialized and so calling it AGI is
12:05 kind of a
12:08 misnomer um so we call this Ami we
12:10 actually pronounce it Ami which means
12:14 friend in French um so we need systems
12:16 that um learn well models from sensory
12:18 input basically mental models of how the
12:20 world works that you can manipulate in
12:23 your mind learning 2D physics um from
12:25 video let's say systems that have
12:28 persistent memory systems that can plan
12:30 actions uh possibly
12:32 hierarchically so as to fulfill an
12:34 objective and systems that can
12:36 reason um and then systems that are
12:40 controllable and safe by Design not by
12:42 fine-tuning which is the the case for
12:45 llms now the only way I know to build
12:47 systems of this type is to change the
12:52 type of of inference um that um current
12:55 uh AI systems perform so right now the
12:59 way an llm uh performs inference is by
13:01 running through a fixed number of layers
13:03 of anet a transformer then producing a
13:05 token injecting that token on the input
13:06 and then running through a fixed number
13:08 of layers again and the problem with
13:11 this is that if you ask a simple
13:13 question or complex question and you ask
13:16 the system to answer by yes or no like
13:20 does 2 and two equal four yes or no or
13:22 does p equal NP yes or no it's going to
13:24 spend the exact same amount of
13:25 computation to answer those two
13:27 questions so people have been kind of
13:29 cheating and telling the system system
13:31 will explain you know the Chain of
13:33 Thought trick you you basically have the
13:35 system produce more tokens so that is
13:36 going to spend more competition
13:37 answering the question but that's kind
13:42 of a hack the way um a lot of inference
13:43 in statistics for example that's going
13:46 to make Mike happy actually um the way
13:49 inference works is is not that way in uh
13:52 In classical AI in statistics uh in
13:54 structure prediction a lot of different
13:57 domains the way it works is that you
13:59 have a function that measures the degree
14:01 of compatibility or incompatibility
14:03 between your observation and a proposed
14:05 output and then the inference process
14:08 consist in finding the value of an
14:10 output that minimizes this
14:13 incompatibility measure okay let's call
14:14 it an energy function so you have an
14:17 energy function okay represented by the
14:20 square box here on the right um when it
14:24 doesn't disappear and and the system
14:27 just do performs optimization for doing
14:29 inference now if the inference uh
14:31 problem is more difficult the system
14:32 will just spend more time performing
14:34 inference in other words they will think
14:37 about complex problems for longer than
14:39 simple ones for which the answer is pretty
14:40 pretty
14:42 obvious um and this is really a very
14:44 classical thing to do in classical
14:47 classical AI is all about reasoning and
14:50 uh search and therefore optimization
14:53 pretty much any computational problem
14:55 can be reduce an optimization problem
14:57 essentially or search problem uh it's
15:00 also very classical in probabilistic
15:01 modeling like probabilistic graphical
15:05 models and things of that type so this
15:07 type of inference would be more akin to
15:10 what psychologists call system two in uh
15:13 sort of human U mind if you want system
15:18 two is when you think about what action
15:19 or sequence of actions you're going to
15:22 take before you you you take them you
15:23 think about something before doing it
15:25 and the system one is when you can do
15:26 the thing without thinking about it you
15:29 know it becomes sort of subconscious so
15:32 llms are system one what I'm proposing
15:35 is system two um and then the
15:38 appropriate um sort of semi theoretical
15:41 framework to um explain this is energy
15:43 based models which I'm not going to have
15:45 time to get into too much detail but
15:46 basically you capture the dependency
15:49 between variables let's say observations
15:53 X and uh outputs uh y through an energy
15:56 function that takes low value where when
15:58 X and Y are compatible and then larger
16:00 values when X and why are not compatible
16:02 you don't want to just compute y from X
16:05 as we just saw you just want an energy
16:07 function that measures the degree of
16:09 incompatibility and then you know given
16:12 an X find a y that has low energy for that
16:18 X okay so now let's go a little bit into
16:20 the details of how this type of
16:23 architecture can be built so essentially
16:27 and how it kind of relates to um uh
16:29 thinking or planning
16:32 uh so a system would look like this um
16:34 you you get observation from the world
16:35 it go through a perception module that
16:38 produces an estimate about the state of
16:40 the world but of course the state of the
16:41 world is not completely observable so
16:43 you may have to combine this with a
16:46 memory the content of a memory that
16:47 constit you know contains your idea of
16:49 the state of the world you don't uh currently
16:50 currently
16:53 perceive um and the combination of those
16:56 two goes into a world model so what is a
16:59 world model World model is given given a
17:01 current estimate of the state of the
17:04 world which is in an abstract
17:07 representation space and given an action
17:09 sequence that you imagine
17:13 taking uh your world model predicts the
17:15 the resulting state of the world that
17:18 will um occur after you take that
17:20 sequence of actions okay that's what a
17:22 world model is if I tell you imagine a
17:25 cube floating in the air in front of you
17:27 okay now rotate this Cube by 90 degrees
17:29 around a vertical axis
17:31 um what does it look like it's very easy
17:33 for you to kind of have this metal model
18:36 hopefully all
18:39 right let's hope this will be more stable
18:41 stable
18:46 okay um 50 Herz not 60
18:51 HZ okay so uh what you can do now is uh
18:53 feed okay hang
19:14 okay this doesn't look like it was a good
19:50 nice okay I think we're going to have
19:52 human level intelligence before we have
19:59 works okay um so so if we have this
20:03 world model which is able to predict the
20:05 result of a sequence of
20:08 actions um we can feed it to an
20:10 objective which is a task objective that
20:12 measure to what extent the predicted
20:15 final State U satisfies a goal that we
20:18 set for ourselves it's just a cost
20:20 function um and we also can set some uh
20:23 guardrail objectives think of them as
20:25 constraints that need to be satisfied
20:28 for the system to behave in a safe
20:30 manner right so those guardes will be
20:33 explicitly implemented and the way the
20:35 system proceeds is by optimization it's
20:37 looking for an action sequence that
20:41 minimizes the task objective and the uh
20:44 guard rail objectives at runtime okay
20:45 we're not talking about learning here
20:47 we're just talking about
20:50 inference um and that will guarantee the
20:52 safety of the system because uh the
20:53 guard rails guarantee safety and there
20:55 is no way you can Jailbreak that system
20:58 by giving it a prompt that will you know
20:59 have it ES Escape its guardwire
21:01 objectives the guard objectives would be just
21:02 just
21:05 hardwired they might be trained but
21:08 hardwired now a sequence of actions
21:10 should probably use a single World model
21:13 that you repeat you use repeatedly over
21:15 multiple time steps okay so you have a
21:17 one model if you did the first action it
21:18 predicts the next state and the second
21:20 action predicts the second next state
21:23 you can have guard R cost and objective
21:26 uh task uh task objectives along the
21:29 trajectory the ad specifying what
21:31 optimization algorithm we can use it
21:32 doesn't really matter for the discussion
21:36 that we have um if the world happens not
21:37 to be completely deterministic and
21:40 predictable the world model may need to
21:42 have latent variables to account for all
21:43 the things about the world that we do
21:47 not observe and that uh you know makes
21:50 our prediction basically um inexact and
21:52 ultimately what we want is a system that
21:54 can plan hierarchically so something
21:56 that may have several levels of
22:00 abstraction in such a way that um at the
22:02 low level we plan low level actions like
22:04 basically muscle control but at a high
22:08 level we can plan abstract macro action
22:10 where the world model predicts at longer
22:12 time steps but in a representation space
22:14 that is more abstract and therefore
22:16 contains fewer detail so if I want if
22:19 I'm sitting at my office at NYU and I
22:22 decide to go to Paris um I can decompose
22:24 that task into two sub tasks go to the
22:25 airport and catch a
22:27 plane okay now I have a sub goal going
22:29 to the airport
22:31 um I'm in New York city so going to the
22:33 airport consist in going down on the
22:35 street and haing a taxi how do I go down
22:38 in the street well I need to uh get to
22:41 the elevator push the button go down go
22:42 out the building how do I go to the
22:44 elevator well I need to stand up for my
22:46 chair pick up my bag open the door walk
22:49 to the elevator avoid all the obstacles
22:50 and then at some point I get to a level
22:52 where I don't need to plan I can just
22:56 take the actions um but we do those type
22:57 of this type of hierarchical planning
23:00 absolutely all the time and I tell you
23:01 we have no idea how to do this with learning
23:03 learning
23:05 machines almost every robot does
23:07 hierarchical planning but the the
23:09 representations at every level of the
23:11 hierarchy are hand
23:14 handcrafted what we need is to train an
23:15 architecture perhaps of the type that
23:18 I'm describing here so that it can learn
23:20 repres abstract representations not just
23:23 of the state of the world but also
23:24 prediction World models that predict
23:27 what's going to happen but also abstract
23:29 actions at levels of abstraction so we
23:31 can do this hierarchical planning
23:34 animals do this
23:38 okay humans do this very well we're
23:41 completely incapable of doing this withm
23:44 today if you're starting a PhD great
23:50 years
23:54 um so I with all those Reflections about
23:56 3 years ago I wrote a long paper where I
23:58 kind of explained sort of where where I
24:01 think AI research should be focusing on
24:03 so this so before the whole GP CH GPT
24:05 craze um I haven't changed my mind about
24:07 this CH GPT hasn't Chang anything we
24:10 wereing Els before that so we knew what
24:12 was coming anyway um this is the paper
24:14 um a path towards autonomous machine
24:16 intelligence that we now call Advanced
24:18 machine intelligence because autonomous
24:20 just scares people um and it's on open
24:22 review it's not on
24:24 archive and there's various versions of
24:26 this talk that I've I've given various
24:28 ways okay so very natural idea for for
24:30 getting systems to understand how the world
24:31 world
24:35 works is um using the same process that
24:37 we used to
24:40 um to to train system for natural
24:41 language and apply this to let's say
24:44 video okay if a system is capable of
24:45 predicting what's going to happen in a
24:47 video you show it A short segment of
24:49 video and you ask it to predict what's
24:50 going to happen next presumably it would
24:54 have understood the underlying structure
24:57 of the world um and so training it to
24:59 make that prediction might actually
25:00 cause the system to understand the
25:02 annoing structure of the
25:05 world it works for
25:07 text because predicting words is
25:10 relatively simple why is predicting
25:12 words simple because words um there's
25:14 only a finite number of possible words
25:16 certainly a finite number of possible
25:18 tokens and so we can't predict exactly
25:21 which word will follow another word or
25:23 what what word is missing in the text
25:24 but we can produce a probability
25:26 distribution or score for every possible
25:29 word in the dictionary we cannot do this
25:33 for images for video frames we do not
25:34 have good ways of representing
25:35 distributions of our video
25:39 frames um every attempt to do this uh
25:41 basically bumps into mathematical intract
25:42 intract
25:46 abilities um and so you could try to get
25:48 around the problem using you know um
25:50 statistics and and the math that was
25:53 invented by by physicists you know vial
25:56 inference and all that stuff but in fact
25:57 it's better to just throw away the
25:59 entire idea of doing probabilistic
26:01 modeling and just just say I just want
26:03 to learn this energy function that tells
26:05 me whether my output is compatible with
26:07 my input and I don't care if this energy
26:10 function is a negative log of some
26:12 distribution um and so the reason we
26:13 need to do this of course is because we
26:15 cannot predict exactly what's going to
26:17 happen in the world there is a whole set
26:19 of possible things that may happen and
26:21 if we train a system to just predict one
26:24 frame it's not going to do a good job um
26:26 so the solution to that problem is an AR
26:28 a new architecture I call John tedding
26:30 predictive architecture or
26:32 jepa and that's because generative
26:36 architecture simply do not work for
26:39 producing videos you may have seen video
26:41 generation systems that produce pretty
26:43 amazing stuff there's a lot of hacks
26:45 that go be Beyond them uh behind them
26:47 and they don't really understand
26:49 physics um they don't need to they just
26:51 need to to predict pretty pictures they
26:52 don't need to actually have kind of
26:54 accurate model of the world okay so
26:57 here's what the JEA is the idea is that
27:00 you run both the observation and the
27:03 output which is the next observation
27:06 into an encoder so that the prediction
27:10 does not consist in predicting pixels
27:12 but basically predicting an abstract
27:14 representations of what goes on in the
27:18 video video or anything okay so let's
27:20 compare those two architectures on the
27:22 left you have generative
27:25 architectures you run X the observation
27:26 to an encoder and perhaps to a predictor
27:29 or decoder and you make a prediction for
27:32 y okay that straightforward
27:34 prediction and then on the right this
27:36 jeta architecture you run both X and Y
27:38 through and codos which may be identical or
27:39 or
27:42 different and then you predict the
27:43 representation of Y from the
27:45 representation of X in this abstract
27:48 space what this will cause the system to
27:51 basically learn an encoder that
27:53 eliminates all the stuff you cannot
27:55 predict and this is really what we do
27:58 there's no way that you know if if I
28:00 observe the left part of this room here
28:02 and I kind of pan the camera towards the
28:04 right there's no way any video
28:06 prediction system including humans can
28:08 predict what every one of you looks like
28:10 or predict the texture on the wall or
28:13 the texture of the wood U on the on the
28:15 hardwood floor um there's a lot of
28:17 things that we just simply cannot
28:19 predict and so instead of insisting that
28:21 we should make a probabilistic
28:23 prediction about stuff that we cannot
28:26 predict let's just not predict it learn
28:28 a representation in which all of those
28:30 details are essentially eliminated so
28:32 that the prediction is much simpler it
28:36 may still we need to be uh non-
28:38 deterministic but at least we simplify
28:40 the problem so there's various flavors
28:42 of those jads which I'm not going to go
28:44 into some of which have latent variables
28:46 some of which have are action
28:47 conditioned so I'm going to talk about
28:50 the action condition because that's uh
28:51 the the most interesting one because
28:53 they really are World models right so
28:55 you have an encoder X is current state
28:58 of the world or current observation XX
29:00 is current state of the world you feel
29:02 an action to a predictor which you
29:04 imagine taking and the predictor which
29:06 is a world model predicts the
29:08 representation of the next state of the
29:11 world um and that's how you can do
29:14 planning okay so um you need to we need
29:15 to train those systems and we need to
29:16 figure out how to train those jepa
29:18 architectures and tells that to not be
29:22 completely trivial because you need to
29:24 train the the cost function in this JEA
29:27 architecture that measures the the
29:29 Divergence between the representation of
29:31 Y and the predicted representation of Y
29:35 we need this to be low on the training
29:37 data but we need also needed to be large
29:40 outside the training set okay so this is
29:42 you know this kind of energy function
29:45 here that has kind of uh Contours of
29:48 equal equal energy we need to make sure
29:50 the energy is high outside of the
29:53 manifold of data and I only know two
29:55 classes of methods for this one set of
29:57 method is called contrastive it consists
30:00 in um having uh data points which are
30:03 those those blue dark blue dots pushing
30:05 the down the energy of those and then
30:07 generating you know those flashing green
30:09 dots and then pushing the energy up the
30:11 problem with this type of method Contra
30:13 method is that they don't scale very
30:15 well in high dimension if you have too
30:17 many dimensions in your space of Y
30:18 you're going to need to push up in lots
30:22 of different places and um it it doesn't
30:23 work so well you need a lot of
30:26 contrastive samples for this to work
30:28 there's another set of method that um
30:29 called regularized method and what they
30:32 do is they use a regularizer on the
30:36 energy so as to minimize the volume of
30:39 space that can take low energy okay so
30:40 that leads to two
30:42 different types of learning procedure
30:44 one one learning procedure which is
30:45 contrastive you need to generate those
30:47 contrastive points and then push their
30:49 energy up to some loss function and the
30:51 other one is some regularizer that is
30:54 going to sort of shrink wrap the the
30:57 manifold of data um so as to make sure
30:59 that the energy is Tire outside so
31:00 there's a number of techniques to do
31:03 this um I'll describe just just a
31:06 handful and the way um we we started
31:10 testing them several years ago um maybe
31:15 five six years ago was um to train them
31:17 to learn representations of images so
31:20 you take one image you corrupt it or
31:22 transform it in some ways and you run
31:24 the original image and the corrupted
31:27 version in identical encoders and you
31:28 train a predictor to predict the
31:29 representation of the original image
31:32 from the corrupted one once you're done
31:35 training the system you remove the
31:37 predictor and you use a representation
31:39 at the output of the encoder as input to
31:43 a simple um like a linear classifier or
31:44 something of that type that you train
31:47 supervised uh so as to verify that the
31:49 representations that are learned are
31:50 good and this idea is very old it goes
31:54 back to the 198 90s and things like uh
31:57 we used to call SES networks um and some
31:58 more recent work on on those joint
32:00 embedding architectures and then adding
32:02 the predictor is more is more
32:06 recent um so s clear which is from from
32:08 Google is a contrastive method derived
32:10 from s
32:12 Nets um but again the dimension is is
32:16 restricted so the regularized method uh
32:18 worked the following way you try to
32:20 estimate have some sort of estimate of
32:22 the information content coming out of
32:25 the encoders and what you need to do is
32:27 prevent the encoder from collapsing this
32:30 a trivial solution of training a a
32:32 Jeeter architecture where the encoder
32:34 basically ignores the input produces a
32:35 constant output and another the
32:37 prodction error is zero all the time
32:40 okay and obviously that's a collapsed
32:42 solution that is uh not interesting so
32:43 you need a system you need to prevent
32:47 the system from collapsing and which is
32:48 the regularization method I was talking
32:50 about earlier and an indirect way of
32:53 doing this is maintain the information
32:56 content coming out of the
32:58 encoder Okay so so you're going to have
33:01 a training objective function which is a
33:03 negative information content if you want
33:05 because we minimize in machine learning
33:06 we don't
33:09 maximize uh one way to do this is to
33:12 basically take the
33:15 um vectors representation vectors that
33:18 come out of the encoder over a batch of
33:21 samples um and make sure they contain
33:23 information how you can you do this you
33:26 can take that Matrix of representation
33:29 vectors and compute the product of that
33:32 matrix by its transposed you get aarian
33:34 Matrix and you try to make that coari
33:36 Matrix equal to
33:39 Identity um
33:41 so there's a bad news with this which is
33:43 that this
33:45 basically approximates the information
33:47 content by making very strong
33:50 assumptions about the the nature of the
33:51 dependencies between the variables and
33:53 in fact it's an upper bound on
33:55 information content and we're pushing it
33:57 up crossing our fingers that the actual
33:59 information contain which is below is
34:03 going to follow okay so it's slightly uh
34:07 uh irregular uh theoretically but but it
34:10 works all right so again uh you have a
34:12 matrix coming out of your encoder it's
34:15 got a number of samples um and each
34:17 Vector is a separate variable what we're
34:19 going to try to do is going to try to
34:23 make each variable individually uh
34:24 informative so we're going to try to
34:27 prevent the the variance of the variable
34:29 from going to to zero force it to be one
34:31 for example and then we're going to
34:33 decorrelate the variables with each
34:34 other and that means Computing The
34:37 coverance Matrix of this Matrix is
34:39 transpose multiply by itself and then
34:42 try to make the resulting coar Matrix as
34:45 close to the identity uh Matrix as
34:49 possible um there are other methods that
34:53 try to make the samples uh orthogonal
34:56 not the not the variables um and those
34:58 are contrasting sample contrasting
35:00 methods um but they don't work in high
35:02 dimension and they require large
35:05 batches uh so we have um a method of
35:07 this type called viag that means
35:08 variance in variance Co variance
35:10 regularization and it's got particular
35:12 loss functions for this ciance Matrix um
35:14 there been kind of similar methods
35:18 proposed by uh yima and his team called
35:21 MCR squar and then another method by uh
35:25 some colleagues from NYU called
35:28 mmcr from neuroscience
35:30 so that's one set of methods and I
35:31 really like those methods and I I think
35:33 and they work really well I expect to
35:35 see more of them in the future but there
35:37 is another set of method that to some
35:39 extent has been slightly more successful
35:41 over the last couple years and those are
35:43 based on distillation so again you have
35:45 two encoders it's still a joint Ting
35:46 productive architecture you have two
35:48 encoders they kind of share the same
35:50 weights but not really so the encoder on
35:53 the right uh gets a version of the
35:55 weights of the enod on the left that are
36:00 obtained through a um exponential moving
36:02 average okay a moving average so
36:05 basically you force the encoder on the
36:07 right to uh change its weights more
36:09 slowly than the one on the left and for
36:12 some reason that prevents collapse
36:14 there's some theoretical work on this um
36:16 in fact uh this one that jum just
36:18 finished writing um but it's a little
36:21 bit mysterious why this works and
36:22 frankly I'm a little uncomfortable with
36:25 this method but we have
36:27 to um accept the fact that actually
36:30 works um if you if you're
36:34 careful um you know real Engineers
36:36 buildings without necessarily knowing
36:39 why they work that's good
36:41 engineers and then the usual joke in
36:43 France that everybody here should should
36:46 learn is that students that come out of
36:48 e poly technique when they build
36:49 something it doesn't work but they can
36:51 tell you
36:54 why sorry about that
36:59 um I didn't study here you can tell um
37:02 okay let me uh switch ahead skip ahead a
37:04 little bit in interest of time because
37:07 we wasted a bit of time um okay so
37:09 there's a particular way of implementing
37:11 this AIO distillation called IA there's
37:15 another one called called Dino or Dino
37:18 uh which I I skipped a little bit um and
37:22 um so Dino um is V2 people are working
37:25 on on V3 this is a method produced by
37:27 some some of my colleagues at at Fair Paris
37:28 Paris
37:32 um team led by Max Maximo cab um and
37:34 then a slight different version um
37:36 called IA V
37:40 JEA by also Fair people in in Montreal
37:44 and Paris mostly so no need for negative
37:46 samples there and those those kind of
37:48 those systems learn generic features
37:50 that you can then learn for any
37:51 Downstream task and the features are
37:54 really good um so this works really well
37:55 I'm not going to bore you with details
37:57 because I don't have time uh more
37:58 recently we worked on a version of this
38:00 for video so this is a system that takes
38:03 a a chunk of 16 frames from video and
38:05 you corrupt you you take those 16 frames
38:06 run them to an encoder and then you
38:08 corrupt those 16 frames by masking some
38:11 parts of it run them to the same encoder
38:13 and then train a predictor to predict
38:15 the U representation of a full video
38:18 from the one that is partially masked or
38:22 corrupted and the U so again this
38:25 is group of researchers at at Fair in
38:27 Paris and Montreal
38:28 um and this works really well in the
38:30 sense that uh you learn features that
38:33 you can then feed to A system that can
38:35 classify actions in videos and you get
38:37 really good results with the with this
38:39 these these methods again I'm not going
38:40 to bore you with details but here is a
38:42 really interesting thing this is a paper
38:45 that we just submitted um if you show that
38:46 that
38:50 system um videos where something really strange
38:52 strange
38:54 happens that system actually is capable
38:55 of telling you my prediction error is
38:57 going through the roof there is
38:58 something strange going on in that
39:00 window so you you take a you take a
39:03 video and you take the 16 video Frame
39:05 Window you slide it over the video and
39:08 you measure the prediction error of the
39:10 system and if something really strange
39:12 happen like an object spontaneously
39:13 disappears or change
39:17 shape um the prediction error shoots up
39:19 so what that tells you is that that
39:21 system despite its Simplicity has
39:23 learned some level of Common Sense he
39:24 can tell you if something really strange
39:26 in the world is
39:28 happening um
39:30 lots of experiments to show this in
39:32 various contexts for various types of
39:33 intuitive physics but I'm not going to
39:38 I'm to skip to this uh latest work uh D
39:42 Dino World model um so this is using
39:43 Dino features and then training a
39:45 predictor on top of it which is action
39:47 condition so that it's a world model
39:48 that we can use for
39:50 planning um and this is a a paper that
39:52 is on archive there's a website also
39:54 that you can uh you can look at the URL
39:57 is at the top here
40:01 so basically uh train a predictor using
40:03 you know a picture of the world that you
40:04 run through a dino
40:07 encoder and then an action that maybe a
40:13 robot um takes so you get the next frame
40:16 uh of that of that video next image from
40:18 the world run this to the dino encoder
40:19 and then train your predictor to just
40:20 predict what's going to happen given the
40:24 action that was taken okay very simple
40:27 to do planning um You observe an initial
40:30 state run into the doo encoder then run
40:32 your world model multiple time steps
40:36 with imagined actions um then you have a
40:38 Target state which is represented by a
40:40 Target image for example you run it to
40:41 the encoder and then you compute the
40:44 distance in state space between the
40:47 predicted State and the the the state
40:49 representing the the target
40:52 image and the planning consists in just
40:54 through optimization finding a sequence
40:56 of actions that minimizes that cost at
40:58 runtime okay reference time you know
41:01 people are excited about
41:04 um um you know test time computation and
41:05 blah blah blah as if it was something
41:08 new this is completely classical in
41:09 optimal control this is called Model
41:11 preductive control it's been around with us
41:13 us
41:15 for about the same time that I've been
41:20 around all right um the first paper is
41:23 on you know planning using using models
41:25 of this type using optimization are from
41:27 the early 60s um the the ones that
41:29 actually learned the model are more
41:31 recent they're more from the 70s from France
41:32 France
41:37 actually um it's called edcom um some
41:39 people in optimal control might know
41:43 about this um but you know it's very
41:45 simple concept this works amazingly well
41:48 so let me skip to the video
41:50 because okay so let's say you have this
41:53 uh Little T shape and you want to push
41:56 it into a particular um position and so
41:58 you know which position it has to go to
41:59 because you put an image of that
42:01 position run to the enod and that gives
42:03 you a Target state in representation
42:08 space um let me play that video
42:11 again okay so at the top you see what
42:13 actually happens in the real world when
42:14 you take a sequence of actions that is
42:16 planned and what you see at the bottom
42:19 is the internal mental prediction of
42:21 what the system of the sequence of
42:23 actions the system was planning and this
42:24 is run to a decoder that produces a
42:26 pictorial representation of the internal
42:28 state but that is trained separately
42:31 there's no image generation um let me
42:33 skip to the more interesting one so here
42:35 is one where you have an initial state
42:38 which is a bunch of Blue Chips
42:42 randomly thrown on the floor and the
42:43 target state is at the top and what you
42:46 see here are the actions that are
42:49 resulted from planning and the robot
42:51 like accomplishing those actions the
42:52 Dynamics of this environment is actually
42:54 fairly complicated because those blue
42:55 Chiefs kind of interact with each other
42:58 and and everything um the system has
42:59 just learned this through you know
43:02 observing a bunch of uh uh State action
43:05 next state um and this works in a lot of
43:07 situations for you know arms and moving
43:09 through mazes and pushing a te around
43:13 and and things like that so
43:16 um okay and I'm not sure where I came
43:19 back um we've applied kind of similar
43:21 idea to navigation but interest of time
43:25 I'm just going to skip um so this is you
43:27 know basically sequences the videos
43:31 where a frame is uh is taken at one time
43:33 and then the robot moves and you know
43:34 through odometry you know by how much
43:36 the robot has moved you get the next
43:37 frame and so you just train a system to
43:38 predict what the world is going to look
43:41 like if you take a particular motion uh
43:43 action and what you can do next is you
43:46 can tell a system like you know navigate
43:49 to that point um and it it will it will
43:52 do it and you know avoid obstacles on
43:55 the way this is a very new
43:59 work but let me go to the conclusion so
44:02 I'm having a number of uh
44:03 recommendations abandon generative
44:06 models the most popular method today
44:07 that everybody is working on startop
44:09 working on this you work on jads those
44:11 are not generative models they predict
44:14 in representation space probably seek
44:16 models because it's
44:19 intractable use energy based
44:23 models uh M have had like a 20
44:26 year contentious discussion about this
44:29 um abandon contractive methods in favor
44:30 of those regularized methods abandon
44:32 reinforcement learning but that I've
44:35 been saying for a long time we know it's
44:38 inefficient um you have to use
44:39 reinforcement learning really as a last
44:42 result when your model is inaccurate or
44:45 or your cost function is inaccurate um
44:46 but if you are interested in human level
44:48 AI just don't work on llm there's no
44:50 point I mean in fact if you are in
44:53 Academia don't work on LM because you're
44:55 in competition with like hundreds of
44:58 people with tens of thousands of gpus
44:59 like there's nothing you can bring to
45:03 the table do something else um there's a
45:05 number of problems to solve U training
45:06 those things with you know large scale
45:08 data blah blah blah planning algorithms
45:09 are kind of inefficient we have to come
45:12 up with better methods so if you are
45:14 like into optimization applied math it's
45:17 great um J with latent variables
45:19 planning under uncertainty hierarchical
45:21 planning which is completely unsolved um
45:23 learning cost module because probably
45:24 most of them you can't build by hand you
45:26 need to learn them and then there is
45:29 issues exploration Etc okay so in the
45:31 future we'll have Universal virtual
45:33 assistants they'll be with us at all
45:34 times they will mediate all our
45:37 interaction with the digital world we
45:39 cannot afford to have those systems come
45:41 from a handful of companies from the
45:44 west coast of the US or China uh which
45:46 means the platforms on top of which we
45:47 build those systems need to be open
45:49 source and widely available they are
45:52 expensive to train but once you have a
45:54 foundation model fun tuning it for a
45:55 particular application is relatively
45:57 cheap and a lot of people afford to do
46:00 this so the platforms need to be shared
46:02 they need to speak all the the world
46:04 languages understand all the world's
46:06 cultures all the value systems all the
46:09 centers of Interest no single entity in
46:11 the world can train a foundational model
46:13 of this type this probably will have to
46:15 be done in a collaborative fashion or
46:18 distributed fashion again some work for
46:19 Applied mathematicians who are
46:21 interested in distributed algorithms for
46:22 large scale
46:26 optimization um and so open source AI
46:27 platforms are necessary
46:30 the danger I see um in Europe and in
46:34 other places is that geopolitical
46:38 rivalry will U entice governments to
46:40 basically make the release of Open
46:42 Source model illegal because there are
46:44 under the impression that a country will
46:47 stay ahead if he keeps uh its science
46:49 secret that's that would be a huge
46:51 mistake when you do research in secret
46:53 you fall behind that's
46:55 inevitable what will happen is that the
46:57 rest of the world we go up and and will
46:59 overtake you that's currently what's
47:02 what's happening the open source models are
47:04 are
47:07 overtaking uh slowly but surely uh proprietary
47:09 proprietary