YouTube文字起こし：
The Shape of AI to Come! Yann LeCun at AI Action Summit 2025

動画を最後まで見なくてOK。完全な文字起こしを取得し、キーワード検索やワンクリックコピーができます。

AutoDub

YouTube外国語動画を理解

没入型YouTube日本語吹き替え

言語の壁を越えて、世界の優良コンテンツを楽しもう

無料で使う

動画の文字起こし

動画の要約

Summary

Core Theme

Current large language models (LLMs) are insufficient for achieving human-level AI due to their autoregressive, token-by-token generation, which lacks common sense and a true understanding of the physical world. Future AI development should focus on building systems with mental models, planning capabilities, and more robust inference mechanisms, inspired by biological learning.

Mind Map

クリックして展開

クリックしてインタラクティブなマインドマップを確認

I'd like to uh welcome our second and

final uh plenary uh to the stage um up

next is Yan lukan uh he's a chief AI

scientist at meta and professor at

NYU now Yan was the founding director of

meta and of the NYU uh Center I should

say for data science he works Prim

primarily in a number of fields machine

learning computer vision uh mobile

Robotics and computational Euro science

in 2019 Yan won the prestigious ACM

touring award for his work on AI and

he's of course a member of uh the US

nationaly and the French Academy the

sance a warm welcome to you Yan good to have

have [Applause]

you thank you very much a real pleasure

to be here uh last time must have been

before covid or something um

okay um there's going to be some uh

connection a little bit with what

Bernard just talked about um and what

I'm going to talk about is all the stuff

that Mark Jordan earlier today told you

on um so as a matter of fact we do need

human level

AI um and it's not just because it's an

interesting scientific question it's

also sort of a product need um we are

going to be uh wearing smart devices

like smart glasses and things of that

type in the future and in in those smart

U devices we'll be able to um access AI

assistants that will be with us at all

times and we'll be interacting with them

either through voice or through U uh

electron um electrogram CMG um the

glasses will eventually have displays

although currently they don't and

and

um and we need those system to have

human level intelligence because that's

what we're the most familiar um um

interacting with we're familiar with

interacting with other humans uh we are

familiar with the level of intelligence

that we expect in a in a human and uh it

would be more you know easier to

interact with systems that have kind of

similar forms of

intelligence um so you know those

ubiquitous assistants um are going to

mediate all of our interactions with the

digital world and um that that's why

that's why we we we need them to be easy

to use for a wide population that is not

necessarily familiar with um using

technology okay but the problem is

machine learning sucks compared to what

we observe in humans and animals uh we

don't really have the techniques that

would um allow us to build machines that

have the the same type of

uh learning abilities and Common Sense

and understanding of the physical world

um so animals and humans um have

background knowledge that allows them to

um learn new tasks extremely quickly

understand how the world Works um being

able to reason and plan and that's based

on what we call common sense it's not a

very well- defined concept um and and

our behavior and behaviors of animals

are driven by objectives essentially

essentially um

so I'm going to argue that the type of

AI systems that we uh we have at the

moment um or or that everybody is you

know playing with almost everybody is

playing with uh do not have the right

characteristics uh for for what we want

want

um and the reason is uh they basically

um produce one token after the other

autor regressively right so you have a

sequence of tokens which are subo units

but it doesn't matter what they are a

sequence of symbols and then you have a

predictor that is repeated over the

sequence that Bic basically take a

window of previous tokens and predict

the next

token um and the way you train those

system is that you put the sequence at

the at the input and I really apologize

for this I'm going to perhaps

change the

resolution of the

screen so

hopefully all right um so

so so the way those things are trained

is you take a sequence and you basically

train the system to just reproduce its

input on its output and because it has a

causal structure um it cannot cheat and

use a particular input to predict itself

it has to only look at the symbols that

are to the left of it that's called causal

causal

architecture um so that's very efficient

this is you know what people people call

a GPT general purpose Transformer but

you don't have to put Transformers in it

this could be anything it's just a caal

architecture and I'm afraid I haven't

fixed the flashing anyway um so the the

the way you train the uh those systems

uh then you can use it to generate text

by just Auto aggressively producing a

token shifting it into the input and

then producing the second token shifting

that in ETC that's Auto prediction Not A

New Concept at all obviously um and

there's an issue with this which is that

um the

U the that process is basically

Divergent every time you produce a token

there is some chance that the token is

not within the set of reasonable answers

and take you outside a set of reasonable

answers and if it does that there is no

way to fix it afterwards um and if there

is if you assume there is some

probability for that you know wrong

token uh for wrong tokens to be

generated and the errors are independent

which of course they're not um then you

get exponential Divergence uh which is

why you know we have with those models hallucination

hallucination

issues um but we're missing something

really big because uh you know never

mind trying to reproduce human

intelligence we can even reproduce cat

intelligence or rat intelligence let

alone dog intelligence they can do

amazing feits they understand the

physical world um um you know any house

cat can plan very highly complex um

actions um and they have causal models

of of the world some of them know how to

open doors and and Taps and things of

that type um and in humans you know a

10-year-old can clear up the dinner

table and fill up the dishwasher without

learning zero shot the first time you

ask a 10-year-old to do it um yeah she

will do it any 17-year-old can learn to

drive a car in 20 hours of practice but

we still don't have robots that can act

like a cat we don't have domestic robots

that can clear up the dinner table and

we don't have level five cell driving

cars despite the fact that we have

hundreds of thousands if not millions of

hours of supervis training data okay so

that tells you we're missing something really

really

big um yet we have systems that can pass

the bar exam do math problems prove theorems

theorems

but no domestic robots so we keep

bumping into this Paradox called Mor

Paradox right things that we take for

granted um because humans and animals

can do it we think it's not complicated

it's actually very complicated and the

stuff that we think is uniquely human

like manipulating and generating

language playing chess playing go

playing poker

producing poetry and this kind of stuff

turn that to be easy

relatively okay and perhaps the reason

for this is this very simple calculation

um a typical llm nowadays is trained on

on the order of 30 trillion tokens three

10 to the 13 uh

tokens that's two to the 13 words

roughly each token is about three bytes

um so the data volume is roughly 10 to

the 14 bytes

uh it would take any of us uh almost

half a million years to read through all

that material it's basically all the

publicly available text on the

internet now consider her human child a

four-year-old has been awake a total of

16,000 hours which by the way is only 30

minutes of YouTube

uploads um we have 2 million optical

nerve fibers Each of which carries about

1 B per second maybe a bit less but it

doesn't matter so the data volume is

about 10 to the 14 in four years a

four-year-old child has seen as much

data as the biggest llm in the form of

visual perception and for blind children

is touch it's the same kind of uh

bandwidth uh that tells you a number of

things we're never going to get to human

level intelligence by just turning on

text it's not just not

happening despite what you know some

people who are have a vested interest in

this happening are telling us we're

going to reach you know PhD level

intelligence by next year it's just not

happening we might have PhD level in

some subfield in some area some uh um

problems like chess playing you know but

more of them um as long as we train

those systems specifically for for those

problems as um as Bernard was explaining

with the visual Illusions um there are a

lot of problems of this type when you

formulate a problem you pose a problem to

an llm and if the problem is kind of a

standard puzzle the answer will be

regurgitated in just a few seconds if

you change the statement of the problem

a little bit the system will still

produce the same answer that it had

before because it has no real mental

model what goes on um in the in the

puzzle so how do um humans infants learn

how the world works and you know infants

accumulate a huge amount of background

knowledge about the world in the first

few months of life

um Notions like object permanence um

solidity rigidity natural categories of

objects before children understand

language they do understand the

difference between the table and the

chair um that kind of develops

naturally and they understand intuitive

physics notion like gravity inertia and

things of that type around the age of nine

nine

months um so it takes a long time uh

observation mostly um until four months

because babies don't really have any

influence on the on the world before

that um but then uh through interactions

but the amount of interaction that's

that's required is astonishingly small

small

so if we want um AI system that can

reach eventually reach human level might

take a while um we call this Advanced

machine intelligence at meta we don't

like the term AGI artificial general

intelligence the reason being that that

human intelligence is actually quite

specialized and so calling it AGI is

kind of a

misnomer um so we call this Ami we

actually pronounce it Ami which means

friend in French um so we need systems

that um learn well models from sensory

input basically mental models of how the

world works that you can manipulate in

your mind learning 2D physics um from

video let's say systems that have

persistent memory systems that can plan

actions uh possibly

hierarchically so as to fulfill an

objective and systems that can

reason um and then systems that are

controllable and safe by Design not by

fine-tuning which is the the case for

llms now the only way I know to build

systems of this type is to change the

type of of inference um that um current

uh AI systems perform so right now the

way an llm uh performs inference is by

running through a fixed number of layers

of anet a transformer then producing a

token injecting that token on the input

and then running through a fixed number

of layers again and the problem with

this is that if you ask a simple

question or complex question and you ask

the system to answer by yes or no like

does 2 and two equal four yes or no or

does p equal NP yes or no it's going to

spend the exact same amount of

computation to answer those two

questions so people have been kind of

cheating and telling the system system

will explain you know the Chain of

Thought trick you you basically have the

system produce more tokens so that is

going to spend more competition

answering the question but that's kind

of a hack the way um a lot of inference

in statistics for example that's going

to make Mike happy actually um the way

inference works is is not that way in uh

In classical AI in statistics uh in

structure prediction a lot of different

domains the way it works is that you

have a function that measures the degree

of compatibility or incompatibility

between your observation and a proposed

output and then the inference process

consist in finding the value of an

output that minimizes this

incompatibility measure okay let's call

it an energy function so you have an

energy function okay represented by the

square box here on the right um when it

doesn't disappear and and the system

just do performs optimization for doing

inference now if the inference uh

problem is more difficult the system

will just spend more time performing

inference in other words they will think

about complex problems for longer than

simple ones for which the answer is pretty

pretty

obvious um and this is really a very

classical thing to do in classical

classical AI is all about reasoning and

uh search and therefore optimization

pretty much any computational problem

can be reduce an optimization problem

essentially or search problem uh it's

also very classical in probabilistic

modeling like probabilistic graphical

models and things of that type so this

type of inference would be more akin to

what psychologists call system two in uh

sort of human U mind if you want system

two is when you think about what action

or sequence of actions you're going to

take before you you you take them you

think about something before doing it

and the system one is when you can do

the thing without thinking about it you

know it becomes sort of subconscious so

llms are system one what I'm proposing

is system two um and then the

appropriate um sort of semi theoretical

framework to um explain this is energy

based models which I'm not going to have

time to get into too much detail but

basically you capture the dependency

between variables let's say observations

X and uh outputs uh y through an energy

function that takes low value where when

X and Y are compatible and then larger

values when X and why are not compatible

you don't want to just compute y from X

as we just saw you just want an energy

function that measures the degree of

incompatibility and then you know given

an X find a y that has low energy for that

X okay so now let's go a little bit into

the details of how this type of

architecture can be built so essentially

and how it kind of relates to um uh

thinking or planning

uh so a system would look like this um

you you get observation from the world

it go through a perception module that

produces an estimate about the state of

the world but of course the state of the

world is not completely observable so

you may have to combine this with a

memory the content of a memory that

constit you know contains your idea of

the state of the world you don't uh currently

currently

perceive um and the combination of those

two goes into a world model so what is a

world model World model is given given a

current estimate of the state of the

world which is in an abstract

representation space and given an action

sequence that you imagine

taking uh your world model predicts the

the resulting state of the world that

will um occur after you take that

sequence of actions okay that's what a

world model is if I tell you imagine a

cube floating in the air in front of you

okay now rotate this Cube by 90 degrees

around a vertical axis

um what does it look like it's very easy

for you to kind of have this metal model

hopefully all

right let's hope this will be more stable

stable

okay um 50 Herz not 60

HZ okay so uh what you can do now is uh

feed okay hang

okay this doesn't look like it was a good

nice okay I think we're going to have

human level intelligence before we have

works okay um so so if we have this

world model which is able to predict the

result of a sequence of

actions um we can feed it to an

objective which is a task objective that

measure to what extent the predicted

final State U satisfies a goal that we

set for ourselves it's just a cost

function um and we also can set some uh

guardrail objectives think of them as

constraints that need to be satisfied

for the system to behave in a safe

manner right so those guardes will be

explicitly implemented and the way the

system proceeds is by optimization it's

looking for an action sequence that

minimizes the task objective and the uh

guard rail objectives at runtime okay

we're not talking about learning here

we're just talking about

inference um and that will guarantee the

safety of the system because uh the

guard rails guarantee safety and there

is no way you can Jailbreak that system

by giving it a prompt that will you know

have it ES Escape its guardwire

objectives the guard objectives would be just

just

hardwired they might be trained but

hardwired now a sequence of actions

should probably use a single World model

that you repeat you use repeatedly over

multiple time steps okay so you have a

one model if you did the first action it

predicts the next state and the second

action predicts the second next state

you can have guard R cost and objective

uh task uh task objectives along the

trajectory the ad specifying what

optimization algorithm we can use it

doesn't really matter for the discussion

that we have um if the world happens not

to be completely deterministic and

predictable the world model may need to

have latent variables to account for all

the things about the world that we do

not observe and that uh you know makes

our prediction basically um inexact and

ultimately what we want is a system that

can plan hierarchically so something

that may have several levels of

abstraction in such a way that um at the

low level we plan low level actions like

basically muscle control but at a high

level we can plan abstract macro action

where the world model predicts at longer

time steps but in a representation space

that is more abstract and therefore

contains fewer detail so if I want if

I'm sitting at my office at NYU and I

decide to go to Paris um I can decompose

that task into two sub tasks go to the

airport and catch a

plane okay now I have a sub goal going

to the airport

um I'm in New York city so going to the

airport consist in going down on the

street and haing a taxi how do I go down

in the street well I need to uh get to

the elevator push the button go down go

out the building how do I go to the

elevator well I need to stand up for my

chair pick up my bag open the door walk

to the elevator avoid all the obstacles

and then at some point I get to a level

where I don't need to plan I can just

take the actions um but we do those type

of this type of hierarchical planning

absolutely all the time and I tell you

we have no idea how to do this with learning

learning

machines almost every robot does

hierarchical planning but the the

representations at every level of the

hierarchy are hand

handcrafted what we need is to train an

architecture perhaps of the type that

I'm describing here so that it can learn

repres abstract representations not just

of the state of the world but also

prediction World models that predict

what's going to happen but also abstract

actions at levels of abstraction so we

can do this hierarchical planning

animals do this

okay humans do this very well we're

completely incapable of doing this withm

today if you're starting a PhD great

years

um so I with all those Reflections about

3 years ago I wrote a long paper where I

kind of explained sort of where where I

think AI research should be focusing on

so this so before the whole GP CH GPT

craze um I haven't changed my mind about

this CH GPT hasn't Chang anything we

wereing Els before that so we knew what

was coming anyway um this is the paper

um a path towards autonomous machine

intelligence that we now call Advanced

machine intelligence because autonomous

just scares people um and it's on open

review it's not on

archive and there's various versions of

this talk that I've I've given various

ways okay so very natural idea for for

getting systems to understand how the world

world

works is um using the same process that

we used to

um to to train system for natural

language and apply this to let's say

video okay if a system is capable of

predicting what's going to happen in a

video you show it A short segment of

video and you ask it to predict what's

going to happen next presumably it would

have understood the underlying structure

of the world um and so training it to

make that prediction might actually

cause the system to understand the

annoing structure of the

world it works for

text because predicting words is

relatively simple why is predicting

words simple because words um there's

only a finite number of possible words

certainly a finite number of possible

tokens and so we can't predict exactly

which word will follow another word or

what what word is missing in the text

but we can produce a probability

distribution or score for every possible

word in the dictionary we cannot do this

for images for video frames we do not

have good ways of representing

distributions of our video

frames um every attempt to do this uh

basically bumps into mathematical intract

intract

abilities um and so you could try to get

around the problem using you know um

statistics and and the math that was

invented by by physicists you know vial

inference and all that stuff but in fact

it's better to just throw away the

entire idea of doing probabilistic

modeling and just just say I just want

to learn this energy function that tells

me whether my output is compatible with

my input and I don't care if this energy

function is a negative log of some

distribution um and so the reason we

need to do this of course is because we

cannot predict exactly what's going to

happen in the world there is a whole set

of possible things that may happen and

if we train a system to just predict one

frame it's not going to do a good job um

so the solution to that problem is an AR

a new architecture I call John tedding

predictive architecture or

jepa and that's because generative

architecture simply do not work for

producing videos you may have seen video

generation systems that produce pretty

amazing stuff there's a lot of hacks

that go be Beyond them uh behind them

and they don't really understand

physics um they don't need to they just

need to to predict pretty pictures they

don't need to actually have kind of

accurate model of the world okay so

here's what the JEA is the idea is that

you run both the observation and the

output which is the next observation

into an encoder so that the prediction

does not consist in predicting pixels

but basically predicting an abstract

representations of what goes on in the

video video or anything okay so let's

compare those two architectures on the

left you have generative

architectures you run X the observation

to an encoder and perhaps to a predictor

or decoder and you make a prediction for

y okay that straightforward

prediction and then on the right this

jeta architecture you run both X and Y

through and codos which may be identical or

different and then you predict the

representation of Y from the

representation of X in this abstract

space what this will cause the system to

basically learn an encoder that

eliminates all the stuff you cannot

predict and this is really what we do

there's no way that you know if if I

observe the left part of this room here

and I kind of pan the camera towards the

right there's no way any video

prediction system including humans can

predict what every one of you looks like

or predict the texture on the wall or

the texture of the wood U on the on the

hardwood floor um there's a lot of

things that we just simply cannot

predict and so instead of insisting that

we should make a probabilistic

prediction about stuff that we cannot

predict let's just not predict it learn

a representation in which all of those

details are essentially eliminated so

that the prediction is much simpler it

may still we need to be uh non-

deterministic but at least we simplify

the problem so there's various flavors

of those jads which I'm not going to go

into some of which have latent variables

some of which have are action

conditioned so I'm going to talk about

the action condition because that's uh

the the most interesting one because

they really are World models right so

you have an encoder X is current state

of the world or current observation XX

is current state of the world you feel

an action to a predictor which you

imagine taking and the predictor which

is a world model predicts the

representation of the next state of the

world um and that's how you can do

planning okay so um you need to we need

to train those systems and we need to

figure out how to train those jepa

architectures and tells that to not be

completely trivial because you need to

train the the cost function in this JEA

architecture that measures the the

Divergence between the representation of

Y and the predicted representation of Y

we need this to be low on the training

data but we need also needed to be large

outside the training set okay so this is

you know this kind of energy function

here that has kind of uh Contours of

equal equal energy we need to make sure

the energy is high outside of the

manifold of data and I only know two

classes of methods for this one set of

method is called contrastive it consists

in um having uh data points which are

those those blue dark blue dots pushing

the down the energy of those and then

generating you know those flashing green

dots and then pushing the energy up the

problem with this type of method Contra

method is that they don't scale very

well in high dimension if you have too

many dimensions in your space of Y

you're going to need to push up in lots

of different places and um it it doesn't

work so well you need a lot of

contrastive samples for this to work

there's another set of method that um

called regularized method and what they

do is they use a regularizer on the

energy so as to minimize the volume of

space that can take low energy okay so

that leads to two

different types of learning procedure

one one learning procedure which is

contrastive you need to generate those

contrastive points and then push their

energy up to some loss function and the

other one is some regularizer that is

going to sort of shrink wrap the the

manifold of data um so as to make sure

that the energy is Tire outside so

there's a number of techniques to do

this um I'll describe just just a

handful and the way um we we started

testing them several years ago um maybe

five six years ago was um to train them

to learn representations of images so

you take one image you corrupt it or

transform it in some ways and you run

the original image and the corrupted

version in identical encoders and you

train a predictor to predict the

representation of the original image

from the corrupted one once you're done

training the system you remove the

predictor and you use a representation

at the output of the encoder as input to

a simple um like a linear classifier or

something of that type that you train

supervised uh so as to verify that the

representations that are learned are

good and this idea is very old it goes

back to the 198 90s and things like uh

we used to call SES networks um and some

more recent work on on those joint

embedding architectures and then adding

the predictor is more is more

recent um so s clear which is from from

Google is a contrastive method derived

from s

Nets um but again the dimension is is

restricted so the regularized method uh

worked the following way you try to

estimate have some sort of estimate of

the information content coming out of

the encoders and what you need to do is

prevent the encoder from collapsing this

a trivial solution of training a a

Jeeter architecture where the encoder

basically ignores the input produces a

constant output and another the

prodction error is zero all the time

okay and obviously that's a collapsed

solution that is uh not interesting so

you need a system you need to prevent

the system from collapsing and which is

the regularization method I was talking

about earlier and an indirect way of

doing this is maintain the information

content coming out of the

encoder Okay so so you're going to have

a training objective function which is a

negative information content if you want

because we minimize in machine learning

we don't

maximize uh one way to do this is to

basically take the

um vectors representation vectors that

come out of the encoder over a batch of

samples um and make sure they contain

information how you can you do this you

can take that Matrix of representation

vectors and compute the product of that

matrix by its transposed you get aarian

Matrix and you try to make that coari

Matrix equal to

Identity um

so there's a bad news with this which is

that this

basically approximates the information

content by making very strong

assumptions about the the nature of the

dependencies between the variables and

in fact it's an upper bound on

information content and we're pushing it

up crossing our fingers that the actual

information contain which is below is

going to follow okay so it's slightly uh

uh irregular uh theoretically but but it

works all right so again uh you have a

matrix coming out of your encoder it's

got a number of samples um and each

Vector is a separate variable what we're

going to try to do is going to try to

make each variable individually uh

informative so we're going to try to

prevent the the variance of the variable

from going to to zero force it to be one

for example and then we're going to

decorrelate the variables with each

other and that means Computing The

coverance Matrix of this Matrix is

transpose multiply by itself and then

try to make the resulting coar Matrix as

close to the identity uh Matrix as

possible um there are other methods that

try to make the samples uh orthogonal

not the not the variables um and those

are contrasting sample contrasting

methods um but they don't work in high

dimension and they require large

batches uh so we have um a method of

this type called viag that means

variance in variance Co variance

regularization and it's got particular

loss functions for this ciance Matrix um

there been kind of similar methods

proposed by uh yima and his team called

MCR squar and then another method by uh

some colleagues from NYU called

mmcr from neuroscience

so that's one set of methods and I

really like those methods and I I think

and they work really well I expect to

see more of them in the future but there

is another set of method that to some

extent has been slightly more successful

over the last couple years and those are

based on distillation so again you have

two encoders it's still a joint Ting

productive architecture you have two

encoders they kind of share the same

weights but not really so the encoder on

the right uh gets a version of the

weights of the enod on the left that are

obtained through a um exponential moving

average okay a moving average so

basically you force the encoder on the

right to uh change its weights more

slowly than the one on the left and for

some reason that prevents collapse

there's some theoretical work on this um

in fact uh this one that jum just

finished writing um but it's a little

bit mysterious why this works and

frankly I'm a little uncomfortable with

this method but we have

to um accept the fact that actually

works um if you if you're

careful um you know real Engineers

buildings without necessarily knowing

why they work that's good

engineers and then the usual joke in

France that everybody here should should

learn is that students that come out of

e poly technique when they build

something it doesn't work but they can

tell you

why sorry about that

um I didn't study here you can tell um

okay let me uh switch ahead skip ahead a

little bit in interest of time because

we wasted a bit of time um okay so

there's a particular way of implementing

this AIO distillation called IA there's

another one called called Dino or Dino

uh which I I skipped a little bit um and

um so Dino um is V2 people are working

on on V3 this is a method produced by

some some of my colleagues at at Fair Paris

Paris

um team led by Max Maximo cab um and

then a slight different version um

called IA V

JEA by also Fair people in in Montreal

and Paris mostly so no need for negative

samples there and those those kind of

those systems learn generic features

that you can then learn for any

Downstream task and the features are

really good um so this works really well

I'm not going to bore you with details

because I don't have time uh more

recently we worked on a version of this

for video so this is a system that takes

a a chunk of 16 frames from video and

you corrupt you you take those 16 frames

run them to an encoder and then you

corrupt those 16 frames by masking some

parts of it run them to the same encoder

and then train a predictor to predict

the U representation of a full video

from the one that is partially masked or

corrupted and the U so again this

is group of researchers at at Fair in

Paris and Montreal

um and this works really well in the

sense that uh you learn features that

you can then feed to A system that can

classify actions in videos and you get

really good results with the with this

these these methods again I'm not going

to bore you with details but here is a

really interesting thing this is a paper

that we just submitted um if you show that

that

system um videos where something really strange

strange

happens that system actually is capable

of telling you my prediction error is

going through the roof there is

something strange going on in that

window so you you take a you take a

video and you take the 16 video Frame

Window you slide it over the video and

you measure the prediction error of the

system and if something really strange

happen like an object spontaneously

disappears or change

shape um the prediction error shoots up

so what that tells you is that that

system despite its Simplicity has

learned some level of Common Sense he

can tell you if something really strange

in the world is

happening um

lots of experiments to show this in

various contexts for various types of

intuitive physics but I'm not going to

I'm to skip to this uh latest work uh D

Dino World model um so this is using

Dino features and then training a

predictor on top of it which is action

condition so that it's a world model

that we can use for

planning um and this is a a paper that

is on archive there's a website also

that you can uh you can look at the URL

is at the top here

so basically uh train a predictor using

you know a picture of the world that you

run through a dino

encoder and then an action that maybe a

robot um takes so you get the next frame

uh of that of that video next image from

the world run this to the dino encoder

and then train your predictor to just

predict what's going to happen given the

action that was taken okay very simple

to do planning um You observe an initial

state run into the doo encoder then run

your world model multiple time steps

with imagined actions um then you have a

Target state which is represented by a

Target image for example you run it to

the encoder and then you compute the

distance in state space between the

predicted State and the the the state

representing the the target

image and the planning consists in just

through optimization finding a sequence

of actions that minimizes that cost at

runtime okay reference time you know

people are excited about

um um you know test time computation and

blah blah blah as if it was something

new this is completely classical in

optimal control this is called Model

preductive control it's been around with us

for about the same time that I've been

around all right um the first paper is

on you know planning using using models

of this type using optimization are from

the early 60s um the the ones that

actually learned the model are more

recent they're more from the 70s from France

France

actually um it's called edcom um some

people in optimal control might know

about this um but you know it's very

simple concept this works amazingly well

so let me skip to the video

because okay so let's say you have this

uh Little T shape and you want to push

it into a particular um position and so

you know which position it has to go to

because you put an image of that

position run to the enod and that gives

you a Target state in representation

space um let me play that video

again okay so at the top you see what

actually happens in the real world when

you take a sequence of actions that is

planned and what you see at the bottom

is the internal mental prediction of

what the system of the sequence of

actions the system was planning and this

is run to a decoder that produces a

pictorial representation of the internal

state but that is trained separately

there's no image generation um let me

skip to the more interesting one so here

is one where you have an initial state

which is a bunch of Blue Chips

randomly thrown on the floor and the

target state is at the top and what you

see here are the actions that are

resulted from planning and the robot

like accomplishing those actions the

Dynamics of this environment is actually

fairly complicated because those blue

Chiefs kind of interact with each other

and and everything um the system has

just learned this through you know

observing a bunch of uh uh State action

next state um and this works in a lot of

situations for you know arms and moving

through mazes and pushing a te around

and and things like that so

um okay and I'm not sure where I came

back um we've applied kind of similar

idea to navigation but interest of time

I'm just going to skip um so this is you

know basically sequences the videos

where a frame is uh is taken at one time

and then the robot moves and you know

through odometry you know by how much

the robot has moved you get the next

frame and so you just train a system to

predict what the world is going to look

like if you take a particular motion uh

action and what you can do next is you

can tell a system like you know navigate

to that point um and it it will it will

do it and you know avoid obstacles on

the way this is a very new

work but let me go to the conclusion so

I'm having a number of uh

recommendations abandon generative

models the most popular method today

that everybody is working on startop

working on this you work on jads those

are not generative models they predict

in representation space probably seek

models because it's

intractable use energy based

models uh M have had like a 20

year contentious discussion about this

um abandon contractive methods in favor

of those regularized methods abandon

reinforcement learning but that I've

been saying for a long time we know it's

inefficient um you have to use

reinforcement learning really as a last

result when your model is inaccurate or

or your cost function is inaccurate um

but if you are interested in human level

AI just don't work on llm there's no

point I mean in fact if you are in

Academia don't work on LM because you're

in competition with like hundreds of

people with tens of thousands of gpus

like there's nothing you can bring to

the table do something else um there's a

number of problems to solve U training

those things with you know large scale

data blah blah blah planning algorithms

are kind of inefficient we have to come

up with better methods so if you are

like into optimization applied math it's

great um J with latent variables

planning under uncertainty hierarchical

planning which is completely unsolved um

learning cost module because probably

most of them you can't build by hand you

need to learn them and then there is

issues exploration Etc okay so in the

future we'll have Universal virtual

assistants they'll be with us at all

times they will mediate all our

interaction with the digital world we

cannot afford to have those systems come

from a handful of companies from the

west coast of the US or China uh which

means the platforms on top of which we

build those systems need to be open

source and widely available they are

expensive to train but once you have a

foundation model fun tuning it for a

particular application is relatively

cheap and a lot of people afford to do

this so the platforms need to be shared

they need to speak all the the world

languages understand all the world's

cultures all the value systems all the

centers of Interest no single entity in

the world can train a foundational model

of this type this probably will have to

be done in a collaborative fashion or

distributed fashion again some work for

Applied mathematicians who are

interested in distributed algorithms for

large scale

optimization um and so open source AI

platforms are necessary

the danger I see um in Europe and in

other places is that geopolitical

rivalry will U entice governments to

basically make the release of Open

Source model illegal because there are

under the impression that a country will

stay ahead if he keeps uh its science

secret that's that would be a huge

mistake when you do research in secret

you fall behind that's

inevitable what will happen is that the

rest of the world we go up and and will

overtake you that's currently what's

what's happening the open source models are

are

overtaking uh slowly but surely uh proprietary

proprietary

テキストまたはタイムスタンプをクリックすると、動画のその場面に移動できます

ほとんどの文字起こしは5秒以内に完了

ワンクリックコピー125以上の言語内容を検索タイムスタンプにジャンプ

YouTube URLを貼り付け

任意のYouTube動画リンクを入力すると、完全な文字起こしを取得できます

ほとんどの文字起こしは5秒以内に完了

Chrome拡張機能を追加

YouTubeを離れずに文字起こしを瞬時に取得。Chrome拡張機能をインストールすると、動画視聴ページで任意の文字起こしにワンクリックでアクセスできます。

Chromeに追加 — 無料

YouTube、Coursera、Udemyなど主要な学習プラットフォームに対応

文字起こしをすばやく取得：アドレスバーのドメインを変えるだけ！

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube文字起こし結果を準備しています…

YouTube文字起こし：The Shape of AI to Come! Yann LeCun at AI Action Summit 2025