YouTube Transcript:
RI Seminar: Dieter Fox: Where's RobotGPT?

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Dubbing in English

Break language barriers, embrace global quality content

Use for Free

Video Transcript

Video Summary

Summary

Core Theme

The core theme of this content is the exploration of how generative AI, particularly large language models (LLMs) and advanced simulation techniques, can be leveraged to significantly advance robot manipulation capabilities, bridging the gap between current limitations and the ambitious goal of "robot GPT."

Mind Map

Click to expand

Click to explore the full interactive mind map

we start right on time because the room

has been already filed

before um it's a great privilege to have

D Fox here and uh derer Fox is derer is

a household name in robotics I think

probably most of us have uh read or have

have his book with us he has done some

of the foundational work in robotics

I'll not go into what all he has done uh

think probably these are the foundations

of like robotics 101 that we read in

every robotics class and of course D has

won all the awards I see the list but I

don't feel like speaking because it

doesn't leave any award in or any kind

of fellow I fellow fellow ACM fellow ET

but I maybe bit of trivia about dat

so uh despite like uh being one of the

Supreme figures in robotics leer is very

down to wordss I met him as a PhD

student in a conference and he was just

hanging out I was like oh this seems

like Fox I've only seen in photos and I

spent there and he was gracious enough

to give 30 minutes of his time in a busy

conference so I highly appreciate

everyone who see around in a conference

him he's he's too nice to decline

drink I and I don't think that's changed

and so with that I head over to give it

go to to see what today thank you thank

you thank you so much

and thanks for the generous introduction

it's always so great to be back I I did

my uh post here 25 years ago and can

imagine some of you weren't even maybe

born yet oh my gosh that's very scary

but uh I keep on coming back and it's

always exciting to see all the new work

going on uh see the new faces and also

this morning all the new buildings being

built it's a very exciting times for for

C you and um I can imagine many of you

came maybe just because of the title

because we as robot GPT that seems to be

a question that we as roboticist are

getting asked like continuously right

now right where everybody sees these

dramatic progress in language models in

Vision models and uh we roboticists are

still having trouble kind of picking up

whatever a water bottle or something

like that um so and uh I keep getting

this question pretty often as well so

just like many of you so I've been

thinking a little bit about this and

today I want to kind of give um a view

of some of my thoughts on that topic but

also um specifically for example we've

been doing a lot at Nvidia on simulation

and I want to um focus a little bit on

what kind of role I think simulation can

play in this in this whole Space um so

as as many of you might know I'm I'm

kind of sharing my my role I'm halim at

the University of Washington in Seattle

faculty and I'm also leading the

robotics research team at Nvidia uh most

mostly located in Seattle so the work

I'm presenting is kind of a mix of both

of these works and a lot of it is also

collaborative between the two Labs but

of course um most of it has also been

done uh with great interns that we had

over the

years so let me start with um kind of

the very high level connection uh

between let's say robot manipulation and

uh the topic the high level topic of generative

AI we I think we all agree that we're

seeing really atic progress in the

capabilities of I'll just I just call it

I'm going to be very sloppy with the

wordings here like I'll just call it all

generative AI which are all these kind

of models that we seeing on for example

large language models think about jet GPT

GPT

gp4 um Vision language models that uh

really have a what you might call almost

a deep understanding of of images um but

also purely generative models that are

models that for example diffusion based

models that generate videos images and

um I I think most of you uh would assume

share the view that this progress has

been uh really virtually impossible to

predict right like even people in these

areas when they saw these first um

especially starting with Chet GPT the

progress um has been has been dramatic

right and and people just thought that

can be and and by now I'm actually as a

roboticist we always used to say well

these models they they're not really

relevant to us because they don't do 3D

and they don't do true planning and

reasoning and we all know there are

limitations to what they can do but uh

the the more I see this progress the

more I'm wondering are we maybe just

missing something on our end as well

right like could it be that if we can

generate some kind of data that these

models are being trained on but um more

relevant to robotics um I personally

don't feel like I can predict what the

limit of these models will be as we're

moving on and so now if you think about

what really made these models so capable

I think it's a combination of on the one

hand side um they especially the llms

they have just huge amounts of training

data right like the all the text on the

internet trillions of tokens and words

is a training data and then in a sense

um if if you have so much data in your

training set then most of these tasks

that we throw at them that we call open

world reasoning are kind of in

distribution of your data set so they

don't even need to truly go beyond the

kind of things they've seen in the data

right so um I think that's one of the

the key contributions here is having uh

that data set and these large data

available at the same time of course on

the model architectures like especially

very large Transformer models with now

trillions of parameters or even the open

source models have hundreds of billions

of parameters um also are just capable

of learning the more data you give them

the better they get right and um that is

also something that really came up with

this with this uh data scaling and then

capability scaling so there is still not

uh with these data that is also not so

much on the um overfitting actually

going on um and then actually from a

robo's perspective it seems like the

training objective for many of these

malls is just very simple right it's

kind of behavior cloning which means for

the language malls you put in text a

sequence of text and then it just spits

out the next word right so in robotics

we do something actually very similar

when we do Behavior clone imitation

learning where you get demonstrations of

sequences of sensor data or state

information and controls and then what

the model just learns is based on a

history of those what should be the next

control that the robot should generate

so it seems like very nicely matched to

supervised learning also that we might

want to do in

robotics now uh so far I think it's fair

to say that we really haven't seen the

kind of abilities in the robotics domain

especially also as it comes to I'm going

to focus today on robot manipulation and

uh here being at CMU of course it's

morave he he saw it all coming right he

always said that um this Paradox that

the things that we always feel like are

difficult um for for for for computers

uh such as playing chess and go like the

Holy Grail of AI these are the really

difficult Parts but it turns out that

those are the relatively easy parts for

computers what's really hard is the

stuff that we humans don't even think

about like uh physical interaction in

the world right like picking up objects

moving around the things that we do

subconsciously uh are the the kind of

things that turn out to be really

difficult also for computers that

control uh these robots and I think of

course in addition to the fact that

robotics is just difficult which is kind

of an excuse we can always use uh uh I

think the other one is really we just

don't have the data yet that um people

had access to for their language and

vision models so that is from my

perspective the the big open question

right um so my hypothesis for today is I

would claim and I'm not 100% fully

behind all of this I I'll I'll I'll tell

you a bit more about it but if we could

generate really really large scale data

sets right that are suitable for

Behavior cloning for robotics kind of

like the language the text data sets for

language models then if we combine that

with this really capable geni model

speed diffusion models Transformer

models then I think we can really try to

teach robots these broadly applicable

manipulation skills and doing so just by

let's say Behavior cloning okay the

caveat here is I'm not trying to claim

that that will be the end of Robotics

and everything is going to be solved

right or anything like General AI but I

think if we could do do something like

that at least the robots that are

trained that way would be at a

capability level that is way beyond the

capability level that we're seeing right

now I think right now we're still very

limited and even with behavior cloning I

would hope that we can get them to a

level up such that then follow-up

techniques like reinforcement learning

and things like that can be actually

well so the big open question of course

is for this is where are we going to get

that kind of data from right so it has

to be kind of temporal data it has to be

uh data annotated with actions and

things like

that so uh we we're all seeing uh many

many videos these days also on success

stories where people use Behavior

cloning for manipulation already um you

and and and and the underlying technique

is a lot of kind of using diffusion kind

of policies um and these techniques um

are trained on often human till

operation where they show the robot what

to do and they give it multiple

demonstrations and then the robot

autonomously replace those those

demonstrations right um uh and the

capabilities are uh very very exciting

what you can see sometimes right what

these robots can do there tasks related

to cooking and things like that I think

one key limitations that we're seeing

actually at this point with all these

demonstrations is if you think about the

vast space of things that we want a

Rober to do that all of these are kind

of teeny tin tiny Point Solutions in

that space right which means um they are

all these demonstrations and different

tasks they are isolated kind of

demonstrations and we're not seeing the

kind of interpolation or cross task um

uh generalizability that we would

actually like to see and that we need to

see in order to really see the robot

capabilities of the Next

Generation right so many of these

demonstrations um if you look Beyond

just what's shown in the videos but you

look bit more at the details then often

it turns out that for example whatever

if the robot picks up an object um it

looks really impressive in the video but

it turns out that that object is not

allowed to move at all which means if

you move the object that robot is not

going to be able to pick it up anymore

and for humans that is totally

counterintuitive right if you think if I

can pick up this object and it moves 10

cenms to the left I should be able to

pick it up too but many of these

Behavior cloning kind of techniques at

this point the way they are being

trained on relatively small data sets um

don't really generalize very well uh

there's other examples for example where

uh if the table height in your test

environment is different than the table

height in your training environment then

your robot might not be able to pick up

the objects anymore so there's still

clearly limitations and um I think these

limitations are mainly also just because

of the the the scarcity of training data

that we that we have so now what are

different kind of ways that we can use

to generate data for Behavior cloning on

the one hand side

uh one exciting directions I actually

think is using videos observing humans

doing these tasks uh like learning from

YouTube videos or like ego 4D kind of

data sets where um we also show how

humans perform tasks in the real world

um using also egocentric video for

example and there's a lot of exciting

work going on where then for example you

can track the human hands and you can

use that as a guidance to like for

either as a as a reward function or as a

high level policy so that your robot can

replicate that at this point my sense is

that the gap between those videos and

what a real robot would do in the real

world is still a bit too large to

actually succeed but but I I think

that's actually a very promising

Direction uh the other example is what I

also showed on the previous slide is

real world demonstrations I'm sure all

of you have heard about like the Google

arm farm and uh the data collection went

for example into the rt2 model training

um more recently there have been across Institution

Institution

attempts at combining different data

sets collected at different institutions

like openex embodiment and then uh the

Droid data set is a very recent one uh I

think these are very good for

pre-training at this point but my

experience is that for example with the

uh the openex embodiment is that the

data sets are still too separate from

each other and it's really hard to kind

of learn skills that go across these

different data sets and things like that

so so there's still bit of a way to go

the Droid data set is more focusing on a

single platform and a very controlled

setup and there's a higher chance of

actually combining all this data into a

single model in a meaningful way but I

think overall um all these data sets

still if you look at individual skills

they tend to overfit to very specific

tasks so um the question is how far can

we scale that right if we look at uh lot

of the humanoid companies right now that

get like hundreds of million ions of

dollars of funding they will actually

spend a lot of that money I assume also

on generating T operate training data so

um I think that'll be very exciting and

and it's from my perspective it's a

truly open question of how far they can

can go with this um one question of

course for the academic research

Community is how are we going to get our

hands on that data set right because of

course it's not necessarily in the

interest of these companies to shared

with everybody um another alternative to

this kind of very expensive kind of data

generation is simulation various

different environments various different

skills um the advantage of simulation

data environment is of course once it's

set up and everything it's it's

relatively cheap and you don't have to

be a robotics expert to to run these

experiments uh one key problem with

these data sets of course the Sim to

real Gap that many of them the physics

don't really work quite as well as they

should so that what you learn in Sim for

example transfers to the real world and

another open question is really asset

generation like how do we populate these

in simulation environments with the

right kind of assets and

tasks and uh I will talk a bit more

especially about this Sim to real

Gap so in the next section I want to um

give you some examples of the work that

we've done especially also at Nvidia on

kind of training manipulation

capabilities in simulation so that they

world the simple thing you can do is for

example you want to teach a robot how to

grasp op objects right let me just give

you this one example here where you have

a 3D model of the object and you want to

generate a data set a label data set

that says which are good ways for

picking up this object right let's the

phone here for example um what you can

do is you can do very simple rejection

sampling on this which means you have a

3D asset of the object and you just

randomly guess possible ways for how the

gripper could be relative to it and then

you feed that into your physics Sim Ator

and tell the physics simulator okay

close the gripper and move up and

down and for some of the Gras that were

random good guesses it's going to work

but for many of the Gras it doesn't work

and what you're going to do is you're

just going to retain the ones for which

the object remained in the griper right

so that's actually a very simple way of

uh generating now for this object and

this kind of gripper a data set that

contains all the kind of promising grass

for picking this object up okay uh you

can do this for different combinations

of groupers and objects and here we did

this for example for a subset of of the

shape net data set uh where we have

almost 9,000 objects and we ran them all

through this parallel physics simulation

in order to figure out the good grasp

and now what we have is we have a data

set of annotated objects with grasps

right um the next thing we can do now is

and I'm not going to go into any of of

the technical details but we had a a

line of work where we can then use that

data set in order to train a deep

Network it was initially it was like

like a variational order encoder um the

most recent work here is the M2 T2 it

Callas here is like a Transformer model

but the idea is that you can now in

simulation you can just Generate random

scenes let's say tabletop scenes or

scenes withd draws you can put these

objects in the scene

and you can label also the scene with

the grasp that were successful on these

objects so the input and and then what

you can do is sorry you have this scene

and you can render it as a point Cloud

for example okay so in simulation you're

not using the shape models themselves

but you render it as a point Cloud

because that is what the robot will

observe in the real world and then you

train a deep Network that takes this

input a point cloud and the output of

the network is oh what are all the

possible ways in in which I could grasp

these objects in the scene okay like you

can see this Illustrated here on the on

the left side down there and in addition

in this specific case you could also

train it to say if the robot has an

object in a certain configuration in the

hand what are the possible ways in which

it could place it in the scene okay and

it turns out that that works uh

surprisingly well especially on point

clouds uh Point clouds transfer pretty

well from SIM to real one other aspect

that we might want to look at then is

for example um object segmentation and

and this is work that we started kind of

parallel of course nowadays you might

use Sam segment anything for this which

was trained on huge amount of real world

data but here just want to show that we

can also do actually capable

segmentation training purely in

simulation so this is just some example

where we where we randomly generate

scenes with objects and the objects can

come from different data sets and then

render them and the nice thing about

simulation is that you can get this

segmentation and everything for free

right we also uh in this line of work

that's called object Seeker um we can

train a network that then says I give

you an image of an object and you should

detect the segment in the scene that

belongs to that corresponds to that

image okay and uh that is purely trained

in simulation and let me just show you

an example for how that then works so

here is on the upper left you see that's

kind of the query view where you say

okay I have a have a picture of that

that pot uh on the next image up there

in the middle you see that's the view

from an external camera and uh the upper

right is the view from the gripper

camera on the robot which is just up

here you can see that that's the scene

and so the idea is that what the model

takes as input it takes that query View

and the the the image itself and then

uses the query view to segment out the

object that this corresponds to in this

case it's a screen mask that we can

automatically then generate

and then we feed that together with this

grasp Network that was used to generate

this grasp and then the robot can pick

grasps and also in this case even let's

say the there was a network that was

trained um to do Collision checking

because uh Collision checking in some

settings for example if you have

occlusions and things like that using

just the point Cloud might not be as

robust as for example training a deep

Network to do Collision checking for you

okay here's

now here's another scene so again here

in this case we just give it the image

of the fruit

snacks you can automatically segment it

out and then generate the grass for

that all the components everything is

simulation and the nice thing is let's

say comp compared to some of the the the

the earlier work I I was hinting at is

that of course in these simulations you

can nicely randomize environmental

parameters right like the size of

drawers or the height of countertops and

things like that so the system then

becomes robust to that as

well um wait there was one y and then

once you can pick up this object this is

work with the last here it's called Pro

prompt where then uh that's one way now

to connect these kind of let's say

lowlevel capabilities with this Vision

language model this was with uh GPT

where the idea is we have the language

we want the language model to generate

python code that the robot can execute

right and here's just one example where

the input to the model for the prompting

is we first give it an action of what

the code structure should look like for

example from actions import so we can

say these are the actions the robot can

actually execute

um then we have a function for example

that says throw away

banana and also importantly the object

so we tell the code or the the the llm

what are the objects that are accessible

right now in the scene and then we want

to use that to define a function and

then in this context for example if you

look at the at the left scene the key

question is always how do we enable the

robot to do this what's called open

world reasoning right that's the key

reason why we use l M because they have

this capabilities to reason about things

that are

not uh pre-trained on a classified set

of objects right so for example you

might then say sort the fruits on the

plate and the bottles in the Box um we

can now take that

sentence generate a function that we

would like to have specified in Python

then we we run a language a vision model

to detect the objects in the scene so

this is now also impact uh input to the

llm and then we tell the llm kind of

okay now give us the individual steps of

that right and then uh we can just

execute those on the real robot okay I'm

not making claims here that this is

doing really complicated planning or

things like that but it's kind of what

you might considered like a very loose

connections between VMS and Robotics

where you have very specific robotic

skills and the VM then can call these

right and of course we all know all the

limitations to this kind of work still

um where it's about hallucination and

things like that but it's just an

example how you can combine these kind

now all right so um so far the

manipulation setup was mostly uh kind of

object pick in place so relatively

simple from a physics simulation

perspective another area that we've been

looking at is what we call a contact

Rich industrial tasks okay um it turns

out that uh most of these hard contact R

tasks in Industry are still being done

by humans because robots are just not

necessarily for example flexible enough

to to do these or um do them especially

when the environment changes and things

like that and N for example this is not

the most recent version but they for

example came up with a taskboard to

Benchmark these capabilities like you

can see here are these different assets

and they need to be inserted into

something um and it turns out that all

all these tasks have been done actually

typically on real robots and um even The

Benchmark environment was like a real

physical Benchmark and the problem was

that it's really hard to actually

simulate that I must admit I thought

come on that can't be hard right it's

like really well specified objects you

have good cat models for them how hard

can it be to to simulate sticking a peg

in a hole or something like that it turn

out it's not trivial to get it uh to to

work well um so for example when we did

this work the first L was called Factory

at the time the state-ofthe-art was

something like this where they were able

to simulate threading a a nut onto a

bolt okay but the problem was first of

all the the margins were not quite the

margins that we would see in real

physical nut bolt setups and also the

SIM imulation was 350 times slower than

real world which from a learning

perspective sometimes beats the purpose

of simulation because the reason to do

simul or one reason to do simulation is

just that you can do very fast

simulation faster than in the real world

right so um but that was kind of the

State ofth art and then we worked at

viia with uh with people also from the

physics team and um they did some magic

uh tricks on that because they hadn't

considered that use case beforehand so

we said well that's actually important

for these industrial task so they then

came up with a simulation that can now

do of course uh doing some magic on on

the GPU and some optimizations so now

they were able to actually simulate a

thousand of them in real time in

parallel so beforehand it was a single

350 times slower and now we can do a th

in parallel right the nice thing is once

you can do that um you can start doing

learning of in in this context right

because you can do reinforcement

learning for that and then you can for

example start training a policy for

doing this task and

simulation okay um here's just one

example that is actually the most recent

work in this line where now in the

automate work um Yash and his team and

and and Bing she's intern with us

working on that where they defined a

hundred of these insertion tasks in

simulation and then uh Ty typically what

we do is we train individual using po

individual um policies for for the

different tasks some of them are kind of

threading some of them are insertion

tasks um and the policies first are

trained typically let's say for each

individual task you have a different

what we call state-based policy which

means you give it access to the internal

state of the simulator the exact pose of

objects and things like that and then we

distill that into using that as training

data to distill it into to a policy that

operates for example from a point cloud

data and and I'm not talking necessarily

about Point Cloud observations but Point

Cloud representations of the assets um

and then we're getting on on these

policies on the individual policies for

example we're getting very nice success

rate both in simulation and in the real

world so on the right side is what these

policies now do on the real world we do

zero shot Sim toore transfer on those so

it's actually working um very well now I

think uh one thing is and and then

instead of just doing these individual

policies of course nowadays you can then

use the data to try to distill it into a

generalist policy that can do it

independent of the individual assets and

it's not quite where I think we would

like it to be ultimately so in this case

for example we were able to train a

single policy on 20 different tasks

different assets and the success rate if

you look at the numbers here

surprisingly in Sim actually drops by

10% but the real world doesn't even drop

that that far but still um what you

would like to see is of course that you

can generate enough tasks and then um

distill this into a single policy so

that you get really cross task uh uh

benefits right so that for example you

get a policy overall that's better than

any of these individual pre-trained

policies and also ultimately of course

you want to have a policy that can work

on unknown assets as well we we're just

getting there right now and starting to

look into that okay I have a question

yeah yeah yeah so did you first the

first question is did you randomize the

physics parameters and the second is is

the policy you distill to a recurrent

policy or it's a yeah it's a recurrent policy

policy

and I I'm pretty sure don't uh but I'm

pretty sure that did not for example

fine tune any physics parameters per

task or something like that there was

just once if possible the the find uning

and then of course you do in addition to

make it robust you do some randomization

on the physics parameters as

well how here

is Cam on the or mle that's what said so

in this case actually the point cloud is

not for the sake of let's say

um like what pulkit for example did for

the for the object it's more like a

point Cloud because for example if you

if you want to have um state based

policies for different assets the

problem is you can't learn a single

state-based policies across assets

because the state which means the

position of the object doesn't convey

actually Which object you're holding in

your hand which means you can't learn a

policy that adapts to that so in that

case we replace the state by um

extracting a representation from the

point Cloud still in simulator from the

point Cloud that represents the asset

itself it's not coming from let's say a

camera obser a also right so it's a

asset test time we have I don't know I

think it might even be one camera and

with the camera then we do object

detection because we still assume we

have the asset right and then use that

as the initial post but then generate

the point Cloud again that goes in the

policy but you can you can well imagine

that so you you theet to the and then

you the the asset we still assume that

we have the asset even in the real world

and we take the asset Point Cloud but

we're using the post estimate to place

the point Cloud relative to the gripper

but the relatively obvious next step is

to learn all of that just purely based

on observe Point

clouds can you comment on the accuracy

of the state estimation at time or I'll

estimation uh in three slides two slides

yes um

oh another step now is for example tectile

tectile

simulation um so this is some experience

we did with with the gel side it has

kind of it generates a tactile image on

on on this pad that is uh between the

grippers or on on on front of each of

the gripper and on the left side I'll

show you here for example this is a

simulation roll out of a policy where in

the lower left of course you see the the

gels side image that we can generate

in very fast in real time or faster than

real time um in simulation so now again

we can start doing simulation training

um using for example tactile information

as well and we're also looking at using

force feedback and things like that as

well um and then you can do of course

these kind of experiments nicely where

here's kind of the training success

using PPO on that where um in green we

have the training curve and success rate

that we would get if we give the policy

the ground truth position of the pack

which in reality of course you never

have if we now perturb that estimate

with some noise that the system doesn't

have then actually um po doesn't succeed

at all it's kind of the gray scale that

is the flat line at zero um and and then

these other three the blue orange and

purple curve they show training uh

results um if you use either the a

gripper a camera that is placed on the

wrist of the robot right because that

gives you information about the object

where it is relative in the gripper this

would be the the purple curve and then

if we use the tactile image we this is

the orange curve which is very similar

but actually combining the tactile image

and the wrist camera image gives us a

blue curve which is better than those so

it kind of indicates that um of course

combining these different data sources

into a single policy uh improves

performance um that again the nice thing

about the simulation s we can start

making all these kind of experiments

right and really Benchmark different

things against each other uh we also now

getting this to transfer to the real

world which means we can with zero shot

we can train a policy with tectile um

feedback and then it works in the real

world and it's just to highlight that

yeah it works in the dark if you have a

all right so so this is kind of for this

line of work for let's say more

industrial kind of manipulation task but

I think ultimately even our robots right

they should be able to plug something

into a power outlet or something like

that or USB and things like that so I

think all this kind of can lead to these

kind of capabilities as well in the open

world uh Beyond let's say these more

static simulation tasks uh of course I'm

sure many of you have seen this line of

work the Dex stream where this was Anor

Hunter and and his collaborators uh did

inhand rotation of objects and in this

case the interesting aspect was um I

don't want to spend too much to explain

the task but it has to rotate this

object to a certain configuration uh and

here he trained actually a state-based

policy and in order to successfully

execute that state-based policy in the

real world he also trained a key Point

detector for the cube for example that

then can be applied in the real world to

give you the state so and that keypoint

detector was actually robust enough that

we get also um Sim tooreal transfer zero

shot now briefly one example um on uh

the notion of post estimation um it turns

turns

out uh it I believe in many settings 60

object post estimation doesn't make

actually that much sense because many

objects like uh you don't have a model

for them and I think uh it's kind of an

an artificial bottleneck but for example

you can imagine in industrial settings

or so where you might have actually

access to 3D models of your assets and

things like that post estimation can

still be uh extremely helpful and this

just some work I want to highlight that

is B and when did this um at Nvidia and

it's coming up at cvpr this year it's

called Foundation post of course

Foundation uh the idea is um let's

assume you have a 3D model of your

object and um textured like the one up

here and this is for um kind of local

post estimation you so you assume you

can detect object with a bounding box or

something like that but the goal here is

to estimate the 3D position and

orientation of the object with high

Precision okay so you have some kind of

rough initialization and then you try to

estimate the pose and if you can do that

you also want to do tracking over time

by just initializing your estimator from

the previous time step so the idea here

is this render and compare kind of

approach which is something we've also

done before with techniques like deep

IIM or Mega pose um so the input is a

rendered view onto the object where the

rendered view comes from your current

estimate for where the object is and it

is also kind of an part of the image

where the object is in this case it's

the uh is it cheit I guess everything is

either cheit box or mustard or so

nowadays and then this is the input to

the network and then the network is just

trained uh details don't really matter

but the network is trained to give as

output a local translation and rotation

of the object pose such that the

rendered object matches the observed

object closer right so it's kind of like

for those of you who know from depth

cameras or ICP kind of techniques this

is doing something like learned ICP but

in the full image depth space okay and

then the idea is you get a refined pose

estimate and then you can repeat that

process by rendering your object at your

refined pose and let your U your deep

network uh suggest another predicted

pose um and again we we we've done this

before but now actually this is really

kind of a really level up in the in the

capabilities um and the key trick is was

really also again on just data scale

right so Bowen trained this purely in

simulation we just have like the 40,000

objects from observers and then also

from the Google scanned objects 1,000 um

then also we use some additional um uh

texturing to have more variability on

the object texturing using for example

an object and then an llm and it says

and the llm might say oh the wine glass

should be green or something like that

and then you have it's called um a text

Fusion that can generate then a texture

on the object and you get more iety on

your your data and then of course in

your training data you do a lot of um

randomization on the lighting and

everything to make it robust and uh the

interesting aspect here is first uh this

initial estim estimation if you don't

have the post relatively close yet but

you have to try different uh positions

takes a second but then tracking can be

done at 30 htz with this network and

it's currently it's state-ofthe-art on

on on many of these kind of data sets um

that measure kind of 60 post estimation

and tracking um and the key thing here

is that this network is trained um since

it's trained on all these objects it can

do zero shot object tracking which means

many of these previous Works they assume

that you can train your network on the

object that you want to track in this

case the object is just an input to the

network okay I'll just give you some

examples here so on the left observation

on the right side you see kind of a

rendering of the

object where the system detects it on

side so it's actually extremely stable

and robust and again it's very very

precise and also let me note again that

these objects have not been in the

training set of of the network right so

it's truly kind of zero short new object

and it can do that that's where the name

Foundation post comes from of course

okay uh one more example here on the

left side because for industrial stuff

these kind of objects like shiny metal

pieces and things like that actually

very relevant right for many of the

tasks that you want your robot to do and

we just uh yeah have it here in front of

the camera and you can see that tracking

Works actually very very well without

any additional training yeah can you

include rgbd or this is

RBD yeah I'm tting as

well and here on the right side this is

I think where they buil the model just

uh maybe even with something like bundle

SCF or so but then you can see that the

tracking is actually even the lighting

conditions are not great but the

tracking is very very

good so why I'm showing you that is just

to highlight that that Sim toore doesn't

only work with Point loud kind of data

right or depth data but it's also

getting nowadays better and better at

RGB kind of s toal there was a questions

there yeah that piece that piece Met has

just like no texture right other than

just like a couple uniform holes so do

you think it's actually using much of

the color data or do you think a lot of

it is just using the depth because you

give it you gave it both right I think

it's using I would say it's using a

combination you can see there's for

example also still maybe holes like that

um and I mean that's a nice thing versus

for example key Point based kind of

techniques right they always rely on on

visual key points that you can detect

and this systems that is all just end to

endend trained uh doesn't require that

who knows what it does internally right

so I don't have a a clear answer in this

um and it's very robust also as you can

see actually with respect to um

right how do you you first a model out

of it yeah so the idea is with these

object that's different actually you can

look at the at the there's different

setups one is for example in the left is

where we actually have a cat model

because for many of these industrial C

Parts you have the cat model that we can

then use readily and for some of those

you can either even you can train it to

do posess just based on multiple views

onto the object it doesn't actually

require a full 60 pose of the

object if you can generate this from

nering or something

that yeah totally yeah yeah yeah yeah

yeah yeah yeah yeah I think this might

even be what they've did in this case

okay all right so now that uh well what

I tried to convince you ofice that

simulation can work pretty well right

for at least this set of tasks that we

looked at I'm not saying simulation can

solve everything but we're getting

reasonably good Sim toore uh transfer

results both on the on the physics and

on the um appearance of things um now

where we want to go with this because so

far all I showed you is kind of um

individual little projects right where

we set up the training specifically for

this and and and the assets and

everything and now want to just describe

like where I think we can go with more

like setting up a larger framework for

how to do that and it's kind of like a a

Sim based robot training pipeline right

uh where the idea is there's like in my

view there's kind of three key steps

that we always have to do if we want to

uh train these things in simulation one

is first of all if you have a certain

task or so you Generate random ask

assets and scenes right that represent

whatever application domain you you you

want to worry about then the next thing

is we need to be able to generate

certain tasks in this environment and

also um could be rewards depending on uh

how you generate your solutions for

example in the industrial setting we

used reinforcement learning so you have

to set up your rewards um and the next

step then is for example you go into

these environments and then because

you're in simulation you take advantage

of the privileged information that the

simulator gives you for example right so

for example I know exactly where the

objects are I have the posst estimation

I have the perfect shape estimation that

is a key advantage that I have being in

Sim and then it turns out that many of

these techniques actually work very well

they can solve tasks in Sim that we

cannot yet solve in the real world so

that's a key trick and then what we do

is we just generate many tasks we use

techniques like task and motion planning

or reinforcement learning to solve these

tasks s and now what we can do is we can

use these task solutions to do Behavior

cloning on them right where the key

trick now is to say we take all these

demonstrations which is POL roll outs of

the demonstrations and then render these

rollouts with the sensor information

that my real robot will have access to

and then I can do Behavior cloning and

we're exactly in this kind of world

where we want to be right uh and just to

be very clear about this I don't believe

that we will be able to do everything

just in simulation right so clearly

adding and combining that with real

world demonstration data or even or even

video data and things like that will be

crucial on the long

run and then of course building the what

we're looking into just building the

computer infrastructure for doing all of

this where do you store the data what

formats do you store the data in we're

using this USD representation uh for

these simulation environments and then

also like how do you train these models

so let me just give you an example for

uh what I have in mind here so for

example on the scene generation if you

want to train a robot now to do more of

these indoor let's say kitchen tasks

then um we have a project that is

looking at this uh programmatically

procedurally generating these kind of

synthetic scenes of kitchen environments the

the

key uh thing to notice is that these are

articulated scenes and also so that they

work with a physics simulator that's of

course always the important part right

um and then because we're in simulation

if you can do it with one you can then

randomize over that as well um it turns

out that the variety that you can come

up with these kind of techniques might

still be limited but at least now we

have a large set of

environments right that uh we can run in

our full physics

simulator um one question is then uh how

can we go beyond just these procedurally

generated environments and of course um

we can leverage generative AI for for

going beyond the simple assets and for

example Katarina is doing some really

cool work in in that domain we're

looking at for example two projects

right now where for instance you might

want to generate assets like cabinets

and drawers and things like that and you

can imagine we're train a system where

the llm says something like okay the

cabinet has four shelves and two drawers

and then from that actually you go to a

UF description of that cabinet you learn

to go there put that out and then you

actually have also shape models for the

individual components of that and then

out you might get like fully functional

uh shape assets that you can um then use

in your simulator another line of work

my student Zoe did this um it's called

Ur forer where what she did is the

following she said okay I I can generate

in this upper row I can generate let's

say simple procedurally simple assets

for like doors and drawers and cabinets

and things like that she then then uh

figured out a way to

use and I'll point the paper to use

stable diffusion so that you can render

now actually pretty nice realistically

looking views onto these objects with a

lot of variety in them the key trick is

that these actually rendered images are

consistent with the urdf model with

respect to handles and things like that

right um so that was one tricky piece

but now what you can do is you can train

a deep Network that goes the other way

around so for example the IDE here is

you can download an internet image of a

kitchen and you can train a Transformer

model that then goes to generate a UF of

that kitchen with all the drawers and

doors right so um and also you can do

this it works even much better of course

for individual assets like whatever

frenches and stuff like that but the

idea is now um that you can download

many many images and just convert them

into these assets that you can now feed

into your simulator and that means that

this simulation environments will have

far more variability and diversity than

whatever you can come up menu right so

this is not perfectly done then of

course you can run your moment in these

environments and do uh training in

them so the next step is if we have

these randomized assets and scenes um uh

The Next Step would be generating tasks

and rewards so one way to do this is

kind of do it manually where you place

objects in them and then you might say

hey I want uh the objects to be the

drawer or put or set the table and

things like that um another way is of

course what we're looking at and again

Katrina is doing some really cool work

in that domain is using llms to generate

tasks automatically so you can tell the

llm for examp okay these are the objects

in front of the robot just suggest some

things a robot can do and it turns out

that they are surprisingly good at that

at the same time you have to have

techniques for filtering out all the

noise of course that they generate but

overall I think it's a very promising

Direction so now that we have these

scenes with assets and tasks um uh

currently what we're looking at is using

task and motion planning in order to

generate demonstrations for how to solve

these tasks so that's a whole area um

that I'm sure many of you are familiar

with is kind of especially for robot

manipulation so it's a planning planning

technique that works both at the let's

say abstract Action level planning but

also at the continuous and physics level

planning um so for example they they

reason about about um discrete state of

the environment for example if a door is

open or closed so if the robot is

holding an object or not they have

actions they have preconditions for

actions just the classical kind of

planning kind of precondition kind of

style effects of actions but then they

also reason about for example um if the

robot wants to pick up this can I find a

continuous robot Motion in order to move

the gripper there and can I place the

object at a different location and

things like that so there's various uh

very capable systems out there we use

the one from kin Garrett is called pddl

stream and um uh the key trick here is

again these task and motion planning

systems they're not that great in the

real world yet because they require

access to the real world but the beauty

is we are in simulated environment so we

have access to everything we need and

that's why these Tam systems are pretty

good let me give you just one example

here so for example in this case the

task is for the robot to put the the tea

kettle on the stove somewhere and hold

two teacups I know it's not the the most

exciting task but it's just illustrating

going okay so we just specify the high

level goal and then the planner does all

the motion generation and everything for

us okay and once we have that you can

imagine that if we could generate

thousands and thousands and thousands of

these things and generate the

demonstrations we can now render the Dem

ations with the kind of views that a

real robot would get right we're just

simulating for example the wrist camera

views we we get access to the state and everything

um and then use that as the training

yeah change we mention the cup is filled

with water here could you louder I

second yeah what the thing of the

manipulator change if we instead of just

saying pick up the two cups pick up the

two cups fill with

something oh if you if there's liquid in

there yeah so if there's liquid in the

cup oh well that depends whether your

task and motion planner has that model

right at this point so I guess uh you

can imagine actually llm might be able

to say oh be sure to hold it upright

they're actually capable of this kind of

reasoning right but in the task and

motion planning system so you need to

explicitly have these kind of things

modeled at this point

yeah one very quick H am I running out

of time soon um I just want to give a

quick plug in you've seen in the

previous animation that the motion of

the arm was kind of a little bit

convoluted right so um we Dev I don't

know whether people have heard of corobo

which is a tool we just uh developed uh

this Bala did this was a Acro paper last

year but for example with the standard

motion generation planners you can see

here on the upper left they often

generate demonstrations even in

simulation that I kind of generate this

motion as you can see here that is kind

of very convoluted let's say right it's

not a very natural path the problem with

the sampling based paths is also that

they're very hard to learn from for

Behavior cloning because there's too

much variability in them here you can

see another example if you use this B

directional RT planner for

example and what B developed with Kobo

is a technique that is fully optimized for

for

GPU and it generates actually plans so

it it combines um RT based sampling

generation with optimization everything

so the package is on GitHub you should

you should download it in um much faster

than than the alternative techniques

right now so it's faster it generates

much shorter paths as well better paths

and another key thing is it actually

works with Point cloud data right which

is one of the key limitations many of

these other techniques have let me just

give you one example here here's uh just

a demonstration of the the speed of the

motion generation so whenever it reaches

the previous goal Point indicated by the

planning okay so it's very fast in doing

that on the right hand side you can also

do then kind of this reac motion

framework take and we're going to we're

integrating that of course with our

demonstration generation as well so that

the demonstration that you give for your

robot for the training uh uh becoming

more consistent and fast we can generate

them faster as

well and then finally once we generate

the data all we have to do is behavior

cloning and there are now as we know

there's just so many techniques out

there right um um diffusion different

kinds of diffusion policies uh we have

techniques that use Transformer 3D

Transformer kind of techniques you could

imagine techniques like perceiver actor

um that some of you might have heard of

or rvt is a recent technique Katrina did

some nice work also on act 3D

diffusion policy learning but there's

many techniques out there so um but the

idea is once we have that data set we

can train them on the data set and

evaluate how well they do right here's

just one example this is work mtoa is he

is he here oh no he said he might not be

able to but for mtoa here he did this as

an during his internship with Nvidia

where we actually were able to set up

all these environments with full task

and motion planning okay so there's no

manual human till operation data

generation but it's all task and motion

planning and use that then for Behavior

cloning training of of a transformer

model and he's he got very good results

but now moving forward Imaging doing

something like this with the kind of

data that we could generate in these

kitchen environments and at a much much

much larger scale okay that is kind of

where we are where we're moving right

now so now I'm coming to the very last

section so uh so now that we have all

this data let's say the question is how

can we reuse it or how should we connect

it for example to these large language

models right um uh because what the

language models really provide to us is

this open world kind of reasoning right

they they can provide at least some kind

of guide guidance in settings that we

haven't seen in any kind of robot

training data or things like that right

the question is how do we integrate our

robotics models with these large

language models right on the one hand

side is uh one way we could do this is

the one on the left which I would call

loose integration all right which means

we have our vision language models you

take your favorite what whatever GPT

4V and you just use this to give

guidance to the robot so it's becoming

your it's uh scene understanding and

planner and then we as roboticist we

train some skills for example pickup or

play and things like that and then the

interface between that is just that the

vision language model just calls the

right skill that is kind of what I

showed when I showed this proc prompt

work it's kind of at that level right

where we predefine exactly the kind of

skills that robot needs and uh

independently we're just using the VM to

to work with those I think ultimately

that that's just going to be too brittle

and also um the notion is that it's it's

easy to come up with a small set of in

skills like pick place open close but

overall I think the the things we want

these robots to do are not always easy

to classify into these discrete set of

skills so another extreme is we just say

well we'll have a big Vision language

action model right that takes in so on

the left side it's let's say text and

images because these models are not

trained on robot data but on the right

hand side maybe we can use the robot

data to train end to endend really full

vision language action model that take

us input images any information about

the robot State history and things like

that and they literally output controls

right this is kind of the rt2 kind of

model right where they actually really

output Delta action for the for the

robot manipulator

um I must I don't know maybe right um it

seems from a robotics perspective we

often like kind of hierarchical kind of

approaches but um I'll go on the next

slide I want to talk about but that is

one approach and I think there's

something in the in between where we

actually not just using these Vision

language models but we're fine tuning

these Vision Lang action models together

with robot data so that they become

better but the output is not necessarily

a discrete set of skills that we train

with our robots independently but the

output is some kind of tokenized

interface that we train and align with

robot skills that we learn so and then

what we have on the low level is maybe

more like a policy that could be a

diffusion policy or your favorite

Transformer based policy uh that then we

train for the robot to execute right so

it's kind of between these two extremes and

and

um I'll I'll leave it to you or maybe to

the discussion on which of those uh you

you is your favorite um so now why or

how could we even train these Vision

language action models right these not

in a sense like not just taking these

models as the vision and language people

give them to us but how can we improve

on them right um because the inside is

why I think there's actually hope is the following

following

so these llms we know they were trained

on these huge data sets and they are yeah extremely capable for the kind of

yeah extremely capable for the kind of task that trained on um then there are

task that trained on um then there are many Vision models out there now that

many Vision models out there now that were also trained to provide very robust

were also trained to provide very robust uh representation of image data right

uh representation of image data right and um examples you know Dino clip M

and um examples you know Dino clip M Auto encoder kind of models so the key

Auto encoder kind of models so the key trick of these models is that they were

trick of these models is that they were trained on very weak supervision that

trained on very weak supervision that you can generate on huge data sets and

you can generate on huge data sets and the representations that generate are

the representations that generate are because of the size of the data set are

because of the size of the data set are actually very very capable let's put it

actually very very capable let's put it this way right and they showed that you

this way right and they showed that you can very quickly adapt them on

can very quickly adapt them on Downstream tasks so the open- source

Downstream tasks so the open- source Community if you now want to train

Community if you now want to train Vision language models not just language

Vision language models not just language models or Vision models is what they do

models or Vision models is what they do for example what the the lava uh work

for example what the the lava uh work does is the following they have the

does is the following they have the language mod as the backb because the

language mod as the backb because the language model is kind of the reasoning

language model is kind of the reasoning engine of all of that it's the one that

engine of all of that it's the one that is trained on huge amounts of data right

is trained on huge amounts of data right and then they take the image model that

and then they take the image model that also provides actually very good very

also provides actually very good very capable representation and in a first

capable representation and in a first first stage they used a set of training

first stage they used a set of training data such that they just align the image

data such that they just align the image embedding with the language embedding

embedding with the language embedding right so that your images image token

right so that your images image token and language token live in the same

and language token live in the same embedding space or semantic embedding

embedding space or semantic embedding space right and so what they do is again

space right and so what they do is again they first align the image embedding

they first align the image embedding with the language embeddings and then

with the language embeddings and then they have a data set on which they train

they have a data set on which they train the whole system end to end using fine

the whole system end to end using fine tuning techniques right and fine tuning

tuning techniques right and fine tuning you know there's various techniques for

you know there's various techniques for time fine tuning low rank adaptation and

time fine tuning low rank adaptation and things like that but the key trick is

things like that but the key trick is that um this overall training requires

that um this overall training requires far less data than that went into these

far less data than that went into these individual modalities that were being

individual modalities that were being trained right and there's now community

trained right and there's now community that is looking at these what's called

that is looking at these what's called multimodal large language models this is

multimodal large language models this is taken from a recent survey if you want

taken from a recent survey if you want to so it's it's a very nice paper that

to so it's it's a very nice paper that summarizes kind of what's going on and

summarizes kind of what's going on and this is just expanding on that idea

this is just expanding on that idea where we take the input are these

where we take the input are these individual modalities like images video

individual modalities like images video audio and then uh we first project them

audio and then uh we first project them into a language space right just like

into a language space right just like what the lava model is doing uh and you

what the lava model is doing uh and you can use many different kind of backbones

can use many different kind of backbones for that it doesn't have to be one

for that it doesn't have to be one specific one and then on the output side

specific one and then on the output side um they also then train it to Output

um they also then train it to Output text like for images for example but now

text like for images for example but now the most recent work is also moving into

the most recent work is also moving into even on the output we're connecting it

even on the output we're connecting it to diffusion kind of models so that on

to diffusion kind of models so that on the output is not just text but you can

the output is not just text but you can generate videos on the output side and

generate videos on the output side and they do that through a similar alignment

they do that through a similar alignment process right so if we now come from a

process right so if we now come from a robotics perspective um if we now add

robotics perspective um if we now add let's say robotics data to this kind of

let's say robotics data to this kind of of mix then maybe there's a hope that we

of mix then maybe there's a hope that we can learn capable models without needing

can learn capable models without needing the kind of Internet scale data that the

the kind of Internet scale data that the the language people have at their

the language people have at their disposal so that's why I'm um hopeful

disposal so that's why I'm um hopeful that we can make some real progress here

that we can make some real progress here in this in this domain and with that I

in this in this domain and with that I want to summarize briefly so okay we've

want to summarize briefly so okay we've seen the interest of time we've seen

seen the interest of time we've seen huge progress uh we're still not there

huge progress uh we're still not there yet I think simulation uh there's

yet I think simulation uh there's various ways in which simulation can

various ways in which simulation can help and I'll I'll leave the details

help and I'll I'll leave the details here to the discussion I think another

here to the discussion I think another aspect that we really have to look into

aspect that we really have to look into is benchmarking because right now a lot

is benchmarking because right now a lot of the work is kind of showing videos

of the work is kind of showing videos rather than showing really capabilities

rather than showing really capabilities of these models I think simulation can

of these models I think simulation can Mak quite some contributions to that um

Mak quite some contributions to that um and I think we robot going back to robot

and I think we robot going back to robot GPT I think we're still not not there

GPT I think we're still not not there yet clearly still a way to go but I

yet clearly still a way to go but I think um generating large demonstration

think um generating large demonstration data in simulation combining that with

data in simulation combining that with real demonstrations and then mixing this

real demonstrations and then mixing this with all these other existing models

with all these other existing models right to train endtoend systems I think

right to train endtoend systems I think is for my perspective a very promising

is for my perspective a very promising way to go and there's many many

way to go and there's many many questions of course on how to

questions of course on how to specifically do this like how do we get

specifically do this like how do we get geometric reasoning capability set up

geometric reasoning capability set up better um what kind of data should we

better um what kind of data should we generate how much do we need I think I

generate how much do we need I think I would I I don't think anybody has really

would I I don't think anybody has really the answer to this question like how

the answer to this question like how much do we need right and how does the

much do we need right and how does the integr ation with the language and the

integr ation with the language and the robotic action work and um of course in

robotic action work and um of course in the end it's clear that we will need

the end it's clear that we will need real world fine-tuning and learning and

real world fine-tuning and learning and reinforcement learning for these robots

reinforcement learning for these robots to be really capable in the real world

to be really capable in the real world so this pure Behavior cloning is not

so this pure Behavior cloning is not going to be the end of this story and

going to be the end of this story and with that I'd like to thank you for your

with that I'd like to thank you for your attention and of course all the people

attention and of course all the people that were above this this work and a

that were above this this work and a video and you D in all the interns thank

video and you D in all the interns thank you

you [Applause]

I went over we are slightly over time take couple of

take couple of questions and those who want to leave

can I think thank you for great talk I think you show exciting solutions from s

think you show exciting solutions from s you com in the other directions like

you com in the other directions like real to say how we close the like Lo

real to say how we close the like Lo like yeah like late some real data to

like yeah like late some real data to improve a simulator and you know yeah

improve a simulator and you know yeah that's that's a great question I think

that's that's a great question I think right now so we see a lot of work on a

right now so we see a lot of work on a training models for 3D asset generation

training models for 3D asset generation automatic U my experience is that many

automatic U my experience is that many of the the assets that are being

of the the assets that are being generated they're not quite good enough

generated they're not quite good enough for physics simulation so they often

for physics simulation so they often they look very good and and it's very

they look very good and and it's very exciting but I don't think they're quite

exciting but I don't think they're quite good enough for for physics Sim that's

good enough for for physics Sim that's why for example we looked at this work

why for example we looked at this work with this Ur form where we're explicitly

with this Ur form where we're explicitly generating UR fs and and and very simple

generating UR fs and and and very simple shaped models but at least some that

shaped models but at least some that work with the physics simulation

work with the physics simulation um

um and there there's another interesting

and there there's another interesting aspect also with respect to to

aspect also with respect to to benchmarking actually that is related to

benchmarking actually that is related to this question right so if you want to

this question right so if you want to use simulation for example to Benchmark

use simulation for example to Benchmark capabilities how do you make sure that

capabilities how do you make sure that um the Insight that you're gaining from

um the Insight that you're gaining from the simulation benchmarking that they

the simulation benchmarking that they are inside that that transferred to the

are inside that that transferred to the real world and there's we can s about

real world and there's we can s about some details about that but there is uh

some details about that but there is uh some really interesting questions

some really interesting questions related to that as well

than I'm just curious when mention to n to n region language action model like

to n region language action model like actually region language model is a

actually region language model is a transform base and the chain Robo model

transform base and the chain Robo model is mod how combines two model together

is mod how combines two model together yeah it so the Transformer the the robot

yeah it so the Transformer the the robot doesn't have to be diffusion you can

doesn't have to be diffusion you can imagine various ways of doing the one is

imagine various ways of doing the one is you could discretize your output action

you could discretize your output action space actually um in your using a

space actually um in your using a Transformer and you can also then of

Transformer and you can also then of course you can

course you can um uh use a diffusion process on the

um uh use a diffusion process on the output side of of your uh train

output side of of your uh train transformer right um so the fact that

transformer right um so the fact that it's a Transformer model doesn't have to

it's a Transformer model doesn't have to be doesn't mean that it can be also a

be doesn't mean that it can be also a diffusion process that you use on the

diffusion process that you use on the output side

output side um so you're you're referring to this

um so you're you're referring to this slide right and

slide right and even even this simple step like

even even this simple step like outputting control is a really tricky

outputting control is a really tricky question right like at what level do you

question right like at what level do you do control that what frequency do you do

do control that what frequency do you do control is it in the robot frame of

control is it in the robot frame of reference is it in the camera frame of

reference is it in the camera frame of reference all these kind of questions I

reference all these kind of questions I think are still there's hints on where

think are still there's hints on where people say like what works better than

people say like what works better than other approaches but there's still a lot

other approaches but there's still a lot of work to be done and for that we need

of work to be done and for that we need to just thoroughly actually investigate

to just thoroughly actually investigate these questions you think which kind of

these questions you think which kind of microchip is

microchip is for the edge

for the edge Computing which which kind of mic GPU or

Computing which which kind of mic GPU or oh of

course no but but I mean noway you can of course for a lot

mean noway you can of course for a lot of this perception driven kind of

of this perception driven kind of Robotics you you will need some kind of

Robotics you you will need some kind of computer on the robot for reactivity but

computer on the robot for reactivity but also a lot of the compute you can maybe

also a lot of the compute you can maybe offload either to the cloud or to some

offload either to the cloud or to some desktop compute and um I'll leave it to

desktop compute and um I'll leave it to others to judge at this point what uh

others to judge at this point what uh here think last question um to kind of

here think last question um to kind of pigy back on this question about the

pigy back on this question about the control stuff um you alluded to the

control stuff um you alluded to the control signals you know could be

control signals you know could be extremely diverse right you can at

extremely diverse right you can at different levels of granularity

different levels of granularity frequency and I mean maybe this is kind

frequency and I mean maybe this is kind of my bias as a human it's hard for me

of my bias as a human it's hard for me to interpret visually control signals

to interpret visually control signals easy for me to interpret images but it

easy for me to interpret images but it kind of appears to me the control signal

kind of appears to me the control signal are like a much less like Smooth kind of

are like a much less like Smooth kind of signal to learn than than images and and

signal to learn than than images and and language so I guess the question is like

language so I guess the question is like do you expect or maybe have you seen any

do you expect or maybe have you seen any evidence or any conjecture um whether

evidence or any conjecture um whether these things that are so super

these things that are so super successful in vision and language uh

successful in vision and language uh will ever kind of scale at the same

will ever kind of scale at the same level of data to that like you multi-

level of data to that like you multi- hierarchy completely abstract multi

hierarchy completely abstract multi embodiment uh

embodiment uh yeah that get back to this also this

yeah that get back to this also this notion of the things that are so easy to

notion of the things that are so easy to us right like this spatial geometric

us right like this spatial geometric reasoning I think at this point

reasoning I think at this point um that L to one of the questions I had

um that L to one of the questions I had like how can we get these these Vision

like how can we get these these Vision language models to become much better at

language models to become much better at geometric reasoning 3D kind of reasoning

geometric reasoning 3D kind of reasoning right because one of the problems I

right because one of the problems I guess is that the the language data

guess is that the the language data doesn't provide the details and even a

doesn't provide the details and even a lot of the vision data now if you think

lot of the vision data now if you think about models like Sora that can generate

about models like Sora that can generate scenes that there's something CD in

scenes that there's something CD in there right but it's not quite the the

there right but it's not quite the the quality that we need for these

quality that we need for these robots I at this point I I think maybe

robots I at this point I I think maybe we can get actually pretty far but we

we can get actually pretty far but we will need I think uh a lot of data but

will need I think uh a lot of data but maybe to train these 3D capabilities

maybe to train these 3D capabilities that is where I think simulation can

that is where I think simulation can really shine because we can generate a

really shine because we can generate a lot of this variability so we did for

lot of this variability so we did for example one line of work um and pets

example one line of work um and pets where we trained a policy so for this

where we trained a policy so for this motion Chann generation for a

motion Chann generation for a manipulator right like you have cabinet

manipulator right like you have cabinet or something and some obstacles and you

or something and some obstacles and you just want to reach a point without

just want to reach a point without colliding and you do planning in this

colliding and you do planning in this joint space and I showed this coroba

joint space and I showed this coroba techniques so what we did there is for

techniques so what we did there is for example we generated a data set with I

example we generated a data set with I don't know let's say million examples of

don't know let's say million examples of reaching and then we were able to train

reaching and then we were able to train a network that actually can compile that

a network that actually can compile that into a policy that just takes him the

into a policy that just takes him the point cloud and a goal and it can kind

point cloud and a goal and it can kind of move the arm there pretty smoothly

of move the arm there pretty smoothly and also in a way that is not just local

and also in a way that is not just local right kind of it learned some Global

right kind of it learned some Global kind of collision ofo kind of notion and

kind of collision ofo kind of notion and I think that's similar for us humans for

I think that's similar for us humans for us it's really easy but I would I would

us it's really easy but I would I would actually imagine that many of the mods

actually imagine that many of the mods right now that we're seeing and also for

right now that we're seeing and also for example I like the RT models and things

example I like the RT models and things like that that they would not be able to

like that that they would not be able to solve that task because they're mostly

solve that task because they're mostly doing just free space reaching right but

doing just free space reaching right but even the simple thing of reaching around

even the simple thing of reaching around obstacles and stuff like that is

obstacles and stuff like that is something that the current models don't

something that the current models don't have but I think again with with

have but I think again with with sufficient data I would hope it might

sufficient data I would hope it might come out of it seems so easy right

so I question [Applause]

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:RI Seminar: Dieter Fox: Where's RobotGPT?