The core theme of this content is the exploration of how generative AI, particularly large language models (LLMs) and advanced simulation techniques, can be leveraged to significantly advance robot manipulation capabilities, bridging the gap between current limitations and the ambitious goal of "robot GPT."
Mind Map
Click to expand
Click to explore the full interactive mind map
we start right on time because the room
has been already filed
before um it's a great privilege to have
D Fox here and uh derer Fox is derer is
a household name in robotics I think
probably most of us have uh read or have
have his book with us he has done some
of the foundational work in robotics
I'll not go into what all he has done uh
think probably these are the foundations
of like robotics 101 that we read in
every robotics class and of course D has
won all the awards I see the list but I
don't feel like speaking because it
doesn't leave any award in or any kind
of fellow I fellow fellow ACM fellow ET
but I maybe bit of trivia about dat
so uh despite like uh being one of the
Supreme figures in robotics leer is very
down to wordss I met him as a PhD
student in a conference and he was just
hanging out I was like oh this seems
like Fox I've only seen in photos and I
spent there and he was gracious enough
to give 30 minutes of his time in a busy
conference so I highly appreciate
everyone who see around in a conference
him he's he's too nice to decline
drink I and I don't think that's changed
and so with that I head over to give it
go to to see what today thank you thank
you thank you so much
and thanks for the generous introduction
it's always so great to be back I I did
my uh post here 25 years ago and can
imagine some of you weren't even maybe
born yet oh my gosh that's very scary
but uh I keep on coming back and it's
always exciting to see all the new work
going on uh see the new faces and also
this morning all the new buildings being
built it's a very exciting times for for
C you and um I can imagine many of you
came maybe just because of the title
because we as robot GPT that seems to be
a question that we as roboticist are
getting asked like continuously right
now right where everybody sees these
dramatic progress in language models in
Vision models and uh we roboticists are
still having trouble kind of picking up
whatever a water bottle or something
like that um so and uh I keep getting
this question pretty often as well so
just like many of you so I've been
thinking a little bit about this and
today I want to kind of give um a view
of some of my thoughts on that topic but
also um specifically for example we've
been doing a lot at Nvidia on simulation
and I want to um focus a little bit on
what kind of role I think simulation can
play in this in this whole Space um so
as as many of you might know I'm I'm
kind of sharing my my role I'm halim at
the University of Washington in Seattle
faculty and I'm also leading the
robotics research team at Nvidia uh most
mostly located in Seattle so the work
I'm presenting is kind of a mix of both
of these works and a lot of it is also
collaborative between the two Labs but
of course um most of it has also been
done uh with great interns that we had
over the
years so let me start with um kind of
the very high level connection uh
between let's say robot manipulation and
uh the topic the high level topic of generative
AI we I think we all agree that we're
seeing really atic progress in the
capabilities of I'll just I just call it
I'm going to be very sloppy with the
wordings here like I'll just call it all
generative AI which are all these kind
of models that we seeing on for example
large language models think about jet GPT
GPT
gp4 um Vision language models that uh
really have a what you might call almost
a deep understanding of of images um but
also purely generative models that are
models that for example diffusion based
models that generate videos images and
um I I think most of you uh would assume
share the view that this progress has
been uh really virtually impossible to
predict right like even people in these
areas when they saw these first um
especially starting with Chet GPT the
progress um has been has been dramatic
right and and people just thought that
can be and and by now I'm actually as a
roboticist we always used to say well
these models they they're not really
relevant to us because they don't do 3D
and they don't do true planning and
reasoning and we all know there are
limitations to what they can do but uh
the the more I see this progress the
more I'm wondering are we maybe just
missing something on our end as well
right like could it be that if we can
generate some kind of data that these
models are being trained on but um more
relevant to robotics um I personally
don't feel like I can predict what the
limit of these models will be as we're
moving on and so now if you think about
what really made these models so capable
I think it's a combination of on the one
hand side um they especially the llms
they have just huge amounts of training
data right like the all the text on the
internet trillions of tokens and words
is a training data and then in a sense
um if if you have so much data in your
training set then most of these tasks
that we throw at them that we call open
world reasoning are kind of in
distribution of your data set so they
don't even need to truly go beyond the
kind of things they've seen in the data
right so um I think that's one of the
the key contributions here is having uh
that data set and these large data
available at the same time of course on
the model architectures like especially
very large Transformer models with now
trillions of parameters or even the open
source models have hundreds of billions
of parameters um also are just capable
of learning the more data you give them
the better they get right and um that is
also something that really came up with
this with this uh data scaling and then
capability scaling so there is still not
uh with these data that is also not so
much on the um overfitting actually
going on um and then actually from a
robo's perspective it seems like the
training objective for many of these
malls is just very simple right it's
kind of behavior cloning which means for
the language malls you put in text a
sequence of text and then it just spits
out the next word right so in robotics
we do something actually very similar
when we do Behavior clone imitation
learning where you get demonstrations of
sequences of sensor data or state
information and controls and then what
the model just learns is based on a
history of those what should be the next
control that the robot should generate
so it seems like very nicely matched to
supervised learning also that we might
want to do in
robotics now uh so far I think it's fair
to say that we really haven't seen the
kind of abilities in the robotics domain
especially also as it comes to I'm going
to focus today on robot manipulation and
uh here being at CMU of course it's
morave he he saw it all coming right he
always said that um this Paradox that
the things that we always feel like are
difficult um for for for for computers
uh such as playing chess and go like the
Holy Grail of AI these are the really
difficult Parts but it turns out that
those are the relatively easy parts for
computers what's really hard is the
stuff that we humans don't even think
about like uh physical interaction in
the world right like picking up objects
moving around the things that we do
subconsciously uh are the the kind of
things that turn out to be really
difficult also for computers that
control uh these robots and I think of
course in addition to the fact that
robotics is just difficult which is kind
of an excuse we can always use uh uh I
think the other one is really we just
don't have the data yet that um people
had access to for their language and
vision models so that is from my
perspective the the big open question
right um so my hypothesis for today is I
would claim and I'm not 100% fully
behind all of this I I'll I'll I'll tell
you a bit more about it but if we could
generate really really large scale data
sets right that are suitable for
Behavior cloning for robotics kind of
like the language the text data sets for
language models then if we combine that
with this really capable geni model
speed diffusion models Transformer
models then I think we can really try to
teach robots these broadly applicable
manipulation skills and doing so just by
let's say Behavior cloning okay the
caveat here is I'm not trying to claim
that that will be the end of Robotics
and everything is going to be solved
right or anything like General AI but I
think if we could do do something like
that at least the robots that are
trained that way would be at a
capability level that is way beyond the
capability level that we're seeing right
now I think right now we're still very
limited and even with behavior cloning I
would hope that we can get them to a
level up such that then follow-up
techniques like reinforcement learning
and things like that can be actually
well so the big open question of course
is for this is where are we going to get
that kind of data from right so it has
to be kind of temporal data it has to be
uh data annotated with actions and
things like
that so uh we we're all seeing uh many
many videos these days also on success
stories where people use Behavior
cloning for manipulation already um you
and and and and the underlying technique
is a lot of kind of using diffusion kind
of policies um and these techniques um
are trained on often human till
operation where they show the robot what
to do and they give it multiple
demonstrations and then the robot
autonomously replace those those
demonstrations right um uh and the
capabilities are uh very very exciting
what you can see sometimes right what
these robots can do there tasks related
to cooking and things like that I think
one key limitations that we're seeing
actually at this point with all these
demonstrations is if you think about the
vast space of things that we want a
Rober to do that all of these are kind
of teeny tin tiny Point Solutions in
that space right which means um they are
all these demonstrations and different
tasks they are isolated kind of
demonstrations and we're not seeing the
kind of interpolation or cross task um
uh generalizability that we would
actually like to see and that we need to
see in order to really see the robot
capabilities of the Next
Generation right so many of these
demonstrations um if you look Beyond
just what's shown in the videos but you
look bit more at the details then often
it turns out that for example whatever
if the robot picks up an object um it
looks really impressive in the video but
it turns out that that object is not
allowed to move at all which means if
you move the object that robot is not
going to be able to pick it up anymore
and for humans that is totally
counterintuitive right if you think if I
can pick up this object and it moves 10
cenms to the left I should be able to
pick it up too but many of these
Behavior cloning kind of techniques at
this point the way they are being
trained on relatively small data sets um
don't really generalize very well uh
there's other examples for example where
uh if the table height in your test
environment is different than the table
height in your training environment then
your robot might not be able to pick up
the objects anymore so there's still
clearly limitations and um I think these
limitations are mainly also just because
of the the the scarcity of training data
that we that we have so now what are
different kind of ways that we can use
to generate data for Behavior cloning on
the one hand side
uh one exciting directions I actually
think is using videos observing humans
doing these tasks uh like learning from
YouTube videos or like ego 4D kind of
data sets where um we also show how
humans perform tasks in the real world
um using also egocentric video for
example and there's a lot of exciting
work going on where then for example you
can track the human hands and you can
use that as a guidance to like for
either as a as a reward function or as a
high level policy so that your robot can
replicate that at this point my sense is
that the gap between those videos and
what a real robot would do in the real
world is still a bit too large to
actually succeed but but I I think
that's actually a very promising
Direction uh the other example is what I
also showed on the previous slide is
real world demonstrations I'm sure all
of you have heard about like the Google
arm farm and uh the data collection went
for example into the rt2 model training
um more recently there have been across Institution
Institution
attempts at combining different data
sets collected at different institutions
like openex embodiment and then uh the
Droid data set is a very recent one uh I
think these are very good for
pre-training at this point but my
experience is that for example with the
uh the openex embodiment is that the
data sets are still too separate from
each other and it's really hard to kind
of learn skills that go across these
different data sets and things like that
so so there's still bit of a way to go
the Droid data set is more focusing on a
single platform and a very controlled
setup and there's a higher chance of
actually combining all this data into a
single model in a meaningful way but I
think overall um all these data sets
still if you look at individual skills
they tend to overfit to very specific
tasks so um the question is how far can
we scale that right if we look at uh lot
of the humanoid companies right now that
get like hundreds of million ions of
dollars of funding they will actually
spend a lot of that money I assume also
on generating T operate training data so
um I think that'll be very exciting and
and it's from my perspective it's a
truly open question of how far they can
can go with this um one question of
course for the academic research
Community is how are we going to get our
hands on that data set right because of
course it's not necessarily in the
interest of these companies to shared
with everybody um another alternative to
this kind of very expensive kind of data
generation is simulation various
different environments various different
skills um the advantage of simulation
data environment is of course once it's
set up and everything it's it's
relatively cheap and you don't have to
be a robotics expert to to run these
experiments uh one key problem with
these data sets of course the Sim to
real Gap that many of them the physics
don't really work quite as well as they
should so that what you learn in Sim for
example transfers to the real world and
another open question is really asset
generation like how do we populate these
in simulation environments with the
right kind of assets and
tasks and uh I will talk a bit more
especially about this Sim to real
Gap so in the next section I want to um
give you some examples of the work that
we've done especially also at Nvidia on
kind of training manipulation
capabilities in simulation so that they
world the simple thing you can do is for
example you want to teach a robot how to
grasp op objects right let me just give
you this one example here where you have
a 3D model of the object and you want to
generate a data set a label data set
that says which are good ways for
picking up this object right let's the
phone here for example um what you can
do is you can do very simple rejection
sampling on this which means you have a
3D asset of the object and you just
randomly guess possible ways for how the
gripper could be relative to it and then
you feed that into your physics Sim Ator
and tell the physics simulator okay
close the gripper and move up and
down and for some of the Gras that were
random good guesses it's going to work
but for many of the Gras it doesn't work
and what you're going to do is you're
just going to retain the ones for which
the object remained in the griper right
so that's actually a very simple way of
uh generating now for this object and
this kind of gripper a data set that
contains all the kind of promising grass
for picking this object up okay uh you
can do this for different combinations
of groupers and objects and here we did
this for example for a subset of of the
shape net data set uh where we have
almost 9,000 objects and we ran them all
through this parallel physics simulation
in order to figure out the good grasp
and now what we have is we have a data
set of annotated objects with grasps
right um the next thing we can do now is
and I'm not going to go into any of of
the technical details but we had a a
line of work where we can then use that
data set in order to train a deep
Network it was initially it was like
like a variational order encoder um the
most recent work here is the M2 T2 it
Callas here is like a Transformer model
but the idea is that you can now in
simulation you can just Generate random
scenes let's say tabletop scenes or
scenes withd draws you can put these
objects in the scene
and you can label also the scene with
the grasp that were successful on these
objects so the input and and then what
you can do is sorry you have this scene
and you can render it as a point Cloud
for example okay so in simulation you're
not using the shape models themselves
but you render it as a point Cloud
because that is what the robot will
observe in the real world and then you
train a deep Network that takes this
input a point cloud and the output of
the network is oh what are all the
possible ways in in which I could grasp
these objects in the scene okay like you
can see this Illustrated here on the on
the left side down there and in addition
in this specific case you could also
train it to say if the robot has an
object in a certain configuration in the
hand what are the possible ways in which
it could place it in the scene okay and
it turns out that that works uh
surprisingly well especially on point
clouds uh Point clouds transfer pretty
well from SIM to real one other aspect
that we might want to look at then is
for example um object segmentation and
and this is work that we started kind of
parallel of course nowadays you might
use Sam segment anything for this which
was trained on huge amount of real world
data but here just want to show that we
can also do actually capable
segmentation training purely in
simulation so this is just some example
where we where we randomly generate
scenes with objects and the objects can
come from different data sets and then
render them and the nice thing about
simulation is that you can get this
segmentation and everything for free
right we also uh in this line of work
that's called object Seeker um we can
train a network that then says I give
you an image of an object and you should
detect the segment in the scene that
belongs to that corresponds to that
image okay and uh that is purely trained
in simulation and let me just show you
an example for how that then works so
here is on the upper left you see that's
kind of the query view where you say
okay I have a have a picture of that
that pot uh on the next image up there
in the middle you see that's the view
from an external camera and uh the upper
right is the view from the gripper
camera on the robot which is just up
here you can see that that's the scene
and so the idea is that what the model
takes as input it takes that query View
and the the the image itself and then
uses the query view to segment out the
object that this corresponds to in this
case it's a screen mask that we can
automatically then generate
and then we feed that together with this
grasp Network that was used to generate
this grasp and then the robot can pick
grasps and also in this case even let's
say the there was a network that was
trained um to do Collision checking
because uh Collision checking in some
settings for example if you have
occlusions and things like that using
just the point Cloud might not be as
robust as for example training a deep
Network to do Collision checking for you
okay here's
now here's another scene so again here
in this case we just give it the image
of the fruit
snacks you can automatically segment it
out and then generate the grass for
that all the components everything is
simulation and the nice thing is let's
say comp compared to some of the the the
the earlier work I I was hinting at is
that of course in these simulations you
can nicely randomize environmental
parameters right like the size of
drawers or the height of countertops and
things like that so the system then
becomes robust to that as
well um wait there was one y and then
once you can pick up this object this is
work with the last here it's called Pro
prompt where then uh that's one way now
to connect these kind of let's say
lowlevel capabilities with this Vision
language model this was with uh GPT
where the idea is we have the language
we want the language model to generate
python code that the robot can execute
right and here's just one example where
the input to the model for the prompting
is we first give it an action of what
the code structure should look like for
example from actions import so we can
say these are the actions the robot can
actually execute
um then we have a function for example
that says throw away
banana and also importantly the object
so we tell the code or the the the llm
what are the objects that are accessible
right now in the scene and then we want
to use that to define a function and
then in this context for example if you
look at the at the left scene the key
question is always how do we enable the
robot to do this what's called open
world reasoning right that's the key
reason why we use l M because they have
this capabilities to reason about things
that are
not uh pre-trained on a classified set
of objects right so for example you
might then say sort the fruits on the
plate and the bottles in the Box um we
can now take that
sentence generate a function that we
would like to have specified in Python
then we we run a language a vision model
to detect the objects in the scene so
this is now also impact uh input to the
llm and then we tell the llm kind of
okay now give us the individual steps of
that right and then uh we can just
execute those on the real robot okay I'm
not making claims here that this is
doing really complicated planning or
things like that but it's kind of what
you might considered like a very loose
connections between VMS and Robotics
where you have very specific robotic
skills and the VM then can call these
right and of course we all know all the
limitations to this kind of work still
um where it's about hallucination and
things like that but it's just an
example how you can combine these kind
now all right so um so far the
manipulation setup was mostly uh kind of
object pick in place so relatively
simple from a physics simulation
perspective another area that we've been
looking at is what we call a contact
Rich industrial tasks okay um it turns
out that uh most of these hard contact R
tasks in Industry are still being done
by humans because robots are just not
necessarily for example flexible enough
to to do these or um do them especially
when the environment changes and things
like that and N for example this is not
the most recent version but they for
example came up with a taskboard to
Benchmark these capabilities like you
can see here are these different assets
and they need to be inserted into
something um and it turns out that all
all these tasks have been done actually
typically on real robots and um even The
Benchmark environment was like a real
physical Benchmark and the problem was
that it's really hard to actually
simulate that I must admit I thought
come on that can't be hard right it's
like really well specified objects you
have good cat models for them how hard
can it be to to simulate sticking a peg
in a hole or something like that it turn
out it's not trivial to get it uh to to
work well um so for example when we did
this work the first L was called Factory
at the time the state-ofthe-art was
something like this where they were able
to simulate threading a a nut onto a
bolt okay but the problem was first of
all the the margins were not quite the
margins that we would see in real
physical nut bolt setups and also the
SIM imulation was 350 times slower than
real world which from a learning
perspective sometimes beats the purpose
of simulation because the reason to do
simul or one reason to do simulation is
just that you can do very fast
simulation faster than in the real world
right so um but that was kind of the
State ofth art and then we worked at
viia with uh with people also from the
physics team and um they did some magic
uh tricks on that because they hadn't
considered that use case beforehand so
we said well that's actually important
for these industrial task so they then
came up with a simulation that can now
do of course uh doing some magic on on
the GPU and some optimizations so now
they were able to actually simulate a
thousand of them in real time in
parallel so beforehand it was a single
350 times slower and now we can do a th
in parallel right the nice thing is once
you can do that um you can start doing
learning of in in this context right
because you can do reinforcement
learning for that and then you can for
example start training a policy for
doing this task and
simulation okay um here's just one
example that is actually the most recent
work in this line where now in the
automate work um Yash and his team and
and and Bing she's intern with us
working on that where they defined a
hundred of these insertion tasks in
simulation and then uh Ty typically what
we do is we train individual using po
individual um policies for for the
different tasks some of them are kind of
threading some of them are insertion
tasks um and the policies first are
trained typically let's say for each
individual task you have a different
what we call state-based policy which
means you give it access to the internal
state of the simulator the exact pose of
objects and things like that and then we
distill that into using that as training
data to distill it into to a policy that
operates for example from a point cloud
data and and I'm not talking necessarily
about Point Cloud observations but Point
Cloud representations of the assets um
and then we're getting on on these
policies on the individual policies for
example we're getting very nice success
rate both in simulation and in the real
world so on the right side is what these
policies now do on the real world we do
zero shot Sim toore transfer on those so
it's actually working um very well now I
think uh one thing is and and then
instead of just doing these individual
policies of course nowadays you can then
use the data to try to distill it into a
generalist policy that can do it
independent of the individual assets and
it's not quite where I think we would
like it to be ultimately so in this case
for example we were able to train a
single policy on 20 different tasks
different assets and the success rate if
you look at the numbers here
surprisingly in Sim actually drops by
10% but the real world doesn't even drop
that that far but still um what you
would like to see is of course that you
can generate enough tasks and then um
distill this into a single policy so
that you get really cross task uh uh
benefits right so that for example you
get a policy overall that's better than
any of these individual pre-trained
policies and also ultimately of course
you want to have a policy that can work
on unknown assets as well we we're just
getting there right now and starting to
look into that okay I have a question
yeah yeah yeah so did you first the
first question is did you randomize the
physics parameters and the second is is
the policy you distill to a recurrent
policy or it's a yeah it's a recurrent policy
policy
and I I'm pretty sure don't uh but I'm
pretty sure that did not for example
fine tune any physics parameters per
task or something like that there was
just once if possible the the find uning
and then of course you do in addition to
make it robust you do some randomization
on the physics parameters as
well how here
is Cam on the or mle that's what said so
in this case actually the point cloud is
not for the sake of let's say
um like what pulkit for example did for
the for the object it's more like a
point Cloud because for example if you
if you want to have um state based
policies for different assets the
problem is you can't learn a single
state-based policies across assets
because the state which means the
position of the object doesn't convey
actually Which object you're holding in
your hand which means you can't learn a
policy that adapts to that so in that
case we replace the state by um
extracting a representation from the
point Cloud still in simulator from the
point Cloud that represents the asset
itself it's not coming from let's say a
camera obser a also right so it's a
asset test time we have I don't know I
think it might even be one camera and
with the camera then we do object
detection because we still assume we
have the asset right and then use that
as the initial post but then generate
the point Cloud again that goes in the
policy but you can you can well imagine
that so you you theet to the and then
you the the asset we still assume that
we have the asset even in the real world
and we take the asset Point Cloud but
we're using the post estimate to place
the point Cloud relative to the gripper
but the relatively obvious next step is
to learn all of that just purely based
on observe Point
clouds can you comment on the accuracy
of the state estimation at time or I'll
estimation uh in three slides two slides
yes um
oh another step now is for example tectile
tectile
simulation um so this is some experience
we did with with the gel side it has
kind of it generates a tactile image on
on on this pad that is uh between the
grippers or on on on front of each of
the gripper and on the left side I'll
show you here for example this is a
simulation roll out of a policy where in
the lower left of course you see the the
gels side image that we can generate
in very fast in real time or faster than
real time um in simulation so now again
we can start doing simulation training
um using for example tactile information
as well and we're also looking at using
force feedback and things like that as
well um and then you can do of course
these kind of experiments nicely where
here's kind of the training success
using PPO on that where um in green we
have the training curve and success rate
that we would get if we give the policy
the ground truth position of the pack
which in reality of course you never
have if we now perturb that estimate
with some noise that the system doesn't
have then actually um po doesn't succeed
at all it's kind of the gray scale that
is the flat line at zero um and and then
these other three the blue orange and
purple curve they show training uh
results um if you use either the a
gripper a camera that is placed on the
wrist of the robot right because that
gives you information about the object
where it is relative in the gripper this
would be the the purple curve and then
if we use the tactile image we this is
the orange curve which is very similar
but actually combining the tactile image
and the wrist camera image gives us a
blue curve which is better than those so
it kind of indicates that um of course
combining these different data sources
into a single policy uh improves
performance um that again the nice thing
about the simulation s we can start
making all these kind of experiments
right and really Benchmark different
things against each other uh we also now
getting this to transfer to the real
world which means we can with zero shot
we can train a policy with tectile um
feedback and then it works in the real
world and it's just to highlight that
yeah it works in the dark if you have a
all right so so this is kind of for this
line of work for let's say more
industrial kind of manipulation task but
I think ultimately even our robots right
they should be able to plug something
into a power outlet or something like
that or USB and things like that so I
think all this kind of can lead to these
kind of capabilities as well in the open
world uh Beyond let's say these more
static simulation tasks uh of course I'm
sure many of you have seen this line of
work the Dex stream where this was Anor
Hunter and and his collaborators uh did
inhand rotation of objects and in this
case the interesting aspect was um I
don't want to spend too much to explain
the task but it has to rotate this
object to a certain configuration uh and
here he trained actually a state-based
policy and in order to successfully
execute that state-based policy in the
real world he also trained a key Point
detector for the cube for example that
then can be applied in the real world to
give you the state so and that keypoint
detector was actually robust enough that
we get also um Sim tooreal transfer zero
shot now briefly one example um on uh
the notion of post estimation um it turns
turns
out uh it I believe in many settings 60
object post estimation doesn't make
actually that much sense because many
objects like uh you don't have a model
for them and I think uh it's kind of an
an artificial bottleneck but for example
you can imagine in industrial settings
or so where you might have actually
access to 3D models of your assets and
things like that post estimation can
still be uh extremely helpful and this
just some work I want to highlight that
is B and when did this um at Nvidia and
it's coming up at cvpr this year it's
called Foundation post of course
Foundation uh the idea is um let's
assume you have a 3D model of your
object and um textured like the one up
here and this is for um kind of local
post estimation you so you assume you
can detect object with a bounding box or
something like that but the goal here is
to estimate the 3D position and
orientation of the object with high
Precision okay so you have some kind of
rough initialization and then you try to
estimate the pose and if you can do that
you also want to do tracking over time
by just initializing your estimator from
the previous time step so the idea here
is this render and compare kind of
approach which is something we've also
done before with techniques like deep
IIM or Mega pose um so the input is a
rendered view onto the object where the
rendered view comes from your current
estimate for where the object is and it
is also kind of an part of the image
where the object is in this case it's
the uh is it cheit I guess everything is
either cheit box or mustard or so
nowadays and then this is the input to
the network and then the network is just
trained uh details don't really matter
but the network is trained to give as
output a local translation and rotation
of the object pose such that the
rendered object matches the observed
object closer right so it's kind of like
for those of you who know from depth
cameras or ICP kind of techniques this
is doing something like learned ICP but
in the full image depth space okay and
then the idea is you get a refined pose
estimate and then you can repeat that
process by rendering your object at your
refined pose and let your U your deep
network uh suggest another predicted
pose um and again we we we've done this
before but now actually this is really
kind of a really level up in the in the
capabilities um and the key trick is was
really also again on just data scale
right so Bowen trained this purely in
simulation we just have like the 40,000
objects from observers and then also
from the Google scanned objects 1,000 um
then also we use some additional um uh
texturing to have more variability on
the object texturing using for example
an object and then an llm and it says
and the llm might say oh the wine glass
should be green or something like that
and then you have it's called um a text
Fusion that can generate then a texture
on the object and you get more iety on
your your data and then of course in
your training data you do a lot of um
randomization on the lighting and
everything to make it robust and uh the
interesting aspect here is first uh this
initial estim estimation if you don't
have the post relatively close yet but
you have to try different uh positions
takes a second but then tracking can be
done at 30 htz with this network and
it's currently it's state-ofthe-art on
on on many of these kind of data sets um
that measure kind of 60 post estimation
and tracking um and the key thing here
is that this network is trained um since
it's trained on all these objects it can
do zero shot object tracking which means
many of these previous Works they assume
that you can train your network on the
object that you want to track in this
case the object is just an input to the
network okay I'll just give you some
examples here so on the left observation
on the right side you see kind of a
rendering of the
object where the system detects it on
side so it's actually extremely stable
and robust and again it's very very
precise and also let me note again that
these objects have not been in the
training set of of the network right so
it's truly kind of zero short new object
and it can do that that's where the name
Foundation post comes from of course
okay uh one more example here on the
left side because for industrial stuff
these kind of objects like shiny metal
pieces and things like that actually
very relevant right for many of the
tasks that you want your robot to do and
we just uh yeah have it here in front of
the camera and you can see that tracking
Works actually very very well without
any additional training yeah can you
include rgbd or this is
RBD yeah I'm tting as
well and here on the right side this is
I think where they buil the model just
uh maybe even with something like bundle
SCF or so but then you can see that the
tracking is actually even the lighting
conditions are not great but the
tracking is very very
good so why I'm showing you that is just
to highlight that that Sim toore doesn't
only work with Point loud kind of data
right or depth data but it's also
getting nowadays better and better at
RGB kind of s toal there was a questions
there yeah that piece that piece Met has
just like no texture right other than
just like a couple uniform holes so do
you think it's actually using much of
the color data or do you think a lot of
it is just using the depth because you
give it you gave it both right I think
it's using I would say it's using a
combination you can see there's for
example also still maybe holes like that
um and I mean that's a nice thing versus
for example key Point based kind of
techniques right they always rely on on
visual key points that you can detect
and this systems that is all just end to
endend trained uh doesn't require that
who knows what it does internally right
so I don't have a a clear answer in this
um and it's very robust also as you can
see actually with respect to um
right how do you you first a model out
of it yeah so the idea is with these
object that's different actually you can
look at the at the there's different
setups one is for example in the left is
where we actually have a cat model
because for many of these industrial C
Parts you have the cat model that we can
then use readily and for some of those
you can either even you can train it to
do posess just based on multiple views
onto the object it doesn't actually
require a full 60 pose of the
object if you can generate this from
nering or something
that yeah totally yeah yeah yeah yeah
yeah yeah yeah yeah I think this might
even be what they've did in this case
okay all right so now that uh well what
I tried to convince you ofice that
simulation can work pretty well right
for at least this set of tasks that we
looked at I'm not saying simulation can
solve everything but we're getting
reasonably good Sim toore uh transfer
results both on the on the physics and
on the um appearance of things um now
where we want to go with this because so
far all I showed you is kind of um
individual little projects right where
we set up the training specifically for
this and and and the assets and
everything and now want to just describe
like where I think we can go with more
like setting up a larger framework for
how to do that and it's kind of like a a
Sim based robot training pipeline right
uh where the idea is there's like in my
view there's kind of three key steps
that we always have to do if we want to
uh train these things in simulation one
is first of all if you have a certain
task or so you Generate random ask
assets and scenes right that represent
whatever application domain you you you
want to worry about then the next thing
is we need to be able to generate
certain tasks in this environment and
also um could be rewards depending on uh
how you generate your solutions for
example in the industrial setting we
used reinforcement learning so you have
to set up your rewards um and the next
step then is for example you go into
these environments and then because
you're in simulation you take advantage
of the privileged information that the
simulator gives you for example right so
for example I know exactly where the
objects are I have the posst estimation
I have the perfect shape estimation that
is a key advantage that I have being in
Sim and then it turns out that many of
these techniques actually work very well
they can solve tasks in Sim that we
cannot yet solve in the real world so
that's a key trick and then what we do
is we just generate many tasks we use
techniques like task and motion planning
or reinforcement learning to solve these
tasks s and now what we can do is we can
use these task solutions to do Behavior
cloning on them right where the key
trick now is to say we take all these
demonstrations which is POL roll outs of
the demonstrations and then render these
rollouts with the sensor information
that my real robot will have access to
and then I can do Behavior cloning and
we're exactly in this kind of world
where we want to be right uh and just to
be very clear about this I don't believe
that we will be able to do everything
just in simulation right so clearly
adding and combining that with real
world demonstration data or even or even
video data and things like that will be
crucial on the long
run and then of course building the what
we're looking into just building the
computer infrastructure for doing all of
this where do you store the data what
formats do you store the data in we're
using this USD representation uh for
these simulation environments and then
also like how do you train these models
so let me just give you an example for
uh what I have in mind here so for
example on the scene generation if you
want to train a robot now to do more of
these indoor let's say kitchen tasks
then um we have a project that is
looking at this uh programmatically
procedurally generating these kind of
synthetic scenes of kitchen environments the
the
key uh thing to notice is that these are
articulated scenes and also so that they
work with a physics simulator that's of
course always the important part right
um and then because we're in simulation
if you can do it with one you can then
randomize over that as well um it turns
out that the variety that you can come
up with these kind of techniques might
still be limited but at least now we
have a large set of
environments right that uh we can run in
our full physics
simulator um one question is then uh how
can we go beyond just these procedurally
generated environments and of course um
we can leverage generative AI for for
going beyond the simple assets and for
example Katarina is doing some really
cool work in in that domain we're
looking at for example two projects
right now where for instance you might
want to generate assets like cabinets
and drawers and things like that and you
can imagine we're train a system where
the llm says something like okay the
cabinet has four shelves and two drawers
and then from that actually you go to a
UF description of that cabinet you learn
to go there put that out and then you
actually have also shape models for the
individual components of that and then
out you might get like fully functional
uh shape assets that you can um then use
in your simulator another line of work
my student Zoe did this um it's called
Ur forer where what she did is the
following she said okay I I can generate
in this upper row I can generate let's
say simple procedurally simple assets
for like doors and drawers and cabinets
and things like that she then then uh
figured out a way to
use and I'll point the paper to use
stable diffusion so that you can render
now actually pretty nice realistically
looking views onto these objects with a
lot of variety in them the key trick is
that these actually rendered images are
consistent with the urdf model with
respect to handles and things like that
right um so that was one tricky piece
but now what you can do is you can train
a deep Network that goes the other way
around so for example the IDE here is
you can download an internet image of a
kitchen and you can train a Transformer
model that then goes to generate a UF of
that kitchen with all the drawers and
doors right so um and also you can do
this it works even much better of course
for individual assets like whatever
frenches and stuff like that but the
idea is now um that you can download
many many images and just convert them
into these assets that you can now feed
into your simulator and that means that
this simulation environments will have
far more variability and diversity than
whatever you can come up menu right so
this is not perfectly done then of
course you can run your moment in these
environments and do uh training in
them so the next step is if we have
these randomized assets and scenes um uh
The Next Step would be generating tasks
and rewards so one way to do this is
kind of do it manually where you place
objects in them and then you might say
hey I want uh the objects to be the
drawer or put or set the table and
things like that um another way is of
course what we're looking at and again
Katrina is doing some really cool work
in that domain is using llms to generate
tasks automatically so you can tell the
llm for examp okay these are the objects
in front of the robot just suggest some
things a robot can do and it turns out
that they are surprisingly good at that
at the same time you have to have
techniques for filtering out all the
noise of course that they generate but
overall I think it's a very promising
Direction so now that we have these
scenes with assets and tasks um uh
currently what we're looking at is using
task and motion planning in order to
generate demonstrations for how to solve
these tasks so that's a whole area um
that I'm sure many of you are familiar
with is kind of especially for robot
manipulation so it's a planning planning
technique that works both at the let's
say abstract Action level planning but
also at the continuous and physics level
planning um so for example they they
reason about about um discrete state of
the environment for example if a door is
open or closed so if the robot is
holding an object or not they have
actions they have preconditions for
actions just the classical kind of
planning kind of precondition kind of
style effects of actions but then they
also reason about for example um if the
robot wants to pick up this can I find a
continuous robot Motion in order to move
the gripper there and can I place the
object at a different location and
things like that so there's various uh
very capable systems out there we use
the one from kin Garrett is called pddl
stream and um uh the key trick here is
again these task and motion planning
systems they're not that great in the
real world yet because they require
access to the real world but the beauty
is we are in simulated environment so we
have access to everything we need and
that's why these Tam systems are pretty
good let me give you just one example
here so for example in this case the
task is for the robot to put the the tea
kettle on the stove somewhere and hold
two teacups I know it's not the the most
exciting task but it's just illustrating
going okay so we just specify the high
level goal and then the planner does all
the motion generation and everything for
us okay and once we have that you can
imagine that if we could generate
thousands and thousands and thousands of
these things and generate the
demonstrations we can now render the Dem
ations with the kind of views that a
real robot would get right we're just
simulating for example the wrist camera
views we we get access to the state and everything
um and then use that as the training
yeah change we mention the cup is filled
with water here could you louder I
second yeah what the thing of the
manipulator change if we instead of just
saying pick up the two cups pick up the
two cups fill with
something oh if you if there's liquid in
there yeah so if there's liquid in the
cup oh well that depends whether your
task and motion planner has that model
right at this point so I guess uh you
can imagine actually llm might be able
to say oh be sure to hold it upright
they're actually capable of this kind of
reasoning right but in the task and
motion planning system so you need to
explicitly have these kind of things
modeled at this point
yeah one very quick H am I running out
of time soon um I just want to give a
quick plug in you've seen in the
previous animation that the motion of
the arm was kind of a little bit
convoluted right so um we Dev I don't
know whether people have heard of corobo
which is a tool we just uh developed uh
this Bala did this was a Acro paper last
year but for example with the standard
motion generation planners you can see
here on the upper left they often
generate demonstrations even in
simulation that I kind of generate this
motion as you can see here that is kind
of very convoluted let's say right it's
not a very natural path the problem with
the sampling based paths is also that
they're very hard to learn from for
Behavior cloning because there's too
much variability in them here you can
see another example if you use this B
directional RT planner for
example and what B developed with Kobo
is a technique that is fully optimized for
for
GPU and it generates actually plans so
it it combines um RT based sampling
generation with optimization everything
so the package is on GitHub you should
you should download it in um much faster
than than the alternative techniques
right now so it's faster it generates
much shorter paths as well better paths
and another key thing is it actually
works with Point cloud data right which
is one of the key limitations many of
these other techniques have let me just
give you one example here here's uh just
a demonstration of the the speed of the
motion generation so whenever it reaches
the previous goal Point indicated by the
planning okay so it's very fast in doing
that on the right hand side you can also
do then kind of this reac motion
framework take and we're going to we're
integrating that of course with our
demonstration generation as well so that
the demonstration that you give for your
robot for the training uh uh becoming
more consistent and fast we can generate
them faster as
well and then finally once we generate
the data all we have to do is behavior
cloning and there are now as we know
there's just so many techniques out
there right um um diffusion different
kinds of diffusion policies uh we have
techniques that use Transformer 3D
Transformer kind of techniques you could
imagine techniques like perceiver actor
um that some of you might have heard of
or rvt is a recent technique Katrina did
some nice work also on act 3D
diffusion policy learning but there's
many techniques out there so um but the
idea is once we have that data set we
can train them on the data set and
evaluate how well they do right here's
just one example this is work mtoa is he
is he here oh no he said he might not be
able to but for mtoa here he did this as
an during his internship with Nvidia
where we actually were able to set up
all these environments with full task
and motion planning okay so there's no
manual human till operation data
generation but it's all task and motion
planning and use that then for Behavior
cloning training of of a transformer
model and he's he got very good results
but now moving forward Imaging doing
something like this with the kind of
data that we could generate in these
kitchen environments and at a much much
much larger scale okay that is kind of
where we are where we're moving right
now so now I'm coming to the very last
section so uh so now that we have all
this data let's say the question is how
can we reuse it or how should we connect
it for example to these large language
models right um uh because what the
language models really provide to us is
this open world kind of reasoning right
they they can provide at least some kind
of guide guidance in settings that we
haven't seen in any kind of robot
training data or things like that right
the question is how do we integrate our
robotics models with these large
language models right on the one hand
side is uh one way we could do this is
the one on the left which I would call
loose integration all right which means
we have our vision language models you
take your favorite what whatever GPT
4V and you just use this to give
guidance to the robot so it's becoming
your it's uh scene understanding and
planner and then we as roboticist we
train some skills for example pickup or
play and things like that and then the
interface between that is just that the
vision language model just calls the
right skill that is kind of what I
showed when I showed this proc prompt
work it's kind of at that level right
where we predefine exactly the kind of
skills that robot needs and uh
independently we're just using the VM to
to work with those I think ultimately
that that's just going to be too brittle
and also um the notion is that it's it's
easy to come up with a small set of in
skills like pick place open close but
overall I think the the things we want
these robots to do are not always easy
to classify into these discrete set of
skills so another extreme is we just say
well we'll have a big Vision language
action model right that takes in so on
the left side it's let's say text and
images because these models are not
trained on robot data but on the right
hand side maybe we can use the robot
data to train end to endend really full
vision language action model that take
us input images any information about
the robot State history and things like
that and they literally output controls
right this is kind of the rt2 kind of
model right where they actually really
output Delta action for the for the
robot manipulator
um I must I don't know maybe right um it
seems from a robotics perspective we
often like kind of hierarchical kind of
approaches but um I'll go on the next
slide I want to talk about but that is
one approach and I think there's
something in the in between where we
actually not just using these Vision
language models but we're fine tuning
these Vision Lang action models together
with robot data so that they become
better but the output is not necessarily
a discrete set of skills that we train
with our robots independently but the
output is some kind of tokenized
interface that we train and align with
robot skills that we learn so and then
what we have on the low level is maybe
more like a policy that could be a
diffusion policy or your favorite
Transformer based policy uh that then we
train for the robot to execute right so
it's kind of between these two extremes and
and
um I'll I'll leave it to you or maybe to
the discussion on which of those uh you
you is your favorite um so now why or
how could we even train these Vision
language action models right these not
in a sense like not just taking these
models as the vision and language people
give them to us but how can we improve
on them right um because the inside is
why I think there's actually hope is the following
following
so these llms we know they were trained
on these huge data sets and they are yeah extremely capable for the kind of
yeah extremely capable for the kind of task that trained on um then there are
task that trained on um then there are many Vision models out there now that
many Vision models out there now that were also trained to provide very robust
were also trained to provide very robust uh representation of image data right
uh representation of image data right and um examples you know Dino clip M
and um examples you know Dino clip M Auto encoder kind of models so the key
Auto encoder kind of models so the key trick of these models is that they were
trick of these models is that they were trained on very weak supervision that
trained on very weak supervision that you can generate on huge data sets and
you can generate on huge data sets and the representations that generate are
the representations that generate are because of the size of the data set are
because of the size of the data set are actually very very capable let's put it
actually very very capable let's put it this way right and they showed that you
this way right and they showed that you can very quickly adapt them on
can very quickly adapt them on Downstream tasks so the open- source
Downstream tasks so the open- source Community if you now want to train
Community if you now want to train Vision language models not just language
Vision language models not just language models or Vision models is what they do
models or Vision models is what they do for example what the the lava uh work
for example what the the lava uh work does is the following they have the
does is the following they have the language mod as the backb because the
language mod as the backb because the language model is kind of the reasoning
language model is kind of the reasoning engine of all of that it's the one that
engine of all of that it's the one that is trained on huge amounts of data right
is trained on huge amounts of data right and then they take the image model that
and then they take the image model that also provides actually very good very
also provides actually very good very capable representation and in a first
capable representation and in a first first stage they used a set of training
first stage they used a set of training data such that they just align the image
data such that they just align the image embedding with the language embedding
embedding with the language embedding right so that your images image token
right so that your images image token and language token live in the same
and language token live in the same embedding space or semantic embedding
embedding space or semantic embedding space right and so what they do is again
space right and so what they do is again they first align the image embedding
they first align the image embedding with the language embeddings and then
with the language embeddings and then they have a data set on which they train
they have a data set on which they train the whole system end to end using fine
the whole system end to end using fine tuning techniques right and fine tuning
tuning techniques right and fine tuning you know there's various techniques for
you know there's various techniques for time fine tuning low rank adaptation and
time fine tuning low rank adaptation and things like that but the key trick is
things like that but the key trick is that um this overall training requires
that um this overall training requires far less data than that went into these
far less data than that went into these individual modalities that were being
individual modalities that were being trained right and there's now community
trained right and there's now community that is looking at these what's called
that is looking at these what's called multimodal large language models this is
multimodal large language models this is taken from a recent survey if you want
taken from a recent survey if you want to so it's it's a very nice paper that
to so it's it's a very nice paper that summarizes kind of what's going on and
summarizes kind of what's going on and this is just expanding on that idea
this is just expanding on that idea where we take the input are these
where we take the input are these individual modalities like images video
individual modalities like images video audio and then uh we first project them
audio and then uh we first project them into a language space right just like
into a language space right just like what the lava model is doing uh and you
what the lava model is doing uh and you can use many different kind of backbones
can use many different kind of backbones for that it doesn't have to be one
for that it doesn't have to be one specific one and then on the output side
specific one and then on the output side um they also then train it to Output
um they also then train it to Output text like for images for example but now
text like for images for example but now the most recent work is also moving into
the most recent work is also moving into even on the output we're connecting it
even on the output we're connecting it to diffusion kind of models so that on
to diffusion kind of models so that on the output is not just text but you can
the output is not just text but you can generate videos on the output side and
generate videos on the output side and they do that through a similar alignment
they do that through a similar alignment process right so if we now come from a
process right so if we now come from a robotics perspective um if we now add
robotics perspective um if we now add let's say robotics data to this kind of
let's say robotics data to this kind of of mix then maybe there's a hope that we
of mix then maybe there's a hope that we can learn capable models without needing
can learn capable models without needing the kind of Internet scale data that the
the kind of Internet scale data that the the language people have at their
the language people have at their disposal so that's why I'm um hopeful
disposal so that's why I'm um hopeful that we can make some real progress here
that we can make some real progress here in this in this domain and with that I
in this in this domain and with that I want to summarize briefly so okay we've
want to summarize briefly so okay we've seen the interest of time we've seen
seen the interest of time we've seen huge progress uh we're still not there
huge progress uh we're still not there yet I think simulation uh there's
yet I think simulation uh there's various ways in which simulation can
various ways in which simulation can help and I'll I'll leave the details
help and I'll I'll leave the details here to the discussion I think another
here to the discussion I think another aspect that we really have to look into
aspect that we really have to look into is benchmarking because right now a lot
is benchmarking because right now a lot of the work is kind of showing videos
of the work is kind of showing videos rather than showing really capabilities
rather than showing really capabilities of these models I think simulation can
of these models I think simulation can Mak quite some contributions to that um
Mak quite some contributions to that um and I think we robot going back to robot
and I think we robot going back to robot GPT I think we're still not not there
GPT I think we're still not not there yet clearly still a way to go but I
yet clearly still a way to go but I think um generating large demonstration
think um generating large demonstration data in simulation combining that with
data in simulation combining that with real demonstrations and then mixing this
real demonstrations and then mixing this with all these other existing models
with all these other existing models right to train endtoend systems I think
right to train endtoend systems I think is for my perspective a very promising
is for my perspective a very promising way to go and there's many many
way to go and there's many many questions of course on how to
questions of course on how to specifically do this like how do we get
specifically do this like how do we get geometric reasoning capability set up
geometric reasoning capability set up better um what kind of data should we
better um what kind of data should we generate how much do we need I think I
generate how much do we need I think I would I I don't think anybody has really
would I I don't think anybody has really the answer to this question like how
the answer to this question like how much do we need right and how does the
much do we need right and how does the integr ation with the language and the
integr ation with the language and the robotic action work and um of course in
robotic action work and um of course in the end it's clear that we will need
the end it's clear that we will need real world fine-tuning and learning and
real world fine-tuning and learning and reinforcement learning for these robots
reinforcement learning for these robots to be really capable in the real world
to be really capable in the real world so this pure Behavior cloning is not
so this pure Behavior cloning is not going to be the end of this story and
going to be the end of this story and with that I'd like to thank you for your
with that I'd like to thank you for your attention and of course all the people
attention and of course all the people that were above this this work and a
that were above this this work and a video and you D in all the interns thank
video and you D in all the interns thank you
you [Applause]
I went over we are slightly over time take couple of
take couple of questions and those who want to leave
can I think thank you for great talk I think you show exciting solutions from s
think you show exciting solutions from s you com in the other directions like
you com in the other directions like real to say how we close the like Lo
real to say how we close the like Lo like yeah like late some real data to
like yeah like late some real data to improve a simulator and you know yeah
improve a simulator and you know yeah that's that's a great question I think
that's that's a great question I think right now so we see a lot of work on a
right now so we see a lot of work on a training models for 3D asset generation
training models for 3D asset generation automatic U my experience is that many
automatic U my experience is that many of the the assets that are being
of the the assets that are being generated they're not quite good enough
generated they're not quite good enough for physics simulation so they often
for physics simulation so they often they look very good and and it's very
they look very good and and it's very exciting but I don't think they're quite
exciting but I don't think they're quite good enough for for physics Sim that's
good enough for for physics Sim that's why for example we looked at this work
why for example we looked at this work with this Ur form where we're explicitly
with this Ur form where we're explicitly generating UR fs and and and very simple
generating UR fs and and and very simple shaped models but at least some that
shaped models but at least some that work with the physics simulation
work with the physics simulation um
um and there there's another interesting
and there there's another interesting aspect also with respect to to
aspect also with respect to to benchmarking actually that is related to
benchmarking actually that is related to this question right so if you want to
this question right so if you want to use simulation for example to Benchmark
use simulation for example to Benchmark capabilities how do you make sure that
capabilities how do you make sure that um the Insight that you're gaining from
um the Insight that you're gaining from the simulation benchmarking that they
the simulation benchmarking that they are inside that that transferred to the
are inside that that transferred to the real world and there's we can s about
real world and there's we can s about some details about that but there is uh
some details about that but there is uh some really interesting questions
some really interesting questions related to that as well
than I'm just curious when mention to n to n region language action model like
to n region language action model like actually region language model is a
actually region language model is a transform base and the chain Robo model
transform base and the chain Robo model is mod how combines two model together
is mod how combines two model together yeah it so the Transformer the the robot
yeah it so the Transformer the the robot doesn't have to be diffusion you can
doesn't have to be diffusion you can imagine various ways of doing the one is
imagine various ways of doing the one is you could discretize your output action
you could discretize your output action space actually um in your using a
space actually um in your using a Transformer and you can also then of
Transformer and you can also then of course you can
course you can um uh use a diffusion process on the
um uh use a diffusion process on the output side of of your uh train
output side of of your uh train transformer right um so the fact that
transformer right um so the fact that it's a Transformer model doesn't have to
it's a Transformer model doesn't have to be doesn't mean that it can be also a
be doesn't mean that it can be also a diffusion process that you use on the
diffusion process that you use on the output side
output side um so you're you're referring to this
um so you're you're referring to this slide right and
slide right and even even this simple step like
even even this simple step like outputting control is a really tricky
outputting control is a really tricky question right like at what level do you
question right like at what level do you do control that what frequency do you do
do control that what frequency do you do control is it in the robot frame of
control is it in the robot frame of reference is it in the camera frame of
reference is it in the camera frame of reference all these kind of questions I
reference all these kind of questions I think are still there's hints on where
think are still there's hints on where people say like what works better than
people say like what works better than other approaches but there's still a lot
other approaches but there's still a lot of work to be done and for that we need
of work to be done and for that we need to just thoroughly actually investigate
to just thoroughly actually investigate these questions you think which kind of
these questions you think which kind of microchip is
microchip is for the edge
for the edge Computing which which kind of mic GPU or
Computing which which kind of mic GPU or oh of
course no but but I mean noway you can of course for a lot
mean noway you can of course for a lot of this perception driven kind of
of this perception driven kind of Robotics you you will need some kind of
Robotics you you will need some kind of computer on the robot for reactivity but
computer on the robot for reactivity but also a lot of the compute you can maybe
also a lot of the compute you can maybe offload either to the cloud or to some
offload either to the cloud or to some desktop compute and um I'll leave it to
desktop compute and um I'll leave it to others to judge at this point what uh
others to judge at this point what uh here think last question um to kind of
here think last question um to kind of pigy back on this question about the
pigy back on this question about the control stuff um you alluded to the
control stuff um you alluded to the control signals you know could be
control signals you know could be extremely diverse right you can at
extremely diverse right you can at different levels of granularity
different levels of granularity frequency and I mean maybe this is kind
frequency and I mean maybe this is kind of my bias as a human it's hard for me
of my bias as a human it's hard for me to interpret visually control signals
to interpret visually control signals easy for me to interpret images but it
easy for me to interpret images but it kind of appears to me the control signal
kind of appears to me the control signal are like a much less like Smooth kind of
are like a much less like Smooth kind of signal to learn than than images and and
signal to learn than than images and and language so I guess the question is like
language so I guess the question is like do you expect or maybe have you seen any
do you expect or maybe have you seen any evidence or any conjecture um whether
evidence or any conjecture um whether these things that are so super
these things that are so super successful in vision and language uh
successful in vision and language uh will ever kind of scale at the same
will ever kind of scale at the same level of data to that like you multi-
level of data to that like you multi- hierarchy completely abstract multi
hierarchy completely abstract multi embodiment uh
embodiment uh yeah that get back to this also this
yeah that get back to this also this notion of the things that are so easy to
notion of the things that are so easy to us right like this spatial geometric
us right like this spatial geometric reasoning I think at this point
reasoning I think at this point um that L to one of the questions I had
um that L to one of the questions I had like how can we get these these Vision
like how can we get these these Vision language models to become much better at
language models to become much better at geometric reasoning 3D kind of reasoning
geometric reasoning 3D kind of reasoning right because one of the problems I
right because one of the problems I guess is that the the language data
guess is that the the language data doesn't provide the details and even a
doesn't provide the details and even a lot of the vision data now if you think
lot of the vision data now if you think about models like Sora that can generate
about models like Sora that can generate scenes that there's something CD in
scenes that there's something CD in there right but it's not quite the the
there right but it's not quite the the quality that we need for these
quality that we need for these robots I at this point I I think maybe
robots I at this point I I think maybe we can get actually pretty far but we
we can get actually pretty far but we will need I think uh a lot of data but
will need I think uh a lot of data but maybe to train these 3D capabilities
maybe to train these 3D capabilities that is where I think simulation can
that is where I think simulation can really shine because we can generate a
really shine because we can generate a lot of this variability so we did for
lot of this variability so we did for example one line of work um and pets
example one line of work um and pets where we trained a policy so for this
where we trained a policy so for this motion Chann generation for a
motion Chann generation for a manipulator right like you have cabinet
manipulator right like you have cabinet or something and some obstacles and you
or something and some obstacles and you just want to reach a point without
just want to reach a point without colliding and you do planning in this
colliding and you do planning in this joint space and I showed this coroba
joint space and I showed this coroba techniques so what we did there is for
techniques so what we did there is for example we generated a data set with I
example we generated a data set with I don't know let's say million examples of
don't know let's say million examples of reaching and then we were able to train
reaching and then we were able to train a network that actually can compile that
a network that actually can compile that into a policy that just takes him the
into a policy that just takes him the point cloud and a goal and it can kind
point cloud and a goal and it can kind of move the arm there pretty smoothly
of move the arm there pretty smoothly and also in a way that is not just local
and also in a way that is not just local right kind of it learned some Global
right kind of it learned some Global kind of collision ofo kind of notion and
kind of collision ofo kind of notion and I think that's similar for us humans for
I think that's similar for us humans for us it's really easy but I would I would
us it's really easy but I would I would actually imagine that many of the mods
actually imagine that many of the mods right now that we're seeing and also for
right now that we're seeing and also for example I like the RT models and things
example I like the RT models and things like that that they would not be able to
like that that they would not be able to solve that task because they're mostly
solve that task because they're mostly doing just free space reaching right but
doing just free space reaching right but even the simple thing of reaching around
even the simple thing of reaching around obstacles and stuff like that is
obstacles and stuff like that is something that the current models don't
something that the current models don't have but I think again with with
have but I think again with with sufficient data I would hope it might
sufficient data I would hope it might come out of it seems so easy right
so I question [Applause]
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.