This presentation details the development of a low-latency thread pool and job system to optimize a physics engine for a new game, focusing on overcoming the challenges of parallelizing sequential constraint solving.
Mind Map
Click to expand
Click to explore the full interactive mind map
All right, welcome back everyone and uh
thank you Casey for the previous talk.
Amazing detail and amazing amount of
research you must have done on that.
This talk is hopefully going to be a
little bit shorter so we get lunch. Um
so Sam reached out to me a couple of
weeks ago and asked if I wanted to
present something and I came to think
about this work I did about half a year
ago on optimizing and parallelizing the
constraint solver for our upcoming game.
And as it turned out, uh, this talk
isn't going to be that much about the
constraint solver or how to paralyze it.
It actually is going to be more about
the low latency thread pools and job
system that I had to write to make it
all run efficiently on modern hardware.
So, but I hope you you'll enjoy it.
Anyway, you're going to see some physics
movies at least.
Uh, so just some background on me. I'm
Dennis Gustafson and I am the original
creator of the game Tearown, which is a
Voxil uh sandbox heist game that came
out in 2020.
I was also the only programmer on the
project up until the initial release and
now we're a bit more. Um it it features
a um custom u software rate tracing
voxal rate tracing engine which is
pretty special and also a a custom
physics engine that operates and does
collision detection directly on voxil
volumes um and and a bunch of of other
stuff. If you haven't played it um ask
me afterwards. I have a bunch of codes
uh to give out if you want to try. So
for the last year and a half, I haven't
actually been uh involved in the further
development of tearown even though the
development continues.
Uh I've been um making prototypes and uh
started to implement a new engine for
our uh an upcoming game and that's the
the one I'm going to uh talk about
today. So it's it's an engine that is uh
relatively similar to Terodon. We wanted
to take all the things we learned from
tear down
um and um remove a lot of limitations
and restrictions on the old engine
um and um and see how far we can push it
basically. So one of the things we we
did was to um we rewrote the graphics to
use parts of the hardware rate tracing
pipeline is still tracing voxels at the
very lowest end. Um, but by doing that
we could up the number of dynamic
objects dramatically because the old
engine worked in conjunction with the
CPU and the GPU because I was open using
OpenGL at the time because it didn't
know better. So I didn't have access to
the hardware rate tracing but now that
we do we do it in Vulcan. I also had to
rewrite the physics to feed this new
graphics engine with lots more objects.
uh and
doing that uh most of the work went into
making everything run very parallel
because in this time that passed since
the development of tear down uh desktop
PCs and gaming PCs got a lot more
threads. So that's really the key to
getting uh more performance.
So a little background on physics
engines even though it's not the really
the topic of this talk could be good to know
know
an overview at least of how they work.
So you have a bunch of objects
and what you typically do is split the
the workload up into two phases. So you
have first first phase is collision
detection which means you're going to
look for pairs of objects that are close
or overlapping and figure out some kind
of contact points where they collide.
And uh that part is uh ridiculously
parallel. It's has no data dependencies.
You can just split up all the pairs and
do them parallel.
There are no um no locks or anything you
need to do that. So very easy. The other
part when you have this list of constraints
constraints
um is the constraint solver and that
takes the dynamic state which is the
velocity and angular velocity of all the
bodies and it tries to advance time into
the next dynamic state for each body uh
while taking all these constraints into
consideration. And the way Terodon does
that in almost every game out there
right now uh is to do that using a
method called projected Gausidel
or uh sequential impulse also another
name for the same thing. Um and the way
it works is very simple actually it you
just u sequentially
satisfy all these constraints. It's like
a relaxation method. So you satisfy the
first one, compute an impulse, apply it
to uh body A and B and then you progress
through the next one and do
do
the same thing which applies impulse to
B and C and then the third one and the
fourth one and as you can imagine this
has to be done in sequence and that's
what makes it absolutely terrible to
paralyze because you have to do the
first you have to complete the first
thing before you can progress to the
next because you want the result uh from
the impulse up here to be applied to be
where you measure how much of an impulse
you want to apply uh on the second
contact and that makes a huge difference
in terms of convergence for the solver
if you do it that way.
Um, so you go through all these and as
you can imagine when you when you solve
the first one it's going to be solved
perfectly but already when you solve the
second one it's going to like disturb
the solution on the first one. So now
it's not going to be satisfied anymore.
So when you done all the context you're
going to have to go back and do it one
more time and then one more time and one
more time and you do this like
iteratively. And a typical number that
games use is like four, five, six,
seven, eight times in that range.
Uh, terodon uses eight. This new and
uses a six
and um,
yeah intuitively
very hard to paralyze because of this
this sequential thing.
So what Terodon does and also most other
games or physics engines is to uh,
instead of trying to
split the way we solve constraints is to
split objects in the world. So if we
have a bunch of objects over there and a
bunch of objects here and they don't
touch those can be solved as separate
simulation islands uh on multiple
threads and
uh finding those groups is is also
relatively cheap. Um
Um
I included a little screenshot here of
what it might look like and also a
profiling graph down here. We're going
to be looking at at these quite a lot
throughout the talk. So, if you haven't
seen them, it's like a um a frame
profiler or like a timeline profiler.
Time moves from left to right. Up here
you see the main thread, what it's doing
in a hierarchical way. And down here, I
included um these are the worker
threads. So, it's using a thread pool.
workers are told
um by the main thread to start high
level jobs and then all the threads when
they're occupied they show in green and
when they're waiting for something they
uh show in black.
So each line here should be thought of
as a worker thread. And then it also
time progresses.
It this thread here does something here
then it waits a little bit and then it
works and it waits and it works. And
it's all told by the by the main thread
when to do that.
And if we look at how it would be split up,
up,
uh we can just intuitively see these
simulation islands here. they would all
generate tasks that go
into the the job system. And as long as
we have enough of these islands and
they're roughly the same size, this is a
a pretty good way to do it. But unfortunately,
unfortunately,
in Terown and quite a lot of games, it
looks more like this. you have maybe
some small islands here,
but if the player is given the chance
and a lot of tools, they will do that
and and that's going to be one one
single island and it's going to be take
a long long time to solve that on one
single thread while the rest of the
machine is basically doing nothing
except for waiting on that. So this has
been the ever since initial release of
teardown. This has been the biggest
Um and this is a common problem not just
for Terodon but for any physics engine.
So another guy who thought about this a
lot is Erin Tado, author of Box 2D who
most of you probably already are
familiar with. He worked on this problem
last year u and released
uh his findings in um box 2D version 3
that came out a bit more than a year
ago. And he also wrote about this on his blog.
blog.
And Erin is a really smart guy and
really nice guy. So I don't want to I
just want to give him the credit for
this. I'm just using the same thing. So
if you want to look up the details, look
at his blog. It's amazing.
And I just want to quickly describe how
it works.
It's a a um technique called graph
coloring. And it's a way to split up
these large islands into
um chunks that can be solved in
parallel. And it's also a very intuitive method.
method.
You go over all the contact points and
you find you sweep over all of them and
you mark the ones that actually can be
And in order to be solved in parallel,
they cannot affect the same bodies. But
there's going to be a lot of a lot of
constraints that affect uh different
bodies, totally separate ones.
Uh and in this case we can just see that
these three affect different pairs of
bodies. That means we can solve them in
any order or we can solve them even in
parallel and it's going to be exactly
the same result. And
And
when we're done we're not we won't have
covered all the context. We cannot
include this one for instance because
it's already being operated on by the
the other blue ones over here. Um
if we just tried to solve all of them at
the same time, we would since they
operate on the same bodies, we would
have a lot of data races and would also
be even though we might get some result
from that uh it also uh wouldn't be the
same. It wouldn't be deterministic. it
wouldn't be the same every time we run
it, which is also a property that we want.
Um, so when we do that, we have those
three, then we have four remaining
points. What we do is to sweep over
those four, not considering the other
that's already solved, and then we find um
um
mark the ones
that can be solved in parallel. Note
that those red ones, they cannot be
solved in parallel with the blue ones.
We have to finish all the blue ones
first and then we can start on the red
ones and then we just do that over and
over uh until
we covered all the constraints.
This is a very simplified case of
course. Um in a real scenario with maybe
thousands of bodies or or and tens of
thousands of constraints uh that that's
going to be a pretty large number of
constraints that goes into each uh
color. And a typical number for that
could be
maybe 10 11 12 colors that's needed to
cover almost all constraints. And
usually there's a small remainder. We
don't want too many colors. So if we
have a very small number of colors left,
we can just put them in a special bin
and go over them sequentially. That's
going to be, as we'll see later, faster
than than making a separate color for each.
each.
And then when these are fed, when we're
going to solve these, and we're going to
process these after splitting them up,
we start with the first color,
throw that at the thread pool, and solve
all those constraints as parallel as we can.
can.
And then we're going to proceed to the
next one.
uh six times because we wanted to do six iterations.
And in between each color
when these threads have been working on
it before we can start the next one, we
have to wait for all the the threads to
finish. So that's that's a full barrier
sync across all the threads in the
thread thread pool. And we're going to
have to do that between every color in
every iteration. And as I mentioned, the
number of colors are usually 10, 11, 12.
And then we have a few additional jobs
every iteration for integration and
other things. Um, so we end up with
close to or around 100 barriers per uh
per step or per time that we execute the solver.
solver.
And given that this is a game that wants
to update at 60 uh frames per second
um and we also have other things to do.
We have a collision detection and
rendering and whatnot. Uh that gives us
just a few milliseconds to complete
that. And doing this amount of
synchronization in such short amount of
time is something that um
synchronization primitives uh aren't
very good at.
So, in order to measure how they
perform, I had put together this little
test scene.
It's a level exported from Terodon to
the new engine. Uh, I spawned a lot of
uh dynamic objects.
And um it's actually 5,000 of them. I
made so that they're all active and
simulated at all times. And they're all
or most of them are in a big pile. So,
it's it's a tricky case
for the physics engine to handle.
And um the machine I'm running this on
is an Intel Core i9.
Um 14th generation Raptor Lake
900F. I'm not sure what what those
numbers really mean, but it has eight P
cores or performance cores as they're
called now. Uh with hyperthreading, so
two threads per core.
and 16 efficiency cores uh summing up to
a total of 32 threads. Now whether we
should count the hyperthreads or not
could be discussed but there actually 24
and as a baseline if we run this single
threadedly u we get
the number here 82 milliseconds for the
whole scene update
um which is obviously not anywhere near
real time. Um, but the number we're
going to look at throughout the
presentation is actually one iteration
of the solver marked here. So the thing
that we do six times, I take an average
of that and that's the the number I look
at because
that's the thing we're trying to
paralyze. So it makes sense to measure
And if I start out with a traditional or
what I call a traditional thread pool,
it's um actually very close to the
implementation that runs in the current
tear down or tear down the game as it is today.
today.
It uses just standard synchronization
primitives. So it has a shared task
queue that fills up every time there's a
job coming in. there are multiple
workers uh pulling jobs from this task q
using a standard mutx
and then there are a couple of condition
variables one for if it happens so that
the main thread uh has to wait for the
threads to finish the main thread also
pulls tasks off the queue but sometimes
it finishes there's nothing no more
tasks to take but it has to wait for the
threads to finish it sleeps on this
condition variable
and And the same thing when the workers
run out of work um and there's no more
tasks to be done, they sleep on another
condition until the main thread says
here's a new job. Then they can start again.
again.
Um and it looks like this.
And you can already see this is kind of
what we this looks a bit weird, right?
We have very dense profile on the left
and a very sparse on the right. And this
kind of coincides with what we said
before, collision detection on the left,
very easy to paralyze, no data dependencies.
dependencies.
Solver on the right, uh very hard to
paralyze and lots of synchronization.
So if we just measure the time there,
we're at uh 17
17
times better than the single thread
performance on the collision detection
which is uh really good I would say
given that we have
I don't know how how much you know about
performance and efficiency cores. I'm
also I'm not a not a hardware person. So
I'm not the one to tell you exactly how
it works. But the efficiency cores from
what I've come to understand they they
sit there are four cores in a cluster.
They share an L2 cache and they also
have a clock speed that is lower than
the performance core. And that in total
makes them run at roughly half the speed
of a performance core from what I come
to understand.
And since we have 16 of them and eight
performance cores, we should get around
16. And 17 is so that that must be
pretty good, right?
If we look at the solver,
we're at 1.4.
That is like objectively terrible for
having 31 threads on a problem. So,
um, I reran this with different numbers
of threads just to see what happens. And
what's going on,
I it's a little bit hard to explain, but
if I run at eight threads, I should have
had the graph. I didn't actually do all
the number uh numbers because it's a lot
of manual work. Uh, but I tried it and
the one that runs fastest is the one
with eight threads. Um
Um
now we can observe something something
interesting here and that's the
collision detection which paralyzes
really well. It sits at 7.6x
the single single thread performance
expected because it's very close to the
theoretical maximum of eight. Um whereas
the solver which was with 31 threads was
at 1.4 is now at 4.7.
So one part of the code
is slower and another part of code is faster.
faster.
That's kind of hard to work with, right?
Uh so in total if we just look at the
the scene update which is 20.5 in this one
one um
um
it's is 13.3 in this one. So in total
measuring the whole frame time, this one
is actually faster. It's faster to run
just eight threads
than 31. And the reason I chose 31,
maybe I should mention is that when you
have a high thread count like this, I
usually leave one for the operating
system and background tasks. Found that
to be good practice just so you don't
get hiccups. Um
how should we tackle this? It's should
we just And why is it eight? E eight I
think is because we have eight
performance cores. That would make
sense. Um but what should we do about
it? And
if we go back and look at this one and
zoom in on one iteration that should be
able to to give us some hints and then
we can notice
two things. One is that in between. So
these are the the different highlevel
jobs. These like a vertical green fuzzy bars
bars
in between these jobs.
There's there's a lot of just black
space where none of the threads are
doing anything. Uh also not the main
thread which is kind of weird because
the whole computer is just sitting there
for almost a 100 microsconds doing nothing.
nothing.
Uh and the other thing we can observe is
that even when a a worker thread is running,
running,
it's it's a lot of gaps in between
actually processing tasks
and that's likely because there's a lot
of contention. We have a single shared
task queue and a lot of workers trying
to pull from the same queue. So we're
going to have contention from all the
threads trying to access the same resource.
resource. Um
Um
so I want us to think about these two
things as two separate problems even
though they're probably related.
Uh so one to try and alleviate this I
rewrote the uh job system from using
standard synchronization primitives to
using mostly lockless programming and
atomics. I don't know how how many in
here have used or are familiar with
lockless programming.
So that's more than half. Great. Um it's
a way to to spin on atomic variables to
wait for something instead of using uh
like an operating system primitive. The
operating system primitives actually
also use that a little bit. they spin a
little bit in user space usually before
uh actually doing task switch and going
into kernel but uh that's true for for
mutex but from what I come to understand
that is not true for a condition
variable so uh and it's also hard to
know exactly because it's not really
um
so we have an atomic spin that spins
uh waiting. So So when there is congestion
congestion
pulling these tasks,
the atomic spin lock will just sit there
and go, is it available? Is it
available? Is it available until it
becomes available and then it's
immediately going to return. So there
won't ever be u a a task switch or like
the the operating system blocking that
thread or anything.
We also have an atomic counter for
completed tasks so that if the main
thread is the first one to finish
um it's not going to sleep on a
condition variable anymore. It's also
going to sleep on this uh atomic waiting
for the number of completed tasks to be
uh all the tasks.
And I in this version I still have the
condition variable for the workers because
because
we don't want to spin all the threads in
the machine forever because that's going
to consume a lot of energy.
The computer is going to be very loud.
It's going to be very warm and it's also
not a very nice thing to do towards
other programs. We might get punished by
the Windows scheduler for doing that. Uh
so we we don't want to overuse this
opportunity to to spin too much. But
let's let's look at what this looks like.
like.
So on the left side, collision detection
is roughly the same. I think that is
within the error margin. Savis is
probably the same which is expected. It
was already fast before. It's not going
to be magically much faster if it's
already efficient. Um, and the parallel
solve on the right side is now at 3.8
times faster than single threaded. So,
we're a bit closer to to the eighth
thread version. Not quite as good as the
eighth thread version with the previous
one, uh, but much better than the 1.4
that we had before. And in if we look at
the scene update 10, um, that's actually
faster. So by using lockless we could
we could um make the whole frame time
now running faster with 31 threads as
compared to to eight threads that we had
before. So by using we actually get a
benefit from using the whole machine so
and if we then zoom in again
it now looks like this.
And it's worth mentioning here that the
this distance here
uh is is now 880 microsconds whereas in
the previous slide it was 2,200
microsconds. So even though it looks
like it's approximately the same
uh black gap in between jobs, it's
actually much faster because it just
scaled up.
Uh, but we can see that once the threads
actually start working on something,
they're actually quite efficient. There
aren't they're almost not noticeable
these little black things where they try
to get the next task.
So, we solved one problem and we made
one problem better,
but it's still looks like something is a
little bit off. They're doing more
waiting than working still.
And I think the solution here is is
probably obvious to everyone since we
still have that last condition variable
in there that we don't want to remove
because the computer we don't want to
waste or like waste all that energy
having it spinning all the time.
What we could do is to have like two
versions of the thread pool. run that
spins indefinitely
even even when the job is finished. You
can just spin from here to there.
Um, but we just want to do that while
we're in the solver because we know in
the solver that when we when we finish
the job here,
all the threads will hang on that
condition variable. And then on the main thread,
thread,
five microsconds later, it's going to
say, "Okay, time to wake up everyone."
So then they kind of make this switch
like go to sleep, wake up, go to sleep,
wake up. Um, which is very inefficient. Um,
Um,
but before we we look at that, I just
want to make an uh mention an
observation that I made. And I'm not a
hardware person, so I'm I'm not
uh I don't know what I'm talking about
on this one, but I but I just want to
mention it because I I couldn't unsee it
once once I saw it. When I've been
staring at these graphs, I see that it's usually
usually
when the main thread says start working,
there are a number of threads that start
early. And then there are kind of a
gradual uh period before everyone
started. And the time it takes
from the main thread says start until
all of them started
is roughly like 30 milliseconds or 30 microsconds.
microsconds.
And if we just look at these that start early,
early,
you can see that it's one, two, three,
four, five, six, seven.
These are a little hard to measure
because they're so small. One, two,
three, four, five, six, seven. 1, two,
three, four, five, six, seven. You start
to see a pattern. It's like all almost
always it's seven threads that start
early and then the rest come way later.
And um my theory is that those are the
performance cores and that's also why
the eight core version is faster. I
can't really back this up, but that
that's just my theory.
Um but now let's try to remove that last
condition variable. So the only
modification I made was to have a
boolean on the whole uh thread pool that
basically says uh some if we are in this
elevated performance mode or low latency
mode or spin mode as I call it. If we
are we just spin on an atomic while
waiting for the next job.
um or that boolean to to go to the the
normal state, then we also break that spin.
spin.
And looking at the results,
we're now up to 9.1x on the on the
solver side. Uh, collision detection
also went a little bit better up to 19,
but I think that could be
it could also be uh just random noise.
I'm not sure. Um,
and overall we're down to 6.6
milliseconds for the whole simulation time.
time.
Um, from 82. So I don't know what that
would be. It's like a total of maybe 12 13x.
13x.
Uh, and if we zoom in as we always do on
one iteration, we can now see that this
is starting to look pretty reasonable.
It's actually doing what it's told. So
if the main thread says, uh, can you
start this batch? All the threads are
are pretty quick. They start at roughly
the same time. Um,
also another reminder that this is again
scaled up just so you can see it better.
It's now 330 microsconds across.
So the time it takes to start all the
threads here, it went from um 30 mill 30
microsconds in the previous one down to
just you can't really see the legend
here, but it's approximately two
microsconds uh in this version.
And once the the last worker stops working,
working,
the main thread immediately continues
without sleeping. and starts the next
one and then it just goes in sync with
the workers.
Uh the only real gap we have is this
one, but that's the serial part. That's
these the special bin of constraints
that didn't get colored that we solve in
sequence. It's been there in in all the
graphs, but it's been so small that we
didn't see it. And now we can finally
see it. I see that as a good thing. Now,
now the whole the whole graph finally
makes sense. Um,
so as I mentioned a couple of times now,
the scale of this is different in all
the slides and I thought it would be
interesting to just lay them out um and
see them side by side.
So I just scaled the image actually to
to have the same
the same time so to speak. And now it
gets pretty obvious. It's the same
amount of it should be the same amount
of green in all these graphs. It's just
the the amount of black gaps where the
So that's that. And
I also put
actually put them in a diagram like this
where I grouped them a little
differently. I have the single threaded
on the left. I did these three meth
methods, the synchronization primitives,
the lockless or partially lockless uh in
the blue and then these spinning mode
lockless on the right. And then I
measured the whole things with eight
threads and with 31 threads to compare
as we can see with a with the sync
primitives much faster with eight threads.
threads.
um with the lock list still faster with
eight threads but it's like at least
approaching the same range. Uh and if we
just spin in between jobs the 31 threads
outperforms uh the eighth thread one by
quite far.
This by the way is just for the solver.
If we would have included the
uh collision detection also that would
be an even more dramatic difference
because of the the collision detection.
Uh, I redid the same thing with AMD. I
have a Ryzen 7 laptop that has eight Sen
4 cores and it uses AMD's version of
hyperthreading called simultaneous
multi-threading. It's basically the same
thing. Uh, so 16 hardware threads. Um,
so instead of 31, I'm using 15 threads
on the right side.
And I've I thought it was quite
interesting that the the overall trend
here or the overall um
how they differentiate is roughly the
same even though it's a completely
different architecture. So with the sync
primitives eight is faster than 15 with
lockless about the same and with green
we do get a benefit from using also the
Um
I do have a couple more slides if we
have time. Yeah. >> Um,
>> Um,
just some closing notes on having access
to a low latency thread pool like this
has uh affected the way I think about
parallel programming a little bit which
I thought would be interesting to share
at a conference like this.
Um, when you have
regular task systems or job systems,
thread pools, whatever you call them,
you the most common way to
interface with that, to my knowledge, is
to have some kind of task that you
define. And it could look if you use C++
it could be like a a base class a task
and then you override a run function a
run method and then some state that this
task operates on. Uh if you use C it
could be a function pointer and a
strruct. It could look a little
differently. And then at some place in
the code
you would build a task graph. So you
would assemble these tasks. You would
allocate memory for them, put them
somewhere, uh copy the state, add them
to the scheduleuler, insert these
barriers, and sometimes you have
dependencies between the tasks instead
of manually
setting the the barriers and then all
that happens implicitly.
Um and then once you built that
you execute the whole thing and then
that runs and then implicitly it just
calls all these callbacks in the right
order on from different threads and just
magically it resolves uh this graph.
I never ever liked this style of
programming. I think it's horrible. I
think it's very different from how I
normally think about programming. I if I
sit down and write like a prototype for
something, I would never dream of using
this paradigm for doing it. I would just
write it like from top to bottom like
sequential code. And
And
usually this is necessary because you
want these dependencies kind of stated
in that graph. But what I noticed from
my own behavior is that when having
access to a low latency thread pool like
that, I've started to use a very old
concept in parallel programming
much more called the parallel 4, which
is which is very simple. All it does it
loops over
a number of things and it does them in
parallel. That's all it does. And then
it does a
then it waits for all of them to finish
before it proceeds.
So all the measurements you saw in the
previous slides were actually done in
this way. It's just
um it's not exactly this code. It's
simplified a little bit but but you have
this outer loop. First we prepare the
graph coloring to compute these um what
can be what should be solved together
and then we make an outer for loop of
six iterations and for each iteration we
go through all the colors and then we
just solve all of them in parallel uh do
that for all the colors then we solve
these uh sequential batch only on the
main thread and then we do another
parallel piece of work
with the static contacts Um, and it's
just very easy. It's very similar to the intuitive
intuitive
prototype version of this algorithm, right?
right?
And um
the only thing that I really needed to
do to make this run efficiently was to
to have the to set the thread pool in
this elevated state. And I do that with
a little scope variable. So when it
enters that it elevates the state
and if there's something to do on the
main thread it's also beneficial to do that
that
when you enter the function. So in this
case I say I tell the workers to go to
this elevated state
and then I know it's going to take 30 40
micros for that before all the threads
are ready and alert.
Uh so have some time to do the graph
coloring while waiting for that and and
then you can just work with with the
worker threads. Let the worker threads
and the main thread uh work in a much
more synchronized fashion if you don't
have to pay the price of this latency.
Uh that's proved for me at least to be a
vi viable way to to work with parallel programming.
um saves a lot of boilerplate code. You
don't have to allocate memory for these
tasks. You don't have to copy the state.
You don't have to come up with names for
these tasks and all that tedious work
that that is not so pleasant. Um so for
me that has been
um quite eye opening or as like a new
new way that's sort of an old new way
that's viable again.
Uh so conclusion um I
I
I think this to the the takeaway here is
not to do exactly what I did. I think
the takeaway here should be that all
these graphs we've looked at
having access to looking at the way the
threads communicate in a visual way I
think for me has been has been very very
useful and I don't think I would have
come to these conclusions without seeing
all these gaps in the timelines and how
the threads uh collaborate. So I can
highly recommend if you if you're not
visualizing your profiling data in that
way or collecting the profiling data
that makes it possible.
I would highly recommend doing that. Um
and then depending on your workload, you
may want to do something similar or
something or not do it or uh something
completely different. Um but then at
least you're making an informed decision.
decision.
Um yeah, for me changing the way
synchronizing threads made a a huge
difference in performance.
And that's that's the end of the talk. I
just want to leave this with a little
physics movie. And uh if you have any
questions or you want to reach out, you
can see my contact information down
>> Could be
could be around 10,000 maybe. I don't I
don't know. Maybe between five and
So how do you think about the batch size
in the parallel floor? So you like don't
want to assign just a single conraint to
like as the unit cost.
>> Yeah, it's a it's a good question.
>> And then you have like static versus
dynamic scheduling.
>> Yeah. Um it's a good question. I
it's a little you you have to again
measure I think to see what's a good
batch size. I think it's 32 in my case.
Uh I tend to this got very specific very
fast but but the the number of tasks I generate
generate
I think is
four times could have be could be six
times I'm not entirely sure but
somewhere around that range the number
of workers. So I have a good number of
tasks to choose from to not have so much
static scheduling.
>> So big round of applause for Dennis one
more time.
>> Is this still on? Okay. Okay.
>> So my first question is regarding the
profiler you mentioned you needed to
build this from scratch, right? Is this
your your work that you had library that
you actually could use? What was
>> uh this was my my own custom profiling?
Yes. Uh I have a a history with
profilers. I I like them. I use them a
lot. Uh but there is a really good free
one called Tracy that you could use if
you don't want to write your own. It's
actually not that much work to write
your own. You just have to be very
careful how you measure performance
because you can't if you want to measure
performance of something lockless. You
you can't have a mutx in your profiler
because that kind of mess up the whole
timeline. Um so it's um it has to be a
lockless profiler also. But as long as
you have that
um collecting the data is really simple.
you just record all the um I just call
query performance counter and
note the results and um and then at the
end of the frame or when when I
visualize this I kind of sort out all
that and see which thing should go where
and so it doesn't it doesn't affect the
thing while measuring too much.
>> Awesome. So I have a question. This was
all CPU stuff. Yes. Right. And
everybody's crazy about GPUs in terms of
rendering. So it does it makes sense to
offload some of that work during the
physics either the collision detection
or the actual solving part to GPU and
use the resources there.
>> You certainly can and it has been done
been done. But I I'm not a big fan of
the idea for two reasons. And one being
that the GPU is pretty busy doing
rendering, you probably want it to focus
on that. Um, and the other one being the latency
latency
or well, if you want to put physics on
the GPU, you either want very low
latency and high bandwidth between the
CPU and GPU and have them work in sync.
So the CPU can say start crunching
physics or just the solver or whatever
is on the GPU and then get the result
back. But it's going to be rather
inefficient because the GPU is going to
be busy rendering stuff. So it's not
available just for random requests. Um
or you put you put all of it on the GPU
and just tie it into the to the physics
or the graphics. Um
>> have you tried that? No, because the
result of the physics is also very tied
to game play. So when you you you want
the result of the physics on the CPU and
you want it usually immediately after
you're done with it and in our case for
tear down it's it's it's not really
possible because the physics could also
result in things breaking and you want
to create new objects and and putting
all of that on the GPU. You're going to
end up with pretty much the whole game
on the GPU and that that that would be a nightmare.
nightmare. >> Okay.
>> Okay.
I sometimes talk to game developers and
when I talk when we talk about physics
they are always complaining and you
spent your entire life on physics and
physics engines and my question is
is this the skill issue on the developer side
side
or is this like inherently complex? How
how can a person get into physics and
learn how to do it properly?
>> I think you just answered the first
question with the second part like that
there aren't really any a whole lot of
good resources for learning physics.
There used to be the bullet forums, but
there's not that much activity these
days. But like 10 15 years ago, that was
a very active forum where people helped
each other. And we haven't seen much of
that lately. So, I I don't know at this
point. It's not particularly complicated
compared to anything else in game
development, I think. Um, but it's um it
maybe doesn't have the same mass appeal
as graphics for you. I mean, it would be
hard to make like a shader toy for
physics. I don't really know how what
that would look like. So, um getting
people into it takes maybe a little bit
more of an investment before it starts
working. That could be one.
>> Right. So from your talk, we started
with like mutex and synchronizing
threads and then we ended up on the uh
spin uh spinning lockless implementation.
implementation.
Is this does this implementation have
some sort of downsides? Like are there
situations where shouldn't go this way
because it wouldn't work or is it just
always better? Well, the the most
obvious downside is that it can consume
more power. It will require more energy,
but I I think that is only true in some
situations given how much shorter time
the solver runs.
Uh, I don't I'm I'm not convinced that
the total amount of power would would
actually be less if if it went to sleep
and woke up like a 100 times instead. Um,
Um,
but if the but spinning on on spinning
on on um atomics like this is not
something that is generally recommended.
If you look at the whole system, like if
you're running the game and a lot of
other things,
that will probably not work very well
because because the game is going to if
you get like a a context switch in the
middle of the solver while a lot of
threads are waiting spinning. Um,
Um,
then they're going to keep spinning
until if we say for some reason the main
thread would be swapped out while the
workers are spinning waiting for more
work, they would spin a whole time slice
before the the main thread gets swapped
back in. I think in in practice though,
I don't really see how or why that would
happen. But if you if you run a lot of
things on the machine, uh it's going to
start to become really inefficient.
All right, I think we have time for one
question from the audience.
>> We have we have okay time.
>> Okay, time to
>> All right, so let's open up the floor
for questions from the audience. Who
would like to ask the question? Yes.
>> Um so this is more of a general physics
programming question. So if you're if
you're applying the solver to multiple
bodies, um I would imagine that you
start at a different point in the chain
when you're applying impulses that it
would could potentially lead to a
slightly different solution, right? Like
it's order dependent. Um so what what
kind of general like uh conventions are
there for deciding what order to start
in because I would imagine those would
lead to slightly different results.
>> Yeah, that that's
entirely correct. when that's also one
of the reasons we want to have
determinism. It's not only for
networking purposes or anything. It's
it's because of stability. If we solve
the constraints in the same order every
time, we're going to have a result that
is slowly converging to to the correct
solution. Whereas if we if we we solve
them at in random order every time, um
we're going to we're going to end up
with disturbances that just that it
can't compensate for appropriately. So
we're going to end up with jittering.
And um
there's also I don't know if that was
the question you were asking. There's
also been in the past there's been
experimentation with
reordering constraints. Say you have a
stack of objects. If you want to solve
it from bottom to top order and things
like that in order to increase stability
for stacking, that's also been done. But
I I would say that that's a minor thing
compared to the the general stability.
>> If I could just clarify the question,
the question was like what kind of
conventions are there for deciding what
order to go in? That makes sense. Like
is there is there particular conventions
that for deciding what order to
actually? Yeah, apart from these if you
if for stacking for inance if you want
to solve from top to bottom that's
generally preferable but I I would say
in practice what I do and what I I know
other physics programmers also do is
that it it there's some kind of
consensus that it doesn't matter that
much which order you solve it as long as
it's consistent.
So what I do uh I just do a a radix
sort. Uh so I sort them per body and
it's it's incredibly fast and I do it
actually on the main thread before the
solver starts. But it's uh it's it's um
at least with 32 threads it it doesn't
really take up that much time on the profiler.
>> All right, next question. You mentioned
there was potentially some sort of
penalty from the operating system if you
don't go to sleep. Is that a particular
thing on particular operating systems?
>> It's just something I observed on
Windows. I don't know if it's
there's some just certain um cases or if
it's if it's always but if I if I tend
to if I make an error so that they're
always spinning. I noticed that the
uh the operating system I don't know if
the the punishment is in is intentional
because this process is doing something
bad or if it is uh just the time slicing
that that makes it makes it get swapped
out more often in the middle of
something. It could be either of those I guess.
John,
>> did you ever experiment with work
stealing? Because you only have one work
here, right?
>> Yeah. Yeah, it's a excellent question.
It's actually one I should have put on a
slide for future work because work
stealing is is generally a better
algorithm where you don't have just one
single queue of uh tasks, but you have
one task Q per worker and then they
steal work from each other in the case
they run out. So you have like a static
batching that turns into dynamic
batching once um all the work is done so
to speak. Um and I I want to try it. I
haven't tried it yet. I also noticed in
the profiler that these black gaps in
between jobs with a lockless uh are so
small that the the potential speed up is
is is pretty small. I think in my case
Do you think that there is a way to
overlap these iterations
because you you will be touching stuff
that's not going to change during the
previous iteration?
>> If that was the case, they would have
ended up in
in the previous color,
>> you know, like the iterations like when
you're done with some of the
>> Oh, I see. I see. Um
potentially there there might be some
opportunity for optimization in the
overlap going from one iteration to the
my intuition is that would be pretty
small but but it's it's a good idea
should should maybe look into it. Sam.
Sam.
>> Yes. I guess so. I I wonder you you
describe that on the main thread you do
things that I presume touch all n of the
contact points it sounded like. So is it
I'm trying to get an idea for how much
more expensive the salt each constraint
solve is compared to just iterating the
like because you for example said you
sort them. Right.
>> Right. Are you talking about a memory
>> just why is it expensive? Why can you go
over all n contact points on the main
thread that you can't solve? Like how
much more expensive?
>> Yeah, it's a lot of computations to
solve each uh constraint. So that
there's a significant amount of
operations that needs to be carried out.
Um, so it it's yeah, but you you're
right that that going over and touching
the memory on the main thread for
everyone that that could if you have a
lot of constraints
that could potentially be be a bit
expensive. The total memory for all the
constraints, even if you have 10,000,
they're still all going to fit
maybe not in the L2 cache, but at least
in the L3 cache. So, so it's not going
to be a major bottleneck when it comes
to to memory.
>> And and just a quick followup then are
there variations in how complex your
solver could be? Like for example, is
tear down solver a lot more complicated
than what somebody else might do or
>> I don't think so. They us the the
solvers at least from what I know are
relatively similar as long as they
follow this projected Gaussidel or
sequential in impulse. Uh they could be
implemented a little differently. The
new one uh that we saw here is actually
a a bit different from from the one in
Terodon on a number of uh in a number of ways
ways
but it's um the differences are rather
>> Ryan
>> so um I'm not a good physics programmer
so let me know if this doesn't make any
sense but with uh so with the sequence
of iterations you have like let's just
say you have blue color, red color,
green color, and then you the second
iteration you do blue
uh green red whatever order I said. Um
you do that uh many subsequent times.
Does is there a dependency between like
blue number two and red number one if
that makes sense? So like could you do
all the blues first and all the reds and
all the greens or can you not slice it?
>> No, you you have to solve all the colors
in one iterations before you can start
the next one. That's it's it's almost
like the iteration is just a
continuation of this sequential thing.
So you have to finish all of them before
you can start the next one. >> Okay.
>> Okay.
>> The iteration is just basically just to
make the the step small enough so that
you have fine enough correction, right?
That's the only reason we do iteration.
>> Uh the reason there are iterations in
the solver is to increase accuracy. Uh
because it converges towards the solution. the more iterations we take,
solution. the more iterations we take, the closer we get to the solution. Um,
the closer we get to the solution. Um, >> and how did it happen that is exactly
>> and how did it happen that is exactly six? How did you
six? How did you >> uh It's just experimentation like what
>> uh It's just experimentation like what you need for the game.
you need for the game. >> Um,
>> Um, >> will this be stable from game to game or
>> will this be stable from game to game or >> from what I know most games use four or
>> from what I know most games use four or eight there? There probably some games
eight there? There probably some games using a bit more, some a bit less. Some
using a bit more, some a bit less. Some some physics engines let you set that
some physics engines let you set that per body. So if you have an a simulation
per body. So if you have an a simulation island with all uh all have a low
island with all uh all have a low iteration count then it's not going to
iteration count then it's not going to spend so much resources simulating that
spend so much resources simulating that island. Uh so it can look a little
island. Uh so it can look a little different but um but it it's a it's a
different but um but it it's a it's a performance quality trade-off.
All right. >> Yeah. So I have a question this is more
>> Yeah. So I have a question this is more general physics engine optimization and
general physics engine optimization and that's do you do a thing that for
that's do you do a thing that for example you measure uh for example you
example you measure uh for example you see if an object is really far from
see if an object is really far from player camera you do less physics
player camera you do less physics iterations on it or if you see that the
iterations on it or if you see that the object is really stressing the physics
object is really stressing the physics engine you just like silently try to do
engine you just like silently try to do less iterations on it
less iterations on it usually no I don't do that at least I
usually no I don't do that at least I don't know if if someone else is doing
don't know if if someone else is doing it I think it's pretty uncommon because
it I think it's pretty uncommon because generally
generally If you have things that are scrambling
If you have things that are scrambling around somewhere trying to settle, if
around somewhere trying to settle, if you turn down the iteration count, the
you turn down the iteration count, the simulation quality is going to go down,
simulation quality is going to go down, which means they're going to be um more
which means they're going to be um more shaky and jittery, which it's going to
shaky and jittery, which it's going to keep them alive for longer time. So when
keep them alive for longer time. So when something settles, you can take them out
something settles, you can take them out of the simulation. So at a larger scale,
of the simulation. So at a larger scale, you might actually waste time by by
you might actually waste time by by doing that.
Are there yes sq >> have you thought about uh um keep
>> have you thought about uh um keep storing in the objects if they are
storing in the objects if they are solved so that you say the first power
solved so that you say the first power is one value of the second and then you
is one value of the second and then you can check hey I'm in the second wave
can check hey I'm in the second wave have both of these contacts been solved
have both of these contacts been solved in the previous and therefore you don't
in the previous and therefore you don't have gaps where you have to sort of
have gaps where you have to sort of collect everything
collect everything >> do you mean dur for each iteration you
>> do you mean dur for each iteration you would do this to see if they need
would do this to see if they need further solving or I'm not I'm not not
further solving or I'm not I'm not not sure.
sure. >> Yeah. So um if you have a contact in the
>> Yeah. So um if you have a contact in the what would you say the second color
what would you say the second color right
right >> and that contact can only be solved when
>> and that contact can only be solved when the the first iteration is done but it
the the first iteration is done but it doesn't have the whole first iteration
doesn't have the whole first iteration doesn't have to be done only those two
doesn't have to be done only those two objects.
objects. >> I see what you mean. Yeah. No I have not
>> I see what you mean. Yeah. No I have not experimented with that. I think that
experimented with that. I think that would be a very complicated
would be a very complicated thing to implement unless you want to do
thing to implement unless you want to do it because then you want to have atomics
it because then you want to have atomics for that checking. Um
for that checking. Um it's an interesting idea. I I think it
it's an interesting idea. I I think it would be a lot of locking and quite
would be a lot of locking and quite expensive but but it's it could be worth
expensive but but it's it could be worth checking.
>> All right. >> Dumb question. If there are no good
>> Dumb question. If there are no good resources, when can we get your book?
resources, when can we get your book? [Laughter]
[Laughter] Uh, I don't think I'm I'm not planning
Uh, I don't think I'm I'm not planning on writing a book, but but I I agree
on writing a book, but but I I agree that there should be more physics
that there should be more physics resources. There's been a couple of
resources. There's been a couple of initiatives
initiatives to do more. The best I know is Erin Ko's
to do more. The best I know is Erin Ko's blog. It It's very good. And also he if
blog. It It's very good. And also he if you're interested in the solver
you're interested in the solver specifically, he released just last year
specifically, he released just last year a project called uh solver 2D. It's on
a project called uh solver 2D. It's on the same GitHub account as Box 2D. That
the same GitHub account as Box 2D. That is a simplified implementation of I
is a simplified implementation of I don't know 10 11 different solvers. And
don't know 10 11 different solvers. And the code is very easy to read. It's like
the code is very easy to read. It's like written just to compare the solvers
written just to compare the solvers qualitatively. So it's it's not written
qualitatively. So it's it's not written for performance and that's a good
for performance and that's a good resource if you want to learn what
resource if you want to learn what different solvers do.
different solvers do. >> But besides Erin Kato's blog, there's
>> But besides Erin Kato's blog, there's also Voxagon.se
also Voxagon.se blog by Dennis himself. So while we wait
blog by Dennis himself. So while we wait for his book, you can go to this website
for his book, you can go to this website voxagon.se
voxagon.se Swedish Swedish domain I presume. And
Swedish Swedish domain I presume. And Dennis has amazing blog posts, very
Dennis has amazing blog posts, very detailed, very nicely written. You can
detailed, very nicely written. You can go a lot. You can learn a lot from these
go a lot. You can learn a lot from these blogs. So go check out vogsagon.s.
M. Yes. So you said that bad things would happen if you have more than 10
would happen if you have more than 10 colors around like that. Is it the same
colors around like that. Is it the same uh as you did with sequences with trial
uh as you did with sequences with trial trial and error iterations or like there
trial and error iterations or like there is some dependence on the number of
is some dependence on the number of objects like how to deter the number of
objects like how to deter the number of colors?
colors? >> Yeah. The reason I want to keep the
>> Yeah. The reason I want to keep the number of colors I want to cap the
number of colors I want to cap the number of colors is
number of colors is is to not have because each new color
is to not have because each new color requires a full synchronization. So
requires a full synchronization. So there's going to be some time lost and
there's going to be some time lost and if you don't have enough work in each
if you don't have enough work in each color, it's just going to be waste at
color, it's just going to be waste at that point. So it yeah, it's it's a
that point. So it yeah, it's it's a little bit of trial and error to find
little bit of trial and error to find that sweet spot. I think in my code I
that sweet spot. I think in my code I actually have 16 as the cap, but it
actually have 16 as the cap, but it usually never reaches that. It's it's
usually never reaches that. It's it's finished by the time it it gets to 10,
finished by the time it it gets to 10, 11, 12. And then I have another um cap
11, 12. And then I have another um cap at how when there are only fewer than a
at how when there are only fewer than a certain number of constraints left, I
certain number of constraints left, I just put them in this overflow
just put them in this overflow sequential bin. And I think that number
sequential bin. And I think that number is is maybe 32 or something something
is is maybe 32 or something something that that can be solved very efficiently
that that can be solved very efficiently in sequence.
in sequence. I just don't bother um doing any
I just don't bother um doing any labeling for I just quickly want to
labeling for I just quickly want to proceed to the actual solver.
proceed to the actual solver. Yes.
Yes. >> I guess how did using SIMD uh
>> I guess how did using SIMD uh instructions factor into the
instructions factor into the optimization? For example, was this
optimization? For example, was this important or was having?
important or was having? >> It's a good question. You if you go to
>> It's a good question. You if you go to Erin's blog, he's been experimenting
Erin's blog, he's been experimenting with that. So, he's actually doing graph
with that. So, he's actually doing graph coloring uh also for the purpose of
coloring uh also for the purpose of SIMD. So he he solves parallel
SIMD. So he he solves parallel uh on bunch of different threads and on
uh on bunch of different threads and on each thread he solves constraints with
each thread he solves constraints with graph coloring also on different lanes
graph coloring also on different lanes with SIMD which is pretty cool. The new
with SIMD which is pretty cool. The new box 2D is amazingly fast.
>> All right, thank you so much. This was so many questions. Round of applause.
so many questions. Round of applause. Dennis Gustson.
Dennis Gustson. >> Thank you.
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.