YouTube Transcript:
Dennis Gustafsson – Parallelizing the physics solver – BSC 2025

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Dubbing in English

Break language barriers, embrace global quality content

Use for Free

Video Transcript

Video Summary

Summary

Core Theme

This presentation details the development of a low-latency thread pool and job system to optimize a physics engine for a new game, focusing on overcoming the challenges of parallelizing sequential constraint solving.

Mind Map

Click to expand

Click to explore the full interactive mind map

All right, welcome back everyone and uh

thank you Casey for the previous talk.

Amazing detail and amazing amount of

research you must have done on that.

This talk is hopefully going to be a

little bit shorter so we get lunch. Um

so Sam reached out to me a couple of

weeks ago and asked if I wanted to

present something and I came to think

about this work I did about half a year

ago on optimizing and parallelizing the

constraint solver for our upcoming game.

And as it turned out, uh, this talk

isn't going to be that much about the

constraint solver or how to paralyze it.

It actually is going to be more about

the low latency thread pools and job

system that I had to write to make it

all run efficiently on modern hardware.

So, but I hope you you'll enjoy it.

Anyway, you're going to see some physics

movies at least.

Uh, so just some background on me. I'm

Dennis Gustafson and I am the original

creator of the game Tearown, which is a

Voxil uh sandbox heist game that came

out in 2020.

I was also the only programmer on the

project up until the initial release and

now we're a bit more. Um it it features

a um custom u software rate tracing

voxal rate tracing engine which is

pretty special and also a a custom

physics engine that operates and does

collision detection directly on voxil

volumes um and and a bunch of of other

stuff. If you haven't played it um ask

me afterwards. I have a bunch of codes

uh to give out if you want to try. So

for the last year and a half, I haven't

actually been uh involved in the further

development of tearown even though the

development continues.

Uh I've been um making prototypes and uh

started to implement a new engine for

our uh an upcoming game and that's the

the one I'm going to uh talk about

today. So it's it's an engine that is uh

relatively similar to Terodon. We wanted

to take all the things we learned from

tear down

um and um remove a lot of limitations

and restrictions on the old engine

um and um and see how far we can push it

basically. So one of the things we we

did was to um we rewrote the graphics to

use parts of the hardware rate tracing

pipeline is still tracing voxels at the

very lowest end. Um, but by doing that

we could up the number of dynamic

objects dramatically because the old

engine worked in conjunction with the

CPU and the GPU because I was open using

OpenGL at the time because it didn't

know better. So I didn't have access to

the hardware rate tracing but now that

we do we do it in Vulcan. I also had to

rewrite the physics to feed this new

graphics engine with lots more objects.

uh and

doing that uh most of the work went into

making everything run very parallel

because in this time that passed since

the development of tear down uh desktop

PCs and gaming PCs got a lot more

threads. So that's really the key to

getting uh more performance.

So a little background on physics

engines even though it's not the really

the topic of this talk could be good to know

know

an overview at least of how they work.

So you have a bunch of objects

and what you typically do is split the

the workload up into two phases. So you

have first first phase is collision

detection which means you're going to

look for pairs of objects that are close

or overlapping and figure out some kind

of contact points where they collide.

And uh that part is uh ridiculously

parallel. It's has no data dependencies.

You can just split up all the pairs and

do them parallel.

There are no um no locks or anything you

need to do that. So very easy. The other

part when you have this list of constraints

constraints

um is the constraint solver and that

takes the dynamic state which is the

velocity and angular velocity of all the

bodies and it tries to advance time into

the next dynamic state for each body uh

while taking all these constraints into

consideration. And the way Terodon does

that in almost every game out there

right now uh is to do that using a

method called projected Gausidel

or uh sequential impulse also another

name for the same thing. Um and the way

it works is very simple actually it you

just u sequentially

satisfy all these constraints. It's like

a relaxation method. So you satisfy the

first one, compute an impulse, apply it

to uh body A and B and then you progress

through the next one and do

the same thing which applies impulse to

B and C and then the third one and the

fourth one and as you can imagine this

has to be done in sequence and that's

what makes it absolutely terrible to

paralyze because you have to do the

first you have to complete the first

thing before you can progress to the

next because you want the result uh from

the impulse up here to be applied to be

where you measure how much of an impulse

you want to apply uh on the second

contact and that makes a huge difference

in terms of convergence for the solver

if you do it that way.

Um, so you go through all these and as

you can imagine when you when you solve

the first one it's going to be solved

perfectly but already when you solve the

second one it's going to like disturb

the solution on the first one. So now

it's not going to be satisfied anymore.

So when you done all the context you're

going to have to go back and do it one

more time and then one more time and one

more time and you do this like

iteratively. And a typical number that

games use is like four, five, six,

seven, eight times in that range.

Uh, terodon uses eight. This new and

uses a six

and um,

yeah intuitively

very hard to paralyze because of this

this sequential thing.

So what Terodon does and also most other

games or physics engines is to uh,

instead of trying to

split the way we solve constraints is to

split objects in the world. So if we

have a bunch of objects over there and a

bunch of objects here and they don't

touch those can be solved as separate

simulation islands uh on multiple

threads and

uh finding those groups is is also

relatively cheap. Um

I included a little screenshot here of

what it might look like and also a

profiling graph down here. We're going

to be looking at at these quite a lot

throughout the talk. So, if you haven't

seen them, it's like a um a frame

profiler or like a timeline profiler.

Time moves from left to right. Up here

you see the main thread, what it's doing

in a hierarchical way. And down here, I

included um these are the worker

threads. So, it's using a thread pool.

workers are told

um by the main thread to start high

level jobs and then all the threads when

they're occupied they show in green and

when they're waiting for something they

uh show in black.

So each line here should be thought of

as a worker thread. And then it also

time progresses.

It this thread here does something here

then it waits a little bit and then it

works and it waits and it works. And

it's all told by the by the main thread

when to do that.

And if we look at how it would be split up,

up,

uh we can just intuitively see these

simulation islands here. they would all

generate tasks that go

into the the job system. And as long as

we have enough of these islands and

they're roughly the same size, this is a

a pretty good way to do it. But unfortunately,

unfortunately,

in Terown and quite a lot of games, it

looks more like this. you have maybe

some small islands here,

but if the player is given the chance

and a lot of tools, they will do that

and and that's going to be one one

single island and it's going to be take

a long long time to solve that on one

single thread while the rest of the

machine is basically doing nothing

except for waiting on that. So this has

been the ever since initial release of

teardown. This has been the biggest

Um and this is a common problem not just

for Terodon but for any physics engine.

So another guy who thought about this a

lot is Erin Tado, author of Box 2D who

most of you probably already are

familiar with. He worked on this problem

last year u and released

uh his findings in um box 2D version 3

that came out a bit more than a year

ago. And he also wrote about this on his blog.

blog.

And Erin is a really smart guy and

really nice guy. So I don't want to I

just want to give him the credit for

this. I'm just using the same thing. So

if you want to look up the details, look

at his blog. It's amazing.

And I just want to quickly describe how

it works.

It's a a um technique called graph

coloring. And it's a way to split up

these large islands into

um chunks that can be solved in

parallel. And it's also a very intuitive method.

method.

You go over all the contact points and

you find you sweep over all of them and

you mark the ones that actually can be

And in order to be solved in parallel,

they cannot affect the same bodies. But

there's going to be a lot of a lot of

constraints that affect uh different

bodies, totally separate ones.

Uh and in this case we can just see that

these three affect different pairs of

bodies. That means we can solve them in

any order or we can solve them even in

parallel and it's going to be exactly

the same result. And

And

when we're done we're not we won't have

covered all the context. We cannot

include this one for instance because

it's already being operated on by the

the other blue ones over here. Um

if we just tried to solve all of them at

the same time, we would since they

operate on the same bodies, we would

have a lot of data races and would also

be even though we might get some result

from that uh it also uh wouldn't be the

same. It wouldn't be deterministic. it

wouldn't be the same every time we run

it, which is also a property that we want.

Um, so when we do that, we have those

three, then we have four remaining

points. What we do is to sweep over

those four, not considering the other

that's already solved, and then we find um

mark the ones

that can be solved in parallel. Note

that those red ones, they cannot be

solved in parallel with the blue ones.

We have to finish all the blue ones

first and then we can start on the red

ones and then we just do that over and

over uh until

we covered all the constraints.

This is a very simplified case of

course. Um in a real scenario with maybe

thousands of bodies or or and tens of

thousands of constraints uh that that's

going to be a pretty large number of

constraints that goes into each uh

color. And a typical number for that

could be

maybe 10 11 12 colors that's needed to

cover almost all constraints. And

usually there's a small remainder. We

don't want too many colors. So if we

have a very small number of colors left,

we can just put them in a special bin

and go over them sequentially. That's

going to be, as we'll see later, faster

than than making a separate color for each.

each.

And then when these are fed, when we're

going to solve these, and we're going to

process these after splitting them up,

we start with the first color,

throw that at the thread pool, and solve

all those constraints as parallel as we can.

can.

And then we're going to proceed to the

next one.

uh six times because we wanted to do six iterations.

And in between each color

when these threads have been working on

it before we can start the next one, we

have to wait for all the the threads to

finish. So that's that's a full barrier

sync across all the threads in the

thread thread pool. And we're going to

have to do that between every color in

every iteration. And as I mentioned, the

number of colors are usually 10, 11, 12.

And then we have a few additional jobs

every iteration for integration and

other things. Um, so we end up with

close to or around 100 barriers per uh

per step or per time that we execute the solver.

solver.

And given that this is a game that wants

to update at 60 uh frames per second

um and we also have other things to do.

We have a collision detection and

rendering and whatnot. Uh that gives us

just a few milliseconds to complete

that. And doing this amount of

synchronization in such short amount of

time is something that um

synchronization primitives uh aren't

very good at.

So, in order to measure how they

perform, I had put together this little

test scene.

It's a level exported from Terodon to

the new engine. Uh, I spawned a lot of

uh dynamic objects.

And um it's actually 5,000 of them. I

made so that they're all active and

simulated at all times. And they're all

or most of them are in a big pile. So,

it's it's a tricky case

for the physics engine to handle.

And um the machine I'm running this on

is an Intel Core i9.

Um 14th generation Raptor Lake

900F. I'm not sure what what those

numbers really mean, but it has eight P

cores or performance cores as they're

called now. Uh with hyperthreading, so

two threads per core.

and 16 efficiency cores uh summing up to

a total of 32 threads. Now whether we

should count the hyperthreads or not

could be discussed but there actually 24

and as a baseline if we run this single

threadedly u we get

the number here 82 milliseconds for the

whole scene update

um which is obviously not anywhere near

real time. Um, but the number we're

going to look at throughout the

presentation is actually one iteration

of the solver marked here. So the thing

that we do six times, I take an average

of that and that's the the number I look

at because

that's the thing we're trying to

paralyze. So it makes sense to measure

And if I start out with a traditional or

what I call a traditional thread pool,

it's um actually very close to the

implementation that runs in the current

tear down or tear down the game as it is today.

today.

It uses just standard synchronization

primitives. So it has a shared task

queue that fills up every time there's a

job coming in. there are multiple

workers uh pulling jobs from this task q

using a standard mutx

and then there are a couple of condition

variables one for if it happens so that

the main thread uh has to wait for the

threads to finish the main thread also

pulls tasks off the queue but sometimes

it finishes there's nothing no more

tasks to take but it has to wait for the

threads to finish it sleeps on this

condition variable

and And the same thing when the workers

run out of work um and there's no more

tasks to be done, they sleep on another

condition until the main thread says

here's a new job. Then they can start again.

again.

Um and it looks like this.

And you can already see this is kind of

what we this looks a bit weird, right?

We have very dense profile on the left

and a very sparse on the right. And this

kind of coincides with what we said

before, collision detection on the left,

very easy to paralyze, no data dependencies.

dependencies.

Solver on the right, uh very hard to

paralyze and lots of synchronization.

So if we just measure the time there,

we're at uh 17

times better than the single thread

performance on the collision detection

which is uh really good I would say

given that we have

I don't know how how much you know about

performance and efficiency cores. I'm

also I'm not a not a hardware person. So

I'm not the one to tell you exactly how

it works. But the efficiency cores from

what I've come to understand they they

sit there are four cores in a cluster.

They share an L2 cache and they also

have a clock speed that is lower than

the performance core. And that in total

makes them run at roughly half the speed

of a performance core from what I come

to understand.

And since we have 16 of them and eight

performance cores, we should get around

16. And 17 is so that that must be

pretty good, right?

If we look at the solver,

we're at 1.4.

That is like objectively terrible for

having 31 threads on a problem. So,

um, I reran this with different numbers

of threads just to see what happens. And

what's going on,

I it's a little bit hard to explain, but

if I run at eight threads, I should have

had the graph. I didn't actually do all

the number uh numbers because it's a lot

of manual work. Uh, but I tried it and

the one that runs fastest is the one

with eight threads. Um

now we can observe something something

interesting here and that's the

collision detection which paralyzes

really well. It sits at 7.6x

the single single thread performance

expected because it's very close to the

theoretical maximum of eight. Um whereas

the solver which was with 31 threads was

at 1.4 is now at 4.7.

So one part of the code

is slower and another part of code is faster.

faster.

That's kind of hard to work with, right?

Uh so in total if we just look at the

the scene update which is 20.5 in this one

one um

it's is 13.3 in this one. So in total

measuring the whole frame time, this one

is actually faster. It's faster to run

just eight threads

than 31. And the reason I chose 31,

maybe I should mention is that when you

have a high thread count like this, I

usually leave one for the operating

system and background tasks. Found that

to be good practice just so you don't

get hiccups. Um

how should we tackle this? It's should

we just And why is it eight? E eight I

think is because we have eight

performance cores. That would make

sense. Um but what should we do about

it? And

if we go back and look at this one and

zoom in on one iteration that should be

able to to give us some hints and then

we can notice

two things. One is that in between. So

these are the the different highlevel

jobs. These like a vertical green fuzzy bars

bars

in between these jobs.

There's there's a lot of just black

space where none of the threads are

doing anything. Uh also not the main

thread which is kind of weird because

the whole computer is just sitting there

for almost a 100 microsconds doing nothing.

nothing.

Uh and the other thing we can observe is

that even when a a worker thread is running,

running,

it's it's a lot of gaps in between

actually processing tasks

and that's likely because there's a lot

of contention. We have a single shared

task queue and a lot of workers trying

to pull from the same queue. So we're

going to have contention from all the

threads trying to access the same resource.

resource. Um

so I want us to think about these two

things as two separate problems even

though they're probably related.

Uh so one to try and alleviate this I

rewrote the uh job system from using

standard synchronization primitives to

using mostly lockless programming and

atomics. I don't know how how many in

here have used or are familiar with

lockless programming.

So that's more than half. Great. Um it's

a way to to spin on atomic variables to

wait for something instead of using uh

like an operating system primitive. The

operating system primitives actually

also use that a little bit. they spin a

little bit in user space usually before

uh actually doing task switch and going

into kernel but uh that's true for for

mutex but from what I come to understand

that is not true for a condition

variable so uh and it's also hard to

know exactly because it's not really

so we have an atomic spin that spins

uh waiting. So So when there is congestion

congestion

pulling these tasks,

the atomic spin lock will just sit there

and go, is it available? Is it

available? Is it available until it

becomes available and then it's

immediately going to return. So there

won't ever be u a a task switch or like

the the operating system blocking that

thread or anything.

We also have an atomic counter for

completed tasks so that if the main

thread is the first one to finish

um it's not going to sleep on a

condition variable anymore. It's also

going to sleep on this uh atomic waiting

for the number of completed tasks to be

uh all the tasks.

And I in this version I still have the

condition variable for the workers because

because

we don't want to spin all the threads in

the machine forever because that's going

to consume a lot of energy.

The computer is going to be very loud.

It's going to be very warm and it's also

not a very nice thing to do towards

other programs. We might get punished by

the Windows scheduler for doing that. Uh

so we we don't want to overuse this

opportunity to to spin too much. But

let's let's look at what this looks like.

like.

So on the left side, collision detection

is roughly the same. I think that is

within the error margin. Savis is

probably the same which is expected. It

was already fast before. It's not going

to be magically much faster if it's

already efficient. Um, and the parallel

solve on the right side is now at 3.8

times faster than single threaded. So,

we're a bit closer to to the eighth

thread version. Not quite as good as the

eighth thread version with the previous

one, uh, but much better than the 1.4

that we had before. And in if we look at

the scene update 10, um, that's actually

faster. So by using lockless we could

we could um make the whole frame time

now running faster with 31 threads as

compared to to eight threads that we had

before. So by using we actually get a

benefit from using the whole machine so

and if we then zoom in again

it now looks like this.

And it's worth mentioning here that the

this distance here

uh is is now 880 microsconds whereas in

the previous slide it was 2,200

microsconds. So even though it looks

like it's approximately the same

uh black gap in between jobs, it's

actually much faster because it just

scaled up.

Uh, but we can see that once the threads

actually start working on something,

they're actually quite efficient. There

aren't they're almost not noticeable

these little black things where they try

to get the next task.

So, we solved one problem and we made

one problem better,

but it's still looks like something is a

little bit off. They're doing more

waiting than working still.

And I think the solution here is is

probably obvious to everyone since we

still have that last condition variable

in there that we don't want to remove

because the computer we don't want to

waste or like waste all that energy

having it spinning all the time.

What we could do is to have like two

versions of the thread pool. run that

spins indefinitely

even even when the job is finished. You

can just spin from here to there.

Um, but we just want to do that while

we're in the solver because we know in

the solver that when we when we finish

the job here,

all the threads will hang on that

condition variable. And then on the main thread,

thread,

five microsconds later, it's going to

say, "Okay, time to wake up everyone."

So then they kind of make this switch

like go to sleep, wake up, go to sleep,

wake up. Um, which is very inefficient. Um,

Um,

but before we we look at that, I just

want to make an uh mention an

observation that I made. And I'm not a

hardware person, so I'm I'm not

uh I don't know what I'm talking about

on this one, but I but I just want to

mention it because I I couldn't unsee it

once once I saw it. When I've been

staring at these graphs, I see that it's usually

usually

when the main thread says start working,

there are a number of threads that start

early. And then there are kind of a

gradual uh period before everyone

started. And the time it takes

from the main thread says start until

all of them started

is roughly like 30 milliseconds or 30 microsconds.

microsconds.

And if we just look at these that start early,

early,

you can see that it's one, two, three,

four, five, six, seven.

These are a little hard to measure

because they're so small. One, two,

three, four, five, six, seven. 1, two,

three, four, five, six, seven. You start

to see a pattern. It's like all almost

always it's seven threads that start

early and then the rest come way later.

And um my theory is that those are the

performance cores and that's also why

the eight core version is faster. I

can't really back this up, but that

that's just my theory.

Um but now let's try to remove that last

condition variable. So the only

modification I made was to have a

boolean on the whole uh thread pool that

basically says uh some if we are in this

elevated performance mode or low latency

mode or spin mode as I call it. If we

are we just spin on an atomic while

waiting for the next job.

um or that boolean to to go to the the

normal state, then we also break that spin.

spin.

And looking at the results,

we're now up to 9.1x on the on the

solver side. Uh, collision detection

also went a little bit better up to 19,

but I think that could be

it could also be uh just random noise.

I'm not sure. Um,

and overall we're down to 6.6

milliseconds for the whole simulation time.

time.

Um, from 82. So I don't know what that

would be. It's like a total of maybe 12 13x.

13x.

Uh, and if we zoom in as we always do on

one iteration, we can now see that this

is starting to look pretty reasonable.

It's actually doing what it's told. So

if the main thread says, uh, can you

start this batch? All the threads are

are pretty quick. They start at roughly

the same time. Um,

also another reminder that this is again

scaled up just so you can see it better.

It's now 330 microsconds across.

So the time it takes to start all the

threads here, it went from um 30 mill 30

microsconds in the previous one down to

just you can't really see the legend

here, but it's approximately two

microsconds uh in this version.

And once the the last worker stops working,

working,

the main thread immediately continues

without sleeping. and starts the next

one and then it just goes in sync with

the workers.

Uh the only real gap we have is this

one, but that's the serial part. That's

these the special bin of constraints

that didn't get colored that we solve in

sequence. It's been there in in all the

graphs, but it's been so small that we

didn't see it. And now we can finally

see it. I see that as a good thing. Now,

now the whole the whole graph finally

makes sense. Um,

so as I mentioned a couple of times now,

the scale of this is different in all

the slides and I thought it would be

interesting to just lay them out um and

see them side by side.

So I just scaled the image actually to

to have the same

the same time so to speak. And now it

gets pretty obvious. It's the same

amount of it should be the same amount

of green in all these graphs. It's just

the the amount of black gaps where the

So that's that. And

I also put

actually put them in a diagram like this

where I grouped them a little

differently. I have the single threaded

on the left. I did these three meth

methods, the synchronization primitives,

the lockless or partially lockless uh in

the blue and then these spinning mode

lockless on the right. And then I

measured the whole things with eight

threads and with 31 threads to compare

as we can see with a with the sync

primitives much faster with eight threads.

threads.

um with the lock list still faster with

eight threads but it's like at least

approaching the same range. Uh and if we

just spin in between jobs the 31 threads

outperforms uh the eighth thread one by

quite far.

This by the way is just for the solver.

If we would have included the

uh collision detection also that would

be an even more dramatic difference

because of the the collision detection.

Uh, I redid the same thing with AMD. I

have a Ryzen 7 laptop that has eight Sen

4 cores and it uses AMD's version of

hyperthreading called simultaneous

multi-threading. It's basically the same

thing. Uh, so 16 hardware threads. Um,

so instead of 31, I'm using 15 threads

on the right side.

And I've I thought it was quite

interesting that the the overall trend

here or the overall um

how they differentiate is roughly the

same even though it's a completely

different architecture. So with the sync

primitives eight is faster than 15 with

lockless about the same and with green

we do get a benefit from using also the

I do have a couple more slides if we

have time. Yeah. >> Um,

>> Um,

just some closing notes on having access

to a low latency thread pool like this

has uh affected the way I think about

parallel programming a little bit which

I thought would be interesting to share

at a conference like this.

Um, when you have

regular task systems or job systems,

thread pools, whatever you call them,

you the most common way to

interface with that, to my knowledge, is

to have some kind of task that you

define. And it could look if you use C++

it could be like a a base class a task

and then you override a run function a

run method and then some state that this

task operates on. Uh if you use C it

could be a function pointer and a

strruct. It could look a little

differently. And then at some place in

the code

you would build a task graph. So you

would assemble these tasks. You would

allocate memory for them, put them

somewhere, uh copy the state, add them

to the scheduleuler, insert these

barriers, and sometimes you have

dependencies between the tasks instead

of manually

setting the the barriers and then all

that happens implicitly.

Um and then once you built that

you execute the whole thing and then

that runs and then implicitly it just

calls all these callbacks in the right

order on from different threads and just

magically it resolves uh this graph.

I never ever liked this style of

programming. I think it's horrible. I

think it's very different from how I

normally think about programming. I if I

sit down and write like a prototype for

something, I would never dream of using

this paradigm for doing it. I would just

write it like from top to bottom like

sequential code. And

And

usually this is necessary because you

want these dependencies kind of stated

in that graph. But what I noticed from

my own behavior is that when having

access to a low latency thread pool like

that, I've started to use a very old

concept in parallel programming

much more called the parallel 4, which

is which is very simple. All it does it

loops over

a number of things and it does them in

parallel. That's all it does. And then

it does a

then it waits for all of them to finish

before it proceeds.

So all the measurements you saw in the

previous slides were actually done in

this way. It's just

um it's not exactly this code. It's

simplified a little bit but but you have

this outer loop. First we prepare the

graph coloring to compute these um what

can be what should be solved together

and then we make an outer for loop of

six iterations and for each iteration we

go through all the colors and then we

just solve all of them in parallel uh do

that for all the colors then we solve

these uh sequential batch only on the

main thread and then we do another

parallel piece of work

with the static contacts Um, and it's

just very easy. It's very similar to the intuitive

intuitive

prototype version of this algorithm, right?

right?

And um

the only thing that I really needed to

do to make this run efficiently was to

to have the to set the thread pool in

this elevated state. And I do that with

a little scope variable. So when it

enters that it elevates the state

and if there's something to do on the

main thread it's also beneficial to do that

that

when you enter the function. So in this

case I say I tell the workers to go to

this elevated state

and then I know it's going to take 30 40

micros for that before all the threads

are ready and alert.

Uh so have some time to do the graph

coloring while waiting for that and and

then you can just work with with the

worker threads. Let the worker threads

and the main thread uh work in a much

more synchronized fashion if you don't

have to pay the price of this latency.

Uh that's proved for me at least to be a

vi viable way to to work with parallel programming.

um saves a lot of boilerplate code. You

don't have to allocate memory for these

tasks. You don't have to copy the state.

You don't have to come up with names for

these tasks and all that tedious work

that that is not so pleasant. Um so for

me that has been

um quite eye opening or as like a new

new way that's sort of an old new way

that's viable again.

Uh so conclusion um I

I think this to the the takeaway here is

not to do exactly what I did. I think

the takeaway here should be that all

these graphs we've looked at

having access to looking at the way the

threads communicate in a visual way I

think for me has been has been very very

useful and I don't think I would have

come to these conclusions without seeing

all these gaps in the timelines and how

the threads uh collaborate. So I can

highly recommend if you if you're not

visualizing your profiling data in that

way or collecting the profiling data

that makes it possible.

I would highly recommend doing that. Um

and then depending on your workload, you

may want to do something similar or

something or not do it or uh something

completely different. Um but then at

least you're making an informed decision.

decision.

Um yeah, for me changing the way

synchronizing threads made a a huge

difference in performance.

And that's that's the end of the talk. I

just want to leave this with a little

physics movie. And uh if you have any

questions or you want to reach out, you

can see my contact information down

>> Could be

could be around 10,000 maybe. I don't I

don't know. Maybe between five and

So how do you think about the batch size

in the parallel floor? So you like don't

want to assign just a single conraint to

like as the unit cost.

>> Yeah, it's a it's a good question.

>> And then you have like static versus

dynamic scheduling.

>> Yeah. Um it's a good question. I

it's a little you you have to again

measure I think to see what's a good

batch size. I think it's 32 in my case.

Uh I tend to this got very specific very

fast but but the the number of tasks I generate

generate

I think is

four times could have be could be six

times I'm not entirely sure but

somewhere around that range the number

of workers. So I have a good number of

tasks to choose from to not have so much

static scheduling.

>> So big round of applause for Dennis one

more time.

>> Is this still on? Okay. Okay.

>> So my first question is regarding the

profiler you mentioned you needed to

build this from scratch, right? Is this

your your work that you had library that

you actually could use? What was

>> uh this was my my own custom profiling?

Yes. Uh I have a a history with

profilers. I I like them. I use them a

lot. Uh but there is a really good free

one called Tracy that you could use if

you don't want to write your own. It's

actually not that much work to write

your own. You just have to be very

careful how you measure performance

because you can't if you want to measure

performance of something lockless. You

you can't have a mutx in your profiler

because that kind of mess up the whole

timeline. Um so it's um it has to be a

lockless profiler also. But as long as

you have that

um collecting the data is really simple.

you just record all the um I just call

query performance counter and

note the results and um and then at the

end of the frame or when when I

visualize this I kind of sort out all

that and see which thing should go where

and so it doesn't it doesn't affect the

thing while measuring too much.

>> Awesome. So I have a question. This was

all CPU stuff. Yes. Right. And

everybody's crazy about GPUs in terms of

rendering. So it does it makes sense to

offload some of that work during the

physics either the collision detection

or the actual solving part to GPU and

use the resources there.

>> You certainly can and it has been done

been done. But I I'm not a big fan of

the idea for two reasons. And one being

that the GPU is pretty busy doing

rendering, you probably want it to focus

on that. Um, and the other one being the latency

latency

or well, if you want to put physics on

the GPU, you either want very low

latency and high bandwidth between the

CPU and GPU and have them work in sync.

So the CPU can say start crunching

physics or just the solver or whatever

is on the GPU and then get the result

back. But it's going to be rather

inefficient because the GPU is going to

be busy rendering stuff. So it's not

available just for random requests. Um

or you put you put all of it on the GPU

and just tie it into the to the physics

or the graphics. Um

>> have you tried that? No, because the

result of the physics is also very tied

to game play. So when you you you want

the result of the physics on the CPU and

you want it usually immediately after

you're done with it and in our case for

tear down it's it's it's not really

possible because the physics could also

result in things breaking and you want

to create new objects and and putting

all of that on the GPU. You're going to

end up with pretty much the whole game

on the GPU and that that that would be a nightmare.

nightmare. >> Okay.

>> Okay.

I sometimes talk to game developers and

when I talk when we talk about physics

they are always complaining and you

spent your entire life on physics and

physics engines and my question is

is this the skill issue on the developer side

side

or is this like inherently complex? How

how can a person get into physics and

learn how to do it properly?

>> I think you just answered the first

question with the second part like that

there aren't really any a whole lot of

good resources for learning physics.

There used to be the bullet forums, but

there's not that much activity these

days. But like 10 15 years ago, that was

a very active forum where people helped

each other. And we haven't seen much of

that lately. So, I I don't know at this

point. It's not particularly complicated

compared to anything else in game

development, I think. Um, but it's um it

maybe doesn't have the same mass appeal

as graphics for you. I mean, it would be

hard to make like a shader toy for

physics. I don't really know how what

that would look like. So, um getting

people into it takes maybe a little bit

more of an investment before it starts

working. That could be one.

>> Right. So from your talk, we started

with like mutex and synchronizing

threads and then we ended up on the uh

spin uh spinning lockless implementation.

implementation.

Is this does this implementation have

some sort of downsides? Like are there

situations where shouldn't go this way

because it wouldn't work or is it just

always better? Well, the the most

obvious downside is that it can consume

more power. It will require more energy,

but I I think that is only true in some

situations given how much shorter time

the solver runs.

Uh, I don't I'm I'm not convinced that

the total amount of power would would

actually be less if if it went to sleep

and woke up like a 100 times instead. Um,

Um,

but if the but spinning on on spinning

on on um atomics like this is not

something that is generally recommended.

If you look at the whole system, like if

you're running the game and a lot of

other things,

that will probably not work very well

because because the game is going to if

you get like a a context switch in the

middle of the solver while a lot of

threads are waiting spinning. Um,

Um,

then they're going to keep spinning

until if we say for some reason the main

thread would be swapped out while the

workers are spinning waiting for more

work, they would spin a whole time slice

before the the main thread gets swapped

back in. I think in in practice though,

I don't really see how or why that would

happen. But if you if you run a lot of

things on the machine, uh it's going to

start to become really inefficient.

All right, I think we have time for one

question from the audience.

>> We have we have okay time.

>> Okay, time to

>> All right, so let's open up the floor

for questions from the audience. Who

would like to ask the question? Yes.

>> Um so this is more of a general physics

programming question. So if you're if

you're applying the solver to multiple

bodies, um I would imagine that you

start at a different point in the chain

when you're applying impulses that it

would could potentially lead to a

slightly different solution, right? Like

it's order dependent. Um so what what

kind of general like uh conventions are

there for deciding what order to start

in because I would imagine those would

lead to slightly different results.

>> Yeah, that that's

entirely correct. when that's also one

of the reasons we want to have

determinism. It's not only for

networking purposes or anything. It's

it's because of stability. If we solve

the constraints in the same order every

time, we're going to have a result that

is slowly converging to to the correct

solution. Whereas if we if we we solve

them at in random order every time, um

we're going to we're going to end up

with disturbances that just that it

can't compensate for appropriately. So

we're going to end up with jittering.

And um

there's also I don't know if that was

the question you were asking. There's

also been in the past there's been

experimentation with

reordering constraints. Say you have a

stack of objects. If you want to solve

it from bottom to top order and things

like that in order to increase stability

for stacking, that's also been done. But

I I would say that that's a minor thing

compared to the the general stability.

>> If I could just clarify the question,

the question was like what kind of

conventions are there for deciding what

order to go in? That makes sense. Like

is there is there particular conventions

that for deciding what order to

actually? Yeah, apart from these if you

if for stacking for inance if you want

to solve from top to bottom that's

generally preferable but I I would say

in practice what I do and what I I know

other physics programmers also do is

that it it there's some kind of

consensus that it doesn't matter that

much which order you solve it as long as

it's consistent.

So what I do uh I just do a a radix

sort. Uh so I sort them per body and

it's it's incredibly fast and I do it

actually on the main thread before the

solver starts. But it's uh it's it's um

at least with 32 threads it it doesn't

really take up that much time on the profiler.

>> All right, next question. You mentioned

there was potentially some sort of

penalty from the operating system if you

don't go to sleep. Is that a particular

thing on particular operating systems?

>> It's just something I observed on

Windows. I don't know if it's

there's some just certain um cases or if

it's if it's always but if I if I tend

to if I make an error so that they're

always spinning. I noticed that the

uh the operating system I don't know if

the the punishment is in is intentional

because this process is doing something

bad or if it is uh just the time slicing

that that makes it makes it get swapped

out more often in the middle of

something. It could be either of those I guess.

John,

>> did you ever experiment with work

stealing? Because you only have one work

here, right?

>> Yeah. Yeah, it's a excellent question.

It's actually one I should have put on a

slide for future work because work

stealing is is generally a better

algorithm where you don't have just one

single queue of uh tasks, but you have

one task Q per worker and then they

steal work from each other in the case

they run out. So you have like a static

batching that turns into dynamic

batching once um all the work is done so

to speak. Um and I I want to try it. I

haven't tried it yet. I also noticed in

the profiler that these black gaps in

between jobs with a lockless uh are so

small that the the potential speed up is

is is pretty small. I think in my case

Do you think that there is a way to

overlap these iterations

because you you will be touching stuff

that's not going to change during the

previous iteration?

>> If that was the case, they would have

ended up in

in the previous color,

>> you know, like the iterations like when

you're done with some of the

>> Oh, I see. I see. Um

potentially there there might be some

opportunity for optimization in the

overlap going from one iteration to the

my intuition is that would be pretty

small but but it's it's a good idea

should should maybe look into it. Sam.

Sam.

>> Yes. I guess so. I I wonder you you

describe that on the main thread you do

things that I presume touch all n of the

contact points it sounded like. So is it

I'm trying to get an idea for how much

more expensive the salt each constraint

solve is compared to just iterating the

like because you for example said you

sort them. Right.

>> Right. Are you talking about a memory

>> just why is it expensive? Why can you go

over all n contact points on the main

thread that you can't solve? Like how

much more expensive?

>> Yeah, it's a lot of computations to

solve each uh constraint. So that

there's a significant amount of

operations that needs to be carried out.

Um, so it it's yeah, but you you're

right that that going over and touching

the memory on the main thread for

everyone that that could if you have a

lot of constraints

that could potentially be be a bit

expensive. The total memory for all the

constraints, even if you have 10,000,

they're still all going to fit

maybe not in the L2 cache, but at least

in the L3 cache. So, so it's not going

to be a major bottleneck when it comes

to to memory.

>> And and just a quick followup then are

there variations in how complex your

solver could be? Like for example, is

tear down solver a lot more complicated

than what somebody else might do or

>> I don't think so. They us the the

solvers at least from what I know are

relatively similar as long as they

follow this projected Gaussidel or

sequential in impulse. Uh they could be

implemented a little differently. The

new one uh that we saw here is actually

a a bit different from from the one in

Terodon on a number of uh in a number of ways

ways

but it's um the differences are rather

>> Ryan

>> so um I'm not a good physics programmer

so let me know if this doesn't make any

sense but with uh so with the sequence

of iterations you have like let's just

say you have blue color, red color,

green color, and then you the second

iteration you do blue

uh green red whatever order I said. Um

you do that uh many subsequent times.

Does is there a dependency between like

blue number two and red number one if

that makes sense? So like could you do

all the blues first and all the reds and

all the greens or can you not slice it?

>> No, you you have to solve all the colors

in one iterations before you can start

the next one. That's it's it's almost

like the iteration is just a

continuation of this sequential thing.

So you have to finish all of them before

you can start the next one. >> Okay.

>> Okay.

>> The iteration is just basically just to

make the the step small enough so that

you have fine enough correction, right?

That's the only reason we do iteration.

>> Uh the reason there are iterations in

the solver is to increase accuracy. Uh

because it converges towards the solution. the more iterations we take,

solution. the more iterations we take, the closer we get to the solution. Um,

the closer we get to the solution. Um, >> and how did it happen that is exactly

>> and how did it happen that is exactly six? How did you

six? How did you >> uh It's just experimentation like what

>> uh It's just experimentation like what you need for the game.

you need for the game. >> Um,

>> Um, >> will this be stable from game to game or

>> will this be stable from game to game or >> from what I know most games use four or

>> from what I know most games use four or eight there? There probably some games

eight there? There probably some games using a bit more, some a bit less. Some

using a bit more, some a bit less. Some some physics engines let you set that

some physics engines let you set that per body. So if you have an a simulation

per body. So if you have an a simulation island with all uh all have a low

island with all uh all have a low iteration count then it's not going to

iteration count then it's not going to spend so much resources simulating that

spend so much resources simulating that island. Uh so it can look a little

island. Uh so it can look a little different but um but it it's a it's a

different but um but it it's a it's a performance quality trade-off.

All right. >> Yeah. So I have a question this is more

>> Yeah. So I have a question this is more general physics engine optimization and

general physics engine optimization and that's do you do a thing that for

that's do you do a thing that for example you measure uh for example you

example you measure uh for example you see if an object is really far from

see if an object is really far from player camera you do less physics

player camera you do less physics iterations on it or if you see that the

iterations on it or if you see that the object is really stressing the physics

object is really stressing the physics engine you just like silently try to do

engine you just like silently try to do less iterations on it

less iterations on it usually no I don't do that at least I

usually no I don't do that at least I don't know if if someone else is doing

don't know if if someone else is doing it I think it's pretty uncommon because

it I think it's pretty uncommon because generally

generally If you have things that are scrambling

If you have things that are scrambling around somewhere trying to settle, if

around somewhere trying to settle, if you turn down the iteration count, the

you turn down the iteration count, the simulation quality is going to go down,

simulation quality is going to go down, which means they're going to be um more

which means they're going to be um more shaky and jittery, which it's going to

shaky and jittery, which it's going to keep them alive for longer time. So when

keep them alive for longer time. So when something settles, you can take them out

something settles, you can take them out of the simulation. So at a larger scale,

of the simulation. So at a larger scale, you might actually waste time by by

you might actually waste time by by doing that.

Are there yes sq >> have you thought about uh um keep

>> have you thought about uh um keep storing in the objects if they are

storing in the objects if they are solved so that you say the first power

solved so that you say the first power is one value of the second and then you

is one value of the second and then you can check hey I'm in the second wave

can check hey I'm in the second wave have both of these contacts been solved

have both of these contacts been solved in the previous and therefore you don't

in the previous and therefore you don't have gaps where you have to sort of

have gaps where you have to sort of collect everything

collect everything >> do you mean dur for each iteration you

>> do you mean dur for each iteration you would do this to see if they need

would do this to see if they need further solving or I'm not I'm not not

further solving or I'm not I'm not not sure.

sure. >> Yeah. So um if you have a contact in the

>> Yeah. So um if you have a contact in the what would you say the second color

what would you say the second color right

right >> and that contact can only be solved when

>> and that contact can only be solved when the the first iteration is done but it

the the first iteration is done but it doesn't have the whole first iteration

doesn't have the whole first iteration doesn't have to be done only those two

doesn't have to be done only those two objects.

objects. >> I see what you mean. Yeah. No I have not

>> I see what you mean. Yeah. No I have not experimented with that. I think that

experimented with that. I think that would be a very complicated

would be a very complicated thing to implement unless you want to do

thing to implement unless you want to do it because then you want to have atomics

it because then you want to have atomics for that checking. Um

for that checking. Um it's an interesting idea. I I think it

it's an interesting idea. I I think it would be a lot of locking and quite

would be a lot of locking and quite expensive but but it's it could be worth

expensive but but it's it could be worth checking.

>> All right. >> Dumb question. If there are no good

>> Dumb question. If there are no good resources, when can we get your book?

resources, when can we get your book? [Laughter]

[Laughter] Uh, I don't think I'm I'm not planning

Uh, I don't think I'm I'm not planning on writing a book, but but I I agree

on writing a book, but but I I agree that there should be more physics

that there should be more physics resources. There's been a couple of

resources. There's been a couple of initiatives

initiatives to do more. The best I know is Erin Ko's

to do more. The best I know is Erin Ko's blog. It It's very good. And also he if

blog. It It's very good. And also he if you're interested in the solver

you're interested in the solver specifically, he released just last year

specifically, he released just last year a project called uh solver 2D. It's on

a project called uh solver 2D. It's on the same GitHub account as Box 2D. That

the same GitHub account as Box 2D. That is a simplified implementation of I

is a simplified implementation of I don't know 10 11 different solvers. And

don't know 10 11 different solvers. And the code is very easy to read. It's like

the code is very easy to read. It's like written just to compare the solvers

written just to compare the solvers qualitatively. So it's it's not written

qualitatively. So it's it's not written for performance and that's a good

for performance and that's a good resource if you want to learn what

resource if you want to learn what different solvers do.

different solvers do. >> But besides Erin Kato's blog, there's

>> But besides Erin Kato's blog, there's also Voxagon.se

also Voxagon.se blog by Dennis himself. So while we wait

blog by Dennis himself. So while we wait for his book, you can go to this website

for his book, you can go to this website voxagon.se

voxagon.se Swedish Swedish domain I presume. And

Swedish Swedish domain I presume. And Dennis has amazing blog posts, very

Dennis has amazing blog posts, very detailed, very nicely written. You can

detailed, very nicely written. You can go a lot. You can learn a lot from these

go a lot. You can learn a lot from these blogs. So go check out vogsagon.s.

M. Yes. So you said that bad things would happen if you have more than 10

would happen if you have more than 10 colors around like that. Is it the same

colors around like that. Is it the same uh as you did with sequences with trial

uh as you did with sequences with trial trial and error iterations or like there

trial and error iterations or like there is some dependence on the number of

is some dependence on the number of objects like how to deter the number of

objects like how to deter the number of colors?

colors? >> Yeah. The reason I want to keep the

>> Yeah. The reason I want to keep the number of colors I want to cap the

number of colors I want to cap the number of colors is

number of colors is is to not have because each new color

is to not have because each new color requires a full synchronization. So

requires a full synchronization. So there's going to be some time lost and

there's going to be some time lost and if you don't have enough work in each

if you don't have enough work in each color, it's just going to be waste at

color, it's just going to be waste at that point. So it yeah, it's it's a

that point. So it yeah, it's it's a little bit of trial and error to find

little bit of trial and error to find that sweet spot. I think in my code I

that sweet spot. I think in my code I actually have 16 as the cap, but it

actually have 16 as the cap, but it usually never reaches that. It's it's

usually never reaches that. It's it's finished by the time it it gets to 10,

finished by the time it it gets to 10, 11, 12. And then I have another um cap

11, 12. And then I have another um cap at how when there are only fewer than a

at how when there are only fewer than a certain number of constraints left, I

certain number of constraints left, I just put them in this overflow

just put them in this overflow sequential bin. And I think that number

sequential bin. And I think that number is is maybe 32 or something something

is is maybe 32 or something something that that can be solved very efficiently

that that can be solved very efficiently in sequence.

in sequence. I just don't bother um doing any

I just don't bother um doing any labeling for I just quickly want to

labeling for I just quickly want to proceed to the actual solver.

proceed to the actual solver. Yes.

Yes. >> I guess how did using SIMD uh

>> I guess how did using SIMD uh instructions factor into the

instructions factor into the optimization? For example, was this

optimization? For example, was this important or was having?

important or was having? >> It's a good question. You if you go to

>> It's a good question. You if you go to Erin's blog, he's been experimenting

Erin's blog, he's been experimenting with that. So, he's actually doing graph

with that. So, he's actually doing graph coloring uh also for the purpose of

coloring uh also for the purpose of SIMD. So he he solves parallel

SIMD. So he he solves parallel uh on bunch of different threads and on

uh on bunch of different threads and on each thread he solves constraints with

each thread he solves constraints with graph coloring also on different lanes

graph coloring also on different lanes with SIMD which is pretty cool. The new

with SIMD which is pretty cool. The new box 2D is amazingly fast.

>> All right, thank you so much. This was so many questions. Round of applause.

so many questions. Round of applause. Dennis Gustson.

Dennis Gustson. >> Thank you.

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:Dennis Gustafsson – Parallelizing the physics solver – BSC 2025