Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
Apple did what NVIDIA wouldn't. | jakkuh | YouTubeToText
YouTube Transcript: Apple did what NVIDIA wouldn't.
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
This content explores the groundbreaking ability to run massive AI language models locally on a cluster of four Mac Studios, leveraging new software and hardware advancements like RDMA over Thunderbolt and Exo 1.0, offering a powerful and cost-effective alternative to cloud-based AI services.
Mind Map
Click to expand
Click to explore the full interactive mind map • Zoom, pan, and navigate
What have I got myself into this time?
Whether you're for or against AI, it's
hard to deny that at least on the
hardware and engineering side of things
that it's super interesting. I mean,
sure, you can pay for a subscription to
Open AI or Google with Gemini for the
privilege of spoon feeding them your
data in exchange for some AI goodness,
but you can also be like all the cool
kids on the playground. Hell, even
PewDiePie is doing it these days and run
your AI models locally. Yes, even the
computer or phone that you're watching
this video on can run a large language
model. But you know what can AI better?
Not one, not two, but precisely four
Macintosh. Good for a combined total of 1.5
1.5
terabytes of unified memory to run the
beefiest and absolute baddest large
language models out there without having
to give Daddy Alman any of our data. If
you're a hardware nerd or you're into
AI, get subscribed. Duh. But you've also
probably seen this before. But I promise
you, there is something huge that's new.
In fact, two things. This Mac Studio
cluster should be multiple times faster
than it was even just yesterday. All
while still consuming less than around a
half of a North American power circuit
and while realistically costing multiple
times less than a solution that can run
models in the same class as this. It's
basically in a class of its own given
the price and the capabilities at this
point. It's kind of ridiculous. First is
Apple's sneaky release of RDMA over
Thunderbolt in the recent Mac OS 26.2
beta. You probably wouldn't think much
of it, but that is going to enable most
of the performance increases we're
seeing today. And second, the public
release of Exo 1.0. That's the software
you can use to take four Mac Studios
like this and turn them into the AI
cluster of your freaking dreams. But
we're getting a bit ahead of ourselves
here. See, if you've tried to run local
LLMs and you don't have hardware this
crazy, you've probably noticed that they
seem, well, for lack of a better word,
kind of dumb or at the very least less
intelligent than when you talk to Chat
GPT or Gemini. Don't get me wrong, those
small models have their uses. You can't
exactly fit a Mac Studio amount of
performance or memory for that matter in
a security camera or a smartphone. It
makes sense that they can't compete with
the big services because the models that
you're using when you work with chat GBT
are in the range of hundreds or
thousands of gigabytes. So large that
they need to run in data centers that
cost as much as the GDP of a small
European country to build and
realistically probably use more energy
than those European countries too. But
you can also run models of that size on
these. Sorry, I'm just making space for
my supercluster. I hope you don't mind.
Who did this brother? You see, Apple
designed their M series of silicon with
unified memory that's shared between
both the CPU and graphics cores, which
enabled them on the launch of the M3
Ultra chips that are in these Mac
Studios to release a skew with a
whopping 512
GB of RAM, which is half of the units we
have here. The other half have 256 gigs.
Both of which were enough to run some
insane AI models. The kinds of models
that are oneshotting code challenges,
and they're only getting better, which
is exactly why you, software engineer or
total noob, should be checking out
boot.dev, who sponsored this portion of
today's video. They've cracked the code,
if you will, on making programming fun
and engaging to learn by basically
turning it into a video game. I know it
sounds crazy, but imagine this. You're
learning to do back-end web development
by learning Python and SQL and Go, and
at the same time, you're earning XP.
You're leveling up and you're completing
quests. Heck, you're even doing boss
fights all while learning. They have
courses for developers of all experience
levels with a curriculum that goes super
in-depth while still being easy and
engaging to follow. And their courses
even teach you the tools of the trade
while you're working through them. So,
you learn how to use Linux and Git and
Docker. It's kind of crazy, but it's not
as crazy as the fact that they give you
all the course material, the lessons,
the video tutorials, and the starter
code on their website for free. So, you
can start to learn how to build a real
web project without any risk to your
wallet. And if their interactive
learning style works for you, like it's
been working for me, then you should use
code Jaku to get 25% off an annual plan
that gives you all of those fantastic
interactivity features like hands-on
coding, AI assistance, progress
tracking, all the gamification for all
of their courses at the link down in the
description. It's genuinely freaking
sweet. Love those guys. Like the 70
billion parameter llama 3.3 model in
full FP16. Let's see how fast that goes
across one of our Mac studios. Can you
write me a thousandword story about how
cool BMWs are? Include discussion of oil
leaks. Oh god. Took about 2 seconds for
the initial response and we're running
at a rate of about five tokens per
second, which is not the crazy fastest
thing out there, but it's not an easy to
run model at 150 GB on its own. But now
we've got models like Deep Seek where
they're 700 GB or Mistral 3 large and
Kimmy K2 in excess of a terabyte in
their full weights, which is crazy. But
we can handle that with this setup.
Right now, I've got 10 GB Ethernet
plugged into all of them, and I'm going
to boot up Exo on the rest, and it
should automatically form a cluster for
us. And then we'll see how much faster
we can make this go just on the same
model. And then we'll try one of the
bigger ones. Die, my child. Now, if we
start Exo on the rest of our Macs, now
we have a cluster, baby. Even though we
don't need the extra memory from the
other machines, we can make use of the
compute on the other machines and have
that same 70 billion parameter model run
way faster, at least in theory. Now,
before we try all four of them, I'm
going to start by loading that same
Llama 3.3 model on two of our machines
and see how much of a speed up that
yields on its own. and then we can go
into rocking it on all four. All right,
write me a thousandword story about
leaky BMWs. Is it going to go any
faster? That is effectively the same
speed. 2 seconds to start and five
tokens a second. That's kind of weird,
huh? We are running it across both of
them. You can see that we've got 100
watts of power draw on each Mac Studio
and the model is spread across the
memory of both of them. But what if we
try and run it on all four? Well, we're
still talking about five tokens a
second. What's the problem? Now, you can
imagine the data of our model as a book.
And when we load that book across the
cluster, each machine gets its own set
of chapters, but it doesn't necessarily
have access to the entire book. That
means each computer has to calculate its
response to your prompt with the
chapters it has access to and then pass
that response or the partial response on
to the next computer in the cluster
before the next computer can even do
anything. It's just like a relay race
where you run your section but you have
to hand off the baton before the next
person can start. Except for our
situation, on each lap that the cluster
completes, we only get one output token,
which you can imagine is around one
letter in the output. In our case, the
racetrack is essentially the networking
that we're connecting all these machines
with. And you can see right now we're
just using 10 gig Ethernet with a 10 GB
UniFi Pro XG8 switch, which I'll have
linked down in the description along
with all the other hardware we showed
off today. That 10 GB networking is
plenty enough bandwidth to transfer what
is essentially part of a letter. Our
hypothetical relay runners here are so
fast that forcing them to talk over
standard Ethernet is like the networking
equivalent of forcing them to go through
airport security every time they need to
pass on that baton. So you can imagine
when you're asking it to write a
thousandword story for each letter
within those thousand words, our cluster
here is effectively going through TSA at
least four times. That works out to a
metric shitload of completely random
searches and why in the enterprise they
use Infiniband which has much lower
latency. Now you could already do better
than this simply by using Thunderbolt
cables directly connecting all the Macs
together rather than using Ethernet with
the switch. It's just a little bit lower
latency on Apple Stack. Or we can bypass
this hypothetical security problem
entirely by using RDMA. It's kind of the
same deal. We're going to hook up all of
our Macs directly with Thunderbolt, but
then we need to drop into the recovery
menu of our Macs and actually enable it.
Now, you do need the Mac OS 26.2 beta,
and you have to be using a machine that
has Thunderbolt 5. That includes
anything M4 Pro or better, as well as
the M3 Ultra. Now, I'm going to hook
this up exactly as it says in the guide,
which honestly is a little strange. How
does this even work? So, uh, this guy.
Um, how am I missing a cable? 1 2 3 5.
Where's the sixth spaghette? They're all
connected like that. Woohoo. Now, with
three Thunderbolt 5 connections on each
of these systems, we're talking about an
aggregate connectivity from each of them
of around 240 Gbit both ways. But, we're
not going to make use of anywhere near
that. The real story here is once we
have RDMA enabled, the latency cuts down
by 99%.
Which is huge. Hey, Arlo, do you like
fast AI? Oh, boop. Got them. Now you see
we can take the same llama 3.3 70
billion parameter model. Select MLX RDMA
and tensor sharding. Yeah, Arlo. Hey, I
don't even know if the RDMA is going to
work right off the bat. Who knows? Maybe
I cabled it up wrong. The time to beat
is five tokens a second. Write me a
200word story. Holy Nine tokens a
second. That's almost double the speed
we had running on just one. I'm a little
bit in shock. That was a a bit of a
dumpster fire getting that all going.
Okay, why don't we try something bigger
like the Kimmy K2 instruct model in a
4bit quant your what? My quantitative 540
540
GB. For a baseline, I'm going to start
with the standard pipeline MLX ring
sharding so that we get a good idea of
what it would be like before we use
RDMA. Once we know how fast that goes,
we'll try it with RDMA. This is so
freaking cool, man. I'm excited to try
the new Mistral models for coding
because they apparently are like nice.
>> I suspect it's not going to go too fast,
but we'll give her a shot. Oh, yeah.
Okay, look at that. This is over
Ethernet right now, but it's only
transmitting like 20 megabit, 25
megabit. That's it. And uh we're getting
around 22 tokens per second. I'm Uncle
Jim and I hate oil stains the way cats
Now, let's try RDMA and see how fast it
actually goes. Only then it crashed, it
seemed like. Fast forward, oh my god,
like 12 hours of troubleshooting and
messing about trying to get this setup
to work as I was thinking it should
work. And it seems like we're finally
there. It wasn't just my changes. They
actually have made something like 40
different version updates in the last 2
days just to get this working a little
bit more stable. We're definitely at the
frontier of this technology. There
doesn't really exist another solution
for this as Jeff Gearling so nicely
highlighted in his Framework Desktop
video. I also found out that because my
Macs were named differently than they
asked, they were expecting them to be
named as the cabling image said in the
future that's a limitation that should
disappear, but since this is like
definitely tech demo territory, there's
some weird quirks with it like that. I
also am not able to upload any of my own
custom models into this setup. It's very
much a set of models that they have
pre-curated and determined work properly
with this setup. They also have to be
models that are in the MLX format for
this to work properly on Apple devices
with RDMA. However, now we can see how
fast this you can fit so many AIs in
this bad girl right here. I tell you
what, and same with this one and that
one over there and this one over there.
As I showed you guys before, one of
these Max Studios with Llama 3.3 70
billion parameters in the FP16 weight
that I've been using takes about two to
two and a half seconds to start and goes
around five tokens per second for
inferencing. However, when we run it on
two of our Mac Studios with tensor
sharding over RDMA, well, that jumps up
to more like nine tokens a second.
However, it's worth noting that that is
a dense model and that means we get
really good scaling even though it seems
kind of counterintuitive. And there you
have it. We're up to just over 1 second
to get started and around 15 and a half
tokens per second. 3.25 times faster.
How about Kimmy K2? I talked about that
a bunch. Well, in the instruct version
of it in a 4bit quant, well, you can't
actually run this one on just one
machine. Even at four bits, it's nearly
600 GB. So, our baseline is going to be
pipelining the model over two of the
machines. That's essentially the same
results we saw before when we were on
just Ethernet. And that yielded about 25
tokens a second. And then when we split
it across all four machines, nearly 35
tokens per second. The scaling is
nowhere near as good as a dense model,
but it is still a large performance
increase. And it does also show that the
time to first token going from around 6
and 1/2 seconds down to 1 and a half.
That's the kind of latency your users or
whatever have to wait for before they
even start to see a response. The
problem is those models move around a
lot in their total parameter pool which
means they do a lot of smaller
calculations at least relative to a
dense model and that adds a lot of
overhead at least in the current
software setup. Now, the Exo guys say
they are working on optimizing this, but
I expect based on what they briefed me
with that the results we see on those
are not going to be anywhere near as
good. But it's kind of hard to complain
because what's your alternative at this
point? I guess you could buy like eight
Nvidia DGX Spark units, but then it
would draw way more power. You would
only have the same amount of VRAM as one
of these Mac Studios, and it would cost >> $32,000.
>> $32,000.
>> Let me show you what I mean. Starting
with Deepseek 3.1 in an 8bit quant
running on two of these Macs. For some
reason, it doesn't have the option to
run this model pipelined across multiple
just for the VRAM improvement, not for
any performance improvement. So, I don't
have a comparison point less than this.
But, we can at least try two versus four
in this configuration and see what that
looks like. Here's what I'm talking
about. Deepseek v3.1 with 671
billion parameters total only has 37
billion parameters actually active when
you're inferencing. It's like 5% only.
It took 3 1/2 seconds for us to get the
first token and it's going at around 20
tokens per second. Now let's boot it up
on all four. See what I mean? Instead of
our pretty linear scaling that we were
seeing before, we went from getting 20
tokens a second to getting 24 tokens a
second. Sure, the time to first token
dumped a lot from three and a half down
to under two, but you can see how much
less efficient it is to parallelize
these mixture of expert models than it
is to parallelize a dense model, at
least for now. So clearly, the
performance is pretty awesome. And with
future optimization, in theory, we'll be
able to get even further on the same
hardware again, which is so cool to
think about. However, what do you use
this for, right? Because if you're using
those fancy open AI models or Google
models or claude or whatever, you can
use them with cloud code. You can use
them in cursor, you can use them in VS
Code. However, we don't really have that
infrastructure for this setup.
Fortunately, Exo does provide an open AI
compatible API that you can interact
with your models with directly, even
when you have multiple loaded. Now, that
was something that didn't work when I
first tried this and after their
bajillion updates is working a lot
better. So you can have more than one
model loaded at the same time. However,
I tried a bunch of different AI coding
agents and none of them seemed to talk
to it properly. Open code was the
closest I got where it actually was
giving me good responses. However, the
tool use just like didn't work at all.
The exo guys told me that they're
working on that or whatever. I mean, you
could at least use it in Jan like, "Hey,
generate me a website." And I mean, it
it does. It even says you can just copy
paste this, bro. We can just take this
and paste it into a website and then
open it up and subscribe to my YouTube
channel. Duh. Wait a Why did it pick
Lewig? What the Oh, I wasn't subscribed
to Lwig. Ah, but I don't want it to like
go directly into a file or read files
and have that context and it's just not
quite there yet. You know what I'm
interested to see though is how much
power it's actually drawing during this.
So, let's ask it to write a much longer
story. It's drawing around 115 to around
125 watts. Put that all together and
we're at around 480 watts for a mixture
of experts model where it's really not
making perfect use of all that hardware.
Let's switch back to Llama for a second
and see what the power draw is like when
we're running that where it actually
like really can make good use of the
hardware. So much Oh, it failed. What?
>> It's delusional.
>> They're using some like very beta
features that Apple added to sync the
CPU and GPU when running an MLX model
like this. And apparently it's not the
most stable thing in the universe. So,
I'm going to close Exo on all the
machines, start back up again, and then
we'll try. Oh, that's that's watting up
the hoo-ha. That's a weird thing to say.
With dense lama, 3.370 billion, our
power draw is more like 600 watts.
That's actually less than one H200
Nvidia GPU. Can do around 50 tokens a
second on one H200 at FP8 instead of
FP16. So, half the size of the model.
So, why don't we try that across our
cluster? Now, for reference, an H200 is
somewhere in the neighborhood of $30,000
US. Yep. 26 tokens a second. So, we're
just about half. It's just really cool
to be able to see what we can do with
existing hardware and a little or
frankly probably a lot of software
optimization. This is the same hardware
you could have bought all the way back
in March of this year and now it's
getting this much performance uplift.
It's really, really freaking cool. I'm
not going to lie and say there isn't a
part of me that goes, I wish the reason
we were pushing for optimizations like
this wasn't for something that so many
people use for bad stuff. But I've been
trying to reconcile in my head that just
because there are bad things that people
do with AI doesn't mean there isn't also
good things. My first thought when they
offered me the opportunity to try this
setup out was, "Dude, I should use this
to replace my Alexa and build like
freaking the most insane home voice
assistant ever." But unfortunately, the
apps that are out there for connecting
Home Assistant to an OpenAI compatible
API just didn't seem to work for me. So,
I don't know. How about get subscribed?
because I've got these for like two more
months and that means I can spend some
time making that work and then maybe
I'll just have to buy a couple. I mean,
I don't need to run a 700 gigabyte model
for my voice assistant, but and while
you're down there, don't forget to check
out boot.dev, who sponsored this video.
It's a fantastic tool to gify learning
how to code. You can also hit the like
button while you're down there and
comment. Let me know what you would do
with this much absolutely balling compute.
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.