YouTube Transcript:
Apple did what NVIDIA wouldn't.

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

This content explores the groundbreaking ability to run massive AI language models locally on a cluster of four Mac Studios, leveraging new software and hardware advancements like RDMA over Thunderbolt and Exo 1.0, offering a powerful and cost-effective alternative to cloud-based AI services.

Mind Map

Click to expand

Click to explore the full interactive mind map • Zoom, pan, and navigate

What have I got myself into this time?

Whether you're for or against AI, it's

hard to deny that at least on the

hardware and engineering side of things

that it's super interesting. I mean,

sure, you can pay for a subscription to

Open AI or Google with Gemini for the

privilege of spoon feeding them your

data in exchange for some AI goodness,

but you can also be like all the cool

kids on the playground. Hell, even

PewDiePie is doing it these days and run

your AI models locally. Yes, even the

computer or phone that you're watching

this video on can run a large language

model. But you know what can AI better?

Not one, not two, but precisely four

Macintosh. Good for a combined total of 1.5

1.5

terabytes of unified memory to run the

beefiest and absolute baddest large

language models out there without having

to give Daddy Alman any of our data. If

you're a hardware nerd or you're into

AI, get subscribed. Duh. But you've also

probably seen this before. But I promise

you, there is something huge that's new.

In fact, two things. This Mac Studio

cluster should be multiple times faster

than it was even just yesterday. All

while still consuming less than around a

half of a North American power circuit

and while realistically costing multiple

times less than a solution that can run

models in the same class as this. It's

basically in a class of its own given

the price and the capabilities at this

point. It's kind of ridiculous. First is

Apple's sneaky release of RDMA over

Thunderbolt in the recent Mac OS 26.2

beta. You probably wouldn't think much

of it, but that is going to enable most

of the performance increases we're

seeing today. And second, the public

release of Exo 1.0. That's the software

you can use to take four Mac Studios

like this and turn them into the AI

cluster of your freaking dreams. But

we're getting a bit ahead of ourselves

here. See, if you've tried to run local

LLMs and you don't have hardware this

crazy, you've probably noticed that they

seem, well, for lack of a better word,

kind of dumb or at the very least less

intelligent than when you talk to Chat

GPT or Gemini. Don't get me wrong, those

small models have their uses. You can't

exactly fit a Mac Studio amount of

performance or memory for that matter in

a security camera or a smartphone. It

makes sense that they can't compete with

the big services because the models that

you're using when you work with chat GBT

are in the range of hundreds or

thousands of gigabytes. So large that

they need to run in data centers that

cost as much as the GDP of a small

European country to build and

realistically probably use more energy

than those European countries too. But

you can also run models of that size on

these. Sorry, I'm just making space for

my supercluster. I hope you don't mind.

Who did this brother? You see, Apple

designed their M series of silicon with

unified memory that's shared between

both the CPU and graphics cores, which

enabled them on the launch of the M3

Ultra chips that are in these Mac

Studios to release a skew with a

whopping 512

GB of RAM, which is half of the units we

have here. The other half have 256 gigs.

Both of which were enough to run some

insane AI models. The kinds of models

that are oneshotting code challenges,

and they're only getting better, which

is exactly why you, software engineer or

total noob, should be checking out

boot.dev, who sponsored this portion of

today's video. They've cracked the code,

if you will, on making programming fun

and engaging to learn by basically

turning it into a video game. I know it

sounds crazy, but imagine this. You're

learning to do back-end web development

by learning Python and SQL and Go, and

at the same time, you're earning XP.

You're leveling up and you're completing

quests. Heck, you're even doing boss

fights all while learning. They have

courses for developers of all experience

levels with a curriculum that goes super

in-depth while still being easy and

engaging to follow. And their courses

even teach you the tools of the trade

while you're working through them. So,

you learn how to use Linux and Git and

Docker. It's kind of crazy, but it's not

as crazy as the fact that they give you

all the course material, the lessons,

the video tutorials, and the starter

code on their website for free. So, you

can start to learn how to build a real

web project without any risk to your

wallet. And if their interactive

learning style works for you, like it's

been working for me, then you should use

code Jaku to get 25% off an annual plan

that gives you all of those fantastic

interactivity features like hands-on

coding, AI assistance, progress

tracking, all the gamification for all

of their courses at the link down in the

description. It's genuinely freaking

sweet. Love those guys. Like the 70

billion parameter llama 3.3 model in

full FP16. Let's see how fast that goes

across one of our Mac studios. Can you

write me a thousandword story about how

cool BMWs are? Include discussion of oil

leaks. Oh god. Took about 2 seconds for

the initial response and we're running

at a rate of about five tokens per

second, which is not the crazy fastest

thing out there, but it's not an easy to

run model at 150 GB on its own. But now

we've got models like Deep Seek where

they're 700 GB or Mistral 3 large and

Kimmy K2 in excess of a terabyte in

their full weights, which is crazy. But

we can handle that with this setup.

Right now, I've got 10 GB Ethernet

plugged into all of them, and I'm going

to boot up Exo on the rest, and it

should automatically form a cluster for

us. And then we'll see how much faster

we can make this go just on the same

model. And then we'll try one of the

bigger ones. Die, my child. Now, if we

start Exo on the rest of our Macs, now

we have a cluster, baby. Even though we

don't need the extra memory from the

other machines, we can make use of the

compute on the other machines and have

that same 70 billion parameter model run

way faster, at least in theory. Now,

before we try all four of them, I'm

going to start by loading that same

Llama 3.3 model on two of our machines

and see how much of a speed up that

yields on its own. and then we can go

into rocking it on all four. All right,

write me a thousandword story about

leaky BMWs. Is it going to go any

faster? That is effectively the same

speed. 2 seconds to start and five

tokens a second. That's kind of weird,

huh? We are running it across both of

them. You can see that we've got 100

watts of power draw on each Mac Studio

and the model is spread across the

memory of both of them. But what if we

try and run it on all four? Well, we're

still talking about five tokens a

second. What's the problem? Now, you can

imagine the data of our model as a book.

And when we load that book across the

cluster, each machine gets its own set

of chapters, but it doesn't necessarily

have access to the entire book. That

means each computer has to calculate its

response to your prompt with the

chapters it has access to and then pass

that response or the partial response on

to the next computer in the cluster

before the next computer can even do

anything. It's just like a relay race

where you run your section but you have

to hand off the baton before the next

person can start. Except for our

situation, on each lap that the cluster

completes, we only get one output token,

which you can imagine is around one

letter in the output. In our case, the

racetrack is essentially the networking

that we're connecting all these machines

with. And you can see right now we're

just using 10 gig Ethernet with a 10 GB

UniFi Pro XG8 switch, which I'll have

linked down in the description along

with all the other hardware we showed

off today. That 10 GB networking is

plenty enough bandwidth to transfer what

is essentially part of a letter. Our

hypothetical relay runners here are so

fast that forcing them to talk over

standard Ethernet is like the networking

equivalent of forcing them to go through

airport security every time they need to

pass on that baton. So you can imagine

when you're asking it to write a

thousandword story for each letter

within those thousand words, our cluster

here is effectively going through TSA at

least four times. That works out to a

metric shitload of completely random

searches and why in the enterprise they

use Infiniband which has much lower

latency. Now you could already do better

than this simply by using Thunderbolt

cables directly connecting all the Macs

together rather than using Ethernet with

the switch. It's just a little bit lower

latency on Apple Stack. Or we can bypass

this hypothetical security problem

entirely by using RDMA. It's kind of the

same deal. We're going to hook up all of

our Macs directly with Thunderbolt, but

then we need to drop into the recovery

menu of our Macs and actually enable it.

Now, you do need the Mac OS 26.2 beta,

and you have to be using a machine that

has Thunderbolt 5. That includes

anything M4 Pro or better, as well as

the M3 Ultra. Now, I'm going to hook

this up exactly as it says in the guide,

which honestly is a little strange. How

does this even work? So, uh, this guy.

Um, how am I missing a cable? 1 2 3 5.

Where's the sixth spaghette? They're all

connected like that. Woohoo. Now, with

three Thunderbolt 5 connections on each

of these systems, we're talking about an

aggregate connectivity from each of them

of around 240 Gbit both ways. But, we're

not going to make use of anywhere near

that. The real story here is once we

have RDMA enabled, the latency cuts down

by 99%.

Which is huge. Hey, Arlo, do you like

fast AI? Oh, boop. Got them. Now you see

we can take the same llama 3.3 70

billion parameter model. Select MLX RDMA

and tensor sharding. Yeah, Arlo. Hey, I

don't even know if the RDMA is going to

work right off the bat. Who knows? Maybe

I cabled it up wrong. The time to beat

is five tokens a second. Write me a

200word story. Holy Nine tokens a

second. That's almost double the speed

we had running on just one. I'm a little

bit in shock. That was a a bit of a

dumpster fire getting that all going.

Okay, why don't we try something bigger

like the Kimmy K2 instruct model in a

4bit quant your what? My quantitative 540

540

GB. For a baseline, I'm going to start

with the standard pipeline MLX ring

sharding so that we get a good idea of

what it would be like before we use

RDMA. Once we know how fast that goes,

we'll try it with RDMA. This is so

freaking cool, man. I'm excited to try

the new Mistral models for coding

because they apparently are like nice.

>> I suspect it's not going to go too fast,

but we'll give her a shot. Oh, yeah.

Okay, look at that. This is over

Ethernet right now, but it's only

transmitting like 20 megabit, 25

megabit. That's it. And uh we're getting

around 22 tokens per second. I'm Uncle

Jim and I hate oil stains the way cats

Now, let's try RDMA and see how fast it

actually goes. Only then it crashed, it

seemed like. Fast forward, oh my god,

like 12 hours of troubleshooting and

messing about trying to get this setup

to work as I was thinking it should

work. And it seems like we're finally

there. It wasn't just my changes. They

actually have made something like 40

different version updates in the last 2

days just to get this working a little

bit more stable. We're definitely at the

frontier of this technology. There

doesn't really exist another solution

for this as Jeff Gearling so nicely

highlighted in his Framework Desktop

video. I also found out that because my

Macs were named differently than they

asked, they were expecting them to be

named as the cabling image said in the

future that's a limitation that should

disappear, but since this is like

definitely tech demo territory, there's

some weird quirks with it like that. I

also am not able to upload any of my own

custom models into this setup. It's very

much a set of models that they have

pre-curated and determined work properly

with this setup. They also have to be

models that are in the MLX format for

this to work properly on Apple devices

with RDMA. However, now we can see how

fast this you can fit so many AIs in

this bad girl right here. I tell you

what, and same with this one and that

one over there and this one over there.

As I showed you guys before, one of

these Max Studios with Llama 3.3 70

billion parameters in the FP16 weight

that I've been using takes about two to

two and a half seconds to start and goes

around five tokens per second for

inferencing. However, when we run it on

two of our Mac Studios with tensor

sharding over RDMA, well, that jumps up

to more like nine tokens a second.

However, it's worth noting that that is

a dense model and that means we get

really good scaling even though it seems

kind of counterintuitive. And there you

have it. We're up to just over 1 second

to get started and around 15 and a half

tokens per second. 3.25 times faster.

How about Kimmy K2? I talked about that

a bunch. Well, in the instruct version

of it in a 4bit quant, well, you can't

actually run this one on just one

machine. Even at four bits, it's nearly

600 GB. So, our baseline is going to be

pipelining the model over two of the

machines. That's essentially the same

results we saw before when we were on

just Ethernet. And that yielded about 25

tokens a second. And then when we split

it across all four machines, nearly 35

tokens per second. The scaling is

nowhere near as good as a dense model,

but it is still a large performance

increase. And it does also show that the

time to first token going from around 6

and 1/2 seconds down to 1 and a half.

That's the kind of latency your users or

whatever have to wait for before they

even start to see a response. The

problem is those models move around a

lot in their total parameter pool which

means they do a lot of smaller

calculations at least relative to a

dense model and that adds a lot of

overhead at least in the current

software setup. Now, the Exo guys say

they are working on optimizing this, but

I expect based on what they briefed me

with that the results we see on those

are not going to be anywhere near as

good. But it's kind of hard to complain

because what's your alternative at this

point? I guess you could buy like eight

Nvidia DGX Spark units, but then it

would draw way more power. You would

only have the same amount of VRAM as one

of these Mac Studios, and it would cost >> $32,000.

>> $32,000.

>> Let me show you what I mean. Starting

with Deepseek 3.1 in an 8bit quant

running on two of these Macs. For some

reason, it doesn't have the option to

run this model pipelined across multiple

just for the VRAM improvement, not for

any performance improvement. So, I don't

have a comparison point less than this.

But, we can at least try two versus four

in this configuration and see what that

looks like. Here's what I'm talking

about. Deepseek v3.1 with 671

billion parameters total only has 37

billion parameters actually active when

you're inferencing. It's like 5% only.

It took 3 1/2 seconds for us to get the

first token and it's going at around 20

tokens per second. Now let's boot it up

on all four. See what I mean? Instead of

our pretty linear scaling that we were

seeing before, we went from getting 20

tokens a second to getting 24 tokens a

second. Sure, the time to first token

dumped a lot from three and a half down

to under two, but you can see how much

less efficient it is to parallelize

these mixture of expert models than it

is to parallelize a dense model, at

least for now. So clearly, the

performance is pretty awesome. And with

future optimization, in theory, we'll be

able to get even further on the same

hardware again, which is so cool to

think about. However, what do you use

this for, right? Because if you're using

those fancy open AI models or Google

models or claude or whatever, you can

use them with cloud code. You can use

them in cursor, you can use them in VS

Code. However, we don't really have that

infrastructure for this setup.

Fortunately, Exo does provide an open AI

compatible API that you can interact

with your models with directly, even

when you have multiple loaded. Now, that

was something that didn't work when I

first tried this and after their

bajillion updates is working a lot

better. So you can have more than one

model loaded at the same time. However,

I tried a bunch of different AI coding

agents and none of them seemed to talk

to it properly. Open code was the

closest I got where it actually was

giving me good responses. However, the

tool use just like didn't work at all.

The exo guys told me that they're

working on that or whatever. I mean, you

could at least use it in Jan like, "Hey,

generate me a website." And I mean, it

it does. It even says you can just copy

paste this, bro. We can just take this

and paste it into a website and then

open it up and subscribe to my YouTube

channel. Duh. Wait a Why did it pick

Lewig? What the Oh, I wasn't subscribed

to Lwig. Ah, but I don't want it to like

go directly into a file or read files

and have that context and it's just not

quite there yet. You know what I'm

interested to see though is how much

power it's actually drawing during this.

So, let's ask it to write a much longer

story. It's drawing around 115 to around

125 watts. Put that all together and

we're at around 480 watts for a mixture

of experts model where it's really not

making perfect use of all that hardware.

Let's switch back to Llama for a second

and see what the power draw is like when

we're running that where it actually

like really can make good use of the

hardware. So much Oh, it failed. What?

>> It's delusional.

>> They're using some like very beta

features that Apple added to sync the

CPU and GPU when running an MLX model

like this. And apparently it's not the

most stable thing in the universe. So,

I'm going to close Exo on all the

machines, start back up again, and then

we'll try. Oh, that's that's watting up

the hoo-ha. That's a weird thing to say.

With dense lama, 3.370 billion, our

power draw is more like 600 watts.

That's actually less than one H200

Nvidia GPU. Can do around 50 tokens a

second on one H200 at FP8 instead of

FP16. So, half the size of the model.

So, why don't we try that across our

cluster? Now, for reference, an H200 is

somewhere in the neighborhood of $30,000

US. Yep. 26 tokens a second. So, we're

just about half. It's just really cool

to be able to see what we can do with

existing hardware and a little or

frankly probably a lot of software

optimization. This is the same hardware

you could have bought all the way back

in March of this year and now it's

getting this much performance uplift.

It's really, really freaking cool. I'm

not going to lie and say there isn't a

part of me that goes, I wish the reason

we were pushing for optimizations like

this wasn't for something that so many

people use for bad stuff. But I've been

trying to reconcile in my head that just

because there are bad things that people

do with AI doesn't mean there isn't also

good things. My first thought when they

offered me the opportunity to try this

setup out was, "Dude, I should use this

to replace my Alexa and build like

freaking the most insane home voice

assistant ever." But unfortunately, the

apps that are out there for connecting

Home Assistant to an OpenAI compatible

API just didn't seem to work for me. So,

I don't know. How about get subscribed?

because I've got these for like two more

months and that means I can spend some

time making that work and then maybe

I'll just have to buy a couple. I mean,

I don't need to run a 700 gigabyte model

for my voice assistant, but and while

you're down there, don't forget to check

out boot.dev, who sponsored this video.

It's a fantastic tool to gify learning

how to code. You can also hit the like

button while you're down there and

comment. Let me know what you would do

with this much absolutely balling compute.

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:Apple did what NVIDIA wouldn't.