Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
I got a desktop supercomputer? | NVIDIA DGX Spark overview | Tim Carambat | YouTubeToText
YouTube Transcript: I got a desktop supercomputer? | NVIDIA DGX Spark overview
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Hey everybody, Timothy Carbat, creator
and founder of Anything LLM. And today I
was going to do a different style video.
Now, usually the videos I do are all
about anything LLM or just like AI tech
in general. And you know, I'll like run
some models, do some tests, highlight a
cool feature that we built, something
like that. Today, I actually want to do
a little bit of a different review and
just see how that goes, honestly, cuz
why not change it up? So, for my first
video of this kind of format, uh, we're
going to start off with a bang. I
recently got access to Nvidia DGX Spark.
Um, it's right here.
So, I've had access to this for about a
week and a half now, and I've actually
been using it as my daily driver. So,
you know, my day-to-day job, right, is
making anything LLM better for you.
Doing that, I usually do it on a
MacBook, but I have a whole bunch of
other computers, so I can test it on
everything. And uh now I have a DGX
Spark which is actually running Ubuntu
or DGXOS which is a version of Ubuntu
24. Uh so it's very familiar. It feels a
lot like Mac if you've used Ubuntu
before that someone's going to get mad
about that I'm sure. But I want to jump
into a review of this. No BS. We're just
going to get to it. So as I said for the
last week and a half or so I've been
actually using this as my daily driver.
Personally I'm impressed. It's a lot of
fun and it's just really cool to use
because even with someone who has like
an Nvidia GeForce RTX 5090 uh like this
is still cool. It's supplemental to
that. It's not a replacement. I'm going
to get into that more in the video, but
let's talk about kind of unboxing right
now. So, when you get this thing comes
in a pretty hefty box and you just slide
it on up. All the chargers and stuff are
on the bottom, but what you get
something that looks like this. So
immediately you're going to want to pick
this thing up. And you're going to
notice that while it feels very sturdy,
it is actually pretty light. Uh it's 1.2
kg or if you're in freedom units, that's
about a shave over 2 1/2 lb. And then by
dimensionally, it is 150x 150x 50 mm,
which is about 6x6 by two. Again, in
freedom units, the first thing that I
noticed about this was the color. Uh, it
might be hard to pick up on camera and
I'm not sure if that's even focusing,
but it is this nice gold color that
there's two immediate things that come
to mind from my childhood. One was the
gold Game Boy Color and the other was
for I think it was the Nintendo 64 Zelda
Ocarina of Time cartridge and it was
about the same kind of color. There's no
like sparkle in it or anything like
that, but it just it's just such a cool
color. So, the first time that I think
this got mentioned was actually at CES
of this year where they talked about
something called Project Digit. Uh, this
is that they renamed it and it's now
called the DGX Spark. And you're
obviously wondering because it is
Nvidia, what is in this thing? This is
not a hardware review channel, so I'm
just going to kind of give you the
highlevel stuff. In here is the Nvidia
GB10 Grace Blackwell Super Chip. This is
a unified memory kind of system here and
there is 128 GB of LP DDR5X memory in
here. This particular model has 4 TB of
storage which is plenty. And of course
you have a 20 core ARMbased CPU which is
really great because of power draw
concerns. This thing uh sips power. I
wish I had more metrics on that. I
didn't have an ampmeter, but uh I mean
it is ARM based and I do know it is
drawing less power from the stats that I
can collect. Of that 128 GB of unified
memory, I believe 96 of it can be
allocated specifically just to the
VRAMm, although I don't know if that can
be unlocked. I'm sure someone will find
a way. And when it comes to memory
bandwidth, it's 273 GB per second. This
actually allows you to run models up to
200B depending on the quantization
obviously. And then what you really get
is about one pentaflop of FP4 AI
performance. If you aren't an AI model
nerd, you don't care what that means.
But if you're an AI model nerd, you
probably care about what this is. And as
I mentioned before, it comes with Ubuntu
24 LTS with this specialized kind of
just a lot of preloaded software, the
stuff that you need to build and run AI
tools or run fine-tuning jobs. A lot of
the default stuff's already in here. So,
you've got the Nvidia Nvidia container
toolkit, you've got Nvidia SMI, like
basically any tool that you would need.
Uh, all of that is just pre-installed,
which is so nice because you don't have
to install it at all. So, what are we
going to showcase today with this
computer that was built from the ground
up to build and run AI tools like
anything LLM, but also for fine-tuning
and all these other things. Uh, today
specifically is first I'm going to show
you around the OS. It's very familiar if
you've used Ubuntu. Um, we're going to
actually use anything LLM and some other
tools that can run natively on this
hardware to be able to just show some
models running and benchmark them and
just get an experience uh and also get
an idea of obviously tokens per second.
And then also I would like to show a
pretty realistic fine-tuning example as
well where we're going to probably use
midsize model like GMA 3 4B to make a
fine tune for some specific use case.
There are two ways to run this. When you
get the manual for your DGX Spark, uh
it's going to actually give you two
configs in here. So, you can use it
basically as a desktop. Uh you plug in
an HDMI cable and whatever other your
peripherals are, and you just use it
like a computer. Uh it has a whole setup
process if you've used Ubuntu. It's
pretty much exactly like that. But then
there's an another mode which I think is
also interesting where you can actually
use it as aworked device. So you could
have this centralized in your office or
in your house and use it as a dedicated
compute machine for AI workloads, which
is the next thing that I would like to
get to is specifically AI workloads.
There have been a lot of criticisms or I
don't even want to say a lot and I also
don't even want to call it criticisms.
Uh it's just people I guess talking
about this on Reddit saying all of these
things like it's supposed to replace a
Mac Mini. This is not that. Uh this is
an additional compute re resource that
you can just use to free up whatever
you're using already. So like I have a
GPU on my computer. I can continue to
use that and then offload work to this
dedicated device for that. People have
home labs with Mac minis strung
together. Uh this is not a Mac mini
personal computer replacement, but it is
for the home lab use case where people
have been chaining them together. In
fact, actually, you can stack two of
these on top of each other, and there's
a big connection port in the back that
you can chain them together. And so, you
can actually get double the output,
which is really cool. You can run really
large models at actually a good quant.
And of course, because this is the DGX
kind of OS, if you do, for whatever
reason, have access to an a $350,000
H100 server, the code you write on this
for your apps or whatever jobs you're
running, you can actually just use on a
server. As you can tell from the
background of my video, I do not have
one of those servers. If you're building
a home lab dedicated for AI workloads,
which I see all the time on r/local
llama, this is a reasonable device. Now,
it depends on what your price range is,
but I've seen some really expensive Home
Lab setups, and I think that this is
actually in a reasonable price range.
And on that note, I do want to say that
the one I have, which is very clearly
labeled here, is early access. So, the
stuff that I'm getting, the results that
I'm getting, uh, could be better, could
be worse. They might just be different.
Um, but just something to highlight
there. And I just want to take a quick
little sidebar. Uh, for those of you who
don't know, my background is actually
mechanical engineering. Before I got
into the whole founder software thing, I
was a mechanical engineer. And this
thing has just a couple interesting
design highlights. And I'm going to
actually pull in the zoom here. Uh so we
can go over these kind of details. So
looking at the front of the device, uh
you can notice that, you know, there's a
little bit of these polished kind of
areas right here that also expose some
vents. But one thing you may have
noticed is this very interesting
material choice on the front of the
device. This looks like some kind of
like open cell metal foam, probably
aluminum, but this is actually an air
intake and and it's just a really
interesting metal choice and just design
decision in general. I personally really
like it. It's also not very rough to the
touch. Like it doesn't have any kind of
like burrs or snags. So, you can like
handle this pretty reasonably. Uh, and I
imagine stacking two of them would look
really cool. The bottom of the device is
really nothing that you wouldn't expect.
So, of course, you've got your kind of
grip here to keep it from sliding around
on surfaces as well as an additional air
intake. And on the back, we get that
same open cell metal foam finish again.
Um, but this is also exhaust as well.
You can feel kind of air coming out of
there. You have your power button, which
my first complaint about this device is
it has no on light. You have no idea if
this thing is running. So, what I've
been doing is putting my hand in the
front of the front vent to feel for any
kind of suction. Um, or just putting my
ear up to the device and listening for
the kind of worring sound that you can
hear. You have technically three USBC
ports. The first one is for your power
and then you have three additional. I
personally have no USBC peripherals, so
I had to buy some converters off Amazon
for about $6. And then you've got your
standard HDMI port. And then you have
your Ethernet port. Uh it this comes
with Wi-Fi and Bluetooth. These are the
specialized ports that can be used to
stack two DGXs together. Now for the
next part of this video, we're going to
get into the software side of things.
We're going to run GPT OSS120B.
Yes, the big one. Then of course, we're
going to jump into that simple
fine-tuning use case just to get an idea
on times for that. So when you first
boot up your DTX, you're going to be
greeted with a screen that probably
looks uh a lot like this. And you'll
notice it looks and feels because it is
Ubuntu. And so if you're familiar with
Ubuntu, you're already familiar with
this except it comes preloaded with some
additional software. Uh as well as
tools, and that's the real nice part. Um
so for example, you know, you've got
your regular stuff like your system
monitor, calendar, you've got Libra
Office, that kind of stuff. You've also
got this DGX dashboard. You also have
the NVIDIA AI workbench which is really
nice because if you do a lot of Jupyter
notebook stuff or like uh data science
or even like training models, uh the
NVIDIA AI workbench is a great tool for
that. It just comes with your very kind
of like basic software. VLC is already
included and of course it comes with you
know some cool backgrounds. This entire
UI should feel very familiar. Uh some of
the tools that are very useful that it
comes with is like uh NVCC is already
installed. Uh
Nvidia SMI already works and you can see
the driver version that we're on. We're
on that GB10 supported CUDA version of
13. Um and then of course if you are
interested in your GPU
stats and VTOP is also already present.
And so there's just a lot that you can
see and do uh in here just by default
without having to set up any additional
software. I think people know that
setting up all of the CUDA libraries and
the toolkits and the stuff that you need
is always just another step to take. But
for the next part of this video, we're
actually going to play around with some
actual models. So I already have Ola
installed. I'm on 12.5 and I actually
already have some models installed. I
have GPTOSS 12B installed right now. Of
course, you can always just, you know,
doma run GPTOSS 128V
and then just send a simple you need to
send a simple message and you get some
tokens back. But sometimes you'll want a
little bit more verbosity. So you can
say hello again and maybe we can get
some stats this time. You can see we're
sitting at around the 30 tokens per
second uh rate. And chatting with a
model through a CLI is, you know, I
mean, it's fun, it's useful, sure. Uh
but what most people do is they have a
tool like anything LLM where you can set
up workspaces. You can have access to
agent tools. You can build your own
tools in a flow builder. You can write
your own code, use MCPs, search the web,
generate charts. You can do a lot just
in anything LLM. And of course, we're
going to do all of this by just hooking
up to Olama that's already running and
using that GPT OSS120B.
Anything LLM comes with its own internal
O Lama if you don't have Olama installed
on your system, but you get the same
experience no matter what. And so I
think one of the easiest things to show
is there's this website that has a bunch
of CSV files that are just good, you
know, sample CSV files to kind of show
proof of concept. CSVs are probably the
most popular format for people to use
anything LLMs with right alongside PDFs
who want these models to help them be
productive. And so there's a fun one in
here called Dairo CSV which has all the
Rotten Tomato ratings of Robert Dairo.
There's 87 records in here. So, it
probably isn't all of those movies, but
it's definitely enough of them. So, what
we can do is in anything LLM, we can
just simply open up the downloads
folder, drag and drop that in, and just
ask it,
and just ask GPTOSs 12V to analyze this
data set. And if you've used GPTOSS
at all, uh you'll know that it loves
tables. And so what we're going to see
here is VPTOSS just work through all the
data here, not formatting it in a way
that makes sense, asking follow-up
questions. These kind of things are all
part of the model's capabilities. So
And so that is the analysis that it gave
us. It gave us a a bunch of quick
takeaways. We could ask follow-up
questions, but I really don't know what
else I would follow up with. I mean,
there's even analysis as to why school
scores may have dropped and why movies
were poor poorly received. I mean,
obviously, I think everybody here knows
that Taxi Driver and Good Fellas are
amazing. And we definitely don't talk
about these two movies. And also, the
more importantly, we are still sitting
at that 30 tokens a second rating. So,
we're getting consistent performance
across large outputs. And of course, you
know, inside of anything LLM, we're able
to do a lot more here where we can
modify the system prompt. We even have
prompt variables where you can add kind
of dynamic data. There's a whole bunch
you can do in this tool. And having a
really capable model that can also run
incredibly fast to do actual productive
work is really, really nice. And you can
imagine just putting this somewhere in
your office or somewhere in your home
and having it be your centralized
inference service is really a reality
with a device like this. If you are
interested in anything LLM desktop and
what I showed you, you can always
download it for free on your device
today as well as we do have an
open-source repo here that is available
and MIT licensed. Um, but this video is
not about anything LLM. It's about the
DGX Spark. So, in the next part of the
video, I'm going to talk about
fine-tuning your own custom model on
this hardware. I do want to preface this
part of the video that not only is this
the nerdiest part of the video because
it is going to involve code. We're going
to be in a Jupyter notebook, but also
this is really where the DGX Sparks also
shows its unique capabilities because
this is the stuff that it was built for.
What we're going to do is I have a
Jupyter notebook already uh up, but
first I want to talk about Unsloth.
Unsloth are the people who made this
book. Unsloth is an open-source project.
You can find them on GitHub. They'll be
linked in the description. And they have
built a custom training framework and
even kernels for a bunch of different
types of GPUs. But just being able to
tune models in a more memory efficient
way. Even though we are on a DGX Spark
which has a lot of resources available,
Unsloth has even made it possible to
tune models on even lower-end hardware.
But since we have really good hardware,
we're going to actually use a midsize
model. So the model we're going to be
working with today is actually going to be
GMA 3 4B instruction model. This is
going to be a use case fine-tune where
we are going to do something with this
model to make it better for what we
specifically need. just going over the
high level of what this whole document
is supposed to be doing for us is we are
going to go through this step by step
and I will share a link to this exact
file so you can run this fine-tuning as
well but the the detail that really kind
of matters here is the data set and so
there is a data set here of basically
support tickets IT support tickets where
a user is complaining complaining about
something or a bug has occurred or there
was an issue and then there is a
recommendation to resolve this. Now this
can be this particular data set is
actually I'd say pretty generic, right?
Maybe there's specific verbiage and
processes that are outlined in this data
set. But in real life, you probably have
some kind of use case or repetitive
input output framework or standard
operating procedure that just generally
models don't capture. And the only
alternative to that is to pollute them
with a lot of context that then gives
you less ability to put more tokens to
the output. And then when you allow
fine-tuning for these models, you can
have a model inherently and just be
smarter about a specific domain. This
data set is just, you know, a very basic
data set. I believe there are 500
something examples in this. You don't
actually even need that much data. Uh
here's one with a lot more with a
100,000, but this is more of a general
QA kind of like check GPT answer like
explain a turnary operator. uh develop a
lesson plan. Like these are these are
questions that every model's base tuning
is pretty decent with. Uh I'd say very
decent with nowadays. So training on
this data set is not really going to
move the needle for us or you know give
us a different output than what we would
expect. To get started, I already have
this Jupyter notebook running and I'm
running it obviously on this DGX. So the
first thing to do is to actually load
the model in. And at the end of this
actually, we're even going to have the
opportunity to output this file as a
GGUF. So you can then export this file,
load it into anywhere you can run a
GGUF, you can now run this fine tune
model. We're not going to do that in
this video, but I am going to show you
the process and how it is legitimately
one line of code.
And now that the model is loaded, we can
just keep going through these steps.
And then we're going to load this data
set. And then the most important part of
a data set is formatting that data
properly. So I've already gone ahead and
wrote a little bit of code that it took
to do that. Here's an example of an
input snippet, which is where the user
and the agent have a discussion. And
then what we want from our output is an
analysis. What we want is to see this
kind of output where we begin with an
analysis and then have a recommendation.
And maybe in our internal system where
this model runs that means something
those those uh headers mean something
and this is just the formatted output
from the text conversation. So now the
first thing is let's get the trainer
ready. So the trainer is just basically
taking our data set tokenizing it and
getting us ready.
And then we're going to tell uh our our
training process that we're only
training on the responses because that's
the part of the data we actually care
about. And then let's just make sure
that things look all right, which I've
already gone through this. These these
things look okay. But the main thing is
first, what does GMA 3 even respond as
to a basic scenario.
And you'll see that it kind of gives
this generic answer of saying, "Oh,
that's good information. Let's narrow it
down. Uninstall the current version."
and it's talking about zoom which by the
way this scenario sample does not say
anything to do with zoom. Uh so this is
just the model hallucinating
essentially. So you can see that you
know it doesn't really it doesn't really
give an answer that is specific to our
output especially with no analysis
block. So let's train the model. Now,
this can take some time, but normally
you can actually use a service by Google
called Collab. And Collab allows you to
run these kind of scripts. Actually, you
could probably run this script on Nvidia
T4 GPUs. I can tell you from experience
because I've done this already on the
cloud that this particular training job.
The exact code that I'm running here
takes about 17 minutes to run on
average. I've run it about four or five
times just to get a good idea. Um, so
I'm going to let this run and we'll see
what the stats are to train on the
Nvidia TGX Spark. Okay. And you can now
see that all 60 of our training steps
for one epic have run. That was on 504
pretty chunky uh samples. And of course
I, you know, cut through the video a
little bit just to save time. But the
one thing you can't cut is the actual
time that it took. So as I said on an
Nvidia T4 in Collab, uh this would take
about 17 minutes. Uh but here it only
took 4.3 minutes, which is awesome. Uh
that's an incredible time save. Uh, and
also it lets us iterate much faster,
which is great because when it comes to
fine-tuning, you really don't know what
you're going to get until you do it. Um,
and so iteration speed is extremely
important for anyone who's fine-tuned.
But now we should be able to see, do we
get the output we expect? So I took the
same scenario that same scenario that we
had uh in the previous block where we
were experiencing a black screen and we
now get the model giving us the actual
kind of output that we want. We have an
analysis block and a recommendation. Now
this specific format is not only from
our data set but so is the content. And
I think that is the important part
because a lot of people would say, "Oh,
why don't you just have a system prop
that just says break it out like this?"
It's not only the format. Sometimes,
sometimes it is the output. And most
times it's actually both. For anyone
who's fine-tuned models, that's a pretty
obvious thing. Uh, but you know,
obviously there's more than one
scenario. And so we can do this again on
a different scenario that we haven't
tested on before. Um, and you can see
that we still get that same format. We
still get that output. Now of course if
we don't have direct output that we were
trained on then of course the model is
going to try to answer the question
anyway but still within its new domain
expertise and while keeping that format.
So now we have a model that works the
way we want. We took gamma 34B which is
a great model for its base and gives an
okay answer right when we use it on when
we use it totally untrained just
straight from Google. But when we apply
just this 500 sample data set we're able
to get the answers the way we want and
now we have our own version of gamma 3.
Now of course you're going to want to
take that and put it different places.
Obviously, like this specific model is
not very useful if it's stuck in this
notebook. And so to do this, you can
just do model save pre-trained GGUF. You
can do it at F16, Q4, Q8, Q5, or a whole
bunch at the same time. If you're try,
if you know that this is a good model,
but you're going to want to offer it at
different quality levels, great. You can
do that. You can save all of those to a
GGUF completely compiled and then take
that and put it in your software of
choice. And that software of choice
could very well be anything LLM. We have
a way to import GGUFS if you want. That
is the kind of software overview and
demo for the DGX and for specific this
fine-tuning use case. It is definitely
worth saying though that this is the
version that Nvidia will be selling off
of their website. And from what I do
understand is that there are other OEMs
out there who are going to be utilizing
this GB10 chipset but in their own OEM
form factor. But just know that this is
not the only way that this device and
its hardware may be presented to you if
you're interested in them. And I'm sure
like we see with graphics cards, there's
different form factors and packages and
all of that stuff. So hopefully this
demo of showcasing the kind of early
access Nvidia DGX Spark was useful, gave
you some insight into maybe some
practical uses. I mean, we did run
anything LLM just to get some benchmarks
on some models. Um, we did do some
fine-tuning just to showcase that there
are some performance improvements there.
It's just a nice dedicated device. And I
think what I'm going to do personally
with this device is I'm actually going
to set it up probably at my house first,
but then maybe move it into an office.
um just to have a dedicated centralized
but still local kind of landbased
inference service that I can just use
for whatever it is that I want to use.
Most tools allow you to just slap in a
endpoint that you can use like if you're
using something like a continue.dev or a
void editor or some other or even like
one of the clawed codes out there that
allow you to put your own inference URL
to run your own coding models. probably
just load this up with a coding model
honestly and just use this and save some
money on a cursor or whatever it is you
might be using. But that's it for now.
If you're interested in anything LLM or
an open source project, you can star us
on GitHub. Uh if you want to use a
desktop app, it's free to use and free
to download. If all you have is a phone,
uh we actually now have a phone app,
anything LLM mobile for Android and you
can download it and run small language
models that and still get utility out of
them on device. And actually, if you do
have something like this, you can
actually hook your phone up uh to use
this endpoint instead and get a really
powerful experience just on mobile. So,
whatever you want to do, I think this is
a good fit. But thanks for watching. Bye.
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.