YouTube Transcript:
‘Thinking is Hard’: Jensen Huang Explains How Nvidia Is Rewiring the Future of Intelligence | AI14

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

The AI industry has entered a new phase of rapid advancement and adoption, driven by breakthroughs in AI model training and reasoning capabilities, leading to a significant increase in computational demand and the emergence of a self-sustaining "virtual cycle" for AI development and deployment.

Mind Map

Click to expand

Click to explore the full interactive mind map • Zoom, pan, and navigate

fairly profound happened this year

actually. If you look in the beginning

of the year, everybody has some attitude

about AI. That attitude is generally

this is going to be big. It's going to

be the future. And somehow a few months

ago, it kicked into turbocharge. And the

reason for that is several things.

The first is that we in the last couple

years have figured out how to make AI

much much smarter

rather than just pre-training.

Pre-training basically says, let's take

all of the all of the information that

humans have ever created. Let's give it

to the AI to learn from. It's

essentially memorization and generalization.

generalization.

It's no, it's not unlike going to school

back when we were kids. The first stage

of learning, pre-training was never

meant, just as preschool

was never meant to be the end of education.

education.

Pre-training, preschool was simply

teaching you the basic skills of

intelligence. so that you can understand

how to learn everything else. Without

vocabulary, without understanding of

language and how to communicate, how to

think, it's impossible to learn

everything else. The next is

post-training. Post-training after

pre-training is teaching you skills.

Skills to solve problem, break down

problems, reason about it, how to solve

math problems, how to code, how to think

about these problems step by step. use

first principal reasoning and then after that

that

is where computation really kicks in. As

you know for many of us you know we went

to school and that's in my case decades

ago but ever since I've learned more

thought about more and the reason for

that is because we're constantly

grounding oursel in new knowledge. We're

constantly doing research and we're

constantly thinking. thinking is really

what intelligence is all about. And so

now we have three fundamental technology

skills. We have these three technology

pre-training which still requires

enormous enormous amount of computation.

We now have post-training which uses

even more computation and now thinking

puts incredible amounts of computation

load on the infrastructure because it's

thinking on our behalf for every single

human. So the amount of computation

necessary for AI to think inference is

really quite extraordinary. Now I used

to hear people say that inference is

easy. NVIDIA should do training. Nvidia

is going to do you know they are really

good at this so they're going to do

training. That inference was easy. How

could thinking be easy? Regurgitating

memorized content is easy. Regurgitating

the multiplication table is easy.

Thinking is hard. Which is the reason

why these three scales, these three new

scaling laws which is all of it in in

full steam has put so much pressure on

the amount of computation. Now another

thing has happened

from these three scaling laws. We get

smarter models and these smarter models

need more compute. But when you get

smarter models, you get more intelligence.

intelligence.

as if anything happens. I want to be the

Jazz kid. I'm sure it's fine. Probably

just lunch. My stomach.

Was that me?

And so, so where was I? The smarter your

models are, the smarter your models are,

the more people use it, it's now more

grounded. It's able to reason. It's able

to solve problems it never learn how to

solve before because it could do

research. Go learn about it, come back,

break it down, reason about how to solve

your how to answer your question, how to

solve your problem, and go off and solve

it. The amount of thinking is making the

models more intelligent. The more

intelligent it is, the more people use

it. The more intelligent it is, the more

computation is necessary. But here's

what happened.

This last year,

the AI industry turned a corner. Meaning

that the AI models are now smart enough.

They're making they're worthy. They're

worthy to pay for. NVIDIA pays for every

license of Cursor. And we gladly do it.

We gladly do it because cursor is

helping a several hundred,000 employee

software engineer or AI researcher be

many many times more productive. So of

course we'd be more than happy to do

that. These AI models have become good

enough that they are worthy to be paid

for. Cursor 11 Labs synthesia a bridge

open evidence the list goes on. Of

course, open AI, of course, claude.

These models are now so good that people

are paying for it. And because people

are paying for it and using more of it,

and every time they use more of it, you

need more compute. We now have two exponentials.

exponentials.

These two exponentials, one is the

exponential compute requirement of the

three scaling law. And the second

exponential, the more people, the

smarter it is, the more people use it,

the more people use it, the more

computing it needs. two exponentials now

putting pressure on the world's

at exactly the time when I told you

earlier that Moore's law has largely

ended and so the question is what do we

do if we have these two exponential

demands growing and if we don't if we

don't find a way to drive the cost down

then this positive feedback system this

circular feedback system essentially

called the virtual cycle. Essential for

almost any industry,

essential for any platform industry. It

was essential for Nvidia. We have now

reached the virtual cycle of CUDA.

The more applications, the more the more

applications people create, the more

valuable CUDA is, the more valuable CUDA

is, the more CUDA computers are

purchased. the more could p computers

are purchased more developers want to

create applications for it that virtual cycle

cycle

for Nvidia has now been achieved after

30 years we have achieved that also 15

years later we've achieved that for AI

AI has now reached the virtual cycle and

so the more you use it because the AI is

smart and we pay for it the more profit

is generated the more profit generated

the more computes put to on the on the

grid. The more compute is put into AI

factories, the more comput the AI

becomes smarter, the smarter, more more

people use it, more applications use it,

the more problems we can solve. This

virtual cycle is now spinning. What we

need to do is drive the cost down

tremendously so that one, the user

experience is better. When you prompt

the AI, it responds to you much faster.

And two, so that we keep this virtual

cycle going by driving its cost down so

that it could get smarter, so that more

people use it, so that so on so forth.

That virtual cycle is now spinning. But

how do we do that when Moore's law has

really reached this limit? Well, the

answer is called co-design.

You can't just design chips and hope

that things on top of it is going to go

faster. The best you could do in

designing chips is add I don't know 50%

more transistors every couple of years

and if you added more transistors just

you know we can add more transistors and

TSMC's got a lot of transistor

incredible company we just keep adding

more transistors however that's all in

percentages not exponentials

we need to compound exponentials to keep

this virtual cycle going extreme code

design is the only company in the world

today that literally starts from a blank

sheet of paper and can think about new

fundamental architecture, computer

architecture, new chips, new systems,

new software, new model architecture and

new applications all at the same time.

So many of the people in this room are

here because you're different parts of

that layer that different parts of that

stack and working with Nvidia.

We fundamentally rearchitect everything

from the ground up and then because AI

is such a large problem, we scale it up.

We created a whole computer, a computer

for the first time that has scaled up

into an entire rack. That's one

computer, one GPU. And then we scale it

out by inventing a new AI Ethernet

technology we call Spectrum X Ethernet.

Everybody will say Ethernet is Ethernet.

Ethernet is hardly Ethernet. Ethernet

spectrum X Ethernet is designed for AI

performance and it's the reason why it's

so successful. And even that's not big

enough. We'll fill this entire room of

AI supercomputers and GPUs.

That's still not big enough because the

number of applications and the number of

users for AI is continuing to grow

exponentially. And we connect multiple

of these data centers together and we

call that scale across spectrum XGS

gigascale X spectrum X gigascale XGS. By

doing so, we do code design at such a

such an enormous level, such an extreme

level that the performance benefits are

shocking. Not 50% better each

generation, not 25% better each

generation, but much much more. This is

the most extreme code-designed computer

we've ever made and quite frankly made

in modern times. Since the IBM system

360, I don't think a computer has been

ground up, reinvented like this ever.

This system was incredibly hard to

create. I'll show you the benefits in

just a second. But essentially what

we've done, essentially what we've done,

we've created otherwise

Hey Janine, you can come out. It's

you have to have to meet me like halfway.

All right. So, this is kind of like

Captain America shield.

So, MVLink 72, MVLink 72, if we were to

create one giant chip, one giant GPU,

this is what it would look like. This is

the level of wafer scale processing we

would have to do.

It's incredible. All of this, all of

these chips are now put into one giant rack.

rack.

Did I do that or did somebody else do

that? Into that one giant rack.

You know, sometimes I don't feel like

This one giant rack makes all of these

chips work together as one. It's

actually completely incredible. And I'll

show you the benefits of that. The way

it looks is this. So, thanks Janine.

I I like this. All right, ladies and

I got it. In the future next, I'm just

It's like when you're at home and and

you can't reach the remote and you just

go like this and somebody brings it to

you. That's Yeah. Same idea.

It never happens to me. I'm just

dreaming about it. I'm just saying.

Okay. So, so anyhow, anyhow, um we

basically this is what we created in the

past. This is MVLink MVLink 8. Now,

these models are so gigantic. The way we

solve it is we turn this model, this

gigantic model into a whole bunch of

experts. It's a little bit like a team.

And so, these experts are good at

certain types of problems. And we

collect a whole bunch of experts

together. And so, this giant multi-

trillion dollar AI model has all these

different experts. And we put all these

different experts on a GPU. Now, this is

NVLink 72.

We could put all of the chips into one

giant fabric and every single expert can

talk to each other. So the master the

the primary expert could talk to all of

the distributed work and all of the

necessary context and prompts and bunch

of data that we have to bunch of tokens

that we have to send to all of the

experts. The experts would whichever one

of the experts are selected to solve the

answer would then go off and try to

respond and then it would go off and do

that layer after layer after layer.

Sometimes eight, sometimes 16 and

sometimes these experts, sometimes 64,

sometimes 256. But the point is there

are more and more and more experts.

Well, here MVLink72, we have 72 GPUs.

And because of that, we could put four

experts in one GPU.

The most important thing you need to do

for each GPU is to generate tokens,

which is the amount of bandwidth that

you have in HPM memory. We have one H

one GPU generating thinking for four

experts versus here because each one of

the computers can only put eight GPUs.

We have to put 32 experts into one GPU.

So this one GPU has to think for 32

experts versus this system each GPU only

has to think for four. And because of

that the speed difference is incredible.

And this just came out. This is the

benchmark done by semi analysis. They do

a really really thorough job and they benchmarked

benchmarked

all of the GPUs that are benchmarkable

and it turns out it's not that many. If

you look at the list of looks list of

GPUs you could actually benchmark is

like 90% Nvidia. Okay. And but so we're

comparing ourselves to ourselves but the

second best GPU in the world is the H200

and runs all the workload.

Grace Blackwell per GPU is 10 times the performance.

performance.

Now, how do you get 10 times the

performance when it's only twice the

number of transistors?

Well, the answer is extreme code design.

And by understanding the nature of the

future of AI models and we're thinking

across that entire stack, we can create

architectures for the future. This is a

big deal. It says we can now respond a

lot faster. But this is even bigger

deal. This next one, look at this. This says

says

that the lowest cost tokens in the world

are generated by Grace Blackwell

MVLink72, the most expensive computer.

On the one hand, GB200 is the most

expensive computer. On the other hand,

its token generation capability is so

great that it produces it at the lowest

cost because the tokens per second

divided by the t by the total cost of

ownership of Grace Blackwell is so good.

It is the lowest cost way to generate

tokens. By doing so, delivering

incredible performance, 10 times the

performance, incre delivering 10 times

lower cost, that virtual cycle can

continue. Which then brings me to this

one. I just saw this literally

yesterday. This is uh the CSP capex.

People are asking me about capex these

days and um this is a good way to look

at it. In fact, the capex of the top six

CSPs and this one, this one is Amazon,

Core Weave, Google, Meta, Microsoft, and

Oracle. Okay, these CSPs together

are going to invest this much in capex.

And I would I would tell you the timing

couldn't be better. And the reason for

that is now we have the Grace Blackwell

NVLink72 in all volume production,

supply chain, everywhere in the world is

manufacturing it. So we can now deliver

to all of them this new architecture so

that the capex invests in instruments

computers that deliver the best TCO. Now

underneath this there are two things

that are going on. So when you look at

this it's actually fairly extraordinary

and it's fairly extraordinary anyhow but

what's happening under underneath is

this there are two platform shifts

happening at the same time.

One platform shift is going from general

purpose computing to accelerated

computing. Remember accelerated

computing as I mentioned to you before

it does data processing, it does image

processing, computer graphics, it does

com comput computation of all kinds. It

runs SQL, runs Spark, it runs, you know,

you you ask it, you tell us what you

need to have run, and I'm fairly certain

we have an amazing library for you. You

could be, you know, a data center trying

to make masks to manufacture

semiconductors. we have a great library

for you. And so underneath, irrespective

of AI, the world is moving from general

purpose computing to accelerated

computing irrespective of AI. And in

fact, many of the CSPs already have

services that have been here long ago

before AI. Remember, they were invented

in the era of machine learning.

classical machine learning algorithms

like XG Boost, algorithms like um uh

data frames that are used for recommener

systems, collaborative filtering,

content filtering, all of those

technologies were created in the old

days of general purpose computing. Even

those algorithms, even those

architectures are now better with

accelerated computing. And so even

without AI, the world's CSPs are going

to invest into acceleration. Nvidia's

GPU is the only GPU that can do all of

that plus AI. And ASIC might be able to

do AI, but it can't do any of the others.

others.

Nvidia could do all of that, which

explains why it is so safe to just lean

into Nvidia's architecture. We have now

reached our virtual cycle, our

inflection point. And this is quite

extraordinary. I have many partners in

the room and all of you are part of our

supply chain and I know how hard you

guys are working. I want to thank all of

you how hard you are working. Thank you

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:‘Thinking is Hard’: Jensen Huang Explains How Nvidia Is Rewiring the Future of Intelligence | AI14