The increasing cost and capability of advanced AI models necessitate a shift towards efficient token usage and smart consumption habits to maximize value and avoid unnecessary expenses.
Mind Map
Clicca per espandere
Clicca per esplorare la mappa mentale interattiva completa
The next generation of models is likely
to drop in the next one to two months.
I'm talking about Claude Mythos. I'm
talking about whatever Chad GPT drops
next. I'm talking about the next Gemini
model. They will be more expensive, a
lot more expensive because they're all
trained on much more expensive chips,
the GB300 series from Nvidia, and it's
just going to get more expensive from
there. The intelligence we're going to
get, the ambient compute all around us
that is essentially free intelligence is
going to be the dumber models. That's
just how it is. If you want to use
cutting edge models, you have got to
stop burning tokens and blaming the
model. And that is the theme for this
video. If you're in a position where
you're wondering how much token usage
you have or how expensive your AI is or
whether you're using too many tokens for
your AI or how you can even measure
that, how you can get better at it. That
is what this is. And that is going to be
one of the most valuable skills on the
planet. by the way, because you do not
want to be in a position where you are
putting $250,000 a year. A real number
that JSON Huang gave in a real interview
for what he expects an actual individual
engineer to spend in a year on tokens.
You don't want to be the person spending
250 grand on tokens you don't have to be
spending on. You want to be smart. And I
am going to give you a specific example.
This is real life example. A real person
I know gave me permission to use this. I
recently saw a production AI pipeline
that ingests multiple long- form
conversations per user, runs an analysis
across dozens of dimensions and
generates a fully personalized output
all on the most expensive models that
money can buy. Not because the person
wants to use expensive models, but
because he tested it and what he found
was that the better models produce the
results he needs for this business. The
cost per user less than a quarter, less
than 25 cents per user for that. Most of
us are spending more than we need to on
AI and this is a video about that. You
can be really smart, use really good
cutting edge AI and you can be
intelligent with your token usage and
not spend a ton of money. If you want to
know what that's like, keep on watching
because we're going to get into specific
strategies and I'm going to show you
what I built so that we can actually
make this easier for everybody so it's
not just a guessing game anymore. The
takeaway is that Frontier AI can be
absurdly cheap when you know what you're
doing. Essentially, the models are not
expensive. It's your habits that cost a
lot. And with cloud usage limits
dominating everything in the last week,
I think it's worth having that
conversation. So, let's get to it. I've
made the case we can use our models
better. What are the specific habits we
can change? I want to name specific
habits that I have seen in conversations
with others, looking over shoulders,
reading GitHub repos, listening to
conversations online. These are specific
examples that are patterns I see over
and over again. And the first one is the
rookies. The folks who are new to
cutting edge, you know what you bleed
out on in tokens? You bleed out on
document ingestion. This one drives me
crazy because it's so so easy to fix. A
brand new Cloud Desktop user might drag
in three PDFs into a conversation that
might be 1500 words each, which is just
4,500 words of text. It's not that long.
And they say, "Summarize these." and
Claude processes the raw PDFs with all
the formatting overhead that goes with
that, the headers, the footers, the
embedded fonts, the layout metadata, and
the entire binary structure gets encoded
as tokens. And so the 4500 words of
content can become a 100 plus thousand
tokens if you're not careful. All you
have to do to avoid that is just think
in terms of markdown. If you just ask
Claude or frankly go to any of a number
of services on the internet that are
free and say please convert to markdown.
It will just do it right. It will just
take 10 seconds and convert to markdown.
And then you have a very clean set of
content that's between four and 6,000
tokens. And that's like saving you 20x
on the memory. And this waste just
compounds, right? Because once those
100,000 tokens are in your conversation
history, they bounce back and forth and
bounce back and forth. And this is how
you fill up your token window. and you
wonder how other people get so much
done. Please, please, please, if you're
new to AI or if you've never thought
about it, think about the file formats
you're throwing because so many of these
file formats are designed to be human
readable. They're not designed to be AI
readable. Think about the token
efficiency of these file formats. And if
you're wondering, well, how do I convert
to markdown? I built something for you
because all you have to do is just
ingest a file. You you hit transform and
it just converts it back into into
markdown. That's it. And we have a
number of file types. We're adding more
from the community all the time. It's
part of the open brain ecosystem. It's
just a plugin you can put in and it will
just convert it to markdown. But that's
not the only way. You can tell Claude to
do it directly. You can also just
directly do it on the internet with any
of a number of free web services.
Markdown conversion should not be gated.
It just it's super easy to do. Tokens
are designed to preserve everything in
an original text. If you wanted to
reason about the style of the PDF, fine,
keep it. But 99% of the time, all you
care about is the text. You just want it
in markdown. Please, please, please
think about your file formats. Next big
mistake that people make, and this one
comes a little bit after people tend to
convert to markdown and start to
understand how some of these initial
documents work. Please do not sprawl
your conversations. If you were doing
20, 30, 40 turns on a conversation, no
AI was reinforcement learned, trained,
or designed to handle that kind of
sprawl. All you're doing is compressing
the ratio of the conversation where the
original instructions happened. And yes,
the models are getting better and better
and better at anchoring on and
remembering those original instructions
even when they go through compression.
But why make them suffer? Why make
yourself suffer by filling up the
context window with croft? Why waste
tokens? Why not just ask for what you
want upfront? And if you're going to
have an evolving exchange or evolving
conversation, clearly market at the top
as our goal here is to evolve and reach
a conclusion together. And then you have
a light conversation that goes 20 or 30
turns and say, "Thank you. I've got a
conclusion. Please summarize this." And
then you go and do real work. I see so
many people trying to mix together
modes, but AI is really designed for
single turn do a lot of heavy work more
and more and in that context you need to
do the thinking in advance and bring
that to the table and if you need to
think with AI that should be in a
separate chat, separate conversation. It
might even be a separate model. It might
be three separate models and you're
bringing all of that in. I do that all
the time. I'm like, okay, I want to look
through what communities are thinking
about AI on X. I'm going to go to Grock
for that or I'm going to go through and
look at what earnings reports are saying
about the state of AI and capital
investment. I'm going to go and pipe
that through chat GPT thinking mode and
get a bunch of reports out on that. Or
I'm going to go through perplexity
research and get a bunch of reports out
on that. Now I'm going to go and have a
look at what some major blog posts have
to say about a particular AI topic. I'll
just go to Claude Opus 4.6. We'll do a
targeted web search. We'll go back
through. We'll make sure we understand
what we're looking at. None of that is
intended to be a single answer, right?
These are all evolving conversations.
Once I get what I want out of each of
these individual threads, I can pull
them together and say, "Okay, now I have
a piece of work to do. Now I have
something I actually need done and I
have all the context needed." So you
should have two modes here. You should
have a mode where you are trying to
gather information and a mode where you
are trying to focus and get work done.
Do not mix the two together. That is how
you burn tokens. That is how you confuse
the AI. Your objective when you want the
AI to do real work should be to be so
clear that the AI needs to do nothing
else and it just goes and gets the work
done and comes back. It should be that
clear. If you are an intermediate user
and you are like, I know this stuff,
Nate, well, let me give you another tip
you may not know. the people who are
adding lots of plugins to their chat GPT
or their cloud instances, you are paying
a tax every time you start a
conversation because in the background,
those are going to be loaded in and
they're going to start to fill the
context window. I know someone who
shared with me that they are over 50,000
tokens in on a context window before
they type the first word because they
actually load that many plugins and
connectors. You don't need that much.
You know what that's like? That is like
walking in to a fully functional tool
workshop and the first thing you do
instead of leaving the tools on the
walls is you go and get all the tools
off and you lay them out on the
workbench and you say, "Okay, now we're
going to do, I don't know, we're going
to do something. We're going to make a
bench." Do you need all 200 tools in the
workshop to make the bench? No. You
probably need the right five. Think
about that the next time you have an
approach to tooling. Because so many
people, we we hear about this new
plugin, we hear about this new
connector, someone hypes it up, we say
we need to add it, and we don't realize
it's a silent tax for the rest of time.
Every time we have a conversation, and
it just adds that little bit, it adds a
thousand tokens, it adds 2,000 tokens,
whatever it does, and it just adds it
always. Do you want to pay that for the
model? Maybe you should think more
strategically about which plugins and
connectors are really adding value for
you because they can. like they can be
tremendously valuable, but make sure you
know which ones you really want because
if you don't, then you're going to be
looking at dozens of plugins that you
don't really need that are supposed to
add value, but just add a bunch of
croft, a bunch of junk into your context
window and confuse the model and keep it
from doing good work and maybe confuse
it as to which tools it's supposed to
use. Now, I'm saving the most expensive
and the most advanced users for last
because this is where the leverage lies.
If you are an advanced user, if you are
someone who's like, "Send me to the
GitHub repo. I can just do this myself.
Let me install OpenClaw on my Mac Mini.
I'm okay managing the gateway. I can be
secure." This is for you. You have the
most leverage of anybody out there in
terms of how many tokens you use. And
typically speaking, your mistakes are
the most expensive ones because if you
screw up, you're screwing up at a level
of hundreds of thousands or millions of
tokens, maybe more. And the reason why
is simple. You are doing bigger projects
with AI. And when you do big projects
with AI, your ability to leverage AI
effectively becomes one of the most
critical things you can do to manage ROI
and cost on a particular project. It is
a job skill at that level. If you're
technical enough to go to a GitHub, you
have a job skill to manage tokens
efficiently. And you cannot pass that
off to somebody else. That is not going
to be somebody else's full-time job at
an org. All of us are going to have to
learn to manage our tokens. Well, if you
were sitting there and you are you are
the person who is responsible for the
system prompt on an agent and you
haven't pruned it in the last couple of
weeks, what are you doing? If you
haven't sat there and gone line by line
and said, you know what, a hundred of
these lines I don't need anymore because
they've been here since 3.5 and like I
don't need them now. If you're sitting
there and you're like, I don't know why
we're loading this entire repo into the
context window. We just do it all the
time and it seemed to work two
generations ago but we never tested it.
That's just irresponsible. You need to
be in a position where you are actually
allowing the gains in model intelligence
to lean out your context window. If you
want to look at the larger trend that we
see in AI today, it is that we needed to
frontload and be really specific about a
lot of context for dumber models in
2025. And now that it's 2026, as the
models get more intelligent, we can lean
out the context window initially because
we can trust the model to retrieve
better. So take that seriously. That is
something you can do that is practical
to get ready for claude mythos. Don't
sleep on it. This is again if you're
technical, these are million token
decisions we're talking about,
especially if you're running this agent
over and over again. It adds up. Let me
give you a specific example that is
based on the original beginner example
with PDFs to show you the tangible
difference in cost, right? And this is
something that should cascade all the
way across. If you don't believe me,
this is real. Let's say you feed raw
PDFs into context. Let's say it's a
100,000 tokens versus 5K like we talked
about. Let's say it's a conversation
sprawl that takes 30 turns. I've seen
these like this is very realistic. And
let's say you use Opus 4.6 for
everything including formatting,
including proof reading, and you're
making something over a 5 hour session
where you're talking back and forth. You
might be spending roughly 800,000 to a
million input tokens with maybe 150,000
to 200,000 of output tokens including
thinking. $5 in and $25 out per million.
you're spending eight to$10 dollars
worth of compute which you might say you
know what I can tolerate that or I got
the unlimited plan or I don't care
whatever but I want you to look at the
difference because anytime you start to
get serious with AI you need to see the
difference we talk about not being
wasteful with artificial intelligence
this is being wasteful you want to save
water you want to save energy don't
waste your tokens clean session same
work convert documents to markdown first
start fresh conversations every 10 to 15
turns use opus for reasoning and sonnet
for execution ution and haiku for polish
and scope the context to what's needed
and over the same period of time you get
the same result for 100 to 150,000 input
tokens a lot less and maybe 50 to 80,000
output tokens you blend that across both
models and instead of costing8 to$10 in
compute you spend a buck and you got the
same amount in other words you got an 8
to10x reduction in cost now scale it
right that sloppy user is burning 40 to
50 bucks in compute a week and the clean
user is burning five to seven bucks a
across a 10 person team on an API.
That's 2,000 bucks a month versus 250
bucks a month for the exact same result.
For subscription users, it's the
difference between hitting your limit
daily and then forgetting that limits
exist because you just are so
productive. Now, if you think this isn't
serious, I want you to think about the
cost structure for Mythos for a minute.
Mythos is rumored to be by far
anthropic's most expensive model. I
think very strongly by April or May we
are going to have a new class of pricing
well above $525 range for tokens into
maybe 10x that right imagine a world
where you are 10x what opus costs now $5
in $25 out for opus what if it's $50 in
$250 out for opus well now things start
to get serious now that eight or 10x
reduction on individual work for a day
becomes something that you can actually
measure and think about as a business
and you imagine how big that gets When
you start to work across a dev team, the
mistakes you're making today were
tolerable because models were priced
cheaply when cutting edge intelligence
that you want comes out more expensive.
And I don't know the exact price, right?
I'm not saying it's 50 and 250. I'm
giving you a thought exercise. It might
be 10 and 50 instead. It's still the
same point. The point is the model that
you want is going to cost more. And as
models cost more, your mistakes scale.
Your mistakes scale with the price of
intelligence. And make no mistake, the
models will keep getting better. Every
quarter, every release, the trajectory
is unambiguous. People who tell you the
models are plateauing are lying. They
are lying to you. The models are getting
much faster. And I do see that
occasionally that people are insisting
that the models aren't getting better.
It's not true by any measure out there.
And the people that I see insisting on
it, I think they're insisting on it
partly because they don't want to face
the world as it will exist when AI is
this good and continuing to accelerate
this fast. It's scary, right? We but we
should face it and we can all work
through it together.
>> All right. I have built a stupid button.
That is my contribution to this
discourse. I am building a stupid button
so you can check and see if you are
using your context incorrectly. I want
to save you money. I want to save you
hundreds of dollars. Please do not be
stupid with your tokens. You know, if if
you care about it, don't waste the
water. Don't waste the electricity. If
you just care about the bottom line,
also don't waste your bucks, right? We
should probably care about all of those
things. If you want to know like what's
in Nate's stupid button, it's really
simple. There's six questions that I'm
helping you answer. Number one, do you
feed Claude raw PDFs and images when all
you need is text? Is there something you
are doing that is grossly inefficient as
far as tokens go? By the way,
screenshots, terribly inefficient. It
would be much, much better just to copy
and paste text. Convert to markdown
always. Claude can do it really, really
fast for you. Why not? Question two.
When was the last time you started a
fresh conversation? Are you one of those
people that keeps a conversation going
forever? I swear the number of people
who keep their conversations going
forever is highly correlated to the
number of people who start experiencing
symptoms of LLM psychosis. Why? Because
models drift over time. They were never
intended for that long a conversation.
If you're having a longunning
conversation, you're just in strange
territory. When was the last time you
started a fresh conversation? And why is
that? Again, every time you take a turn
in a conversation, you read it as
sending one line back. But Claude or
Chad GPT or Gemini reads it as sending
the entire conversation back. And if
you're wondering, is this something
that's just for Claude? Nate's talking
about Claude a lot. No, it's for Chad
GPT, it's for Gemini, it's for Llama,
it's for any LLM you're using. It's for
Quen. This is how LLMs work. Don't waste
it. Question three, are you using the
most expensive model for everything? Are
you using Opus? Are you using 5.4 on pro
mode? Whatever your choice is, are you
picking the most expensive model and
just blindly using it regardless when
the cheaper model may work better? This
is especially important if you have
production workloads, but it's also true
for all of us. Like, if you're doing
something that's a simple formatting
task, don't depend on Opus for it. Don't
depend on 5.4 for it. Use the models for
what they're designed for. Don't bring a
Ferrari to the grocery store. Question
four, do you know what's loading in
context before you even type? You can
actually find this out. You can run
slashcontext in cla code. By the way,
you could look at the number of things
that are loading. If you're in cloud
code, if you don't know what that means,
you can go to your Chad GBT or your
cloud. You can see how many connectors
you have available. You can see how many
you've loaded up. You could be loading
tens of thousands of tokens that you're
not really aware of and not really
using. If you enable Google Drive months
ago and you never never ever use Google
Drive, you just thought it was cool on
the day it launched. Why? just drop it.
There are so many examples like that
where we see something cool, we add it,
and we forget it's there. It's like a
barnacle on a ship. It's going to slow
you down. It's going to burn tokens. You
don't need to have it. Audit. Audit your
plugins. It matters. Next question. API
builders, are you caching stable context
so you don't reuse it? Prompt caching
can give you a 90% discount on repeated
content. Right. Cash hits on Opus cost
50 cents per million versus $5 per
million standard. It makes a difference.
Do not sit there and ignore prompt
caching. Take it seriously. If your
system prompt, your tool definitions,
your reference documents aren't cached,
what are you doing? This is not advanced
stuff in 2026. You should just be doing
it. In the last question, the stupid
button test for this is a real button.
By the way, I really built a stupid
button. How are you handling web search?
Are you letting Claude do web research
the expensive way? People don't realize
this, but if you call perplexity for a
search, it tends to be much more token
cheap than using claude natively. Now,
Claude is addressing this. There are
lots of ways to do claude search. You
can actually use Claude to navigate
through a browser. You can also directly
search in the terminal and it will spin
up something in the background that's a
service and you can call something in
like an MCP connector for perplexity.
All different options you can use. This
is broadly true. It's not just true for
cloud. It's true for Chad GPT. is true
for Gemini, etc. because MCP is magic.
But if you are trying to do search, the
larger point is that you should be doing
search as cheaply as possible. If you
just want quick results that are token
efficient, it may be worth it to take
the time to spin up an MCP and just have
a dedicated service that just returns
the search results. That's what I have
found experimentally with perplexity and
claude is that perplexity tends to burn
something like 10 to 50,000 less tokens
per search which is not a small number
if you're doing complex search and it
tends to be five times faster and it has
structured citations. So this is not
meant to be a perplexity plug. It's just
a token management plug. Try it for
yourself. But I got to say I like
faster. I like citations. I like less
tokens over a researchheavy session like
a plug-in like that can save you a lot
on the token side. And that's a larger
call out. Like if you have ways to look
at your token usage and to diagnose it,
you're going to be smarter about it. And
that's the whole point of the stupid
button is like let's not fly blind here.
Let's look at our actual token usage and
let's actually make some good choices
and let's optimize it. Now what's in
this stupid button? Number one, there is
a prompt. If you've never done this, if
you're like, "What is an MCP server?" We
got a prompt for you, right? A prompt
you can run against your recent
conversations that actually identifies
the specific dumb things you
specifically are doing. Like it will see
which documents you're feeding raw. It
will see your conversation spraw. It
will look at model misuse. It will look
at redundant context loading. It looks
at your actual patterns and it will tell
you what to fix first. So that's the
easy version, right? Anyone can use it.
Any plan, no setup required. Number two,
a skill. This is an invocable skill that
audits your cloud code or your desktop
environment or any other environment. It
could be it could be chat GPT etc.
Skills are also translatable and it
measures your per session token
overhead. It will flag system prompt
load. It will check your plug-in and
your skill loading. It will give you a
before and after before you make
changes. Think of it as like you kind of
need a gas tank for your tokens and gee,
wouldn't it be nice to have one, right?
So, it's like the gas tank skill. Number
three, we built some guardrails. So
guardrails will sit directly on your
knowledge store. So if you're an open
brain person, which is something we've
been doing as a community, it will sit
right on your open brain and you will
stop burning tokens on input, which is a
nice touch, right? Automatic markdown
conversion for documents that are
hitting the store. Index first retrieval
instead of just dump and search. Uh
context scoping that enables a sort of
minimum viable context for the query.
This is where token management stops
just being a personal discipline and it
becomes infrastructure that starts to
maintain itself. And I think I'm really
excited to see how the community
continues to build on this because open
brain is open source and we'll keep
evolving it and improving it. But I
wanted to make sure that we had rails
that ensured we have responsible token
usage for the open brain community. So
look, I'm going to close by talking
briefly about agents and context because
agents burn hundreds of millions of
tokens in some cases. We don't want to
leave them out. How do we think about
context management for agents? And I'm
going to give you five commandments. I
call it the keep it simple stupid
commandments for agents. Number one,
index your references. Right? If an
agent is getting raw documents instead
of relevant trunks, you've already
failed. The entire point of retrieval is
to scope what the model sees to what it
needs. Dumping a full document set into
the window on every agent call is wildly
irresponsible. You can't do that just to
give the agent context. Don't make the
agent do work it doesn't need to do.
Number two, please prepare your context
for consumption. Pre-process,
pre-summarized, pre-chunk it. A
reference document should arrive in an
agent's context, ready to be used, not
ready to be read or processed. If the
model's first several thousand tokens of
reasoning are just spent dealing with
the crappy pre-processing you did,
you're not being a responsible agent
builder. Number three, this is something
we've mentioned before. I'm calling it
out in the context of agents because
it's so important for agent workflows.
Please, please, please cash your stable
context. System prompts, tool
definitions, persona instructions,
reference material, anything that is
stable, all should be cashed at a 90%
discount on cash hits. This is the
lowest effort, highest impact
optimization that you have on the table.
If you're making thousands of agent
calls a day and you're not cashing, it's
just pouring money down the drain.
Number four, scope every agent's contact
to the minimum it needs. Right? A
planning agent does not need your full
codebase. Don't give it the full
codebase. An editing agent doesn't need
your project roadmap. Don't give it the
project roadmap. You get the idea,
right? Passing everything to every agent
is architectural laziness and it has
real costs both in tokens burn and
frankly in degraded agent performance.
Models perform worse when they're
drowning in a relevant context. And by
the way, if you're like, I'm not sure
what the agent will need. Aren't the
smarter agents supposed to find it? The
answer is yes. But you will only do that
efficiently if you give them a
searchable repo that is pre-processed so
they can go and get only the relevant
slice of context. So take the time to do
it right. Number five, measure what you
burn. If you don't know your per call
token cost, you're just optimizing
without any information. Right? Please
instrument your agent calls. Track your
input tokens. Track your output tokens.
Track your overall model mix and your
cost ratio. You cannot improve what you
do not measure. And most teams building
agentic systems are thinking a lot about
whether they are semantically correct,
not whether they're functionally
correct. There's a big difference. And
they're thinking a lot about optimizing
their system prompt. They're not
thinking a ton about their model cost
because most of the time the model cost
is not what makes the project live or
die. And I get that in this age in 2025,
early 2026, with the cost we have today
and the urgency from executives to build
the $12 per run cost or whatever it's
going to be is not going to make or
break the ship. But plan for a world
where the models are more expensive.
Plan for a world where you have to scale
up. Plan for a world where you have to
be responsible and instrument. Now,
stepping back, there's a cultural
problem we need to acknowledge behind
all of this. At some point in the last
few months, burning tokens has become a
badge of honor. And I get it. There is a
degree to which you need to be burning
tokens in order to do meaningful work in
the age of AI. None of this is to say
that I expect token consumption to go
down. It won't. You need to be ready to
burn those tokens. This is not an ask
that you not do that. This is an ask
that you do it efficiently. And so when
Jensen sits there on stage and says
$250,000 in token costs per developer
and everyone like is shocked or rolls
their eyes or whatever the reaction is,
my reaction is I hope it's 250 grand in
smart token costs. It's not the
individual dollar amount for Jensen
because he's got cash in the bank. It's
whether the tokens were used well. It's
whether it's smart tokens. So begin to
think to yourself, yes, I need to be
maxing out my cloud. There are people
who like go into withdrawal when they
don't get to use their cloud. I know
people like that who are like, "Ah, I
went to a movie and uh I couldn't use my
cloud for a few hours. I feel like I
missed out on my token limit." Touch
some grass. It's going to be okay. But
use your tokens well. Be efficient with
your token usage. Know what you're
spending it on. Don't spend it on silly
stuff. Don't spend it on the PDFs that
you have to convert. Actually spend it
on meaningful work. And that is
something that is a human problem. We
need to be bold and audacious. These
models are really good at stuff. So,
let's get more bold, more audacious, and
think bigger about what we can aim them
at. Because if we can be more efficient,
we can do a whole lot more cool and
creative stuff with those tokens. That's
Clicca su qualsiasi testo o timestamp per andare direttamente a quel momento del video
Condividi:
La maggior parte delle trascrizioni è pronta in meno di 5 secondi
Copia in un clicOltre 125 lingueCerca nel contenutoVai ai timestamp
Incolla l'URL di YouTube
Inserisci il link di qualsiasi video YouTube per ottenere la trascrizione completa
Modulo di estrazione trascrizione
La maggior parte delle trascrizioni è pronta in meno di 5 secondi
Installa la nostra estensione per Chrome
Ottieni le trascrizioni all'istante senza uscire da YouTube. Installa la nostra estensione per Chrome e accedi con un clic alla trascrizione di qualsiasi video direttamente dalla pagina di riproduzione.