0:01 The next generation of models is likely
0:03 to drop in the next one to two months.
0:04 I'm talking about Claude Mythos. I'm
0:06 talking about whatever Chad GPT drops
0:07 next. I'm talking about the next Gemini
0:10 model. They will be more expensive, a
0:11 lot more expensive because they're all
0:13 trained on much more expensive chips,
0:16 the GB300 series from Nvidia, and it's
0:17 just going to get more expensive from
0:18 there. The intelligence we're going to
0:21 get, the ambient compute all around us
0:22 that is essentially free intelligence is
0:24 going to be the dumber models. That's
0:26 just how it is. If you want to use
0:27 cutting edge models, you have got to
0:29 stop burning tokens and blaming the
0:31 model. And that is the theme for this
0:32 video. If you're in a position where
0:34 you're wondering how much token usage
0:37 you have or how expensive your AI is or
0:38 whether you're using too many tokens for
0:41 your AI or how you can even measure
0:43 that, how you can get better at it. That
0:44 is what this is. And that is going to be
0:46 one of the most valuable skills on the
0:48 planet. by the way, because you do not
0:49 want to be in a position where you are
0:52 putting $250,000 a year. A real number
0:54 that JSON Huang gave in a real interview
0:56 for what he expects an actual individual
0:58 engineer to spend in a year on tokens.
1:00 You don't want to be the person spending
1:02 250 grand on tokens you don't have to be
1:05 spending on. You want to be smart. And I
1:07 am going to give you a specific example.
1:09 This is real life example. A real person
1:11 I know gave me permission to use this. I
1:14 recently saw a production AI pipeline
1:16 that ingests multiple long- form
1:19 conversations per user, runs an analysis
1:21 across dozens of dimensions and
1:23 generates a fully personalized output
1:25 all on the most expensive models that
1:27 money can buy. Not because the person
1:28 wants to use expensive models, but
1:30 because he tested it and what he found
1:32 was that the better models produce the
1:34 results he needs for this business. The
1:36 cost per user less than a quarter, less
1:39 than 25 cents per user for that. Most of
1:43 us are spending more than we need to on
1:46 AI and this is a video about that. You
1:48 can be really smart, use really good
1:50 cutting edge AI and you can be
1:52 intelligent with your token usage and
1:54 not spend a ton of money. If you want to
1:56 know what that's like, keep on watching
1:57 because we're going to get into specific
1:59 strategies and I'm going to show you
2:01 what I built so that we can actually
2:03 make this easier for everybody so it's
2:05 not just a guessing game anymore. The
2:07 takeaway is that Frontier AI can be
2:10 absurdly cheap when you know what you're
2:12 doing. Essentially, the models are not
2:14 expensive. It's your habits that cost a
2:15 lot. And with cloud usage limits
2:17 dominating everything in the last week,
2:19 I think it's worth having that
2:21 conversation. So, let's get to it. I've
2:22 made the case we can use our models
2:24 better. What are the specific habits we
2:26 can change? I want to name specific
2:28 habits that I have seen in conversations
2:31 with others, looking over shoulders,
2:33 reading GitHub repos, listening to
2:35 conversations online. These are specific
2:37 examples that are patterns I see over
2:39 and over again. And the first one is the
2:40 rookies. The folks who are new to
2:42 cutting edge, you know what you bleed
2:44 out on in tokens? You bleed out on
2:46 document ingestion. This one drives me
2:49 crazy because it's so so easy to fix. A
2:51 brand new Cloud Desktop user might drag
2:54 in three PDFs into a conversation that
2:56 might be 1500 words each, which is just
2:59 4,500 words of text. It's not that long.
3:01 And they say, "Summarize these." and
3:04 Claude processes the raw PDFs with all
3:05 the formatting overhead that goes with
3:07 that, the headers, the footers, the
3:09 embedded fonts, the layout metadata, and
3:11 the entire binary structure gets encoded
3:14 as tokens. And so the 4500 words of
3:16 content can become a 100 plus thousand
3:18 tokens if you're not careful. All you
3:20 have to do to avoid that is just think
3:24 in terms of markdown. If you just ask
3:26 Claude or frankly go to any of a number
3:27 of services on the internet that are
3:29 free and say please convert to markdown.
3:32 It will just do it right. It will just
3:35 take 10 seconds and convert to markdown.
3:37 And then you have a very clean set of
3:39 content that's between four and 6,000
3:42 tokens. And that's like saving you 20x
3:43 on the memory. And this waste just
3:45 compounds, right? Because once those
3:47 100,000 tokens are in your conversation
3:49 history, they bounce back and forth and
3:50 bounce back and forth. And this is how
3:52 you fill up your token window. and you
3:53 wonder how other people get so much
3:55 done. Please, please, please, if you're
3:58 new to AI or if you've never thought
4:00 about it, think about the file formats
4:01 you're throwing because so many of these
4:04 file formats are designed to be human
4:05 readable. They're not designed to be AI
4:08 readable. Think about the token
4:11 efficiency of these file formats. And if
4:13 you're wondering, well, how do I convert
4:15 to markdown? I built something for you
4:18 because all you have to do is just
4:21 ingest a file. You you hit transform and
4:23 it just converts it back into into
4:25 markdown. That's it. And we have a
4:26 number of file types. We're adding more
4:27 from the community all the time. It's
4:29 part of the open brain ecosystem. It's
4:31 just a plugin you can put in and it will
4:32 just convert it to markdown. But that's
4:34 not the only way. You can tell Claude to
4:36 do it directly. You can also just
4:38 directly do it on the internet with any
4:40 of a number of free web services.
4:42 Markdown conversion should not be gated.
4:45 It just it's super easy to do. Tokens
4:47 are designed to preserve everything in
4:49 an original text. If you wanted to
4:53 reason about the style of the PDF, fine,
4:56 keep it. But 99% of the time, all you
4:58 care about is the text. You just want it
5:00 in markdown. Please, please, please
5:03 think about your file formats. Next big
5:04 mistake that people make, and this one
5:06 comes a little bit after people tend to
5:08 convert to markdown and start to
5:09 understand how some of these initial
5:12 documents work. Please do not sprawl
5:14 your conversations. If you were doing
5:17 20, 30, 40 turns on a conversation, no
5:19 AI was reinforcement learned, trained,
5:22 or designed to handle that kind of
5:24 sprawl. All you're doing is compressing
5:27 the ratio of the conversation where the
5:29 original instructions happened. And yes,
5:31 the models are getting better and better
5:32 and better at anchoring on and
5:34 remembering those original instructions
5:36 even when they go through compression.
5:38 But why make them suffer? Why make
5:41 yourself suffer by filling up the
5:43 context window with croft? Why waste
5:46 tokens? Why not just ask for what you
5:48 want upfront? And if you're going to
5:50 have an evolving exchange or evolving
5:53 conversation, clearly market at the top
5:55 as our goal here is to evolve and reach
5:58 a conclusion together. And then you have
6:00 a light conversation that goes 20 or 30
6:02 turns and say, "Thank you. I've got a
6:03 conclusion. Please summarize this." And
6:05 then you go and do real work. I see so
6:07 many people trying to mix together
6:09 modes, but AI is really designed for
6:11 single turn do a lot of heavy work more
6:14 and more and in that context you need to
6:15 do the thinking in advance and bring
6:16 that to the table and if you need to
6:18 think with AI that should be in a
6:20 separate chat, separate conversation. It
6:22 might even be a separate model. It might
6:23 be three separate models and you're
6:25 bringing all of that in. I do that all
6:27 the time. I'm like, okay, I want to look
6:28 through what communities are thinking
6:30 about AI on X. I'm going to go to Grock
6:32 for that or I'm going to go through and
6:33 look at what earnings reports are saying
6:34 about the state of AI and capital
6:36 investment. I'm going to go and pipe
6:39 that through chat GPT thinking mode and
6:40 get a bunch of reports out on that. Or
6:41 I'm going to go through perplexity
6:43 research and get a bunch of reports out
6:45 on that. Now I'm going to go and have a
6:47 look at what some major blog posts have
6:49 to say about a particular AI topic. I'll
6:51 just go to Claude Opus 4.6. We'll do a
6:53 targeted web search. We'll go back
6:54 through. We'll make sure we understand
6:56 what we're looking at. None of that is
7:00 intended to be a single answer, right?
7:02 These are all evolving conversations.
7:03 Once I get what I want out of each of
7:05 these individual threads, I can pull
7:07 them together and say, "Okay, now I have
7:09 a piece of work to do. Now I have
7:11 something I actually need done and I
7:13 have all the context needed." So you
7:15 should have two modes here. You should
7:17 have a mode where you are trying to
7:19 gather information and a mode where you
7:21 are trying to focus and get work done.
7:22 Do not mix the two together. That is how
7:25 you burn tokens. That is how you confuse
7:27 the AI. Your objective when you want the
7:30 AI to do real work should be to be so
7:33 clear that the AI needs to do nothing
7:35 else and it just goes and gets the work
7:38 done and comes back. It should be that
7:39 clear. If you are an intermediate user
7:41 and you are like, I know this stuff,
7:43 Nate, well, let me give you another tip
7:44 you may not know. the people who are
7:46 adding lots of plugins to their chat GPT
7:48 or their cloud instances, you are paying
7:50 a tax every time you start a
7:51 conversation because in the background,
7:53 those are going to be loaded in and
7:54 they're going to start to fill the
7:56 context window. I know someone who
7:59 shared with me that they are over 50,000
8:02 tokens in on a context window before
8:04 they type the first word because they
8:06 actually load that many plugins and
8:08 connectors. You don't need that much.
8:10 You know what that's like? That is like
8:14 walking in to a fully functional tool
8:15 workshop and the first thing you do
8:17 instead of leaving the tools on the
8:18 walls is you go and get all the tools
8:20 off and you lay them out on the
8:21 workbench and you say, "Okay, now we're
8:22 going to do, I don't know, we're going
8:23 to do something. We're going to make a
8:25 bench." Do you need all 200 tools in the
8:27 workshop to make the bench? No. You
8:28 probably need the right five. Think
8:31 about that the next time you have an
8:33 approach to tooling. Because so many
8:36 people, we we hear about this new
8:37 plugin, we hear about this new
8:39 connector, someone hypes it up, we say
8:41 we need to add it, and we don't realize
8:44 it's a silent tax for the rest of time.
8:46 Every time we have a conversation, and
8:47 it just adds that little bit, it adds a
8:49 thousand tokens, it adds 2,000 tokens,
8:50 whatever it does, and it just adds it
8:52 always. Do you want to pay that for the
8:54 model? Maybe you should think more
8:56 strategically about which plugins and
8:57 connectors are really adding value for
8:59 you because they can. like they can be
9:01 tremendously valuable, but make sure you
9:03 know which ones you really want because
9:04 if you don't, then you're going to be
9:06 looking at dozens of plugins that you
9:08 don't really need that are supposed to
9:10 add value, but just add a bunch of
9:13 croft, a bunch of junk into your context
9:15 window and confuse the model and keep it
9:16 from doing good work and maybe confuse
9:18 it as to which tools it's supposed to
9:20 use. Now, I'm saving the most expensive
9:22 and the most advanced users for last
9:24 because this is where the leverage lies.
9:26 If you are an advanced user, if you are
9:27 someone who's like, "Send me to the
9:30 GitHub repo. I can just do this myself.
9:32 Let me install OpenClaw on my Mac Mini.
9:34 I'm okay managing the gateway. I can be
9:37 secure." This is for you. You have the
9:39 most leverage of anybody out there in
9:41 terms of how many tokens you use. And
9:43 typically speaking, your mistakes are
9:45 the most expensive ones because if you
9:47 screw up, you're screwing up at a level
9:48 of hundreds of thousands or millions of
9:50 tokens, maybe more. And the reason why
9:53 is simple. You are doing bigger projects
9:54 with AI. And when you do big projects
9:57 with AI, your ability to leverage AI
10:00 effectively becomes one of the most
10:02 critical things you can do to manage ROI
10:04 and cost on a particular project. It is
10:06 a job skill at that level. If you're
10:08 technical enough to go to a GitHub, you
10:10 have a job skill to manage tokens
10:12 efficiently. And you cannot pass that
10:13 off to somebody else. That is not going
10:15 to be somebody else's full-time job at
10:16 an org. All of us are going to have to
10:18 learn to manage our tokens. Well, if you
10:20 were sitting there and you are you are
10:22 the person who is responsible for the
10:23 system prompt on an agent and you
10:25 haven't pruned it in the last couple of
10:27 weeks, what are you doing? If you
10:28 haven't sat there and gone line by line
10:30 and said, you know what, a hundred of
10:32 these lines I don't need anymore because
10:34 they've been here since 3.5 and like I
10:35 don't need them now. If you're sitting
10:37 there and you're like, I don't know why
10:39 we're loading this entire repo into the
10:40 context window. We just do it all the
10:41 time and it seemed to work two
10:43 generations ago but we never tested it.
10:45 That's just irresponsible. You need to
10:47 be in a position where you are actually
10:50 allowing the gains in model intelligence
10:52 to lean out your context window. If you
10:55 want to look at the larger trend that we
10:58 see in AI today, it is that we needed to
11:00 frontload and be really specific about a
11:03 lot of context for dumber models in
11:05 2025. And now that it's 2026, as the
11:08 models get more intelligent, we can lean
11:09 out the context window initially because
11:12 we can trust the model to retrieve
11:15 better. So take that seriously. That is
11:16 something you can do that is practical
11:19 to get ready for claude mythos. Don't
11:21 sleep on it. This is again if you're
11:22 technical, these are million token
11:24 decisions we're talking about,
11:25 especially if you're running this agent
11:27 over and over again. It adds up. Let me
11:28 give you a specific example that is
11:30 based on the original beginner example
11:32 with PDFs to show you the tangible
11:34 difference in cost, right? And this is
11:35 something that should cascade all the
11:37 way across. If you don't believe me,
11:39 this is real. Let's say you feed raw
11:41 PDFs into context. Let's say it's a
11:43 100,000 tokens versus 5K like we talked
11:45 about. Let's say it's a conversation
11:47 sprawl that takes 30 turns. I've seen
11:49 these like this is very realistic. And
11:51 let's say you use Opus 4.6 for
11:53 everything including formatting,
11:54 including proof reading, and you're
11:56 making something over a 5 hour session
11:57 where you're talking back and forth. You
11:59 might be spending roughly 800,000 to a
12:02 million input tokens with maybe 150,000
12:05 to 200,000 of output tokens including
12:08 thinking. $5 in and $25 out per million.
12:10 you're spending eight to$10 dollars
12:12 worth of compute which you might say you
12:14 know what I can tolerate that or I got
12:15 the unlimited plan or I don't care
12:17 whatever but I want you to look at the
12:19 difference because anytime you start to
12:21 get serious with AI you need to see the
12:22 difference we talk about not being
12:24 wasteful with artificial intelligence
12:26 this is being wasteful you want to save
12:28 water you want to save energy don't
12:30 waste your tokens clean session same
12:33 work convert documents to markdown first
12:35 start fresh conversations every 10 to 15
12:38 turns use opus for reasoning and sonnet
12:40 for execution ution and haiku for polish
12:42 and scope the context to what's needed
12:45 and over the same period of time you get
12:48 the same result for 100 to 150,000 input
12:51 tokens a lot less and maybe 50 to 80,000
12:53 output tokens you blend that across both
12:55 models and instead of costing8 to$10 in
12:57 compute you spend a buck and you got the
12:59 same amount in other words you got an 8
13:01 to10x reduction in cost now scale it
13:03 right that sloppy user is burning 40 to
13:05 50 bucks in compute a week and the clean
13:08 user is burning five to seven bucks a
13:10 across a 10 person team on an API.
13:12 That's 2,000 bucks a month versus 250
13:14 bucks a month for the exact same result.
13:15 For subscription users, it's the
13:17 difference between hitting your limit
13:18 daily and then forgetting that limits
13:20 exist because you just are so
13:21 productive. Now, if you think this isn't
13:23 serious, I want you to think about the
13:24 cost structure for Mythos for a minute.
13:26 Mythos is rumored to be by far
13:28 anthropic's most expensive model. I
13:30 think very strongly by April or May we
13:32 are going to have a new class of pricing
13:37 well above $525 range for tokens into
13:39 maybe 10x that right imagine a world
13:42 where you are 10x what opus costs now $5
13:45 in $25 out for opus what if it's $50 in
13:47 $250 out for opus well now things start
13:50 to get serious now that eight or 10x
13:53 reduction on individual work for a day
13:54 becomes something that you can actually
13:56 measure and think about as a business
13:58 and you imagine how big that gets When
14:00 you start to work across a dev team, the
14:01 mistakes you're making today were
14:03 tolerable because models were priced
14:06 cheaply when cutting edge intelligence
14:08 that you want comes out more expensive.
14:10 And I don't know the exact price, right?
14:12 I'm not saying it's 50 and 250. I'm
14:14 giving you a thought exercise. It might
14:16 be 10 and 50 instead. It's still the
14:19 same point. The point is the model that
14:21 you want is going to cost more. And as
14:24 models cost more, your mistakes scale.
14:26 Your mistakes scale with the price of
14:27 intelligence. And make no mistake, the
14:29 models will keep getting better. Every
14:31 quarter, every release, the trajectory
14:33 is unambiguous. People who tell you the
14:35 models are plateauing are lying. They
14:37 are lying to you. The models are getting
14:39 much faster. And I do see that
14:40 occasionally that people are insisting
14:42 that the models aren't getting better.
14:43 It's not true by any measure out there.
14:45 And the people that I see insisting on
14:46 it, I think they're insisting on it
14:49 partly because they don't want to face
14:52 the world as it will exist when AI is
14:54 this good and continuing to accelerate
14:56 this fast. It's scary, right? We but we
14:57 should face it and we can all work
14:58 through it together.
15:00 >> All right. I have built a stupid button.
15:02 That is my contribution to this
15:04 discourse. I am building a stupid button
15:07 so you can check and see if you are
15:09 using your context incorrectly. I want
15:10 to save you money. I want to save you
15:12 hundreds of dollars. Please do not be
15:14 stupid with your tokens. You know, if if
15:16 you care about it, don't waste the
15:18 water. Don't waste the electricity. If
15:19 you just care about the bottom line,
15:21 also don't waste your bucks, right? We
15:22 should probably care about all of those
15:24 things. If you want to know like what's
15:25 in Nate's stupid button, it's really
15:27 simple. There's six questions that I'm
15:29 helping you answer. Number one, do you
15:32 feed Claude raw PDFs and images when all
15:34 you need is text? Is there something you
15:36 are doing that is grossly inefficient as
15:38 far as tokens go? By the way,
15:40 screenshots, terribly inefficient. It
15:41 would be much, much better just to copy
15:44 and paste text. Convert to markdown
15:46 always. Claude can do it really, really
15:49 fast for you. Why not? Question two.
15:51 When was the last time you started a
15:52 fresh conversation? Are you one of those
15:54 people that keeps a conversation going
15:57 forever? I swear the number of people
15:58 who keep their conversations going
16:00 forever is highly correlated to the
16:02 number of people who start experiencing
16:04 symptoms of LLM psychosis. Why? Because
16:05 models drift over time. They were never
16:07 intended for that long a conversation.
16:08 If you're having a longunning
16:10 conversation, you're just in strange
16:12 territory. When was the last time you
16:14 started a fresh conversation? And why is
16:17 that? Again, every time you take a turn
16:18 in a conversation, you read it as
16:21 sending one line back. But Claude or
16:24 Chad GPT or Gemini reads it as sending
16:26 the entire conversation back. And if
16:27 you're wondering, is this something
16:28 that's just for Claude? Nate's talking
16:30 about Claude a lot. No, it's for Chad
16:32 GPT, it's for Gemini, it's for Llama,
16:34 it's for any LLM you're using. It's for
16:36 Quen. This is how LLMs work. Don't waste
16:39 it. Question three, are you using the
16:41 most expensive model for everything? Are
16:43 you using Opus? Are you using 5.4 on pro
16:46 mode? Whatever your choice is, are you
16:49 picking the most expensive model and
16:51 just blindly using it regardless when
16:53 the cheaper model may work better? This
16:55 is especially important if you have
16:57 production workloads, but it's also true
16:59 for all of us. Like, if you're doing
17:00 something that's a simple formatting
17:03 task, don't depend on Opus for it. Don't
17:05 depend on 5.4 for it. Use the models for
17:07 what they're designed for. Don't bring a
17:09 Ferrari to the grocery store. Question
17:11 four, do you know what's loading in
17:13 context before you even type? You can
17:14 actually find this out. You can run
17:17 slashcontext in cla code. By the way,
17:18 you could look at the number of things
17:20 that are loading. If you're in cloud
17:22 code, if you don't know what that means,
17:24 you can go to your Chad GBT or your
17:25 cloud. You can see how many connectors
17:27 you have available. You can see how many
17:30 you've loaded up. You could be loading
17:32 tens of thousands of tokens that you're
17:33 not really aware of and not really
17:36 using. If you enable Google Drive months
17:38 ago and you never never ever use Google
17:39 Drive, you just thought it was cool on
17:42 the day it launched. Why? just drop it.
17:44 There are so many examples like that
17:45 where we see something cool, we add it,
17:47 and we forget it's there. It's like a
17:48 barnacle on a ship. It's going to slow
17:50 you down. It's going to burn tokens. You
17:53 don't need to have it. Audit. Audit your
17:56 plugins. It matters. Next question. API
17:59 builders, are you caching stable context
18:01 so you don't reuse it? Prompt caching
18:03 can give you a 90% discount on repeated
18:06 content. Right. Cash hits on Opus cost
18:08 50 cents per million versus $5 per
18:11 million standard. It makes a difference.
18:13 Do not sit there and ignore prompt
18:15 caching. Take it seriously. If your
18:16 system prompt, your tool definitions,
18:18 your reference documents aren't cached,
18:20 what are you doing? This is not advanced
18:22 stuff in 2026. You should just be doing
18:24 it. In the last question, the stupid
18:26 button test for this is a real button.
18:28 By the way, I really built a stupid
18:29 button. How are you handling web search?
18:31 Are you letting Claude do web research
18:33 the expensive way? People don't realize
18:36 this, but if you call perplexity for a
18:39 search, it tends to be much more token
18:41 cheap than using claude natively. Now,
18:42 Claude is addressing this. There are
18:44 lots of ways to do claude search. You
18:47 can actually use Claude to navigate
18:50 through a browser. You can also directly
18:51 search in the terminal and it will spin
18:53 up something in the background that's a
18:54 service and you can call something in
18:56 like an MCP connector for perplexity.
18:58 All different options you can use. This
19:00 is broadly true. It's not just true for
19:01 cloud. It's true for Chad GPT. is true
19:04 for Gemini, etc. because MCP is magic.
19:07 But if you are trying to do search, the
19:09 larger point is that you should be doing
19:11 search as cheaply as possible. If you
19:13 just want quick results that are token
19:16 efficient, it may be worth it to take
19:18 the time to spin up an MCP and just have
19:20 a dedicated service that just returns
19:22 the search results. That's what I have
19:23 found experimentally with perplexity and
19:26 claude is that perplexity tends to burn
19:30 something like 10 to 50,000 less tokens
19:32 per search which is not a small number
19:34 if you're doing complex search and it
19:37 tends to be five times faster and it has
19:39 structured citations. So this is not
19:40 meant to be a perplexity plug. It's just
19:42 a token management plug. Try it for
19:45 yourself. But I got to say I like
19:47 faster. I like citations. I like less
19:50 tokens over a researchheavy session like
19:51 a plug-in like that can save you a lot
19:53 on the token side. And that's a larger
19:57 call out. Like if you have ways to look
19:59 at your token usage and to diagnose it,
20:01 you're going to be smarter about it. And
20:02 that's the whole point of the stupid
20:05 button is like let's not fly blind here.
20:07 Let's look at our actual token usage and
20:08 let's actually make some good choices
20:11 and let's optimize it. Now what's in
20:13 this stupid button? Number one, there is
20:15 a prompt. If you've never done this, if
20:17 you're like, "What is an MCP server?" We
20:18 got a prompt for you, right? A prompt
20:20 you can run against your recent
20:21 conversations that actually identifies
20:23 the specific dumb things you
20:25 specifically are doing. Like it will see
20:26 which documents you're feeding raw. It
20:28 will see your conversation spraw. It
20:29 will look at model misuse. It will look
20:32 at redundant context loading. It looks
20:34 at your actual patterns and it will tell
20:35 you what to fix first. So that's the
20:37 easy version, right? Anyone can use it.
20:40 Any plan, no setup required. Number two,
20:43 a skill. This is an invocable skill that
20:45 audits your cloud code or your desktop
20:47 environment or any other environment. It
20:49 could be it could be chat GPT etc.
20:51 Skills are also translatable and it
20:53 measures your per session token
20:55 overhead. It will flag system prompt
20:56 load. It will check your plug-in and
20:58 your skill loading. It will give you a
21:00 before and after before you make
21:02 changes. Think of it as like you kind of
21:04 need a gas tank for your tokens and gee,
21:05 wouldn't it be nice to have one, right?
21:07 So, it's like the gas tank skill. Number
21:10 three, we built some guardrails. So
21:12 guardrails will sit directly on your
21:13 knowledge store. So if you're an open
21:14 brain person, which is something we've
21:16 been doing as a community, it will sit
21:18 right on your open brain and you will
21:22 stop burning tokens on input, which is a
21:23 nice touch, right? Automatic markdown
21:25 conversion for documents that are
21:27 hitting the store. Index first retrieval
21:30 instead of just dump and search. Uh
21:32 context scoping that enables a sort of
21:34 minimum viable context for the query.
21:36 This is where token management stops
21:38 just being a personal discipline and it
21:39 becomes infrastructure that starts to
21:41 maintain itself. And I think I'm really
21:43 excited to see how the community
21:44 continues to build on this because open
21:46 brain is open source and we'll keep
21:47 evolving it and improving it. But I
21:49 wanted to make sure that we had rails
21:50 that ensured we have responsible token
21:52 usage for the open brain community. So
21:54 look, I'm going to close by talking
21:56 briefly about agents and context because
21:57 agents burn hundreds of millions of
21:59 tokens in some cases. We don't want to
22:01 leave them out. How do we think about
22:03 context management for agents? And I'm
22:05 going to give you five commandments. I
22:06 call it the keep it simple stupid
22:09 commandments for agents. Number one,
22:11 index your references. Right? If an
22:13 agent is getting raw documents instead
22:14 of relevant trunks, you've already
22:17 failed. The entire point of retrieval is
22:20 to scope what the model sees to what it
22:22 needs. Dumping a full document set into
22:26 the window on every agent call is wildly
22:27 irresponsible. You can't do that just to
22:29 give the agent context. Don't make the
22:31 agent do work it doesn't need to do.
22:33 Number two, please prepare your context
22:35 for consumption. Pre-process,
22:37 pre-summarized, pre-chunk it. A
22:40 reference document should arrive in an
22:42 agent's context, ready to be used, not
22:46 ready to be read or processed. If the
22:48 model's first several thousand tokens of
22:49 reasoning are just spent dealing with
22:52 the crappy pre-processing you did,
22:54 you're not being a responsible agent
22:55 builder. Number three, this is something
22:56 we've mentioned before. I'm calling it
22:58 out in the context of agents because
23:00 it's so important for agent workflows.
23:03 Please, please, please cash your stable
23:04 context. System prompts, tool
23:06 definitions, persona instructions,
23:08 reference material, anything that is
23:11 stable, all should be cashed at a 90%
23:13 discount on cash hits. This is the
23:15 lowest effort, highest impact
23:17 optimization that you have on the table.
23:18 If you're making thousands of agent
23:20 calls a day and you're not cashing, it's
23:21 just pouring money down the drain.
23:25 Number four, scope every agent's contact
23:27 to the minimum it needs. Right? A
23:29 planning agent does not need your full
23:30 codebase. Don't give it the full
23:32 codebase. An editing agent doesn't need
23:34 your project roadmap. Don't give it the
23:35 project roadmap. You get the idea,
23:37 right? Passing everything to every agent
23:40 is architectural laziness and it has
23:42 real costs both in tokens burn and
23:45 frankly in degraded agent performance.
23:46 Models perform worse when they're
23:48 drowning in a relevant context. And by
23:49 the way, if you're like, I'm not sure
23:51 what the agent will need. Aren't the
23:53 smarter agents supposed to find it? The
23:55 answer is yes. But you will only do that
23:57 efficiently if you give them a
24:00 searchable repo that is pre-processed so
24:02 they can go and get only the relevant
24:04 slice of context. So take the time to do
24:06 it right. Number five, measure what you
24:09 burn. If you don't know your per call
24:11 token cost, you're just optimizing
24:14 without any information. Right? Please
24:16 instrument your agent calls. Track your
24:18 input tokens. Track your output tokens.
24:20 Track your overall model mix and your
24:22 cost ratio. You cannot improve what you
24:25 do not measure. And most teams building
24:28 agentic systems are thinking a lot about
24:30 whether they are semantically correct,
24:31 not whether they're functionally
24:32 correct. There's a big difference. And
24:34 they're thinking a lot about optimizing
24:36 their system prompt. They're not
24:39 thinking a ton about their model cost
24:41 because most of the time the model cost
24:43 is not what makes the project live or
24:45 die. And I get that in this age in 2025,
24:48 early 2026, with the cost we have today
24:50 and the urgency from executives to build
24:53 the $12 per run cost or whatever it's
24:54 going to be is not going to make or
24:56 break the ship. But plan for a world
24:58 where the models are more expensive.
25:00 Plan for a world where you have to scale
25:02 up. Plan for a world where you have to
25:04 be responsible and instrument. Now,
25:06 stepping back, there's a cultural
25:07 problem we need to acknowledge behind
25:10 all of this. At some point in the last
25:12 few months, burning tokens has become a
25:15 badge of honor. And I get it. There is a
25:17 degree to which you need to be burning
25:20 tokens in order to do meaningful work in
25:21 the age of AI. None of this is to say
25:24 that I expect token consumption to go
25:27 down. It won't. You need to be ready to
25:29 burn those tokens. This is not an ask
25:30 that you not do that. This is an ask
25:32 that you do it efficiently. And so when
25:34 Jensen sits there on stage and says
25:36 $250,000 in token costs per developer
25:38 and everyone like is shocked or rolls
25:40 their eyes or whatever the reaction is,
25:42 my reaction is I hope it's 250 grand in
25:44 smart token costs. It's not the
25:46 individual dollar amount for Jensen
25:48 because he's got cash in the bank. It's
25:49 whether the tokens were used well. It's
25:52 whether it's smart tokens. So begin to
25:54 think to yourself, yes, I need to be
25:56 maxing out my cloud. There are people
25:57 who like go into withdrawal when they
25:58 don't get to use their cloud. I know
26:00 people like that who are like, "Ah, I
26:02 went to a movie and uh I couldn't use my
26:03 cloud for a few hours. I feel like I
26:05 missed out on my token limit." Touch
26:07 some grass. It's going to be okay. But
26:09 use your tokens well. Be efficient with
26:10 your token usage. Know what you're
26:12 spending it on. Don't spend it on silly
26:14 stuff. Don't spend it on the PDFs that
26:16 you have to convert. Actually spend it
26:17 on meaningful work. And that is
26:19 something that is a human problem. We
26:21 need to be bold and audacious. These
26:23 models are really good at stuff. So,
26:25 let's get more bold, more audacious, and
26:26 think bigger about what we can aim them
26:28 at. Because if we can be more efficient,
26:30 we can do a whole lot more cool and
26:32 creative stuff with those tokens. That's