Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
AI's Memory Wall: Why Compute Grew 60,000x But Memory Only 100x (PLUS My 8 Principles to Fix) | AI News & Strategy Daily | Nate B Jones | YouTubeToText
YouTube Transcript: AI's Memory Wall: Why Compute Grew 60,000x But Memory Only 100x (PLUS My 8 Principles to Fix)
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
The core theme is that AI memory is a critical, worsening problem ("memory wall") due to fundamental architectural limitations, not just hardware constraints, requiring a shift from passive accumulation to deliberate architectural design for effective long-term AI interaction.
Mind Map
Click to expand
Click to explore the full interactive mind map • Zoom, pan, and navigate
Memory is perhaps the biggest unsolved
problem in AI and it is one of the only
problems in AI that is getting worse,
not better. As we get better and better
and better at intelligence, we get worse
at memory, relatively speaking. In fact,
there's a name for it in the model maker
community. It's called the memory wall.
We are not improving the hardware chip
capabilities of our memory systems
nearly as fast as we are improving the
ability of those chips to infer or
compute words or do LLM inference. That
generates a growing gap between our
intelligence capabilities and our memory
capabilities. Don't worry, we won't stay
at the hardware level for long. I want
to go through with you the core issues
that we see as builders, as users of AI,
as designers of AI systems. What is the
root of the memory problems we
experience? If we're at a systems design
level, if we're at a usage level, if if
we are even using Chad JPT, why are
memory problems so sticky and hard to
untangle? Why have we not seen better
solutions in the market? I think there
are good reasons for that. And then once
we go through those root causes, how can
we start to think about solving them?
How can we think about solving them as
users? How can we think about solving
them as builders? So, I'm going to go
through five root causes and then we're
going to flip the script and I'm going
to go through eight principles for
building a solution because I want you
to walk away from this and I want you to
feel empowered to actually design better
memory systems. I don't want you to wait
around for someone in Silicon Valley to
make a pitch and get funded for this.
You can design your own solution here.
So the key thing to keep in mind through
this whole conversation is that AI
systems are stateless by design but
useful intelligence requires state. So
every conversation is stateless meaning
it starts from zero. The model has
parametric knowledge which the weights
we talk about in a model right but it
doesn't have episodic memory. It does
not remember what happened to you. And
I'm sorry, but the 10 or 11 sentences or
the the very lossy memory that chat GPT
has right now or the ability to search
conversations that Claude has right now
is not good enough for that. You have to
reconstruct your context every single
time. This is not a bug actually. It is
an intentional architecture. It is a
design for statelessness because the
model makers want the model to be
maximally useful at solving the next
problem, the problem in front of you.
And they cannot presume that state
matters. It doesn't always matter. So
the promise of memory features is that
vendors are going to be able to
magically solve this by making the
system stateful in ways that are useful
to you. But this creates a whole host of
new problems because statefulness is not
the same for all of us. What should it
remember? Is it passive accumulation? Is
it active curation? How long should it
remember? Is it persistent forever? Is
it stale ever? Does it drop off after 30
days? When do you retrieve it? Do you
retrieve it when it's relevant, sort of
like claude does? Do you retrieve it all
the time and potentially it's noisy in
the context window? How do you update
it? This is one of the biggest problems
with LLMs. People tell me they'll put
their wiki into a retrieval augmented
generation system and I'm like, when was
the last time you updated your wiki? If
it's not updated, how do you overwrite
it? How do you append data to it? How do
you change data? These are not
implementation details. They are
fundamental questions about what memory
is and its purpose when we do work.
Memory matters because we humans are
able to quickly and fluidly negotiate
between stateless brainstorming things
that are like wild and we don't need to
use a lot of our past memory and very
stateful work. LLMs are not good at
that. Loading that context is very very
hard right now. So why is this so
persistent? We've talked a little bit
about how the promise is hard to
fulfill, but what are some of the root
causes that make it hard for vendors to
do this? Number one, the relevance
problem is one of the gnarliest unsolved
problems out there. What's relevant
actually changes based on the task that
you're doing. Are you planning? Are you
executing? The phase of your work. Are
you just exploring? Are you refining
your work? The scope you're in, right?
Is it a personal or is it a project? I
know someone who is in the healthcare
industry. And they have to be very
careful because if they were to ever ask
for health advice then the memory
retrieval within Chad GBT would pull up
work stuff and they are afraid in the
same context if they pull up a work
thing that their personal health data
will leak in because it will all look
like health data. So the scope matters.
What has changed since the last time you
talked? The state delta is what we would
call that. If you come back and you say
this is a new version, does it really
understand that's a new version or not?
Semantic similarity which is what a
retrieval augmented generation depends
on is just a proxy. It is a proxy for
relevance. It is not a true solution.
Finding similar documents works until
you need to find the document where we
decided X and that's very specific. Or
ignore everything about client A right
now but pay attention to clients B, C,
and D. Or please only pay attention to
what we've decided since October 12th.
These are all things that we humans can
understand and execute on when we go and
manually retrieve information. But the
AI using semantic search, it's just not
the right tool for that job. There's no
general algorithm for relevance. There's
no magic relevance solve that the AI can
depend on. You need to use human
judgment about task context. And that
means requiring very complicated
architectures to accomplish a specific
memory task, not just better embeddings
in a rag memory system. And that, by the
way, is one of the big reasons why these
like one-stop shop vendors often
struggle with real implementations.
Number two, the persistence precision
trade-off is a massive issue with memory
systems. If you store everything,
retrieval becomes very noisy and it
becomes very expensive. You jam up your
context window. If you store
selectively, you're going to lose
information that you need later. If you
let the system decide what to keep, it
optimizes often for something that you
didn't ask it to. Maybe it optimizes for
recency. Maybe it optimizes for
frequency. Maybe it optimize for
statistical saliency versus actual
importance. And if you wonder what
statistical salency is, have you ever
tried having an argument with Chad GPT
or Claude or Gemini about the fact that
it's emphasizing the wrong thing in
something it's writing? That is salency.
That's a salency defect. Human memory is
actually, funnily enough, very good at
this through the technology of
forgetting. We use incredibly lossy
compression with emotional and
importance waiting. And so we've
actually done studies on human memory.
And it turns out that you can with
practice get better and better and
better at recalling specific things. But
if you choose not to recall something
that happened to you, you're just going
to lose it. And what's interesting is it
seems to be a database keys issue for
us. I realize I am like some someone in
the comments is going to be a
neuroscientist and just rightly take me
to town. But my understanding of the
reading is that you have to be able to
remember the equivalent of a database
key to retrieve the memory. And if you
can do that, the memory becomes
accessible again. But your short-term
memory, so to speak, in humans is very
lossy. And so you lose the database keys
if you can't persist them with intent.
if you don't intend to remember them.
And that is why fundamentally your
childhood memories can be very
accessible. But what happened last
Thursday? You're sitting there and
you're like, did we eat out or did we
not eat out? Which which day did we go
to the movies? Right? It's not because
you have a profound issue with memory.
It's because your brain is desperately
compressing information to make it
useful to you and has dumped out those
database keys. And when you go to the
effort of remembering, you're literally
retrieving the database keys to get the
memory back. Forgetting is a useful
technology for us. That's the point of
that. AI systems don't have any of that.
They either accumulate or they purge,
but they do not decay. And what I'm
talking about when I'm like, did I go to
the movie? Oh, yeah. It was the movie.
Who was that character? Oh, now I have
I'm recovering the key and I'm able to
get it back. The memory has decayed into
a lossy approximation in the memory key,
but I can recover it if I put effort
into it. We have nothing like that in
AI. That is a uniquely human technology
and it's funny but we have to think
about forgetting as a technology when we
talk about memory. Number three, the
single context window assumption.
Vendors often try to solve memory by
making context windows bigger. But
volume is not the issue. The structure
is the problem. A million token context
window is not a usable million token
context window if it's full of unsorted
context. That is worse than a tightly
curated 10,000 time. The model has to
still find what matters, parse the
relevance, ignore the noise. You have
not solved the problem by expanding the
context window. You have simply made
your problem more expensive. Sometimes
substantially more expensive. I know
people who make calls and they don't
budget the calls and they're like, "Why
is my API bill high?" I'm like, your API
bill is high cuz you're stuffing the
context window and you're just kind of
trying to throw queries against it. It
does not work well and it also is very
expensive. The real solution requires
multiple context streams with different
life cycles and retrieval patterns. It
is hard. You have to design it. It
breaks the mental model of just talk to
the AI. That is why there is no
one-sizefits-all solution. Issue number
four is the portability problem. Every
single vendor builds proprietary memory
layers because they think in their pitch
deck that memory is a moat. I get it. It
makes sense on a pitch deck. Chat GPT
memory cla recall cursor memory banks.
These are not inherently interoperable.
Users will invest time building up
memory in a given system. And the model
makers like that because it makes the
switching cost real and you can't port
what chat GPT knows about me to claude
and your memory is locked in and so on.
The problem here is a problem of the
commons. This behavior set from vendors
and model makers and tool builders
encourages users to leave memory to the
tool rather than encouraging them to
build a proper context library. And I
get it from a product design perspective
because like how many users are going to
really build a product context library?
But if we reframe it and we say
portability is a first class problem,
users should be inherently able to be
multimodel. I think that's more
relevant. And maybe from a consumer
standpoint, you don't care because you
have 800 million users in chat GPT. It
dwarfs everything else, etc. One, that's
not entirely true because Gemini has I
think uh closing in on half a million or
half a billion now. But the other reason
is that from a business perspective, you
have to be multimodel. It is it is a
liability to be single model. And so if
you're building business memory systems,
you must solve the portability problem.
And the issue is any given vendor is not
incentivized to make that truly portable
either. They want to make that
proprietary to them. And then you have
the same bottleneck, but now you're on a
vendor who may not be as well funded as
the model maker. And so it becomes a
house of cards. Number five, the passive
accumulation fallacy. Most memory
features assume you just use your AI
normally and it will figure out what to
remember. That is the default mental
model of users. And so that's the
assumption that memory features build
around. But this fails because the
system cannot distinguish a preference
from a fact. It cannot easily tell
project specific from evergreen context.
I've often seen that mixed up. It
doesn't automatically know when old
information is stale. If you've ever
wondered why chat GPT or Claude or
Perplexity comes back and talks about
old AI models as if they are active
today, that is the same issue. They
can't tell when old information is stale
and it optimizes for continuity. It does
not optimize for correctness. This is
the keep the conversation going issue.
Useful memory fundamentally requires
active curation. You have to decide what
to keep, what to update, and what to
discard. And that is work. And so
vendors promise passive solutions
because active curation they are told
does not scale as a product. I think we
have to start by framing that problem
better because it turns out passive
accumulation doesn't solve for it
either. And this is still a big enough
problem that it costs us billions of
dollars at the enterprise level and it's
extremely frustrating for users both
personally and professionally. The
answer cannot be there is no answer or
we'll fake the answer. Finally, number
six on the root cause side, then we're
going to get to solve. It'll it'll feel
better. Memory is actually multiple
problems. And that's part of why it's so
hard. I hope you're getting that idea,
right? When people say AI memory, what
they really mean is any number of
preferences, how I like things done.
That could be a key value that's
persistent. They could mean facts.
What's true about particular things or
entities that can be structured, it
might need updates. They might mean
knowledge, right? Domain expertise. And
that can be parametric, right? that can
be embedded in weights but it might not
be right and then what do you do? It can
be episodic. So it could be
conversational, temporal, ephemeral
knowledge. It can also be procedural.
Have we solved this before? Right? If
episodic memory is what we've discussed
in the past, procedural memory is how we
solve this problem in the past. And
those are also different things. And so
you have exemplars there, you have
successes and fails in procedural
memory. Every single memory type needs
different system design to handle
storage retrieval and update patterns.
And if you feel like you're getting a
headache here, you're not alone. This is
why we don't have a good solve. And this
is why I want to lay out in the next
section principles for solve. But it
starts with being honest about the
problem. Treating this problem as one
problem guarantees you are going to
solve none of the real problems well.
And that is why we have memory as a
persistent issue. in fact a growingly
worse issue in the AI community. Vendors
fundamentally are treating this as a
solve for infrastructure and not a sol
for architecture. And so bigger windows
and better embeddings and cross chat
search scale, but they don't solve
structurally. And users keep expecting
passive solutions because they're
frankly sold passive solutions. There's
an expectations issue here. Just
remember what matters is not something
that you can expect to work. But we're
told that it will work. So if memory
requires architecture and users want
magic, the gap between what's promised,
what's delivered, and what's needed has
never been bigger. We have a memory wall
of our own beyond the chip level in how
we design our systems. And it won't get
solved if we solve the wrong. So let's
say you've gone through all of this and
you want to solve memory correctly. I am
going to give you principles that work
whether you are using the chat and a
sort of power user at home and you want
to build something yourself because this
absolutely works for that or whether you
are designing larger systems because it
turns out that the principles for memory
are fractal because the problem is
fractal. We have the same kinds of
memory issues when we are power users
individually in a chat as we do when we
are designing agentic systems. So the
principles that work. Number one,
there's going to be eight of these.
Settle in. It's going to be fun. Memory
is an architecture. Memory is an
architecture. It is not a feature. You
cannot wait for vendors to solve this. I
think you get this idea. We won't spend
too long here. Every tool will have
memory capabilities, but if you leave it
to tools, they will solve different
slices. You need principles that work
across all of them. And you need to
architect memory as a standalone that
works across your whole tool set.
Principle two, you should separate by
life cycle, not by convenience. So as an
example, you need to separate personal
preferences which can be permanent from
project facts which can be temporary and
those should be separated from session
state which can be ephemeral or
conversation state. Mixing different
life cycle states mixing permanent with
temporary with ephemeral it just breaks
memory. The discipline lies in keeping
these apart cleanly. And again, this
works if you're in chat. It works if
you're designing a gentic systems. If
you have a permanent personal
preference, it is possible. It is as
simple as a very disciplined system chat
update where you go into the sort of
system rules and the system prompt for
chat GPT and you say, "This is what you
need to know about me. These are my
personal preferences." And model makers
are starting to make that more exposed
because they want that. But they don't
tell you how to use that properly. And
when I observe how people actually use
that tell me about yourself, it is
absolutely a mix of personal preferences
and ephemeral stuff and project facts
because no one has taught them to use it
better. And if you're designing agentic
systems, it gets more complex, but it's
the same separation of concerns. You
have to separate out what are the
permanent facts in the situation here.
What are project specific facts and what
is session state. Principle number
three, you need to match storage to
query pattern. So that means you're
going to need multiple stores because
different questions require different
retrieval. Now in the chat situation
that I gave you, chat GPT can retrieve
the memory if it's a system prompt kind
of a thing and it just calls it into the
context window and it's super simple and
you'd never think of it as memory for
most people but that's what it is. If
you're designing an agentic system, it
is understanding the difference between,
for example, what is my style, which
could be a key value because it's a
written style of some sort. What is the
client ID, which should be structured
data or relational data, what similar
work have we done, which could be
semantic or vector storage data, and
what did we do last time, which should
be event logs. Those are four different
types of data, right? You have key value
data, structured data, semantic data,
event logs. Trying to do all of these in
one storage pattern is going to fail.
And that is why when people say, "We
have our data lake and it's going to be
a rag." I'm like, why? Why is it going
to be a rag? Have you heard the word rag
repeated a hundred times like a magic
spell for memory? It does not work that
way. You need to match storage to the
query pattern. Otherwise, you just have
a very expensive data dump. Principle
number four, mode aare context beats
volume hands down. And so more context
is not better context. Planning
conversations need breadth like they
need to have space for alternatives.
They need to have space for comparables.
Brainstorming conversations are similar
to planning conversations. You need to
be able to range. Execution
conversations. Execution workflows in
agentic situations. They need precision.
They need precise constraints. Retrieval
strategy needs to match your task type.
You cannot just sit there and think to
yourself, okay, I'm going to have a
brainstorming conversation and it's
going to be incredibly precise and just
hope that it works. This is why I talk
about prompting so much. Effectively,
what prompting is doing? It is giving
context that is mode aare to an AI so
that it can be in the right mode. And
that's super effective for chat users.
But guess what? If you're designing
agentic systems, it is your
responsibility to architect mode
awareness into the system so that it is
aware that this is an execution
environment and that precision matters
and that it is audited and eval on
precision. Principle number five, you
need to build portable as a first class
object. You need to build portable and
not platform dependent. Your memory
layer needs to survive vendor changes.
It needs to survive tool changes. It
needs to survive model changes. If chat
GPT changes their pricing, if Claude
adds a feature, your context library
should be retrievable regardless. And
that is something that almost nobody can
say right now. And the people who are
doing it tend to be designing very large
scale agentic AI systems at the
enterprise level. But this is a lesson
that we all need to take with us. I
think it is a best practice. It is sort
of like keeping a go bag next to the
door in case you need to get out in case
of I don't know something happens to
your house. You need to have something
that is portable that carries relevant
memory that you can use to have
productive conversations with another
AI. I fully admit there is not an
outof-box solution for this. There are
people who are power users who configure
obsidian to do this right as a
note-taking app and they tie it into AI
and it becomes a portable platform
independent way of handling this. There
are people who use notion for this. The
thing that is a common trait is that
they are obsessed with making sure the
memor is configured correctly for them
and the AI has to come in and be queried
correctly or called correctly to engage
with a piece of the memory that matters.
Whether that is a key value piece like
what's my style or a semantic search
like what similar work have we done
together. A good data structure accounts
for that. Principle number six
compression is curation. Do not upload
40 pages hoping the AI extracts what
matters. I see people do this when they
overload the context window and they ask
for an analysis of a report. You need to
do the compression work. You need to
either in a separate LM call or in your
own work, write the brief, identify the
key facts that matter and state the
constraints. This is where judgment
lives. And if you don't delegate it, you
will be happier with the precision and
context awareness of the response.
Memory is bound up in how we humans
touch the work. There are ways to use AI
to amplify and expand your judgment. You
can use a precise prompt to extract
information in a structured way from 40
pages of data and then in a separate
sort of piece of work figure out what to
do with that data. But it remains on you
to make sure that the facts are correct,
that the constraints are real, and that
the precision work you're asking AI to
do with that data is the correct
precision work. The judgment in
compression is human judgment. It may be
human judgment that you amplify with AI,
but it remains human judgment. Principle
number seven, retrieval needs
verification. So semantic search will
recall well but fail on specifics,
right? It will recall topics and themes.
Well, you need to pair fuzzy retrieval
techniques like rag search with exact
verification where facts must be
correct. You should have a two-stage
retrieval call path, right? Recall
candidates and then verify against some
kind of ground truth. This is especially
important in situations where you have
policy or you have financial facts or
legal facts that you need to validate.
Something like this is exactly why there
was a very prominent fine leveled
against a major consultant firm in the
last two weeks. I think the fine came to
close to half a million dollars because
they could not verify facts around court
cases in a document that they prepared
and they hallucinated them and they
didn't catch them. retrieval failed.
Retrieval failed. And because the LLM is
designed to keep the conversation going,
it just inserted something plausible and
nobody caught it. You need to be able to
verify retrieval against ground truth.
Now, if it's a small task, that might be
the human at the other end of the chat,
right? It just is a step that needs
doing. If it's a large agentic system,
it is the exact same fractal principle,
but you need to do it in an automatic
way using an AI agent for evals.
Principle number eight, memory compounds
through structure. So random
accumulation actually does not compound.
It just creates noise. Just adding stuff
doesn't compound. If if we just added
memories randomly the way we experience
them in life and we had no lossiness, no
forgetting ability, we would not be able
to function as people. Forgetting is a
technology for us. In the same way that
forgetting is a technology for us,
structured memory is a technology for
LLM systems. So evergreen context goes
one place, version prompts go another
place, tagged exemplars go another
place. And at a small scale, yes, you
can do this. People are doing this with
Obsidian, with notion, with other
systems as individuals. And yes, you can
scale this as a business. Same same
principle. You let each interaction
build without degradation if you have
structured memory. Otherwise, you just
have random accumulation. Otherwise, you
have the pile of transcripts you never
got to, and you're like, well, this is
data. We're logging it. it's probably
good. It just it's going to be random
accumulation. It creates noise. You're
not going to have structured memory.
These are the principles that work. They
work whether you are a power user with
chat GPT or a developer building agentic
systems. Frankly, they are guideposts
for you if you are evaluating vendors in
the memory space. These are tool
agnostic principles. They're designed to
scale with complexity and they're
designed to give you keys that solve the
memory problem because they make consist
context persist reliably without the
brittleleness that we see with current
AI systems. And so my challenge to you
as we wrap up this video, we've gone
through root causes. We've gone through
why memory is a hard problem. We've gone
through eight principles for how to
solve for this memory issue. Please take
memory seriously. The reason it matters
now is because if you solve memory now,
you have an agentic AI edge. These
systems are going to get cheaper and
more powerful, but you can't assume
they're magically going to solve for
memory. As I said at the beginning,
there's a chip level issue here. It is a
hard hard problem. If they don't
magically solve for it, if you take
responsibility for memory and build it
yourself in the way that works for you,
you are starting the timer earlier than
everybody else around you on getting
memory that is functional across a
long-term engagement with AI. Because
you have to start to think, we're in
year two of the AI revolution. Wouldn't
it be great to have memory that goes
back to the year two when you are
working with AI systems in 10 years, in
15 years, in 20 years? Everybody else is
going to have memory that started much
later and they're going to lose that
discipline, that acceleration, that
ability to manage deep work over time
that AI is going to be capable of with
proper memory structures. So there is a
moment here for you to think about and
put in place a memory structure that
works. Don't lose the opportunity. This
is a this is a complex one, but it's on
you and me and all of us together to
build memory systems that handle our own
needs, whether that's personal needs or
professional needs. I know you can do
it. Drop in the comments how you're
doing it because I think we should all crowdsource
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.