YouTube Transcript:
AI's Memory Wall: Why Compute Grew 60,000x But Memory Only 100x (PLUS My 8 Principles to Fix)

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

The core theme is that AI memory is a critical, worsening problem ("memory wall") due to fundamental architectural limitations, not just hardware constraints, requiring a shift from passive accumulation to deliberate architectural design for effective long-term AI interaction.

Mind Map

Click to expand

Click to explore the full interactive mind map • Zoom, pan, and navigate

Memory is perhaps the biggest unsolved

problem in AI and it is one of the only

problems in AI that is getting worse,

not better. As we get better and better

and better at intelligence, we get worse

at memory, relatively speaking. In fact,

there's a name for it in the model maker

community. It's called the memory wall.

We are not improving the hardware chip

capabilities of our memory systems

nearly as fast as we are improving the

ability of those chips to infer or

compute words or do LLM inference. That

generates a growing gap between our

intelligence capabilities and our memory

capabilities. Don't worry, we won't stay

at the hardware level for long. I want

to go through with you the core issues

that we see as builders, as users of AI,

as designers of AI systems. What is the

root of the memory problems we

experience? If we're at a systems design

level, if we're at a usage level, if if

we are even using Chad JPT, why are

memory problems so sticky and hard to

untangle? Why have we not seen better

solutions in the market? I think there

are good reasons for that. And then once

we go through those root causes, how can

we start to think about solving them?

How can we think about solving them as

users? How can we think about solving

them as builders? So, I'm going to go

through five root causes and then we're

going to flip the script and I'm going

to go through eight principles for

building a solution because I want you

to walk away from this and I want you to

feel empowered to actually design better

memory systems. I don't want you to wait

around for someone in Silicon Valley to

make a pitch and get funded for this.

You can design your own solution here.

So the key thing to keep in mind through

this whole conversation is that AI

systems are stateless by design but

useful intelligence requires state. So

every conversation is stateless meaning

it starts from zero. The model has

parametric knowledge which the weights

we talk about in a model right but it

doesn't have episodic memory. It does

not remember what happened to you. And

I'm sorry, but the 10 or 11 sentences or

the the very lossy memory that chat GPT

has right now or the ability to search

conversations that Claude has right now

is not good enough for that. You have to

reconstruct your context every single

time. This is not a bug actually. It is

an intentional architecture. It is a

design for statelessness because the

model makers want the model to be

maximally useful at solving the next

problem, the problem in front of you.

And they cannot presume that state

matters. It doesn't always matter. So

the promise of memory features is that

vendors are going to be able to

magically solve this by making the

system stateful in ways that are useful

to you. But this creates a whole host of

new problems because statefulness is not

the same for all of us. What should it

remember? Is it passive accumulation? Is

it active curation? How long should it

remember? Is it persistent forever? Is

it stale ever? Does it drop off after 30

days? When do you retrieve it? Do you

retrieve it when it's relevant, sort of

like claude does? Do you retrieve it all

the time and potentially it's noisy in

the context window? How do you update

it? This is one of the biggest problems

with LLMs. People tell me they'll put

their wiki into a retrieval augmented

generation system and I'm like, when was

the last time you updated your wiki? If

it's not updated, how do you overwrite

it? How do you append data to it? How do

you change data? These are not

implementation details. They are

fundamental questions about what memory

is and its purpose when we do work.

Memory matters because we humans are

able to quickly and fluidly negotiate

between stateless brainstorming things

that are like wild and we don't need to

use a lot of our past memory and very

stateful work. LLMs are not good at

that. Loading that context is very very

hard right now. So why is this so

persistent? We've talked a little bit

about how the promise is hard to

fulfill, but what are some of the root

causes that make it hard for vendors to

do this? Number one, the relevance

problem is one of the gnarliest unsolved

problems out there. What's relevant

actually changes based on the task that

you're doing. Are you planning? Are you

executing? The phase of your work. Are

you just exploring? Are you refining

your work? The scope you're in, right?

Is it a personal or is it a project? I

know someone who is in the healthcare

industry. And they have to be very

careful because if they were to ever ask

for health advice then the memory

retrieval within Chad GBT would pull up

work stuff and they are afraid in the

same context if they pull up a work

thing that their personal health data

will leak in because it will all look

like health data. So the scope matters.

What has changed since the last time you

talked? The state delta is what we would

call that. If you come back and you say

this is a new version, does it really

understand that's a new version or not?

Semantic similarity which is what a

retrieval augmented generation depends

on is just a proxy. It is a proxy for

relevance. It is not a true solution.

Finding similar documents works until

you need to find the document where we

decided X and that's very specific. Or

ignore everything about client A right

now but pay attention to clients B, C,

and D. Or please only pay attention to

what we've decided since October 12th.

These are all things that we humans can

understand and execute on when we go and

manually retrieve information. But the

AI using semantic search, it's just not

the right tool for that job. There's no

general algorithm for relevance. There's

no magic relevance solve that the AI can

depend on. You need to use human

judgment about task context. And that

means requiring very complicated

architectures to accomplish a specific

memory task, not just better embeddings

in a rag memory system. And that, by the

way, is one of the big reasons why these

like one-stop shop vendors often

struggle with real implementations.

Number two, the persistence precision

trade-off is a massive issue with memory

systems. If you store everything,

retrieval becomes very noisy and it

becomes very expensive. You jam up your

context window. If you store

selectively, you're going to lose

information that you need later. If you

let the system decide what to keep, it

optimizes often for something that you

didn't ask it to. Maybe it optimizes for

recency. Maybe it optimizes for

frequency. Maybe it optimize for

statistical saliency versus actual

importance. And if you wonder what

statistical salency is, have you ever

tried having an argument with Chad GPT

or Claude or Gemini about the fact that

it's emphasizing the wrong thing in

something it's writing? That is salency.

That's a salency defect. Human memory is

actually, funnily enough, very good at

this through the technology of

forgetting. We use incredibly lossy

compression with emotional and

importance waiting. And so we've

actually done studies on human memory.

And it turns out that you can with

practice get better and better and

better at recalling specific things. But

if you choose not to recall something

that happened to you, you're just going

to lose it. And what's interesting is it

seems to be a database keys issue for

us. I realize I am like some someone in

the comments is going to be a

neuroscientist and just rightly take me

to town. But my understanding of the

reading is that you have to be able to

remember the equivalent of a database

key to retrieve the memory. And if you

can do that, the memory becomes

accessible again. But your short-term

memory, so to speak, in humans is very

lossy. And so you lose the database keys

if you can't persist them with intent.

if you don't intend to remember them.

And that is why fundamentally your

childhood memories can be very

accessible. But what happened last

Thursday? You're sitting there and

you're like, did we eat out or did we

not eat out? Which which day did we go

to the movies? Right? It's not because

you have a profound issue with memory.

It's because your brain is desperately

compressing information to make it

useful to you and has dumped out those

database keys. And when you go to the

effort of remembering, you're literally

retrieving the database keys to get the

memory back. Forgetting is a useful

technology for us. That's the point of

that. AI systems don't have any of that.

They either accumulate or they purge,

but they do not decay. And what I'm

talking about when I'm like, did I go to

the movie? Oh, yeah. It was the movie.

Who was that character? Oh, now I have

I'm recovering the key and I'm able to

get it back. The memory has decayed into

a lossy approximation in the memory key,

but I can recover it if I put effort

into it. We have nothing like that in

AI. That is a uniquely human technology

and it's funny but we have to think

about forgetting as a technology when we

talk about memory. Number three, the

single context window assumption.

Vendors often try to solve memory by

making context windows bigger. But

volume is not the issue. The structure

is the problem. A million token context

window is not a usable million token

context window if it's full of unsorted

context. That is worse than a tightly

curated 10,000 time. The model has to

still find what matters, parse the

relevance, ignore the noise. You have

not solved the problem by expanding the

context window. You have simply made

your problem more expensive. Sometimes

substantially more expensive. I know

people who make calls and they don't

budget the calls and they're like, "Why

is my API bill high?" I'm like, your API

bill is high cuz you're stuffing the

context window and you're just kind of

trying to throw queries against it. It

does not work well and it also is very

expensive. The real solution requires

multiple context streams with different

life cycles and retrieval patterns. It

is hard. You have to design it. It

breaks the mental model of just talk to

the AI. That is why there is no

one-sizefits-all solution. Issue number

four is the portability problem. Every

single vendor builds proprietary memory

layers because they think in their pitch

deck that memory is a moat. I get it. It

makes sense on a pitch deck. Chat GPT

memory cla recall cursor memory banks.

These are not inherently interoperable.

Users will invest time building up

memory in a given system. And the model

makers like that because it makes the

switching cost real and you can't port

what chat GPT knows about me to claude

and your memory is locked in and so on.

The problem here is a problem of the

commons. This behavior set from vendors

and model makers and tool builders

encourages users to leave memory to the

tool rather than encouraging them to

build a proper context library. And I

get it from a product design perspective

because like how many users are going to

really build a product context library?

But if we reframe it and we say

portability is a first class problem,

users should be inherently able to be

multimodel. I think that's more

relevant. And maybe from a consumer

standpoint, you don't care because you

have 800 million users in chat GPT. It

dwarfs everything else, etc. One, that's

not entirely true because Gemini has I

think uh closing in on half a million or

half a billion now. But the other reason

is that from a business perspective, you

have to be multimodel. It is it is a

liability to be single model. And so if

you're building business memory systems,

you must solve the portability problem.

And the issue is any given vendor is not

incentivized to make that truly portable

either. They want to make that

proprietary to them. And then you have

the same bottleneck, but now you're on a

vendor who may not be as well funded as

the model maker. And so it becomes a

house of cards. Number five, the passive

accumulation fallacy. Most memory

features assume you just use your AI

normally and it will figure out what to

remember. That is the default mental

model of users. And so that's the

assumption that memory features build

around. But this fails because the

system cannot distinguish a preference

from a fact. It cannot easily tell

project specific from evergreen context.

I've often seen that mixed up. It

doesn't automatically know when old

information is stale. If you've ever

wondered why chat GPT or Claude or

Perplexity comes back and talks about

old AI models as if they are active

today, that is the same issue. They

can't tell when old information is stale

and it optimizes for continuity. It does

not optimize for correctness. This is

the keep the conversation going issue.

Useful memory fundamentally requires

active curation. You have to decide what

to keep, what to update, and what to

discard. And that is work. And so

vendors promise passive solutions

because active curation they are told

does not scale as a product. I think we

have to start by framing that problem

better because it turns out passive

accumulation doesn't solve for it

either. And this is still a big enough

problem that it costs us billions of

dollars at the enterprise level and it's

extremely frustrating for users both

personally and professionally. The

answer cannot be there is no answer or

we'll fake the answer. Finally, number

six on the root cause side, then we're

going to get to solve. It'll it'll feel

better. Memory is actually multiple

problems. And that's part of why it's so

hard. I hope you're getting that idea,

right? When people say AI memory, what

they really mean is any number of

preferences, how I like things done.

That could be a key value that's

persistent. They could mean facts.

What's true about particular things or

entities that can be structured, it

might need updates. They might mean

knowledge, right? Domain expertise. And

that can be parametric, right? that can

be embedded in weights but it might not

be right and then what do you do? It can

be episodic. So it could be

conversational, temporal, ephemeral

knowledge. It can also be procedural.

Have we solved this before? Right? If

episodic memory is what we've discussed

in the past, procedural memory is how we

solve this problem in the past. And

those are also different things. And so

you have exemplars there, you have

successes and fails in procedural

memory. Every single memory type needs

different system design to handle

storage retrieval and update patterns.

And if you feel like you're getting a

headache here, you're not alone. This is

why we don't have a good solve. And this

is why I want to lay out in the next

section principles for solve. But it

starts with being honest about the

problem. Treating this problem as one

problem guarantees you are going to

solve none of the real problems well.

And that is why we have memory as a

persistent issue. in fact a growingly

worse issue in the AI community. Vendors

fundamentally are treating this as a

solve for infrastructure and not a sol

for architecture. And so bigger windows

and better embeddings and cross chat

search scale, but they don't solve

structurally. And users keep expecting

passive solutions because they're

frankly sold passive solutions. There's

an expectations issue here. Just

remember what matters is not something

that you can expect to work. But we're

told that it will work. So if memory

requires architecture and users want

magic, the gap between what's promised,

what's delivered, and what's needed has

never been bigger. We have a memory wall

of our own beyond the chip level in how

we design our systems. And it won't get

solved if we solve the wrong. So let's

say you've gone through all of this and

you want to solve memory correctly. I am

going to give you principles that work

whether you are using the chat and a

sort of power user at home and you want

to build something yourself because this

absolutely works for that or whether you

are designing larger systems because it

turns out that the principles for memory

are fractal because the problem is

fractal. We have the same kinds of

memory issues when we are power users

individually in a chat as we do when we

are designing agentic systems. So the

principles that work. Number one,

there's going to be eight of these.

Settle in. It's going to be fun. Memory

is an architecture. Memory is an

architecture. It is not a feature. You

cannot wait for vendors to solve this. I

think you get this idea. We won't spend

too long here. Every tool will have

memory capabilities, but if you leave it

to tools, they will solve different

slices. You need principles that work

across all of them. And you need to

architect memory as a standalone that

works across your whole tool set.

Principle two, you should separate by

life cycle, not by convenience. So as an

example, you need to separate personal

preferences which can be permanent from

project facts which can be temporary and

those should be separated from session

state which can be ephemeral or

conversation state. Mixing different

life cycle states mixing permanent with

temporary with ephemeral it just breaks

memory. The discipline lies in keeping

these apart cleanly. And again, this

works if you're in chat. It works if

you're designing a gentic systems. If

you have a permanent personal

preference, it is possible. It is as

simple as a very disciplined system chat

update where you go into the sort of

system rules and the system prompt for

chat GPT and you say, "This is what you

need to know about me. These are my

personal preferences." And model makers

are starting to make that more exposed

because they want that. But they don't

tell you how to use that properly. And

when I observe how people actually use

that tell me about yourself, it is

absolutely a mix of personal preferences

and ephemeral stuff and project facts

because no one has taught them to use it

better. And if you're designing agentic

systems, it gets more complex, but it's

the same separation of concerns. You

have to separate out what are the

permanent facts in the situation here.

What are project specific facts and what

is session state. Principle number

three, you need to match storage to

query pattern. So that means you're

going to need multiple stores because

different questions require different

retrieval. Now in the chat situation

that I gave you, chat GPT can retrieve

the memory if it's a system prompt kind

of a thing and it just calls it into the

context window and it's super simple and

you'd never think of it as memory for

most people but that's what it is. If

you're designing an agentic system, it

is understanding the difference between,

for example, what is my style, which

could be a key value because it's a

written style of some sort. What is the

client ID, which should be structured

data or relational data, what similar

work have we done, which could be

semantic or vector storage data, and

what did we do last time, which should

be event logs. Those are four different

types of data, right? You have key value

data, structured data, semantic data,

event logs. Trying to do all of these in

one storage pattern is going to fail.

And that is why when people say, "We

have our data lake and it's going to be

a rag." I'm like, why? Why is it going

to be a rag? Have you heard the word rag

repeated a hundred times like a magic

spell for memory? It does not work that

way. You need to match storage to the

query pattern. Otherwise, you just have

a very expensive data dump. Principle

number four, mode aare context beats

volume hands down. And so more context

is not better context. Planning

conversations need breadth like they

need to have space for alternatives.

They need to have space for comparables.

Brainstorming conversations are similar

to planning conversations. You need to

be able to range. Execution

conversations. Execution workflows in

agentic situations. They need precision.

They need precise constraints. Retrieval

strategy needs to match your task type.

You cannot just sit there and think to

yourself, okay, I'm going to have a

brainstorming conversation and it's

going to be incredibly precise and just

hope that it works. This is why I talk

about prompting so much. Effectively,

what prompting is doing? It is giving

context that is mode aare to an AI so

that it can be in the right mode. And

that's super effective for chat users.

But guess what? If you're designing

agentic systems, it is your

responsibility to architect mode

awareness into the system so that it is

aware that this is an execution

environment and that precision matters

and that it is audited and eval on

precision. Principle number five, you

need to build portable as a first class

object. You need to build portable and

not platform dependent. Your memory

layer needs to survive vendor changes.

It needs to survive tool changes. It

needs to survive model changes. If chat

GPT changes their pricing, if Claude

adds a feature, your context library

should be retrievable regardless. And

that is something that almost nobody can

say right now. And the people who are

doing it tend to be designing very large

scale agentic AI systems at the

enterprise level. But this is a lesson

that we all need to take with us. I

think it is a best practice. It is sort

of like keeping a go bag next to the

door in case you need to get out in case

of I don't know something happens to

your house. You need to have something

that is portable that carries relevant

memory that you can use to have

productive conversations with another

AI. I fully admit there is not an

outof-box solution for this. There are

people who are power users who configure

obsidian to do this right as a

note-taking app and they tie it into AI

and it becomes a portable platform

independent way of handling this. There

are people who use notion for this. The

thing that is a common trait is that

they are obsessed with making sure the

memor is configured correctly for them

and the AI has to come in and be queried

correctly or called correctly to engage

with a piece of the memory that matters.

Whether that is a key value piece like

what's my style or a semantic search

like what similar work have we done

together. A good data structure accounts

for that. Principle number six

compression is curation. Do not upload

40 pages hoping the AI extracts what

matters. I see people do this when they

overload the context window and they ask

for an analysis of a report. You need to

do the compression work. You need to

either in a separate LM call or in your

own work, write the brief, identify the

key facts that matter and state the

constraints. This is where judgment

lives. And if you don't delegate it, you

will be happier with the precision and

context awareness of the response.

Memory is bound up in how we humans

touch the work. There are ways to use AI

to amplify and expand your judgment. You

can use a precise prompt to extract

information in a structured way from 40

pages of data and then in a separate

sort of piece of work figure out what to

do with that data. But it remains on you

to make sure that the facts are correct,

that the constraints are real, and that

the precision work you're asking AI to

do with that data is the correct

precision work. The judgment in

compression is human judgment. It may be

human judgment that you amplify with AI,

but it remains human judgment. Principle

number seven, retrieval needs

verification. So semantic search will

recall well but fail on specifics,

right? It will recall topics and themes.

Well, you need to pair fuzzy retrieval

techniques like rag search with exact

verification where facts must be

correct. You should have a two-stage

retrieval call path, right? Recall

candidates and then verify against some

kind of ground truth. This is especially

important in situations where you have

policy or you have financial facts or

legal facts that you need to validate.

Something like this is exactly why there

was a very prominent fine leveled

against a major consultant firm in the

last two weeks. I think the fine came to

close to half a million dollars because

they could not verify facts around court

cases in a document that they prepared

and they hallucinated them and they

didn't catch them. retrieval failed.

Retrieval failed. And because the LLM is

designed to keep the conversation going,

it just inserted something plausible and

nobody caught it. You need to be able to

verify retrieval against ground truth.

Now, if it's a small task, that might be

the human at the other end of the chat,

right? It just is a step that needs

doing. If it's a large agentic system,

it is the exact same fractal principle,

but you need to do it in an automatic

way using an AI agent for evals.

Principle number eight, memory compounds

through structure. So random

accumulation actually does not compound.

It just creates noise. Just adding stuff

doesn't compound. If if we just added

memories randomly the way we experience

them in life and we had no lossiness, no

forgetting ability, we would not be able

to function as people. Forgetting is a

technology for us. In the same way that

forgetting is a technology for us,

structured memory is a technology for

LLM systems. So evergreen context goes

one place, version prompts go another

place, tagged exemplars go another

place. And at a small scale, yes, you

can do this. People are doing this with

Obsidian, with notion, with other

systems as individuals. And yes, you can

scale this as a business. Same same

principle. You let each interaction

build without degradation if you have

structured memory. Otherwise, you just

have random accumulation. Otherwise, you

have the pile of transcripts you never

got to, and you're like, well, this is

data. We're logging it. it's probably

good. It just it's going to be random

accumulation. It creates noise. You're

not going to have structured memory.

These are the principles that work. They

work whether you are a power user with

chat GPT or a developer building agentic

systems. Frankly, they are guideposts

for you if you are evaluating vendors in

the memory space. These are tool

agnostic principles. They're designed to

scale with complexity and they're

designed to give you keys that solve the

memory problem because they make consist

context persist reliably without the

brittleleness that we see with current

AI systems. And so my challenge to you

as we wrap up this video, we've gone

through root causes. We've gone through

why memory is a hard problem. We've gone

through eight principles for how to

solve for this memory issue. Please take

memory seriously. The reason it matters

now is because if you solve memory now,

you have an agentic AI edge. These

systems are going to get cheaper and

more powerful, but you can't assume

they're magically going to solve for

memory. As I said at the beginning,

there's a chip level issue here. It is a

hard hard problem. If they don't

magically solve for it, if you take

responsibility for memory and build it

yourself in the way that works for you,

you are starting the timer earlier than

everybody else around you on getting

memory that is functional across a

long-term engagement with AI. Because

you have to start to think, we're in

year two of the AI revolution. Wouldn't

it be great to have memory that goes

back to the year two when you are

working with AI systems in 10 years, in

15 years, in 20 years? Everybody else is

going to have memory that started much

later and they're going to lose that

discipline, that acceleration, that

ability to manage deep work over time

that AI is going to be capable of with

proper memory structures. So there is a

moment here for you to think about and

put in place a memory structure that

works. Don't lose the opportunity. This

is a this is a complex one, but it's on

you and me and all of us together to

build memory systems that handle our own

needs, whether that's personal needs or

professional needs. I know you can do

it. Drop in the comments how you're

doing it because I think we should all crowdsource

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:AI's Memory Wall: Why Compute Grew 60,000x But Memory Only 100x (PLUS My 8 Principles to Fix)