0:02 been amazed by how much better these big
0:04 language models get as they scale up. It
0:06 feels like magic sometimes, like they
0:08 suddenly just know more. Oh, absolutely.
0:09 It really does feel that way. You see
0:12 these huge models tackling tasks that
0:14 seemed impossible just a short while ago
0:16 and with well surprising skill. And
0:18 that's where you come in, listener. We
0:19 know you're smart. You're curious about
0:21 what's really going on inside these
0:22 systems, but maybe you don't have time
0:24 to wade through dense research papers.
0:26 Right. Right. So, think of this as your
0:28 express lane to understanding a really
0:31 key uh cutting edge idea behind these AI
0:33 improvements. Exactly. Today, we're
0:35 diving into this really interesting link
0:38 between how models internally represent
0:40 information, this concept called um
0:43 representation superposition and those
0:44 predictable improvement patterns we see
0:46 as they get bigger, the neural scaling
0:48 laws. Okay, our main source here is a
0:50 really compelling taper. Superposition
0:52 yields robust neural scaling plus some
0:54 extra notes to round things out. It
0:56 gives us both the empirical side, what
0:58 they observed, and a theoretical angle.
1:00 So, we know bigger models work better.
1:01 That's kind of the baseline observation.
1:03 But today, we're really digging into the
1:04 why, focusing on this superp position
1:07 idea. Our mission basically is to unpack
1:10 the mechanics behind LLM scaling. And
1:12 honestly, this research has some angles
1:13 that really made me think, okay, let's
1:15 get into it. So, the uh the big
1:17 observation driving all this LLM
1:18 progress is pretty simple. Empirically
1:21 speaking, bigger models just do better.
1:22 More parameters, more data, more
1:24 compute. Right? And you consistently see
1:26 better performance across all sorts of
1:28 tasks. Language understanding,
1:30 reasoning, math, coding, you name it.
1:32 And it's not just a little bit better.
1:34 It improves in a structured way. Right?
1:35 That's the scaling law part precisely.
1:38 It's not random improvement. These gains
1:39 follow patterns which we call neural
1:42 scaling laws. Often the model's error,
1:44 the loss, goes down following a power
1:46 law as the model size increases. Like a
1:48 predictable recipe. Add more
1:50 ingredients, get a predictably better
1:51 result. Yeah, something like that. It's
1:53 remarkably consistent. But that's the
1:55 puzzle, isn't it? Why do these simple
1:58 almost universal patterns pop out of
2:00 systems that are just incredibly
2:02 complex, millions, billions of
2:04 parameters? That is the core question
2:06 researchers have been wrestling with. Uh
2:07 there have been several ideas. Some
2:08 thought maybe larger models are just
2:11 better function approximators, you know,
2:12 capturing the complex math behind the
2:14 data or the structure of the data
2:17 itself, right? Or that. Others focused
2:19 on the idea that bigger models can
2:21 learn, let's say, more sophisticated
2:23 internal concepts or skills, better
2:25 representations. But those earlier
2:26 explanations weren't perfect, were they?
2:28 They had limitations. They did. One
2:30 issue was sensitivity. Some theories
2:32 seemed very dependent on the exact type
2:34 of data used for training, change the
2:36 data distribution slightly, and the
2:38 predicted scaling might change quite a
2:39 bit. Ah, okay. And it wasn't always
2:42 totally clear how these abstract ideas
2:44 actually, you know, manifested inside a
2:46 real LLM.
2:48 Got it. So useful starting points but
2:50 maybe missing a key piece. And that
2:52 missing piece might be this idea of
2:54 superposition. That's where this newer
2:56 research comes in. Yeah. The argument is
2:58 that how these models represent the vast
3:00 amounts of information they learn is
3:02 actually a critical bottleneck and that
3:04 leads directly to superposition.
3:06 Superposition. Okay. It sounds complex.
3:07 What does it mean in this LLM context?
3:10 Not quantum physics I assume. Ha. No,
3:11 not quantum physics here. In this
3:14 context, superposition means the model
3:17 manages to represent uh significantly
3:19 more features or concepts than you'd
3:20 expect just by looking at the number of
3:22 neurons of its hidden layers, the
3:24 dimensionality or width. Okay, think of
3:27 it like um packing for a long trip with
3:29 only a small suitcase. You have to get
3:31 really clever about folding, rolling,
3:32 maybe overlapping items to fit
3:34 everything in. That's a great analogy.
3:37 So, the LLM is sort of compressing or
3:39 overlapping knowledge representations to
3:40 fit more in. That's the core idea.
3:42 Earlier work kind of assumes something
3:44 maybe called weak superposition. A
3:45 little bit of overlap perhaps, but not a
3:47 defining feature, right? But this paper
3:50 argues that actual LLM operate in a
3:52 strong superp position regime. The
3:54 overlap isn't just present. It's
3:56 significant and it's potentially why
3:58 they scale so well. That's a really key
4:00 innovative claim here. Okay.
4:02 Interesting. So to study this properly,
4:03 they couldn't just use a giant LLM,
4:05 right? They built a simplified model.
4:07 Exactly. They built a toy model. The
4:09 beauty of a toy model is stripping away
4:10 complexity to isolate the core
4:12 principles you want to study. Here it
4:14 was superp position and data structure
4:17 effects on scaling. Makes sense. So this
4:18 toy model was built around two main
4:20 ideas. One, it had more features it
4:22 needed to learn than dimensions
4:24 available to represent them. And two,
4:26 these features appeared with different
4:28 frequencies in the data. Some common,
4:30 some rare. Like learning words. Yeah.
4:31 Some you use all the time, others almost
4:34 never. Precisely. The model learned by
4:35 trying to reconstruct data made from
4:37 these hidden features. It has a key
4:40 part, a weight matrix W, where each row,
4:42 say we, is the model's internal code for
4:45 a specific feature. And crucially, they
4:47 could control if superp position
4:48 happened in this model. Yes, they could
4:50 compare scenarios. One was no
4:52 superposition. The model only learns the
4:54 most frequent features cleanly, no
4:56 overlap, like giving each piece of
4:58 furniture its own room. Okay. Okay. The
5:00 other was superp position where it tries
5:02 to represent more features, even the
5:04 rarer ones, but the representations
5:06 overlap. That's the clever packing
5:08 analogy again. Got it. And how did they
5:10 toggle that switch between weak and
5:12 strong superp position? They used a
5:13 standard machine learning techniques
5:15 called weight decay. You can think of it
5:17 as a kind of regularization pressure
5:19 that encourages the model to be
5:20 efficient with its representation
5:23 weights. Ah, okay. In the toy model,
5:24 tweaking the amount of weight decay
5:26 acted like a knob. Low decay allowed or
5:28 even encouraged. Strong superposition,
5:30 lots of features overlapping. High decay
5:33 pushed it towards weak superposition,
5:34 fewer features represented, less
5:36 overlap. Okay, so they have this toy
5:37 model. They can control the superp
5:39 position. What did they find? How did
5:41 the model's learning its loss reduction
5:43 change with size in these two regimes,
5:45 right? So in the weak superposition
5:47 case, they found the scaling, how fast
5:49 the error dropped as the model got
5:51 bigger was really sensitive to the
5:53 feature frequencies in the data. If the
5:54 frequencies follow a power law, the
5:56 error tended to scale as a power law
5:58 too. So performance improvement was
6:00 directly tied to the data's statistical
6:02 structure in that case. Exactly. But
6:04 then in the strong superp position
6:05 regime, things got really interesting
6:08 and this is a core innovation. The error
6:10 started scaling almost perfectly as
6:11 adversely proportioned to the model
6:14 dimension. A robust power law exponent
6:16 close to one. Wow. Okay. And the crucial
6:18 part this held true across a wide range
6:20 of different feature frequency
6:23 distributions. The scaling became robust
6:24 largely independent of the data specifics.
6:26 specifics.
6:28 That's huge. So strong superposition
6:30 seems to unlock this very reliable
6:32 predictable scaling almost regardless of
6:34 what exact data it's learning. Why?
6:36 What's the explanation? They offer a
6:38 really elegant geometric explanation. It
6:39 comes down to the properties of
6:41 highdimensional spaces. When you force
6:44 many vectors representing features into
6:46 a lower dimensional space, the model's
6:47 hidden layer, right? the interference
6:49 between them, the squared overlaps,
6:51 naturally starts to scale inversely with
6:53 the dimension of that space. It's a
6:55 mathematical consequence of packing
6:57 things tightly. That's the ha moment.
6:58 Then the act of squeezing things
7:00 together via superp position inherently
7:02 leads to this predictable scaling. That
7:03 seems to be the core insight. Yeah, it
7:05 suggests that the way LLM have to
7:08 operate cramming vast knowledge into
7:10 limited dimensions naturally puts them
7:13 into this strong superposition regime
7:15 where scaling becomes robust. Okay, the
7:17 toy model makes a strong case, but does
7:19 this hold up in real LLMs? Did they find
7:21 evidence of strong superposition in
7:24 models like GPT or OPT? Yes, that was
7:25 the critical next step. They analyzed
7:28 several real open- source LLM families,
7:31 OPT, GPT2, Quen, Pythia, and indeed they
7:33 found evidence consistent with strong
7:35 superposition. How did they check that?
7:37 They looked closely at the weight matrix
7:38 in the part of the model that predicts
7:41 the next word, the language model head.
7:43 The patterns they saw there lined up
7:44 with what strong superp position would
7:46 predict. And did the performance scaling
7:48 match too? Did the real LLMs improve
7:50 with size in the way the toy model
7:51 suggested they should under strong
7:54 superposition? Remarkably, yes.
7:56 Quantitatively, the loss curves of these
7:59 LLMs how their error decreased with size
8:00 closely matched the predictions from the
8:02 strong superposition regime in the toy
8:04 model. It even aligned well with
8:06 established laws like the chinchilla
8:07 scaling law. That's a really strong
8:09 validation. in them. The toy model seems
8:11 to capture something fundamental. How
8:13 did they directly measure the overlap or
8:14 packing in the real models? They
8:16 calculated the mean squared overlaps
8:18 between the normalized rows of those
8:21 weight matriies. Essentially, how much
8:22 the representations for different
8:24 concepts interfered with each other on
8:26 average and they found it roughly
8:29 followed a 1 m scaling where m is the
8:31 model's hidden dimension just like the
8:33 geometric theory predicted. Wow. This
8:34 provides a direct empirical link from
8:36 the abstract theory to the internals of
8:39 actual LLMs. They also noted that token
8:41 frequencies in language data follow a
8:43 power law with an exponent near one
8:45 which fits the data conditions where
8:47 strong superposition yields robust
8:49 scaling. This is really building a
8:51 compelling picture. Okay, let's dive a
8:53 bit deeper into some of the more uh
8:55 academic details and the specific
8:56 innovations from the paper. You
8:58 mentioned a fraction of represented
9:01 features earlier. Yes, AO2 in the toy
9:03 model. This measured the proportion of
9:05 features whose internal representation
9:08 vector had a norm a strength greater
9:10 than 0.5. They found weight decay
9:12 strongly influenced this fraction. Okay.
9:14 And interestingly the feature norms
9:16 tended to cluster either strong or weak
9:19 a biodal distribution as seen in their
9:21 figure 3A. So weight decay wasn't just
9:23 onoff for superposition. It was tuning
9:24 which features the model was really
9:26 paying attention to. So weight decay
9:28 wasn't just an onoff for superp
9:29 position. It was tuning which features
9:32 get strongly represented and what about
9:34 understanding the errors the loss more
9:36 precisely especially in the weak
9:39 superposition case right for weak superp
9:41 position they found that the loss was
9:44 well described by the expected number of
9:46 activated but unlearned features
9:49 basically error comes from encountering
9:51 frequent features the model hasn't
9:53 properly learned yet. Equation four
9:55 formalizes this and it matched
9:58 experiments well. That's an innovation
10:00 directly linking loss to unlearned features.
10:02 features.
10:03 Makes sense. If you haven't learned
10:05 this, you'll make errors often. What
10:06 about strong superposition? If it's
10:08 trying to learn everything, where does
10:10 the error come from? Interference
10:12 overlaps. Because everything is packed
10:13 so tightly, even representing one
10:15 feature correctly involves some
10:17 unavoidable overlap with others causing
10:19 small errors in reconstruction. The
10:21 jostling furniture analogy again.
10:22 Exactly. They even derived a theoretical
10:25 lower bound on the maximum overlap kappa
10:27 showing how it scales with dimension m
10:28 equation 5 and they're distinguished
10:31 between strongly and weakly represented
10:32 features even within strong superp
10:34 position. Yes, features with
10:36 representation norms above one were
10:38 strongly represented generally having
10:39 smaller overlaps closer to that
10:42 theoretical minimum. Those with norms
10:43 below one were weakly represented and
10:46 had larger overlaps. Figures 6A and 6B
10:48 show this. Okay. Okay. And this led to
10:50 another key finding about the scaling
10:52 exponent itself. Yes. A really neat
10:55 empirical rule they found a mats. Here a
10:57 is the exponent for how loss scales with
10:59 model dement. And amma describes how
11:01 feature frequencies decay. What's the
11:03 significance of that rule? The big deal
11:05 is that for many realistic data
11:07 distributions a range of ammy values.
11:10 This formula gives a one. It means law
11:12 scales like 1 meter. This reinforces the
11:14 robustness. the scaling law becomes
11:16 largely independent of the specific data
11:18 statistics as long as you're in strong
11:21 superposition. That's a major insight.
11:23 So again, strong superposition forces
11:25 this consistent one meter scaling.
11:27 Powerful. Did other factors change this
11:29 like how often features appear? The
11:30 activation density. They checked that
11:32 too with a parameter E for activation
11:34 density. It affected the overall level
11:36 of the loss, the magnitude, but not
11:39 really the scaling exponent. So the 1
11:41 meter relationship holds regardless of
11:43 data sparity roughly speaking. And
11:45 trying it all back to real LLMs again,
11:47 they fit the loss curves. Yes, they
11:49 proposed a formula for LLM loss that
11:51 includes a term scaling as 1 meter
11:53 reflecting strong superp position plus a
11:55 constant offset loss independent of
11:57 model size. This formula fit the actual
11:59 loss curves of LMS very well. Equation
12:02 8, figure 8 P and the fitted exponent
12:04 was near one. Yes, the fitted on was
12:05 close to one, further supporting the
12:07 whole framework. They also connected the
12:09 dots between model dimension m and total
12:11 parameters n noting that empirically n
12:13 grows roughly like m to the power of
12:16 2.5. This links their 1 m scaling for
12:18 loss versus dimension to the observed
12:20 tinchillause for loss versus total
12:22 parameters. Okay, so bringing all these
12:24 deep insights together. What does this
12:25 mean for the future? How might this
12:27 change how we build LLMs? Well, one
12:29 implication is that maybe just making
12:31 models bigger and bigger isn't the only
12:32 path forward or maybe it's becoming less
12:35 efficient. If strong superp position is
12:38 key to robust scaling, then maybe we
12:39 should focus on enhancing superp
12:41 position. Exactly. Could we design
12:42 architectures or training methods that
12:44 explicitly encourage strong superp
12:46 position that might lead to smaller,
12:48 more efficient models that still achieve
12:50 high performance? Getting more bang for
12:51 your buck essentially. That's a
12:53 fascinating direction. Getting smarter,
12:55 not just bigger. Precisely. The paper
12:57 even nods towards some recent ideas like
13:00 NGPT architecture or focus optimization
13:02 that might implicitly be doing something
13:04 like this. And there are open questions
13:05 too like what happens in truly massive
13:07 models does some other bottleneck take
13:09 over and how does superp position relate
13:12 to those amazing emergent abilities we
13:14 see lots to explore still okay let's try
13:16 to wrap this up for our listener who's
13:18 followed along this deep dive what are
13:20 the absolute key takeaways I think the
13:22 main thing is that representation
13:24 superposition isn't just some obscure
13:26 detail it seems to be a core mechanism
13:28 underpinning why LLM scale so
13:31 predictably well this idea of
13:32 efficiently packing features geometrically
13:33 geometrically
13:35 leads to robust improvement. The
13:37 innovation is really in understanding
13:39 why scaling works through the superp
13:41 position lens. And that aha moment for
13:43 me was the robustness that consistent
13:46 one meter loss scaling popping up under
13:48 strong superp position almost regardless
13:50 of the data specifics and matching real
13:52 LLMs. Yeah, that's pretty striking. So
13:54 for you the listener, grasping these
13:56 principles gives you a real shortcut to
13:58 understanding a fundamental driver of
14:00 LOM progress without needing to drown in
14:02 all the technical weeds. You've got a
14:03 key piece of the puzzle now for why
14:06 bigger often means better. So maybe a
14:08 final thought to leave you with. Could
14:10 the next big leap in AI come not from
14:12 brute force scaling, but from clever
14:14 tricks to enhance superp position?
14:16 Finding smarter ways to pack that
14:18 knowledge definitely something to mle
14:20 over regarding where AI development is
14:21 headed. It certainly shifts the
14:23 perspective and hopefully insights like
14:25 these understanding superposition better
14:27 will help us build not just more capable
14:29 AI but perhaps more efficient and