YouTube 字幕：
The 5 Levels of AI Coding (Why Most of You Won't Make It Past Level 2)

不必从头看完视频——获取完整字幕，搜索关键词，一键复制。

AutoDub

听懂YouTube外语视频

沉浸式YouTube翻译中文配音

告别语言障碍，拥抱全球优质内容

免费使用

视频字幕

视频摘要

Summary

Core Theme

The future of software development lies in the "dark factory" model, where AI agents autonomously build, test, and ship code based on human-written specifications, a stark contrast to the current industry reality where AI adoption often leads to slower development cycles due to unadapted workflows and organizational structures.

Mind Map

点击展开

点击探索完整互动思维导图

90% of cloud code was written by claude

code. Codeex is releasing features

entirely written by codecs. And yet most

developers using AI empirically get

slower, at least at first. The gap

between these two facts is where the

future of software lives. Imagine

hearing this at work. Code must not be

written by humans. Code must not be even

reviewed by humans. Those are the first

two principles of a real production

software team called Strong DM and their

software factory. They're just three

engineers. No one writes code. No one

reviews code. The system is a set of AI

agents orchestrated by markdown

specification files. The system is

designed to take a specification, build

the software, test the software against

real behavior scenarios, and

independently ship it. All the humans do

is write the specs and evaluate the

outcomes. The machines do absolutely

everything in between. As I was saying,

meanwhile, 90% and yes, it's true. Over

at Anthropic, 90% of Claude Code's

codebase was written by Claude Code

itself. Boris Triny, who leads the

Claude Code project at Anthropic, hasn't

personally written code in months. And

Anthropic's leadership is now estimating

that functionally 100% the entirety of

code produced at the company is AI

generated. And yet at the same time, in

the same industry, with us here on the

same planet, a rigorous 2025 randomized

control trial by METR found that

experienced open-source developers using

AI tools took 19% longer to complete

tasks than developers working without

them. There is a mystery here. They're

not going faster, they're going slower.

And here's the part that should really

unsettle you. Those developers are bad

at estimation. They believed AI had made

them 24% faster. They were wrong not

just about the direction but about the

magnitude of the change. Three teams are

running lights out software factories.

The rest of the industry is getting

measurably slower. Just a few teams

around tech are running truly lights out

software factories. The rest of the

industry tends to get measurably slower

while convincing themselves and everyone

around them with press releases that

they're speeding up. The distance

between these two realities is the most

important gap in tech right now and

almost nobody is talking honestly about

it and what it takes to cross it. That

is what this video is about. Dan

Shapiro, the CEO over at Glow Forge and

the veteran of multiple companies built

on the boundary between software and

physical products, just published a

framework earlier this year in 2026 that

maps where the industry stands. He calls

it the five levels of vibe coding. And

the name is deliberately informal

because the underlying reality is what

matters. Level zero is what he calls

spicy autocomplete. You type the code,

the AI suggests the next line. You

accept or reject. This is GitHub copilot

in its original format. Just a faster

tab key. The human is really writing the

software here. And the AI is just

reducing the keystrokes and the effort

your fingers have. Level one is coding

intern. You hand the AI a discrete well

scoped task. You write the function. You

build the component. You refactor the

module. That's the task you give the AI.

You hand the AI a discrete and well

scoped task like write this function or

build this component or refactor this

module. You then review as the human

everything that comes back. The AI

handles the tasks. The human handles the

architecture, the judgment and the

integration. Do you see the pattern

here? Do you see how the human is

stepping back more and more through

these levels? Let's keep going. Level

two is the junior developer. The AI

handles multifile changes. It can

navigate a codebase. It can understand

dependencies. It can build features that

span modules. You're reviewing more

complicated output, but you as a human

are still reading all of the code.

Shapiro estimates that 90% of developers

who say they are AI native are operating

at this level. And I think from what

I've seen, he's right. Software

developers who operate here think

they're farther along than they are.

Let's move on. Level three, the

developer is now the manager. This is

where the relationship starts to flip.

This is where it gets interesting.

You're now not writing code and having

the AI help. You're simply directing the

AI and you're reviewing what it

produces. Your day is whether you want

to read, whether you want to approve,

whether you want to reject, but at the

feature level, at the PR level. The

model is doing the implementation. The

model is submitting PRs for your review.

You have to have the judgment. Almost

everybody tops out here right now. Most

developers, Shapiro says, hit that

ceiling at level three because they are

struggling with the psychological

difficulty of letting go of the code.

But there are more levels. And this is

where it gets spicy and exciting. Level

four is the developer as the product

manager. You write a specification, you

leave, you come back hours later and

check whether the tests pass. You're not

really reading the code anymore. You're

just evaluating the outcomes. The code

is a black box. you care whether it

works, but because you have written your

eval so completely, you don't have to

worry too much about how it's written if

it passes. This requires a level of

trust both in the system and in your

ability to write spec. And that quality

of spec writing almost nobody has

developed well yet. Level five, the dark

factory. This is effectively a black box

that turns specs into software. It is

where the industry is going. No human

writes the code. No human even reviews

the code. The factory runs autonomously

with the lights off. Specification goes

in, working software comes out. And you

know, Shapiro is correct. Almost nobody

on the planet operates at this level.

The rest of the industry is mostly

between level one and level three, and

most of them are treating AI kind of

like a junior developer. I like this

framework because it gives us really

honest language for a conversation

that's been drowning in hype. When a

vendor tells you their tool writes code

for you, they often mean level one. When

a startup says they're doing agentic

software development, they often mean

level two or three. But when strong DM

says their code must not be written by

humans, they really do mean level five,

the dark factory, and they actually

operate there. The gap between marketing

language and operating reality is

enormous. and collapsing that gap into

what is actually going on on the ground

requires changes that go way beyond

picking a better AI tool. So many people

look at this problem and think this is a

tool problem. It's not a tool problem.

It's a people problem. So what does

level five software development actually

look like? I think strong DM software

factory is the most thoroughly

documented example of level five in

production. Simon Willis, one of the

most careful and credible observers in

the developer tooling space, calls

StrongDm Software Factory, quote, "The

most ambitious form of AI assisted

software development that I've seen

yet." The details are really worth

digging into here because they reveal

what it looks like to run a dark factory

for software on today's agents. And as

we have this discussion, I want you to

keep in mind that for most of us

listening, we are getting to time

travel. We are seeing how a bold vision

for the future can be translated into

reality with today's agents and today's

agent harnesses. It is only going to get

easier as we go into 2026 which is one

of the reasons I think this is going to

be a massive center of gravity for

future agentic software development

practices. We are all going to level

five. So what does strong DM do? The

team is three people. Justin McCarthy,

CTO, Jay Taylor, and Nan Chowan. They've

been running the factory since July of

last year, actually. And the inflection

point they identify is Claude 3.5

Sonnet, which shipped actually in the

fall of 2024. That's when long horizon

agentic coding started compounding

correctness more than compounding

errors. Give them credit for thinking

ahead. Almost no one was thinking in

terms of dark factories that far back.

But they found that 3.5 sonnet could

sustain coherent work across sessions

long enough that the output was reliable

and it wasn't just a flash in the pan.

It wasn't just demo worthy and so they

built around it. The factory runs on an

open-source coding agent called

attractor. The repo is just three

markdown specification files and that's

it. That's the agent. The specifications

describe what the software should do.

The agent reads them. It writes the code

and it tests it. And here's where it

gets really interesting and where most

people's mental model really starts to

break down. Strong DM doesn't actually

use traditional software tests. They use

what they call scenarios. And the

distinction is important. Tests

typically live inside the codebase. The

AI agent can read them, which means the

AI agent can intentionally or not

optimize for passing the tests rather

than building correct software. It's the

same problem as teaching to the test in

education. You can get perfect scores

and shallow understanding. Scenarios are

different. Scenarios live outside the

codebase. They're behavioral

specifications that describe what the

software should do from an external

perspective, stored separately so the

agent cannot see them during

development. They function as a holdout

set. The same concept that machine

learning users use to prevent

overfitting. The agent builds the

software and the scenarios evaluate

whether the software actually works. The

agent never sees the evaluation

criteria. It can't game the system. This

is really a new idea in software

development and I don't see it

implemented very frequently yet. But it

solves a problem that nobody was

thinking about when all the code was

written by humans. When humans write

code, we don't tend to worry about the

developer gaming their own test suite

unless incentives are really, really

skewed at that organization and then you

have bigger problems. When AI writes the

code, optimizing for test passage is the

default behavior unless you deliberately

architect around it. And it's one of the

most important differences to really

understand as you start to think about

AI as a code builder. Strongdm

architected around that with external

scenarios. The other major piece of the

architecture is what StrongDM calls

their digital twin universe. Behavioral

clones of every external service the

software interacts with. a simulated

octa, a simulated Jira, a simulated

Slack, Google Docs, Google Drive, Google

Sheets. The AI agents develop against

these digital twins, which means they

can run full integration testing

scenarios without ever touching real

production systems, real APIs, or real

data. It's a complete simulated

environment purpose-built for autonomous

software development. And the output is

real. CXDB, their AI context store, has

16,000 lines of Rust, nine and a half

thousand lines of Go, and 700 lines of

TypeScript. It's shipped, it's in

production, it works, it's real

software, and it's built by agents end

to end. And then the metric that tells

you how seriously they take it. They say

if you haven't spent $1,000 per human

engineer, your software factory has room

for improvement. I think they're right.

That's not a joke. $1,000 per engineer

per day enables AI agents to run at a

volume that makes the cost of compute

meaningful if you are giving them a

mission to build software that has real

scale and real utility in production use

cases and it's often still cheaper than

the humans they're replacing. Let's hop

over and look at what the hyperscalers

are doing. The self-referential loop has

taken hold at both anthropic and open

AAI and it's stranger than the hype

might make it sound. Codex 5.3 is the

first frontier AI model that was

instrumental in creating itself. And

that's not a metaphor. Earlier builds of

Codeex would analyze training logs,

would flag failing tests, and might

suggest fixes to training scripts. But

this model shipped as a direct product

of its own predecessors coding labor.

OpenAI reported a 25% speed improvement

and 93% fewer wasted tokens in the

effort to build Codeex 5.3. And those

improvements came in part from the model

identifying its own inefficiencies

during the build process. Isn't that

wild? Cloud code is doing something

similar. 90% of the code in Claude Code,

including the tool itself, was built by

Claude Code, and that number is rapidly

converging toward 100%.

Boris Churny isn't joking when he talks

about not writing code in the last few

months. He's simply saying his role has

shifted to specification, to direction,

to judgment. Anthropic is estimating all

of their company moving to entirely AI

generated code about now. Everyone at

Anthropic is architecting and the

machines are implementing. And the

downstream numbers tell the same story.

When I made a video on co-work and

talked about how it was written in 10

days by four engineers, what I want you

to remember is it wasn't just four

engineers hyperting so that they could

get that out super fast and write every

line by hand. No, no, no. They were

directing machines to build the code for

co-work. And that's why it was so fast.

4% of public commits on GitHub are now

directly authored by Claude Code, a

number that Anthropic thinks will exceed

20% by the end of this year. I think

they're probably right. Claude Code by

itself has hit a billion dollar run rate

just 6 months since launch. This is all

real today in February of 2026. The

tools are building themselves. They're

improving themselves. is they're

enabling us to go faster at improving

themselves and that means the next

generation is going to be faster and

better than it would have been otherwise

and we're going to keep compounding. The

feedback loop on AI has closed and the

question is not whether we're going to

start using AI to improve AI. The

question is how fast that loop is going

to accelerate and what it means for the

40 or 50 million of us around the world

who currently build software for a

living. This is true for vendors as much

as it's true for software developers.

And I don't think we talk about that

enough because the gap between what's

possible at the frontier in February of

2026 and what tends to happen in

practice and what vendors want to sell

has never been wider. That MER study, a

randomized control trial, by the way,

not a survey, found that open source

developers using AI coding tools

completed their task 19% slower. We

talked about that, right? The

researchers controlled for task

difficulty. They controlled for

developer experience. They controlled

even for tool familiarity and none of it

mattered. AI made even experienced

developers slower. Why? In a world where

co-work can ship that fast. Why? Because

the workflow disruption outweighed the

generation speed. Developers spent time

evaluating AI suggestions, correcting

almost right code, context switching

between their own mental model and the

model's output, and debugging really

subtle errors introduced by generated

code that looked correct but weren't.

46% of developers in broader surveys say

they don't fully trust AI generated

code. These guys aren't lites, right?

This is experienced engineers running

into a consistent problem. The AI is

fast, but it struggles with the

reliability to trust without what they

view as vital human review. And this

irony is the J curve that adoption

researchers keep identifying. When you

bolt an AI coding assistant onto an

existing workflow, productivity dips

before it gets better. It goes down like

the bottom of a J. Sometimes for a

while, sometimes for months. And the dip

happens because the tool changes the

workflow, but the workflow has not been

redesigned around the tool explicitly.

And so you're kind of running a new

engine on old transmission. The gears

are going to grind. Most organizations

are sitting in the bottom of that J

curve right now. And many of them are

interpreting the dip as evidence that AI

tools don't work, that the vendors did

not tell them the truth, and that the

evidence that their workflows haven't

adapted is really evidence that AI is

hype and not real. I think GitHub

Copilot might be the clearest

illustration of this. It has 20 million

users, 42% market share among AI coding

tools, apparently. Uh, and lab studies

show 55% faster code completion on

isolated tasks. I'm sure that makes the

people driving GitHub Copilot happy in

their slide decks. But in production,

the story is much more complicated.

There are larger poll requests. There

are higher review costs. There's more

security vulnerabilities introduced by

generated code. And developers are

wrestling with how to do it well. One

senior engineer put it really sharply.

C-Ilot makes writing code cheaper but

owning it more expensive. And that is

actually a very common sentiment I've

heard across a lot of engineers in the

industry. not just for co-pilot but for

AI generated code in general. The

organizations that are seeing

significant call it 25 30% or more

productivity gains with AI are not the

ones that just installed co-pilot had a

one-day seminar and called it done.

They're the ones that thought carefully

went back to the whiteboard and

redesigned their entire development

workflow around AI capabilities.

changing how they write their specs,

changing how they review their code,

changing what they expect from junior

versus senior engineers, changing their

CI/CD pipelines to catch the new

category of errors that AI generated

code introduces. End to end process

transformation. It's not about tool

adoption. And end toend transformation

is hard. It's sometimes it's politically

contentious. It's expensive. It's slow

and most companies don't have the

stomach for it. Which is why most

companies are stuck at the bottom of the

J curve. Which is why the gap between

frontier teams and everyone else is not

just widening, it's accelerating

rapidly. Because those teams on the edge

that are running dark factories, they

are positioned to gain the most. As

tools like Opus 4.6 and Codeex 5.3

enable widespread agentic powers for

every software engineer on the planet.

95% of those software engineers don't

know what to do with that. It's the ones

that are actually operating at level

four, level five that truly get the

multiplicative value of these tools. So

if this is a politically contentious

problem, if this is not just a tool

problem but a people problem, we need to

look at the nature of our software

organizations. Most software

organizations were designed to

facilitate people building software.

every process, every ceremony, every

role. They exist because humans building

software in teams need coordination

structures. Stand-up meetings exist

because developers working on the same

codebase, they got to synchronize every

single day. Sprint planning exists

because humans can only hold a certain

number of tasks in working memory and

then they need a regular cadence to rep

prioritize. Code review exists because

humans make mistakes that other humans

can catch. QA teams exist because the

people who build software, they can't

evaluate it objectively. You get the

idea. Every one of these structures is a

response to a human limitation. And when

the human is no longer the one writing

the code, the structures, they're not

optional, they're friction. So what does

sprint planning look like when the

implementation happens in hours, not

weeks? What does code review look like

when no human wrote the code and no

human can really review the diff that AI

produced in 20 minutes because it's

going to produce another one in 20 more

minutes. So what does a QA team do when

the AI already tested against scenarios

it was never shown? Strong BM's

threeperson team doesn't have sprints.

They don't have standups. They don't

have a Jiraa board. They write specs and

they evaluate outcomes. That is it.

The entire coordination layer that

constitutes the operating system of a

modern software organization. The layer

that most managers spend 60% of their

time maintaining is just deleted. It

does not exist. Not because it was

eliminated as a cost-saving measure, but

because it no longer serves a purpose.

This is the structural shift that's

harder to see than the tech shift, and

it might matter more. The question is

becoming what happens to the

organizational structures that were

built for a world where humans write

code? What happens to the engineering

manager whose primary value is

coordination? What happens to the scrum

master, the release manager, the

technical program manager whose job is

to make sure a dozen teams ship on time?

Look, those roles don't disappear

overnight, but the center of gravity is

shifting. The engineering manager's

value is moving from coordinate the team

building the feature to define the

specification clearly enough that agents

build the feature. The program manager's

value is moving from track dependencies

between human teams to architect the

pipeline of specs that flow through the

factory. The skills that matter are

shifting very rapidly from coordination

to articulation. From making sure people

are rowing in the same direction to

making sure the direction is described

precisely enough that machines can go do

it. And oh, by the way, for engineering

managers, there's an extra challenge.

How do you coach your engineers to do

the same thing? It's a people challenge.

If you think this is a trivial shift,

you have never tried to write a

specification detailed enough for an AI

agent to implement it correctly without

human intervention. And you've certainly

never sat down and tried to coach an

engineer to do the same. It is a

different skill. It requires the kind of

rigorous systems thinking that most

organizations have never needed from

most of their people because the humans

on the other end of the spec could fill

in the gaps with judgment, with context,

with a slack message that says, "Did you

mean X or Y?" The machines don't have

that layer of human context. They build

what you described. If what you

described was ambiguous, you're going to

get software that fills in the gaps with

software guesses, not customer- ccentric

guesses. The bottleneck has moved from

implementation speed to spec quality.

And spec quality is a function of how

deeply you understand the system, your

customer, and your problem. That kind of

understanding has always been the

scarcest resource in software

engineering. The dark factory doesn't

reduce the demand for that. It just

makes the demand an absolute law. It

becomes the only thing that matters.

Now, let's be honest. Everything that I

have just talked about assumes you're

building from scratch. Most of the

software economy is not built from

scratch. The vast majority of enterprise

software is brownfield. It's existing

systems. It's accumulated over years,

over decades. It's running in

production, serving real users, carrying

real revenue. CRUD applications that

process business transactions. Monoliths

that have grown organically through 15

years of feature additions. CI/CD

pipelines tuned to the quirks of a

specific codebase and a specific team's

workflow. Config management that exists

in the heads of the three people who've

been at the company long enough to

remember why that one environment

variable is set to that one value. You

know who you are. You cannot dark

factory your way through a legacy

system. You cannot just pretend that you

can bolt that on. It doesn't work that

way. The specification for that does not

exist. The tests, if they're any, cover

30% of your existing codebase, and the

other 70% runs on institutional

knowledge and tribal lore and someone

who shows up once a week in a polo shirt

and knows where all the skeletons are

buried in the code. The system is the

specification. It's the only complete

description of what the software does

because no one ever wrote down the

thousand implicit decisions that

accumulated over a decade or more of

patches of hot fixes of temporary

workarounds that of course became

permanent. This is the truth about the

interstitial states that lie along this

continuum toward more autonomous

software development. For most

organizations, the path is not to start

with deploy an agent that writes code.

It starts with let's develop a

specification for what your real

existing software really actually does.

And that specification work that reverse

engineering of the implicit knowledge

embedded in a running system is very

difficult and it's deeply human work. It

requires the engineer who knows why the

billing module has the one edge case for

Canadian customers. It requires the

architect who remembers which micros

service it was that carved out of the

monolith under duress during the 2021

outage and we've always maintained it

ever since. It requires the product

person who can explain what the software

actually does for real users versus what

the PRD says it does. Domain expertise,

ruthless honesty, customer

understanding, systems thinking. exactly

the human capabilities that matter even

粘贴 YouTube 链接

输入任意 YouTube 视频链接，获取完整字幕

大多数字幕 5 秒内即可准备好

安装 Chrome 扩展

无需离开 YouTube，一键获取视频字幕。安装我们的 Chrome 扩展，直接在视频页面访问任意视频的完整字幕。

免费添加到 Chrome

支持 YouTube、Coursera、Udemy 等主流教育平台

快速获取字幕：直接修改地址栏中的域名即可！

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube 字幕正在为您准备结果……

YouTube 字幕：The 5 Levels of AI Coding (Why Most of You Won't Make It Past Level 2)