YouTube Transcript:
Does AI Actually Boost Developer Productivity? (100k Devs Study) - Yegor Denisov-Blanch, Stanford

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

In January of this year, Mark Zuckerberg

said that he was going to replace all of

the mid-level engineers at Meta with AI

by the end of the year.

I think Mark was a bit optimistic and he

was probably acting like a good CEO

would to inspire a vision and also

probably to keep the Facebook stock

price up. But what Mark also did was

create a lot of trouble for CTOs worldwide.

worldwide. Why?

Why?

Because after Mark said that, every

single CEO in the world almost turned to

their CTO and said, "Hey, Marcus says

he's going to replace all of his

developers with AI. Where are we in that

journey?" And the answer probably was

honestly not very far and we're not sure

we're going to do that. And so I

personally think hopefully this is not

um you know going to going to change,

but I don't think AI is going to replace

developers entirely. at least at least

this year, let alone at at at Meta,

right? But um and I do think that AI

increases developer productivity, but

there's also cases in which it decreases

developer productivity. So AI or using

AI for coding is not a one-sizefits-all

solution, and there are cases in which

it shouldn't be used.

And so for the past three years we've uh

been running one of the larger uh

studies on software engineering

productivity at Stanford and we've done

this in a time series and

cross-sectional way. So time series

meaning that even if a participant joins

in 2025 we get access to their git

history meaning we can see trends of

data across time we can see covid we can

see AI we can see all of these trends

and and things that happened and then

also cross-sectional because we have

more than 600 companies participating

enterprise midsized and also startups

and so this means that we have more than

100,000 software engineers in our data

set right now dozens of millions of

commits and billions of lines of code.

And most importantly, most of this data

is private repositories.

This is important because if you use a

public repo to measure someone's

productivity, that public repo is not

self-contained. Someone could be working

on that repo on the weekend or once in a

while, right? Whereas if you have a

private repo, it's much more

self-contained and much easier to

measure the productivity of a team, of a

So late last year there was a huge um controversial

controversial

uh thing around ghost engineers. So this

came from kind of the s the same

research group our research group and

here uh Elon Musk was kind enough to

retweet us. But what we found is that

roughly 10% of software engineers in our

data set at the time about 50,000 were

what we called ghost engineers.

These people collect a paycheck but

basically do no work.

So that was very surprising for some

people, very unsurprising for others.

And so some of the people in this

research team are for example Simon from

industry. Uh so he was CTO at a unicorn

which he exited and he had a team of

about 700 developers and as CTO he was

always the last person to know when

something was up with his engineering

team right and so he thought okay how

can I change this?

Myself, I've been at Stanford since 22

and I focus on what I call datadriven

decision-making in software engineering

and in a past life I was looking after

digital transformation for a large

company with thousands of engineers.

Part of the team is also professor

Kasinski who was at Stanford and his

research focuses on human behavior in a

digital environment and basically he was

the Cambridge Analytica whistleblower

back in the day if you recall that.

So today we're going to be talking about

three things.

We're going to start off with the

limitations of existing studies that

seek to quantify the impact of AI on

developer productivity.

We're going to showcase our methodology.

And lastly, we're going to spend most of

the time looking at some of the results.

What is the impact on AI on deaf

productivity? And how are ways we can

slice and dice these results to make

And so there's lots of research being

done on this topic, but a lot of it is

led by vendors who themselves are trying

to sell you their own AI coding tools,

right? And so there's a bit of a

conflict of interest there sometimes.

And the biggest three limitations that I

see is that a lot of these studies

revolve around commits and PRs and

tasks. Hey, we completed more commits,

more PRs. The time between commits decreased.

decreased.

The problem here is that task size

varies, right? And so delivering more

commits does not necessarily mean more productivity.

productivity.

And in fact, what we found very often is

that by using AI, you're introducing new

tasks that are bug fixes to the stuff

that the AI just coded before. So by

that case, like you're kind of spinning

your wheels in place, right? So that's

kind of funny.

Secondly, there's a bunch of studies who

say, well, we grabbed a bunch of

developers, we split them into two

groups and we kind of gave one AI and

one of them we didn't. And what usually

happens there is that these are kind of

green field tasks where they're asked to

build something with kind of zero

context from scratch. And there of

course AI decimates uh the non-AI

people, but that's because AI is just

really good at green field kind of

boilerplate code, right? But actually

most of software engineering isn't green

field and isn't always boiler

boilerplate, right? And so there's

usually an existing codebase. There's

usually dependencies. So these studies

can't be like applied too too well to

these situations either.

And then we also have surveys which we

found to be an ineffective predictor of

productivity by doing this small

experiment with 43 developers whereby we

ask every developer to evaluate

themselves relative to the global mean

or median in five percentile uh buckets

from 0 to 100 and then we compared that

to their measured productivity. We'll

get into what that means later, but what

we found is that asking someone how

productive they think they are is almost

as good as flipping a coin. There's very

little correlation, right? And so we

found that people misjudged their

productivity by about 30 percentile points.

points.

Only one in three people actually

estimated their productivity within

their quartile, one quartile. And I

think surveys are great. They're

valuable for surfacing, you know, morale

and other issues that cannot be derived

from metrics. But service shouldn't be

used to measure developer productivity

much less the impact of AI on developers

for productivity cases. You can measure

it to kind of see how happy they are

using AI or whatever I suppose. Um

great. So now let's dive into our

methodology. So in an ideal world, you

would have an engineer who writes code

and this code is evaluated by a panel of

10 or 15 experts who separately without

knowing what every person is uh

answering evaluates that code based on

quality, maintainability, output, how

long would this take me, how good is it,

right? kind of like a bucket of

questions and then um what happens is

that you aggregate those results and we

found two things. The first one is that

this panel actually agrees with one

another. So it turns out that one

engineering expert agrees with the other

engineering expert when they're talking

about an objective code in front of

them. And secondly and probably most

importantly is that you can use this to

predict reality. And reality is

predicted by a panel like this. The

problem then is that this is very slow.

It's not scalable. It's expensive. Um,

and so what we did is we built a model

that essentially automates this,

correlates pretty well, it's fast, it's

scalable, and it's affordable.

The way it works is it plugs into git

and then the model analyzes the source

code changes of every commit and

quantifies them based on a bunch of

these dimensions.

And then since every commit has a unique

author, a unique sha, a unique time

stamp, then you can kind of understand

okay the productivity of a team is

basically the functionality of the code

they delivered across time, not the

lines of code, not the whatever commits,

but the fun like what that code is

doing, right? And so then you can kind

of put this in a dashboard and uh

overlay it across time and get something

Great. So now let's dive into some of

our results.

So here in September is when this

company implemented AI. This is a team

of about 120 developers and they were

piloting whether they wanted to use uh

you know AI in their kind of regular

workflow. And we have here um these bars

and every bar is the sum total of the

output done in that month using our

methodology not lines of code. And we

can see that in green it's added

functionality. In gray it's removed.

In blue is refactoring and in orange is

reworked. And so rework versus

refactoring. They both alter existing

code. But rework alters code that's much

more recent. Meaning it's wasteful.

Refactoring could be wasteful, could be

not wasteful.

And so from the get-go, you see that by

implementing AI, you get a bunch more of rework.

rework.

What happens is that you feel like

you're delivering more code because

there's just like more volume of code

being written, more commits, more stuff

being pushed. But not all of that is

actually useful. To be clear, I think

there I mean based on this chart and

overall there is a productivity boost of

about 15 to 20%. But then a lot of the

gains you're seeing are uh basically

this kind of rework which is a bit you

know misleading.

So if I could summarize it into one

chart with many discrepancies it would

be something like this. So with AI

coding you generate or you increase your

productivity by roughly 30 40%. Right?

You're delivering more code. However,

you got to go back and kind of fix some

of the bugs that code introduced and

kind of, you know, fix the the the mess

that the AI made, which in turn gives

you an average productivity gain across

all industries, all sectors, everything

of roughly about 15 to 20%.

There's a lot of new ones here, which

we're going to see in just a second.

So here we have two violin charts and

they plot the distributions of the gains

in productivity from using AI and so

kind of like the y-axis is the gains. It

starts from minus 20%, take note, and

then it goes up. And here we have kind

of four pieces of data being shown. In

blue is low complexity tasks

and in red is high complexity tasks

and kind of like your left uh the chart

to the left is green field tasks, the

chart to the right is brown field tasks.

So right from the get-go, the first

conclusion we have is that sure it seems

like AI performs better in coding with

simpler tasks. That's good. It's proven

by data. That's awesome. The second

thing we see is that hey, it sounds like

for low complexity green field tasks,

there is a much more elongated

distribution and a much higher

distribution on average. Keep in mind

that this is for enterprise settings.

This doesn't apply for kind of like

personal projects or vibe coding

something for yourself from scratch. The

improvements there would be much bigger.

This is kind of for like real world

working company settings.

And the third thing we see is that if

you look at the high complexity tasks, I

mean they're lower than the low

complexity ones in average in terms of

the distribution, but also in some cases

they are more likely to decrease an

engineer's productivity.

Now this decrease could be for many

things, many reasons, but that's kind of

what we see in the data, right? The

underlying causes are still not super

clear to us.

If we translate this to a chart like

this, which is a bit more digestible,

you have uh in in the bars and the columns

columns

kind of like the average or the median

gain and then the line represents the

interquartile range. So the bottom of

the line is the 25th percentile and the

top of the line is roughly 75th

percentile. And so here it's very clear

to see how we have you know more gains

from low complexity tasks, less gains

from uh high complexity tasks and then

brownfield it's harder to leverage AI um

to make increases in productivity there

So if there is maybe a slide that you

could show to your leadership team, it

could be this one or it could also be

this one. So here we have a matrix

really simplifying things you know

reality is a bit more difficult than

this but here we have kind of on one

axis task complexity low and high on the

other one project maturity green field

versus brownfield kind of if we see that

hey low complexity green field 30 to 40%

gains right from AI

high complexity but green field more

modest gains 10 to 15

brown field and low complexity pretty

good 15 to 20%. And most importantly,

high complexity brownfield tasks 0 to 10%.

10%.

These are orientative guidelines based

on what you see what we see in the data.

And I forgot to mention uh this slide

has a sample size of 136 teams across 27

companies. So pretty representative.

Um and then that's going to derive or

Then here we have a similar matrix

except at the bottom we have language popularity.

popularity.

So in low we have examples such as coobo

hasll elixir really kind of obscure

obscure stuff and high is things like

Python, Java, you know JavaScript, TypeScript

TypeScript

and what we see is that AI doesn't

really help even with low complexity uh

tasks for low popularity languages. It

can help a bit but it's not terribly

useful. And what ends up happening is

that people just don't use it. Because

if it's only helpful two times out of

five, you're just not going to use it

very often.

What's funny or interesting is that for

low language popularity and complex

tasks, AI can actually decrease

productivity because it's so bad at

coding in Cobalt or Haskell or Elixir

that it just makes you slower. Right?

Granted, this isn't very like this

happens, but it may be five or 10% of

the kind of global development work if

that, right?

Most of the development work is probably

somewhere in the langu in the high

language popularity kind of part of the

chart. And here you have gains between

20% for the low complexity and 10 to 15%

So now moving into something a bit more

theoretical, less empirically proven,

but more so kind of like what we're

seeing in in in the data, right? This is

like an illustrative chart which has

kind of productivity gain from AI on the

y-axis and a logarithmic scale of the

codebase size right from 1,000 lines of

code to 10 million on the x-axis. And we

see that as the codebase size increases

the gains you get from AI decrease

sharply, right? And I think most code

bases nowadays are kind of somewhere in

the depending on on your use case,

right? but they're bigger than a

thousand lines of code unless you are a

YC startup or something that's like kind

of spun out a couple months ago, right?

And that's because, you know, there's

three three reasons for this really.

Context window limitations. We're going

to see in a second how performance

decreases even with larger context

windows. The signal to noise ratio is

kind of confuses the the model if if you

will. And then of course larger code

bases have more dependencies and more

domain specific logic present.

And so then borrowing work from this

paper called no lima which shows you on

a scale of 0 to 100 how LLMs perform on

coding tasks. You see that as context

length increases from 1,00 to 32,000 tokens

tokens

performance decreases. And so we see all

these models here. For example, Gemini

1.5 Pro has a context window of 2

million tokens. And you might think,

whoa, I can just throw my entire

codebase into it and it's going to

retrieve and and code perfectly, right?

And what we see is that even at 32,000

tokens, it's already showing a decrease

in performance from 90% to about 50%.

Right? So what's going to happen when

you move from 32 to 64 or 128, right?

You're going to see really, really poor

performance here.

And so in short, AI does increase

developer productivity. You should use

AI for most cases, but it doesn't

increase the productivity of developers

all the time and equally. It depends on

things like task complexity, codebase

maturity, language popularity, codebase

size, and also context length.

Thank you so much for listening. If

you'd like to learn more about our

research, you can access our research

portal, which is software engineering productivity.stanford.edu.

productivity.stanford.edu.

You can also reach me uh by email or

LinkedIn. Super happy to talk about this

topic at any time. Thank you so much. [Music]

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…