YouTube Transcript:
Does AI Actually Boost Developer Productivity? (100k Devs Study) - Yegor Denisov-Blanch, Stanford
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Available languages:
View:
In January of this year, Mark Zuckerberg
said that he was going to replace all of
the mid-level engineers at Meta with AI
by the end of the year.
I think Mark was a bit optimistic and he
was probably acting like a good CEO
would to inspire a vision and also
probably to keep the Facebook stock
price up. But what Mark also did was
create a lot of trouble for CTOs worldwide.
worldwide. Why?
Why?
Because after Mark said that, every
single CEO in the world almost turned to
their CTO and said, "Hey, Marcus says
he's going to replace all of his
developers with AI. Where are we in that
journey?" And the answer probably was
honestly not very far and we're not sure
we're going to do that. And so I
personally think hopefully this is not
um you know going to going to change,
but I don't think AI is going to replace
developers entirely. at least at least
this year, let alone at at at Meta,
right? But um and I do think that AI
increases developer productivity, but
there's also cases in which it decreases
developer productivity. So AI or using
AI for coding is not a one-sizefits-all
solution, and there are cases in which
it shouldn't be used.
And so for the past three years we've uh
been running one of the larger uh
studies on software engineering
productivity at Stanford and we've done
this in a time series and
cross-sectional way. So time series
meaning that even if a participant joins
in 2025 we get access to their git
history meaning we can see trends of
data across time we can see covid we can
see AI we can see all of these trends
and and things that happened and then
also cross-sectional because we have
more than 600 companies participating
enterprise midsized and also startups
and so this means that we have more than
100,000 software engineers in our data
set right now dozens of millions of
commits and billions of lines of code.
And most importantly, most of this data
is private repositories.
This is important because if you use a
public repo to measure someone's
productivity, that public repo is not
self-contained. Someone could be working
on that repo on the weekend or once in a
while, right? Whereas if you have a
private repo, it's much more
self-contained and much easier to
measure the productivity of a team, of a
So late last year there was a huge um controversial
controversial
uh thing around ghost engineers. So this
came from kind of the s the same
research group our research group and
here uh Elon Musk was kind enough to
retweet us. But what we found is that
roughly 10% of software engineers in our
data set at the time about 50,000 were
what we called ghost engineers.
These people collect a paycheck but
basically do no work.
So that was very surprising for some
people, very unsurprising for others.
And so some of the people in this
research team are for example Simon from
industry. Uh so he was CTO at a unicorn
which he exited and he had a team of
about 700 developers and as CTO he was
always the last person to know when
something was up with his engineering
team right and so he thought okay how
can I change this?
Myself, I've been at Stanford since 22
and I focus on what I call datadriven
decision-making in software engineering
and in a past life I was looking after
digital transformation for a large
company with thousands of engineers.
Part of the team is also professor
Kasinski who was at Stanford and his
research focuses on human behavior in a
digital environment and basically he was
the Cambridge Analytica whistleblower
back in the day if you recall that.
So today we're going to be talking about
three things.
We're going to start off with the
limitations of existing studies that
seek to quantify the impact of AI on
developer productivity.
We're going to showcase our methodology.
And lastly, we're going to spend most of
the time looking at some of the results.
What is the impact on AI on deaf
productivity? And how are ways we can
slice and dice these results to make
And so there's lots of research being
done on this topic, but a lot of it is
led by vendors who themselves are trying
to sell you their own AI coding tools,
right? And so there's a bit of a
conflict of interest there sometimes.
And the biggest three limitations that I
see is that a lot of these studies
revolve around commits and PRs and
tasks. Hey, we completed more commits,
more PRs. The time between commits decreased.
decreased.
The problem here is that task size
varies, right? And so delivering more
commits does not necessarily mean more productivity.
productivity.
And in fact, what we found very often is
that by using AI, you're introducing new
tasks that are bug fixes to the stuff
that the AI just coded before. So by
that case, like you're kind of spinning
your wheels in place, right? So that's
kind of funny.
Secondly, there's a bunch of studies who
say, well, we grabbed a bunch of
developers, we split them into two
groups and we kind of gave one AI and
one of them we didn't. And what usually
happens there is that these are kind of
green field tasks where they're asked to
build something with kind of zero
context from scratch. And there of
course AI decimates uh the non-AI
people, but that's because AI is just
really good at green field kind of
boilerplate code, right? But actually
most of software engineering isn't green
field and isn't always boiler
boilerplate, right? And so there's
usually an existing codebase. There's
usually dependencies. So these studies
can't be like applied too too well to
these situations either.
And then we also have surveys which we
found to be an ineffective predictor of
productivity by doing this small
experiment with 43 developers whereby we
ask every developer to evaluate
themselves relative to the global mean
or median in five percentile uh buckets
from 0 to 100 and then we compared that
to their measured productivity. We'll
get into what that means later, but what
we found is that asking someone how
productive they think they are is almost
as good as flipping a coin. There's very
little correlation, right? And so we
found that people misjudged their
productivity by about 30 percentile points.
points.
Only one in three people actually
estimated their productivity within
their quartile, one quartile. And I
think surveys are great. They're
valuable for surfacing, you know, morale
and other issues that cannot be derived
from metrics. But service shouldn't be
used to measure developer productivity
much less the impact of AI on developers
for productivity cases. You can measure
it to kind of see how happy they are
using AI or whatever I suppose. Um
great. So now let's dive into our
methodology. So in an ideal world, you
would have an engineer who writes code
and this code is evaluated by a panel of
10 or 15 experts who separately without
knowing what every person is uh
answering evaluates that code based on
quality, maintainability, output, how
long would this take me, how good is it,
right? kind of like a bucket of
questions and then um what happens is
that you aggregate those results and we
found two things. The first one is that
this panel actually agrees with one
another. So it turns out that one
engineering expert agrees with the other
engineering expert when they're talking
about an objective code in front of
them. And secondly and probably most
importantly is that you can use this to
predict reality. And reality is
predicted by a panel like this. The
problem then is that this is very slow.
It's not scalable. It's expensive. Um,
and so what we did is we built a model
that essentially automates this,
correlates pretty well, it's fast, it's
scalable, and it's affordable.
The way it works is it plugs into git
and then the model analyzes the source
code changes of every commit and
quantifies them based on a bunch of
these dimensions.
And then since every commit has a unique
author, a unique sha, a unique time
stamp, then you can kind of understand
okay the productivity of a team is
basically the functionality of the code
they delivered across time, not the
lines of code, not the whatever commits,
but the fun like what that code is
doing, right? And so then you can kind
of put this in a dashboard and uh
overlay it across time and get something
Great. So now let's dive into some of
our results.
So here in September is when this
company implemented AI. This is a team
of about 120 developers and they were
piloting whether they wanted to use uh
you know AI in their kind of regular
workflow. And we have here um these bars
and every bar is the sum total of the
output done in that month using our
methodology not lines of code. And we
can see that in green it's added
functionality. In gray it's removed.
In blue is refactoring and in orange is
reworked. And so rework versus
refactoring. They both alter existing
code. But rework alters code that's much
more recent. Meaning it's wasteful.
Refactoring could be wasteful, could be
not wasteful.
And so from the get-go, you see that by
implementing AI, you get a bunch more of rework.
rework.
What happens is that you feel like
you're delivering more code because
there's just like more volume of code
being written, more commits, more stuff
being pushed. But not all of that is
actually useful. To be clear, I think
there I mean based on this chart and
overall there is a productivity boost of
about 15 to 20%. But then a lot of the
gains you're seeing are uh basically
this kind of rework which is a bit you
know misleading.
So if I could summarize it into one
chart with many discrepancies it would
be something like this. So with AI
coding you generate or you increase your
productivity by roughly 30 40%. Right?
You're delivering more code. However,
you got to go back and kind of fix some
of the bugs that code introduced and
kind of, you know, fix the the the mess
that the AI made, which in turn gives
you an average productivity gain across
all industries, all sectors, everything
of roughly about 15 to 20%.
There's a lot of new ones here, which
we're going to see in just a second.
So here we have two violin charts and
they plot the distributions of the gains
in productivity from using AI and so
kind of like the y-axis is the gains. It
starts from minus 20%, take note, and
then it goes up. And here we have kind
of four pieces of data being shown. In
blue is low complexity tasks
and in red is high complexity tasks
and kind of like your left uh the chart
to the left is green field tasks, the
chart to the right is brown field tasks.
So right from the get-go, the first
conclusion we have is that sure it seems
like AI performs better in coding with
simpler tasks. That's good. It's proven
by data. That's awesome. The second
thing we see is that hey, it sounds like
for low complexity green field tasks,
there is a much more elongated
distribution and a much higher
distribution on average. Keep in mind
that this is for enterprise settings.
This doesn't apply for kind of like
personal projects or vibe coding
something for yourself from scratch. The
improvements there would be much bigger.
This is kind of for like real world
working company settings.
And the third thing we see is that if
you look at the high complexity tasks, I
mean they're lower than the low
complexity ones in average in terms of
the distribution, but also in some cases
they are more likely to decrease an
engineer's productivity.
Now this decrease could be for many
things, many reasons, but that's kind of
what we see in the data, right? The
underlying causes are still not super
clear to us.
If we translate this to a chart like
this, which is a bit more digestible,
you have uh in in the bars and the columns
columns
kind of like the average or the median
gain and then the line represents the
interquartile range. So the bottom of
the line is the 25th percentile and the
top of the line is roughly 75th
percentile. And so here it's very clear
to see how we have you know more gains
from low complexity tasks, less gains
from uh high complexity tasks and then
brownfield it's harder to leverage AI um
to make increases in productivity there
So if there is maybe a slide that you
could show to your leadership team, it
could be this one or it could also be
this one. So here we have a matrix
really simplifying things you know
reality is a bit more difficult than
this but here we have kind of on one
axis task complexity low and high on the
other one project maturity green field
versus brownfield kind of if we see that
hey low complexity green field 30 to 40%
gains right from AI
high complexity but green field more
modest gains 10 to 15
brown field and low complexity pretty
good 15 to 20%. And most importantly,
high complexity brownfield tasks 0 to 10%.
10%.
These are orientative guidelines based
on what you see what we see in the data.
And I forgot to mention uh this slide
has a sample size of 136 teams across 27
companies. So pretty representative.
Um and then that's going to derive or
Then here we have a similar matrix
except at the bottom we have language popularity.
popularity.
So in low we have examples such as coobo
hasll elixir really kind of obscure
obscure stuff and high is things like
Python, Java, you know JavaScript, TypeScript
TypeScript
and what we see is that AI doesn't
really help even with low complexity uh
tasks for low popularity languages. It
can help a bit but it's not terribly
useful. And what ends up happening is
that people just don't use it. Because
if it's only helpful two times out of
five, you're just not going to use it
very often.
What's funny or interesting is that for
low language popularity and complex
tasks, AI can actually decrease
productivity because it's so bad at
coding in Cobalt or Haskell or Elixir
that it just makes you slower. Right?
Granted, this isn't very like this
happens, but it may be five or 10% of
the kind of global development work if
that, right?
Most of the development work is probably
somewhere in the langu in the high
language popularity kind of part of the
chart. And here you have gains between
20% for the low complexity and 10 to 15%
So now moving into something a bit more
theoretical, less empirically proven,
but more so kind of like what we're
seeing in in in the data, right? This is
like an illustrative chart which has
kind of productivity gain from AI on the
y-axis and a logarithmic scale of the
codebase size right from 1,000 lines of
code to 10 million on the x-axis. And we
see that as the codebase size increases
the gains you get from AI decrease
sharply, right? And I think most code
bases nowadays are kind of somewhere in
the depending on on your use case,
right? but they're bigger than a
thousand lines of code unless you are a
YC startup or something that's like kind
of spun out a couple months ago, right?
And that's because, you know, there's
three three reasons for this really.
Context window limitations. We're going
to see in a second how performance
decreases even with larger context
windows. The signal to noise ratio is
kind of confuses the the model if if you
will. And then of course larger code
bases have more dependencies and more
domain specific logic present.
And so then borrowing work from this
paper called no lima which shows you on
a scale of 0 to 100 how LLMs perform on
coding tasks. You see that as context
length increases from 1,00 to 32,000 tokens
tokens
performance decreases. And so we see all
these models here. For example, Gemini
1.5 Pro has a context window of 2
million tokens. And you might think,
whoa, I can just throw my entire
codebase into it and it's going to
retrieve and and code perfectly, right?
And what we see is that even at 32,000
tokens, it's already showing a decrease
in performance from 90% to about 50%.
Right? So what's going to happen when
you move from 32 to 64 or 128, right?
You're going to see really, really poor
performance here.
And so in short, AI does increase
developer productivity. You should use
AI for most cases, but it doesn't
increase the productivity of developers
all the time and equally. It depends on
things like task complexity, codebase
maturity, language popularity, codebase
size, and also context length.
Thank you so much for listening. If
you'd like to learn more about our
research, you can access our research
portal, which is software engineering productivity.stanford.edu.
productivity.stanford.edu.
You can also reach me uh by email or
LinkedIn. Super happy to talk about this
topic at any time. Thank you so much. [Music]
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.
Works with YouTube, Coursera, Udemy and more educational platforms
Get Instant Transcripts: Just Edit the Domain in Your Address Bar!
YouTube
←
→
↻
https://www.youtube.com/watch?v=UF8uR6Z6KLc
YoutubeToText
←
→
↻
https://youtubetotext.net/watch?v=UF8uR6Z6KLc