Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
A junior engineer's experience highlights that while clean code is important, it is insufficient for production readiness. True system success relies on a holistic approach encompassing infrastructure, observability, and an operational culture that prioritizes resilience over mere code elegance.
Mind Map
Click to expand
Click to explore the full interactive mind map • Zoom, pan, and navigate
I watched a senior engineer with 8 years
of experience build really good code,
good patterns, solid test coverage,
looked great on paper. Then I watched it
crash and burn in production at onetenth
of the scale of a legacy code it was
supposed to replace. I was the mid-level
on the team who saw the problems early
tried to raise concerns three different
times and nobody listened. This is how I
learned that beautiful code is
completely worthless. It's 2018. I'm
working with a midsized e-commerce
company. They do about 50 million in
annual repeating revenue. Seven backend
engineers, three maintaining the legacy
PHP system, four of us building the
future. I was on the future team. Well,
management brings in Steven, 8 years of
experience, recently gave a talk called
why code quality is your competitive
advantage. I'm only 2 years in. I'm
excited to learn from someone like this.
First team meeting, Steven's on the
whiteboard for hours showing
architecture, repository patterns, event
sourcing, and everything that you need.
all the clean code book stuff. And I'm
thinking, this is really impressive.
This is how you're supposed to build
systems. But there's something in how he
talks about it. This confidence that
borders on dismissiveness. He waves his
hand at the legacy system, saying, "This
is exactly why we need this migration to
happen now. Clean code is what matters."
I'm cautiously optimistic. The patterns
make sense, but something feels off
about ignoring everything else. What I
didn't realize was that Steven was
optimizing for like the wrong backend
pillar and I was about to watch it play
out in slow motion. Steven's first PR
comes in. The code is clean. It's nicely
structured. It has proper dependency
injection. It follows everything you'd
want it to follow. He had a few
comments, mostly positive. Two comments
that were talking about caching, but he
replied and said, you know, we don't
want too much premature optimization.
Then Steven assigns us, me and a couple
other people to study the legacy PHP
system. You know, understanding why
we're migrating. I open it. The legacy
system is really messy. One of the files
has 4,847
lines. It has no classes, SQL
concatenation, magic numbers, pure
spaghetti. Steven also looked at it, and
in standup, he mentioned that it was
worse than he thought. But I'm curious
because it's currently running properly.
Of course, it could be better, you know,
code quality, but it worked. So I dug
deeper and I started to find things
buried in the configuration file
scattered through the deployment
scripts. The whole hidden infrastructure
of this legacy application. It has a
three- tier caching and the cache rate
is 91%. It also has 47 database indexes
on the orders table alone. partitioning
by month, three read replicates, and
then I found custom metrics for clients,
240 to be exact, tracked across every
step of checkout, error rates by payment
method, latency percentiles, and
real-time dashboards. It had everything
you needed in case a disaster happened.
So, I went back and looked at the Black
Friday of 2017 data cuz it was the year
before, and it had 85,000 concurrent
users, 77,000 orders, 246 milliseconds
on average with an uptime of 99.4%. I'm
staring at these numbers thinking like,
"Wow, this is really impressive." So, I
grabb coffee with Mike. He's the legacy
team lead. 15 years at the company. I
show him the numbers. I remember telling
him, "Mike, the code's a mess, but these
numbers are really good. How did you do
it?" I still remember today cuz I felt
like it felt funny cuz I just remember
him leaning back, taking a sip of his
coffee, and saying, "Code quality is
just one dimension, Eric.
Infrastructure, database design,
observability, those matter, too. Maybe
more when you're under load than you
think." that still sticks with me today.
Code's just one dimension. We give so
much credit and so much effort toward
code, but it's only one dimension. It's
only one pillar of building software.
So, I bring it up in our planning. I
said like, "Hey, Stephen, you know, I
looked at the Black Friday metrics and
they handled 85,000 concurrent users.
Should we be looking at more caching and
observability to kind of match what they
were doing in production today?" And I
remember like Steven like barely even
looking at me. That's just compensating
for bad code. Eric, clean architecture
doesn't need all that infrastructure.
Trust me. And I'm thinking, well, the
data doesn't lie, right? But he has 8
years. I have 2 years. Maybe I'm missing
something. This was the first time I
raised a concern and I was shut down
almost instantly. Over the next 2
months, we start building out Steven's
vision. Uh the code is very clean. I'm
going to give Stephen all the credit for
that. It's clean. It has uh clean
structures. You know, the patterns are
great. Everything's testable. It's very
satisfying. But that question keeps
nagging me, you know, like will it
scale? While Steven's perfecting the
code, infrastructure is delegated to a
DevOps engineer who's never scaled
before. Like that DevOps engineer really
knew what they were doing, but like not
in the standpoint of like how to scale
users. And on top of it, we barely used
caching in staging for this new system.
Caching only hit 23%. The legacy app had
91%. And observability, we had basic
logs, but no custom metrics, no query
visibility. So, I do something I've
never done before. I start documenting
everything. I create a document called
like production readiness concerns or
something, whatever I was thinking at
the time. And I list them out as the
cache hit rate. We have 23%. Legacy
system was 91. You know, we have no
query monitoring. The production one has
a ton of query monitoring. We have no
sustained load testing. We're missing
composite indexes. I spent a little bit
of time adding links to all the legacy
metrics. And I added some suggestions
and I I sent it over to Steven. His
response was what you'd expect. again.
You know, thanks for thinking about
this, Eric. But we don't want to
overengineer the first MVP. We'll
address issues if they show up in beta.
But beta was October. Black Friday is
November. So, let's go ahead and fast
forward it to the week before beta. We
load test 1,000 current users. It works.
No issues at all. And you know why?
Because we're testing with 1,000 users,
not the 85,000 users that was on Black
Friday last year. And I bring this up
again that we need to test with more
people. Steven looks at me and the rest
of the team. Eric, I appreciate you
being thorough, but we can't let
perfection be the enemy of good enough,
which is something I believe in. I
always talk about good enough, but good
enough means good enough for what you're
trying to do, not just shipping code.
But other engineers nod. It sounds
reasonable when someone says it, whether
they're right in the context or not.
Everyone's, you know, is confident, so I
stay quiet. Two times now. Two times
I've raised concerns. Two times I was
told not to worry. I didn't know it yet,
but I was running out of chances. Fast
forward to the beta launch day. You
know, we we flipped the feature flag.
New checkout enabled for 5% of the
users. You know, I'm at my desk with
everybody else watching. We only have
three metrics. That's all we have. CPU
15% looks good. Memory 38%. Okay. Air
rate 0%. It's exactly what we want. But
I still can't shake the mindset of the
244 custom metrics that the other system
had that we're not looking at. We
looking at connection um pool
utilization? No. The cache hit rate? No.
the query performances. No. After
launch, the first order is complete. It
was fast, 142 milliseconds. But 30
minutes later, we got a customer support
message in a Slack channel of like #
incidents. Getting reports of slow
checkout. Is something wrong. I remember
me and Steven looking at the dashboard.
We we look at the numbers. He types in
Slack. Probably just perception. Numbers
look fine. But then it happens again and
again. And then we finally get the
dreaded database connection error
connection pool exhausted. I refresh the
dashboard. Air rate is climbing. Support
tickets start coming in. And here's what
we couldn't see. All the issues because
we weren't adding in the metrics behind
the dashboard showing the CPU and the
memory. We couldn't see the queries or
the database connections. Our beautiful,
beautiful repository pattern was lazy
loading everything. We had to roll back
our modern application to the legacy
app. The roll back takes 47 minutes. At
the time, it was the longest 47 minutes
of my career. And here's the thing, it
wasn't even Black Friday. It wasn't even
the big day of the 85,000 concurrent
users that I had in the back of my mind.
And this was just beta. We probably had
a little over a thousand. We load tested
at like probably the maximum that we
thought would happen on that day, but we
were so close to crashing. And that's
not even close to the 85,000 we had on
Black Friday the year before. So, I take
a breath. I think there are four pillars
to a production system. We nailed one.
Code quality. Didn't really nail it all
that well. We ignored three. We go over
each pillar. There's the first pillar,
which is code quality. Yes, our code was
clean, beautiful, even. And that does
matter. Clean code is easier to
maintain. is easier to onboard to,
easier to reason about. Those are real
benefits of having clean code. But clean
code doesn't automatically mean scalable
code. Doesn't automatically mean you're
building the right thing for the
product. Let me show you what I mean.
Our repository pattern was textbook
example of a good design. Had single
responsibility, dependency injection. It
was testable. Everything the clean code
book tells you to do. Here's what it
looked like conceptually. We had a
product repository that handles
products. We had a category repository
that handled categories. We had a user
repository which you probably can guess
what it did. It handled users. They were
separated. They were clean. They were
easy to test. But this is what happened.
We called the get product details for a
single product page. First, we fetch the
product. One query makes sense. Then the
product object has a category property.
We access it. We lazy load the second
query. That category has a parent
category. We access it. Third query. The
product has reviews. Now, we can say it
has like 10 reviews for this example,
but if it had 10 reviews, that's 10 more
queries. We were falling into the N+1
trap. In the legacy system, the ugly 4,847
4,847
line file had two queries. One query
with a massive join that gets the
product category in the parent category
and one query that batch fetched all the
reviews for the user. Is it ugly?
Absolutely. Can you maintain it easily?
Probably not. But can you see every
query? Does it have the metrics to get
you by? Yes. Our clean abstractions made
the expensive operations invisible. And
here's the thing that really bothers me.
We never would have caught this in code
review. Look at this from like a code
review process. Is the code modular?
Yes. Check. Is it testable? Yes. Is it
follow solid principles? Yes. Approved.
Ship it. No one was looking at the
queries because the queries were hidden
behind. Nobody asked how many queries
does this generate because you can't see
it. The abstraction hid it from us.
Steven was working on the queries. We
just called them through our repository.
And if you ask Eric, why didn't you
catch it when you were testing? Well,
that's because we were creating mocks.
We genuinely never tested the one thing
that matters, and that's behavior under
real production load. And it's not just
queries, it's the memory, too. Now,
pillar two is the infrastructure and
scaling strategy. This is where we
really, really failed as a team. And I
own some of this, too. I saw the
infrastructure gap, and I didn't push
hard enough. The legacy system had three
tier caching, application, reddus, and
CDN. had database partitioning by month,
three read replicas, custom connection
pooling. We built our code first and
delegated infrastructure to a DevOps
engineer who's never scaled an
e-commerce store before. That should
have been a red flag in itself, but we
just kept moving forward. We had minimal
caching, 23%, one database instance.
Steven said something in the planning
that like I can't get out of my mind,
and it's that clean architecture won't
need all that infrastructure. I said
that earlier in the story, and that's
just simply not how it works.
Infrastructure doesn't compensate for
bad code. Infrastructure is like the
foundation for the code. Underload,
infrastructure beats whatever you're
trying to do in your architecture almost
every single time. Because when you have
85,000 concurrent users, even the most
perfect optimized query takes time. Even
if you get the query down to 5
milliseconds, multiply that by 85,000,
you need more infrastructure. There's a
reason all these giant tech companies
have crazy infrastructures. You need
caching to handle repeated requests. Do
you need connection pooling to manage
database connections? You need read
replicas to distribute load. This isn't
optional. This is literally the
baseline. So here's the question you
need to ask yourself every time you
build a system. How and when will you
need to scale to like 10 times the
traffic? And if it's kind of soon, can
your database handle it? Can your cache
handle it? Can your network handle it?
If you can't answer those questions,
your system is not ready to be scaled.
And if your system's not ready to be
scaled and because you don't have the
users, that's fine. But if you have the
users, you need to figure it out. Now,
pillar three is observability. This one
hurts the most because we could have
caught this. The legacy system has 247
metrics. Like I mentioned earlier, we
built three. When things started going
wrong, we were completely blind. We
could see error rates climbing, but we
couldn't see why. We couldn't see which
queries were slow, which endpoints were
struggling, where the bottlenecks were.
Simply put, you can't fix what you can't
see. Observability isn't just logging.
Logs tell you what happened. That's
valuable, but metrics tell you why.
Here's what we should have implemented
from day one. We should have implemented
some type of query performance metrics.
We should have implemented some resource
utilization metrics. We should have
implemented, you know, some business
metrics over how everything's working.
The legacy system had all of this.
That's how Mike and his team knew they
could handle Black Friday. They watch
the metrics climb and they knew exactly
where to put their ceilings to match
where the metrics could go. We didn't
know our ceiling until we hit our head
on it. And here's the thing, it's not
about building a monitoring system from
scratch. You know, you can use tools.
There's tools that exist. Data dog, New
Relic, it doesn't matter. Just pick one.
Instrument your code. Expose the metrics
that matter. Observability is the
difference between like responding to a
fire and preventing one. And as a
backend engineer, you always want to
prevent it. Now, pillar four is the
operational culture. This is the part
that nobody teaches you, and it's by far
one of the most important ones. Culture.
The culture you're in kills more systems
than bad code ever will. And here's what
I mean. I raised concerns three times.
One, in planning when I showed the Black
Friday metrics and asked about caching.
Two, in the code review, I documented
production readiness concerns. Three,
before beta, I questioned the load test
scope. Each time I was told the same
thing, don't overengineer. We'll fix
issues when they come up. They're like
real on trust me, bro moments. Those
kind of messages shut down discussions
because Steven had 8 years of experience
and only had two. His confidence was
treated as expertise. But confidence and
expertise aren't the same thing. Steven
was confident in clean code because he's
he has seen it work. But he's never
scaled an application to 85,000
concurrent users. And neither have I.
Neither had anyone on our modern team.
The only person was Mike. And we weren't
even talking to Mike. We had a culture
that rewarded elegance over resilience
that treated production readiness as a
nice to have instead of a requirement.
And that culture always comes from the
top. When leadership rewards beautiful
demos over boring reliability, you get
what we got. Real engineering isn't just
about writing code. It's about designing
systems that can survive contact with
reality. If you've ever been on a team
where clean code got prioritized over
operational reality, let me know in the
comments. And if you want me to break
down, you know, each one of those
pillars a little bit more, let me know.
I'll make videos on them. Thanks for
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.