0:01 I watched a senior engineer with 8 years
0:04 of experience build really good code,
0:06 good patterns, solid test coverage,
0:08 looked great on paper. Then I watched it
0:10 crash and burn in production at onetenth
0:12 of the scale of a legacy code it was
0:14 supposed to replace. I was the mid-level
0:16 on the team who saw the problems early
0:18 tried to raise concerns three different
0:20 times and nobody listened. This is how I
0:22 learned that beautiful code is
0:24 completely worthless. It's 2018. I'm
0:26 working with a midsized e-commerce
0:29 company. They do about 50 million in
0:31 annual repeating revenue. Seven backend
0:33 engineers, three maintaining the legacy
0:35 PHP system, four of us building the
0:38 future. I was on the future team. Well,
0:40 management brings in Steven, 8 years of
0:42 experience, recently gave a talk called
0:44 why code quality is your competitive
0:46 advantage. I'm only 2 years in. I'm
0:48 excited to learn from someone like this.
0:49 First team meeting, Steven's on the
0:51 whiteboard for hours showing
0:53 architecture, repository patterns, event
0:55 sourcing, and everything that you need.
0:57 all the clean code book stuff. And I'm
0:59 thinking, this is really impressive.
1:01 This is how you're supposed to build
1:03 systems. But there's something in how he
1:06 talks about it. This confidence that
1:08 borders on dismissiveness. He waves his
1:10 hand at the legacy system, saying, "This
1:13 is exactly why we need this migration to
1:15 happen now. Clean code is what matters."
1:18 I'm cautiously optimistic. The patterns
1:21 make sense, but something feels off
1:23 about ignoring everything else. What I
1:25 didn't realize was that Steven was
1:28 optimizing for like the wrong backend
1:30 pillar and I was about to watch it play
1:33 out in slow motion. Steven's first PR
1:36 comes in. The code is clean. It's nicely
1:38 structured. It has proper dependency
1:39 injection. It follows everything you'd
1:41 want it to follow. He had a few
1:42 comments, mostly positive. Two comments
1:45 that were talking about caching, but he
1:46 replied and said, you know, we don't
1:48 want too much premature optimization.
1:50 Then Steven assigns us, me and a couple
1:53 other people to study the legacy PHP
1:54 system. You know, understanding why
1:57 we're migrating. I open it. The legacy
1:59 system is really messy. One of the files
2:01 has 4,847
2:02 lines. It has no classes, SQL
2:04 concatenation, magic numbers, pure
2:06 spaghetti. Steven also looked at it, and
2:08 in standup, he mentioned that it was
2:10 worse than he thought. But I'm curious
2:12 because it's currently running properly.
2:14 Of course, it could be better, you know,
2:16 code quality, but it worked. So I dug
2:18 deeper and I started to find things
2:19 buried in the configuration file
2:20 scattered through the deployment
2:23 scripts. The whole hidden infrastructure
2:25 of this legacy application. It has a
2:27 three- tier caching and the cache rate
2:30 is 91%. It also has 47 database indexes
2:32 on the orders table alone. partitioning
2:35 by month, three read replicates, and
2:37 then I found custom metrics for clients,
2:40 240 to be exact, tracked across every
2:42 step of checkout, error rates by payment
2:44 method, latency percentiles, and
2:46 real-time dashboards. It had everything
2:48 you needed in case a disaster happened.
2:50 So, I went back and looked at the Black
2:52 Friday of 2017 data cuz it was the year
2:54 before, and it had 85,000 concurrent
2:57 users, 77,000 orders, 246 milliseconds
3:01 on average with an uptime of 99.4%. I'm
3:02 staring at these numbers thinking like,
3:05 "Wow, this is really impressive." So, I
3:07 grabb coffee with Mike. He's the legacy
3:09 team lead. 15 years at the company. I
3:11 show him the numbers. I remember telling
3:13 him, "Mike, the code's a mess, but these
3:15 numbers are really good. How did you do
3:17 it?" I still remember today cuz I felt
3:19 like it felt funny cuz I just remember
3:20 him leaning back, taking a sip of his
3:22 coffee, and saying, "Code quality is
3:23 just one dimension, Eric.
3:25 Infrastructure, database design,
3:27 observability, those matter, too. Maybe
3:29 more when you're under load than you
3:30 think." that still sticks with me today.
3:33 Code's just one dimension. We give so
3:35 much credit and so much effort toward
3:37 code, but it's only one dimension. It's
3:39 only one pillar of building software.
3:41 So, I bring it up in our planning. I
3:42 said like, "Hey, Stephen, you know, I
3:44 looked at the Black Friday metrics and
3:47 they handled 85,000 concurrent users.
3:49 Should we be looking at more caching and
3:51 observability to kind of match what they
3:53 were doing in production today?" And I
3:54 remember like Steven like barely even
3:56 looking at me. That's just compensating
3:58 for bad code. Eric, clean architecture
4:00 doesn't need all that infrastructure.
4:03 Trust me. And I'm thinking, well, the
4:05 data doesn't lie, right? But he has 8
4:07 years. I have 2 years. Maybe I'm missing
4:08 something. This was the first time I
4:10 raised a concern and I was shut down
4:12 almost instantly. Over the next 2
4:14 months, we start building out Steven's
4:16 vision. Uh the code is very clean. I'm
4:17 going to give Stephen all the credit for
4:18 that. It's clean. It has uh clean
4:19 structures. You know, the patterns are
4:21 great. Everything's testable. It's very
4:23 satisfying. But that question keeps
4:24 nagging me, you know, like will it
4:26 scale? While Steven's perfecting the
4:28 code, infrastructure is delegated to a
4:30 DevOps engineer who's never scaled
4:31 before. Like that DevOps engineer really
4:33 knew what they were doing, but like not
4:35 in the standpoint of like how to scale
4:38 users. And on top of it, we barely used
4:41 caching in staging for this new system.
4:44 Caching only hit 23%. The legacy app had
4:47 91%. And observability, we had basic
4:49 logs, but no custom metrics, no query
4:51 visibility. So, I do something I've
4:53 never done before. I start documenting
4:54 everything. I create a document called
4:57 like production readiness concerns or
4:58 something, whatever I was thinking at
5:00 the time. And I list them out as the
5:02 cache hit rate. We have 23%. Legacy
5:03 system was 91. You know, we have no
5:06 query monitoring. The production one has
5:08 a ton of query monitoring. We have no
5:10 sustained load testing. We're missing
5:12 composite indexes. I spent a little bit
5:14 of time adding links to all the legacy
5:16 metrics. And I added some suggestions
5:18 and I I sent it over to Steven. His
5:19 response was what you'd expect. again.
5:20 You know, thanks for thinking about
5:22 this, Eric. But we don't want to
5:24 overengineer the first MVP. We'll
5:26 address issues if they show up in beta.
5:28 But beta was October. Black Friday is
5:29 November. So, let's go ahead and fast
5:32 forward it to the week before beta. We
5:35 load test 1,000 current users. It works.
5:37 No issues at all. And you know why?
5:40 Because we're testing with 1,000 users,
5:43 not the 85,000 users that was on Black
5:45 Friday last year. And I bring this up
5:46 again that we need to test with more
5:48 people. Steven looks at me and the rest
5:50 of the team. Eric, I appreciate you
5:52 being thorough, but we can't let
5:54 perfection be the enemy of good enough,
5:55 which is something I believe in. I
5:57 always talk about good enough, but good
5:59 enough means good enough for what you're
6:02 trying to do, not just shipping code.
6:03 But other engineers nod. It sounds
6:05 reasonable when someone says it, whether
6:07 they're right in the context or not.
6:08 Everyone's, you know, is confident, so I
6:10 stay quiet. Two times now. Two times
6:13 I've raised concerns. Two times I was
6:14 told not to worry. I didn't know it yet,
6:16 but I was running out of chances. Fast
6:18 forward to the beta launch day. You
6:20 know, we we flipped the feature flag.
6:21 New checkout enabled for 5% of the
6:23 users. You know, I'm at my desk with
6:24 everybody else watching. We only have
6:26 three metrics. That's all we have. CPU
6:30 15% looks good. Memory 38%. Okay. Air
6:33 rate 0%. It's exactly what we want. But
6:35 I still can't shake the mindset of the
6:38 244 custom metrics that the other system
6:39 had that we're not looking at. We
6:40 looking at connection um pool
6:43 utilization? No. The cache hit rate? No.
6:44 the query performances. No. After
6:46 launch, the first order is complete. It
6:48 was fast, 142 milliseconds. But 30
6:50 minutes later, we got a customer support
6:53 message in a Slack channel of like #
6:54 incidents. Getting reports of slow
6:56 checkout. Is something wrong. I remember
6:58 me and Steven looking at the dashboard.
6:59 We we look at the numbers. He types in
7:01 Slack. Probably just perception. Numbers
7:03 look fine. But then it happens again and
7:04 again. And then we finally get the
7:06 dreaded database connection error
7:08 connection pool exhausted. I refresh the
7:10 dashboard. Air rate is climbing. Support
7:12 tickets start coming in. And here's what
7:15 we couldn't see. All the issues because
7:17 we weren't adding in the metrics behind
7:18 the dashboard showing the CPU and the
7:20 memory. We couldn't see the queries or
7:22 the database connections. Our beautiful,
7:25 beautiful repository pattern was lazy
7:27 loading everything. We had to roll back
7:28 our modern application to the legacy
7:30 app. The roll back takes 47 minutes. At
7:32 the time, it was the longest 47 minutes
7:34 of my career. And here's the thing, it
7:36 wasn't even Black Friday. It wasn't even
7:39 the big day of the 85,000 concurrent
7:41 users that I had in the back of my mind.
7:42 And this was just beta. We probably had
7:45 a little over a thousand. We load tested
7:46 at like probably the maximum that we
7:48 thought would happen on that day, but we
7:50 were so close to crashing. And that's
7:53 not even close to the 85,000 we had on
7:55 Black Friday the year before. So, I take
7:58 a breath. I think there are four pillars
8:00 to a production system. We nailed one.
8:01 Code quality. Didn't really nail it all
8:03 that well. We ignored three. We go over
8:04 each pillar. There's the first pillar,
8:06 which is code quality. Yes, our code was
8:08 clean, beautiful, even. And that does
8:10 matter. Clean code is easier to
8:11 maintain. is easier to onboard to,
8:13 easier to reason about. Those are real
8:16 benefits of having clean code. But clean
8:18 code doesn't automatically mean scalable
8:20 code. Doesn't automatically mean you're
8:21 building the right thing for the
8:22 product. Let me show you what I mean.
8:24 Our repository pattern was textbook
8:26 example of a good design. Had single
8:28 responsibility, dependency injection. It
8:30 was testable. Everything the clean code
8:31 book tells you to do. Here's what it
8:33 looked like conceptually. We had a
8:35 product repository that handles
8:36 products. We had a category repository
8:38 that handled categories. We had a user
8:39 repository which you probably can guess
8:41 what it did. It handled users. They were
8:44 separated. They were clean. They were
8:46 easy to test. But this is what happened.
8:48 We called the get product details for a
8:50 single product page. First, we fetch the
8:52 product. One query makes sense. Then the
8:54 product object has a category property.
8:56 We access it. We lazy load the second
8:58 query. That category has a parent
9:01 category. We access it. Third query. The
9:02 product has reviews. Now, we can say it
9:04 has like 10 reviews for this example,
9:06 but if it had 10 reviews, that's 10 more
9:09 queries. We were falling into the N+1
9:11 trap. In the legacy system, the ugly 4,847
9:13 4,847
9:15 line file had two queries. One query
9:16 with a massive join that gets the
9:19 product category in the parent category
9:21 and one query that batch fetched all the
9:22 reviews for the user. Is it ugly?
9:25 Absolutely. Can you maintain it easily?
9:27 Probably not. But can you see every
9:29 query? Does it have the metrics to get
9:33 you by? Yes. Our clean abstractions made
9:36 the expensive operations invisible. And
9:38 here's the thing that really bothers me.
9:40 We never would have caught this in code
9:41 review. Look at this from like a code
9:43 review process. Is the code modular?
9:46 Yes. Check. Is it testable? Yes. Is it
9:48 follow solid principles? Yes. Approved.
9:50 Ship it. No one was looking at the
9:51 queries because the queries were hidden
9:53 behind. Nobody asked how many queries
9:54 does this generate because you can't see
9:57 it. The abstraction hid it from us.
9:58 Steven was working on the queries. We
10:01 just called them through our repository.
10:02 And if you ask Eric, why didn't you
10:03 catch it when you were testing? Well,
10:05 that's because we were creating mocks.
10:06 We genuinely never tested the one thing
10:08 that matters, and that's behavior under
10:10 real production load. And it's not just
10:12 queries, it's the memory, too. Now,
10:14 pillar two is the infrastructure and
10:17 scaling strategy. This is where we
10:19 really, really failed as a team. And I
10:20 own some of this, too. I saw the
10:22 infrastructure gap, and I didn't push
10:24 hard enough. The legacy system had three
10:27 tier caching, application, reddus, and
10:29 CDN. had database partitioning by month,
10:32 three read replicas, custom connection
10:34 pooling. We built our code first and
10:36 delegated infrastructure to a DevOps
10:37 engineer who's never scaled an
10:38 e-commerce store before. That should
10:40 have been a red flag in itself, but we
10:42 just kept moving forward. We had minimal
10:45 caching, 23%, one database instance.
10:47 Steven said something in the planning
10:49 that like I can't get out of my mind,
10:50 and it's that clean architecture won't
10:52 need all that infrastructure. I said
10:54 that earlier in the story, and that's
10:56 just simply not how it works.
10:58 Infrastructure doesn't compensate for
11:00 bad code. Infrastructure is like the
11:03 foundation for the code. Underload,
11:04 infrastructure beats whatever you're
11:06 trying to do in your architecture almost
11:08 every single time. Because when you have
11:11 85,000 concurrent users, even the most
11:13 perfect optimized query takes time. Even
11:15 if you get the query down to 5
11:18 milliseconds, multiply that by 85,000,
11:19 you need more infrastructure. There's a
11:21 reason all these giant tech companies
11:23 have crazy infrastructures. You need
11:24 caching to handle repeated requests. Do
11:25 you need connection pooling to manage
11:27 database connections? You need read
11:29 replicas to distribute load. This isn't
11:31 optional. This is literally the
11:33 baseline. So here's the question you
11:34 need to ask yourself every time you
11:36 build a system. How and when will you
11:38 need to scale to like 10 times the
11:40 traffic? And if it's kind of soon, can
11:41 your database handle it? Can your cache
11:43 handle it? Can your network handle it?
11:44 If you can't answer those questions,
11:46 your system is not ready to be scaled.
11:47 And if your system's not ready to be
11:48 scaled and because you don't have the
11:50 users, that's fine. But if you have the
11:51 users, you need to figure it out. Now,
11:54 pillar three is observability. This one
11:55 hurts the most because we could have
11:58 caught this. The legacy system has 247
11:59 metrics. Like I mentioned earlier, we
12:01 built three. When things started going
12:03 wrong, we were completely blind. We
12:05 could see error rates climbing, but we
12:06 couldn't see why. We couldn't see which
12:08 queries were slow, which endpoints were
12:10 struggling, where the bottlenecks were.
12:13 Simply put, you can't fix what you can't
12:15 see. Observability isn't just logging.
12:17 Logs tell you what happened. That's
12:19 valuable, but metrics tell you why.
12:20 Here's what we should have implemented
12:22 from day one. We should have implemented
12:24 some type of query performance metrics.
12:25 We should have implemented some resource
12:27 utilization metrics. We should have
12:29 implemented, you know, some business
12:32 metrics over how everything's working.
12:33 The legacy system had all of this.
12:35 That's how Mike and his team knew they
12:37 could handle Black Friday. They watch
12:40 the metrics climb and they knew exactly
12:41 where to put their ceilings to match
12:43 where the metrics could go. We didn't
12:44 know our ceiling until we hit our head
12:46 on it. And here's the thing, it's not
12:48 about building a monitoring system from
12:50 scratch. You know, you can use tools.
12:51 There's tools that exist. Data dog, New
12:52 Relic, it doesn't matter. Just pick one.
12:54 Instrument your code. Expose the metrics
12:56 that matter. Observability is the
12:58 difference between like responding to a
13:00 fire and preventing one. And as a
13:02 backend engineer, you always want to
13:04 prevent it. Now, pillar four is the
13:06 operational culture. This is the part
13:08 that nobody teaches you, and it's by far
13:10 one of the most important ones. Culture.
13:12 The culture you're in kills more systems
13:13 than bad code ever will. And here's what
13:15 I mean. I raised concerns three times.
13:17 One, in planning when I showed the Black
13:19 Friday metrics and asked about caching.
13:21 Two, in the code review, I documented
13:23 production readiness concerns. Three,
13:26 before beta, I questioned the load test
13:28 scope. Each time I was told the same
13:30 thing, don't overengineer. We'll fix
13:32 issues when they come up. They're like
13:34 real on trust me, bro moments. Those
13:36 kind of messages shut down discussions
13:38 because Steven had 8 years of experience
13:40 and only had two. His confidence was
13:43 treated as expertise. But confidence and
13:45 expertise aren't the same thing. Steven
13:47 was confident in clean code because he's
13:49 he has seen it work. But he's never
13:51 scaled an application to 85,000
13:53 concurrent users. And neither have I.
13:56 Neither had anyone on our modern team.
13:58 The only person was Mike. And we weren't
14:00 even talking to Mike. We had a culture
14:02 that rewarded elegance over resilience
14:04 that treated production readiness as a
14:06 nice to have instead of a requirement.
14:09 And that culture always comes from the
14:11 top. When leadership rewards beautiful
14:14 demos over boring reliability, you get
14:16 what we got. Real engineering isn't just
14:18 about writing code. It's about designing
14:21 systems that can survive contact with
14:23 reality. If you've ever been on a team
14:25 where clean code got prioritized over
14:27 operational reality, let me know in the
14:29 comments. And if you want me to break
14:30 down, you know, each one of those
14:32 pillars a little bit more, let me know.
14:33 I'll make videos on them. Thanks for