0:01 Hey there, this is Akshin Milan and
0:03 welcome back to a new video. And in this
0:05 video, I'll be walking you through a
0:07 problem statement that was given to me
0:10 uh during the interview uh for senior
0:12 software engineer position. The company
0:13 I cannot tell the exact name but it was
0:15 a London based company and it was a
0:18 remote opportunity. So as soon as I uh
0:20 went on a meet with the interviewer,
0:22 there were two interviewers and they
0:24 asked me to share my screen and in the
0:25 meet chat they put this problem
0:29 statement. Design a system that allows
0:30 users to upload audio files and receive
0:32 transcriptions. The system should handle
0:34 100,000 audio files per day with an
0:36 average file size of 20 MB and duration
0:38 of 10 minutes. User should receive a
0:39 notification when the transcription is
0:43 complete. So this was clear that this
0:45 interview is not just back-end concepts
0:47 interview like uh my expertise were in
0:49 Python for that interview for that job
0:51 role and I realized that this is a
0:54 system design interview and I have to
0:56 design the system because the company
0:58 was AI data processing company. Uh this
1:01 is a good statement because here I have
1:03 to design a speech to test uh
1:05 speechtoext transcription system that
1:06 means users will be uploading audio
1:09 files and they should be getting
1:11 transcriptions back of that audio file.
1:13 And for your information this was round
1:15 three for this position uh in that whole
1:18 interview process. So now I've got this
1:21 statement. So I realized that in this
1:24 system I might be uh working with some
1:26 load balancers for handling load because
1:29 audio files are uh big in number and I
1:30 was pretty sure that during this whole
1:32 interview they will ask me to handle the
1:34 spikes like this 100,000 may become
1:38 300,000 at uh speak spike duration like
1:40 spike times right and like I have to do
1:43 some back of the envelope uh
1:46 calculations uh like I have to calculate
1:47 what will be the cost I'll be bearing
1:49 for this whole system per month or what
1:51 will be the total uh storage I'll be
1:53 handling in in like one year or in five
1:55 years. So I was clear that I have to
1:57 handle this but I did not start putting
1:58 the questions in front of the
2:01 interviewer right from there. I I I went
2:02 into a flow and I'll tell you how I
2:04 navigated through the interview. So the
2:07 first thing that I told him that just a
2:09 clarification question hey so okay I
2:12 read this statement and we will be
2:14 building a speechtoext transcription
2:18 system and I see that in a normal uh day
2:19 like in a normal day we are going to
2:21 handle 100,000 audio files okay and one
2:24 file is 20 MB and duration is 10 minutes
2:27 okay that is cool uh just to uh go one
2:30 step further these are the functional
2:32 requirements that I can think of and I
2:33 have some clarification I have some
2:35 follow-up questions that I would like
2:36 you to clear for me. This was my
2:40 statement and then so I used this
2:42 statement this platform itself which is
2:44 app eraser and just like now I'm writing
2:46 I was writing in that interview as well.
2:49 So what I did I wrote a heading
2:52 functional requirements okay so what
2:54 type of languages or which are languages
2:56 we are going to support uh is it just
2:58 going to be for English or is it going
3:00 to be for English, Spanish, German which
3:03 are languages right? So this question is
3:05 important because you're going to use an
3:07 AI model. It can be an LLM or it can be
3:09 your own fine-tuned model. We won't go
3:11 into that depth but the language is
3:13 going to be important because if you're
3:15 using an opensource LLM or a model then
3:18 you have to be sure that that LLM
3:20 supports these languages. So he told me
3:22 that English and Spanish are the two
3:23 languages that we are going to support.
3:25 And as soon as he told me then I
3:27 mentioned him some of the models. So you
3:29 should have like a a bird's eye view of
3:31 a lot of technology. So I was I had
3:33 already worked with LLM. So I I was able
3:35 to give him a suggestion okay that so if
3:37 English and Spanish are the languages we
3:39 can go ahead with whisper model which is
3:44 And at this point of time I wanted to
3:46 actually know whether he's okay to use
3:48 OpenAI's model or not or he's going to
3:50 ask me to fine-tune a model. So he was
3:52 okay because this whole interview was
3:54 focusing on the backend and assist
3:56 system design rather than questions on
3:58 fine-tuning and all of the model
4:00 training stuff. So we went ahead with
4:02 whisper. This was all good. Once the
4:04 languages are fixed then I told him one
4:06 functional requirement that means that
4:08 is users should be able to upload the
4:11 audio files right. So this statement or
4:13 this requirement will uh help us know
4:15 that there should be an API an endpoint
4:17 or a system or a UI where the users can
4:19 actually upload their audio files. The
4:21 next functional requirement is system
4:23 should be able to transcribe the audio
4:25 to text. And the next statement or the
4:27 next requirement is we should get
4:29 notifications which means if my audio
4:31 transcription fails I should get an
4:33 email or a push notification. If it
4:35 succeeds I should get an email that your
4:37 text or the transcription is ready.
4:39 Perfect. So once these requirements are
4:40 clear then I went ahead with
4:43 non-functional requirements. So first of
4:47 all the scale scale of 100k files per
4:51 day spike time 300k files latency
4:53 latency
4:55 like how much time a transcription
4:57 should take. So so I went ahead with a
5:00 uh like a number I said 5 minutes should
5:01 be sufficient for transcribing any kind
5:05 of audio file and he was okay with that.
5:07 High availability that means my system
5:10 should be 99.99% available. So high
5:12 availability can be one requirement that
5:15 means if my uh load is increasing then I
5:17 should have some load balancers. So this
5:19 I did not tell at that point of time I
5:21 kept these pointers ready in my mind
5:22 when I will be designing the system.
5:24 Then I asked him the budget like what is
5:26 the budget monthly budget you are
5:28 thinking of to support this whole
5:31 system. So he he told me $50,000 per
5:33 month. So this might not cover
5:35 non-functional requirement but this is
5:36 going to cover in budget estimation.
5:38 Some of the interviews are more focused
5:40 towards uh budgeting and all. If you are
5:41 going ahead with solutions architect or
5:43 like you know principal architect kind
5:45 of roles then you have to also worry
5:47 about the budgets as well. One more
5:49 non-functional requirement was high
5:51 accuracy that means your uh
5:53 transcription should not be buggy like
5:55 they should be uh good. So 95% accuracy
5:58 is good. So high accuracy that means 95%
6:00 accuracy means like if there are 100
6:03 lines in your or 100 words in your whole
6:05 corpus that is generated at least 95
6:06 should be correct. So at this point of
6:08 time interviewer nodded and I was sure
6:10 that these requirements are sufficient
6:12 at this point of time. Now I went ahead
6:15 with some uh calculations uh so that we
6:17 know how much storage we need or how
6:19 much bandwidth we need right
6:22 calculations or estimations.
6:25 So 100k files
6:29 uh 20 MB per uh file and this is going
6:31 to come out to be so you have to convert
6:33 these numbers to terabytes. So this is
6:37 going to come 2,000 GB because 2 then
6:41 1,000 right? So 2,000 GB which is equal
6:44 to 2 terab this much data we are going
6:49 to need per day. So into 30
6:52 which is 60 terabytes per month. Now you
6:55 might not know the cost of S3 buckets
6:57 but it is you can ask the interviewer or
7:00 you can estimate like 002 per GB per
7:02 month and you can multiply it with this
7:04 and when you will do that it will come
7:06 around $4,100 per month for the S3
7:08 bucket storage and when you will be
7:10 giving a lot of system design interviews
7:11 all of these values you will
7:13 automatically remember like how to
7:16 convert GBs to TBS uh like calculating
7:18 the cost S3 bucket cost and all of these
7:19 otherwise you can even ask the
7:21 interviewer. Then comes the processing
7:23 cost like if you went if you go ahead
7:25 with openi whisper or google transcribe
7:29 amazon transcribe as transcribe then uh
7:32 the approx cost is 002 per minute. This
7:34 also cost my interview gave to me. So
7:37 for this we have 100k files and uh in
7:39 the question it is given 10 minutes uh
7:42 per file. So this comes out to be a
7:44 million minutes
7:46 and then you multiply it with the cost
7:48 per minute for Google transcribe or all
7:50 of those LLM models for transcription.
7:51 So as soon as I reached here, my
7:54 interviewer asked me to stop uh here and
7:56 actually go ahead with the uh system
7:58 design because we only had 1 hour. Uh
8:00 but you might not be stopped. So what
8:02 I'm going to do is uh all the other cost
8:05 like network cost or worker cost all of
8:06 those cost I'm going to put it in the
8:07 description so that you can do such
8:09 calculation and at the end you have to
8:11 just add all of these cost if your
8:13 interview is asking you to and then you
8:14 see whether the cost is fitting in this
8:17 budget or not. Great. Now let's go ahead
8:19 with very first diagram that I drew on
8:21 the screen which is a very high level
8:23 architecture diagram and as I'm going to
8:25 explain the same explanation I was
8:27 giving to my interviewer as well. So
8:28 first of all for uploading the audio
8:30 file we need to provide a client uh
8:32 application to the user. So because the
8:35 user has to upload the audio files uh so
8:37 we need a client application like a
8:39 mobile app or a web app where the user
8:42 can upload the uh audio files. So I drew
8:46 like a rectangle and sorry yes and I
8:49 called it a client app and this
8:52 application is only be uh responsible is
8:53 only going to be responsible for taking
8:55 the audio file or uploading the audio
8:58 file. Next thing is we need a API
8:59 server. Now first thing that I told him
9:02 is that hey you can the user can
9:04 directly upload the file to a server and
9:07 server can be responsible for uh going
9:10 to whisper or kind of a system which is
9:11 going to transcribe and give the
9:13 transcription back. But this system is
9:15 going to be synchronous like user has to
9:17 keep the application open for those 5
9:19 minutes 10 minutes till the time the
9:21 transcription is happening. And in
9:22 between if the user closes the
9:24 application, closes the tab, the
9:25 connection will be lost and the
9:27 transcription will not be received and
9:29 this is going to be wastage of
9:31 resources. So this system is called
9:33 synchronous system. I asked him do we
9:34 want a synchronous or asynchronous
9:36 system. He mentioned we need an async
9:38 system. So in that case what happens uh
9:42 the user do not upload the uh file the
9:45 entire like 5GB file to server. In fact
9:47 what we do is the server so I'm going to
9:51 have a server over here. now uh which is
9:54 my API server.
9:55 You can have like a load balancer also
9:57 which I created at the end when he asked
10:00 me to uh handle the scale. But let's say
10:02 for now we only have this API server and
10:04 this API server is going to be connected
10:06 to S3 buckets. So in S3 buckets we have
10:08 a concept called pre-signed URLs. What
10:11 you do your client app just create just
10:14 calls a very simple API on API server
10:17 and API server goes to S3 buckets. We
10:20 have S3 buckets over here, right? So,
10:22 first of all, the client app is going to
10:24 the API server that hey, a new user with
10:26 this user ID is trying to upload a file.
10:29 Uh, we need a pre-signed URL. So that
10:31 URL is called pre-signed URL which API
10:36 server goes to S3 and says hey S3 just
10:38 creates a just create a pre-signed URL
10:41 which I can give to this client app and
10:44 then using on on that URL the client app
10:48 can directly upload the file on S3.
10:51 Okay. And the API server won't be busy
10:54 during that uh time. Right. So now what
10:57 happens? API server gets the pre-signed
11:01 URL which is valid for let's say 1 hour
11:04 or 1 day and API server gives it back to
11:07 the client app and client app uploads
11:09 the audio file directly to S3 buckets.
11:10 So with this approach what happens our
11:12 API server won't be busy till that audio
11:14 file is getting uploaded because here we
11:16 are just talking about one audio file
11:18 but as we see we have 100,000 audio
11:20 files and just assume if 100,000 audio
11:21 files are getting uploaded to this API
11:24 server at the same time what is the load
11:27 that is API server is under right so
11:29 that's why this approach is good and
11:31 this is a very simple approach all right
11:34 okay perfect now what happens as soon as
11:37 the audio uploading is done S3 buckets
11:39 receives the final object and Now the
11:41 audio file is with the S3 buckets.
11:43 Inside that S3 buckets we have a system.
11:45 So we are going to have two S3 buckets.
11:47 Basically one S3 bucket. So I'll just
11:50 create another S3 bucket right which we
11:51 which we are going to use later. Don't
11:53 worry about that. This S3 bucket is
11:55 going to store all the audio files. So
11:57 in this S3 bucket we will have a file
12:00 structure like the user ID. Then we have
12:03 the file ID like the file right this MP4
12:06 MP3 WAV file. This S3 bucket will be
12:08 responsible for storing the transcripts.
12:10 the final transcripts. Let's not worry
12:12 about that. So now as soon as the S3
12:15 bucket receives the final object, final
12:19 video part chunk uh client no client in
12:21 the on the client side we can send like
12:23 a window hey your file is uploaded. Now
12:25 wait for 5 minutes 10 minutes and you
12:27 are going to get an email when your file
12:29 transcription or if audio transcription
12:32 is ready. Perfect.
12:34 Now what S3 buckets does is it creates
12:36 an event. So event-driven architecture
12:38 we going to use. So one more thing that
12:41 we have missed over here is uh when
12:43 client app was asking for a pre-signed
12:46 URL from API server. API server also
12:48 creates an entry in our main database.
12:50 Now we are going to have one more system
12:53 over here which is RDS and this is our
12:56 main SQL database of our system and when
12:59 client app asked for a pre-signed URL
13:01 for from the API server the API server
13:03 gives the pre-signed URL to the client
13:07 app but also creates an entry in our RDS
13:12 that hey this is the user let me yeah
13:16 this is the user ID right and this is
13:19 the time stamp at which this user asked
13:21 for a pre-signed URL that means started
13:24 the process and this is the status of
13:26 this job right and at this point of time
13:29 the status will be pending or it can be
13:31 uploading right let's go with pending
13:34 simple one right so let's just read this
13:36 yeah so status is pending so at this
13:38 point of time the status of our process
13:42 of our job is pending okay now coming to
13:44 the process again now let's come back to
13:47 our process so at this point of time S3
13:49 bucket has received the file completely
13:52 and now it has to notify someone that
13:53 hey we have received this file now we
13:55 have to go ahead and process this file
13:58 right so one way is the S3 bucket
14:01 notifies the API server itself that hey
14:03 the file is ready now whatever you want
14:05 to do go and process it and then API
14:07 server might fetch the file download the
14:08 file but again the problem will be the
14:10 API server has to download the file so
14:13 again uh API server will be busy and a
14:15 lot of load will come on that so we
14:17 cannot notify the API server Instead of
14:20 that we are going to use a Q system
14:22 right and this is the place where
14:23 interview is going to judge you like
14:25 what are the uh system components that
14:28 you are going to you utilize for
14:30 handling such scale which is 100,000 so
14:32 you're going to use a Q system right so
14:35 go ahead and search for SQS which is
14:37 Amazon Q service and we are going to
14:40 have a Q over here
14:45 okay and we are going to the uh S3
14:48 bucket is going to send an event to this
14:50 uh que and this event is going to
14:53 contain the information about that job
14:56 ID. So over here uh with this data we
14:59 also have job ID so that we can later
15:02 update the status of this job. Okay. So
15:04 with the same job id is now appended to
15:07 this SQS. So this SQS might have let's
15:12 say 100 jobs in the queue right. Uh so
15:15 that number is called q depth right and
15:17 this qdepth number will be important for
15:19 scaling up and scaling down our system
15:21 we'll come to that later but let's just
15:24 assume that s3 bucket has notified this
15:26 queue and now we have an event or a job
15:29 inside this SQS
15:31 now who is going to process these jobs
15:33 who is going to complete these jobs
15:36 right so we will have a fleet of EC2
15:38 instances which are called workers right
15:41 so I'm going to go and search for EC2
15:43 so You will see that our main API server
15:45 is not at all responsible for
15:48 downloading the file, processing it,
15:50 giving it back, notification. No, API
15:53 server is just there for like a central
15:55 unit which is going to orchestrate some
15:57 of the things. But we have a fleet of
15:59 EC2 instances like three EC2 instances
16:02 or like let's say 100 EC2 instances
16:03 based on the scaling up and scaling
16:05 down. We'll come to that. But let's go
16:08 with three workers for now. So we have
16:11 one, we have two and we have three. So
16:12 we have three EC2 instances which are
16:14 workers and they are going to
16:18 continuously pull the jobs from SQS
16:20 something like this. So these EC2
16:22 instances which are workers, they are
16:25 going to pull the jobs from this queue.
16:28 Okay. Now the question is how to handle
16:31 failures right. So let's say this AC2
16:33 instance picked up a job with job id
16:36 let's say a right and it tried to
16:38 process that file but the job failed
16:40 somehow. So it is not going to discard
16:42 that file in one go itself. it is going
16:46 to go ahead and put that file or append
16:49 that file again to this queue.
16:52 Okay. And maybe next time some other
16:55 worker picks it up, right? But there's a
16:57 concept called exponential back off that
16:59 let's say we tried our we tried to
17:01 process a file, it failed. Next time it
17:04 will try it after 5 minutes. Next time
17:06 it will try it after 25 minutes. Next
17:09 time it will try it after uh 125
17:10 minutes. So something like this. So
17:13 there should be exponential time gap and
17:15 we also sometimes use a different Q for
17:18 that. So you can have a different Q
17:20 right for handling all the failure cases
17:23 that is called dead letter Q right. So
17:27 this is our dead letter Q DL Q and this
17:30 is very important to mention in your
17:31 system design interview that this is
17:33 something you have thought of and you
17:36 know such uh concept right. So if it
17:38 passes then we are going to proceed. Now
17:41 if it fails the queue uh sorry the
17:44 object or the job will go to DLQ and
17:45 there we have a concept of exponential
17:49 backoff. Perfect. Right. So let's
17:52 connect this fleet with this DLQ as well.
17:54 well.
17:56 So now we have understood the concept of
17:58 Q. Now instead of SQS we can also go
18:01 with Kafka and we have fleet of workers.
18:04 Uh and we have a DLQ and you also know
18:06 exponential backoff. Great. These are
18:08 the concepts which are very important to
18:10 mention in such system design. Now when
18:14 to update this job right so if the job
18:18 fails once okay no problem we put it to
18:21 dq till now the status is still uh now
18:25 as soon as the worker takes up a job I
18:27 missed this uh when the worker let's say
18:29 this worker a picks up the job it is
18:32 also going to update this uh status
18:35 which will be processing
18:38 okay so now the status status of this
18:40 job is processing and not pending or not
18:45 uploading. Okay, if the uh job fails
18:46 three times, let's say we have a maximum
18:49 retry count of three. If it fails three
18:50 times, then the job status will become
18:53 failed and we will send a mail to the
18:56 user that your file is corrupt or we
18:57 cannot process this file at this point
19:01 of time. Right? If it succeeds, okay, we
19:03 say success
19:05 and then the concept of alerting will
19:08 come into play. like we have to notify
19:10 some of the units of the system design
19:13 uh so that that unit notifies the user
19:15 via an email. Okay. And before we close
19:19 this whole loop of this job and we uh uh
19:21 go ahead with notifications, you need to
19:23 know that these workers which are
19:24 responsible for processing the file and
19:27 creating a transcription, they will have
19:30 an LLM or a model already loaded. And
19:32 this is a question that came to me like
19:34 how are how are you going to ensure that
19:37 these uh workers do not take a lot of
19:39 time in loading the models. So the
19:41 answer is you have to uh preload these
19:42 models so that every time a new job
19:45 comes the worker in its CPU in its uh
19:47 processing unit already has the model
19:50 loaded right. So this is an answer for
19:53 that and all of these worker uh workers
19:54 have the model preloaded as soon as a
19:56 new job comes they use the pre
19:58 pre-loaded LLM or the model and they use
20:00 it for transcription. Now once the
20:01 transcription is done two things are
20:03 going to happen right. One thing we
20:07 already saw that this status will be uh
20:09 made success from processing. So this is
20:12 one thing. Next thing is the job has to
20:15 be deleted from the SQS so that again
20:17 that uh job is not processed. That is
20:21 second. One more thing you have to have
20:23 uh actually two more things that you
20:25 need to do. One is you you will have
20:28 another S3. So I'll go ahead copy this
20:32 and you will have another S3 and this is
20:36 sorry uh this S3 is for storing the
20:39 transcriptions files. So transcriptions
20:41 file transcription files are going to be
20:44 for are in txt. So this S3 was for
20:48 storing WAV, MP4 all those file and this
20:51 S3 bucket is for storing the text files.
20:55 Okay. So this worker uploads that text
20:58 file to the S3
21:00 right. Similarly all the workers upload
21:04 the S3 file uh like file to the S3 and
21:06 next thing that this worker does is we
21:09 have another Q. Okay. So we have another
21:11 Q. So we have three Q's in the system
21:15 and this Q is for notifications.
21:18 Okay. So what we have done we have
21:20 updated the job status in the f in a in
21:23 a main database. We have uploaded the
21:25 transcription to this S3 bucket and
21:27 maybe this location can also be saved
21:30 with this uh main object that this is
21:32 the place where the final transcription
21:35 is saved. Okay. And the final thing we
21:38 have to put a job again to this
21:41 notifications uh queue. Okay. And the
21:45 job contains the job id and maybe the
21:47 status in the metadata that this is a
21:49 success job. This is a failed job.
21:51 Right? And similarly over here if the
21:55 job fails even after uh three uh retry
21:58 counts uh then also the job will go to
22:01 notifications with the status of failed
22:04 and the metadata failed. Now this Q will
22:07 also be attached like a worker to a
22:09 worker and worker that worker is for
22:10 handling the notifications. So we have
22:13 another EC2 which is like maybe like
22:17 just one EC2 right and this EC2 is also
22:20 continuously pulling this notifications
22:25 uh SQS and this worker which is like the
22:29 notif worker is just for sending the
22:32 emails or push notifications to our uh
22:35 client. Okay. So we send the
22:37 notification back to the client that hey
22:39 your job which you submitted at this
22:42 time or your file which you submitted at
22:43 this time is now ready for downloading.
22:45 This is the link and that link will
22:47 actually be this and yes you can have a
22:50 download URL from S3 which you can use
22:51 for downloading in that particular time
22:54 frame itself. So that download URL will
22:56 have a time to live like a expiry and
22:58 within that if the user downloads that
23:00 is perfect otherwise that link will not
23:02 work right. So that's why sometimes you
23:05 get an email and you don't have uh then
23:07 after one week you cannot download it
23:09 again because uh the URL is expired and
23:12 yes you can have a worker or a system
23:15 that can actually um like revise that
23:17 URL as well. Okay, but till now this is
23:20 cool. So we have closed the loop just to
23:22 revise then we'll come to the follow-up
23:24 questions which came to me after I gave
23:27 this highle architecture diagram.
23:30 Client app wants to upload a file. Okay.
23:31 uh and he or she wants the
23:33 transcription. We go to the API server
23:35 and we ask the API server to give a
23:37 pre-signed URL of the S3 bucket so that
23:39 we can upload the file directly to S3.
23:42 API server goes to S3, brings a presign
23:44 URL, gives it to the client with that
23:46 the uh API server also creates an entry
23:49 in the SQL database with this user ID,
23:51 time stamp, status which is pending and
23:54 job ID. Okay, perfect. Once the S3
23:57 bucket has uh uploaded the file
24:00 completely, S3 bucket uh sends an event
24:04 to the SQS with the job ID. Okay, the
24:07 worker fleet which is there is there
24:10 pulling this SQS and they are picking up
24:13 the jobs. If the job fails, the job goes
24:15 to the DLQ, dead letter Q, right? If the
24:17 job fails, even after three maximum
24:20 retry counts, the job fails, we sends
24:22 the job, we send the job to the
24:25 notifications and notification worker
24:26 will pick up the job. It finds that hey,
24:28 this is a failed job and sends an email
24:31 to the client that hey your file cannot
24:35 be processed. If the worker successfully
24:38 completes the transcription via the LLM
24:40 which is preloaded, then it saves the
24:42 transcription in the transcription S3
24:45 bucket and also updates the status as
24:46 success. Even for the failed, the status
24:49 will be updated. Right? And then it
24:51 sends the job to notifications Q and
24:53 again the notification worker picks it
24:55 up and sends a successful email to the
24:58 client app. Right? So this is how the
24:59 system looks. Now the first question
25:03 that came to me why you used Q? What is
25:05 the what are the advantages uh that your
25:07 Q is providing to this whole system
25:10 design? Okay. So now the question comes
25:11 from the interviewer. How are you going
25:14 to handle the autoscaling? How is your
25:16 system going to handle the autoscaling?
25:18 Right. And I think this was the question
25:22 uh that made them uh select me right
25:23 because this is something that you need
25:25 to know if you are going ahead with this
25:29 with such system uh where cues are there
25:31 where workers are there and this is a
25:34 very basic concept then okay so we have
25:37 to handle autoscaling right so that
25:39 autoscaling will happen that should
25:42 happen on some number if that number
25:45 crosses the upper limit it should be
25:46 scaled up right the system should
25:50 automatically scale uh up, right? If the
25:54 number crosses below the lower limit, we
25:56 need to scale it down. The system should
25:59 automatically scale itself down. This is
26:01 the concept. What is that number? So
26:04 that number is going to come from Q
26:07 depth. Okay. So you need to know one
26:10 concept which is qth.
26:13 Q depth is the number of jobs which are
26:16 currently there inside your que. So that
26:19 is called qth. And one more metric which
26:25 is q depth per worker. Okay, per worker
26:33 divided by the number of workers.
26:37 Okay. So these this metric let's delete
26:39 this one. This is the metric. If it
26:41 crosses the upper limit, which we're
26:43 going to come right now, uh we need to
26:45 scale up that means we need to increase
26:47 the number of workers. If this skew
26:50 depth per worker decreases or it crosses
26:51 below the lower limit, we need to
26:53 decrease the number of workers which are
26:55 active that means we need to scale the
26:58 system down. Okay. So this is something
26:59 that you have to mention. Now give an
27:02 example. Let's say the upper limit is
27:05 500. That means if my QEP per worker goes
27:07 goes crosses
27:09 crosses
27:12 above 500 like it goes above 500 then we
27:14 need to scale it up. So let's take an
27:17 example uh for scaling up. Let's say you
27:19 have 10,000 messages in the queue,
27:21 right? And you have let's say just 10
27:25 workers in your fleet. So it is,000,
27:28 right? 1,000 is the queue depth per
27:30 worker. That means,000 jobs need to be
27:33 handled by one single worker, which is
27:35 too much. So you need to scale up
27:37 because it crosses above 500. Usually
27:40 500 is a good one for such jobs. uh
27:42 let's say 30 second is the job time per
27:45 job time so I think 500 is a good one
27:47 and interviewer also agreed with me so
27:49 thousand in this case where your Q depth
27:51 per worker is thousand that means you
27:54 need to scale up so the strategy is you
27:58 either increase the number of workers by 20%
27:59 20%
28:02 or by 10
28:05 okay so one number is percentage and
28:09 another number is absolute right uh so
28:12 if we have 10 workers. So which is
28:15 greater? You you take that number. So 10
28:18 right 20% of 10 is two right. So either
28:21 you add two more workers or increase it
28:23 by 10. Whichever is greater you do that.
28:26 So 10 is bigger than two. So you will
28:28 add 10 more workers. So now your total
28:31 workers are 20. Okay. So this is how
28:34 your system autoscaled it up. You
28:35 previously had 10 workers. Now you have
28:39 20 workers. Okay. Let's say again this
28:42 scenario comes and you again have uh
28:44 like you again your Q depth per worker
28:46 increases above 500. So again you your
28:48 system has to make a call and it has to
28:52 scale it up. So again 20%
28:59 20% of 20 of 20 or 10. So 1 percentage
29:05 and one absolute. So 20% of 20 which is
29:10 four right? So four or 10 10. So you
29:12 will make it 30 workers. So now you have
29:15 30 workers. Again 20%
29:18 of 30 which is going to come six or 10.
29:19 So you will again go 10 and you will
29:23 have 40 workers. Again let's say again
29:25 the spike is there and you again have to
29:28 scale it up. So 20% of which is eight. 8
29:30 or 10 again 10. So you will have 50
29:33 workers. Now the scenario will come
29:36 where either 20% of 50 or 10 both are
29:38 10. So you'll again have 10. So you will
29:41 have 60 workers, right? So you see how
29:45 your system is scaling it up. Okay? So
29:47 this is your autoscaling strategy,
29:51 right? And 60 now 20% of 60 which is 12
29:54 or 10. 12 is greater. So you will not
29:56 have 70 workers. Now you will have 72
29:58 workers because you will have 12
30:01 workers. Okay? Which is 20% is greater
30:05 than 10. So you will have so this is how
30:08 if your uh if your load is increasing in
30:11 such quantities like so much load is
30:13 coming then you will have percentages
30:15 percentages winning over the absolute
30:18 numbers. So this is how your scaling up
30:21 will work. Similarly scaling down uh
30:24 let's say we have
30:27 the number the lower limit of 200. So if
30:30 your Q depth per worker goes below 200
30:32 then you need to scale it down. So you
30:35 either remove 10 workers or you decrease
30:37 it by 20%.
30:39 Okay. So if you have let's say 50
30:44 workers. So 20% of 50 is 10 or 10. So
30:46 both are same. So we'll make it 40. So
30:48 50 will become 40 workers. So you see
30:52 how the number of workers are going away
30:54 like how we are decreasing the number of
30:57 workers. So 40. So again 20% of 40 which
30:59 is eight. You have to remove eight or
31:02 you will have to remove 10. So 10 is
31:04 bigger. So you'll remove 10. So 38 it
31:06 will become. So like this your workers
31:08 are decreasing.
31:10 Perfect. So this is an example you have
31:12 to give to the interviewer so that he
31:15 understands that you know such strategy
31:16 uh when you have cues and workers in
31:19 place in your system and how your system
31:21 has to scale up and scale down. So with
31:22 this you can also say that this is the
31:24 reason this autoscaling is one of the
31:27 reasons I went ahead with selecting a Q.
31:30 Right? Another reason is Q also act as a
31:33 buffer, right? If your spike is coming
31:35 up, right? If your load is increasing,
31:39 then uh if a Q was not there and only
31:41 you had an API server, then your API
31:43 server would crash, right? Because so
31:45 much load is coming on one single API
31:47 server, it would crash. But Q acts as a
31:49 buffer because it's like a a Q then your
31:51 jobs will keep on getting appended and
31:54 for that duration when the spike is
31:56 coming up like it is changing from 100k
31:58 to 200k 200k to 300k at that time your
32:01 work like your que will accommodate more
32:03 and more jobs and your system will not
32:06 crash right but in case Q is not there
32:08 and you just have an API server your
32:10 system would crash. Next is alerting. We
32:12 did not spend a lot of time on alerting
32:15 but uh my interviewer did ask me which
32:17 all metrics you are going to track. So
32:19 monitoring and alerting is coming under
32:21 one umbrella. So which all metrics you
32:23 are going to track. So I've already kept
32:26 the metrics ready uh which I usually say
32:28 for such system designs like you have
32:31 business metrics uh and you have like
32:34 your system metrics right so for
32:36 business metrics how many jobs created
32:37 per minute. So these are the metrics
32:39 which will be analyzed. Yeah. So these
32:40 are system metrics and this is the infra
32:44 infra metrics. Okay. So yeah, so these
32:45 are the metrics which will be tracked by
32:47 the business team. So how many jobs are
32:49 created per minute? So that they can
32:51 also do business analysis and all. So
32:53 how many jobs created per minute? Is the
32:55 traffic growing? Jobs completed per
32:56 minute. Is the throughput keeping up?
32:58 Jobs failed per minute. Are we having
33:01 quality issues? So P50, P95 and P99
33:04 latency, right? So uh transcription
33:06 confidence scores. So when you are
33:09 uploading the transcription file txt to
33:10 the transcriptions you also upload the
33:12 confidence scores file so that the
33:14 machine learning team can also take that
33:16 that scoring file and they can also
33:18 improve their system if they are
33:20 training their own models. So confidence
33:22 score actually tells you this is the
33:24 actual word. This is the predicted word.
33:25 This is the delta. This is the
33:28 confidence. So that data also needs to
33:30 be stored and analyzed. So that will
33:32 come under metrics. Okay. Cost per
33:33 transcription. how much cost you're
33:35 bearing for doing one transcription and
33:38 you have to optimize that cost right
33:40 system metrics SKSQ depth so that
33:43 autoscaling strategies can be built upon
33:45 worker count how many workers are active
33:48 in a day on an average right is the
33:50 autoscaling even workinger CPU
33:52 utilization database query latency
33:54 database connection count API error rate
33:57 API latency then infra eC2 instance
34:00 health for how long the EC EC2 instances
34:03 went down S3 operation latency network
34:04 throughput, disk utilization, all of
34:06 these metrics you need to track or just
34:08 keep them in mind and at least mention
34:09 them when you are having such
34:11 interviews. one question which again
34:14 which was a final question uh for me uh
34:17 what if someone uploads a 2-hour long
34:19 podcast a very long podcast for
34:21 transcription in that case your one
34:23 single file will be too long too heavy
34:25 and in that case one AC2 instance might
34:27 fail so in that case you also need to
34:30 have a chunking strategy in place like
34:32 one worker can be there for chunking or
34:34 even here itself while uploading also
34:37 you can have chunking in place where one
34:39 like 2our podcast will be divided into
34:42 10 10 minutes videos or of audio audios
34:44 and all of those chunks will be handled
34:46 by various workers and at the end your
34:48 whole transcription file will be patched
34:50 up and that one single audio file will
34:52 be given to the transcription S3 bucket.
34:54 So this is the answer like chunking
34:56 strategy is the answer. And one more
34:58 thing the final thing is uh what if
35:01 someone is a uh very content creator and
35:03 he is like a very influencer kind of
35:07 person and he has uh asked for uh he has
35:10 published a podcast in our system and a
35:12 lot of people like a million people are
35:13 trying to access that transcript. They
35:15 want to read the captions in that
35:17 podcast. So for that thing you have to
35:20 uh handle the case via CDN. Cloudfront
35:22 AWS CloudFront is there. So that this
35:25 transcription files are not served by
35:28 the S3 uh on that instance or by by your
35:30 system instance. It will be served by
35:31 the cloud front. So there will be one
35:34 single get request to the S3 and then
35:36 that uh transcription file will be
35:40 cached uh at the CDN. So there is CDN
35:44 cloud front.
35:47 Cloudfront is there. Yes. So then uh at
35:49 the time of get when the users are
35:51 trying to get the transcription file you
35:53 can also have a cloud front in place so
35:55 that we don't have to go to the S3
35:57 bucket and we don't have to query the S3
35:58 bucket for transcription you can just
36:02 fetch the file via the CDN
36:04 that's it so this is how your system
36:07 design might look and he might also ask
36:08 you how will you handle multiple API
36:10 servers so in that case you have to
36:14 answer load balancer so AWS ELB or an ALB
36:15 ALB
36:19 application load balancer. Application
36:22 I think this is the load balancer logo.
36:23 Yes. So you can have application load
36:25 balancer and then you can have various
36:28 API servers. Okay. So this load balancer
36:31 will uh like decide which request should
36:32 go where. So there can be various
36:34 strategies for that rate limiting and
36:36 all those strategies are there but let's
36:38 not go into that. So this is how your
36:39 system design might look. This is ugly
36:42 but this is how it looks. Uh and I hope
36:43 you learned something new in this video.
36:45 Till the next video, keep coding, keep