YouTube Transcript:
2 Years Exp. Data Engineer Interview | End-to-End Project Round | Hands-On Questions

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

This content is an interview transcript where a candidate, Sep, discusses their experience in data engineering, focusing on medallion architecture, SQL, Python, and data lake concepts.

Mind Map

Click to expand

Click to explore the full interactive mind map • Zoom, pan, and navigate

Hi Sundi, my name is DJ and today I'll

be conducting your interview. Please let

me know about your experiences and tell

me something about yourself.

>> Hi, hi Raj. Uh thank you very much for

giving me the opportunity uh for this

interview. Uh my name is Sep and I have

been working in data engineering domain

since past 2 years and uh I have been

working on multiple technologies like uh

Pispark. I have mostly worked with AWS

cloud uh technologies and uh I have

worked with the project which follows

the medallion architecture. So it is a

batch data processing pipeline on which

I have worked on and uh that is my

overall experience.

>> Okay. Okay. Good. So I'll start with

your first question actually now. So as

you have mentioned I'd like to know what

is the purpose of bronze, silver and

gold layers in medallion architecture.

Could you please?

>> Yes sure. Uh in the bronze, silver and

gold layer uh we keep the data. So the

data which we capture from the sources,

we put it in the bronze layer and that

data is completely raw. There are no

data quality checks performed on it. We

extract the data and put it onto a

bronze layer like a folder on S3. And uh

later we perform some data quality

checks. After performing data quality

checks, we write the same data to silver

layer. While writing data to silver

layer, we might also go for the data

modeling part. So the data which is

present onto the silver layer is going

to follow some data quality rules as

well as it is going to have its own

model. Onto the gold layer what we do is

uh we keep analysis ready data. Onto

gold layer we have the summary tables as

well as the aggregations so that data

analysts data scientists can directly

read from that particular t from those

particular tables and uh they can

perform their own analysis. So from my

point of view, this is the purpose of

bronze, silver and gold layer in the

medallion architecture.

>> Mhm. Okay. Um I'd like to know more

about the bronze layer. So tell me why

do we keep raw data in bronze layer as imitable?

imitable?

>> Uh if we extract the data and we keep

that data onto bronze layer, the benefit

of it is we don't have to extract it

from the sources again and again. Let's

say in the pipeline if there is any

issue which has occurred and if we don't

have the raw data stored in some layer,

we might have to extract it from the

source. And the challenge with that is

if the data has got deleted from the

source, we will not be able to extract

it. If we have data in the bronze layer

and let us assume that the pipeline gets

crashed, we can any time refer to the

data which is in bronze layer and we can

replay the pipeline and that is the

reason we have data in bronze layer and

we normally keep it immutable. We don't

modify it.

>> Okay. Okay. Fine. We'll move to the next

question. So uh tell me how do you

handle duplicate records and partial

files when multiple files arrive for the

same day?

>> Uh if multiple files arrive for the same

date uh the other metadata of that

particular file is going to be different

based on the metadata like the last

modified timestamp as well as the name

of file. Uh we can judge that for a

certain date we have received two files.

Then we also need to check the record

level duplicates. In pispark we have

fill NA and drop NA. So we can also fill

null values with some uh dummy values

and we can also drop these values. So

with the help of drop na it is possible

to drop these values. So in pispark

there are transformations readily

available to deal with record level

duplicates. So if it is a file level

duplicate we will check it through the

file metadata and if it is a record

level duplicate then we are going to use

>> Okay. So previously you mentioned that

in your project you did work on the

bronze silver and the gold layer. So I'd

like to know about what kind of

transformations did you perform while

moving from bronze layer to silver.

>> Sure. Sure. Uh the data used to be kept

in the bronze layer and that was the

completely raw data. What we used to do

is to convert that raw data into a data

which data analysts and data scientists

can analyze and come up with some

insights. We had to perform data

cleaning, data enrichment as well as

data modeling onto the data which is

present onto the bronze layer and then

we used to keep that data onto silver

layer. The data which was there onto

silver layer we used to perform the

aggregation logic and that data used to

be uh kept onto the gold layer. So while

performing data cleaning we heavily used

dduplication the drop in a

transformation. While checking the data

quality rules, we used filter and uh we

also used case expressions uh so that if

we if there is anything like condition

based if we want to make any changes to

the data then we used case expressions

as well. Uh while talking about the

enrichment part uh of course we went

with select exprition in pispark and

while we were calculating the aggregates

we went with the group by transformation

we also went for the window operations

and uh in case of the data modeling part

we have to heavily use the join

operations. So these are the common

operations which we have used in our project.

project.

>> Okay, cool. And um in all of these

actions, how did you handle the schema

evolution during injection?

>> Uh yes, schema evolution was a challenge

for us as well because onto the source

level when the schema changes the

pipeline needs to be modified or updated

alongside and uh if these schema changes

were made without our knowledge then it

can also break our pipeline and these

events have occurred as well in the

past. So what we did was uh we defined

the schema uh initially and uh after

that whenever the data arrives there is

a way in spark using which we can infer

the schema of the file or the content

which we are able to see. So we used to

infer the schema and we used to match

the schema with the schema which we have

defined with the schema which we are

expecting and if there was any mismatch

we used to raise the flags then we used

to deal with the data. Anyways, uh it

was a strict policy that if there is any

change in the schema, the data

engineering team needs to be notified so

that the pipelines can be updated. Uh uh

regardless, uh we always got the updates

and uh as soon as the schema has got

changed or evolved, uh we used to modify

the pipelines. That's why the uh volume

of errors associated with the schema

evolution was less.

>> Okay. Okay. Fine. U moving ahead I'd

like to know um how would you design a

multi-reion or cross crosscloud

medallion architecture since now that we

have spoken about bronze silver layer

and how to move ahead with it but uh I'd

like to know how would you design a

multi-reion or a crosscloud medallion architecture

architecture

>> okay at my current experience level uh I

didn't had any opportunity to uh work

with uh designing the medallion

architecture that job is majorly taken

care by the architects who have designed

the project uh my job is mostly around

the pispark part like if there are any

changes the tickets will get raised and

those tickets will be allocated to us

and we just resolve those tickets. Uh

these tickets are mostly around the

processing part. Uh there might be some

tickets around the orchestration part as

well. Uh but when it comes to

architectural level stuff uh I'm not uh

currently on a level where uh I have

explored these things yet.

>> M okay that's understandable. It's okay.

Not an issue. Okay. Um I've understood a

little bit about how you've been working

on your projects. Um the next section

that I'd like to speak to you about is

SQL. Okay. And we'll include some

machine test questions as well for you

to explain. Okay.

>> So tell me about inner join, left join

and full join and do explain with examples.

examples.

>> Okay. So uh is it fine if I can share my

screen so I'll be able to explain much better?

better?

>> Yeah, certainly. Go ahead.

>> Is it is it visible now?

>> Yeah. Yeah, it is.

>> Yeah, go ahead.

>> Uh so let me start with the inner join.

uh in case of inner join if we have two

tables like we have a table called T1

and we have a table called T2 and in

case of inner join what we have to do is

uh we have to pick some joining key

based on which these two tables can be

joined and we mention the joining

condition as well. So in T1 as well as

in T2 uh whenever the joining condition

is true which means uh the join key

exists which exists in T_1 it also

exists in T2 then these two records are

going to get joined. So all the matching

records from T1 and T2 will be there in

the output in case of inner join. In

case of uh left join, the records from

left table like if this is the left

table and this is going to be the right

table. So the records from left table

will be joined with the right table and

uh if there is no matching record then

in front of that record we will see null

values in case of left. So in case of

left join we will get all records of the

left table but we will not get all we

might not get all the records from the

uh right table. We will just get the

matching pairs and if the records from

right table are not matched we'll just

get the null values in place of the

records of the right table. In case of

right join uh the scenario is exactly

reverse. In case of right join all

records from the right table are going

to be visible. All records of the right

table are going to be visible and their

respective matching uh records are also

going to be visible. But for the records

of right table, if there are no matching

records in the left table, we'll be able

to see null values. And in case of full

join, we'll get to see all the records

of for a certain keys. And for the keys

who don't have matching records, either

in the left data frame or in the right

data frame, we get to see the null

values. Uh so these are the uh join uh

types, I guess. And uh if if you if you

want anything else or if you want me to

dive deeper, please let me know. Um I'd

like to know about row number. Write an

SQL to remove duplicates using row number.

number.

>> Okay. Okay. This query uh will be

removing the duplicates. Uh and uh in

this query of course we are deleting the

records from the users and uh of course

this is the syntax for it.

>> M okay. Okay. Fine. Okay. Write SQL to

list customers who did not place any orders.

orders.

>> Okay. So uh for the customers who have

not placed any orders uh I will consider

two tables. One is the customers table

and another one is going to be the

orders table. So let me just write the

SQL query for it. I think uh this query

is going to resolve our issues. Uh I

have considered customers table as well

as the orders table and used left join

to figure it out.

>> Okay. Okay. What is the difference

between group by and partition by?

>> Okay. Uh in case of group Y we are

performing the aggregations like uh uh

is it fine if I share my screen to

explain this topic?

>> Yeah no problem you can use your screen

whenever it is comfortable for you to

explain your answers.

>> Sure thank you.

So uh when we talk about the group

operation uh if if let us say I have uh

multiple records with different

different country codes like India then

US then again I have records for India

again I have record for US and I have

multiple records. So if I'm talking

about the group by operation. So group

by operation is going to group all the

rows based on the uh based on the column

using which we are performing group by.

If I'm performing group by using country

code column, it will just create groups

of all rows associated with the

respective country like India and all

records of India are going to be in the

group. US and all records for US are

going to be in a group. Uh same is the

case if there are any other countries

and and all the rows in in in this

group. Uh normally group by is a

two-step operation. First we go for

group buy and later we go for

aggregation. Let's imagine if I'd like

to go for average revenue earned or

total count of records or the sum of

revenue. Uh these kind of use cases are

possible through the group by operation.

Talking about partition by the partition

by operation is completely different. In

case of partition by of course the logic

is same but it is not preparing groups.

It is just partitioning the records like

all records of India are going to be

let's say in a single partition. We can

say all records of US might be in a

single partition. Uh so respective

countries will be stored in a respective

logical partition. And of course these

are going to be the full rows not the

groups like we get in the group by

operation. The partition by is mostly

useful in case of the uh window

operations where we can perform certain

window functions like row number rank

dense rank on top of these. So according

to me this is the difference between

group by and partition by.

>> Okay. Uh the next question is how would

you find the second highest distinct

salary from the employees table?

>> Sure. Sure. Uh let me just write a SQL

query for it. Okay. Uh so this is the

query using which uh we can find the

second highest salary.

>> Okay. Where?

>> Okay. Mhm. Find the top three highest

paid employees in each department.

>> Sure. Okay. So uh this SQL query is

going to uh return the top three highest

paid employees. >> Okay.

>> Okay.

Okay. What is the difference between

where and having?

>> Where and having? uh both are going to

be checking the conditions uh in in so

they are going to work on the boolean

boolean data type. However, the wear is

going to run uh initially before the

aggregation. Let's say in a certain

query if we are working with group by

and aggregation. So the wear clause is

going to get executed before the group

by clause and the having clause is

executed after the aggregation has been

performed. So this is the major

difference between where and having.

>> M okay. Okay. >> Fine.

>> Fine.

Tell me the difference between delete,

truncate and drop.

>> Sure. Sure. Uh delete is going to be

there. If I would like to perform some

record level deletes or some condition

based deletes from a certain table, I

can use the delete. Uh if I go for

truncate, in that case all data of the

table is going to get truncated or we

can say deleted. In this in both delete

as well as truncate the table structure

is not going to get eliminated. So the

table entry is going to be there in the

database. But if we go for a drop

operation in that case the data as well

as metadata it will get deleted. So this

is the major difference between these three.

three.

>> Okay. Okay. Cool. Um we'll move on to

the Python section now and we'll see

your skills in Python and how proficient

you are.

>> So we'll start with how to find the

duplicate elements in a list. So for

example, if there's a given list of

integers and you'll have to find the

duplicate values, you can assume the

numbers which you'll choose in the list.

>> Sure. Sure. So I have assumed these

numbers uh like 1 2 3 uh like these and

I have written this Python code uh and

this is going to remove the duplicates

uh from this list.

>> Mhm. Okay. But uh how would you remove

the duplicate values from a list while

keeping the original order? How would

you do that?

>> Sure. Sure. Let me write a Python code

for it. Uh I'm considering the exact

same list in the uh previous example.

And if we work with this code snippet uh

it would also preserve the order as we

are iterating through the list element

and just appending the unique element.

It will preserve the order of uh the list.

list. >> Mhm.

>> Mhm.

Okay. So let's suppose if there are two

lists and they have some common elements

in between the lists. So how would you

find these common elements?

>> Sure. Let me write a code snippet for

it. Uh in this example I have considered

these two list A and B. And in order to

get the common uh I have just removed

the duplicates using sets. And after

that I have converted it back to list uh

and it is going to give me the common

elements uh between both the lists.

>> Okay. And how would you count the

frequency of elements in a list using

dictionary? So now we saw your sets. So

I'd like to know how would you use

dictionary to find the count frequency

of elements.

>> Sure. Let me write a code snippet for it.

it. >> Mhm.

>> Mhm.

>> So uh this code snippet is going to give

me the frequency. In this case, I'm

iterating through the list and after

that uh we are just uh adding up the

element as we get one. So if there is no

entry, it will just go with zero and

after that it will incrementally add

values if it gets the duplicate record

based on which we can get the frequency.

>> Okay. Okay. And um how would you sort a

dictionary based on values in ascending order?

order?

>> Okay. So uh this is the code snippet uh

which is going to uh solve the concern

which we are facing.

>> Can you please explain this? How have

you used this lambda function?

>> Sure. Sure. See, in this case, uh what's

happening is uh we have defined a

dictionary which currently has three

elements. So these are three key value

pairs. What we are doing is uh we are

just uh calling the items function on

the scores dictionary. And in in this

course dictionary, I'm going to get list

of tpples which is going to have these

elements. And then what we are doing is

uh we are running a lambda function. So

this lambda function is going to be

applied onto each and every element. And

in here I'm saying x of one. So uh if if

this is the element in case of scores

dot items the a is going to be zero

index and 80 is going to be the first

index. So in this case the lambda

function is being pushed through the

sorted function and it will sort the

elements by the value which is present

onto the first index. So that is what

happening over here and whatever we get

we are converting it to dictionary and

we are printing the sorted dictionary

over here.

>> Mhm. Okay. Now that we're done like okay

with the dictionary. So I'd like to know

the difference between list and tpple.

At the same time I want to know the

difference between set and list. Do you understand?

understand?

>> Yeah. Yeah. Sure. Uh so let me explain

the difference between all four data

structures which you have mentioned. Uh

list or or all these data structures are

first of all collections. Collection of

elements. In case of list, list is the

collection of elements which might be of

same or different data types and it is

going to be mutable which means we can

add, delete, modify elements from the

list. And uh in case of tpple, tpple is

exactly similar to list but it is going

to be immutable. So we can't make any

modifications onto the tpple. Talking

about sets, in case of sets we don't

have the concept of indexing and

slicing. It is used to perform the set

operations like union, intersection, set

difference. These operations are

applicable onto an element which belongs

to the set data type. When it comes to

dictionary, dictionary is also going to

be mutable where we can add remove data

but it is going to handle the key value

pair data. So we are going to have keys

and values in the dictionary. We can

also have nested keys and values uh in

the dictionary. So that is the main

difference between all four data

structures which you have mentioned.

>> Okay. But then how will you

differentiate between a dictionary and a set?

set?

>> Sure. uh dictionary and set both use

curly brace to uh represent themselves

but in set you get individual elements

and in dictionary we have key value

pairs. It differentiates uh based on the

kind of data it holds.

>> Okay. Okay. Okay. So tell me have you

worked with data leaks? Do you have

experience with data leaks?

>> Uh yes yes I have worked with the data

links and uh in our project uh we have

configured data leak using S3 and glue catalog.

catalog.

>> Okay. Yeah, it's the same thing I was

actually going to ask you like what kind

of object storage like S3 or ADLS and

why is it preferred for data links?

>> Sure. Uh because when it comes to data

links, we are not looking for a

full-time running servers which are

going to handle our data like a proper

data warehouse. In case of data links,

we can put data onto the services like

S3, ADLS. These are object stores. And

in case of let's say S3 or AWS, the

storage cost is not going to be too

high. If you compare it with some

full-time data warehouse like uh red

shift. So in case of red shift we have

to maintain the servers. In case of S3

we just have to pay for the storage and

uh while configuring the data link all

we need is the metadata which can be

kept on AWS glue. Uh so that we can get

a data warehouse kind of environment on

the data which is currently present onto

S3. So we can directly perform queries

onto the data which is on S3. If glue is

involved in case of AWS and uh in case

of uh let's say data bricks we have the

unity catalog for a similar feature.

>> Okay. So would you say what you just

described is the purpose of glue catalog

or let's say meta store or would you

like to define it further or explain?

>> Sure. Sure. Uh the glue catalog is a

place where we can define the metadata.

Uh and based on the metadata we can

perform the queries. Like if I'm

defining a table then the table name,

table properties, columns, the data

types of the columns, all of these

things would count as the metadata and

the actual data would reside onto S3. So

whenever I'm connecting uh for the sake

of queries, I can connect through Athena

or Python script or even through my

application to the glue catalog and just

go for the SQL queries. So the purpose

of glue catalog is uh is is in in both

ways actually in order to access the

data in the SQL fashion and in order to

maintain the metadata and the metadata

versions that is also one of the uh

purpose of AWS glue that to AWS glue catalog.

catalog.

>> Okay. So okay fine. So tell me what is

the what is partitioning and why is it needed?

needed?

>> Uh partitioning is a feature which is

going to reduce the amount of data which

we are trying to process. Uh is it fine

if I share the screen? I'll explain this part.

part.

>> Go ahead. Please share your screen.

Let's consider we are not going through

partitioning and we have the records of

different different countries like

India, US, China and let's say again

some records from India, some records

from US, some records from China and

then I go for a query where I would like

to find the maximum revenue for just

India. In that case I'll have to process

all of the data from start to end. Let's

say if this is a data set of size 10

terabytes, I'll have to go out and

process all 10 terabytes of data. But if

we define partitions the data is going

to be uh stored in the respective

folders like all data of India is going

to be in the India partition or you can

call it folder. All data for US is going

to be in the US partition and the data

for China is going to be in the China

partition. So if in the where clause I

have mentioned India. So the query would

just get into the India partition and

process the data. It will save our

trouble to process the data for US as

well as China. This is one of the

benefit of uh partitioning and it is a

very powerful feature if used in a

accurate way.

>> Okay. Okay. Good. So moving on. I'd like

to know what is hoodie delta iceberg and

when do you use upserts?

>> Sure. Sure. Uh the hoodie deltas or or

we can say the iceberg. So the hoodie

and icebug have been added so that we

can go for the OLTP like rowle inserts,

updates and deletes onto the data which

is present onto the object store like S3

and uh we can also go for upsert

operation upsert uh which simply means

update plus insert. Uh if we don't have

hoodie iceberg in picture or to be

precise if we don't have hoodie in

picture the upsert operation is going to

get complicated because we'll have to

manually write the logic for upsert. In

case of hoodie it also maintains the

metadata. So upsert operation becomes

easier if we are using hoodie.

>> Okay. And how do you implement cost

optimization uh or choose between red

shift snowflake or bigquery?

>> Uh the issue is uh I'm not say I'm just

I'm just having two years of experience.

So I'm not into uh cost optimizations as

well as I'm not into selection of the

tools. Uh all I do is I try to implement

what has been assigned to my part. But I

do have a brief idea on Redshift uh

Snowflake uh and BigQuery. So if uh we

are going with the AWS ecosystem and we

want a full-time data warehouse then we

can go for red shift uh and let's say if

I want a data warehouse with all the

features and integrations and uh I'd

like to have it open for others uh other

features as well. We can go for

snowflake and uh if I'm going with the

Google's tech stack or Google cloud

stack then I can go for bigquery. I

might be wrong on this because I have

just a highle idea on these things.

>> Okay. Okay. Um are you familiar with

airflow? You're comfortable?

>> Uh yes, I'm quite comfortable.

>> Okay. So tell me what is a DAG in an airflow?

airflow?

>> Uh DAG is directed a cyclic graph. Uh we

have used airflow in our project. There

are multiple operations which we perform

uh into our project like we are inesting

ingesting data. Then we are performing

data cleaning and data quality checks.

Later we are writing data onto data

warehouse and we are validating that

data if uh everything is right or not.

So there is a sequence of operation

which we have to perform repeatedly. So

we can automate this with the help of

airflow. We can define each and every

step in airflow as a task and we can

choose the sequence of it and the

sequence of these tasks is nothing but

DAG and it has it it is written in the

Python language and we can submit it uh

and airflow can execute these DAGs. So

Airflow is not the one who actually

performing any heavy lifting. Airflow is

just acting as an orchestrator. It is

just triggering the activities which we

have to perform.

>> Okay. So then what would be the

difference between a task and a task instance?

instance?

>> Sure. Let's imagine I have defined a

task uh which says print hello.

>> So the definition of task is one thing

and let's say if that task gets executed

once so I will say that is one task

instance. If that task executes twice I

will call it two task instances. So the

definition of task is called as task and

the number of times the task is getting

executed and the time at which this task

is getting executed both things are

going to be called as task instances.

>> Okay. And what happens when a task gets

stuck in cued state?

>> Uh in in case of airflow if the task is

stuck in the ceued state there might be

multiple reasons. The very first reason

uh could be that the scheduleuler might

be down that's why uh the tasks are not

uh being launched. Another thing the

availability of worker. If the workers

are not available then also the task

might be in the cute state. Uh another

thing is if we have configured a pool

and all slots of that pool are blocked.

So there is no free slot for our task to

get launched. In that case as well the

task might be might get stuck in the

cute state and uh the kind of executor

we are using like there are different

different types of executors which we

can use with airflow like salary

executor, local executor uh then

sequential exeutor and if there are too

many tasks for the exeutor to handle it

will be quite difficult uh you know to

uh launch our task right away if there

are many waiting tasks. So the issue

might lie in executor or the

scheduleuler or the worker configuration

or the pool configuration. I will check

these four things and uh based on that I

will get the appropriate resolution.

>> M okay. But how will you use retries

then? How will you use retries in

>> air? Sure. Sure. In airflow for a

certain task or for for the tasks we can

define the retries and the retry

duration like if a certain task get

fails should we retry or should we uh

you know simply crash the complete

thing. So there are two things if we are

going to configure the retries. The

first thing is retry count and retry

duration. How many times we have to try

a certain logic and uh after how much

how long time like if a task gets failed

we are not going for immediate retry

because even if we go for immediate

retry we might have not worked with the

actual issue. So we can define the retry

duration and retry count for a task in

airflow uh that on a task level when we

define it with the help of some operator.

operator.

>> Okay. Um tell me have you had any

experience designing a CI/CD pipeline

for deploying an airflow tag? uh

actually I'm quite interested in CI/CD

uh and I'm trying to adapt many things

but uh currently it doesn't fall under

my uh responsibility. It has been taken

care by the senior data engineers. >> Mhm.

>> Mhm.

>> Okay. Okay. Fine. No problem. Okay.

Let's keep this interview moving and uh

we'll go to the next question. So let's

say if one source example there's an FTP

is delayed but RDBMS is on time. So do

you continue your pipeline or how would

you react?

>> Sure. If I have multiple sources uh for

my pipeline and let us assume that if

sources if one of the sources getting

delayed in that case I will first check

that uh if these two sources are

required for me uh in the later stages

of the pipeline. If both of the sources

or if the data coming from both of the

sources is absolutely required for me

for performing the operations which

means the sources are dependent on each

other when we are moving into the silver

or the gold layer. In that case, I will

have to wait for the arrival of data

from the another source. But if these

two sources are not dependent on each

other, I can uh I can let the pipeline

proceed further.

>> Mhm. Mhm. Okay. And uh describe to me a

situation where your pipeline failed and

how did you fix it? What what kind of

steps or actions did you take to fix

your pipeline?

>> Yes. Yes. Uh our pipeline has got

crashed several times. Uh there were a

few reasons behind it. One of the reason

was the uh out of memory error because

we misconfigured the resources. So we

diagnosed the issue root cause analysis

was done and uh we just changed the

configurations of our spark jobs and uh

eventually that error got resolved.

Another instance where the pipeline was

broke was because of the uh schema which

we have received. We were not notified

about the change in schema and the newly

updated schema was not going well with

our current pipeline and that is why the

pipeline got broke. uh we diagnosed that

issue then we marked those files as

corrupt files and we just took out what

we need for our pipeline from that data.

Later we evolved our pipeline to uh

consider the schema changes which have

happened and uh the third instance when

the pipeline was broke was because of

the corrupt data which we received from

the FTP source uh because that data say

we used to decompress the data but the

data was not in a proper format. So

there was issue initially while the

pipeline got executed. that issue was

again diagnosed and uh that issue was

also resolved. Uh but yes, of course,

the pipeline has got broken several

times. I might not be able to uh tell

you all the instances but yeah, these

two to three instances were quite major

which uh stuck to my mind.

>> Okay. Okay. Fine. Uh thank you Sep. So

your interview was fine. We'll proceed

further with your processing and get

back to you shortly. In the meantime, if

you'd like to ask any questions, please

do go ahead.

>> Uh yes, I have just one question. uh if

I ever get an opportunity to work with

you uh what kind of tech stack and

project architecture I might be working on.

on.

>> Okay. So the tech stack that we are

currently working on is the Azure data

engineering tech stack and uh the

architecture which we use in our

projects is the lakehouse architecture.

Get elected you'll be working on it.

>> Yeah. Yeah. I'm looking forward for this opportunity.

opportunity.

>> Certainly. Okay. Fine. Thank you Sep.

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:2 Years Exp. Data Engineer Interview | End-to-End Project Round | Hands-On Questions

AutoDub

Video Transcript

Summary

Core Theme

Paste YouTube URL

Transcript Extraction Form

Get Our Chrome Extension

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube Transcript:
2 Years Exp. Data Engineer Interview | End-to-End Project Round | Hands-On Questions