YouTube Transcript:
AWS Data Engineering Interview
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
View:
connection established okay so hi AA
connection established okay so hi AA welcome to this round right so can you
welcome to this round right so can you
welcome to this round right so can you please introduce yourself and maybe the
please introduce yourself and maybe the
please introduce yourself and maybe the project that you have worked upon so
project that you have worked upon so
project that you have worked upon so yeah you can start yeah okay yeah thank
yeah you can start yeah okay yeah thank
yeah you can start yeah okay yeah thank you Nisha for giving me this opportunity
you Nisha for giving me this opportunity
you Nisha for giving me this opportunity to interview with you I'll start with my
to interview with you I'll start with my
to interview with you I'll start with my brief introduction so myself ASA I have
brief introduction so myself ASA I have
brief introduction so myself ASA I have close to five years of experience
close to five years of experience
close to five years of experience working in data now I have worked on
working in data now I have worked on
working in data now I have worked on banking as well as insurance to I
banking as well as insurance to I
banking as well as insurance to I started my journey with exential there I
started my journey with exential there I
started my journey with exential there I was working in ETL projects ETL data
was working in ETL projects ETL data
was working in ETL projects ETL data migration and implementation project and
migration and implementation project and
migration and implementation project and in quantify I was working in creating
in quantify I was working in creating
in quantify I was working in creating Inn Data Solutions using uh various AWS
Inn Data Solutions using uh various AWS
Inn Data Solutions using uh various AWS services like U red shift S3 uh DMS and
services like U red shift S3 uh DMS and
services like U red shift S3 uh DMS and I do have a good understanding on data
I do have a good understanding on data
I do have a good understanding on data modeling Concepts and data warehousing
modeling Concepts and data warehousing
modeling Concepts and data warehousing other than that like my go-to
other than that like my go-to
other than that like my go-to programming languages Python and SQL
programming languages Python and SQL
programming languages Python and SQL scripting so that's a brief about me so
scripting so that's a brief about me so
scripting so that's a brief about me so can you also mention like the
can you also mention like the
can you also mention like the architecture like what was the source
architecture like what was the source
architecture like what was the source location and where you were putting the
location and where you were putting the
location and where you were putting the data what type of Transformations you
data what type of Transformations you
data what type of Transformations you are doing so just to give an
are doing so just to give an
are doing so just to give an idea yeah okay so uh in one of my use
idea yeah okay so uh in one of my use
idea yeah okay so uh in one of my use case I'll discuss about one of my use
case I'll discuss about one of my use
case I'll discuss about one of my use case that was related to Insurance data
case that was related to Insurance data
case that was related to Insurance data so basically uh my architecture was like
so basically uh my architecture was like
so basically uh my architecture was like uh Source was the orle database where
uh Source was the orle database where
uh Source was the orle database where all the transaction were uh transactions
all the transaction were uh transactions
all the transaction were uh transactions were happening so we were using DMS
were happening so we were using DMS
were happening so we were using DMS service to get the data migrated to the
service to get the data migrated to the
service to get the data migrated to the uh S3 storage where we were keeping all
uh S3 storage where we were keeping all
uh S3 storage where we were keeping all the raw data then we were using uh like
the raw data then we were using uh like
the raw data then we were using uh like um uh we are leveraging uh red shift for
um uh we are leveraging uh red shift for
um uh we are leveraging uh red shift for our data warehousing solution where we
our data warehousing solution where we
our data warehousing solution where we have all the data models like Dimensions
have all the data models like Dimensions
have all the data models like Dimensions facts as well as there is one layer to
facts as well as there is one layer to
facts as well as there is one layer to it there is ODS layer which is basically
it there is ODS layer which is basically
it there is ODS layer which is basically we are keeping a exact replica of what
we are keeping a exact replica of what
we are keeping a exact replica of what data is there on the source so that it
data is there on the source so that it
data is there on the source so that it can act as a source of Truth for us when
can act as a source of Truth for us when
can act as a source of Truth for us when we are starting modeling our data so we
we are starting modeling our data so we
we are starting modeling our data so we can uh basically leverage that and uh
can uh basically leverage that and uh
can uh basically leverage that and uh that can uh act as a source for us uh in
that can uh act as a source for us uh in
that can uh act as a source for us uh in between we are uh basically creating
between we are uh basically creating
between we are uh basically creating stor procedure and external tables using
stor procedure and external tables using
stor procedure and external tables using red shift and of spectrum capabilities
red shift and of spectrum capabilities
red shift and of spectrum capabilities to get the CDC data incremental data and
to get the CDC data incremental data and
to get the CDC data incremental data and we have written the code in SQL so there
we have written the code in SQL so there
we have written the code in SQL so there is a couple of uh like merge statements
is a couple of uh like merge statements
is a couple of uh like merge statements and other uh quality related checks that
and other uh quality related checks that
and other uh quality related checks that we are doing so this was end to end
we are doing so this was end to end
we are doing so this was end to end architecture and we were dumping the
architecture and we were dumping the
architecture and we were dumping the data into our dimensions and uh dims and
data into our dimensions and uh dims and
data into our dimensions and uh dims and facts basically which is there in red
facts basically which is there in red
facts basically which is there in red shift which was uh later on consumed by
shift which was uh later on consumed by
shift which was uh later on consumed by the client as per the
the client as per the
the client as per the requirement so is uh the DMS that you
requirement so is uh the DMS that you
requirement so is uh the DMS that you are using right so is how it is
are using right so is how it is
are using right so is how it is basically how much data load that you
basically how much data load that you
basically how much data load that you have on the Oracle system like what is
have on the Oracle system like what is
have on the Oracle system like what is the what are the different tables and
the what are the different tables and
the what are the different tables and what is the data load that it is
what is the data load that it is
what is the data load that it is handling yeah so there are a lot of
handling yeah so there are a lot of
handling yeah so there are a lot of different tables uh in Insurance domain
different tables uh in Insurance domain
different tables uh in Insurance domain there are different tables related to
there are different tables related to
there are different tables related to like uh based on different granularities
like uh based on different granularities
like uh based on different granularities policy related risk there are coverages
policy related risk there are coverages
policy related risk there are coverages there are claim related data so there
there are claim related data so there
there are claim related data so there are uh like a lot of uh Source data so
are uh like a lot of uh Source data so
are uh like a lot of uh Source data so we have enabled CDC basically we are
we have enabled CDC basically we are
we have enabled CDC basically we are just getting the incremental data like
just getting the incremental data like
just getting the incremental data like the change that is happening what DMS is
the change that is happening what DMS is
the change that is happening what DMS is doing we have enabled the CDC in our
doing we have enabled the CDC in our
doing we have enabled the CDC in our source site so whenever there is a
source site so whenever there is a
source site so whenever there is a transaction it will go on and uh
transaction it will go on and uh
transaction it will go on and uh basically uh write it in a log uh what
basically uh write it in a log uh what
basically uh write it in a log uh what we call is a right Ahad logs so whenever
we call is a right Ahad logs so whenever
we call is a right Ahad logs so whenever the transaction is done that log will be
the transaction is done that log will be
the transaction is done that log will be captured there and DMS is taking all the
captured there and DMS is taking all the
captured there and DMS is taking all the logs from the client side and it will
logs from the client side and it will
logs from the client side and it will just uh migrate it to the S3 it will
just uh migrate it to the S3 it will
just uh migrate it to the S3 it will have the raw files as well like all the
have the raw files as well like all the
have the raw files as well like all the raw files we'll be receiving so since we
raw files we'll be receiving so since we
raw files we'll be receiving so since we are just dealing with the change data
are just dealing with the change data
are just dealing with the change data capture it's not uh in a like huge
capture it's not uh in a like huge
capture it's not uh in a like huge volume but when we are doing history
volume but when we are doing history
volume but when we are doing history load then uh yeah it is taking a lot of
load then uh yeah it is taking a lot of
load then uh yeah it is taking a lot of time DM is taking a lot of time to get
time DM is taking a lot of time to get
time DM is taking a lot of time to get the data from the source so sometimes it
the data from the source so sometimes it
the data from the source so sometimes it also take like more than two hours so
also take like more than two hours so
also take like more than two hours so like the row data right as I understand
like the row data right as I understand
like the row data right as I understand it is the exact replic of your orle
it is the exact replic of your orle
it is the exact replic of your orle system right so do you have set any
system right so do you have set any
system right so do you have set any retention at that part or will it be
retention at that part or will it be
retention at that part or will it be always in the S3 standard
always in the S3 standard
always in the S3 standard layer uh so like uh firstly the logs
layer uh so like uh firstly the logs
layer uh so like uh firstly the logs there is a retention period like
there is a retention period like
there is a retention period like previously there was a retention period
previously there was a retention period
previously there was a retention period of 2 to 3 days but if there is any
of 2 to 3 days but if there is any
of 2 to 3 days but if there is any failure which has not been resolved in 2
failure which has not been resolved in 2
failure which has not been resolved in 2 3 days so uh then we have changed it to
3 days so uh then we have changed it to
3 days so uh then we have changed it to 7 days so logs will be there uh till
7 days so logs will be there uh till
7 days so logs will be there uh till like 7 Days retention in S3 we what we
like 7 Days retention in S3 we what we
like 7 Days retention in S3 we what we are doing is like after 90 days like we
are doing is like after 90 days like we
are doing is like after 90 days like we are moving it to uh other basically we
are moving it to uh other basically we
are moving it to uh other basically we are archiving that
are archiving that
are archiving that data so rtion so say you have a
data so rtion so say you have a
data so rtion so say you have a requirement right where you don't
requirement right where you don't
requirement right where you don't suppose you have a location say uh in
suppose you have a location say uh in
suppose you have a location say uh in the S3 bucket right and you are writing
the S3 bucket right and you are writing
the S3 bucket right and you are writing the data the next time you run a load
the data the next time you run a load
the data the next time you run a load you don't want that location uh so you
you don't want that location uh so you
you don't want that location uh so you want that location to be overd right but
want that location to be overd right but
want that location to be overd right but in this case you don't want to lose the
in this case you don't want to lose the
in this case you don't want to lose the previous data so what capability of S3
previous data so what capability of S3
previous data so what capability of S3 would you use so that you can't you
would you use so that you can't you
would you use so that you can't you won't delete the previous version of the
won't delete the previous version of the
won't delete the previous version of the data uh so basically everyday data is
data uh so basically everyday data is
data uh so basically everyday data is going in a partition date so in S3 we do
going in a partition date so in S3 we do
going in a partition date so in S3 we do have date date folders uh so let's say
have date date folders uh so let's say
have date date folders uh so let's say if there are 100 transaction that has
if there are 100 transaction that has
if there are 100 transaction that has happened today so it will be in today's
happened today so it will be in today's
happened today so it will be in today's date and uh for tomorrow it will be in
date and uh for tomorrow it will be in
date and uh for tomorrow it will be in different date so whenever we are uh
different date so whenever we are uh
different date so whenever we are uh creating our external table we are
creating our external table we are
creating our external table we are adding the uh today's date partition to
adding the uh today's date partition to
adding the uh today's date partition to it and we just we are just reading uh
it and we just we are just reading uh
it and we just we are just reading uh the latest data from there and then uh
the latest data from there and then uh
the latest data from there and then uh like store procedure is running and it
like store procedure is running and it
like store procedure is running and it is dumping the data from is3 to uh our
is dumping the data from is3 to uh our
is dumping the data from is3 to uh our red shift ODS layer so it's like if I am
red shift ODS layer so it's like if I am
red shift ODS layer so it's like if I am running my job today I'll have if there
running my job today I'll have if there
running my job today I'll have if there is any case where failure has been uh
is any case where failure has been uh
is any case where failure has been uh like there was a failure so we can add
like there was a failure so we can add
like there was a failure so we can add multiple date partition as well to
multiple date partition as well to
multiple date partition as well to access different uh dates data but for
access different uh dates data but for
access different uh dates data but for now the use case is like like every day
now the use case is like like every day
now the use case is like like every day it will drop the previous date partition
it will drop the previous date partition
it will drop the previous date partition and add a new one and uh we are just
and add a new one and uh we are just
and add a new one and uh we are just able to access the one day data only
able to access the one day data only
able to access the one day data only okay so are you doing any reconciliation
okay so are you doing any reconciliation
okay so are you doing any reconciliation process as part of this
process as part of this
process as part of this job uh yeah we are doing reconciliation
job uh yeah we are doing reconciliation
job uh yeah we are doing reconciliation process once we are uh done with like
process once we are uh done with like
process once we are uh done with like the production load for all the use
the production load for all the use
the production load for all the use cases there will be SNS notifications
cases there will be SNS notifications
cases there will be SNS notifications that will get triggered based on the
that will get triggered based on the
that will get triggered based on the success and failures so mostly what we
success and failures so mostly what we
success and failures so mostly what we are doing is we are just uh based on
are doing is we are just uh based on
are doing is we are just uh based on some transformation we are checking
some transformation we are checking
some transformation we are checking whatever data is there in in our
whatever data is there in in our
whatever data is there in in our dimensions and facts uh and as per the
dimensions and facts uh and as per the
dimensions and facts uh and as per the requirement whatever data is there on
requirement whatever data is there on
requirement whatever data is there on the source so we have written SQL
the source so we have written SQL
the source so we have written SQL queries uh as part of Recon so it will
queries uh as part of Recon so it will
queries uh as part of Recon so it will basically match all the kpis and if
basically match all the kpis and if
basically match all the kpis and if there is any difference it will trigger
there is any difference it will trigger
there is any difference it will trigger a notification if there is no difference
a notification if there is no difference
a notification if there is no difference as well then then also like we have kept
as well then then also like we have kept
as well then then also like we have kept it that way that uh it will uh daily
it that way that uh it will uh daily
it that way that uh it will uh daily match some kpis for example like the
match some kpis for example like the
match some kpis for example like the policy count uh the premium amount so
policy count uh the premium amount so
policy count uh the premium amount so these kpis it will be matching and we
these kpis it will be matching and we
these kpis it will be matching and we are sending the triggers so how you
are sending the triggers so how you
are sending the triggers so how you scheduling the trigger ciliation process
scheduling the trigger ciliation process
scheduling the trigger ciliation process and at what time like is it a daily run
and at what time like is it a daily run
and at what time like is it a daily run that you are doing or a weekly run so
that you are doing or a weekly run so
that you are doing or a weekly run so how it is
how it is
how it is scheduled it's a daily run so for now
scheduled it's a daily run so for now
scheduled it's a daily run so for now like we have scheduled it like uh in the
like we have scheduled it like uh in the
like we have scheduled it like uh in the morning around 9:00 because we have we
morning around 9:00 because we have we
morning around 9:00 because we have we are starting all our use cases run after
are starting all our use cases run after
are starting all our use cases run after 12: after midnight so uh as for the load
12: after midnight so uh as for the load
12: after midnight so uh as for the load that we were getting everything will be
that we were getting everything will be
that we were getting everything will be like all the uh use cases jobs will uh
like all the uh use cases jobs will uh
like all the uh use cases jobs will uh will be successful by 7 in the morning
will be successful by 7 in the morning
will be successful by 7 in the morning so based on that like we are running it
so based on that like we are running it
so based on that like we are running it uh at like 9 in the like 9 in the
uh at like 9 in the like 9 in the
uh at like 9 in the like 9 in the morning yeah and say as part of your
morning yeah and say as part of your
morning yeah and say as part of your reconciliation process right you find
reconciliation process right you find
reconciliation process right you find some discrepancy in your data like maybe
some discrepancy in your data like maybe
some discrepancy in your data like maybe there is some corrupt record or
there is some corrupt record or
there is some corrupt record or something so how you would deal with
something so how you would deal with
something so how you would deal with that and how you would like reprocess it
that and how you would like reprocess it
that and how you would like reprocess it again yeah okay so if there is any
again yeah okay so if there is any
again yeah okay so if there is any mismatch in the data so what we are
mismatch in the data so what we are
mismatch in the data so what we are doing is we are going back to the code
doing is we are going back to the code
doing is we are going back to the code firstly we'll check the logs like there
firstly we'll check the logs like there
firstly we'll check the logs like there can be different type of uh failures if
can be different type of uh failures if
can be different type of uh failures if it's a job failure or is it a uh kpi
it's a job failure or is it a uh kpi
it's a job failure or is it a uh kpi mismatch so in case of uh there is any
mismatch so in case of uh there is any
mismatch so in case of uh there is any failure we are going to the step
failure we are going to the step
failure we are going to the step function that we are using for
function that we are using for
function that we are using for orchestrating all our tasks so there we
orchestrating all our tasks so there we
orchestrating all our tasks so there we do have enabled uh like Cloud watch logs
do have enabled uh like Cloud watch logs
do have enabled uh like Cloud watch logs we can go to the ECS task and we can
we can go to the ECS task and we can
we can go to the ECS task and we can check the cloudwatch logs and based on
check the cloudwatch logs and based on
check the cloudwatch logs and based on the error message that it is showing
the error message that it is showing
the error message that it is showing we'll go and fix in the code uh in the
we'll go and fix in the code uh in the
we'll go and fix in the code uh in the jobs basically and if it's a kbii
jobs basically and if it's a kbii
jobs basically and if it's a kbii mismatch related data then we know like
mismatch related data then we know like
mismatch related data then we know like uh from which fact we are getting this
uh from which fact we are getting this
uh from which fact we are getting this data and we'll have to basically
data and we'll have to basically
data and we'll have to basically backtrack the source of that particular
backtrack the source of that particular
backtrack the source of that particular kpi uh like what all the transformation
kpi uh like what all the transformation
kpi uh like what all the transformation we are applying to that uh particular uh
we are applying to that uh particular uh
we are applying to that uh particular uh like that metric and from where it is
like that metric and from where it is
like that metric and from where it is coming and what can be done like uh we
coming and what can be done like uh we
coming and what can be done like uh we will be backtracking it at every uh
will be backtracking it at every uh
will be backtracking it at every uh Point like every
Point like every
Point like every stage so how you are currently notifying
stage so how you are currently notifying
stage so how you are currently notifying your consumer so say you are producing
your consumer so say you are producing
your consumer so say you are producing the data and there are some consumers
the data and there are some consumers
the data and there are some consumers right and because of your process you
right and because of your process you
right and because of your process you identify maybe there is some SL sometime
identify maybe there is some SL sometime
identify maybe there is some SL sometime that is breached or maybe there is some
that is breached or maybe there is some
that is breached or maybe there is some freshness issue or something so how that
freshness issue or something so how that
freshness issue or something so how that will be communicated to the
will be communicated to the
will be communicated to the consumers uh so basically uh we have
consumers uh so basically uh we have
consumers uh so basically uh we have created views on top of our facts and
created views on top of our facts and
created views on top of our facts and the access has been given so it's uh
the access has been given so it's uh
the access has been given so it's uh like country related data that we are
like country related data that we are
like country related data that we are getting so there are some users that are
getting so there are some users that are
getting so there are some users that are accessing the uh like let's say Hong
accessing the uh like let's say Hong
accessing the uh like let's say Hong Kong data there are some users that are
Kong data there are some users that are
Kong data there are some users that are accessing Vietnam data So based on the
accessing Vietnam data So based on the
accessing Vietnam data So based on the uh like authent like we will be giving
uh like authent like we will be giving
uh like authent like we will be giving the them the uh permission to view the
the them the uh permission to view the
the them the uh permission to view the data so only the uh related use cases uh
data so only the uh related use cases uh
data so only the uh related use cases uh basically if there are some users that
basically if there are some users that
basically if there are some users that want they want the data uh to get viewed
want they want the data uh to get viewed
want they want the data uh to get viewed or something like that so they'll have
or something like that so they'll have
or something like that so they'll have to ask for Access and we need to add
to ask for Access and we need to add
to ask for Access and we need to add their name as well but suppose they
their name as well but suppose they
their name as well but suppose they already have the access but somehow
already have the access but somehow
already have the access but somehow because of your job fail or something
because of your job fail or something
because of your job fail or something you are not able to publish today's date
you are not able to publish today's date
you are not able to publish today's date data right and now when they check on
data right and now when they check on
data right and now when they check on their side they won't be able to get the
their side they won't be able to get the
their side they won't be able to get the latest data so as a proactive measure
latest data so as a proactive measure
latest data so as a proactive measure you can also notify them right as part
you can also notify them right as part
you can also notify them right as part of your process so how can you is it
of your process so how can you is it
of your process so how can you is it currently that You' have implemented if
currently that You' have implemented if
currently that You' have implemented if note how could you do that yes yes us uh
note how could you do that yes yes us uh
note how could you do that yes yes us uh SNS notification will get triggered
SNS notification will get triggered
SNS notification will get triggered after the run is complete and it's going
after the run is complete and it's going
after the run is complete and it's going like we have included the users as well
like we have included the users as well
like we have included the users as well who are consuming the data so they'll
who are consuming the data so they'll
who are consuming the data so they'll get a notification in case of there is
get a notification in case of there is
get a notification in case of there is any failure or there is any kpi mismatch
any failure or there is any kpi mismatch
any failure or there is any kpi mismatch they'll directly get the notification
they'll directly get the notification
they'll directly get the notification from
from
from there uh so will how will you implement
there uh so will how will you implement
there uh so will how will you implement it as part of your current
it as part of your current
it as part of your current workflow yeah uh so uh like uh we are
workflow yeah uh so uh like uh we are
workflow yeah uh so uh like uh we are orchestrating everything in Step
orchestrating everything in Step
orchestrating everything in Step functions there is one uh SNS uh trigger
functions there is one uh SNS uh trigger
functions there is one uh SNS uh trigger that we are uh basically we have written
that we are uh basically we have written
that we are uh basically we have written the Lambda as well where we are
the Lambda as well where we are
the Lambda as well where we are comparing all the data and uh that we
comparing all the data and uh that we
comparing all the data and uh that we will be invoking and it will uh based on
will be invoking and it will uh based on
will be invoking and it will uh based on some conditions it will trigger the
some conditions it will trigger the
some conditions it will trigger the notification we when we are creating the
notification we when we are creating the
notification we when we are creating the topics we are sending that them to the
topics we are sending that them to the
topics we are sending that them to the users they will subscribe to it based on
users they will subscribe to it based on
users they will subscribe to it based on the PS up method they will be receiving
the PS up method they will be receiving
the PS up method they will be receiving notifications later and in case of any
notifications later and in case of any
notifications later and in case of any failure or production issue then based
failure or production issue then based
failure or production issue then based on that like uh we'll have to estimate
on that like uh we'll have to estimate
on that like uh we'll have to estimate it and we'll create a like us a story
it and we'll create a like us a story
it and we'll create a like us a story for that like produ issue we'll have to
for that like produ issue we'll have to
for that like produ issue we'll have to log in jir and based on the we'll start
log in jir and based on the we'll start
log in jir and based on the we'll start working on it okay so in S3 bucket right
working on it okay so in S3 bucket right
working on it okay so in S3 bucket right can we have like two buckets with the
can we have like two buckets with the
can we have like two buckets with the same name in AWS
same name in AWS
same name in AWS S3 uh no uh it should be unique uh so
S3 uh no uh it should be unique uh so
S3 uh no uh it should be unique uh so but is it like is it not region specific
but is it like is it not region specific
but is it like is it not region specific or is it region specific d three service
or is it region specific d three service
or is it region specific d three service it's Global it's Global and every name
it's Global it's Global and every name
it's Global it's Global and every name should be unique is there any
should be unique is there any
should be unique is there any requirement for having such Global name
requirement for having such Global name
requirement for having such Global name space
space
space uh require uh sorry uh I did not
uh require uh sorry uh I did not
uh require uh sorry uh I did not understand your question so point is
understand your question so point is
understand your question so point is like is uh what is the use case like why
like is uh what is the use case like why
like is uh what is the use case like why ad decided to have the global name
ad decided to have the global name
ad decided to have the global name spaces even if we can specify the region
spaces even if we can specify the region
spaces even if we can specify the region in an S3 bucket okay I check I don't so
in an S3 bucket okay I check I don't so
in an S3 bucket okay I check I don't so say you have an S3 bucket right and like
say you have an S3 bucket right and like
say you have an S3 bucket right and like how can you secure your S3 bucket like
how can you secure your S3 bucket like
how can you secure your S3 bucket like what different uh things you can
what different uh things you can
what different uh things you can Implement by which you can secure your
Implement by which you can secure your
Implement by which you can secure your S3
S3
S3 bucket uh so basically we can provide
bucket uh so basically we can provide
bucket uh so basically we can provide permissions accordingly like uh there
permissions accordingly like uh there
permissions accordingly like uh there can be an uh access uh list we can add
can be an uh access uh list we can add
can be an uh access uh list we can add and uh we can allow or deny some
and uh we can allow or deny some
and uh we can allow or deny some resources or we can allow or uh deny
resources or we can allow or uh deny
resources or we can allow or uh deny based on the IM am
based on the IM am
based on the IM am permissions so that we can uh
permissions so that we can uh
permissions so that we can uh do okay any other thing that you can
do okay any other thing that you can
do okay any other thing that you can Implement I can think of right now okay
Implement I can think of right now okay
Implement I can think of right now okay so say you are working for say
so say you are working for say
so say you are working for say e-commerce company right hello say
e-commerce company right hello say
e-commerce company right hello say you're working for an e-commerce company
you're working for an e-commerce company
you're working for an e-commerce company and uh maybe handling the Sher data
and uh maybe handling the Sher data
and uh maybe handling the Sher data product right so all the customer
product right so all the customer
product right so all the customer information and everything you are
information and everything you are
information and everything you are dealing with those type of data uh so
dealing with those type of data uh so
dealing with those type of data uh so Cher data is basically in pii data right
Cher data is basically in pii data right
Cher data is basically in pii data right and it's your responsibility as part of
and it's your responsibility as part of
and it's your responsibility as part of the team to secure the pii information
the team to secure the pii information
the team to secure the pii information right so what are the diff so how you
right so what are the diff so how you
right so what are the diff so how you will Design uh this pipelines or maybe
will Design uh this pipelines or maybe
will Design uh this pipelines or maybe how you will secure this pii information
how you will secure this pii information
how you will secure this pii information so maybe you can list down some factors
so maybe you can list down some factors
so maybe you can list down some factors and maybe some ways uh so first thing is
and maybe some ways uh so first thing is
and maybe some ways uh so first thing is like uh this data like are we receiving
like uh this data like are we receiving
like uh this data like are we receiving the piia columns as well and uh what is
the piia columns as well and uh what is
the piia columns as well and uh what is the like use case like can we uh can we
the like use case like can we uh can we
the like use case like can we uh can we access that data uh if yes then we can
access that data uh if yes then we can
access that data uh if yes then we can uh do one thing we can create an
uh do one thing we can create an
uh do one thing we can create an abstraction layer on top of that where
abstraction layer on top of that where
abstraction layer on top of that where we can just uh basically uh mask that
we can just uh basically uh mask that
we can just uh basically uh mask that data so what we are doing is in some
data so what we are doing is in some
data so what we are doing is in some cases we do have two uh two different
cases we do have two uh two different
cases we do have two uh two different schemas one is restricted one and one is
schemas one is restricted one and one is
schemas one is restricted one and one is the general one so in the general one we
the general one so in the general one we
the general one so in the general one we will be keeping all the uh we will be
will be keeping all the uh we will be
will be keeping all the uh we will be doing all the masking on the columns
doing all the masking on the columns
doing all the masking on the columns that uh that comes under pi and in the
that uh that comes under pi and in the
that uh that comes under pi and in the restricted column we will be showing
restricted column we will be showing
restricted column we will be showing their values as well so what we are
their values as well so what we are
their values as well so what we are doing is uh we do have have data stored
doing is uh we do have have data stored
doing is uh we do have have data stored in our facts when we are creating the
in our facts when we are creating the
in our facts when we are creating the views on top of that then only in the
views on top of that then only in the
views on top of that then only in the view uh view query we are uh just have
view uh view query we are uh just have
view uh view query we are uh just have written a case statement where we are
written a case statement where we are
written a case statement where we are checking if that column is present or
checking if that column is present or
checking if that column is present or not if the value is there then we are
not if the value is there then we are
not if the value is there then we are just masking that value other than data
just masking that value other than data
just masking that value other than data masking like is is there any other thing
masking like is is there any other thing
masking like is is there any other thing that you will Implement so that say your
that you will Implement so that say your
that you will Implement so that say your data is present in the S3 bucket and
data is present in the S3 bucket and
data is present in the S3 bucket and currently uh you have the access right
currently uh you have the access right
currently uh you have the access right and you want to prevent any unauthorized
and you want to prevent any unauthorized
and you want to prevent any unauthorized access to your data so data asking is
access to your data so data asking is
access to your data so data asking is one of the technique that you will
one of the technique that you will
one of the technique that you will surely Implement before dumping it to
surely Implement before dumping it to
surely Implement before dumping it to the Target sources so like any other
the Target sources so like any other
the Target sources so like any other technique that you can Implement to
technique that you can Implement to
technique that you can Implement to secure your data maybe we can remove
secure your data maybe we can remove
secure your data maybe we can remove those uh columns from there and we can
those uh columns from there and we can
those uh columns from there and we can keep it in different uh uh like
keep it in different uh uh like
keep it in different uh uh like different file or different table and uh
different file or different table and uh
different file or different table and uh we can move it to different folder uh
we can move it to different folder uh
we can move it to different folder uh which do not have access to the normal
which do not have access to the normal
which do not have access to the normal users we can restrict the access to that
users we can restrict the access to that
users we can restrict the access to that we can write python code we can read the
we can write python code we can read the
we can write python code we can read the data we can uh basically separate the
data we can uh basically separate the
data we can uh basically separate the The pii Columns and the uh original
The pii Columns and the uh original
The pii Columns and the uh original columns that we want to use and we can
columns that we want to use and we can
columns that we want to use and we can create two different files out of it and
create two different files out of it and
create two different files out of it and one file will go to the restricted
one file will go to the restricted
one file will go to the restricted folder and one file can go uh to the
folder and one file can go uh to the
folder and one file can go uh to the normal folder that basically can be
normal folder that basically can be
normal folder that basically can be consumed by all the users okay so when
consumed by all the users okay so when
consumed by all the users okay so when you're explaining the architecture right
you're explaining the architecture right
you're explaining the architecture right so what was the format of the data that
so what was the format of the data that
so what was the format of the data that you have in the
you have in the
you have in the roer so it's basically firstly uh all
roer so it's basically firstly uh all
roer so it's basically firstly uh all the logs that we are receiving it's in
the logs that we are receiving it's in
the logs that we are receiving it's in CSV format and later on using python we
CSV format and later on using python we
CSV format and later on using python we are taking the chunks uh like chunks of
are taking the chunks uh like chunks of
are taking the chunks uh like chunks of data we are uh basically converting that
data we are uh basically converting that
data we are uh basically converting that uh that into park format so that we can
uh that into park format so that we can
uh that into park format so that we can access it later so how you transforming
access it later so how you transforming
access it later so how you transforming from rad to uh transform is there any
from rad to uh transform is there any
from rad to uh transform is there any Services of AWS you're
Services of AWS you're
Services of AWS you're using so uh firstly there is one raw
using so uh firstly there is one raw
using so uh firstly there is one raw Zone which we are calling uh which will
Zone which we are calling uh which will
Zone which we are calling uh which will have all the like all the logs so let's
have all the like all the logs so let's
have all the like all the logs so let's say if there are 100 transaction that
say if there are 100 transaction that
say if there are 100 transaction that has happened so it will have 10 uh like
has happened so it will have 10 uh like
has happened so it will have 10 uh like 100 CSV files inside it so now what we
100 CSV files inside it so now what we
100 CSV files inside it so now what we are doing is we have written a python
are doing is we have written a python
are doing is we have written a python code and the python code is running it
code and the python code is running it
code and the python code is running it will merge all the those 100 uh files
will merge all the those 100 uh files
will merge all the those 100 uh files because uh there can be a case like for
because uh there can be a case like for
because uh there can be a case like for every transaction it is creating a file
every transaction it is creating a file
every transaction it is creating a file so there can be like thousands or uh 10
so there can be like thousands or uh 10
so there can be like thousands or uh 10 thousands of files we have we are
thousands of files we have we are
thousands of files we have we are receiving in one day but we want to
receiving in one day but we want to
receiving in one day but we want to process everything so we are merging all
process everything so we are merging all
process everything so we are merging all those data and then we are in the python
those data and then we are in the python
those data and then we are in the python code only like we are creating U like we
code only like we are creating U like we
code only like we are creating U like we are converting it into par and we are
are converting it into par and we are
are converting it into par and we are keeping that data into our aggregate
keeping that data into our aggregate
keeping that data into our aggregate zone so agregate Zone uh we have in
zone so agregate Zone uh we have in
zone so agregate Zone uh we have in place which will have the exact same
place which will have the exact same
place which will have the exact same data but it it's uh in uh like we are
data but it it's uh in uh like we are
data but it it's uh in uh like we are it's compressed it's in park format that
it's compressed it's in park format that
it's compressed it's in park format that we can leverage later for the
we can leverage later for the
we can leverage later for the optimization techniques that we can use
optimization techniques that we can use
optimization techniques that we can use and uh it's acting uh like uh acting as
and uh it's acting uh like uh acting as
and uh it's acting uh like uh acting as a middle layer which can be later on
a middle layer which can be later on
a middle layer which can be later on consumed uh by the user so on top of
consumed uh by the user so on top of
consumed uh by the user so on top of that we are creating external tables and
that we are creating external tables and
that we are creating external tables and we are dumping the data into ODS so in
we are dumping the data into ODS so in
we are dumping the data into ODS so in this whole pipeline L how you are
this whole pipeline L how you are
this whole pipeline L how you are managing the schema right because CSV
managing the schema right because CSV
managing the schema right because CSV and uh don't come with the schemas right
and uh don't come with the schemas right
and uh don't come with the schemas right so how you managing your schemas schema
so how you managing your schemas schema
so how you managing your schemas schema drift or schema Evolution as part of
drift or schema Evolution as part of
drift or schema Evolution as part of this pipeline if there is any Evolution
this pipeline if there is any Evolution
this pipeline if there is any Evolution happening right so how you will
happening right so how you will
happening right so how you will incorporate that in a
incorporate that in a
incorporate that in a pipeline yeah so CSV does not provide us
pipeline yeah so CSV does not provide us
pipeline yeah so CSV does not provide us the uh feature for like schi Evolution
the uh feature for like schi Evolution
the uh feature for like schi Evolution so what we are doing is uh we have
so what we are doing is uh we have
so what we are doing is uh we have written a python code and uh like
written a python code and uh like
written a python code and uh like whenever there is a new file we are
whenever there is a new file we are
whenever there is a new file we are basically creating a data frame out of
basically creating a data frame out of
basically creating a data frame out of it and uh we are checking all The
it and uh we are checking all The
it and uh we are checking all The Columns we are creating a set of it and
Columns we are creating a set of it and
Columns we are creating a set of it and we are connecting to our red ship which
we are connecting to our red ship which
we are connecting to our red ship which is our Target and we are taking some
is our Target and we are taking some
is our Target and we are taking some data from that as well again doing the
data from that as well again doing the
data from that as well again doing the same process checking all the columns
same process checking all the columns
same process checking all the columns and creating a set out of it and then we
and creating a set out of it and then we
and creating a set out of it and then we have two sets like one set has all the
have two sets like one set has all the
have two sets like one set has all the columns from The Source One set has all
columns from The Source One set has all
columns from The Source One set has all the columns from the target we are
the columns from the target we are
the columns from the target we are taking a set difference and if there is
taking a set difference and if there is
taking a set difference and if there is uh some columns that is like we are
uh some columns that is like we are
uh some columns that is like we are getting as as an output of set
getting as as an output of set
getting as as an output of set difference then we know like these are
difference then we know like these are
difference then we know like these are all the columns that are uh that that is
all the columns that are uh that that is
all the columns that are uh that that is not there in the red shift in our Target
not there in the red shift in our Target
not there in the red shift in our Target and these are all the new columns so
and these are all the new columns so
and these are all the new columns so there like we are doing two things uh we
there like we are doing two things uh we
there like we are doing two things uh we are firstly like uh based on the columns
are firstly like uh based on the columns
are firstly like uh based on the columns which we are getting we are creating
which we are getting we are creating
which we are getting we are creating alter statements so we are adding new
alter statements so we are adding new
alter statements so we are adding new columns to our Target and we are just
columns to our Target and we are just
columns to our Target and we are just keeping it as we care because it can uh
keeping it as we care because it can uh
keeping it as we care because it can uh have any type of data and later on like
have any type of data and later on like
have any type of data and later on like we can go on and manually change that
we can go on and manually change that
we can go on and manually change that like as per the uh like the data
like as per the uh like the data
like as per the uh like the data requirement if it's uh we need to keep
requirement if it's uh we need to keep
requirement if it's uh we need to keep it integer or like any other data type
it integer or like any other data type
it integer or like any other data type so this will be the one thing and second
so this will be the one thing and second
so this will be the one thing and second thing we are sending notifications to
thing we are sending notifications to
thing we are sending notifications to your user so that we can get their
your user so that we can get their
your user so that we can get their confirmation like if those columns were
confirmation like if those columns were
confirmation like if those columns were needed and uh are they part of as part
needed and uh are they part of as part
needed and uh are they part of as part of any new requirement or things like
of any new requirement or things like
of any new requirement or things like that so can't we use like blue catalog
that so can't we use like blue catalog
that so can't we use like blue catalog service here to manage the schema like
service here to manage the schema like
service here to manage the schema like instead of managing like connecting to
instead of managing like connecting to
instead of managing like connecting to source and Target and comparing it so
source and Target and comparing it so
source and Target and comparing it so can't we use any managed catalog service
can't we use any managed catalog service
can't we use any managed catalog service Like
Like
Like Glue yes yes we can definitely use that
Glue yes yes we can definitely use that
Glue yes yes we can definitely use that but since it was already uh like flow is
but since it was already uh like flow is
but since it was already uh like flow is like we are getting the data from DMS
like we are getting the data from DMS
like we are getting the data from DMS and we are processing all we are doing
and we are processing all we are doing
and we are processing all we are doing all the processing writing all the SQL
all the processing writing all the SQL
all the processing writing all the SQL scripts and dumping the data so we are
scripts and dumping the data so we are
scripts and dumping the data so we are not using glue for that that's why we
not using glue for that that's why we
not using glue for that that's why we have not used but definitely we can uh
have not used but definitely we can uh
have not used but definitely we can uh yeah change that to uh we can use some
yeah change that to uh we can use some
yeah change that to uh we can use some other ews Service as well okay so say
other ews Service as well okay so say
other ews Service as well okay so say you're using glue right for maybe EMR
you're using glue right for maybe EMR
you're using glue right for maybe EMR service right so we have to specify the
service right so we have to specify the
service right so we have to specify the cluster configurations right like what
cluster configurations right like what
cluster configurations right like what type of instances we want to use so um
type of instances we want to use so um
type of instances we want to use so um we have like different instance typee in
we have like different instance typee in
we have like different instance typee in AWS service right so uh on basis of use
AWS service right so uh on basis of use
AWS service right so uh on basis of use case how we will specify like what type
case how we will specify like what type
case how we will specify like what type of instance we have to use from AWS
offering so uh like if we are using AWS
offering so uh like if we are using AWS clue we don't need to worry about the
clue we don't need to worry about the
clue we don't need to worry about the infrastructure management because it's a
infrastructure management because it's a
infrastructure management because it's a server serverless and it manages uh as
server serverless and it manages uh as
server serverless and it manages uh as per the load so if there's more uh like
per the load so if there's more uh like
per the load so if there's more uh like more data coming in our way and the
more data coming in our way and the
more data coming in our way and the workload is uh higher than expected like
workload is uh higher than expected like
workload is uh higher than expected like it will spin up all the cluster
it will spin up all the cluster
it will spin up all the cluster accordingly uh so I have used uh ews
accordingly uh so I have used uh ews
accordingly uh so I have used uh ews glue not uh EMR I can talk in terms of
glue not uh EMR I can talk in terms of
glue not uh EMR I can talk in terms of that like if firstly uh Whenever there
that like if firstly uh Whenever there
that like if firstly uh Whenever there is any use case we'll have to go on and
is any use case we'll have to go on and
is any use case we'll have to go on and uh analyze the data data volume data
uh analyze the data data volume data
uh analyze the data data volume data types and uh we will be doing all the
types and uh we will be doing all the
types and uh we will be doing all the analysis on that basis only we can
analysis on that basis only we can
analysis on that basis only we can either uh use some uh like EMR in EMR we
either uh use some uh like EMR in EMR we
either uh use some uh like EMR in EMR we can get uh all the flexibility to
can get uh all the flexibility to
can get uh all the flexibility to provision our own clusters so there it's
provision our own clusters so there it's
provision our own clusters so there it's very much necessary that we know what uh
very much necessary that we know what uh
very much necessary that we know what uh data volume we're getting and as for
data volume we're getting and as for
data volume we're getting and as for that like how can it be handled how many
that like how can it be handled how many
that like how can it be handled how many nodes how many basically how many
nodes how many basically how many
nodes how many basically how many executors do we need what is the memory
executors do we need what is the memory
executors do we need what is the memory what is the core that we need as per the
what is the core that we need as per the
what is the core that we need as per the requirement and the data we are dealing
requirement and the data we are dealing
requirement and the data we are dealing with So currently AMR also comes with a
with So currently AMR also comes with a
with So currently AMR also comes with a new offering of EMR serverless right so
new offering of EMR serverless right so
new offering of EMR serverless right so in this case you also can use EMR Ser
in this case you also can use EMR Ser
in this case you also can use EMR Ser serverless for your load so what what
serverless for your load so what what
serverless for your load so what what can be the use case for using EMR
can be the use case for using EMR
can be the use case for using EMR serverless over top of over glue yeah so
serverless over top of over glue yeah so
serverless over top of over glue yeah so if it's uh like similar like
if it's uh like similar like
if it's uh like similar like architecture wise like it's serverless
architecture wise like it's serverless
architecture wise like it's serverless and it also gives like uh manages all
and it also gives like uh manages all
and it also gives like uh manages all the infrastructure itself so maybe the
the infrastructure itself so maybe the
the infrastructure itself so maybe the data cataloging part that we have in AWS
data cataloging part that we have in AWS
data cataloging part that we have in AWS glue we don't have in am EMR I'm not uh
glue we don't have in am EMR I'm not uh
glue we don't have in am EMR I'm not uh wellers with the EMR uh like
wellers with the EMR uh like
wellers with the EMR uh like functionalities and features but I'll
functionalities and features but I'll
functionalities and features but I'll talk about uh aw's clue it does have uh
talk about uh aw's clue it does have uh
talk about uh aw's clue it does have uh data brew as well where we can uh like
data brew as well where we can uh like
data brew as well where we can uh like perform some analytics that is one plus
perform some analytics that is one plus
perform some analytics that is one plus Point uh it does have AWS crawlers as
Point uh it does have AWS crawlers as
Point uh it does have AWS crawlers as well that can help to crawl the data
well that can help to crawl the data
well that can help to crawl the data from some source to Etha or some other
from some source to Etha or some other
from some source to Etha or some other source so those are the functionalities
source so those are the functionalities
source so those are the functionalities maybe that uh EMR is not like it's not
maybe that uh EMR is not like it's not
maybe that uh EMR is not like it's not there in the EMR okay so say you are
there in the EMR okay so say you are
there in the EMR okay so say you are using glue service right for your detail
using glue service right for your detail
using glue service right for your detail Pipeline and you are using glue crawler
Pipeline and you are using glue crawler
Pipeline and you are using glue crawler and say your data is present in the S3
and say your data is present in the S3
and say your data is present in the S3 bucket right so how so say one day you
bucket right so how so say one day you
bucket right so how so say one day you use the glue crawler and schema is up to
use the glue crawler and schema is up to
use the glue crawler and schema is up to date in your glue pipelines right uh so
date in your glue pipelines right uh so
date in your glue pipelines right uh so maybe from some few days uh afterwards
maybe from some few days uh afterwards
maybe from some few days uh afterwards your schema has changed right so how
your schema has changed right so how
your schema has changed right so how will glue come to know that schema has
will glue come to know that schema has
will glue come to know that schema has changed so how you will implement this
changed so how you will implement this
changed so how you will implement this type of workflow so that your ETL
type of workflow so that your ETL
type of workflow so that your ETL pipelines would be aware of the recent
pipelines would be aware of the recent
pipelines would be aware of the recent schema
schema
schema changes so we are using AWS uh like data
changes so we are using AWS uh like data
changes so we are using AWS uh like data catalog part here and data catalog will
catalog part here and data catalog will
catalog part here and data catalog will basically store all the metadata
basically store all the metadata
basically store all the metadata regarding the data files and everything
regarding the data files and everything
regarding the data files and everything so if there is any change in the data so
so if there is any change in the data so
so if there is any change in the data so AWS catalog will capture that change as
AWS catalog will capture that change as
AWS catalog will capture that change as part of schema evaluation and when we
part of schema evaluation and when we
part of schema evaluation and when we are uh moving the data to like crawling
are uh moving the data to like crawling
are uh moving the data to like crawling the data from the S3 or some other
the data from the S3 or some other
the data from the S3 or some other source uh it will go on and check the
source uh it will go on and check the
source uh it will go on and check the metadata from data catalog and it will
metadata from data catalog and it will
metadata from data catalog and it will come to know about the changes in the
come to know about the changes in the
come to know about the changes in the schema so say you want to like have a
schema so say you want to like have a
schema so say you want to like have a pipeline implemented as uh in the
pipeline implemented as uh in the
pipeline implemented as uh in the similar flow so say you want to have an
similar flow so say you want to have an
similar flow so say you want to have an Automation in place uh in this case like
Automation in place uh in this case like
Automation in place uh in this case like whenever there is a schema changes or
whenever there is a schema changes or
whenever there is a schema changes or something or maybe you have a glue job
something or maybe you have a glue job
something or maybe you have a glue job which will basically maybe per day basis
which will basically maybe per day basis
which will basically maybe per day basis it will scan the data and in case there
it will scan the data and in case there
it will scan the data and in case there is any drift in the schema you will be
is any drift in the schema you will be
is any drift in the schema you will be get notified ified so you want to
get notified ified so you want to
get notified ified so you want to automate such a pipeline so how you will
automate such a pipeline so how you will
automate such a pipeline so how you will Design such a pipeline in glue uh so in
Design such a pipeline in glue uh so in
Design such a pipeline in glue uh so in glue what we can do is uh in that case
glue what we can do is uh in that case
glue what we can do is uh in that case like if there is any schema changes we
like if there is any schema changes we
like if there is any schema changes we need to
need to
need to notify uh maybe we can add some triggers
notify uh maybe we can add some triggers
notify uh maybe we can add some triggers or uh we can send some notifications
or uh we can send some notifications
or uh we can send some notifications using teams
using teams
using teams or maybe on some other channel like
or maybe on some other channel like
or maybe on some other channel like slack Channel teams uh what according to
slack Channel teams uh what according to
slack Channel teams uh what according to the functionality that it provides uh
the functionality that it provides uh
the functionality that it provides uh yeah I'm not uh much like I have to uh
yeah I'm not uh much like I have to uh
yeah I'm not uh much like I have to uh think about that use case but my
think about that use case but my
think about that use case but my Approach will be like if I am using glue
Approach will be like if I am using glue
Approach will be like if I am using glue uh somewhere like I will get previous
uh somewhere like I will get previous
uh somewhere like I will get previous date uh data right uh that has because
date uh data right uh that has because
date uh data right uh that has because uh the the schema was the previous
uh the the schema was the previous
uh the the schema was the previous schema that we were referring to and uh
schema that we were referring to and uh
schema that we were referring to and uh the new schema that we are getting we
the new schema that we are getting we
the new schema that we are getting we can also load that also in our AWS glue
can also load that also in our AWS glue
can also load that also in our AWS glue so maybe let's say like I have like
so maybe let's say like I have like
so maybe let's say like I have like today's the date that we have received
today's the date that we have received
today's the date that we have received the new schema and two days back the
the new schema and two days back the
the new schema and two days back the date was uh there like uh based on the
date was uh there like uh based on the
date was uh there like uh based on the partition date I can read two partitions
partition date I can read two partitions
partition date I can read two partitions now and from those two partitions I can
now and from those two partitions I can
now and from those two partitions I can read the data and check all the columns
read the data and check all the columns
read the data and check all the columns and uh basically get the difference in
and uh basically get the difference in
and uh basically get the difference in the columns whatever the difference is
the columns whatever the difference is
the columns whatever the difference is there and what is the data type changes
there and what is the data type changes
there and what is the data type changes basically uh based on that like if there
basically uh based on that like if there
basically uh based on that like if there is any difference that uh there will we
is any difference that uh there will we
is any difference that uh there will we can uh set some flags and based on those
can uh set some flags and based on those
can uh set some flags and based on those flags or we can uh invoke some Lambda or
flags or we can uh invoke some Lambda or
flags or we can uh invoke some Lambda or basically we can invoke SNS to trigger
basically we can invoke SNS to trigger
basically we can invoke SNS to trigger the notifications so something like that
the notifications so something like that
the notifications so something like that we can Implement okay so you mentioned
we can Implement okay so you mentioned
we can Implement okay so you mentioned in your resume like you have also used
in your resume like you have also used
in your resume like you have also used Dynam and elastic search as a database
Dynam and elastic search as a database
Dynam and elastic search as a database uh so what was the use case right and
uh so what was the use case right and
uh so what was the use case right and why we have like using two different
why we have like using two different
why we have like using two different databases so what were the different use
databases so what were the different use
databases so what were the different use cases that you were using this
cases that you were using this
cases that you were using this for yeah so uh basically dynamodb and
for yeah so uh basically dynamodb and
for yeah so uh basically dynamodb and elastic search we are using for same use
elastic search we are using for same use
elastic search we are using for same use case only what is what was happening is
case only what is what was happening is
case only what is what was happening is we were getting uh aay related data so
we were getting uh aay related data so
we were getting uh aay related data so whenever like uh we need to uh set a
whenever like uh we need to uh set a
whenever like uh we need to uh set a mandate so that like uh the premium or
mandate so that like uh the premium or
mandate so that like uh the premium or anything like that like it can be
anything like that like it can be
anything like that like it can be deducted automatically from the account
deducted automatically from the account
deducted automatically from the account so those sort of data we were getting in
so those sort of data we were getting in
so those sort of data we were getting in terms of CSV and txt file from the users
terms of CSV and txt file from the users
terms of CSV and txt file from the users themselves and using infoworks for the
themselves and using infoworks for the
themselves and using infoworks for the data injection part like we were using
data injection part like we were using
data injection part like we were using DMS ews service in this use case we are
DMS ews service in this use case we are
DMS ews service in this use case we are using infoworks which basically is a
using infoworks which basically is a
using infoworks which basically is a data injection tool there like we can
data injection tool there like we can
data injection tool there like we can write the same workflows and uh we can
write the same workflows and uh we can
write the same workflows and uh we can create a pipeline uh what it was doing
create a pipeline uh what it was doing
create a pipeline uh what it was doing was like it was reading the data from
was like it was reading the data from
was like it was reading the data from that file and uh it was uh converting it
that file and uh it was uh converting it
that file and uh it was uh converting it into par and again the so like the
into par and again the so like the
into par and again the so like the target was to uh basically convert in
target was to uh basically convert in
target was to uh basically convert in park and uh place it on S3 itself which
park and uh place it on S3 itself which
park and uh place it on S3 itself which was our um raw Zone there as well and
was our um raw Zone there as well and
was our um raw Zone there as well and later on like what we were doing is uh
later on like what we were doing is uh
later on like what we were doing is uh we were using AWS glue jobs to read the
we were using AWS glue jobs to read the
we were using AWS glue jobs to read the data from there and we were it's uh like
data from there and we were it's uh like
data from there and we were it's uh like we were creating the Json structure
we were creating the Json structure
we were creating the Json structure there like using stru type and struck
there like using stru type and struck
there like using stru type and struck field and the target was Dynamo DB
field and the target was Dynamo DB
field and the target was Dynamo DB because um what we like Dynamo DB and uh
because um what we like Dynamo DB and uh
because um what we like Dynamo DB and uh elastic search why we were using two
elastic search why we were using two
elastic search why we were using two different uh Services because uh firstly
different uh Services because uh firstly
different uh Services because uh firstly there are some functionalities that
there are some functionalities that
there are some functionalities that elastic search provides like it can
elastic search provides like it can
elastic search provides like it can provides us like we can uh write more
provides us like we can uh write more
provides us like we can uh write more complex queries it's more uh based on
complex queries it's more uh based on
complex queries it's more uh based on real-time analytics and also like uh it
real-time analytics and also like uh it
real-time analytics and also like uh it can uh it provides a uh full text search
can uh it provides a uh full text search
can uh it provides a uh full text search so if we want to uh search in a complex
so if we want to uh search in a complex
so if we want to uh search in a complex uh uh there are some partition Keys also
uh uh there are some partition Keys also
uh uh there are some partition Keys also like in Dynamo DB it's also uh very fast
like in Dynamo DB it's also uh very fast
like in Dynamo DB it's also uh very fast it's it has a very high throughput and
it's it has a very high throughput and
it's it has a very high throughput and uh it does store data in uh no SQL
uh it does store data in uh no SQL
uh it does store data in uh no SQL unstructured or maybe semi-structured
unstructured or maybe semi-structured
unstructured or maybe semi-structured data but it comes with the cost
data but it comes with the cost
data but it comes with the cost definitely our use case was like like if
definitely our use case was like like if
definitely our use case was like like if there is any change in the document
there is any change in the document
there is any change in the document whatever we are storing and based on the
whatever we are storing and based on the
whatever we are storing and based on the same partition key if I want to update
same partition key if I want to update
same partition key if I want to update that it will override that particular uh
that it will override that particular uh
that it will override that particular uh data which was there in Dynamo TP it
data which was there in Dynamo TP it
data which was there in Dynamo TP it does not index it and keep both the data
does not index it and keep both the data
does not index it and keep both the data fields so let's say if there is any like
fields so let's say if there is any like
fields so let's say if there is any like my partition uh like we need to uh
my partition uh like we need to uh
my partition uh like we need to uh basically specify the partitions and
basically specify the partitions and
basically specify the partitions and sort key there so let's say if I'm
sort key there so let's say if I'm
sort key there so let's say if I'm keeping my like employee ID employee
keeping my like employee ID employee
keeping my like employee ID employee name those two keys as part of partition
name those two keys as part of partition
name those two keys as part of partition keys and uh on those two keys I have
keys and uh on those two keys I have
keys and uh on those two keys I have some updates now I want to update that
some updates now I want to update that
some updates now I want to update that system like there is some changes okay
system like there is some changes okay
system like there is some changes okay so if I'll go on and hit the uh updating
so if I'll go on and hit the uh updating
so if I'll go on and hit the uh updating Dynamo DB it will override that uh data
Dynamo DB it will override that uh data
Dynamo DB it will override that uh data in elastic search it will index that
in elastic search it will index that
in elastic search it will index that like it will keep both the data like we
like it will keep both the data like we
like it will keep both the data like we can uh it doesn't have any uh if you
can uh it doesn't have any uh if you
can uh it doesn't have any uh if you want to uh basically prevent it from uh
want to uh basically prevent it from uh
want to uh basically prevent it from uh having the duplicates we'll have to IND
having the duplicates we'll have to IND
having the duplicates we'll have to IND that data and we can query that as well
that data and we can query that as well
that data and we can query that as well like based on uh if you want the um if
like based on uh if you want the um if
like based on uh if you want the um if you want the latest data we can just
you want the latest data we can just
you want the latest data we can just query it accordingly because it has its
query it accordingly because it has its
query it accordingly because it has its like already indexed documents so that
like already indexed documents so that
like already indexed documents so that was one use case which were uh because
was one use case which were uh because
was one use case which were uh because of which we are having the same data at
of which we are having the same data at
of which we are having the same data at both places but the consumers are
both places but the consumers are
both places but the consumers are different like the users are different
different like the users are different
different like the users are different so can you explain me the difference in
so can you explain me the difference in
so can you explain me the difference in the database Charing and the
the database Charing and the
the database Charing and the partitioning uh database charting and
partitioning uh database charting and
partitioning uh database charting and partitioning partitioning is uh uh in my
partitioning partitioning is uh uh in my
partitioning partitioning is uh uh in my understanding like where we are getting
understanding like where we are getting
understanding like where we are getting like we have a large data set and we
like we have a large data set and we
like we have a large data set and we want to process that data set uh in uh
want to process that data set uh in uh
want to process that data set uh in uh like very quick time so we'll be taking
like very quick time so we'll be taking
like very quick time so we'll be taking the chunks of it and uh we'll be keeping
the chunks of it and uh we'll be keeping
the chunks of it and uh we'll be keeping those chunks in different different
those chunks in different different
those chunks in different different partitions so that is partitioning to
partitions so that is partitioning to
partitions so that is partitioning to achieve parallelism where we can uh par
achieve parallelism where we can uh par
achieve parallelism where we can uh par parall we can process all those data and
parall we can process all those data and
parall we can process all those data and sharding uh like I have I'm aware on the
sharding uh like I have I'm aware on the
sharding uh like I have I'm aware on the AWS skesis I guess in sharding like we
AWS skesis I guess in sharding like we
AWS skesis I guess in sharding like we are also getting the data like from
are also getting the data like from
are also getting the data like from streaming
streaming
streaming I'm not sure if uh I'm correct here but
I'm not sure if uh I'm correct here but
I'm not sure if uh I'm correct here but uh I have read about charts which
uh I have read about charts which
uh I have read about charts which basically capture the streaming data and
basically capture the streaming data and
basically capture the streaming data and we do specify the capacity of that chart
we do specify the capacity of that chart
we do specify the capacity of that chart uh which will be there so it can like
uh which will be there so it can like
uh which will be there so it can like maybe um in one chart we can have 100
maybe um in one chart we can have 100
maybe um in one chart we can have 100 records or things like that that's
records or things like that that's
records or things like that that's mostly using for the uh streaming
mostly using for the uh streaming
mostly using for the uh streaming related data okay so like what is the
related data okay so like what is the
related data okay so like what is the secondary index Concept in the Dynamo DB
secondary index Concept in the Dynamo DB
secondary index Concept in the Dynamo DB so like we have the secondary local
so like we have the secondary local
so like we have the secondary local secondary index as well as the global
secondary index as well as the global
secondary index as well as the global secondary index so what is the use case
secondary index so what is the use case
secondary index so what is the use case of implementing the secondary index and
of implementing the secondary index and
of implementing the secondary index and what is the cons of using those
what is the cons of using those
what is the cons of using those secondary index I'm not uh much aware on
secondary index I'm not uh much aware on
secondary index I'm not uh much aware on that actually I'll have to okay look
that actually I'll have to okay look
that actually I'll have to okay look into it so say consider one scenario
into it so say consider one scenario
into it so say consider one scenario right so you are working for some
right so you are working for some
right so you are working for some business unit and uh so they are putting
business unit and uh so they are putting
business unit and uh so they are putting your data in the S3 bucket so they don't
your data in the S3 bucket so they don't
your data in the S3 bucket so they don't have a particular schedule when they are
have a particular schedule when they are
have a particular schedule when they are dumping it is like on an ad hoc basis on
dumping it is like on an ad hoc basis on
dumping it is like on an ad hoc basis on a daily basis right and they have
a daily basis right and they have
a daily basis right and they have mentioned an SLA for 30 minutes so
mentioned an SLA for 30 minutes so
mentioned an SLA for 30 minutes so whenever data LS to the S3 bucket you
whenever data LS to the S3 bucket you
whenever data LS to the S3 bucket you need to process that data in in the 30
need to process that data in in the 30
need to process that data in in the 30 minute window right so they have also
minute window right so they have also
minute window right so they have also given some business logic which you need
given some business logic which you need
given some business logic which you need to perform on your data and right and
to perform on your data and right and
to perform on your data and right and again they also want all the auditing
again they also want all the auditing
again they also want all the auditing and logging in place so that they can
and logging in place so that they can
and logging in place so that they can also track the data lineage so what
also track the data lineage so what
also track the data lineage so what would be your approach of handling such
would be your approach of handling such
would be your approach of handling such a pipeline uh for Designing such a
a pipeline uh for Designing such a
a pipeline uh for Designing such a pipeline by meeting the constraints that
pipeline by meeting the constraints that
pipeline by meeting the constraints that they have mentioned so first constraint
they have mentioned so first constraint
they have mentioned so first constraint is uh that it's on a dog basis we are
is uh that it's on a dog basis we are
is uh that it's on a dog basis we are not getting any notifications right uh
not getting any notifications right uh
not getting any notifications right uh what I can think of is uh what we can do
what I can think of is uh what we can do
what I can think of is uh what we can do is uh so whenever we can uh activate we
is uh so whenever we can uh activate we
is uh so whenever we can uh activate we can write a Lambda function and uh we
can write a Lambda function and uh we
can write a Lambda function and uh we can invoke it based on the put uh put
can invoke it based on the put uh put
can invoke it based on the put uh put event in S3 so if there is any file that
event in S3 so if there is any file that
event in S3 so if there is any file that has been placed in S3 it will
has been placed in S3 it will
has been placed in S3 it will automatically go on and basically invoke
automatically go on and basically invoke
automatically go on and basically invoke the Lambda function and Lambda function
the Lambda function and Lambda function
the Lambda function and Lambda function can take care of the uh processing part
can take care of the uh processing part
can take care of the uh processing part there um that's one thing and in case
there um that's one thing and in case
there um that's one thing and in case there are like the data size data volume
there are like the data size data volume
there are like the data size data volume is huge so what can we do in that case
is huge so what can we do in that case
is huge so what can we do in that case um basic uh what I'm thinking is maybe
um basic uh what I'm thinking is maybe
um basic uh what I'm thinking is maybe we can uh do the same thing like till
we can uh do the same thing like till
we can uh do the same thing like till the first part what we can do is uh
the first part what we can do is uh
the first part what we can do is uh whenever there's a put object it can
whenever there's a put object it can
whenever there's a put object it can invoke the Lambda and in the Lambda we
invoke the Lambda and in the Lambda we
invoke the Lambda and in the Lambda we can invoke AWS glue jobs maybe and uh it
can invoke AWS glue jobs maybe and uh it
can invoke AWS glue jobs maybe and uh it will take the data from S3 and in the
will take the data from S3 and in the
will take the data from S3 and in the AWS glue jobs it will start processing
AWS glue jobs it will start processing
AWS glue jobs it will start processing uh so that was the a do part and what
uh so that was the a do part and what
uh so that was the a do part and what was the second uh what were uh the other
was the second uh what were uh the other
was the second uh what were uh the other constraint uh so for other constraint is
constraint uh so for other constraint is
constraint uh so for other constraint is they also want to have the logging in
they also want to have the logging in
they also want to have the logging in place as well as the auditing mechanism
place as well as the auditing mechanism
place as well as the auditing mechanism so that they can also see the data
lineage okay so for that like uh uh data
lineage okay so for that like uh uh data lineage okay so uh in AWS glue if we are
lineage okay so uh in AWS glue if we are
lineage okay so uh in AWS glue if we are using that uh so it does uh have like uh
using that uh so it does uh have like uh
using that uh so it does uh have like uh we we are leveraging spark right so it
we we are leveraging spark right so it
we we are leveraging spark right so it does store the data in the lower uh like
does store the data in the lower uh like
does store the data in the lower uh like low level in rdds which does have the uh
low level in rdds which does have the uh
low level in rdds which does have the uh lineage information so maybe we can
lineage information so maybe we can
lineage information so maybe we can leverage
leverage
leverage that or in case and for the logging part
that or in case and for the logging part
that or in case and for the logging part we can
we can
we can active yeah so for any logging part or
active yeah so for any logging part or
active yeah so for any logging part or monitoring purpose we can uh activate uh
monitoring purpose we can uh activate uh
monitoring purpose we can uh activate uh AWS cloudwatch events there and in case
AWS cloudwatch events there and in case
AWS cloudwatch events there and in case of any failure in case of anything we
of any failure in case of anything we
of any failure in case of anything we want to uh basically see we can go on
want to uh basically see we can go on
want to uh basically see we can go on and uh check on the logs uh yeah so can
and uh check on the logs uh yeah so can
and uh check on the logs uh yeah so can you explain like how rdds will give you
you explain like how rdds will give you
you explain like how rdds will give you the data
the data
the data lineage yeah so uh rdds uh basically how
lineage yeah so uh rdds uh basically how
lineage yeah so uh rdds uh basically how uh okay so whenever we are calling an
uh okay so whenever we are calling an
uh okay so whenever we are calling an action basically uh it's a in sparkk
action basically uh it's a in sparkk
action basically uh it's a in sparkk It's a lazy evaluation it follows a lazy
It's a lazy evaluation it follows a lazy
It's a lazy evaluation it follows a lazy evaluation technique so if there is any
evaluation technique so if there is any
evaluation technique so if there is any action that has been called it will go
action that has been called it will go
action that has been called it will go on and from the start it will start uh
on and from the start it will start uh
on and from the start it will start uh take all the uh steps that will that is
take all the uh steps that will that is
take all the uh steps that will that is there it will create a execution plan uh
there it will create a execution plan uh
there it will create a execution plan uh and in that execution plan will have all
and in that execution plan will have all
and in that execution plan will have all the Transformations and everything that
the Transformations and everything that
the Transformations and everything that needs to be done to get to the output
needs to be done to get to the output
needs to be done to get to the output and it it's in certain order and so it
and it it's in certain order and so it
and it it's in certain order and so it will create dags for it so dag will have
will create dags for it so dag will have
will create dags for it so dag will have a particular tasks in certain order that
a particular tasks in certain order that
a particular tasks in certain order that needs to be executed that is basically
needs to be executed that is basically
needs to be executed that is basically the lineage information that rdd has uh
the lineage information that rdd has uh
the lineage information that rdd has uh so in that sense like we can uh if there
so in that sense like we can uh if there
so in that sense like we can uh if there is any node failure or anything like
is any node failure or anything like
is any node failure or anything like that happens uh so it can go on and uh
that happens uh so it can go on and uh
that happens uh so it can go on and uh Trace back and using that dag it can
Trace back and using that dag it can
Trace back and using that dag it can recomm that node
recomm that node
recomm that node so it does have a lineage information
so it does have a lineage information
so it does have a lineage information with it so say you have in this uh in
with it so say you have in this uh in
with it so say you have in this uh in this use case only you have the
this use case only you have the
this use case only you have the requirement to have all the different
requirement to have all the different
requirement to have all the different event so say you are saying I am
event so say you are saying I am
event so say you are saying I am triggering a Lambda right and then using
triggering a Lambda right and then using
triggering a Lambda right and then using Lambda I am again triggering the glue
Lambda I am again triggering the glue
Lambda I am again triggering the glue service right so all this event you want
service right so all this event you want
service right so all this event you want to track as part of your audit right
to track as part of your audit right
to track as part of your audit right what is the originating Source what is
what is the originating Source what is
what is the originating Source what is uh what is the transformation source so
uh what is the transformation source so
uh what is the transformation source so in this scenario like for this
in this scenario like for this
in this scenario like for this implementation how you will uh enable
implementation how you will uh enable
implementation how you will uh enable the auditing part uh I guess cloud trail
the auditing part uh I guess cloud trail
the auditing part uh I guess cloud trail is a
is a
is a uh AWS service that can help in auditing
uh AWS service that can help in auditing
uh AWS service that can help in auditing part uh but that is again on the entire
part uh but that is again on the entire
part uh but that is again on the entire infra level right that cloud trail yeah
infra level right that cloud trail yeah
infra level right that cloud trail yeah okay okay in that case like uh we can
okay okay in that case like uh we can
okay okay in that case like uh we can maybe leverage not AWS glue we can use
maybe leverage not AWS glue we can use
maybe leverage not AWS glue we can use airflow as well we can write uh we can
airflow as well we can write uh we can
airflow as well we can write uh we can create dags in airflow and uh we can
create dags in airflow and uh we can
create dags in airflow and uh we can create tasks and uh we
create tasks and uh we
create tasks and uh we can uh
can uh
can uh basically use their uh uh we can
basically use their uh uh we can
basically use their uh uh we can uh set the dependencies between them and
uh set the dependencies between them and
uh set the dependencies between them and based on that like it can leverage act
based on that like it can leverage act
based on that like it can leverage act as a leanage information in case of any
as a leanage information in case of any
as a leanage information in case of any failures okay uh so how you are
failures okay uh so how you are
failures okay uh so how you are currently deploying right your pipeline
currently deploying right your pipeline
currently deploying right your pipeline so what what is the is there any devops
so what what is the is there any devops
so what what is the is there any devops tool that you are using so you are also
tool that you are using so you are also
tool that you are using so you are also using quite a good Services of AWS right
using quite a good Services of AWS right
using quite a good Services of AWS right so how you deploying this
so how you deploying this
so how you deploying this infrastructure okay so uh basically uh
infrastructure okay so uh basically uh
infrastructure okay so uh basically uh we do have now a new framework with us
we do have now a new framework with us
we do have now a new framework with us with which is basically uh has been
with which is basically uh has been
with which is basically uh has been created by our team uh in quantify uh so
created by our team uh in quantify uh so
created by our team uh in quantify uh so to deploy what we are doing is we do
to deploy what we are doing is we do
to deploy what we are doing is we do have a repository get repository so all
have a repository get repository so all
have a repository get repository so all the changes that we are doing uh in
the changes that we are doing uh in
the changes that we are doing uh in terms of our uh Transformations and all
terms of our uh Transformations and all
terms of our uh Transformations and all the SQL related changes on in the
the SQL related changes on in the
the SQL related changes on in the scripts we will create a get uh request
scripts we will create a get uh request
scripts we will create a get uh request like basically purle request to merge
like basically purle request to merge
like basically purle request to merge the data so someone will approve that
the data so someone will approve that
the data so someone will approve that and will be able to merge the data and
and will be able to merge the data and
and will be able to merge the data and post that we do have a genin pipeline
post that we do have a genin pipeline
post that we do have a genin pipeline that we are using for the deployment
that we are using for the deployment
that we are using for the deployment part so gen pipeline is pointing to that
part so gen pipeline is pointing to that
part so gen pipeline is pointing to that git repository so if there is any change
git repository so if there is any change
git repository so if there is any change we'll need to again deploy the genkins
we'll need to again deploy the genkins
we'll need to again deploy the genkins it will go on and uh choose the stack
it will go on and uh choose the stack
it will go on and uh choose the stack and whatever the resources it will go on
and whatever the resources it will go on
and whatever the resources it will go on and allocate that so this is the cicd
and allocate that so this is the cicd
and allocate that so this is the cicd tool that we are using for now so what
tool that we are using for now so what
tool that we are using for now so what what stack it is deploying
what stack it is deploying
what stack it is deploying like so like it there is a code that has
like so like it there is a code that has
like so like it there is a code that has been written by the platform team uh to
been written by the platform team uh to
been written by the platform team uh to uh deploy all the resources that are
uh deploy all the resources that are
uh deploy all the resources that are needed like uh and provide all the uh
needed like uh and provide all the uh
needed like uh and provide all the uh permissions and accesses so if in case
permissions and accesses so if in case
permissions and accesses so if in case like we are using uh AWS S3 so all the
like we are using uh AWS S3 so all the
like we are using uh AWS S3 so all the resources that are using in
resources that are using in
resources that are using in communicating with S3 they need the
communicating with S3 they need the
communicating with S3 they need the permission to communicate with S3 so all
permission to communicate with S3 so all
permission to communicate with S3 so all those things there is a terraform code I
those things there is a terraform code I
those things there is a terraform code I guess like that has been actually taken
guess like that has been actually taken
guess like that has been actually taken care by the platform team so they are
care by the platform team so they are
care by the platform team so they are the ones who are deploying so it is a
the ones who are deploying so it is a
the ones who are deploying so it is a separate team who is managing the INF
separate team who is managing the INF
separate team who is managing the INF yes yes okay so uh you are also using
yes yes okay so uh you are also using
yes yes okay so uh you are also using Lambda right you have used Lambda so you
Lambda right you have used Lambda so you
Lambda right you have used Lambda so you already are aware that Lambda is a
already are aware that Lambda is a
already are aware that Lambda is a limitation right where it can't have the
limitation right where it can't have the
limitation right where it can't have the processing more than 15 minute right so
processing more than 15 minute right so
processing more than 15 minute right so say you have a use case right where you
say you have a use case right where you
say you have a use case right where you uh wanted to have sequential Lambda
uh wanted to have sequential Lambda
uh wanted to have sequential Lambda trigger right so maybe once one Lambda
trigger right so maybe once one Lambda
trigger right so maybe once one Lambda completes maybe before the timeout uh it
completes maybe before the timeout uh it
completes maybe before the timeout uh it will trigger another Lambda somehow and
will trigger another Lambda somehow and
will trigger another Lambda somehow and the processing would continue from that
the processing would continue from that
the processing would continue from that execution state only uh so can you think
execution state only uh so can you think
execution state only uh so can you think like how can we Implement such a use
like how can we Implement such a use
like how can we Implement such a use case uh let me think uh so in case we
case uh let me think uh so in case we
case uh let me think uh so in case we need to use Lambda only or processing
need to use Lambda only or processing
need to use Lambda only or processing the data that is huge more than expected
the data that is huge more than expected
the data that is huge more than expected which cannot be completed in 15 minutes
which cannot be completed in 15 minutes
which cannot be completed in 15 minutes of time span so uh yeah we can uh
of time span so uh yeah we can uh
of time span so uh yeah we can uh trigger another Lambda inside a Lambda
trigger another Lambda inside a Lambda
trigger another Lambda inside a Lambda that we can do and uh maybe what we can
that we can do and uh maybe what we can
that we can do and uh maybe what we can do is uh whatever the processing like
do is uh whatever the processing like
do is uh whatever the processing like based on the time limit so let's say
based on the time limit so let's say
based on the time limit so let's say there are three or four transformation
there are three or four transformation
there are three or four transformation that we need to do and as part of one
that we need to do and as part of one
that we need to do and as part of one Lambda time St like time out uh we can
Lambda time St like time out uh we can
Lambda time St like time out uh we can just do two Transformations so we can
just do two Transformations so we can
just do two Transformations so we can read the file we can perform those two
read the file we can perform those two
read the file we can perform those two transformation and we can uh get a
transformation and we can uh get a
transformation and we can uh get a output file and we can place it again on
output file and we can place it again on
output file and we can place it again on some other S3 Zone uh we can create some
some other S3 Zone uh we can create some
some other S3 Zone uh we can create some intermediate Zone in S3 we can do that
intermediate Zone in S3 we can do that
intermediate Zone in S3 we can do that and uh that can be passed as a argument
and uh that can be passed as a argument
and uh that can be passed as a argument in like as a variable in Lambda like the
in like as a variable in Lambda like the
in like as a variable in Lambda like the another Lambda that will be invoked
another Lambda that will be invoked
another Lambda that will be invoked inside this Lambda only so maybe that we
inside this Lambda only so maybe that we
inside this Lambda only so maybe that we can do or uh we can uh do one more thing
can do or uh we can uh do one more thing
can do or uh we can uh do one more thing instead of like creating the f file we
instead of like creating the f file we
instead of like creating the f file we can uh dump the data like whatever the
can uh dump the data like whatever the
can uh dump the data like whatever the process processing we have done we can
process processing we have done we can
process processing we have done we can dump the data into our red shift and uh
dump the data into our red shift and uh
dump the data into our red shift and uh from that point of state only it will
from that point of state only it will
from that point of state only it will fetch the recent data that has been that
fetch the recent data that has been that
fetch the recent data that has been that that is there in red shift and it can
that is there in red shift and it can
that is there in red shift and it can continue working on it okay so say you
continue working on it okay so say you
continue working on it okay so say you using Athena right so for some ad hoc
using Athena right so for some ad hoc
using Athena right so for some ad hoc analysis and something right uh so see
analysis and something right uh so see
analysis and something right uh so see there is uh the currently the definition
there is uh the currently the definition
there is uh the currently the definition that you have mentioned there is no
that you have mentioned there is no
that you have mentioned there is no partition right maybe on the next day uh
partition right maybe on the next day uh
partition right maybe on the next day uh the data has been changed right and now
the data has been changed right and now
the data has been changed right and now there is partition so will aena able to
there is partition so will aena able to
there is partition so will aena able to detect the partition automat ially or do
detect the partition automat ially or do
detect the partition automat ially or do you need to maybe have some commands
you need to maybe have some commands
you need to maybe have some commands perform by which it will be able to
perform by which it will be able to
perform by which it will be able to identify the
identify the
identify the partitions uh so as far as I remember
partitions uh so as far as I remember
partitions uh so as far as I remember like in Athena we need to give the
like in Athena we need to give the
like in Athena we need to give the partitions as well because that is how
partitions as well because that is how
partitions as well because that is how the query like it's able to access uh
the query like it's able to access uh
the query like it's able to access uh give the query result faster because
give the query result faster because
give the query result faster because it's working on partitions uh so as far
it's working on partitions uh so as far
it's working on partitions uh so as far as remember like I we need to give some
as remember like I we need to give some
as remember like I we need to give some partitions because based on that create
partitions because based on that create
partitions because based on that create an external table without partitions
an external table without partitions
an external table without partitions will AA give us an error okay I'm I'm
will AA give us an error okay I'm I'm
will AA give us an error okay I'm I'm not aware on that I'll have to look into
not aware on that I'll have to look into
not aware on that I'll have to look into it actually so does aena also physically
it actually so does aena also physically
it actually so does aena also physically load the data because it is also
load the data because it is also
load the data because it is also maintaining the catalog and meta store
maintaining the catalog and meta store
maintaining the catalog and meta store right so will it like entirely load the
right so will it like entirely load the
right so will it like entirely load the data and then it will do the
data and then it will do the
data and then it will do the processing uh no no it's not uh loading
processing uh no no it's not uh loading
processing uh no no it's not uh loading the data uh it's just referring to the
the data uh it's just referring to the
the data uh it's just referring to the data that is present on maybe S3 so that
data that is present on maybe S3 so that
data that is present on maybe S3 so that not loading the data so can we have an
not loading the data so can we have an
not loading the data so can we have an aena service say in one region uh which
aena service say in one region uh which
aena service say in one region uh which is say maybe doing the analysis on the
is say maybe doing the analysis on the
is say maybe doing the analysis on the data in some other S3 bucket in another
data in some other S3 bucket in another
data in some other S3 bucket in another region so can we have such a cross cross
region so can we have such a cross cross
region so can we have such a cross cross region communication here or is it a
region communication here or is it a
region communication here or is it a mandate to have the AA in the same
mandate to have the AA in the same
mandate to have the AA in the same region as the F3
region as the F3
region as the F3 bucket I'm not I'm not aware on that
bucket I'm not I'm not aware on that
bucket I'm not I'm not aware on that okay uh so can you please share your
okay uh so can you please share your
okay uh so can you please share your screen like I have in SLE questions yeah
screen like I have in SLE questions yeah
screen like I have in SLE questions yeah let me know if my screen is visible yeah
let me know if my screen is visible yeah
let me know if my screen is visible yeah yes so yeah your screen is visible can
yes so yeah your screen is visible can
yes so yeah your screen is visible can you see on the chat maybe so let me
you see on the chat maybe so let me
you see on the chat maybe so let me explain the question first right so here
explain the question first right so here
explain the question first right so here will have like the three different
will have like the three different
will have like the three different tables so consider this as for the
tables so consider this as for the
tables so consider this as for the e-commerce data right where we have the
e-commerce data right where we have the
e-commerce data right where we have the orders data and then also we have the
orders data and then also we have the
orders data and then also we have the products information right and then we
products information right and then we
products information right and then we have the order detail so order detail is
have the order detail so order detail is
have the order detail so order detail is like a more granular where you have will
like a more granular where you have will
like a more granular where you have will have the like the uh for a single order
have the like the uh for a single order
have the like the uh for a single order ID right you will have the different
ID right you will have the different
ID right you will have the different products right so order detail is like a
products right so order detail is like a
products right so order detail is like a more granular where you have all the
more granular where you have all the
more granular where you have all the product ID information how many quantity
product ID information how many quantity
product ID information how many quantity of that unit you have purchased what is
of that unit you have purchased what is
of that unit you have purchased what is the unit price right so this order
the unit price right so this order
the unit price right so this order details is at the detail Lev level right
details is at the detail Lev level right
details is at the detail Lev level right and this orders table is on a very high
and this orders table is on a very high
and this orders table is on a very high level like what is the order ID what is
level like what is the order ID what is
level like what is the order ID what is the order date what is the total amount
the order date what is the total amount
the order date what is the total amount right so that orders table is is the
right so that orders table is is the
right so that orders table is is the Gran is a more at high level of
Gran is a more at high level of
Gran is a more at high level of granularity right then this product
granularity right then this product
granularity right then this product table is containing the information
table is containing the information
table is containing the information right uh this is the product ID
right uh this is the product ID
right uh this is the product ID corresponding to this name and this
corresponding to this name and this
corresponding to this name and this category so you you have this three
category so you you have this three
category so you you have this three different uh tables right now what you
different uh tables right now what you
different uh tables right now what you need to find out in the SQL query is the
need to find out in the SQL query is the
need to find out in the SQL query is the top selling product right so what is the
top selling product right so what is the
top selling product right so what is the so maybe you generating some report for
so maybe you generating some report for
so maybe you generating some report for some customer or maybe some stakeholders
some customer or maybe some stakeholders
some customer or maybe some stakeholders right and you want to show them right
right and you want to show them right
right and you want to show them right this is the top selling product right
this is the top selling product right
this is the top selling product right what whatever we are offered and what is
what whatever we are offered and what is
what whatever we are offered and what is the criteria for finding the top selling
the criteria for finding the top selling
the criteria for finding the top selling product is basically characterized by
product is basically characterized by
product is basically characterized by the highest
the highest
the highest revenue uh what is the criteria again
revenue uh what is the criteria again
revenue uh what is the criteria again for the highest revenue so we have to uh
for the highest revenue so we have to uh
for the highest revenue so we have to uh do the segregation based on each
do the segregation based on each
do the segregation based on each category for the last quarter so say you
category for the last quarter so say you
category for the last quarter so say you are building a quarterly report right
are building a quarterly report right
are building a quarterly report right and in the quarterly report within each
and in the quarterly report within each
and in the quarterly report within each category what is the top selling product
category what is the top selling product
category what is the top selling product which will be characterized by the
which will be characterized by the
which will be characterized by the highest revenue so you need to create a
highest revenue so you need to create a
highest revenue so you need to create a SQL query for
SQL query for
SQL query for this let me know in case of any question
this let me know in case of any question
this let me know in case of any question yeah
yeah
yeah okay yeah so highest revenue is it based
okay yeah so highest revenue is it based
okay yeah so highest revenue is it based on like the total amount which we have
on like the total amount which we have
on like the total amount which we have here so you can ignore this total amount
here so you can ignore this total amount
here so you can ignore this total amount right I think it might not be accurate
right I think it might not be accurate
right I think it might not be accurate so consider this the order details and
so consider this the order details and
so consider this the order details and the quantity and unit price for the for
the quantity and unit price for the for
the quantity and unit price for the for the revenue okay okay so uh firstly what
the revenue okay okay so uh firstly what
the revenue okay okay so uh firstly what I'm thinking firstly we need the data
I'm thinking firstly we need the data
I'm thinking firstly we need the data just for the last quarter so uh let's
just for the last quarter so uh let's
just for the last quarter so uh let's say last three months uh okay so that we
say last three months uh okay so that we
say last three months uh okay so that we can filter out based on the order date
can filter out based on the order date
can filter out based on the order date okay uh just writing some approach here
okay uh just writing some approach here
okay uh just writing some approach here so last 3 months data we can filter out
so last 3 months data we can filter out
so last 3 months data we can filter out from uh order
from uh order
from uh order table now uh we need to join orders and
table now uh we need to join orders and
table now uh we need to join orders and Order detail together okay to get the um
Order detail together okay to get the um
Order detail together okay to get the um we can join these two based on order
we can join these two based on order
we can join these two based on order ID and we can get quantity and uh unit
ID and we can get quantity and uh unit
ID and we can get quantity and uh unit price okay and we'll try we'll multiply
price okay and we'll try we'll multiply
price okay and we'll try we'll multiply uh the number of units that we have with
uh the number of units that we have with
uh the number of units that we have with the unit price so to get the total
the unit price so to get the total
the unit price so to get the total amount if we are ignoring uh this
amount if we are ignoring uh this
amount if we are ignoring uh this particular total amount here so we need
particular total amount here so we need
particular total amount here so we need to multiply this
to multiply this
to multiply this quantity uh with unit price and getting
quantity uh with unit price and getting
quantity uh with unit price and getting the total
the total
the total amount
amount
amount now what uh we can do is now if we have
now what uh we can do is now if we have
now what uh we can do is now if we have the total amount here and
the total amount here and
the total amount here and uh we have the order ID okay based on
uh we have the order ID okay based on
uh we have the order ID okay based on the order ID here okay uh order ID so
the order ID here okay uh order ID so
the order ID here okay uh order ID so what type of join will you perform in
what type of join will you perform in
what type of join will you perform in this two
this two
this two table we need all t uh like all the
table we need all t uh like all the
table we need all t uh like all the orders then we can uh use l
orders then we can uh use l
orders then we can uh use l join I'm assuming like it will have
join I'm assuming like it will have
join I'm assuming like it will have information for all the orders because
information for all the orders because
information for all the orders because it's on the granular level of orders so
it's on the granular level of orders so
it's on the granular level of orders so even if we do in a join I don't think
even if we do in a join I don't think
even if we do in a join I don't think there will be any loss of data
there will be any loss of data
there will be any loss of data okay yeah okay so now I have my total
okay yeah okay so now I have my total
okay yeah okay so now I have my total amount I have filtered order date uh now
amount I have filtered order date uh now
amount I have filtered order date uh now I need to group my data um group my data
I need to group my data um group my data
I need to group my data um group my data based on product and get the sum of the
based on product and get the sum of the
based on product and get the sum of the total amount get the revenue basically
total amount get the revenue basically
total amount get the revenue basically and within each category of for the last
and within each category of for the last
and within each category of for the last quarter so category okay now category
quarter so category okay now category
quarter so category okay now category also we need so we need to uh basically
also we need so we need to uh basically
also we need so we need to uh basically join this
join this
join this product I like product table as well and
product I like product table as well and
product I like product table as well and based on product ID and we need uh
based on product ID and we need uh
based on product ID and we need uh category from here we need category from
category from here we need category from
category from here we need category from here now we have all the data we can
here now we have all the data we can
here now we have all the data we can group group it on um product
group group it on um product
group group it on um product ID but product ID unique right uh okay
ID but product ID unique right uh okay
ID but product ID unique right uh okay so the product ID is unique so it will
so the product ID is unique so it will
so the product ID is unique so it will automatically have different categories
automatically have different categories
automatically have different categories right uh we are considering these uh
right uh we are considering these uh
right uh we are considering these uh different products
different products
different products correct yeah okay so we can just group
correct yeah okay so we can just group
correct yeah okay so we can just group it on product
it on product
it on product ID and we can take the total sum of uh
ID and we can take the total sum of uh
ID and we can take the total sum of uh like some of this
like some of this
like some of this amount uh total amount uh so can you
amount uh total amount uh so can you
amount uh total amount uh so can you explain like why we are doing the
explain like why we are doing the
explain like why we are doing the grouping on the product ID uh so we need
grouping on the product ID uh so we need
grouping on the product ID uh so we need top selling product so top selling
top selling product so top selling
top selling product so top selling product will be the one that has
product will be the one that has
product will be the one that has generated the highest revenue so if I
generated the highest revenue so if I
generated the highest revenue so if I want the top selling product I have this
want the top selling product I have this
want the top selling product I have this product ID table and I have this product
product ID table and I have this product
product ID table and I have this product name and category cons like this is the
name and category cons like this is the
name and category cons like this is the unique one so if I am doing the uh group
unique one so if I am doing the uh group
unique one so if I am doing the uh group or grouping on product ID so whatever
or grouping on product ID so whatever
or grouping on product ID so whatever the orders are there for this particular
the orders are there for this particular
the orders are there for this particular order ID 101 I am uh basically basically
order ID 101 I am uh basically basically
order ID 101 I am uh basically basically adding all the revenues generated from
adding all the revenues generated from
adding all the revenues generated from all the orders for this particular
all the orders for this particular
all the orders for this particular product okay and I'm getting the total
product okay and I'm getting the total
product okay and I'm getting the total revenue for that particular product
revenue for that particular product
revenue for that particular product likewise I am doing it for all the
likewise I am doing it for all the
likewise I am doing it for all the products now so if I'm grouping it on
products now so if I'm grouping it on
products now so if I'm grouping it on product ID I will have all the products
product ID I will have all the products
product ID I will have all the products and all the revenue generated from uh
and all the revenue generated from uh
and all the revenue generated from uh like uh from all the orders for these
like uh from all the orders for these
like uh from all the orders for these products uh so now what you have in your
products uh so now what you have in your
products uh so now what you have in your select query in this
select query in this
select query in this case okay so in my select query
case okay so in my select query
case okay so in my select query uh in my select query I will have these
uh in my select query I will have these
uh in my select query I will have these two uh quantity and unit and
two uh quantity and unit and
two uh quantity and unit and uh I will I can uh like I can use
uh I will I can uh like I can use
uh I will I can uh like I can use product ID only because I'm grouping it
product ID only because I'm grouping it
product ID only because I'm grouping it by product ID but like in the question
by product ID but like in the question
by product ID but like in the question we have can you read out the question
we have can you read out the question
we have can you read out the question again what would be the top the top
again what would be the top the top
again what would be the top the top selling product yeah okay so we need
selling product yeah okay so we need
selling product yeah okay so we need product name and product
product name and product
product name and product category what will uh what is the like
category what will uh what is the like
category what will uh what is the like all the columns we need in the output
all the columns we need in the output
all the columns we need in the output can you uh just so we would need the
can you uh just so we would need the
can you uh just so we would need the category and then we would need the
category and then we would need the
category and then we would need the product name and the total revenue
product name and the total revenue
product name and the total revenue highest revenue that is
highest revenue that is
highest revenue that is there
there
there category product
category product
category product name and uh Revenue
name and uh Revenue
name and uh Revenue okay okay so I can you uh like uh
okay okay so I can you uh like uh
okay okay so I can you uh like uh instead of grouping it on product ID I
instead of grouping it on product ID I
instead of grouping it on product ID I can group it on product name and
can group it on product name and
can group it on product name and category also okay that because uh
category also okay that because uh
category also okay that because uh combination of this is unique okay and
combination of this is unique okay and
combination of this is unique okay and uh in the select query also like I can
uh in the select query also like I can
uh in the select query also like I can use the same
use the same
use the same okay will it give the fin because I need
okay will it give the fin because I need
okay will it give the fin because I need the no because I just need the uh top
the no because I just need the uh top
the no because I just need the uh top selling product so maybe uh I need to
selling product so maybe uh I need to
selling product so maybe uh I need to order
order
order it uh by this Revenue okay in descending
it uh by this Revenue okay in descending
it uh by this Revenue okay in descending order so now I have uh all the highest
order so now I have uh all the highest
order so now I have uh all the highest revenue first and then the lowest one in
revenue first and then the lowest one in
revenue first and then the lowest one in the uh bottom uh then I can just like
the uh bottom uh then I can just like
the uh bottom uh then I can just like maybe I'm thinking to limit it to one
maybe I'm thinking to limit it to one
maybe I'm thinking to limit it to one then it will give me the just one output
then it will give me the just one output
then it will give me the just one output with the
with the
with the highest but we require for each category
highest but we require for each category
highest but we require for each category uh we just we don't need a single row we
uh we just we don't need a single row we
uh we just we don't need a single row we need for each category the top selling
need for each category the top selling
need for each category the top selling product okay okay okay okay so in that
product okay okay okay okay so in that
product okay okay okay okay so in that case like uh we'll have to go with the
case like uh we'll have to go with the
case like uh we'll have to go with the uh window function
uh window function
uh window function here so uh yeah so what were your window
here so uh yeah so what were your window
here so uh yeah so what were your window condition H uh so in Partition by uh we
condition H uh so in Partition by uh we
condition H uh so in Partition by uh we will be writing this product uh name and
will be writing this product uh name and
will be writing this product uh name and category uh sorry uh in like the C yeah
category uh sorry uh in like the C yeah
category uh sorry uh in like the C yeah um these two are maybe just the category
um these two are maybe just the category
um these two are maybe just the category one we can keep in Partition by so we
one we can keep in Partition by so we
one we can keep in Partition by so we can partition the data based on all the
can partition the data based on all the
can partition the data based on all the categories and uh we can order the data
categories and uh we can order the data
categories and uh we can order the data on revenue and uh like this Revenue in
on revenue and uh like this Revenue in
on revenue and uh like this Revenue in descending order and then uh we can just
descending order and then uh we can just
descending order and then uh we can just uh ask it to maybe uh give it a name and
uh ask it to maybe uh give it a name and
uh ask it to maybe uh give it a name and uh we can just take the data or like we
uh we can just take the data or like we
uh we can just take the data or like we can filter it out post that we can just
can filter it out post that we can just
can filter it out post that we can just uh uh something like that like if I am
uh uh something like that like if I am
uh uh something like that like if I am using just a minute
using just a minute
using just a minute so I'm ranking my data let's say name
so I'm ranking my data let's say name
so I'm ranking my data let's say name and category I already have this as
and category I already have this as
and category I already have this as revenue
revenue
revenue and sending and you can just give it
and sending and you can just give it
and sending and you can just give it some Rank and later on like
some Rank and later on like
some Rank and later on like uh or maybe use qualify for this but
uh or maybe use qualify for this but
uh or maybe use qualify for this but that is only supported in some SQL not
that is only supported in some SQL not
that is only supported in some SQL not in all databases yeah yeah yes yes uh so
in all databases yeah yeah yes yes uh so
in all databases yeah yeah yes yes uh so in that case like uh in another uh like
in that case like uh in another uh like
in that case like uh in another uh like you can keep it as a subquery and later
you can keep it as a subquery and later
you can keep it as a subquery and later on we can just use RNs one so for every
on we can just use RNs one so for every
on we can just use RNs one so for every category whatever the highest revenue is
category whatever the highest revenue is
category whatever the highest revenue is we we are just switching that uh so can
we we are just switching that uh so can
we we are just switching that uh so can you go to the product table you scroll
you go to the product table you scroll
you go to the product table you scroll to product table right so in the product
to product table right so in the product
to product table right so in the product table right say you have a category of
table right say you have a category of
table right say you have a category of business called right so you have
business called right so you have
business called right so you have different product name A and B A and D
different product name A and B A and D
different product name A and B A and D sorry right so say uh as for this data
sorry right so say uh as for this data
sorry right so say uh as for this data I'm just assuming right so maybe you
I'm just assuming right so maybe you
I'm just assuming right so maybe you have the a as the highest selling
have the a as the highest selling
have the a as the highest selling product so you don't need a two separate
product so you don't need a two separate
product so you don't need a two separate Row for business card just a single Row
Row for business card just a single Row
Row for business card just a single Row for business card and a and the highest
for business card and a and the highest
for business card and a and the highest revenue so will your SQL query give me
revenue so will your SQL query give me
revenue so will your SQL query give me the single Row for a single category
the single Row for a single category
the single Row for a single category no in that case I don't uh I was
no in that case I don't uh I was
no in that case I don't uh I was assuming that the product name and
assuming that the product name and
assuming that the product name and category both are different both will be
category both are different both will be
category both are different both will be unique now if uh this is the case like
unique now if uh this is the case like
unique now if uh this is the case like then I can group it on category only
then I can group it on category only
then I can group it on category only then uh it will sum up all the revenues
then uh it will sum up all the revenues
then uh it will sum up all the revenues uh from the C and D product name and if
uh from the C and D product name and if
uh from the C and D product name and if that is the highest like whatever the
that is the highest like whatever the
that is the highest like whatever the highest one I guess uh it will give me
highest one I guess uh it will give me
highest one I guess uh it will give me the highest one so in
the highest one so in
the highest one so in case both the different product have the
case both the different product have the
case both the different product have the same Revenue so what would be the output
same Revenue so what would be the output
same Revenue so what would be the output in the
in the
in the rank uh it will give me the same like
rank uh it will give me the same like
rank uh it will give me the same like both the uh because uh okay so in case
both the uh because uh okay so in case
both the uh because uh okay so in case if uh okay uh because I am using
if uh okay uh because I am using
if uh okay uh because I am using Rank and uh if
Rank and uh if
Rank and uh if it's if it's the same it will have the
it's if it's the same it will have the
it's if it's the same it will have the same uh rank only so will it give me
same uh rank only so will it give me
same uh rank only so will it give me like the two different Row in the
like the two different Row in the
like the two different Row in the output yes uh yeah I guess so okay uh
output yes uh yeah I guess so okay uh
output yes uh yeah I guess so okay uh because from the next one it will uh
because from the next one it will uh
because from the next one it will uh like skip a rank if you're using uh rank
like skip a rank if you're using uh rank
like skip a rank if you're using uh rank function so say we want to add another
function so say we want to add another
function so say we want to add another CL right say we want to add another
CL right say we want to add another
CL right say we want to add another close here uh say we just need a single
close here uh say we just need a single
close here uh say we just need a single row right for the category and we can
row right for the category and we can
row right for the category and we can add another condition so in that case
add another condition so in that case
add another condition so in that case when there is a same revenue for
when there is a same revenue for
when there is a same revenue for different products uh we can go ahead
different products uh we can go ahead
different products uh we can go ahead with the alphabetical ordering for the
with the alphabetical ordering for the
with the alphabetical ordering for the product name so whatever ever uh product
product name so whatever ever uh product
product name so whatever ever uh product name comes first as per the alphabetical
name comes first as per the alphabetical
name comes first as per the alphabetical ordering we want to include that row so
ordering we want to include that row so
ordering we want to include that row so how you will include this condition in
how you will include this condition in
how you will include this condition in your
your
your query if I want to include uh only
query if I want to include uh only
query if I want to include uh only product name that
product name that
product name that comes uh okay only in this case if a and
comes uh okay only in this case if a and
comes uh okay only in this case if a and C has like the same Revenue uh Like A
C has like the same Revenue uh Like A
C has like the same Revenue uh Like A and D has the same Revenue based on the
and D has the same Revenue based on the
and D has the same Revenue based on the uh category business card uh I just want
uh category business card uh I just want
uh category business card uh I just want to display a right uh okay stand that uh
I can compare uh the product name and uh
I can compare uh the product name and uh because product name will have a and c
because product name will have a and c
because product name will have a and c and which one uh like either one is
and which one uh like either one is
and which one uh like either one is smaller it will uh it should give me the
smaller it will uh it should give me the
smaller it will uh it should give me the smaller one only how will you compare
smaller one only how will you compare
smaller one only how will you compare this two how will
this two how will
this two how will I okay so now I have two columns what I
I okay so now I have two columns what I
I okay so now I have two columns what I can do is I can uh use a row number or
can do is I can uh use a row number or
can do is I can uh use a row number or uh row number function based on the
uh row number function based on the
uh row number function based on the category and I can order it on product
category and I can order it on product
category and I can order it on product name and uh like product name uh
name and uh like product name uh
name and uh like product name uh ascending order if uh what I like I'm
ascending order if uh what I like I'm
ascending order if uh what I like I'm just think uh so if I am grouping it
just think uh so if I am grouping it
just think uh so if I am grouping it again on
again on
again on category
category
category um yeah so if I'm grouping it again on
um yeah so if I'm grouping it again on
um yeah so if I'm grouping it again on category and uh I am just
category and uh I am just
category and uh I am just ordering by uh my product
ordering by uh my product
ordering by uh my product name
name
name uh
uh
uh so I am just giving
so I am just giving
so I am just giving this just a
minute if I'm using maybe R number
minute if I'm using maybe R number function and I'm partitioning by
function and I'm partitioning by
function and I'm partitioning by category and I'm ordering by product
category and I'm ordering by product
category and I'm ordering by product name so whatever the ascending order
name so whatever the ascending order
name so whatever the ascending order will be it will have uh R number as one
will be it will have uh R number as one
will be it will have uh R number as one and uh the second one like d will have
and uh the second one like d will have
and uh the second one like d will have row number as two uh then I can filter
row number as two uh then I can filter
row number as two uh then I can filter it out based on rle number here as well
it out based on rle number here as well
it out based on rle number here as well like uh Ru number one so you will apply
like uh Ru number one so you will apply
like uh Ru number one so you will apply this condition after all the above logic
this condition after all the above logic
this condition after all the above logic right so you will add another window
right so you will add another window
right so you will add another window function uh to incorporate this
function uh to incorporate this
function uh to incorporate this logic okay so order by Revenue in
logic okay so order by Revenue in
logic okay so order by Revenue in descending then I can uh here only
descending then I can uh here only
descending then I can uh here only M I can
M I can
M I can use this condition like I can order it
use this condition like I can order it
use this condition like I can order it by like firstly it will order it by the
by like firstly it will order it by the
by like firstly it will order it by the revenue and then uh it will order it uh
revenue and then uh it will order it uh
revenue and then uh it will order it uh on product name okay again in category
on product name okay again in category
on product name okay again in category itself okay so we won't need this second
itself okay so we won't need this second
itself okay so we won't need this second window function will this work the
window function will this work the
window function will this work the condition no
condition no
condition no no yeah I guess uh yeah I'm thinking
no yeah I guess uh yeah I'm thinking
no yeah I guess uh yeah I'm thinking like it should uh
like it should uh
like it should uh work okay yeah yeah so I'm good from my
work okay yeah yeah so I'm good from my
work okay yeah yeah so I'm good from my side like do you have any questions for
side like do you have any questions for
side like do you have any questions for me yeah so uh
me yeah so uh
me yeah so uh based like do you have any feedback or
based like do you have any feedback or
based like do you have any feedback or anything yeah so you were very good at
anything yeah so you were very good at
anything yeah so you were very good at the SQL right so the
the SQL right so the
the SQL right so the thinking yeah so thinking uh you are
thinking yeah so thinking uh you are
thinking yeah so thinking uh you are able to think logically right so you
able to think logically right so you
able to think logically right so you understand the data uh the only thing
understand the data uh the only thing
understand the data uh the only thing that is important is the Assumption so
that is important is the Assumption so
that is important is the Assumption so don't assume anything you have the data
don't assume anything you have the data
don't assume anything you have the data set right the products uh it's there the
set right the products uh it's there the
set right the products uh it's there the different category and the different
different category and the different
different category and the different product name right so for the same two
product name right so for the same two
product name right so for the same two categories we have the different product
categories we have the different product
categories we have the different product name so it's not unique right so always
name so it's not unique right so always
name so it's not unique right so always if you are making an assumption just
if you are making an assumption just
if you are making an assumption just clarify it from the interviewer if the
clarify it from the interviewer if the
clarify it from the interviewer if the Assumption you are taking is right or
Assumption you are taking is right or
Assumption you are taking is right or wrong so that is one part otherwise your
wrong so that is one part otherwise your
wrong so that is one part otherwise your logic part is very good you are able to
logic part is very good you are able to
logic part is very good you are able to Think Through the problem and able to
Think Through the problem and able to
Think Through the problem and able to demonstrate your SQL capabilities right
demonstrate your SQL capabilities right
demonstrate your SQL capabilities right uh on the engineering side right I think
uh on the engineering side right I think
uh on the engineering side right I think uh main other thing you need to also
uh main other thing you need to also
uh main other thing you need to also focus is on the data security part so
focus is on the data security part so
focus is on the data security part so like how because when why we are using
like how because when why we are using
like how because when why we are using Cloud right cloud is not always secure
Cloud right cloud is not always secure
Cloud right cloud is not always secure so Cloud we have to make our services
so Cloud we have to make our services
so Cloud we have to make our services secure so what are the different things
secure so what are the different things
secure so what are the different things that we can incorporate like uh the
that we can incorporate like uh the
that we can incorporate like uh the question that I asked right how you will
question that I asked right how you will
question that I asked right how you will make your maybe F3 bucket secure or
make your maybe F3 bucket secure or
make your maybe F3 bucket secure or maybe the pii data so all those things
maybe the pii data so all those things
maybe the pii data so all those things there are different things that we can
there are different things that we can
there are different things that we can Implement so there is like encryption
Implement so there is like encryption
Implement so there is like encryption then also we have the server side loging
then also we have the server side loging
then also we have the server side loging then also we have the different IM roles
then also we have the different IM roles
then also we have the different IM roles so we can leverage IM roles and policies
so we can leverage IM roles and policies
so we can leverage IM roles and policies uh and also we have to follow the Leist
uh and also we have to follow the Leist
uh and also we have to follow the Leist privilege access policies so sometimes I
privilege access policies so sometimes I
privilege access policies so sometimes I have seen different codes where people
have seen different codes where people
have seen different codes where people just put star like give all the
just put star like give all the
just put star like give all the permissions for different service so we
permissions for different service so we
permissions for different service so we uh AWS doesn't recommend like this so
uh AWS doesn't recommend like this so
uh AWS doesn't recommend like this so always have the list granity at the
always have the list granity at the
always have the list granity at the possible in the permissions and other
possible in the permissions and other
possible in the permissions and other stuff uh so that is again very much
stuff uh so that is again very much
stuff uh so that is again very much important and uh so yeah I find like you
important and uh so yeah I find like you
important and uh so yeah I find like you good with the designing another thing
good with the designing another thing
good with the designing another thing but security is again another thing that
but security is again another thing that
but security is again another thing that we should also Focus as a data engineer
we should also Focus as a data engineer
we should also Focus as a data engineer so yeah that is yes definitely I'll look
so yeah that is yes definitely I'll look
so yeah that is yes definitely I'll look into it yeah and just maybe a little bit
into it yeah and just maybe a little bit
into it yeah and just maybe a little bit more on the uh different services so you
more on the uh different services so you
more on the uh different services so you are also good with the Dynam DB and
are also good with the Dynam DB and
are also good with the Dynam DB and elastic search uh why we are using the
elastic search uh why we are using the
elastic search uh why we are using the use of different so there can be a
use of different so there can be a
use of different so there can be a question like why we are replicating the
question like why we are replicating the
question like why we are replicating the data between the two different databases
data between the two different databases
data between the two different databases but you are clear on your use is like
but you are clear on your use is like
but you are clear on your use is like why there is a need so it's always the
why there is a need so it's always the
why there is a need so it's always the use case implementation specific and you
use case implementation specific and you
use case implementation specific and you are clear with that point right so I
are clear with that point right so I
are clear with that point right so I think I'm good yeah just focus on those
think I'm good yeah just focus on those
think I'm good yeah just focus on those little uh little details yeah on the
little uh little details yeah on the
little uh little details yeah on the infrastructure part I think otherwise
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.
Works with YouTube, Coursera, Udemy and more educational platforms
Get Instant Transcripts: Just Edit the Domain in Your Address Bar!
YouTube
←
→
↻
https://www.youtube.com/watch?v=UF8uR6Z6KLc
YoutubeToText
←
→
↻
https://youtubetotext.net/watch?v=UF8uR6Z6KLc