YouTube Transcript:
🚀 Azure End-to-End ETL Pipeline Project | Azure Data Factory + Databricks + Power BI | For Beginners
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Available languages:
View:
hi everyone my name is rutuja and
welcome back to data to Crunch where we
learn more about data engineering and
business intelligence so in today's
video we will dive into the exciting
world of azure I'll walk you through the
process of creating an endtoend ETL
pipeline in Azure where we will try to
utilize main Azure services like Azure
data L Storage Azure data Factory and
Azure data bricks and afterwards we'll
see how we can connect these services to
the powerbi so that we can create
reports and the dashboards for the same
so basically we'll start with learning
some basic concepts of the same then we
will see how we can set up these Azure
services to create this end to endend
ETL Pipeline and afterwards we'll see
how we can use these tools to clean
transform and you visualize our data
efficiently so for this I can say that
we are going to use Amazon Prime data
set for the analysis purpose and by the
end of this tutorial you will have the
understanding of this ETL Pipeline and
along with that you will have the deeper
understanding of how these powerful
Azure tools come together seamlessly for
the data workflows so now let's move
ahead and learn more about our data set
before we start working with the data
let's quickly go over this Amazon Prime
movies and the TV shows data set on
which we will be working on so I have
downloaded this data set from the kagle
website and I'll provide the link of the
same in the description box below so
this data set contains the information
about the titles available a on the
Amazon Prime which mainly includes
movies and the TV shows so it has
several key features that we can use to
analyze maybe Trends ratings genres and
many more so now let's see or take a
look look on our columns which we have
so we have column which gives us unique
show ID then we have other column which
gives us the type of that
respective movie or the TV show then it
also displays the title of that movie
then it gives the director of that
respective show afterwards it also gives
us the cast of that respective movie
afterwards it is also giving us the
country of that production of that
respective movie then we have date added
column which gives when that movie or TV
show was added onto Amazon Prime then it
also gives the release year of that
movie or the show then we have ratings
column and after ratings we have
duration of the movie or TV shows column
so this data set basically offers a lot
of interesting possibilities for the
analysis by like including exploring
relationships between for example maybe
in between genres or the release years
or the ratings or even like examining
the trends in the movies uh in the movie
popularity across the different
countries so what I have done here is
like I have downloaded this data set and
I have uploaded it on my GitHub account
I'll provide the link of the same also
in the description box below
so we now we will see our further
procedure where where we will try to see
how we can fish this data from this
GitHub account to our Azure services and
then how we can perform our cleaning and
transformation activities of the same so
now let's move ahead and let's have a
overview about our whole procedure which
we will be working on so this is the
overview of our N2 and ETL pipeline
which we will be creating using Azure so
our source is this GitHub account link
where we have stored our data set and
our destination or the end point will be
the powerbi where we will try to create
some dashboards and the reports on our
transform data set to represent it in
front of the business so here we are
going to use three Azure services so the
first one will be Azure data L Storage
Services which is ADLs where we will try
to store our data so here Al so for the
storage purpose We Will Follow The
Medallion architecture so Medallion
architecture is a data management
framework which we commonly use in our
modern data lake or the data warehouses
so basically here we organize our data
into three distinct layers for that we
call it as a medals so the first medal
will be bronze layer or bronze medal
then the second will be silver and the
then the third will be gold
so the the purpose of bronze layer is we
we we will use this bronze layer to
store our raw and unprocessed data which
we have ingested so right now our data
set is into GitHub account right so now
we need to fetch this data set into our
Azure services and then we need to store
that data into our ADLs so for that
purpose we will use Azure data Factory
so what Azure data Factory will do it
will fet the data from our GitHub
account and then it will store that raw
and unprocessed data into our ADLs
account so now till now AO data Factory
uh has done its purpose now AO data
brakes comes into the scenario so what
Azure data breaks will do so Azure data
braks will work on this raw and
unprocessed datas data set and then it
will try to clean and transform that
data so that our uh so that it will be
easier for the analysis for the data
scientist or the business intelligence
whoever the team who is going to work
for the further procedures so right now
we have uh this raw and unprocessed data
in the bronze layer right so now Azure
data braks will perform data cleaning
and data transformation so it will
basically take this data from bronze
layer it will perform its activity and
once they have done that transformation
then they will store that data into the
silver layer so here the second layer
comes into the scenario where we have
cleaned and transformed data into our
ADLs so this data at this stage we can
say that has been processed to handle
the inconsistencies or the missing
values or the duplicates or to derive
some more meaningful relationships
between the data set so till now after
this uh after this silver layer this
bronze uh this gold layer comes into the
scenario so this gold there basically
represents the highest level of data
refinement and it is intended for the
business intelligence purpose so at this
stage we can say that our data is fully
processed and optimized for the final
consumptions so right now we know that
our data has been transformed and stored
into the silver layer right so now we
using a your data Brak we will use this
data and we will fetch this data from
the silver layer and then we will try to
perform some
Advanced Transformations on the same so
that we will get that we will get that
data which is business ready so once we
have that business ready data then we
will store that data into this gold
layer so now what does this what does
this mean by business ready data so
business ready data means it is intended
basically for the business intelligence
and the reporting or the analysis
purpose so till this stage the data is
fully processed and optimized for the
final consumption by the business users so
so
so here we can say that our data is
highly curated and it is stored in a
final reporting friendly format such as
for example maybe Delta lake or the park
eight so we will store our data into
such kind of formats so once we have
data into the gold layer
then what we will do we will connect our
data brakes to the powerbi services so
that what we can do once this uh
connection is ready we will fetch this
data from the gold layer which is into
Delta or the parket format and then we
will pull this data into powerbi and
once we have that data into the powerbi
we will create dashboards or the reports
so this will be our end to endend ETL
pipeline over view so now let's move
ahead and we'll first see the basic
concepts which will be required for the
understanding of this ETL pipeline let's
try to understand the ETL process so for
the first step extract the main goal of
this extraction process is to pull the
raw data from Source systems without
making any changes to that so the
extract phase it basically involves like
Gathering of raw data from the various
Source systems which is which could be
for example databases or the apis or the
flat files or even the live streams so
this data it could be either structured
semi structured or unstructured so after
extract we have transform
process so for the transform our
objective for this step would be to
ensure that our data is accurate
consistent and formatted correctly for
the reporting or the analytics so the
transform information phase it is where
that raw data under goes the processing
or the cleaning to meet the desired
structured structure or the quality and
the format of that respective data for
the reporting or the analysis so this
phase basically includes data cleaning
aggregations validations filtering or
other operations to ensure that we have
suitable data for the downstream
applications so after transform we have load
so the objective of load is to store the
transform data in a format that is
optimized for quering or the reporting
so that it would make it easier for the
access for the users or the applications
so this loading phase is where the
transformed data is stored into the
Final Destination usually that would be
data warehouse or the data Lake and it
will be easier for the quering report or
the analysis and this data can be done
into either a batch mode or the real
time mode also so in our case we'll see
how this ETL pipeline will work so in
our case extract process this will in
this basically Azure data
Factory this Azure data Factory will
pull the data from the external
sources like GitHub account or maybe
kaggle it will pull that data and it
will store that data into the bronze
layer which which we will make it in the
ADLs which is into Azure data L Storage
Services then into the
transformation this uh data bricks will
come into the scenario and this data
bricks will read the raw data from the
bronze layer then it will perform
cleaning and the transformation and it
will make that data into a more
structured format and then it will store
it into the silver
layer so once it it is stored into the
silver layer it can also perform some
additional and advanced aggregations or
the refinement so once those aggregation
and refinements are done then that
advanced business intelligence ready
data that will get stored into the gold
layer where that data is curated and
optimized for the reporting
so into the load process that
transformed data which is into the gold
layer then that ready trans that
business ready data which is loaded into
the uh which is already into the gold
layer that will get loaded into the powerbi
powerbi tool
tool
so we will fetch that data which is
stored into the gold layer in 2 power bi
so once that is done we will we can
create dashboards and reports to
represent it in front of the business so
this is how this complete ETL process
will work in our case now let's try to
understand what is Medallion
architecture so The Medallion
architecture is a data management
framework commonly used in modern data
legs or the data warehouses so basically
it organ izes the data into three layers
which is namely bronze layer silver
layer and the gold layer so each of
these layers or each of these steps are
designed to handle the data at the
different stages of the processing and
the optimization so in a medallion
architecture data is progressively
refined as it moves from bronze layer
which is nothing but raw layer to the
silver layer which is cleaned and enrich
dat data layer and afterwards to the
gold layer which is business ready data
layer so the goal basically here is to
ensure that we have scalability
performance and high quality data at every
every
stages so the first layer is a bronze
layer which is nothing but a raw layer
this bronze layer is where we have our
raw unprocessed data which is ingested
and stored so basically it is also
referred as a landing zone or a raw data
zone so if we talk about the
characteristics of this bronze layer
then the data is typically stored in its
native form for example CSV form Json
form or pocket form any any form of any
kind of form so it is directly from the
source system and there are no
Transformations which are applied at
this stage so we can say that data is
stored as it is so as it is stored as it
is it may contain errors duplicates or
any missing values so the basically the
focus of this bronze layer is to capture
and preserve all the incoming data
without losing any
information so in our case we are going
to ingest our data from GitHub to this
raw into this raw format and we are
going to store it into
ADLs so after this bronze layer
we have silver layer so the purpose of
this silver layer is we need this silver
layer where our data is cleaned
transformed and enriched for the further
analysis so data at this stage has been
processed to handle the inconsistencies
or the missing values or the duplicates
and to derive some more meaningful
relationships between the data sets so
if we talk about the characteristics of
this silver layer then we can say that
at this stage data is transformed into
more structured normalized or enriched
form so for this it may involve the
operations like data cleaning or D
duplications or the validation so it can
also include basic business Logics like
maybe creating some additional columns
or data data conversion or
categorizations so we can perform these
kind of basic logic logical
transformations in this layer and this
lay the data in this layer is typically
suitable for the Eda which is nothing
but exploratory data analysis and intermediate
intermediate
reporting so what could be the use case
for this silver layer then we can say
that data scientist or the analyst they
can use this layer to build machine
learning models or to conduct some
further analysis in data brakes so
basically Ally it serves as a foundation
for the gold layers curated and business ready
ready
data so in our case we are going to use
data bricks for this silver layer right
so you might clean we might clean and
transform the Raw movie data in that
maybe we will try to remove some
duplicates and we will try to handle
some missing values and the store we
will store this result into this silver
layer in in the format or in the in in
the delta or poket format so we will
store that data into silver layer into
any one of these
format so the next layer is gold layer
so the purpose of this gold layer is
that this gold layer represents the
highest level of data refinement and it
is intended for the business
intelligence reporting and the analytics
so we can say that at this stage the
data is fully processed and optimized
for the final consumptions by the for
the final consumption by the business
users so what could be the
characteristics of this gold layer so we
can say that data is highly curated so
it often includes aggregations or the
summaries or the derived matrices that
will answer the specific business
questions so this layer we can say that
it optimized for performance or the fast
quering so first squaring means using
for example indexing or partitioning or
Z ordering so such kind of fast squaring
we can use this gold layer and lastly we
can say that it is stored in the final
reporting friendly format such as bucket
or the Delta l so what could be the use
case for this gold layer so we can say
that we can use this for the dashboards
reports or the kpis which will basically
help for the decision making
so business teams or the managements
they can use this layer to extract the
insights and perform some Trend analysis
and make some data driven decisions so
in our case we can say that for this
powerbi comes into the
scenario and after transforming the data
we might need to for example aggregate
the movie counts by genres or the year
and we need to save this final data set
into the gold there so this data is now
ready to be consumed by the powerbi
dashboard for the reporting so we can
pull this data from the gold layer into
the powerbi and then we can perform NE
next business driven activities which
are creating reports and the dashboard
so this is how this Medallion
architecture works and this is how it
will perform in our end to endend ETL
pipeline which we are going to create
now now we will try to see what is
Microsoft aure exactly is so Microsoft
Azure is a cloud-based Computing
platform and the service which is
offered by the Microsoft so it provides
a wide range of cloud services including
like computing power or the storage
networking databases analytics
artificial intelligence and many more so
the businesses and the individuals they
can use this a Azure to build deploy and
manage the applications on a global
scale without needing to maintain it on
the on premise
infrastructures so why do we need this
Azure so there are some points for that
we need Azure so the first one would be
the scalability and
flexibility so aure basically allows the
businesses to scale the resources up or
down based on the demand so with this
they can ensure that they have Optimal
Performance without needing The Upfront
Hardware investment So to avoid this
Hardware in investment it will we can
prefer aure and it will give us the
scalability and the flexibility the next
one I would say is cost efficiency
because aure has that pay as you go
infrastructure so because of this pay as
youo organizations will only pay for the
resources they are going to use and it
will ultimately reduce the operational
expenses then the third one I'll say is
global reach and the
availability so with this for this the
the aour has data centers across the
world so Azure provides the high
availability and it ensures that
applications run closer to the end users
and It ultimately improves the
performance so after this I'll say the
security and the
compliance so aure basically offers
industry-leading security measures which
includes like multi-layered protection
or the encryption or the compliances
with global standards like gdpr
r or hi paaa so such kind of compliances
this Azure offers so for that also so
many companies prefer Azure and the next
one I'll say is integration with the
Microsoft and the open source tools so
this is kind of a main thing for which
Azure is preferred because Azure Works
seamlessly with the other Microsoft
products also for example it would be
like Office 365 or the powerbi or the
SQL Server as well as some open source
Frameworks such as like python spark or
the kubernetes so such kind of
Integrations these Integrations are made
easier with the Azure so that is also
one of the one or main major because of
which Azure is getting preferred right
now so now we are going to create ETL
pipeline right so we'll see which Azure
Services we are going to use right now
so the mainly we'll see the first one is
azure data ltor storage services so this
ADLs is highly scalable I I can say that
this ADLs is like secure and the
scalable cloud storage service which is
basically designed for the big data and
the analytics workload so why do we need
this ADLs okay so it so stores the raw
or semi-processed or the processed data
into the different layers which we have
S which we have seen earlier via this
Medallion architecture right so such
kind of data gets stored in DLS so it
basically gives us that storage and
apart from that it also supports the
hierarchical name space so it will
improve the performance in the file
operations so hierarchical name space as
in it will allow you to create one
folder and in that it will allow you to
create subfolders and in that you can
store the file so creating these
subfolders it is nothing but creating
this hierarchical name space so it has
this ADLs gives that facility and it
makes us easier for the file operation
and apart from that it integrates
seamlessly with the Azure data
brakes and other Azure Services also so
right now Azure data brakes is important
because in this ETL pipeline we are
going to integrate this ADLs with this
Azure data brakes so that is why I said
Azure datab brakes and there are also
other datab other Azure Services also
with which it is
compatible and next service I'll say is
azure data
Factory so this Azure data Factory is
nothing but a cloudbased
storage which gives ETL service which is
used for the data movement and the
transformation if we say like why do we
need this ADF so we can say that it will
ingest this ADF will ingest the data
from various sources it could be
anything like in our case it is https
right so it could be SQL or the other
apis or the cloud storage or on premise
database it basically it from these such
kind of various sources it will help us
to ingest that data into ADLs and apart
from that I can say that it can also
schedule and orchestrate the workflows
by automating the data pipeline so in
our case we are fing the data from one
source right and then with this copy
activity we are using here ADF we are
storing the we are pitching the data
from our source to
destination so we have done that we are
we have done that using uh with this
manual intervention right so but what if
I I want to use it without any manual
intervention then for that we need to or
we need to schedule and orchestrate
orchestrate this workflow right so it
will basically also allow automating
this data pipeline so it this ADF has
that facility also in which is uh this
orchestration and automating and apart
from that it integrates with the AO data
braks also for the Transformations and
powerbi for the visualization so so we
are going to see like how we are going
to connect this Azure data brakes with the
the
powerbi so next to that is azure data
break so AO data briak is a big
analytics big data analytics and machine
learning platform which is based on
Apaches spark which is optimized for this
this
Azure so it basically provides high
performance distributed computing for
the large data sets and apart from that
it supports the languages like python or
SQL l or
Scala or R so such kind of languages it
supports for the data processing and
apart from that we can say that it also
process it can process structured as
well as unstructured data very
efficiently and apart from that it
integrates seamlessly with
ADLs as your data Factory and other
other as your services
also so the next one which we are going
to use is powerbi so powerbi we all know
that it is a business intelligence and
data visualization tool which basically
enables the user to create interactive
dashboards and the reports so why do we
need it so we can say that it connects
directly with the Azure data Azure datab
breaks or the ADLs for the realtime
insights so it gives a seamless
integration and apart from that it also
supports the direct query and the import
mode for the data retrieval so for this
Main purpose we are going to use powerbi
so in our case the use case would be to
connect the gold layer which we
have and get fch the data into powerbi
to create the dashboards and with the
creation we can also visualize this data using
using
kpis or other trending
patterns or business matrices so it will
basically allow you or enable this
realtime reporting using this direct
query mode which we are going to see in
some time so these are the components
which we are going to use in our ETL
pipeline now let's try to learn more
about management level and the hierarchy
so AEL provides you with the four levels
of management so the first one is
Management Group followed by the
subscription then the third one would be
Resource Group and the last one is the
resources so if we talk about the
management group so it basically helps
you to manage the access policy or the
compliances for the multiple
subscriptions so all the subscriptions
in that particular Management Group it
automatically inherits the conditions
which are applied to that respective
Management Group so subscription like
what is subscription
exactly so subscription we can say that
it logically associate the user accounts
with the resources which they create so
each subscription has limits or the
codes on the amount of resources it can
create or use so organizations can use
these subscriptions to manage the costs
or the resources which are created by
that particular user or the team or the
project so basically in the company what
they do that for the for any respective
team or that particular user they will
allocate one subscription so that
whatever user that user will create
across that subscription that will get
buil for that subscription so it will
logically basically it will logically
associate that user with the resources
which they have created so this is how
this subscription level works then the
next one is Resource Group so we talk
about it then the resource Group we can
say say that is a logical container
where you can deploy and manage the
Azure resources like virtual machine web
apps databases or the storage accounts
so we can say that it is nothing but a
logical folder in which we are going to
store all our resources and resources
are nothing but the instance of your
services which you can create in that
Resource Group so in our case we have uh
these Services right which is like ADF
or the data brakes there are synapse
also which uh which we we will use into
any another project or SQL database or
ADLs or like any virtual machine so
these all are nothing but the resources
which are nothing but the instances of
the services which you can create and we
basically what do we store for all these
Services into a particular folder so
that folder is nothing but a resource
Group so in this Resource Group I can
say that I have stored the SQL or the
ADLs so that depends on you so what
we'll do like we'll have subscription if
you have free account then you will get
that subscription for particular dollars
for the free or like if you using pay as
you go then you will create one new
subscription and in that new
subscription then you will create this
Resource Group and in that resource
group you will create all the resources
or instances which we are going to use
to create our ETL pipeline so this is
how this management level and hierarchy
works so now we will try to implement
this management level and the hierarchy
in our ETL Pipeline and we'll see how we
can create the subscription then
Resource Group and the resources and Via
that or through that we will create our
pipeline so now let's move ahead and see
our practical part now let's start with
our practical part so so for that just
go to the Microsoft Azure portal and
here you will get the option for sign in
and get started with the Azure so if you
are a newcomer just click on a get
started and with Azure and then try for
try this azzure for free so you will get
$200 USD and for that dollars you can
use these services for free for a month
so this is one of the good option which
can for which you can opt and otherwise
you can go for the pay as you go so for
the services which we which you will use
for that only you will get build for
that particular subscription we'll see
this is how our Microsoft IG portal looks
looks
like so these are the aure services
which I have used recently this is my
recent work on the left hand side if you
click on this three arrows you will get all
all
the facilities like dashboards all the
Services resource Group SQ DB databases
virtual machines and so on and if you're
not getting anything you can simply
search here and you will get that
particular service so the first thing in
our management higher archive was
management groups and after that it was
subscriptions if I just say
subscriptions then I'll get that service
click on that then you will see like how
many subscriptions you have right now
with active or disabled status and you
can check like what's the current cost
which is which is right now causing you
so right now you'll see like I don't
have any free subscription so I'm on as
you go so I can see that subscription
here with my current cost or if you
suppose want would like to create a new
subscription for the billing purpose
then you just click on ADD and with ADD
you can just give whatever the
subscription name which you want to give
whatever it is and then billing account
billing profile so your credit details
you need to enter so that it that on
that respective card that you will get
buil for on that respective card and you
just need to click onate create and uh
review and create right now I'll just
not create any new subscription I'll use
the existing one but this is how you can
create the new subscription
subscription
so yes I would like to discard so now
once uh you are done with the
subscription then the third step was to
create the resource
groups so I would like to create one new
Resource Group so for that I'll come
here click on
create so here it is by default
selecting the subscription which I have
right now so if you if you have multiple
subscription then just click on this
drop down and select whichever
subscription you want to so right now I
just have only one so I'll select that
and then you need to enter the resource
Group name which you want to give for
that respective Resource Group so for
Prime work you'll see that this is
available then the regions we have
different different regions so I'll just
select the nearby region for
example Central or the East okay let's
just go with the East us then click on
next tags we don't need to next and
lastly it will give you one review and
create page where it will show you under
this subscription you're going to create
this Resource Group under the region
so it says that Resource Group created
once okay now we can see that it has
reflected here so this is my Resource
Group which is third management level
now under this Resource Group now we are
going to create our different Azure
services so that it will act as a folder
for our different different kinds of
services so the first first service
would be to create our storage account
which is nothing but ADLs account so for
that what I'll do I'll just click on
create so it will navigate you to the
marketplace so here I need to search for storage
account so this storage account we need
so if you you can see that that
respective subscription and Resource
Group it has been
allocated by default so for example like
if you want to create a new Resource
Group so you can just click on click new
and you can assign a new Resource Group
and once you will click on okay that
respective new Resource Group will also
get created in the back end but right
now what we have done we have first
created our Resource Group and in that
Resource Group we are going to create
our other storage services so it has
selected by default that RG Prime work
if you would like to switch amongst the
other existing resource groups you can
do that so right now our uh Resource
Group is RG Prime workor then you need
to give a storage account name so for
that suppose for example I'll give
ADLs okay so it is giving me error as
like it can contain only lower case lat so
ADLs Prime work
okay now it's checking and it shows that
it is available so region I'll just keep
I'll just keep the near one East
us then we have primary service option
so in that our primary service would be
Azure data leg storage where we are
going to store our data into three
layers so I'll select that primary
service and then we have this option as performance
performance
option so in this performance option we
have this standard and premium these two
options are available here so we can say
that with the standard with this
standard performance right so this one
is like mainly suitable for the general
purpose storage needs such as like
storing large amount of data that is
frequently accessed or like which we are
going to store somewhere so for that we
can use this standard performance but if
we talk about the cost
so like it's better than the premium and
this standard it is more cost effective
and it makes it is basically suitable
for a budget cautious scenarios where
like high performance is needed and
there are not any kind of critical
requirements so if you talk about like
about the costing from the costing
purpose then I say we should go with the
standard one so with the premium we can
say that premium is basically designed
for the high performance
basically for the high performance
workloads that demand low latency and
the high transaction rate such as for
example for the virtual machine discs or
the any transactional databases or any
Big Data Systems so for that premium
performance tier is very much good and
in terms of cost it generally costs
higher than the standard tier which
reflects the superior Superior
performance capabilities but right now
we right now our work is not that much
heavy so we'll just opt for the standard
option and if we now talk about the
redundancies then we have four types of
redundancies now let's quickly see how
these all four redundant storage Services
Services
works now let's talk about the Azure
storage redundancy offerings so Azure
storage offers like various redundancy
options basically to ensure that your
data remains available and durable and
even in the face of like Hardware
failure or power outages or any natural
disasters we need our data intact so for
that purpose aure offers this uh
redundancy offerings so these redundancy
strategies basically reflects your data
across the different locations by
providing the protection against your
data loss and ensuring that business
continuity gets maintained so this is
the main purpose of this uh redund
redundancy offerings so the key purposes
of this uh storage redundancy would be
like first would be data protection
because redundancy safeguards your data
from unexpected events such as for
example uh Hardware malfunctions or the
data center outages so it basically
protects you with these such kind of
outages and it maintains the multiple
copies across the various locations so
data protection is the main key feature
for this or key purpose for this
redundant offering then the next one
would be aure this High availability so
by replicating this data this redundancy
ensures that your applications can
access the data with minimal downtime
even during the disruptions so high
availability is the second most
important factor then the third one we
can say is disaster recovery so in the
event of any Regional disaster or
certain redundancy options it will
basically allow you to get protected
from the this data vales by storing that
data into a secondary region so
basically it facilitates the Swift
recovery and the continuity of your
operations so selecting this appropriate
redundancy option it depends on your
organizational specific needs balancing
factors like cost or data availability
requirements or the disaster recovery
objectives so there are so many things
or the factors based on that you need to
select these one of like the redund
redundancy offering which is best for
your project so the first option is
locally redundant storage so what
happens in lrs so lrs basically
replicates your data three times within
a single data center so we have this
primary region so in our case we have
selected East us right so that is our
primary region in that primary region we
have data center and in that data center
we have three copies of our data stored
within the same data Center so that is
how this lrs functions so basically this
lrs safeguards
your data or the activity against the
hardware failures within the same data
center so it's good option but the
limitation for this storage would be
like it does not protect against the
data center wide disaster like maybe for
example Fire or the flood then if that
in that case if that complete data
center gets disrupts then it is of no
use but it like rarely happens or will
never occur but there is a possibility
of that occurrence so for that purpose
lrs would be beneficial and if your
project has very minimal costing then
also you can opt for this lrs option
then the second option is Zone redundant
storage so what happens in this Zone
redundant storage
is Zs it replicates your data across the
three availability zones within a region
so we have our East US primary region
and in that region we have three
different availability zones and in that
we have three different data center so
in lrs our copy of our three copies were
in like one data center but in Zs our
three copies would be into different
different data centers but in a same
primary region so that is how this zrs
functions so the next one would be GE R
storage which is GRS so in GRS this GRS
basically combines the lrs within the
primary region and asynchronously
replicates the data to a secondary
region so we can say that data is
synchronously replicated three times
within a single storage cluster in the
primary region and then it and then it
was asynchronously replicated to a
secondary pair region so this is our
primary region so in our case is East us
so in that data center it has replicated
your data three times synchronously and
like despite of this it also replicates
your re data into a secondary Region
Three Times within the same data center
this is how this GRS function so
basically we can say that it will
protect or it will provide you Disaster
Recovery capabilities in case of any
Regional outage so like what what would
be the limitation for this GRS then we
can say that fell over to the secondary
region it requires manual intervention
so without that your data will be intact
in your primary
region so the fourth one is Geo
redundant geozone redundant storage
which is
gzrs so in this
gzrs basically this gzrs integrates with
zrs within the primary region and
asynchronously replicates the dat to a
secondary region so data our data in our
primary region is us our data is
synchronously replicated three
times within three different data
centers these are synchronously
replicated and apart from that it
asynchronously replicates your data
three times into your secondary region
so this is how this gzrs function so we
can say that it will offer you the
enhanced disaster recovery and high
availability but the limitation would be
for this would be like failure to a
secondary region again it will require
manual interv intervention apart from
that your your data is already copied
three times into three data centers into
a primary region so this is the I can
say that this gzrs provides you the
maximum protection to your data from all
the disasters and recoveries but it is
not cost efficient it it requires a lot
of cost so it completely requires on
your project requirement and the budget
which is like available and which is
which is allowed to use so this is how
these are for for redundancy storage Services
Services
functions okay so for the cost effective
purpose I'll just go with the local
redundancy storage click on
next so here we have this one block
which is hierarchical name space so what
happens is like this hierarchical name
space which is nothing but
hns this is an optional feature in ADLs
Gen 2 that basically enhances the way
your data is managed within that storage
so it introduces a file system like
structure to The aure Blob storage
making it more efficient for the Big
Data anal analysis analytics and the
higher performance data operations so
this hns within this storage V2 if we
just take mark this option then it will turn
turn
on it will basically turns this storage
into AO AO datal storage genu account
and unlike traditional flat blob storage
this hns will allow you to store or
organize the data into Direct iies and
subdirectories so for the normal blob
storage you can store your data into one
file system but with this hns enabled it
will allow you to create the directories
so in this what will happen suppose if I
have this one directory in this
directory I can create subd directory
sub subd directory or Sub sub file
storage and in that I can store my file
system so this is nothing but the
hierarchical facility which it will
enable by clicking on this checkbox and
it will convert this storage account
into ADLs by offering this services so
with this
service there is one more uh thing as
like with this service we know that it
makes it makes it easier to manage
structure or navigate the large data
sets just like this file system so apart
from that this hns it will also it also
allows the access control lists which
which is nothing but ACL so this acl's
at the directory we can apply this ACL
Access Control list at the directory or
file level which offers a fine gr
permissions so it is useful when we have
multiple users or the applications that
need the access with the different
levels of permissions so we are going to
require this ACL feature going forward
because we have this AO data brakes
right and with AO data braks we need to
access this ADLs account so to access
this ADLs account or these ADLs file
folders We where we have stored our all
raw data so to access that data Azure
data braks needs particular permissions
so these permissions we can give by
using this ACL which is nothing but
Access Control lists so this is one of
the main feature which which we are
going to require and apart from this
this sub foldering section that is also
important so these two main features we
will get by clicking on this enable
hierarchical name space so we are going
to opt for that so next coming here we
have one more
feature which is this access tier so
here we have three options we have three
tiers hot cool and cold so basically
this Azure blob storage offers these
three services right the main purpose of
this is to help you to manage the
storage costs based on how frequently
you are going to access this data so we
have uh we are going to create this a
your ADLs right and in that we are going
data so we have this ADLs in that we
have we are going to pull data and we
are going to store our data into
directories and directories so now I
need to check like how frequently I'm
going to access this data so depending
on that I need to select which access
tier is suitable for my storage account
creation the first one is hot tier so
the purpose of this hot tier is is that
it is designed for the data that is
accessed or modified
frequently so if you are going to access
the data frequently then you should opt
for this hot per hot tier and if you
talk about the cost structure then the
this highest storage cost it has some
higher storage costs but lower access
and transaction fees so you have this
ADLs account for so to store that data
into this ADLs it will require high
amount but to access that data again and
again it will not be it will it will not
be that much higher so this will be the
cost structure for high uh hot tier and
the use case we'll see then it is like
useful for the active data sets or the
real time analytics or data staging
before processing so most of the time
hot tier is beneficial then the next one is
is
cool so cool tier is basically suited
for the data that is infrequently
accessed or modified so we are not
suppose we are not going to use that
data frequently then for that purpose we
can opt for this cool tier so so the
cost structure is also like it has lower
storage and lower storage cost would be
very less with higher access and
transactions so the storing cost is less
but if you want to access it then that
access cost would be higher so this is
how this cold tier will perform then we
have this third one which is cold tier
so it is basically ideal for the data
that is rarely accessed so we are not
going to use it for with that like with
that frequency so for the rare access we
will we can opt for this cold tier but
it requires quick retrieval whenever
needed so it is nothing but you are
archiving your data so once you have
archived it you're going to use it
rarely whenever required but when you
are going to require it you need to
access you need to access it very fast
it should not take much longer time to
access your archived data so for that
purpose you can go for cold tier so cost
structure is even lower storage costs
than the cool tier but with higher
access fees so the storage cost is much
lower than the cool cool one but the
access cost is higher than the cool one
so like use case for this would be like
if you want some long-term backups or
the data sets and you need to rain that
for the compliance then for that access
purpose you can opt for this cold tier
and there is one more uh Point here that
is retention requirement so the
retention requirement if you talk about
requirement for the cold
tier the data should stay in this tier
for at least 90
days to avoid early deletion changes and
same retention requirement for the cool
tier is that it should remain in this
tier for at least 30 days to avoid early
deletion charges so these this retention
requirement is important for this both
cool and cold tier which is for cool
it's 30 for cold is 90 so this is how
this access tier works so right now we
are going to opt for the hot access tier
because we are going to access the data
frequently and just click on next then
these are some private endpoint
additional features routing preference
and just click on next next next we are
not going to require this for our ADLs creation
creation
next then it is going to validate all
your requirements and then we'll click
on this
create and it will now create our
storage account which is
ADLs into our assigned Resource
Group so it will take some time it shows
right now that ex deployment is in
progress so we'll wait for some time
till this deployment gets
successful okay so now our deployment is
completed for ADLs so now click on go to
Resource so this is how our ADLs will
look like so if we talk about the data
storage here then we can see that we
have four options here for the data
storage now we'll see one by one very
quickly how it functions so the blob
storage nothing but content ERS we can
use this container to store large amount
of unstructured data or structured data
like CSV files or the documents image
videos so in this we can say that our
data is organized into containers which
are like similar to the folders and
these containers holds the files so we
will in the container we will have these
containers and in that we will have
folders to store in a mannered way so we
can say that it is ideal for the content
storage or and servings or maybe for the
backups or the streaming media so this
is useful for that then the next one is file
file
shares so basically it provides you
fully managed file shares which are
accessible via SMB protocol so SMB is
nothing but server message block so if
we talk about the structure then these
file shares will allow the creation of
files uh of this file shares that can be
mounted concurrently by cloud or on PR
deployments so because of this we can
see that it is suitable for migrating
Legacy operations to the cloud so if you
want to migrate from the Legacy to the
cloud then for that purpose we can say
that F Shares are the most
useful then we have cues which is
nothing but CU storage so basically it
enables the a synchronous communication
between the application components by
storing the large number of messages so
the structure of this cues would be like
the messages are stored in the cues and
that can be accessed from anywhere using any
any
HTTP or
https protocol so with that we can
access these cues so use case for this
would be like it is useful for building
some flexible applications or loading
leveling during this high traffic
periods and for example like decoupling
the application components to improve
the scalability or resilience so for
that purpose we can use the cues option
and the last option is
Stables so this stable storage it
basically provides a no SQL key value
store for this rapid development so key
value pair is important here so using
this uh key value store we can have this
massive stry structure data set stored
into the tables and if you talk about
the structures then sto it stores the
data in tables with the schema less
design which allows you for easy
adaption AS application needs to evolve
so what would be the use case for this
so we can say that it is appropriate for
storing the structured non relational
data such as user information or device
data or any other metadata so these are
some some of the information for all
these four type of storages right now
for our purpose we are going to opt for the
the
containers so so once you will come here
you can click and create
container so for example let's say right
now I'll uh we are going to follow this
Medallion architecture right so for that
we are going to have bronze layer silver
layer and gold layer so let's just
create three containers so I'll just
silver okay the first would have been
like bronze but it's okay bronze silver
and then the third one would be
gold so right now what we are going to
do we are going to fetch our data from
GitHub account that would be our raw
data right and we will store that draw
data into our bronze layer so we for
that we will use Azure data Factory and
with that data we with that Edo data
Factory we will store our data here then
we will fetch this data
into aure Data brakes we will perform
some Transformations and then we will
store our data into silver layer so once
we are done with that Transformations
with the Azure data Factory only what we
will do we will again fetch this
data we will perform some Advanced
Transformations and then we will store
again B that business ready data into
the gold layer so this is how we are
going to use these three layer or three
layers here so now we'll see how we can
f that data into these different
different layers so now first step would
be now we are done with ADLs creation
right so now we have our ADLs
ready with bronze container in that so
now we will create one Azure data
Factory and now we will pull that
data to this trans layer so that would
be our second step now so now let's just go and create the AO data Factory for
go and create the AO data Factory for the
the same now let's try to create AO data
same now let's try to create AO data Factory so I into my Resource Group
Factory so I into my Resource Group again I'll click on
create then it will navigate me to the marketplace and here we will search for
marketplace and here we will search for data
create so it has assigned respective subscription and Resource Group here so
subscription and Resource Group here so now we need to give one name to our data
now we need to give one name to our data Factory so I'll just give the name as
Factory so I'll just give the name as anything like ADF
anything like ADF Prime
Prime work region I'll keep it as East us only
work region I'll keep it as East us only then the version is
then the version is V2 next
next these like GI git configuration networking Advanced tags we don't need
networking Advanced tags we don't need to configure anything right now so we'll
to configure anything right now so we'll just check whatever is required like
just check whatever is required like Basics subscription group then verion
Basics subscription group then verion region and then we will click on
create so it states that now deployment is in progress so it will take some time
is in progress so it will take some time to create it okay so now it has has been
to create it okay so now it has has been created so now we will click on go to
Resource refresh it so now see under the resource Group now we can see our Azure
resource Group now we can see our Azure data like storage and along with that we
data like storage and along with that we can see our Azure data Factory too so
can see our Azure data Factory too so now we'll click on that and launch the
studio so this is our AO data Factory in which we will try to do this data
which we will try to do this data movement using copy
movement using copy activity so now if you'll see here on
activity so now if you'll see here on the left hand side we have tabs of home
the left hand side we have tabs of home then author monitor manage and then
then author monitor manage and then Learning Center so in Learning Center
Learning Center so in Learning Center you will get uh like the videos
you will get uh like the videos tutorials like you can navigate through
tutorials like you can navigate through that you can check that and if you want
that you can check that and if you want to learn more you can all all the time
to learn more you can all all the time you can select for that then we have
you can select for that then we have manage tab here we can create Link
manage tab here we can create Link services or the triggers we can
services or the triggers we can configure the git we can we can create
configure the git we can we can create integration run times also here under
integration run times also here under the manage tab so in monitor tab if
the manage tab so in monitor tab if you'll see here you have all the runs so
you'll see here you have all the runs so if you'll debug anything or if you will
if you'll debug anything or if you will run anything uh suppose for example in
run anything uh suppose for example in our case if you'll run the pipeline then
our case if you'll run the pipeline then we we can check the status of that
we we can check the status of that pipeline after coming here the similar
pipeline after coming here the similar goes to the triggers also if you have
goes to the triggers also if you have set any trigger you can like keep a
set any trigger you can like keep a watch on that trigger via here and then
watch on that trigger via here and then we have author tab so in author Au tab
we have author tab so in author Au tab we have Factory resources in that we
we have Factory resources in that we have pipelines then right now it has it
have pipelines then right now it has it is like change data capture is in
is like change data capture is in preview mode then we can create data
preview mode then we can create data sets data flow power query so all these
sets data flow power query so all these are option available here so now we are
are option available here so now we are going to create our pipeline under the
going to create our pipeline under the author tab so for that first we will
author tab so for that first we will check or first we will see the basics of
check or first we will see the basics of like how this copy activity functions
like how this copy activity functions and what are the requirements and then
and what are the requirements and then we will try to implement that copy
we will try to implement that copy activity
this is the overview of azure data Factory under which we are going to
Factory under which we are going to perform copy activity using this
perform copy activity using this pipeline so in Azo data Factory
pipeline so in Azo data Factory orchestrating this data movement from
orchestrating this data movement from the source to destination it involves
the source to destination it involves several interconnected components that
several interconnected components that work together to facilitate this
work together to facilitate this efficient data transfer from the source
efficient data transfer from the source to the destination so in our case source
to the destination so in our case source is the gith account and the destination
is the gith account and the destination is the bronze layer into
is the bronze layer into ADLs this GitHub account is a source it
ADLs this GitHub account is a source it it will basically act as the origin of
it will basically act as the origin of the data which you want to process so in
the data which you want to process so in our case it's a repository which
our case it's a repository which contains the data files which we want to
contains the data files which we want to pull into our Azure data Factory then we
pull into our Azure data Factory then we have linked service so if you'll see we
have linked service so if you'll see we have two linked Services one is on the
have two linked Services one is on the source side and another one is on the
source side and another one is on the sync side so this linked service it
sync side so this linked service it basically establishes the connection
basically establishes the connection between the ADF and the source so from
between the ADF and the source so from the the link service which is on the
the the link service which is on the source side it will establish the
source side it will establish the connection between
connection between GitHub and the Azure data
GitHub and the Azure data Factory this basically provides the IO
Factory this basically provides the IO Data Factory with necessary credentials
Data Factory with necessary credentials so these credentials are necessary that
so these credentials are necessary that will get provided with this linked
will get provided with this linked service and along with the credentials
service and along with the credentials it will give the connection details to
it will give the connection details to access this data in the GitHub then we
access this data in the GitHub then we have data set again we have two data set
have data set again we have two data set one on the source side one on the syn
one on the source side one on the syn side so source side data set it defines
side so source side data set it defines the structure and the location of the
the structure and the location of the data within the GitHub the purpose if
data within the GitHub the purpose if you see the purpose is to
you see the purpose is to specify which data in the GitHub account
specify which data in the GitHub account that should be accessed by the ADF
that should be accessed by the ADF including the file path and the data
including the file path and the data format so for that we have data set on
format so for that we have data set on the source side then we have copy
the source side then we have copy activity so once we have pulled this
activity so once we have pulled this data from the GitHub using this linked
data from the GitHub using this linked service and data set then this copy
service and data set then this copy activity will transfer or will done the
activity will transfer or will done the data movement from the source to the
data movement from the source to the destination so now with this data set or
destination so now with this data set or with this copy activity we have pulled
with this copy activity we have pulled the data now we want to store this data
the data now we want to store this data to our sync which is our destination so
to our sync which is our destination so for that also we have again data set at
for that also we have again data set at the sync and the linked service at the
the sync and the linked service at the sync so this data set it will Define the
sync so this data set it will Define the structure and the location of the data
structure and the location of the data destination data store now this data set
destination data store now this data set at the syn is responsible for performing
at the syn is responsible for performing the activities which are related to the
the activities which are related to the sync the first one was responsible for
sync the first one was responsible for the source now this one is responsible
the source now this one is responsible for the sync this goes the same for the
for the sync this goes the same for the link service also the link service on
link service also the link service on the sync side it will establish the
the sync side it will establish the connection between the a aure data
connection between the a aure data Factory and your destination data source
Factory and your destination data source so this link service will serve that
so this link service will serve that purpose for the sync side and the think
purpose for the sync side and the think is it's quite obvious that it's the
is it's quite obvious that it's the final destination where the data will
final destination where the data will get stored after this copy activity so
get stored after this copy activity so in our case it's the folder which which
in our case it's the folder which which is which we have created for the bronze
is which we have created for the bronze layer into Azure data L Storage services
layer into Azure data L Storage services so this is the overview of our EO data
so this is the overview of our EO data Factory activity which we are going to
Factory activity which we are going to perform so now let's see how we can
perform so now let's see how we can implement this uh copy activity into our
implement this uh copy activity into our aure data Factory so for that now we are
aure data Factory so for that now we are clear that we need two linked
clear that we need two linked Services one to establish the connection
Services one to establish the connection with the source and another for the
with the source and another for the destination and with this two Link
destination and with this two Link services we need two data sets also
services we need two data sets also again one for the source and one for the
again one for the source and one for the syn so we are pretty much clear that we
syn so we are pretty much clear that we need two Link services two data sets and
need two Link services two data sets and then the copy activity and then we will
then the copy activity and then we will cut Interlink that those components and
cut Interlink that those components and then we will try to pull this data from
then we will try to pull this data from the source to sync so now let's see how
the source to sync so now let's see how we can implement this
we can implement this practically now I'm into Azure data
practically now I'm into Azure data Factory and now we need to create one
Factory and now we need to create one new pipeline so for that I'll come to
new pipeline so for that I'll come to the author pipeline click on these three
the author pipeline click on these three dots and then click on the new Pipeline
dots and then click on the new Pipeline and once you will click on that you need
and once you will click on that you need to give a name for your pipeline so
to give a name for your pipeline so let's say suppose I'll just give the
let's say suppose I'll just give the name as
Prime copy data and then I'll close this so
so under this pipeline we have so many
under this pipeline we have so many activities like move transform synapse
activities like move transform synapse then AO data Explorer then data
then AO data Explorer then data analytics iterations and the
analytics iterations and the conditionals so there are so many
conditionals so there are so many activities you can perform so right now
activities you can perform so right now first we will start with the basics so
first we will start with the basics so under the move and transform we have
under the move and transform we have this copy data and then we will just
this copy data and then we will just simply just pull pull this copy data
simply just pull pull this copy data onto our screen so this will be our copy
onto our screen so this will be our copy activity but now we know that for our
activity but now we know that for our work we need two Link services and the
work we need two Link services and the two data set to establish the connection
two data set to establish the connection with source and the destination okay so
with source and the destination okay so first we'll create our two Link services
first we'll create our two Link services so for that what we will do we'll just
so for that what we will do we'll just go to the manage tab we have Link
go to the manage tab we have Link services then click
services then click new here first link service we need to
new here first link service we need to connect with our GitHub account right so
connect with our GitHub account right so this is our GitHub account this is our
this is our GitHub account this is our data
data so you when you will click on this RAW
so you when you will click on this RAW button then you will get this main data
button then you will get this main data okay and this will be our link from
okay and this will be our link from where we need to pull the data so this
where we need to pull the data so this data we can see is with https so we need
data we can see is with https so we need to establish a linked service with
to establish a linked service with HTTP so here come here
HTTP so here come here then we have link service for HTTP click
then we have link service for HTTP click on that continue
then you have to give a name let's say suppose I'll give
name let's say suppose I'll give HTTP Prime and LS for the link
HTTP Prime and LS for the link service then you have to I'll just take
service then you have to I'll just take Auto resolve integration runtime then
Auto resolve integration runtime then you have to give the best URL so for the
you have to give the best URL so for the best URL right now I'll just go with the
best URL right now I'll just go with the static pipeline creation and I'll just
static pipeline creation and I'll just copy this URL as it is and then I'll
copy this URL as it is and then I'll just paste that URL here so once you
just paste that URL here so once you will that do that come down and then the
will that do that come down and then the authentication type authentication type
authentication type authentication type I'll just select Anonymous and once you
I'll just select Anonymous and once you are done with inputting the required
are done with inputting the required information just click on test
information just click on test connection so it states that our
connection so it states that our connection is successful then I'll just
connection is successful then I'll just click on
click on create so this is a linked service which
create so this is a linked service which we need for the source side now we need
we need for the source side now we need another link service for the destination
another link service for the destination side so now again click on
side so now again click on new here our destination is azure data L
new here our destination is azure data L Gen 2 so we need to select data link
Gen 2 so we need to select data link storage Gen 2 click on continue then
storage Gen 2 click on continue then give the data set uh give the link
give the data set uh give the link service name here suppose I'll give
service name here suppose I'll give ADLs Prime
account key is fine subscription I'll select my respective
select my respective subscription then ADLs Prime
subscription then ADLs Prime work and then just test the
work and then just test the connection connection is successful
connection connection is successful click on
click on create now we are done with two linked
create now we are done with two linked Services creation Now again come back to
Services creation Now again come back to the author tab under the author tab we
the author tab under the author tab we have this copy
have this copy activity so now let's just give proper
activity so now let's just give proper naming convention for this copy data
naming convention for this copy data let's say for example Prime copy data
let's say for example Prime copy data I'll just give this
name then come to the source tab now here we are done with the two Link
here we are done with the two Link services creation now we need to create
services creation now we need to create two data sets one for the source one for
two data sets one for the source one for the destination source is https
the destination source is https destination is our ADLs The Source data
destination is our ADLs The Source data set we can create new here here click on
set we can create new here here click on new then our source is HTTP select that
new then our source is HTTP select that continue then under HTTP we know that
continue then under HTTP we know that our file is into CSV format so we'll
our file is into CSV format so we'll check that and we'll select delimited
check that and we'll select delimited text which is our CSV format click on
text which is our CSV format click on continue then you need to create uh give
continue then you need to create uh give one name for your data set here for
one name for your data set here for example I'll
example I'll give Maybe
HTTP Prime DS for data set then we need to select our link service which we have
to select our link service which we have created just now which is HTTP Prime
created just now which is HTTP Prime LS then relative URL when we were
LS then relative URL when we were creating B when we are creating uh link
creating B when we are creating uh link service we have given base URL as it is
service we have given base URL as it is because we need we wanted to create
because we need we wanted to create static one right so there is no need to
static one right so there is no need to give any relative URL here first row is
give any relative URL here first row is header that's true we don't need to
header that's true we don't need to import any schema click on none
import any schema click on none okay so now Source data set is ready now
okay so now Source data set is ready now go to the sync and again we'll create
go to the sync and again we'll create one sync data set for our ADLs so click
one sync data set for our ADLs so click on new our ADLs is Gen
on new our ADLs is Gen 2 so a your data L Storage Gen 2
2 so a your data L Storage Gen 2 continue I want to save my file as into
continue I want to save my file as into CSV format only so I'll just create save
CSV format only so I'll just create save the format as CSV continue
the format as CSV continue then give the name data data set name
then give the name data data set name maybe ADLs
maybe ADLs Prime data set then we'll select our
Prime data set then we'll select our newly created linked
newly created linked service then file path we need to give
service then file path we need to give our file path because we have three
our file path because we have three folders created right bronze silver and
folders created right bronze silver and gold you just come here we can see those
gold you just come here we can see those I just right now I just want to save my
I just right now I just want to save my all raw data into bronze layer so just
all raw data into bronze layer so just select that bronze folder click on okay
select that bronze folder click on okay so once you will you want to save that
so once you will you want to save that data into bronze folder you need to give
data into bronze folder you need to give a name for that right otherwise like I
a name for that right otherwise like I as any which it will pick one by default
as any which it will pick one by default or system generated file name for that
or system generated file name for that but if you want to give your own file
but if you want to give your own file name you can give that for example I'll
name you can give that for example I'll here give Prime what was our original
here give Prime what was our original name Amazon Prime titles I'll just give
name Amazon Prime titles I'll just give this name only
this name only you can give another name also I'll just
you can give another name also I'll just for the Simplicity I'll just give that
for the Simplicity I'll just give that name asz your Prime T titles.
name asz your Prime T titles. CSV import schema none and then click on
CSV import schema none and then click on okay so we are done with
okay so we are done with our dat data Factory interl components
our dat data Factory interl components two Link services then two data sets
two Link services then two data sets then copy activity and now we will see
then copy activity and now we will see if it is working working or not so for
if it is working working or not so for that first I'll just validate whatever I
that first I'll just validate whatever I have done okay so it states that your
have done okay so it states that your pipeline has been validated then I'll
pipeline has been validated then I'll click on
debug so once you will click on debug it will show
will show that your activity is cued and then it
that your activity is cued and then it will change the status to in progress
will change the status to in progress and once it is done then it will change
and once it is done then it will change to success it will take some time okay
to success it will take some time okay so now it has been succeeded it will
so now it has been succeeded it will check the output it states that it has
check the output it states that it has read the
read the data duration
data duration throughput okay so now it's the time to
throughput okay so now it's the time to check if our data has been copied or not
check if our data has been copied or not so for that
so for that again come to your ADLs account
again come to your ADLs account container
container bronze okay so now we can see that we
bronze okay so now we can see that we have one file here which with CSV format
have one file here which with CSV format let's let's just check if we are getting
let's let's just check if we are getting expected output or not click on that
expected output or not click on that edit you are not going to edit anything
edit you are not going to edit anything it's just that you want to check if you
it's just that you want to check if you have that output or not so we can see
have that output or not so we can see that we have that output if I'll click
that we have that output if I'll click on
preview now we can see that we have show ID type title director so basically we
ID type title director so basically we can say that our Azure data Factory has
can say that our Azure data Factory has successfully transformed our data from
successfully transformed our data from the source to azure data L Storage
the source to azure data L Storage services so now our first part is done
services so now our first part is done now we will try to create AO data brakes
now we will try to create AO data brakes resource but before that we'll just come
resource but before that we'll just come back to this AO data Factory and the
back to this AO data Factory and the main important thing is that we need to
main important thing is that we need to publish all of our work otherwise if
publish all of our work otherwise if you'll not publish it then if once you
you'll not publish it then if once you will close this AO data Factory all of
will close this AO data Factory all of your work will be gone and you will not
your work will be gone and you will not be able to recover it so for that to be
be able to recover it so for that to be on safer side it's a good practice to
on safer side it's a good practice to always publish whatever you have done so
always publish whatever you have done so I'll just click on
publish now it states that it is deploying our changes which we have
deploying our changes which we have created it will take few seconds to
created it will take few seconds to publish it so now it says that
publish it so now it says that publishing completed so like again going
publishing completed so like again going forward if I'll come back to this Edo
forward if I'll come back to this Edo data Factory then my all work will be
data Factory then my all work will be here and there is no need to rework
here and there is no need to rework anything so now we are done with the
anything so now we are done with the Azure data Factory now we will create
Azure data Factory now we will create Azure data breakes so for that
Azure data breakes so for that again we will come to our Resource Group
again we will come to our Resource Group and there we will create this aure data
and there we will create this aure data Factory or sorry data brakes so one is
Factory or sorry data brakes so one is done then second data Factory is done
done then second data Factory is done now third again we'll create and go to
now third again we'll create and go to the
the marketplace then data
brakes click on this create data braks it is giving me proper
braks it is giving me proper subscription under that Resource Group
subscription under that Resource Group if you want to if you want like it's a
if you want to if you want like it's a best practice to use the same Resource
best practice to use the same Resource Group so we are going to keep the same
Group so we are going to keep the same Resource Group then we need to assign
Resource Group then we need to assign one workspace name or data break name so
one workspace name or data break name so what we can do for
what we can do for example let's say suppose aure data
example let's say suppose aure data freak ADB
freak ADB Prime
Prime work this name is available so we can
work this name is available so we can opt for this region East us I'll keep
opt for this region East us I'll keep the same because that's the nearest one
the same because that's the nearest one pricing tier we have premium standard
pricing tier we have premium standard and the trial so the best optimum
and the trial so the best optimum solution is the trial because you'll get
solution is the trial because you'll get 14 days free and it it will not charge
14 days free and it it will not charge you for anything for the 14 days so I'll
you for anything for the 14 days so I'll just select that then we have this
just select that then we have this option of managed Resource Group so what
option of managed Resource Group so what is it
is it exactly so this manage Resource Group is
exactly so this manage Resource Group is that when we will create this Azure data
that when we will create this Azure data bricks workspace right then Azure will
bricks workspace right then Azure will automatically generate
automatically generate some manage resource groups so right now
some manage resource groups so right now we know that this RG Prime work which we
we know that this RG Prime work which we have created it right but apart from
have created it right but apart from that in the back end when we are going
that in the back end when we are going to create this data brakes system is
to create this data brakes system is going to
going to generate different Resource Group in the
generate different Resource Group in the back end so these are system
back end so these are system generated which are created for this
generated which are created for this databas infrastructure resources so for
databas infrastructure resources so for that if you'll not give any name here
that if you'll not give any name here then it will automatically assign one
then it will automatically assign one name name which will be created by Azure
name name which will be created by Azure Azure but if you want to give your dat
Azure but if you want to give your dat like if you if you want to give that
like if you if you want to give that folder name like your any customized
folder name like your any customized folder name then you can give that and
folder name then you can give that and then datab briak will store those
then datab briak will store those backend resource groups which are system
backend resource groups which are system generated into that folder name so if
generated into that folder name so if you'll not give it then it is also fine
you'll not give it then it is also fine if you'll give it then it is also fine
if you'll give it then it is also fine so I'll just give that name and I'll
so I'll just give that name and I'll move ahead for this I'll give RG maybe
move ahead for this I'll give RG maybe Ma data
system this things available then next networking we don't
available then next networking we don't need to do any changes for the
need to do any changes for the networking then encryption security and
networking then encryption security and compliance tags and we are here with
compliance tags and we are here with review plus create now we will review it
review plus create now we will review it and then we will try to create it
and then we will try to create it so now let's try to understand
so now let's try to understand that it is going to take few seconds to
that it is going to take few seconds to create it right now deployment is in
create it right now deployment is in progress now let's just wait so that all
progress now let's just wait so that all that Azure will create these resources
that Azure will create these resources and backend system generated resource
and backend system generated resource groups and everything so it will take
groups and everything so it will take few seconds for
few seconds for that okay so now our deployment is
that okay so now our deployment is completed click on go to
completed click on go to Resource this is our data brakes we'll
Resource this is our data brakes we'll click on launch
click on launch workspace select your
account so this is our datab breaks platform so on the left hand side we'll
platform so on the left hand side we'll see we have workspaces then cataloges
see we have workspaces then cataloges workspace compute so we need will need
workspace compute so we need will need to create one cluster here in compute
to create one cluster here in compute then we have SQL editor
then we have SQL editor queries right now now we'll just work on
queries right now now we'll just work on the topics which we want which we are re
the topics which we want which we are re which we are going to require for this
which we are going to require for this ETL Pipeline and in another video I'll
ETL Pipeline and in another video I'll just cover all these components which
just cover all these components which are available under this Azure data
are available under this Azure data brakes so first now we have our databas
brakes so first now we have our databas ready right so this is
our this is our data brakes and this is our ADLs account
brakes and this is our ADLs account under which we have this bronze layer
under which we have this bronze layer where we have stored our raw data so now
where we have stored our raw data so now we need to PCH this bronze layer data in
we need to PCH this bronze layer data in the bronze layer into our Azure data
the bronze layer into our Azure data brakes so how are we going to do
brakes so how are we going to do that
that so to fetch this data from bronze layer
so to fetch this data from bronze layer which is into a storage account in Azure
which is into a storage account in Azure data breaks need to have the right
data breaks need to have the right permissions so first we need to Grant
permissions so first we need to Grant those permissions to this Azure data
those permissions to this Azure data braks for that we need to create an
braks for that we need to create an application which is nothing but service
application which is nothing but service application so once we will create that
application so once we will create that service
application we will grant the access to that application to read and write the
that application to read and write the data and then we will P we will pull
data and then we will P we will pull this application into our data brakes so
this application into our data brakes so for this application creation we just
for this application creation we just need to go to the
need to go to the Microsoft inra
Microsoft inra ID there we will create this service
ID there we will create this service application and then we will grant that
application and then we will grant that application respective rights to read
application respective rights to read and write that data into Azure databas
and write that data into Azure databas so in this way Azure data bricks will be
so in this way Azure data bricks will be able to read the data from ADLs and then
able to read the data from ADLs and then once it will transform that data then
once it will transform that data then again it
again it will push that data into the bronze lay
will push that data into the bronze lay uh into the silver layer and the gold
uh into the silver layer and the gold layer so for that we'll go to now
layer so for that we'll go to now Microsoft intra ID for that you need to
Microsoft intra ID for that you need to again come back to
your Microsoft aure account we'll search
account we'll search for Microsoft intra ID so this is our
for Microsoft intra ID so this is our Microsoft intra ID here
Microsoft intra ID here just check the option of app
just check the option of app registration it is under manage click on
registration it is under manage click on app
app registration
registration then new
then new registration now give the applic uh give
registration now give the applic uh give your uh application name here for
your uh application name here for example if I'll
example if I'll give Prime
you can give any name whichever you want then we are fine with the single tenant
then we are fine with the single tenant and then click on
and then click on register so now our application has been
register so now our application has been registered then afterwards we need to
registered then afterwards we need to assign all the essential roles to our
assign all the essential roles to our application to read and write the data
application to read and write the data so now we have created the application
so now we have created the application The Next Step would be to give the
The Next Step would be to give the essential access for those applications
essential access for those applications for
for that come to your ADLs
account you have am Access Control here click on that then you need to
click on that then you need to add Ro
assignment here search for
for storage blob data contributor select
storage blob data contributor select that and then click on
next then User Group service principal select
members then search for your application which we created just
your application which we created just now and that was Prime service
now and that was Prime service application right select that it will
application right select that it will come under selected members click on
come under selected members click on select and once you will done
select and once you will done that review and
assign so now it will add those respective role assignments here so what
respective role assignments here so what happens like once it will assign these
happens like once it will assign these role assignments then it will take some
role assignments then it will take some time to reflect those accesses into Data
time to reflect those accesses into Data brakes so right now like if you'll all
brakes so right now like if you'll all of a sudden go to data brakes and we'll
of a sudden go to data brakes and we'll try to access and if you not if you will
try to access and if you not if you will not be able to access it then you just
not be able to access it then you just don't have to worry it usually takes
don't have to worry it usually takes some time to access all those roles to
some time to access all those roles to that particular application so take some
that particular application so take some time and then come back to data data
time and then come back to data data brakes and again try to perform those
brakes and again try to perform those data cleaning and trans transformation
data cleaning and trans transformation activities now the next step is to
activities now the next step is to create a compute so in data data Brakes
create a compute so in data data Brakes come to this compute tab create
compute policy will be unrestricted I want it for single
want it for single node run time would be fine terminate
node run time would be fine terminate after 20
minutes note type I'll just give standard D3
standard D3 V2 which has 14 GB memory and the 4
V2 which has 14 GB memory and the 4 course I'll just create on create
course I'll just create on create compute till our cluster gets created we
compute till our cluster gets created we will try to Club all the information
will try to Club all the information which we are going to require to input
which we are going to require to input when we are connecting data brakes to
when we are connecting data brakes to the ADLs okay so for that what we are
the ADLs okay so for that what we are going to require so come to your app
going to require so come to your app registration and select that app which
registration and select that app which you have just created under Microsoft
you have just created under Microsoft intra ID and from here we are going to
intra ID and from here we are going to require this application / client ID and
require this application / client ID and the Tenant ID now let's let's try to
the Tenant ID now let's let's try to copy this information somewhere
ID then we are going to require this tenant
then we are going to require a secret key key which we are going to create
key key which we are going to create right now so for that what we need to
do come under this manage tab then certificates and the
certificates and the secrets then click on new client
secret description if you want to give anything give
it add so now from this client secret we are
so now from this client secret we are going to require this value information
going to require this value information so we'll just copy
so we'll just copy that and that will be our secret ID so
that and that will be our secret ID so we have now client
tenant and then secret ID so we have these three IDs which we
ID so we have these three IDs which we are going to require now let's see how
are going to require now let's see how we can connect this data brakes to the
we can connect this data brakes to the ADLs Gen 2 so for
ADLs Gen 2 so for that I I'm searching like how to connect
that I I'm searching like how to connect data brakes to ADLs genu and here you
data brakes to ADLs genu and here you will get one tutorial which is given by
will get one tutorial which is given by the
the Microsoft and then connect ADLs
blob so this is the information which we are going to require here so with this
are going to require here so with this information we will just modify our
information we will just modify our information into here like storage
information into here like storage account then the client ID then the
account then the client ID then the secret
secret ID we will input our those credentials
ID we will input our those credentials here and then we will see if we are able
here and then we will see if we are able to fetch that respective data into Data
to fetch that respective data into Data brakes or not so now let's just check
brakes or not so now let's just check okay so now our cluster is also
okay so now our cluster is also ready come to the
ready come to the workspace it's always a good practice to
workspace it's always a good practice to create one folder for your
create one folder for your work Prime
work Prime work and in this I'll create
work and in this I'll create one python
one python notebook so this is our
notebook so this is our notebook first we will connect our
notebook first we will connect our notebook to our newly created cluster it
notebook to our newly created cluster it is it is
is it is connected I'll change the name of this
connected I'll change the name of this notebook
notebook so I'll give the name as Prime
silver layer transformation you can give it
layer transformation you can give it anything so these these are our
anything so these these are our sales from here you can change the
sales from here you can change the default language for your whole notebook
default language for your whole notebook it could be python SQL Scala or R then
it could be python SQL Scala or R then for the for that particular cell if you
for the for that particular cell if you want to change that language then you
want to change that language then you can change it from here if you want to
can change it from here if you want to run that respective cell you can click
run that respective cell you can click on run cell if you want to run all the
on run cell if you want to run all the notebook then you can click on run all
notebook then you can click on run all so now let's see if we are able to
so now let's see if we are able to connect data brakes to the ADLs now
connect data brakes to the ADLs now let's try to
let's try to use this python code into our data
use this python code into our data breaks so I'll just copy this
breaks so I'll just copy this code
code paste I don't need this first line I'll
paste I don't need this first line I'll just remove it and then we have storage
just remove it and then we have storage account right here so we need to give
account right here so we need to give our storage account name wherever we are
our storage account name wherever we are seeing this storage account section so
seeing this storage account section so the storage account name
the storage account name is ADLs Prime
is ADLs Prime work we I'll just give that name
here just keep in mind that you need to change this or you need to give this
change this or you need to give this storage account name multiple times
okay I hope I have given it everywhere then the next we need to give
then the next we need to give application ID so this is our
application ID so this is our application
application ID
copy paste then we need to give directory ID so this is our tenant or
directory ID so this is our tenant or directory
directory ID which we are going to give here
ID which we are going to give here and we have this service
and we have this service credential so this is our secret
give okay so now let's try to run this cell and we'll
cell and we'll see if it is working fine or not okay it
see if it is working fine or not okay it seems like it is working fine
seems like it is working fine because the sale has run
because the sale has run successfully then I just need to
successfully then I just need to access the folders right so for that DBL
access the folders right so for that DBL ls.
ls. fs.
fs. LS and then inverted
FSS then we need to give our bronze layer information here let's come to The
layer information here let's come to The Container we'll see if we are getting
Container we'll see if we are getting that information here or
that information here or not con Diner
okay let it be just we just need to give this bronze container name right
this bronze container name right so ABB FSS dou slash then we have bronze
so ABB FSS dou slash then we have bronze layer
layer into
into our ADLs account name which was this
our ADLs account name which was this ADLs Prime
work do DFS doc. windows.net
doc. windows.net and then
and then flash let's see if it is working or not
flash let's see if it is working or not okay so it said that it is able to
okay so it said that it is able to access this bronze folder from this ADLs
access this bronze folder from this ADLs Prime work and it has one file which has
Prime work and it has one file which has name of Amazon Prime titles. CSV that
name of Amazon Prime titles. CSV that means that it is able to fetch that file
means that it is able to fetch that file so now let's
so now let's see if we are able to read the data or
see if we are able to read the data or not
not so it's always a good practice
so it's always a good practice so for data reading or data
so for data reading or data understanding first let's just try to
understanding first let's just try to import all the P spark functions and the
import all the P spark functions and the types so I usually prefer to import
types so I usually prefer to import these libraries first so that it will
these libraries first so that it will not bother us
not bother us afterward so I'll try to import all the
afterward so I'll try to import all the functions and then all the types
so this is done so now we will try to read our data and understand what is
read our data and understand what is exactly going on so I'll just put a
exactly going on so I'll just put a markdown
understanding then now we need to read our data right which
now we need to read our data right which is into CSV format so for that first we
is into CSV format so for that first we need to create one data frame so here
need to create one data frame so here I'll
I'll create data frame as for example DF
create data frame as for example DF uncore
uncore silver is equal to then park. read okay
silver is equal to then park. read okay it has given me by default code
it has given me by default code but any which way let's try to
but any which way let's try to understand it so our data is into do CSV
understand it so our data is into do CSV format so spark. tr. format and which is
format so spark. tr. format and which is into CSV
into CSV then option as header because I need to
then option as header because I need to see my headers also otherwise what will
see my headers also otherwise what will happen the headers that will come as a
happen the headers that will come as a first row so we to avoid that we need to
first row so we to avoid that we need to give options headers as
then we need to give infer schema as true to take whatever the schema type it
true to take whatever the schema type it is so we need that schema type as it it
is so we need that schema type as it it and then we want to load the
and then we want to load the data and we need to give that respective
data and we need to give that respective path from where we want to fetch that
path from where we want to fetch that data so we know that we want to we need
data so we know that we want to we need this data from bronze layer from ADLs
this data from bronze layer from ADLs Prime work but our file name was
Prime work but our file name was something different right so this was
something different right so this was Amazon Prime titles so this I need to
Amazon Prime titles so this I need to change with Amazon Prime
change with Amazon Prime titles and then
titles and then we just need to display our data frame
we just need to display our data frame so this is our data
so this is our data frame dot
happens okay that's that okay so we have given. CSV do CSV two times we'll try it
again okay so we can see our data into table show ID type title director cast
table show ID type title director cast and so on so now now from here also we
and so on so now now from here also we can see that this column show ID has a
can see that this column show ID has a type of string then type has string then
type of string then type has string then director has text string
director has text string format so this has happened because we
format so this has happened because we have turned on the info schema as true
have turned on the info schema as true so now let's see like after reading this
so now let's see like after reading this data it is important to check the schema
data it is important to check the schema to understand the structure of data so
to understand the structure of data so basically this will help us to identify
basically this will help us to identify the data types for each column such as
the data types for each column such as string integer and Float so if you want
string integer and Float so if you want to check the schema then what we can do
to check the schema then what we can do is like we have DF do silver dfdore
is like we have DF do silver dfdore silver and then we'll just say print
silver and then we'll just say print schema so this is our schema so here we
schema so this is our schema so here we will get the clear picture of like how
will get the clear picture of like how many column names which we have how many
many column names which we have how many columns we have then what are the names
columns we have then what are the names of those respective columns then what's
of those respective columns then what's what is the type of that column so that
what is the type of that column so that complete information which we will get
complete information which we will get using this do print schema so this is
using this do print schema so this is like this print schema is the method
like this print schema is the method which print the schema of the data frame
which print the schema of the data frame by showing the column names along with
by showing the column names along with their respective data types okay so this
their respective data types okay so this is how like we can read the data or
is how like we can read the data or understand the data then now next what
understand the data then now next what we can do we can try to clean the data
we can do we can try to clean the data so the next step would
so the next step would be data
cleaning so in the data cleaning first we can see how we can handle this
we can see how we can handle this missing values so for example
missing values so for example uh let's try to display this data frame
rating we can see that we have so many null values and along with the rating if
null values and along with the rating if we'll just check the country column in
we'll just check the country column in the country column also we have null
the country column also we have null values so for example suppose if I want
values so for example suppose if I want to replace the null values from the
to replace the null values from the rating column as unrated and I want to
rating column as unrated and I want to change the countries which has null
change the countries which has null values to the unknown so for that what
values to the unknown so for that what we can do we can say that DF do dfdore
we can do we can say that DF do dfdore Sil which is our data
Sil which is our data frame do
frame do na do
na do fill and then I need to give two columns
fill and then I need to give two columns right so for the two columns I'll need
right so for the two columns I'll need to give the curly bres and the first one
to give the curly bres and the first one would be rating so rating I would like
would be rating so rating I would like to change the null values with for
to change the null values with for example
example unrated and with this rating column
unrated and with this rating column there is one more column which is
country and Country column name I would like U null values I would like to
like U null values I would like to replace it with unknown and once we will
replace it with unknown and once we will do this transformation or we will do
do this transformation or we will do this data cleaning then assign that
this data cleaning then assign that modified data frame to our existing data
modified data frame to our existing data frame so that's why I'll do DF silver is
frame so that's why I'll do DF silver is equal to and then our condition and then
equal to and then our condition and then dot
dot display so now let's see if it is
display so now let's see if it is working or
okay so our country null values are replaced with unknown and then rating
replaced with unknown and then rating these are replaced with
these are replaced with unrated so in the same way like along
unrated so in the same way like along with this we can check like remove the
with this we can check like remove the RADS or remove the duplicate rows so for
RADS or remove the duplicate rows so for example suppose if you have multiple
example suppose if you have multiple records of the same entry right and if
records of the same entry right and if you would like to drop the duplicates
you would like to drop the duplicates then what we can do DF underscore
okay so our DF data frame name is DF do silvercore drop
silvercore drop duplicates and
duplicates and then again I would like to modify my
then again I would like to modify my existing data
existing data frame and then display that data
get so now we can see that there are so many null values for the date
many null values for the date added or the director so we need to
added or the director so we need to handle these values
handle these values right so let's try to see if we can fill
right so let's try to see if we can fill the null
the null values for other columns also
unknown okay so it has given me the code date
date added maybe date added I'll give it
as 0 1 01 2025 release
2025 release year maybe I'll give it as
year maybe I'll give it as 2020 duration is unknown is fine
2020 duration is unknown is fine description unknown listed in unknown is
description unknown listed in unknown is fine so we'll see how it looks like
now okay so we have forgotten some column some
names which is director and okay director and the cast maybe so
and okay director and the cast maybe so let's say like now we will try to delete
let's say like now we will try to delete the dup or delete the values like we
the dup or delete the values like we will try to remove the rows which has
will try to remove the rows which has any null value in that so for that what
any null value in that so for that what we can do is like our data frame and
we can do is like our data frame and then
then dot
dot drop na
drop na and in that I'll give the string as
and in that I'll give the string as any so any means that it will remove the
any so any means that it will remove the rows if that rows if that row contains
rows if that rows if that row contains any null values in any
any null values in any column so now let's see how it functions
okay it seems like we have fair enough data okay
data okay fine then for example if you want to
fine then for example if you want to rename the column name for the better
rename the column name for the better readability okay so let's say suppose we
readability okay so let's say suppose we have this column name as title so I
have this column name as title so I would like to change it as content title
would like to change it as content title so what I can do here
so what I can do here is dfdore silver
dot then I'll use the function which is with column renamed and with column
with column renamed and with column renamed I would like to give existing
renamed I would like to give existing column name which is title and then I
column name which is title and then I would like to change that name
would like to change that name to content
to content title and then I would like to display
title and then I would like to display this new modified data frame
see okay it has been changed successfully now let's say suppose I
successfully now let's say suppose I would like to do one data transformation
would like to do one data transformation here so for this I'll use this country
here so for this I'll use this country column so for this transformation what
column so for this transformation what I'll do is
I'll do is like
like if the column entry is in this country
if the column entry is in this country column is any particular country then
column is any particular country then I'll assign the value as one otherwise
I'll assign the value as one otherwise if that country value is unknown then I
if that country value is unknown then I would like to assign that value at zero
would like to assign that value at zero into our new created newly created
into our new created newly created column so for that we what we can
column so for that we what we can do so for this we have our data frame
do so for this we have our data frame and we will use fth column function for
and we will use fth column function for this and then we need to assign what
this and then we need to assign what should be our new column name so maybe
should be our new column name so maybe I'll give the name as H
I'll give the name as H country and then I need to assign that
country and then I need to assign that condition which we just spoke right now
condition which we just spoke right now so
so when our
column country is equal to is equal to unknown then I would like to assign the
unknown then I would like to assign the value as zero otherwise I would like to
value as zero otherwise I would like to assign the value as one and I want to
assign the value as one and I want to add this new column into my existing
add this new column into my existing data frame only so for that I'll just
data frame only so for that I'll just assign this condition to my existing
assign this condition to my existing data frame and then we'll just display
data frame and then we'll just display that new data
frame okay what is it saying so we have given column column two
times okay now we'll see we have each country column if we
see we have each country column if we have name as unknown then it is giving
have name as unknown then it is giving us the value zero otherwise it is giving
us the value zero otherwise it is giving us the value as one so now let's say
us the value as one so now let's say suppose U I have transformed our data
suppose U I have transformed our data and now I would like to store that data
and now I would like to store that data into parket format and into our silver
into parket format and into our silver layer so basically what we need to do is
layer so basically what we need to do is like
here our data frame is DF do DF silver then we need to write our data
then we need to write our data frame in the format of
so right now I just want this into a pocket
format then mode I would like to give the mode as
the mode as aend and then I would like
aend and then I would like to save this data but before save I need
to save this data but before save I need to
to give file path here
give file path here so option
be AB FSS and now it should be into silver layer so I'll just change
into silver layer so I'll just change this to Silver and then
what should be the file name I would like to give
like to give this file name and I'll just add silver
this file name and I'll just add silver into it Amazon Prime
into it Amazon Prime titles maybe silver layer okay
titles maybe silver layer okay CSV then this would be our
CSV then this would be our path and once we are done with that path
path and once we are done with that path just save
okay so now we are using append mode so we have four types of
we have four types of nodes the one is
nodes the one is append so with append what will
append so with append what will happen so with append let's say suppose
happen so with append let's say suppose uh this is our ADLs and this is our
uh this is our ADLs and this is our silver layer okay and we have our files
silver layer okay and we have our files already stored here and now we are
already stored here and now we are trying to store this Amazon Prime title
trying to store this Amazon Prime title silver file here so now let's say
silver file here so now let's say suppose if we already have the file with
suppose if we already have the file with the same name into our silver layer then
the same name into our silver layer then with the append it will just add that
with the append it will just add that new file below that file so it will not
new file below that file so it will not erase that already existing file it will
erase that already existing file it will just keep adding the new files into that
just keep adding the new files into that silver layer using this append mode so
silver layer using this append mode so this is about the append mode the next
this is about the append mode the next mode is over right mode so in overwrite
mode is over right mode so in overwrite mode what happens is like if we have
mode what happens is like if we have already existing file into our silver
already existing file into our silver layer and we are now trying to store
layer and we are now trying to store this new file with the same name then
this new file with the same name then what this override mode will do this
what this override mode will do this overwrite mode it will replace this
overwrite mode it will replace this existing value existing file with this
existing value existing file with this new file so that previous file will be
new file so that previous file will be gone you will not be able to to recover
gone you will not be able to to recover it and then you will just get the new
it and then you will just get the new file which will get stored here so this
file which will get stored here so this is how this overwrite mode works then we
is how this overwrite mode works then we have error mode and ignore mode so with
have error mode and ignore mode so with the error mode again what
the error mode again what happens
if this right mode will try to store this same file into our ADLs and it sees
this same file into our ADLs and it sees that okay we have already existing file
that okay we have already existing file with the same name then with the error
with the same name then with the error mode what it will do it will not save
mode what it will do it will not save your new file into the ADLs
your new file into the ADLs folder and it will throw you some error
folder and it will throw you some error setting like maybe you already have file
setting like maybe you already have file there or something so it will basically
there or something so it will basically throw the error and it will not save
throw the error and it will not save your new work into the existing ADLs
your new work into the existing ADLs work so this is how
work so this is how this error mode will function and
this error mode will function and now the fourth one is ignore so for
now the fourth one is ignore so for ignore what happens is like again this
ignore what happens is like again this is our
is our ADLs we have this file already present
ADLs we have this file already present here and we are trying to save this file
here and we are trying to save this file here it will simply check and it will
here it will simply check and it will say that okay you I have already one
say that okay you I have already one file over there with the same name and
file over there with the same name and it will not save your file there but
it will not save your file there but also it will not give you any kind of
also it will not give you any kind of error so it will simply ignore it and it
error so it will simply ignore it and it will not save your work at that time and
will not save your work at that time and also will not give you any kind of error
also will not give you any kind of error so this is how this four mode works so
so this is how this four mode works so now we are just trying to use this
now we are just trying to use this append mode and we'll try to save this
append mode and we'll try to save this information and format we would like to
information and format we would like to be into the pocket
be into the pocket format now to save this or write this
format now to save this or write this format into our silver layer we are
format into our silver layer we are going to use this bucket file format so
going to use this bucket file format so now let's try to understand a little bit
now let's try to understand a little bit more about this pocket format so pocket
more about this pocket format so pocket format is nothing but a columnar storage
format is nothing but a columnar storage file format
file format this is
this is columnar storage file format which is
columnar storage file format which is basically designed for the use with big
basically designed for the use with big data processing Frameworks like Apache
data processing Frameworks like Apache spark or the Hardo so we can say that
spark or the Hardo so we can say that it's an open-source format and that has
it's an open-source format and that has become one of the most commonly used
become one of the most commonly used file format for the Big Data due to its
file format for the Big Data due to its efficient storage and fast
efficient storage and fast performance so these are these main
performance so these are these main feature about this so like if we'll talk
feature about this so like if we'll talk about the main key feature about it
about the main key feature about it about this format then first one we can
about this format then first one we can say is columnar
storage this columnar storage is uh like this parket stores the data into
this parket stores the data into columnar format meaning that data is
columnar format meaning that data is stored column by column rather than
stored column by column rather than storing it by row by row so what it will
storing it by row by row so what it will do this will allow you for the better
do this will allow you for the better compression and more efficient quering
compression and more efficient quering because you can read only the columns
because you can read only the columns which you need so for example here if
which you need so for example here if you need to read just the ear column and
you need to read just the ear column and you just don't want to fet any other
you just don't want to fet any other kind of information I just want analysis
kind of information I just want analysis of the ear column so for this ear column
of the ear column so for this ear column what it will do it will just this spark
what it will do it will just this spark will only load the ear column into the
will only load the ear column into the memory and it will not load the entire
memory and it will not load the entire rows so this is what a column in our
rows so this is what a column in our storage is
storage is then the next key feature would be
compression so this is the second most important
important feature so parket supports the multiple
feature so parket supports the multiple compression algorithms for example one
compression algorithms for example one is
is Snappy or
Snappy or gzip or
gzip or LZ so these are some algorithms which
LZ so these are some algorithms which help to reduce the storage space so this
help to reduce the storage space so this columnar format and the compression it
columnar format and the compression it can significantly reduce the amount of
can significantly reduce the amount of disk space required to store the data so
disk space required to store the data so this is the second most important which
this is the second most important which is compression and the third one I'll
is compression and the third one I'll say
say is schema
evoluation so with this schema Evolution what
what happens is that parket allows this
happens is that parket allows this schema evaluation so that like you can
schema evaluation so that like you can change the schema of your data over the
change the schema of your data over the time period like while you are using or
time period like while you are using or while you are able to read the older
while you are able to read the older data also with the original schema so
data also with the original schema so what do you mean by change in the schema
what do you mean by change in the schema so right now uh we just saw that we have
so right now uh we just saw that we have added one new column right which is is
added one new column right which is is country so if you'll see in the source
country so if you'll see in the source file that column was not there but like
file that column was not there but like after the changes we have this one newly
after the changes we have this one newly added column right that means that we
added column right that means that we have changed the schema of of our data
have changed the schema of of our data so with this suppose if normally if we
so with this suppose if normally if we write try to write the for write this
write try to write the for write this data into the silver layer it will throw
data into the silver layer it will throw us the error because the schema of the
us the error because the schema of the source and schema of the destination is
source and schema of the destination is not same but with parket it gives that
not same but with parket it gives that FIB facility or the flexibility of
FIB facility or the flexibility of schema Evolution that means that if your
schema Evolution that means that if your data data has been changed or if your
data data has been changed or if your schema has been changed then also it
schema has been changed then also it will allow you to store that data into
will allow you to store that data into our silver layer
our silver layer so these are some important features of
so these are some important features of the parket that's why we usually prefer
the parket that's why we usually prefer parket format or the Delta format so now
parket format or the Delta format so now let's see if it is allowing us to store
let's see if it is allowing us to store the data or not okay it has been
the data or not okay it has been successful now let's come to our silver
successful now let's come to our silver layer okay now we can see we have one
layer okay now we can see we have one folder
here and in that folder we we can see that we have our parket
visible okay so now we can see that our data has been loaded successfully into
data has been loaded successfully into our silver layer that means our second
our silver layer that means our second step is done now our third step would
step is done now our third step would be now we will fet this data from this
be now we will fet this data from this silver layer whichever data is here into
silver layer whichever data is here into again into Data
again into Data brakes and we will try to perform some
brakes and we will try to perform some Advanced changes on that so that our
Advanced changes on that so that our data will be business ready data and
data will be business ready data and then we will pull that push that data
then we will pull that push that data into the gold layer or the gold folder
into the gold layer or the gold folder which we have
which we have created here in the ADLs Prime work
created here in the ADLs Prime work which is this folder gold folder so
which is this folder gold folder so we'll try to see after the after the
we'll try to see after the after the changes if we are able to load our
changes if we are able to load our business ready data into this gold
business ready data into this gold folder or not now let's see how we can
folder or not now let's see how we can move
move ahead so for that I would like to create
ahead so for that I would like to create one new
notebook I'll just create the duplicate of this
notebook for our gold
gold layer
Transformations and I'll connect my cluster for the
cluster for the same again this same procedure we need
same again this same procedure we need to do to fetch the data so I'll just
to do to fetch the data so I'll just copy paste whatever the steps which we
copy paste whatever the steps which we have created so this first one
have created so this first one was that python code with different
was that python code with different different directory ID secret keys and
all then we'll see if our instead of bronze now we need to see
our instead of bronze now we need to see if our
if our silver folder is ready or not we can see
silver folder is ready or not we can see that we have one folder
here and then we will try to import these libraries
again then we will read our data again so now
then we will read our data again so now I'll have
I'll have efcore gold framework
efcore gold framework do read. format now our format is
do read. format now our format is bucket header true M schema true and now
bucket header true M schema true and now we don't have data into bronze we have
we don't have data into bronze we have into silver layer and in silver
layer we need to assign this folder name also
okay so we have done one mistake DF it should be dfdore
gold see now we are able to fetch this new data which we stored into our silver
new data which we stored into our silver layer we can see our new column also
layer we can see our new column also that that means that our schema
that that means that our schema Evolution was
Evolution was successful then we have changed the
successful then we have changed the title to content title it is also true
title to content title it is also true it is also getting displayed we don't
it is also getting displayed we don't have any null value okay that means that
have any null value okay that means that our uh Second Step we uh that has worked
our uh Second Step we uh that has worked properly now we'll see how we can apply
properly now we'll see how we can apply some Advanced Transformations on this
some Advanced Transformations on this gold layer and we'll try to again load
gold layer and we'll try to again load this data into the gold ler now let's
this data into the gold ler now let's say suppose I want to do some Advanced
say suppose I want to do some Advanced uh Transformations so for example in the
uh Transformations so for example in the date added column I just I would like to
date added column I just I would like to CH just check the date formats of the of
CH just check the date formats of the of that respective column
that respective column and if it is not already into the proper
and if it is not already into the proper date type then I would like to convert
date type then I would like to convert that format into a proper format which
that format into a proper format which is mmdd YY okay so for that what what we
is mmdd YY okay so for that what what we can
can do so we will have
do so we will have our data
gold then we will use with column function and with column function we
function and with column function we would like to use
would like to use with date
with date added
added column comma and then we need to write
column comma and then we need to write our condition so to
layer here we will again use the column name which
into mmdd yyy format and
then okay so I think so we need we just don't need to fetch only ear here or
don't need to fetch only ear here or maybe we can also fetch
maybe we can also fetch or also extract the year only year B for
or also extract the year only year B for the business analysis so for that also
the business analysis so for that also what we can
what we can do DF
gold is equal to again our data frame dot with
dot with column and we would like to create one
column and we would like to create one new column which
new column which is here added
then here DF C data frame and into that our
respective column name so now we'll just try to see if it is working or
try to see if it is working or not so just display it
see okay so date added column has been changed
changed successfully and then it has also given
successfully and then it has also given us the year added column which is like
us the year added column which is like only year which is getting displayed
only year which is getting displayed from the date added column okay so now
from the date added column okay so now we can see that this is working
we can see that this is working fine now let's see or let's try to
fine now let's see or let's try to implement one more transformation and
implement one more transformation and let's see if it is working or
column where is it okay so we can see that we have so many entries here like
that we have so many entries here like animation comma kids comedy comma kids
animation comma kids comedy comma kids so let's see if we
so let's see if we can split this one column into two
can split this one column into two different
columns so for that we will use split function so let's see and let's
function so let's see and let's Implement that our data
frame then withd column and we want to use with column and we would like to
use with column and we would like to create a new column
create a new column name which is for example
category one and in this we will try to use split
one and in this we will try to use split function so
function so split and then what you need to split so
split and then what you need to split so I would like to split listed in column
I would like to split listed in column by the comma
by the comma and I would like to get zeroth index of
and I would like to get zeroth index of the same so similarly I'll just copy
the same so similarly I'll just copy paste it and then I would like to create
paste it and then I would like to create another column as Category
another column as Category 2 and split the listed in column and I
2 and split the listed in column and I would like to take the first index of
would like to take the first index of the listed in column let's see what is
the listed in column let's see what is it showing
we'll see or we'll try to display our modified data
frame so let's see let's see okay so now we can see that we have
see okay so now we can see that we have one category one and category two
one category one and category two separated okay so now we have splitted
separated okay so now we have splitted our column into two
our column into two categories then we can check if we have
categories then we can check if we have any missing values in any category or
any missing values in any category or now we can see that we have category one
now we can see that we have category one but for the category 2 we have so many
but for the category 2 we have so many null values right so we would like to
null values right so we would like to replace these null values with for
replace these null values with for example unknown text now we will try to
example unknown text now we will try to replace all the null values from the
replace all the null values from the category 2 with the unknown text for
category 2 with the unknown text for that what we will
that what we will do DF do dfdore gold is our data
do DF do dfdore gold is our data frame so in this data frame with column
frame so in this data frame with column we would like to check
Category 2 column name when
when this DF gold Category
to give okay so otherwise whatever is the
give okay so otherwise whatever is the existing category keep it as it
existing category keep it as it is let's see hopefully it will
is let's see hopefully it will work not
okay now we can see that whatever null values were there those are replaced
values were there those are replaced with the unknown text in the category 2
with the unknown text in the category 2 so this is how you can perform some kind
so this is how you can perform some kind of advanced Transformations I'll not be
of advanced Transformations I'll not be able to cover all those Transformations
able to cover all those Transformations but these are few which I have done so
but these are few which I have done so that you will understand what we need to
that you will understand what we need to do exactly so now we need to save this
do exactly so now we need to save this data into Delta
data into Delta format into our gold
format into our gold layer okay so for that what we will
layer okay so for that what we will do we will again use the same
technique to write our data into gold layer now we will try to save our data
layer now we will try to save our data into Delta format so I'll just change
into Delta format so I'll just change the format as Delta then mode I'll keep
the format as Delta then mode I'll keep it as happened and now our path will be
it as happened and now our path will be instead of silver it will be gold
instead of silver it will be gold layer and here also I'll change the
layer and here also I'll change the folder
name as gold and now we will try to save this
so it is now successful now let's cross verify
verify it so this is our gold layer and in that
it so this is our gold layer and in that we have now one new folder and in this
we have now one new folder and in this we have our file path here along with uh
we have our file path here along with uh file save here along with Delta logs
file save here along with Delta logs which are stored just about
which are stored just about that Delta format file so the question
that Delta format file so the question is like why do we need to use this Delta
is like why do we need to use this Delta format right so for this data format I
format right so for this data format I can say that there are so many key
can say that there are so many key features for this data format or the
features for this data format or the advantages so the First Advantage I can
advantages so the First Advantage I can say that this data format supports acci
say that this data format supports acci ID transaction
ID transaction so ACI ID transaction is nothing
but atomicity consistency isolation and durability
durability transaction which Bas basically ensures
transaction which Bas basically ensures that all the read and write operations
that all the read and write operations are done in a reliable and consistent
are done in a reliable and consistent manner so this is especially useful for
manner so this is especially useful for the Big Data environments where like we
the Big Data environments where like we have many users and the processes may be
have many users and the processes may be like accessing or modifying the same
like accessing or modifying the same data concurrently so for that this
data concurrently so for that this atomicity ensures that the set of
atomicity ensures that the set of operations like for example right is
operations like for example right is treated as a single unit of work either
treated as a single unit of work either like fully completed or not executed at
like fully completed or not executed at all so such kind of atomic atomicity
all so such kind of atomic atomicity then consistency then isolation and the
then consistency then isolation and the durability this is supported by this
durability this is supported by this Delta log or Delta format then the next
Delta log or Delta format then the next one I would say time
one I would say time travel I'll just cover theoretical part
travel I'll just cover theoretical part of the same in the next video I'm trying
of the same in the next video I'm trying I'll be trying to covering Azure data
I'll be trying to covering Azure data braks completely so in that I'll show
braks completely so in that I'll show handson for these activities like TR
handson for these activities like TR time traveling and versioning schema
time traveling and versioning schema Evolution so I'll try to perform those
Evolution so I'll try to perform those practical sessions over there so now
practical sessions over there so now it's TR travel so it is nothing but
it's TR travel so it is nothing but vering this Delta L provides the ability
vering this Delta L provides the ability to travel back in the time by quering
to travel back in the time by quering the older version of the data so it
the older version of the data so it supports the time travel to access the
supports the time travel to access the historical versions of the data which is
historical versions of the data which is especially useful for the auditing or
especially useful for the auditing or the debugging or recovering for the
the debugging or recovering for the errors from the errors so for example
errors from the errors so for example you are working on one data set
you are working on one data set and by mistake you have deleted the data
and by mistake you have deleted the data or by mistake you have deleted some rows
or by mistake you have deleted some rows in that data so now you need to fetch or
in that data so now you need to fetch or you need to recover the past version of
you need to recover the past version of that data where that where all of that
that data where that where all of that data was present and it was not deleted
data was present and it was not deleted so for that purpose we can use this time
so for that purpose we can use this time travel and and with that Prime travel we
travel and and with that Prime travel we we we can recover our older version and
we we can recover our older version and we can restore that older version so
we can restore that older version so that we can perform our next activities
that we can perform our next activities and we can it this this mistake which we
and we can it this this mistake which we have done by deleting some roles it will
have done by deleting some roles it will not affect the previous version and we
not affect the previous version and we will be able to proceed ahead very
will be able to proceed ahead very successfully for that we use time travel
successfully for that we use time travel and then again it also supports the
and then again it also supports the schema Evolution like as I have already
schema Evolution like as I have already told about the pocket format it's the
told about the pocket format it's the same way it also supports the scha
same way it also supports the scha schema Evolution so that like if the
schema Evolution so that like if the schema of the data has been changed into
schema of the data has been changed into the Transformations and is different
the Transformations and is different from the source then it will not give
from the source then it will not give you any error while storing that data
you any error while storing that data back to the sync so this is schema
back to the sync so this is schema Evolution and it also supports the large
Evolution and it also supports the large scale data processing and it has data
scale data processing and it has data lineage and audit auditability also
lineage and audit auditability also because Delta tables automatically track
because Delta tables automatically track the data lineage so this what is this
the data lineage so this what is this data
data lineage so with this data lineage it
lineage so with this data lineage it makes us easier to trace the flow of the
makes us easier to trace the flow of the data from the source to destination so
data from the source to destination so this basically ensures that our data has
this basically ensures that our data has transparency or regularity compliance
transparency or regularity compliance and also the effective troubleshooting
and also the effective troubleshooting so for these purposes we use a Delta
so for these purposes we use a Delta format and in that we also has this
format and in that we also has this Delta log transaction
files if we talk about this Delta log in brief so we can say that this Delta log
brief so we can say that this Delta log is essential this is an essential
is essential this is an essential component in Delta leg which basically
component in Delta leg which basically tracks all the changes that are made
tracks all the changes that are made into this Delta table which enables this
into this Delta table which enables this acid transactions Version Control and
acid transactions Version Control and the schema Evolution so this Delta log
the schema Evolution so this Delta log basically gets stored into this folder
basically gets stored into this folder underscore deltore log and it consists
underscore deltore log and it consists of sequential Json files that basically
of sequential Json files that basically records the metadata of each of the
records the metadata of each of the transaction for example it could be
transaction for example it could be maybe insert update or the delete so
maybe insert update or the delete so such kind of transactions get stored
such kind of transactions get stored into this Json file and it these files
into this Json file and it these files are named in in incrementally starting
are named in in incrementally starting from like with 0le 0 then if we do some
from like with 0le 0 then if we do some changes and then store it then it will
changes and then store it then it will be 0 1 02 03 so it will be into this
be 0 1 02 03 so it will be into this incremental mode so this is how this
incremental mode so this is how this data Delta log functions so this is just
data Delta log functions so this is just a brief we'll try to cover this into any
a brief we'll try to cover this into any another video which I'll create on the
another video which I'll create on the Azure data brakes so now let's move
Azure data brakes so now let's move ahead and see how we can connect our
ahead and see how we can connect our data brakes to the powerbi okay so now
data brakes to the powerbi okay so now we know that we have pushed our data
we know that we have pushed our data into gold layer right but what if
into gold layer right but what if suppose I would like to create One
suppose I would like to create One external table or Delta table on the top
external table or Delta table on the top of this data
of this data so why is it useful to create the
so why is it useful to create the external table let's try to understand
external table let's try to understand suppose this is
suppose this is our silver layer and this is our gold
our silver layer and this is our gold layer and we know that we were earlier
layer and we know that we were earlier having this data into silver layer right
having this data into silver layer right then we fetched this data and
then we fetched this data and transformed that data and then pushed
transformed that data and then pushed that data into gold layer so now we have
that data into gold layer so now we have our files into gold layer and suppose
our files into gold layer and suppose for uh by mistake I have used drop
for uh by mistake I have used drop command okay so in this case drop what
command okay so in this case drop what drop command will do it will delete the
drop command will do it will delete the table and data what is which is already
table and data what is which is already present into gold layer and our all data
present into gold layer and our all data will get deleted and we won't be having
will get deleted and we won't be having access to our data after this drop
access to our data after this drop Command right so for that sake what we
Command right so for that sake what we can do so just to avoid that suppose
layer and we will create One external table on the top of this gold
table on the top of this gold layer so this is our external table so
layer so this is our external table so we have all the data into uh data into
we have all the data into uh data into this files into gold layer and this is
this files into gold layer and this is our external table so in this case if we
our external table so in this case if we will use this drop command then this
will use this drop command then this drop command will only delete this
drop command will only delete this external table
external table and not the data which is already
and not the data which is already present into gold layer this is how our
present into gold layer this is how our data will be kept safe and there we will
data will be kept safe and there we will not be like there will not be no there
not be like there will not be no there will not be any kind of loss onto our
will not be any kind of loss onto our original data which is saved into this
original data which is saved into this gold layer so this is where this
gold layer so this is where this external table concept comes in so let's
external table concept comes in so let's try to create One external table on the
try to create One external table on the top of this data so for that what we can
top of this data so for that what we can do for that we'll just create one
do for that we'll just create one database into a SQL language so for that
database into a SQL language so for that sake select SQL language from language
sake select SQL language from language from here and the code is very simple
from here and the code is very simple like the SQL create database and the
like the SQL create database and the database name so in this case my data
database name so in this case my data name database name would be gold layer
name database name would be gold layer now let's try to execute this
command Okay so it says that it is already
already present so now let let me just drop this
present so now let let me just drop this database first
okay so now we will create our database again with the name gold layer okay so
again with the name gold layer okay so this is our new gold layer and now we
this is our new gold layer and now we will save our
will save our data into Delta format as a table in the
data into Delta format as a table in the gold layer okay so for that what we will
gold layer okay so for that what we will do this the syntax will be same we will
do this the syntax will be same we will write it in Python this is our data
write it in Python this is our data frame name right format would be Delta
frame name right format would be Delta mode I would like to keep it as append
mode I would like to keep it as append then the path we need to give the path
then the path we need to give the path of
of our gold layer okay so this is my
our gold layer okay so this is my path let's suppose in this path I would
path let's suppose in this path I would like to create
one file folder inside that so maybe my folder name would be
gold and then I would like to save this as a table so for that save as table
as a table so for that save as table then I would I'll give the gold layer
then I would I'll give the gold layer which is our database name and Dot then
which is our database name and Dot then I would like to give the table name
I would like to give the table name as maybe if Prime
as maybe if Prime gold and now let's it should be in
gold and now let's it should be in inverted comma
okay so now we will try to run this
cell okay it has been run successfully now let's
now let's see now if we'll come to our gold layer
see now if we'll come to our gold layer then we can see that we have a new
then we can see that we have a new folder of name Prime gold and in that
folder of name Prime gold and in that Prime gold we have this pocket file on
Prime gold we have this pocket file on the top of this Delta log so now we know
the top of this Delta log so now we know that we have saved our format as a Delta
that we have saved our format as a Delta right so Delta format is nothing but
right so Delta format is nothing but parket
file the pocket file format and on the top of that we have this Delta
top of that we have this Delta log so this create our Delta format
log so this create our Delta format overall so if you can see here we have
overall so if you can see here we have this parket file right we have already
this parket file right we have already discussed our uh compressive algorithm
discussed our uh compressive algorithm which is Snappy and our format is into
which is Snappy and our format is into pocket and on the top of that it has
pocket and on the top of that it has created Delta lock so overall it means
created Delta lock so overall it means that this is a Delta format in which we
that this is a Delta format in which we have saved our
have saved our data so we can see that this is how we
data so we can see that this is how we can save our data as a table now let's
can save our data as a table now let's say
say suppose I want to check it like I would
suppose I want to check it like I would like to create one I would like to use
like to create one I would like to use SQL command on this so now let's just
SQL command on this so now let's just check and if we'll see we will see that
check and if we'll see we will see that if we are able to access that table or
if we are able to access that table or not so for that what we will do select
not so for that what we will do select star from then our database name was
star from then our database name was gold layer and in that gold
gold layer and in that gold layer we have
layer we have used Prime
sale okay so now we can see that that this like that format that table which
this like that format that table which we have saved it right with uh under the
we have saved it right with uh under the database gold layer with the name Prime
database gold layer with the name Prime gold we are able to access that table
gold we are able to access that table successfully so that you can use this
successfully so that you can use this SQL command or SQL language also to work
SQL command or SQL language also to work on that table so this is how we can use
on that table so this is how we can use this now the next step would be how we
this now the next step would be how we can connect this uh datab braks file to
can connect this uh datab braks file to the power be right so for that what we
the power be right so for that what we can do come to the marketplace
can do come to the marketplace in the Marketplace search for our
connect and it will ask you to download the file but before that we need to
the file but before that we need to select the compute because this has
select the compute because this has automatically connected to the
automatically connected to the serverless starter warehouse but our
serverless starter warehouse but our files are into another cluster or
files are into another cluster or computer right so select your proper
computer right so select your proper compute or cluster and then just
compute or cluster and then just download the connection
file so once you have downloaded it just click on that file and try to
it just click on that file and try to open
it so now in this case what happened is I was already having powerbi
is I was already having powerbi downloaded on my laptop so for that what
downloaded on my laptop so for that what you need to do you just go to the Google
you need to do you just go to the Google and just write it as download powerbi
and just write it as download powerbi desktop and you will get the link very
desktop and you will get the link very first and just download and install that
first and just download and install that power powerb desktop on your laptop or a
power powerb desktop on your laptop or a desktop so now in this case what
desktop so now in this case what happened it has directly connected to my
happened it has directly connected to my powerbi because I already connected my
powerbi because I already connected my powerbi to to the same data Bri file and
powerbi to to the same data Bri file and if you'll come under
if you'll come under this 5 meta store then we can see that
this 5 meta store then we can see that this database which we have created is
this database which we have created is here gold layer and under that we have
here gold layer and under that we have table as Prime gold but what happens is
table as Prime gold but what happens is that before this it will ask you to
that before this it will ask you to connect to your respective data brakes
connect to your respective data brakes right so for that what it will give so
right so for that what it will give so it will uh it will give you one popup in
it will uh it will give you one popup in that you need to select access key from
that you need to select access key from the left left hand side and once you
the left left hand side and once you will uh select this personal access key
will uh select this personal access key then it will ask you to enter your
then it will ask you to enter your access key so you just need to enter
access key so you just need to enter enter your access key and then you just
enter your access key and then you just need to click on connect so where you
need to click on connect so where you will get this access key right so for
will get this access key right so for that get that personal access key just
that get that personal access key just come back to your data brakes select
come back to your data brakes select your account go to the
settings and under the settings you will get the option as user and then the
get the option as user and then the developer and here you you just need to
developer and here you you just need to click on this manage access
click on this manage access token and then just generate one new
token and then just generate one new token click on generate and it will give
token click on generate and it will give you one token you just have to copy this
you one token you just have to copy this token because it once you will close it
token because it once you will close it you will not be able to get that access
you will not be able to get that access again click on done then come back to
again click on done then come back to the
the powerbi where is it okay
popup and in that popup once you will get that popup just click on personal
get that popup just click on personal access
key and whichever key we have copied right just space that ke here and click
right just space that ke here and click on connect button so once you will click
on connect button so once you will click on connect button then you will get this
on connect button then you will get this Navigator and under this Navigator if
Navigator and under this Navigator if you'll see under the hive meta store we
you'll see under the hive meta store we can see that we have this database which
can see that we have this database which we have created gold layer and under
we have created gold layer and under that we have this
that we have this table so just if I cck click on this and
table so just if I cck click on this and just click on load
connection it will take some time to load the data okay so now we can see
load the data okay so now we can see that we have our Prime gold data here
that we have our Prime gold data here and all the columns which are present
and all the columns which are present here now you can create a dashboards
here now you can create a dashboards here you can create multiple charts like
here you can create multiple charts like bar chart pie chart whichever is
bar chart pie chart whichever is suitable for your information so you can
suitable for your information so you can f that you can create charts and you can
f that you can create charts and you can f respective column into X and Y AIS and
f respective column into X and Y AIS and you can add Legions small multiple so
you can add Legions small multiple so all the required information you can add
all the required information you can add it and you can create four to five
it and you can create four to five charts which are most useful for the
charts which are most useful for the business because they are from
business because they are from non-technical background and when they
non-technical background and when they will see your report of the dashboard
will see your report of the dashboard they need to able to or they should be
they need to able to or they should be able to grasp what is exactly going on
able to grasp what is exactly going on despite of they from non- technical
despite of they from non- technical background so it is very important to
background so it is very important to know that which fields are important to
know that which fields are important to display and which Dash which charts are
display and which Dash which charts are needed to display or show it to the
needed to display or show it to the business so this is how you can create
business so this is how you can create report on this PowerBar
report on this PowerBar desktop and this is what we have created
desktop and this is what we have created till
till now so this I think that this is it this
now so this I think that this is it this is how we have created ETL pipeline from
is how we have created ETL pipeline from starting from source to destination
starting from source to destination which was
which was our GitHub account
our GitHub account to connect it to the powerbi with the
to connect it to the powerbi with the business ready data for that we have
business ready data for that we have used Ed data Factory to in the data then
used Ed data Factory to in the data then we have utilized data brakes to
we have utilized data brakes to create other Medallion architecture
create other Medallion architecture which were like silver and the gold and
which were like silver and the gold and we we saw like how we can create this
we we saw like how we can create this data brakes to the ADLs account and then
data brakes to the ADLs account and then how we can phas that data how we can
how we can phas that data how we can transform that data and how we can again
transform that data and how we can again pull that data to different different
pull that data to different different layers into ADLs genu and once we are
layers into ADLs genu and once we are done with our work how we can create and
done with our work how we can create and how we can connect our data bricks to
how we can connect our data bricks to the powerbi so this is how end to end
the powerbi so this is how end to end pipeline works so if you have found this
pipeline works so if you have found this video very helpful then please don't
video very helpful then please don't forget to like the video and subscribe
forget to like the video and subscribe for the more content and also let me
for the more content and also let me know in the comments if you have any
know in the comments if you have any questions and if you want more such kind
questions and if you want more such kind of videos so in the next video I'll try
of videos so in the next video I'll try to cover the a data Brakes in detail
to cover the a data Brakes in detail with the with new Concepts which are
with the with new Concepts which are already there like Delta l and so on so
already there like Delta l and so on so thank you so much for watching this
thank you so much for watching this video and till then keep crunching the
video and till then keep crunching the data
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.
Works with YouTube, Coursera, Udemy and more educational platforms
Get Instant Transcripts: Just Edit the Domain in Your Address Bar!
YouTube
←
→
↻
https://www.youtube.com/watch?v=UF8uR6Z6KLc
YoutubeToText
←
→
↻
https://youtubetotext.net/watch?v=UF8uR6Z6KLc