YouTube Transcript:
🚀 Azure End-to-End ETL Pipeline Project | Azure Data Factory + Databricks + Power BI | For Beginners

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

Video Transcript

hi everyone my name is rutuja and

welcome back to data to Crunch where we

learn more about data engineering and

business intelligence so in today's

video we will dive into the exciting

world of azure I'll walk you through the

process of creating an endtoend ETL

pipeline in Azure where we will try to

utilize main Azure services like Azure

data L Storage Azure data Factory and

Azure data bricks and afterwards we'll

see how we can connect these services to

the powerbi so that we can create

reports and the dashboards for the same

so basically we'll start with learning

some basic concepts of the same then we

will see how we can set up these Azure

services to create this end to endend

ETL Pipeline and afterwards we'll see

how we can use these tools to clean

transform and you visualize our data

efficiently so for this I can say that

we are going to use Amazon Prime data

set for the analysis purpose and by the

end of this tutorial you will have the

understanding of this ETL Pipeline and

along with that you will have the deeper

understanding of how these powerful

Azure tools come together seamlessly for

the data workflows so now let's move

ahead and learn more about our data set

before we start working with the data

let's quickly go over this Amazon Prime

movies and the TV shows data set on

which we will be working on so I have

downloaded this data set from the kagle

website and I'll provide the link of the

same in the description box below so

this data set contains the information

about the titles available a on the

Amazon Prime which mainly includes

movies and the TV shows so it has

several key features that we can use to

analyze maybe Trends ratings genres and

many more so now let's see or take a

look look on our columns which we have

so we have column which gives us unique

show ID then we have other column which

gives us the type of that

respective movie or the TV show then it

also displays the title of that movie

then it gives the director of that

respective show afterwards it also gives

us the cast of that respective movie

afterwards it is also giving us the

country of that production of that

respective movie then we have date added

column which gives when that movie or TV

show was added onto Amazon Prime then it

also gives the release year of that

movie or the show then we have ratings

column and after ratings we have

duration of the movie or TV shows column

so this data set basically offers a lot

of interesting possibilities for the

analysis by like including exploring

relationships between for example maybe

in between genres or the release years

or the ratings or even like examining

the trends in the movies uh in the movie

popularity across the different

countries so what I have done here is

like I have downloaded this data set and

I have uploaded it on my GitHub account

I'll provide the link of the same also

in the description box below

so we now we will see our further

procedure where where we will try to see

how we can fish this data from this

GitHub account to our Azure services and

then how we can perform our cleaning and

transformation activities of the same so

now let's move ahead and let's have a

overview about our whole procedure which

we will be working on so this is the

overview of our N2 and ETL pipeline

which we will be creating using Azure so

our source is this GitHub account link

where we have stored our data set and

our destination or the end point will be

the powerbi where we will try to create

some dashboards and the reports on our

transform data set to represent it in

front of the business so here we are

going to use three Azure services so the

first one will be Azure data L Storage

Services which is ADLs where we will try

to store our data so here Al so for the

storage purpose We Will Follow The

Medallion architecture so Medallion

architecture is a data management

framework which we commonly use in our

modern data lake or the data warehouses

so basically here we organize our data

into three distinct layers for that we

call it as a medals so the first medal

will be bronze layer or bronze medal

then the second will be silver and the

then the third will be gold

so the the purpose of bronze layer is we

we we will use this bronze layer to

store our raw and unprocessed data which

we have ingested so right now our data

set is into GitHub account right so now

we need to fetch this data set into our

Azure services and then we need to store

that data into our ADLs so for that

purpose we will use Azure data Factory

so what Azure data Factory will do it

will fet the data from our GitHub

account and then it will store that raw

and unprocessed data into our ADLs

account so now till now AO data Factory

uh has done its purpose now AO data

brakes comes into the scenario so what

Azure data breaks will do so Azure data

braks will work on this raw and

unprocessed datas data set and then it

will try to clean and transform that

data so that our uh so that it will be

easier for the analysis for the data

scientist or the business intelligence

whoever the team who is going to work

for the further procedures so right now

we have uh this raw and unprocessed data

in the bronze layer right so now Azure

data braks will perform data cleaning

and data transformation so it will

basically take this data from bronze

layer it will perform its activity and

once they have done that transformation

then they will store that data into the

silver layer so here the second layer

comes into the scenario where we have

cleaned and transformed data into our

ADLs so this data at this stage we can

say that has been processed to handle

the inconsistencies or the missing

values or the duplicates or to derive

some more meaningful relationships

between the data set so till now after

this uh after this silver layer this

bronze uh this gold layer comes into the

scenario so this gold there basically

represents the highest level of data

refinement and it is intended for the

business intelligence purpose so at this

stage we can say that our data is fully

processed and optimized for the final

consumptions so right now we know that

our data has been transformed and stored

into the silver layer right so now we

using a your data Brak we will use this

data and we will fetch this data from

the silver layer and then we will try to

perform some

Advanced Transformations on the same so

that we will get that we will get that

data which is business ready so once we

have that business ready data then we

will store that data into this gold

layer so now what does this what does

this mean by business ready data so

business ready data means it is intended

basically for the business intelligence

and the reporting or the analysis

purpose so till this stage the data is

fully processed and optimized for the

final consumption by the business users so

so here we can say that our data is

highly curated and it is stored in a

final reporting friendly format such as

for example maybe Delta lake or the park

eight so we will store our data into

such kind of formats so once we have

data into the gold layer

then what we will do we will connect our

data brakes to the powerbi services so

that what we can do once this uh

connection is ready we will fetch this

data from the gold layer which is into

Delta or the parket format and then we

will pull this data into powerbi and

once we have that data into the powerbi

we will create dashboards or the reports

so this will be our end to endend ETL

pipeline over view so now let's move

ahead and we'll first see the basic

concepts which will be required for the

understanding of this ETL pipeline let's

try to understand the ETL process so for

the first step extract the main goal of

this extraction process is to pull the

raw data from Source systems without

making any changes to that so the

extract phase it basically involves like

Gathering of raw data from the various

Source systems which is which could be

for example databases or the apis or the

flat files or even the live streams so

this data it could be either structured

semi structured or unstructured so after

extract we have transform

process so for the transform our

objective for this step would be to

ensure that our data is accurate

consistent and formatted correctly for

the reporting or the analytics so the

transform information phase it is where

that raw data under goes the processing

or the cleaning to meet the desired

structured structure or the quality and

the format of that respective data for

the reporting or the analysis so this

phase basically includes data cleaning

aggregations validations filtering or

other operations to ensure that we have

suitable data for the downstream

applications so after transform we have load

so the objective of load is to store the

transform data in a format that is

optimized for quering or the reporting

so that it would make it easier for the

access for the users or the applications

so this loading phase is where the

transformed data is stored into the

Final Destination usually that would be

data warehouse or the data Lake and it

will be easier for the quering report or

the analysis and this data can be done

into either a batch mode or the real

time mode also so in our case we'll see

how this ETL pipeline will work so in

our case extract process this will in

this basically Azure data

Factory this Azure data Factory will

pull the data from the external

sources like GitHub account or maybe

kaggle it will pull that data and it

will store that data into the bronze

layer which which we will make it in the

ADLs which is into Azure data L Storage

Services then into the

transformation this uh data bricks will

come into the scenario and this data

bricks will read the raw data from the

bronze layer then it will perform

cleaning and the transformation and it

will make that data into a more

structured format and then it will store

it into the silver

layer so once it it is stored into the

silver layer it can also perform some

additional and advanced aggregations or

the refinement so once those aggregation

and refinements are done then that

advanced business intelligence ready

data that will get stored into the gold

layer where that data is curated and

optimized for the reporting

so into the load process that

transformed data which is into the gold

layer then that ready trans that

business ready data which is loaded into

the uh which is already into the gold

layer that will get loaded into the powerbi

powerbi tool

tool

so we will fetch that data which is

stored into the gold layer in 2 power bi

so once that is done we will we can

create dashboards and reports to

represent it in front of the business so

this is how this complete ETL process

will work in our case now let's try to

understand what is Medallion

architecture so The Medallion

architecture is a data management

framework commonly used in modern data

legs or the data warehouses so basically

it organ izes the data into three layers

which is namely bronze layer silver

layer and the gold layer so each of

these layers or each of these steps are

designed to handle the data at the

different stages of the processing and

the optimization so in a medallion

architecture data is progressively

refined as it moves from bronze layer

which is nothing but raw layer to the

silver layer which is cleaned and enrich

dat data layer and afterwards to the

gold layer which is business ready data

layer so the goal basically here is to

ensure that we have scalability

performance and high quality data at every

every

stages so the first layer is a bronze

layer which is nothing but a raw layer

this bronze layer is where we have our

raw unprocessed data which is ingested

and stored so basically it is also

referred as a landing zone or a raw data

zone so if we talk about the

characteristics of this bronze layer

then the data is typically stored in its

native form for example CSV form Json

form or pocket form any any form of any

kind of form so it is directly from the

source system and there are no

Transformations which are applied at

this stage so we can say that data is

stored as it is so as it is stored as it

is it may contain errors duplicates or

any missing values so the basically the

focus of this bronze layer is to capture

and preserve all the incoming data

without losing any

information so in our case we are going

to ingest our data from GitHub to this

raw into this raw format and we are

going to store it into

ADLs so after this bronze layer

we have silver layer so the purpose of

this silver layer is we need this silver

layer where our data is cleaned

transformed and enriched for the further

analysis so data at this stage has been

processed to handle the inconsistencies

or the missing values or the duplicates

and to derive some more meaningful

relationships between the data sets so

if we talk about the characteristics of

this silver layer then we can say that

at this stage data is transformed into

more structured normalized or enriched

form so for this it may involve the

operations like data cleaning or D

duplications or the validation so it can

also include basic business Logics like

maybe creating some additional columns

or data data conversion or

categorizations so we can perform these

kind of basic logic logical

transformations in this layer and this

lay the data in this layer is typically

suitable for the Eda which is nothing

but exploratory data analysis and intermediate

intermediate

reporting so what could be the use case

for this silver layer then we can say

that data scientist or the analyst they

can use this layer to build machine

learning models or to conduct some

further analysis in data brakes so

basically Ally it serves as a foundation

for the gold layers curated and business ready

ready

data so in our case we are going to use

data bricks for this silver layer right

so you might clean we might clean and

transform the Raw movie data in that

maybe we will try to remove some

duplicates and we will try to handle

some missing values and the store we

will store this result into this silver

layer in in the format or in the in in

the delta or poket format so we will

store that data into silver layer into

any one of these

format so the next layer is gold layer

so the purpose of this gold layer is

that this gold layer represents the

highest level of data refinement and it

is intended for the business

intelligence reporting and the analytics

so we can say that at this stage the

data is fully processed and optimized

for the final consumptions by the for

the final consumption by the business

users so what could be the

characteristics of this gold layer so we

can say that data is highly curated so

it often includes aggregations or the

summaries or the derived matrices that

will answer the specific business

questions so this layer we can say that

it optimized for performance or the fast

quering so first squaring means using

for example indexing or partitioning or

Z ordering so such kind of fast squaring

we can use this gold layer and lastly we

can say that it is stored in the final

reporting friendly format such as bucket

or the Delta l so what could be the use

case for this gold layer so we can say

that we can use this for the dashboards

reports or the kpis which will basically

help for the decision making

so business teams or the managements

they can use this layer to extract the

insights and perform some Trend analysis

and make some data driven decisions so

in our case we can say that for this

powerbi comes into the

scenario and after transforming the data

we might need to for example aggregate

the movie counts by genres or the year

and we need to save this final data set

into the gold there so this data is now

ready to be consumed by the powerbi

dashboard for the reporting so we can

pull this data from the gold layer into

the powerbi and then we can perform NE

next business driven activities which

are creating reports and the dashboard

so this is how this Medallion

architecture works and this is how it

will perform in our end to endend ETL

pipeline which we are going to create

now now we will try to see what is

Microsoft aure exactly is so Microsoft

Azure is a cloud-based Computing

platform and the service which is

offered by the Microsoft so it provides

a wide range of cloud services including

like computing power or the storage

networking databases analytics

artificial intelligence and many more so

the businesses and the individuals they

can use this a Azure to build deploy and

manage the applications on a global

scale without needing to maintain it on

the on premise

infrastructures so why do we need this

Azure so there are some points for that

we need Azure so the first one would be

the scalability and

flexibility so aure basically allows the

businesses to scale the resources up or

down based on the demand so with this

they can ensure that they have Optimal

Performance without needing The Upfront

Hardware investment So to avoid this

Hardware in investment it will we can

prefer aure and it will give us the

scalability and the flexibility the next

one I would say is cost efficiency

because aure has that pay as you go

infrastructure so because of this pay as

youo organizations will only pay for the

resources they are going to use and it

will ultimately reduce the operational

expenses then the third one I'll say is

global reach and the

availability so with this for this the

the aour has data centers across the

world so Azure provides the high

availability and it ensures that

applications run closer to the end users

and It ultimately improves the

performance so after this I'll say the

security and the

compliance so aure basically offers

industry-leading security measures which

includes like multi-layered protection

or the encryption or the compliances

with global standards like gdpr

r or hi paaa so such kind of compliances

this Azure offers so for that also so

many companies prefer Azure and the next

one I'll say is integration with the

Microsoft and the open source tools so

this is kind of a main thing for which

Azure is preferred because Azure Works

seamlessly with the other Microsoft

products also for example it would be

like Office 365 or the powerbi or the

SQL Server as well as some open source

Frameworks such as like python spark or

the kubernetes so such kind of

Integrations these Integrations are made

easier with the Azure so that is also

one of the one or main major because of

which Azure is getting preferred right

now so now we are going to create ETL

pipeline right so we'll see which Azure

Services we are going to use right now

so the mainly we'll see the first one is

azure data ltor storage services so this

ADLs is highly scalable I I can say that

this ADLs is like secure and the

scalable cloud storage service which is

basically designed for the big data and

the analytics workload so why do we need

this ADLs okay so it so stores the raw

or semi-processed or the processed data

into the different layers which we have

S which we have seen earlier via this

Medallion architecture right so such

kind of data gets stored in DLS so it

basically gives us that storage and

apart from that it also supports the

hierarchical name space so it will

improve the performance in the file

operations so hierarchical name space as

in it will allow you to create one

folder and in that it will allow you to

create subfolders and in that you can

store the file so creating these

subfolders it is nothing but creating

this hierarchical name space so it has

this ADLs gives that facility and it

makes us easier for the file operation

and apart from that it integrates

seamlessly with the Azure data

brakes and other Azure Services also so

right now Azure data brakes is important

because in this ETL pipeline we are

going to integrate this ADLs with this

Azure data brakes so that is why I said

Azure datab brakes and there are also

other datab other Azure Services also

with which it is

compatible and next service I'll say is

azure data

Factory so this Azure data Factory is

nothing but a cloudbased

storage which gives ETL service which is

used for the data movement and the

transformation if we say like why do we

need this ADF so we can say that it will

ingest this ADF will ingest the data

from various sources it could be

anything like in our case it is https

right so it could be SQL or the other

apis or the cloud storage or on premise

database it basically it from these such

kind of various sources it will help us

to ingest that data into ADLs and apart

from that I can say that it can also

schedule and orchestrate the workflows

by automating the data pipeline so in

our case we are fing the data from one

source right and then with this copy

activity we are using here ADF we are

storing the we are pitching the data

from our source to

destination so we have done that we are

we have done that using uh with this

manual intervention right so but what if

I I want to use it without any manual

intervention then for that we need to or

we need to schedule and orchestrate

orchestrate this workflow right so it

will basically also allow automating

this data pipeline so it this ADF has

that facility also in which is uh this

orchestration and automating and apart

from that it integrates with the AO data

braks also for the Transformations and

powerbi for the visualization so so we

are going to see like how we are going

to connect this Azure data brakes with the

the

powerbi so next to that is azure data

break so AO data briak is a big

analytics big data analytics and machine

learning platform which is based on

Apaches spark which is optimized for this

this

Azure so it basically provides high

performance distributed computing for

the large data sets and apart from that

it supports the languages like python or

SQL l or

Scala or R so such kind of languages it

supports for the data processing and

apart from that we can say that it also

process it can process structured as

well as unstructured data very

efficiently and apart from that it

integrates seamlessly with

ADLs as your data Factory and other

other as your services

also so the next one which we are going

to use is powerbi so powerbi we all know

that it is a business intelligence and

data visualization tool which basically

enables the user to create interactive

dashboards and the reports so why do we

need it so we can say that it connects

directly with the Azure data Azure datab

breaks or the ADLs for the realtime

insights so it gives a seamless

integration and apart from that it also

supports the direct query and the import

mode for the data retrieval so for this

Main purpose we are going to use powerbi

so in our case the use case would be to

connect the gold layer which we

have and get fch the data into powerbi

to create the dashboards and with the

creation we can also visualize this data using

using

kpis or other trending

patterns or business matrices so it will

basically allow you or enable this

realtime reporting using this direct

query mode which we are going to see in

some time so these are the components

which we are going to use in our ETL

pipeline now let's try to learn more

about management level and the hierarchy

so AEL provides you with the four levels

of management so the first one is

Management Group followed by the

subscription then the third one would be

Resource Group and the last one is the

resources so if we talk about the

management group so it basically helps

you to manage the access policy or the

compliances for the multiple

subscriptions so all the subscriptions

in that particular Management Group it

automatically inherits the conditions

which are applied to that respective

Management Group so subscription like

what is subscription

exactly so subscription we can say that

it logically associate the user accounts

with the resources which they create so

each subscription has limits or the

codes on the amount of resources it can

create or use so organizations can use

these subscriptions to manage the costs

or the resources which are created by

that particular user or the team or the

project so basically in the company what

they do that for the for any respective

team or that particular user they will

allocate one subscription so that

whatever user that user will create

across that subscription that will get

buil for that subscription so it will

logically basically it will logically

associate that user with the resources

which they have created so this is how

this subscription level works then the

next one is Resource Group so we talk

about it then the resource Group we can

say say that is a logical container

where you can deploy and manage the

Azure resources like virtual machine web

apps databases or the storage accounts

so we can say that it is nothing but a

logical folder in which we are going to

store all our resources and resources

are nothing but the instance of your

services which you can create in that

Resource Group so in our case we have uh

these Services right which is like ADF

or the data brakes there are synapse

also which uh which we we will use into

any another project or SQL database or

ADLs or like any virtual machine so

these all are nothing but the resources

which are nothing but the instances of

the services which you can create and we

basically what do we store for all these

Services into a particular folder so

that folder is nothing but a resource

Group so in this Resource Group I can

say that I have stored the SQL or the

ADLs so that depends on you so what

we'll do like we'll have subscription if

you have free account then you will get

that subscription for particular dollars

for the free or like if you using pay as

you go then you will create one new

subscription and in that new

subscription then you will create this

Resource Group and in that resource

group you will create all the resources

or instances which we are going to use

to create our ETL pipeline so this is

how this management level and hierarchy

works so now we will try to implement

this management level and the hierarchy

in our ETL Pipeline and we'll see how we

can create the subscription then

Resource Group and the resources and Via

that or through that we will create our

pipeline so now let's move ahead and see

our practical part now let's start with

our practical part so so for that just

go to the Microsoft Azure portal and

here you will get the option for sign in

and get started with the Azure so if you

are a newcomer just click on a get

started and with Azure and then try for

try this azzure for free so you will get

$200 USD and for that dollars you can

use these services for free for a month

so this is one of the good option which

can for which you can opt and otherwise

you can go for the pay as you go so for

the services which we which you will use

for that only you will get build for

that particular subscription we'll see

this is how our Microsoft IG portal looks

looks

like so these are the aure services

which I have used recently this is my

recent work on the left hand side if you

click on this three arrows you will get all

all

the facilities like dashboards all the

Services resource Group SQ DB databases

virtual machines and so on and if you're

not getting anything you can simply

search here and you will get that

particular service so the first thing in

our management higher archive was

management groups and after that it was

subscriptions if I just say

subscriptions then I'll get that service

click on that then you will see like how

many subscriptions you have right now

with active or disabled status and you

can check like what's the current cost

which is which is right now causing you

so right now you'll see like I don't

have any free subscription so I'm on as

you go so I can see that subscription

here with my current cost or if you

suppose want would like to create a new

subscription for the billing purpose

then you just click on ADD and with ADD

you can just give whatever the

subscription name which you want to give

whatever it is and then billing account

billing profile so your credit details

you need to enter so that it that on

that respective card that you will get

buil for on that respective card and you

just need to click onate create and uh

review and create right now I'll just

not create any new subscription I'll use

the existing one but this is how you can

create the new subscription

subscription

so yes I would like to discard so now

once uh you are done with the

subscription then the third step was to

create the resource

groups so I would like to create one new

Resource Group so for that I'll come

here click on

create so here it is by default

selecting the subscription which I have

right now so if you if you have multiple

subscription then just click on this

drop down and select whichever

subscription you want to so right now I

just have only one so I'll select that

and then you need to enter the resource

Group name which you want to give for

that respective Resource Group so for

Prime work you'll see that this is

available then the regions we have

different different regions so I'll just

select the nearby region for

example Central or the East okay let's

just go with the East us then click on

next tags we don't need to next and

lastly it will give you one review and

create page where it will show you under

this subscription you're going to create

this Resource Group under the region

so it says that Resource Group created

once okay now we can see that it has

reflected here so this is my Resource

Group which is third management level

now under this Resource Group now we are

going to create our different Azure

services so that it will act as a folder

for our different different kinds of

services so the first first service

would be to create our storage account

which is nothing but ADLs account so for

that what I'll do I'll just click on

create so it will navigate you to the

marketplace so here I need to search for storage

account so this storage account we need

so if you you can see that that

respective subscription and Resource

Group it has been

allocated by default so for example like

if you want to create a new Resource

Group so you can just click on click new

and you can assign a new Resource Group

and once you will click on okay that

respective new Resource Group will also

get created in the back end but right

now what we have done we have first

created our Resource Group and in that

Resource Group we are going to create

our other storage services so it has

selected by default that RG Prime work

if you would like to switch amongst the

other existing resource groups you can

do that so right now our uh Resource

Group is RG Prime workor then you need

to give a storage account name so for

that suppose for example I'll give

ADLs okay so it is giving me error as

like it can contain only lower case lat so

ADLs Prime work

okay now it's checking and it shows that

it is available so region I'll just keep

I'll just keep the near one East

us then we have primary service option

so in that our primary service would be

Azure data leg storage where we are

going to store our data into three

layers so I'll select that primary

service and then we have this option as performance

performance

option so in this performance option we

have this standard and premium these two

options are available here so we can say

that with the standard with this

standard performance right so this one

is like mainly suitable for the general

purpose storage needs such as like

storing large amount of data that is

frequently accessed or like which we are

going to store somewhere so for that we

can use this standard performance but if

we talk about the cost

so like it's better than the premium and

this standard it is more cost effective

and it makes it is basically suitable

for a budget cautious scenarios where

like high performance is needed and

there are not any kind of critical

requirements so if you talk about like

about the costing from the costing

purpose then I say we should go with the

standard one so with the premium we can

say that premium is basically designed

for the high performance

basically for the high performance

workloads that demand low latency and

the high transaction rate such as for

example for the virtual machine discs or

the any transactional databases or any

Big Data Systems so for that premium

performance tier is very much good and

in terms of cost it generally costs

higher than the standard tier which

reflects the superior Superior

performance capabilities but right now

we right now our work is not that much

heavy so we'll just opt for the standard

option and if we now talk about the

redundancies then we have four types of

redundancies now let's quickly see how

these all four redundant storage Services

Services

works now let's talk about the Azure

storage redundancy offerings so Azure

storage offers like various redundancy

options basically to ensure that your

data remains available and durable and

even in the face of like Hardware

failure or power outages or any natural

disasters we need our data intact so for

that purpose aure offers this uh

redundancy offerings so these redundancy

strategies basically reflects your data

across the different locations by

providing the protection against your

data loss and ensuring that business

continuity gets maintained so this is

the main purpose of this uh redund

redundancy offerings so the key purposes

of this uh storage redundancy would be

like first would be data protection

because redundancy safeguards your data

from unexpected events such as for

example uh Hardware malfunctions or the

data center outages so it basically

protects you with these such kind of

outages and it maintains the multiple

copies across the various locations so

data protection is the main key feature

for this or key purpose for this

redundant offering then the next one

would be aure this High availability so

by replicating this data this redundancy

ensures that your applications can

access the data with minimal downtime

even during the disruptions so high

availability is the second most

important factor then the third one we

can say is disaster recovery so in the

event of any Regional disaster or

certain redundancy options it will

basically allow you to get protected

from the this data vales by storing that

data into a secondary region so

basically it facilitates the Swift

recovery and the continuity of your

operations so selecting this appropriate

redundancy option it depends on your

organizational specific needs balancing

factors like cost or data availability

requirements or the disaster recovery

objectives so there are so many things

or the factors based on that you need to

select these one of like the redund

redundancy offering which is best for

your project so the first option is

locally redundant storage so what

happens in lrs so lrs basically

replicates your data three times within

a single data center so we have this

primary region so in our case we have

selected East us right so that is our

primary region in that primary region we

have data center and in that data center

we have three copies of our data stored

within the same data Center so that is

how this lrs functions so basically this

lrs safeguards

your data or the activity against the

hardware failures within the same data

center so it's good option but the

limitation for this storage would be

like it does not protect against the

data center wide disaster like maybe for

example Fire or the flood then if that

in that case if that complete data

center gets disrupts then it is of no

use but it like rarely happens or will

never occur but there is a possibility

of that occurrence so for that purpose

lrs would be beneficial and if your

project has very minimal costing then

also you can opt for this lrs option

then the second option is Zone redundant

storage so what happens in this Zone

redundant storage

is Zs it replicates your data across the

three availability zones within a region

so we have our East US primary region

and in that region we have three

different availability zones and in that

we have three different data center so

in lrs our copy of our three copies were

in like one data center but in Zs our

three copies would be into different

different data centers but in a same

primary region so that is how this zrs

functions so the next one would be GE R

storage which is GRS so in GRS this GRS

basically combines the lrs within the

primary region and asynchronously

replicates the data to a secondary

region so we can say that data is

synchronously replicated three times

within a single storage cluster in the

primary region and then it and then it

was asynchronously replicated to a

secondary pair region so this is our

primary region so in our case is East us

so in that data center it has replicated

your data three times synchronously and

like despite of this it also replicates

your re data into a secondary Region

Three Times within the same data center

this is how this GRS function so

basically we can say that it will

protect or it will provide you Disaster

Recovery capabilities in case of any

Regional outage so like what what would

be the limitation for this GRS then we

can say that fell over to the secondary

region it requires manual intervention

so without that your data will be intact

in your primary

region so the fourth one is Geo

redundant geozone redundant storage

which is

gzrs so in this

gzrs basically this gzrs integrates with

zrs within the primary region and

asynchronously replicates the dat to a

secondary region so data our data in our

primary region is us our data is

synchronously replicated three

times within three different data

centers these are synchronously

replicated and apart from that it

asynchronously replicates your data

three times into your secondary region

so this is how this gzrs function so we

can say that it will offer you the

enhanced disaster recovery and high

availability but the limitation would be

for this would be like failure to a

secondary region again it will require

manual interv intervention apart from

that your your data is already copied

three times into three data centers into

a primary region so this is the I can

say that this gzrs provides you the

maximum protection to your data from all

the disasters and recoveries but it is

not cost efficient it it requires a lot

of cost so it completely requires on

your project requirement and the budget

which is like available and which is

which is allowed to use so this is how

these are for for redundancy storage Services

Services

functions okay so for the cost effective

purpose I'll just go with the local

redundancy storage click on

next so here we have this one block

which is hierarchical name space so what

happens is like this hierarchical name

space which is nothing but

hns this is an optional feature in ADLs

Gen 2 that basically enhances the way

your data is managed within that storage

so it introduces a file system like

structure to The aure Blob storage

making it more efficient for the Big

Data anal analysis analytics and the

higher performance data operations so

this hns within this storage V2 if we

just take mark this option then it will turn

turn

on it will basically turns this storage

into AO AO datal storage genu account

and unlike traditional flat blob storage

this hns will allow you to store or

organize the data into Direct iies and

subdirectories so for the normal blob

storage you can store your data into one

file system but with this hns enabled it

will allow you to create the directories

so in this what will happen suppose if I

have this one directory in this

directory I can create subd directory

sub subd directory or Sub sub file

storage and in that I can store my file

system so this is nothing but the

hierarchical facility which it will

enable by clicking on this checkbox and

it will convert this storage account

into ADLs by offering this services so

with this

service there is one more uh thing as

like with this service we know that it

makes it makes it easier to manage

structure or navigate the large data

sets just like this file system so apart

from that this hns it will also it also

allows the access control lists which

which is nothing but ACL so this acl's

at the directory we can apply this ACL

Access Control list at the directory or

file level which offers a fine gr

permissions so it is useful when we have

multiple users or the applications that

need the access with the different

levels of permissions so we are going to

require this ACL feature going forward

because we have this AO data brakes

right and with AO data braks we need to

access this ADLs account so to access

this ADLs account or these ADLs file

folders We where we have stored our all

raw data so to access that data Azure

data braks needs particular permissions

so these permissions we can give by

using this ACL which is nothing but

Access Control lists so this is one of

the main feature which which we are

going to require and apart from this

this sub foldering section that is also

important so these two main features we

will get by clicking on this enable

hierarchical name space so we are going

to opt for that so next coming here we

have one more

feature which is this access tier so

here we have three options we have three

tiers hot cool and cold so basically

this Azure blob storage offers these

three services right the main purpose of

this is to help you to manage the

storage costs based on how frequently

you are going to access this data so we

have uh we are going to create this a

your ADLs right and in that we are going

data so we have this ADLs in that we

have we are going to pull data and we

are going to store our data into

directories and directories so now I

need to check like how frequently I'm

going to access this data so depending

on that I need to select which access

tier is suitable for my storage account

creation the first one is hot tier so

the purpose of this hot tier is is that

it is designed for the data that is

accessed or modified

frequently so if you are going to access

the data frequently then you should opt

for this hot per hot tier and if you

talk about the cost structure then the

this highest storage cost it has some

higher storage costs but lower access

and transaction fees so you have this

ADLs account for so to store that data

into this ADLs it will require high

amount but to access that data again and

again it will not be it will it will not

be that much higher so this will be the

cost structure for high uh hot tier and

the use case we'll see then it is like

useful for the active data sets or the

real time analytics or data staging

before processing so most of the time

hot tier is beneficial then the next one is

cool so cool tier is basically suited

for the data that is infrequently

accessed or modified so we are not

suppose we are not going to use that

data frequently then for that purpose we

can opt for this cool tier so so the

cost structure is also like it has lower

storage and lower storage cost would be

very less with higher access and

transactions so the storing cost is less

but if you want to access it then that

access cost would be higher so this is

how this cold tier will perform then we

have this third one which is cold tier

so it is basically ideal for the data

that is rarely accessed so we are not

going to use it for with that like with

that frequency so for the rare access we

will we can opt for this cold tier but

it requires quick retrieval whenever

needed so it is nothing but you are

archiving your data so once you have

archived it you're going to use it

rarely whenever required but when you

are going to require it you need to

access you need to access it very fast

it should not take much longer time to

access your archived data so for that

purpose you can go for cold tier so cost

structure is even lower storage costs

than the cool tier but with higher

access fees so the storage cost is much

lower than the cool cool one but the

access cost is higher than the cool one

so like use case for this would be like

if you want some long-term backups or

the data sets and you need to rain that

for the compliance then for that access

purpose you can opt for this cold tier

and there is one more uh Point here that

is retention requirement so the

retention requirement if you talk about

requirement for the cold

tier the data should stay in this tier

for at least 90

days to avoid early deletion changes and

same retention requirement for the cool

tier is that it should remain in this

tier for at least 30 days to avoid early

deletion charges so these this retention

requirement is important for this both

cool and cold tier which is for cool

it's 30 for cold is 90 so this is how

this access tier works so right now we

are going to opt for the hot access tier

because we are going to access the data

frequently and just click on next then

these are some private endpoint

additional features routing preference

and just click on next next next we are

not going to require this for our ADLs creation

creation

next then it is going to validate all

your requirements and then we'll click

on this

create and it will now create our

storage account which is

ADLs into our assigned Resource

Group so it will take some time it shows

right now that ex deployment is in

progress so we'll wait for some time

till this deployment gets

successful okay so now our deployment is

completed for ADLs so now click on go to

Resource so this is how our ADLs will

look like so if we talk about the data

storage here then we can see that we

have four options here for the data

storage now we'll see one by one very

quickly how it functions so the blob

storage nothing but content ERS we can

use this container to store large amount

of unstructured data or structured data

like CSV files or the documents image

videos so in this we can say that our

data is organized into containers which

are like similar to the folders and

these containers holds the files so we

will in the container we will have these

containers and in that we will have

folders to store in a mannered way so we

can say that it is ideal for the content

storage or and servings or maybe for the

backups or the streaming media so this

is useful for that then the next one is file

file

shares so basically it provides you

fully managed file shares which are

accessible via SMB protocol so SMB is

nothing but server message block so if

we talk about the structure then these

file shares will allow the creation of

files uh of this file shares that can be

mounted concurrently by cloud or on PR

deployments so because of this we can

see that it is suitable for migrating

Legacy operations to the cloud so if you

want to migrate from the Legacy to the

cloud then for that purpose we can say

that F Shares are the most

useful then we have cues which is

nothing but CU storage so basically it

enables the a synchronous communication

between the application components by

storing the large number of messages so

the structure of this cues would be like

the messages are stored in the cues and

that can be accessed from anywhere using any

any

HTTP or

https protocol so with that we can

access these cues so use case for this

would be like it is useful for building

some flexible applications or loading

leveling during this high traffic

periods and for example like decoupling

the application components to improve

the scalability or resilience so for

that purpose we can use the cues option

and the last option is

Stables so this stable storage it

basically provides a no SQL key value

store for this rapid development so key

value pair is important here so using

this uh key value store we can have this

massive stry structure data set stored

into the tables and if you talk about

the structures then sto it stores the

data in tables with the schema less

design which allows you for easy

adaption AS application needs to evolve

so what would be the use case for this

so we can say that it is appropriate for

storing the structured non relational

data such as user information or device

data or any other metadata so these are

some some of the information for all

these four type of storages right now

for our purpose we are going to opt for the

the

containers so so once you will come here

you can click and create

container so for example let's say right

now I'll uh we are going to follow this

Medallion architecture right so for that

we are going to have bronze layer silver

layer and gold layer so let's just

create three containers so I'll just

silver okay the first would have been

like bronze but it's okay bronze silver

and then the third one would be

gold so right now what we are going to

do we are going to fetch our data from

GitHub account that would be our raw

data right and we will store that draw

data into our bronze layer so we for

that we will use Azure data Factory and

with that data we with that Edo data

Factory we will store our data here then

we will fetch this data

into aure Data brakes we will perform

some Transformations and then we will

store our data into silver layer so once

we are done with that Transformations

with the Azure data Factory only what we

will do we will again fetch this

data we will perform some Advanced

Transformations and then we will store

again B that business ready data into

the gold layer so this is how we are

going to use these three layer or three

layers here so now we'll see how we can

f that data into these different

different layers so now first step would

be now we are done with ADLs creation

right so now we have our ADLs

ready with bronze container in that so

now we will create one Azure data

Factory and now we will pull that

data to this trans layer so that would

be our second step now so now let's just go and create the AO data Factory for

go and create the AO data Factory for the

the same now let's try to create AO data

same now let's try to create AO data Factory so I into my Resource Group

Factory so I into my Resource Group again I'll click on

create then it will navigate me to the marketplace and here we will search for

marketplace and here we will search for data

create so it has assigned respective subscription and Resource Group here so

subscription and Resource Group here so now we need to give one name to our data

now we need to give one name to our data Factory so I'll just give the name as

Factory so I'll just give the name as anything like ADF

anything like ADF Prime

Prime work region I'll keep it as East us only

work region I'll keep it as East us only then the version is

then the version is V2 next

next these like GI git configuration networking Advanced tags we don't need

networking Advanced tags we don't need to configure anything right now so we'll

to configure anything right now so we'll just check whatever is required like

just check whatever is required like Basics subscription group then verion

Basics subscription group then verion region and then we will click on

create so it states that now deployment is in progress so it will take some time

is in progress so it will take some time to create it okay so now it has has been

to create it okay so now it has has been created so now we will click on go to

Resource refresh it so now see under the resource Group now we can see our Azure

resource Group now we can see our Azure data like storage and along with that we

data like storage and along with that we can see our Azure data Factory too so

can see our Azure data Factory too so now we'll click on that and launch the

studio so this is our AO data Factory in which we will try to do this data

which we will try to do this data movement using copy

movement using copy activity so now if you'll see here on

activity so now if you'll see here on the left hand side we have tabs of home

the left hand side we have tabs of home then author monitor manage and then

then author monitor manage and then Learning Center so in Learning Center

Learning Center so in Learning Center you will get uh like the videos

you will get uh like the videos tutorials like you can navigate through

tutorials like you can navigate through that you can check that and if you want

that you can check that and if you want to learn more you can all all the time

to learn more you can all all the time you can select for that then we have

you can select for that then we have manage tab here we can create Link

manage tab here we can create Link services or the triggers we can

services or the triggers we can configure the git we can we can create

configure the git we can we can create integration run times also here under

integration run times also here under the manage tab so in monitor tab if

the manage tab so in monitor tab if you'll see here you have all the runs so

you'll see here you have all the runs so if you'll debug anything or if you will

if you'll debug anything or if you will run anything uh suppose for example in

run anything uh suppose for example in our case if you'll run the pipeline then

our case if you'll run the pipeline then we we can check the status of that

we we can check the status of that pipeline after coming here the similar

pipeline after coming here the similar goes to the triggers also if you have

goes to the triggers also if you have set any trigger you can like keep a

set any trigger you can like keep a watch on that trigger via here and then

watch on that trigger via here and then we have author tab so in author Au tab

we have author tab so in author Au tab we have Factory resources in that we

we have Factory resources in that we have pipelines then right now it has it

have pipelines then right now it has it is like change data capture is in

is like change data capture is in preview mode then we can create data

preview mode then we can create data sets data flow power query so all these

sets data flow power query so all these are option available here so now we are

are option available here so now we are going to create our pipeline under the

going to create our pipeline under the author tab so for that first we will

author tab so for that first we will check or first we will see the basics of

check or first we will see the basics of like how this copy activity functions

like how this copy activity functions and what are the requirements and then

and what are the requirements and then we will try to implement that copy

we will try to implement that copy activity

this is the overview of azure data Factory under which we are going to

Factory under which we are going to perform copy activity using this

perform copy activity using this pipeline so in Azo data Factory

pipeline so in Azo data Factory orchestrating this data movement from

orchestrating this data movement from the source to destination it involves

the source to destination it involves several interconnected components that

several interconnected components that work together to facilitate this

work together to facilitate this efficient data transfer from the source

efficient data transfer from the source to the destination so in our case source

to the destination so in our case source is the gith account and the destination

is the gith account and the destination is the bronze layer into

is the bronze layer into ADLs this GitHub account is a source it

ADLs this GitHub account is a source it it will basically act as the origin of

it will basically act as the origin of the data which you want to process so in

the data which you want to process so in our case it's a repository which

our case it's a repository which contains the data files which we want to

contains the data files which we want to pull into our Azure data Factory then we

pull into our Azure data Factory then we have linked service so if you'll see we

have linked service so if you'll see we have two linked Services one is on the

have two linked Services one is on the source side and another one is on the

source side and another one is on the sync side so this linked service it

sync side so this linked service it basically establishes the connection

basically establishes the connection between the ADF and the source so from

between the ADF and the source so from the the link service which is on the

the the link service which is on the source side it will establish the

source side it will establish the connection between

connection between GitHub and the Azure data

GitHub and the Azure data Factory this basically provides the IO

Factory this basically provides the IO Data Factory with necessary credentials

Data Factory with necessary credentials so these credentials are necessary that

so these credentials are necessary that will get provided with this linked

will get provided with this linked service and along with the credentials

service and along with the credentials it will give the connection details to

it will give the connection details to access this data in the GitHub then we

access this data in the GitHub then we have data set again we have two data set

have data set again we have two data set one on the source side one on the syn

one on the source side one on the syn side so source side data set it defines

side so source side data set it defines the structure and the location of the

the structure and the location of the data within the GitHub the purpose if

data within the GitHub the purpose if you see the purpose is to

you see the purpose is to specify which data in the GitHub account

specify which data in the GitHub account that should be accessed by the ADF

that should be accessed by the ADF including the file path and the data

including the file path and the data format so for that we have data set on

format so for that we have data set on the source side then we have copy

the source side then we have copy activity so once we have pulled this

activity so once we have pulled this data from the GitHub using this linked

data from the GitHub using this linked service and data set then this copy

service and data set then this copy activity will transfer or will done the

activity will transfer or will done the data movement from the source to the

data movement from the source to the destination so now with this data set or

destination so now with this data set or with this copy activity we have pulled

with this copy activity we have pulled the data now we want to store this data

the data now we want to store this data to our sync which is our destination so

to our sync which is our destination so for that also we have again data set at

for that also we have again data set at the sync and the linked service at the

the sync and the linked service at the sync so this data set it will Define the

sync so this data set it will Define the structure and the location of the data

structure and the location of the data destination data store now this data set

destination data store now this data set at the syn is responsible for performing

at the syn is responsible for performing the activities which are related to the

the activities which are related to the sync the first one was responsible for

sync the first one was responsible for the source now this one is responsible

the source now this one is responsible for the sync this goes the same for the

for the sync this goes the same for the link service also the link service on

link service also the link service on the sync side it will establish the

the sync side it will establish the connection between the a aure data

connection between the a aure data Factory and your destination data source

Factory and your destination data source so this link service will serve that

so this link service will serve that purpose for the sync side and the think

purpose for the sync side and the think is it's quite obvious that it's the

is it's quite obvious that it's the final destination where the data will

final destination where the data will get stored after this copy activity so

get stored after this copy activity so in our case it's the folder which which

in our case it's the folder which which is which we have created for the bronze

is which we have created for the bronze layer into Azure data L Storage services

layer into Azure data L Storage services so this is the overview of our EO data

so this is the overview of our EO data Factory activity which we are going to

Factory activity which we are going to perform so now let's see how we can

perform so now let's see how we can implement this uh copy activity into our

implement this uh copy activity into our aure data Factory so for that now we are

aure data Factory so for that now we are clear that we need two linked

clear that we need two linked Services one to establish the connection

Services one to establish the connection with the source and another for the

with the source and another for the destination and with this two Link

destination and with this two Link services we need two data sets also

services we need two data sets also again one for the source and one for the

again one for the source and one for the syn so we are pretty much clear that we

syn so we are pretty much clear that we need two Link services two data sets and

need two Link services two data sets and then the copy activity and then we will

then the copy activity and then we will cut Interlink that those components and

cut Interlink that those components and then we will try to pull this data from

then we will try to pull this data from the source to sync so now let's see how

the source to sync so now let's see how we can implement this

we can implement this practically now I'm into Azure data

practically now I'm into Azure data Factory and now we need to create one

Factory and now we need to create one new pipeline so for that I'll come to

new pipeline so for that I'll come to the author pipeline click on these three

the author pipeline click on these three dots and then click on the new Pipeline

dots and then click on the new Pipeline and once you will click on that you need

and once you will click on that you need to give a name for your pipeline so

to give a name for your pipeline so let's say suppose I'll just give the

let's say suppose I'll just give the name as

Prime copy data and then I'll close this so

so under this pipeline we have so many

under this pipeline we have so many activities like move transform synapse

activities like move transform synapse then AO data Explorer then data

then AO data Explorer then data analytics iterations and the

analytics iterations and the conditionals so there are so many

conditionals so there are so many activities you can perform so right now

activities you can perform so right now first we will start with the basics so

first we will start with the basics so under the move and transform we have

under the move and transform we have this copy data and then we will just

this copy data and then we will just simply just pull pull this copy data

simply just pull pull this copy data onto our screen so this will be our copy

onto our screen so this will be our copy activity but now we know that for our

activity but now we know that for our work we need two Link services and the

work we need two Link services and the two data set to establish the connection

two data set to establish the connection with source and the destination okay so

with source and the destination okay so first we'll create our two Link services

first we'll create our two Link services so for that what we will do we'll just

so for that what we will do we'll just go to the manage tab we have Link

go to the manage tab we have Link services then click

services then click new here first link service we need to

new here first link service we need to connect with our GitHub account right so

connect with our GitHub account right so this is our GitHub account this is our

this is our GitHub account this is our data

data so you when you will click on this RAW

so you when you will click on this RAW button then you will get this main data

button then you will get this main data okay and this will be our link from

okay and this will be our link from where we need to pull the data so this

where we need to pull the data so this data we can see is with https so we need

data we can see is with https so we need to establish a linked service with

to establish a linked service with HTTP so here come here

HTTP so here come here then we have link service for HTTP click

then we have link service for HTTP click on that continue

then you have to give a name let's say suppose I'll give

name let's say suppose I'll give HTTP Prime and LS for the link

HTTP Prime and LS for the link service then you have to I'll just take

service then you have to I'll just take Auto resolve integration runtime then

Auto resolve integration runtime then you have to give the best URL so for the

you have to give the best URL so for the best URL right now I'll just go with the

best URL right now I'll just go with the static pipeline creation and I'll just

static pipeline creation and I'll just copy this URL as it is and then I'll

copy this URL as it is and then I'll just paste that URL here so once you

just paste that URL here so once you will that do that come down and then the

will that do that come down and then the authentication type authentication type

authentication type authentication type I'll just select Anonymous and once you

I'll just select Anonymous and once you are done with inputting the required

are done with inputting the required information just click on test

information just click on test connection so it states that our

connection so it states that our connection is successful then I'll just

connection is successful then I'll just click on

click on create so this is a linked service which

create so this is a linked service which we need for the source side now we need

we need for the source side now we need another link service for the destination

another link service for the destination side so now again click on

side so now again click on new here our destination is azure data L

new here our destination is azure data L Gen 2 so we need to select data link

Gen 2 so we need to select data link storage Gen 2 click on continue then

storage Gen 2 click on continue then give the data set uh give the link

give the data set uh give the link service name here suppose I'll give

service name here suppose I'll give ADLs Prime

account key is fine subscription I'll select my respective

select my respective subscription then ADLs Prime

subscription then ADLs Prime work and then just test the

work and then just test the connection connection is successful

connection connection is successful click on

click on create now we are done with two linked

create now we are done with two linked Services creation Now again come back to

Services creation Now again come back to the author tab under the author tab we

the author tab under the author tab we have this copy

have this copy activity so now let's just give proper

activity so now let's just give proper naming convention for this copy data

naming convention for this copy data let's say for example Prime copy data

let's say for example Prime copy data I'll just give this

name then come to the source tab now here we are done with the two Link

here we are done with the two Link services creation now we need to create

services creation now we need to create two data sets one for the source one for

two data sets one for the source one for the destination source is https

the destination source is https destination is our ADLs The Source data

destination is our ADLs The Source data set we can create new here here click on

set we can create new here here click on new then our source is HTTP select that

new then our source is HTTP select that continue then under HTTP we know that

continue then under HTTP we know that our file is into CSV format so we'll

our file is into CSV format so we'll check that and we'll select delimited

check that and we'll select delimited text which is our CSV format click on

text which is our CSV format click on continue then you need to create uh give

continue then you need to create uh give one name for your data set here for

one name for your data set here for example I'll

example I'll give Maybe

HTTP Prime DS for data set then we need to select our link service which we have

to select our link service which we have created just now which is HTTP Prime

created just now which is HTTP Prime LS then relative URL when we were

LS then relative URL when we were creating B when we are creating uh link

creating B when we are creating uh link service we have given base URL as it is

service we have given base URL as it is because we need we wanted to create

because we need we wanted to create static one right so there is no need to

static one right so there is no need to give any relative URL here first row is

give any relative URL here first row is header that's true we don't need to

header that's true we don't need to import any schema click on none

import any schema click on none okay so now Source data set is ready now

okay so now Source data set is ready now go to the sync and again we'll create

go to the sync and again we'll create one sync data set for our ADLs so click

one sync data set for our ADLs so click on new our ADLs is Gen

on new our ADLs is Gen 2 so a your data L Storage Gen 2

2 so a your data L Storage Gen 2 continue I want to save my file as into

continue I want to save my file as into CSV format only so I'll just create save

CSV format only so I'll just create save the format as CSV continue

the format as CSV continue then give the name data data set name

then give the name data data set name maybe ADLs

maybe ADLs Prime data set then we'll select our

Prime data set then we'll select our newly created linked

newly created linked service then file path we need to give

service then file path we need to give our file path because we have three

our file path because we have three folders created right bronze silver and

folders created right bronze silver and gold you just come here we can see those

gold you just come here we can see those I just right now I just want to save my

I just right now I just want to save my all raw data into bronze layer so just

all raw data into bronze layer so just select that bronze folder click on okay

select that bronze folder click on okay so once you will you want to save that

so once you will you want to save that data into bronze folder you need to give

data into bronze folder you need to give a name for that right otherwise like I

a name for that right otherwise like I as any which it will pick one by default

as any which it will pick one by default or system generated file name for that

or system generated file name for that but if you want to give your own file

but if you want to give your own file name you can give that for example I'll

name you can give that for example I'll here give Prime what was our original

here give Prime what was our original name Amazon Prime titles I'll just give

name Amazon Prime titles I'll just give this name only

this name only you can give another name also I'll just

you can give another name also I'll just for the Simplicity I'll just give that

for the Simplicity I'll just give that name asz your Prime T titles.

name asz your Prime T titles. CSV import schema none and then click on

CSV import schema none and then click on okay so we are done with

okay so we are done with our dat data Factory interl components

our dat data Factory interl components two Link services then two data sets

two Link services then two data sets then copy activity and now we will see

then copy activity and now we will see if it is working working or not so for

if it is working working or not so for that first I'll just validate whatever I

that first I'll just validate whatever I have done okay so it states that your

have done okay so it states that your pipeline has been validated then I'll

pipeline has been validated then I'll click on

debug so once you will click on debug it will show

will show that your activity is cued and then it

that your activity is cued and then it will change the status to in progress

will change the status to in progress and once it is done then it will change

and once it is done then it will change to success it will take some time okay

to success it will take some time okay so now it has been succeeded it will

so now it has been succeeded it will check the output it states that it has

check the output it states that it has read the

read the data duration

data duration throughput okay so now it's the time to

throughput okay so now it's the time to check if our data has been copied or not

check if our data has been copied or not so for that

so for that again come to your ADLs account

again come to your ADLs account container

container bronze okay so now we can see that we

bronze okay so now we can see that we have one file here which with CSV format

have one file here which with CSV format let's let's just check if we are getting

let's let's just check if we are getting expected output or not click on that

expected output or not click on that edit you are not going to edit anything

edit you are not going to edit anything it's just that you want to check if you

it's just that you want to check if you have that output or not so we can see

have that output or not so we can see that we have that output if I'll click

that we have that output if I'll click on

preview now we can see that we have show ID type title director so basically we

ID type title director so basically we can say that our Azure data Factory has

can say that our Azure data Factory has successfully transformed our data from

successfully transformed our data from the source to azure data L Storage

the source to azure data L Storage services so now our first part is done

services so now our first part is done now we will try to create AO data brakes

now we will try to create AO data brakes resource but before that we'll just come

resource but before that we'll just come back to this AO data Factory and the

back to this AO data Factory and the main important thing is that we need to

main important thing is that we need to publish all of our work otherwise if

publish all of our work otherwise if you'll not publish it then if once you

you'll not publish it then if once you will close this AO data Factory all of

will close this AO data Factory all of your work will be gone and you will not

your work will be gone and you will not be able to recover it so for that to be

be able to recover it so for that to be on safer side it's a good practice to

on safer side it's a good practice to always publish whatever you have done so

always publish whatever you have done so I'll just click on

publish now it states that it is deploying our changes which we have

deploying our changes which we have created it will take few seconds to

created it will take few seconds to publish it so now it says that

publish it so now it says that publishing completed so like again going

publishing completed so like again going forward if I'll come back to this Edo

forward if I'll come back to this Edo data Factory then my all work will be

data Factory then my all work will be here and there is no need to rework

here and there is no need to rework anything so now we are done with the

anything so now we are done with the Azure data Factory now we will create

Azure data Factory now we will create Azure data breakes so for that

Azure data breakes so for that again we will come to our Resource Group

again we will come to our Resource Group and there we will create this aure data

and there we will create this aure data Factory or sorry data brakes so one is

Factory or sorry data brakes so one is done then second data Factory is done

done then second data Factory is done now third again we'll create and go to

now third again we'll create and go to the

the marketplace then data

brakes click on this create data braks it is giving me proper

braks it is giving me proper subscription under that Resource Group

subscription under that Resource Group if you want to if you want like it's a

if you want to if you want like it's a best practice to use the same Resource

best practice to use the same Resource Group so we are going to keep the same

Group so we are going to keep the same Resource Group then we need to assign

Resource Group then we need to assign one workspace name or data break name so

one workspace name or data break name so what we can do for

what we can do for example let's say suppose aure data

example let's say suppose aure data freak ADB

freak ADB Prime

Prime work this name is available so we can

work this name is available so we can opt for this region East us I'll keep

opt for this region East us I'll keep the same because that's the nearest one

the same because that's the nearest one pricing tier we have premium standard

pricing tier we have premium standard and the trial so the best optimum

and the trial so the best optimum solution is the trial because you'll get

solution is the trial because you'll get 14 days free and it it will not charge

14 days free and it it will not charge you for anything for the 14 days so I'll

you for anything for the 14 days so I'll just select that then we have this

just select that then we have this option of managed Resource Group so what

option of managed Resource Group so what is it

is it exactly so this manage Resource Group is

exactly so this manage Resource Group is that when we will create this Azure data

that when we will create this Azure data bricks workspace right then Azure will

bricks workspace right then Azure will automatically generate

automatically generate some manage resource groups so right now

some manage resource groups so right now we know that this RG Prime work which we

we know that this RG Prime work which we have created it right but apart from

have created it right but apart from that in the back end when we are going

that in the back end when we are going to create this data brakes system is

to create this data brakes system is going to

going to generate different Resource Group in the

generate different Resource Group in the back end so these are system

back end so these are system generated which are created for this

generated which are created for this databas infrastructure resources so for

databas infrastructure resources so for that if you'll not give any name here

that if you'll not give any name here then it will automatically assign one

then it will automatically assign one name name which will be created by Azure

name name which will be created by Azure Azure but if you want to give your dat

Azure but if you want to give your dat like if you if you want to give that

like if you if you want to give that folder name like your any customized

folder name like your any customized folder name then you can give that and

folder name then you can give that and then datab briak will store those

then datab briak will store those backend resource groups which are system

backend resource groups which are system generated into that folder name so if

generated into that folder name so if you'll not give it then it is also fine

you'll not give it then it is also fine if you'll give it then it is also fine

if you'll give it then it is also fine so I'll just give that name and I'll

so I'll just give that name and I'll move ahead for this I'll give RG maybe

move ahead for this I'll give RG maybe Ma data

system this things available then next networking we don't

available then next networking we don't need to do any changes for the

need to do any changes for the networking then encryption security and

networking then encryption security and compliance tags and we are here with

compliance tags and we are here with review plus create now we will review it

review plus create now we will review it and then we will try to create it

and then we will try to create it so now let's try to understand

so now let's try to understand that it is going to take few seconds to

that it is going to take few seconds to create it right now deployment is in

create it right now deployment is in progress now let's just wait so that all

progress now let's just wait so that all that Azure will create these resources

that Azure will create these resources and backend system generated resource

and backend system generated resource groups and everything so it will take

groups and everything so it will take few seconds for

few seconds for that okay so now our deployment is

that okay so now our deployment is completed click on go to

completed click on go to Resource this is our data brakes we'll

Resource this is our data brakes we'll click on launch

click on launch workspace select your

account so this is our datab breaks platform so on the left hand side we'll

platform so on the left hand side we'll see we have workspaces then cataloges

see we have workspaces then cataloges workspace compute so we need will need

workspace compute so we need will need to create one cluster here in compute

to create one cluster here in compute then we have SQL editor

then we have SQL editor queries right now now we'll just work on

queries right now now we'll just work on the topics which we want which we are re

the topics which we want which we are re which we are going to require for this

which we are going to require for this ETL Pipeline and in another video I'll

ETL Pipeline and in another video I'll just cover all these components which

just cover all these components which are available under this Azure data

are available under this Azure data brakes so first now we have our databas

brakes so first now we have our databas ready right so this is

our this is our data brakes and this is our ADLs account

brakes and this is our ADLs account under which we have this bronze layer

under which we have this bronze layer where we have stored our raw data so now

where we have stored our raw data so now we need to PCH this bronze layer data in

we need to PCH this bronze layer data in the bronze layer into our Azure data

the bronze layer into our Azure data brakes so how are we going to do

brakes so how are we going to do that

that so to fetch this data from bronze layer

so to fetch this data from bronze layer which is into a storage account in Azure

which is into a storage account in Azure data breaks need to have the right

data breaks need to have the right permissions so first we need to Grant

permissions so first we need to Grant those permissions to this Azure data

those permissions to this Azure data braks for that we need to create an

braks for that we need to create an application which is nothing but service

application which is nothing but service application so once we will create that

application so once we will create that service

application we will grant the access to that application to read and write the

that application to read and write the data and then we will P we will pull

data and then we will P we will pull this application into our data brakes so

this application into our data brakes so for this application creation we just

for this application creation we just need to go to the

need to go to the Microsoft inra

Microsoft inra ID there we will create this service

ID there we will create this service application and then we will grant that

application and then we will grant that application respective rights to read

application respective rights to read and write that data into Azure databas

and write that data into Azure databas so in this way Azure data bricks will be

so in this way Azure data bricks will be able to read the data from ADLs and then

able to read the data from ADLs and then once it will transform that data then

once it will transform that data then again it

again it will push that data into the bronze lay

will push that data into the bronze lay uh into the silver layer and the gold

uh into the silver layer and the gold layer so for that we'll go to now

layer so for that we'll go to now Microsoft intra ID for that you need to

Microsoft intra ID for that you need to again come back to

your Microsoft aure account we'll search

account we'll search for Microsoft intra ID so this is our

for Microsoft intra ID so this is our Microsoft intra ID here

Microsoft intra ID here just check the option of app

just check the option of app registration it is under manage click on

registration it is under manage click on app

app registration

registration then new

then new registration now give the applic uh give

registration now give the applic uh give your uh application name here for

your uh application name here for example if I'll

example if I'll give Prime

you can give any name whichever you want then we are fine with the single tenant

then we are fine with the single tenant and then click on

and then click on register so now our application has been

registered then afterwards we need to assign all the essential roles to our

assign all the essential roles to our application to read and write the data

application to read and write the data so now we have created the application

so now we have created the application The Next Step would be to give the

The Next Step would be to give the essential access for those applications

essential access for those applications for

for that come to your ADLs

account you have am Access Control here click on that then you need to

click on that then you need to add Ro

assignment here search for

for storage blob data contributor select

storage blob data contributor select that and then click on

next then User Group service principal select

members then search for your application which we created just

your application which we created just now and that was Prime service

now and that was Prime service application right select that it will

application right select that it will come under selected members click on

come under selected members click on select and once you will done

select and once you will done that review and

assign so now it will add those respective role assignments here so what

respective role assignments here so what happens like once it will assign these

happens like once it will assign these role assignments then it will take some

role assignments then it will take some time to reflect those accesses into Data

time to reflect those accesses into Data brakes so right now like if you'll all

brakes so right now like if you'll all of a sudden go to data brakes and we'll

of a sudden go to data brakes and we'll try to access and if you not if you will

try to access and if you not if you will not be able to access it then you just

not be able to access it then you just don't have to worry it usually takes

don't have to worry it usually takes some time to access all those roles to

some time to access all those roles to that particular application so take some

that particular application so take some time and then come back to data data

time and then come back to data data brakes and again try to perform those

brakes and again try to perform those data cleaning and trans transformation

data cleaning and trans transformation activities now the next step is to

activities now the next step is to create a compute so in data data Brakes

create a compute so in data data Brakes come to this compute tab create

compute policy will be unrestricted I want it for single

want it for single node run time would be fine terminate

node run time would be fine terminate after 20

minutes note type I'll just give standard D3

standard D3 V2 which has 14 GB memory and the 4

V2 which has 14 GB memory and the 4 course I'll just create on create

course I'll just create on create compute till our cluster gets created we

compute till our cluster gets created we will try to Club all the information

will try to Club all the information which we are going to require to input

which we are going to require to input when we are connecting data brakes to

when we are connecting data brakes to the ADLs okay so for that what we are

the ADLs okay so for that what we are going to require so come to your app

going to require so come to your app registration and select that app which

registration and select that app which you have just created under Microsoft

you have just created under Microsoft intra ID and from here we are going to

intra ID and from here we are going to require this application / client ID and

require this application / client ID and the Tenant ID now let's let's try to

the Tenant ID now let's let's try to copy this information somewhere

ID then we are going to require this tenant

then we are going to require a secret key key which we are going to create

key key which we are going to create right now so for that what we need to

do come under this manage tab then certificates and the

certificates and the secrets then click on new client

secret description if you want to give anything give

it add so now from this client secret we are

so now from this client secret we are going to require this value information

going to require this value information so we'll just copy

so we'll just copy that and that will be our secret ID so

that and that will be our secret ID so we have now client

tenant and then secret ID so we have these three IDs which we

ID so we have these three IDs which we are going to require now let's see how

are going to require now let's see how we can connect this data brakes to the

we can connect this data brakes to the ADLs Gen 2 so for

ADLs Gen 2 so for that I I'm searching like how to connect

that I I'm searching like how to connect data brakes to ADLs genu and here you

data brakes to ADLs genu and here you will get one tutorial which is given by

will get one tutorial which is given by the

the Microsoft and then connect ADLs

blob so this is the information which we are going to require here so with this

are going to require here so with this information we will just modify our

information we will just modify our information into here like storage

information into here like storage account then the client ID then the

account then the client ID then the secret

secret ID we will input our those credentials

ID we will input our those credentials here and then we will see if we are able

here and then we will see if we are able to fetch that respective data into Data

to fetch that respective data into Data brakes or not so now let's just check

brakes or not so now let's just check okay so now our cluster is also

okay so now our cluster is also ready come to the

ready come to the workspace it's always a good practice to

workspace it's always a good practice to create one folder for your

create one folder for your work Prime

work Prime work and in this I'll create

work and in this I'll create one python

one python notebook so this is our

notebook so this is our notebook first we will connect our

notebook first we will connect our notebook to our newly created cluster it

notebook to our newly created cluster it is it is

is it is connected I'll change the name of this

connected I'll change the name of this notebook

notebook so I'll give the name as Prime

silver layer transformation you can give it

layer transformation you can give it anything so these these are our

anything so these these are our sales from here you can change the

sales from here you can change the default language for your whole notebook

default language for your whole notebook it could be python SQL Scala or R then

it could be python SQL Scala or R then for the for that particular cell if you

for the for that particular cell if you want to change that language then you

want to change that language then you can change it from here if you want to

can change it from here if you want to run that respective cell you can click

run that respective cell you can click on run cell if you want to run all the

on run cell if you want to run all the notebook then you can click on run all

notebook then you can click on run all so now let's see if we are able to

so now let's see if we are able to connect data brakes to the ADLs now

connect data brakes to the ADLs now let's try to

let's try to use this python code into our data

use this python code into our data breaks so I'll just copy this

breaks so I'll just copy this code

code paste I don't need this first line I'll

paste I don't need this first line I'll just remove it and then we have storage

just remove it and then we have storage account right here so we need to give

account right here so we need to give our storage account name wherever we are

our storage account name wherever we are seeing this storage account section so

seeing this storage account section so the storage account name

the storage account name is ADLs Prime

is ADLs Prime work we I'll just give that name

here just keep in mind that you need to change this or you need to give this

change this or you need to give this storage account name multiple times

okay I hope I have given it everywhere then the next we need to give

then the next we need to give application ID so this is our

application ID so this is our application

application ID

copy paste then we need to give directory ID so this is our tenant or

directory ID so this is our tenant or directory

directory ID which we are going to give here

ID which we are going to give here and we have this service

and we have this service credential so this is our secret

give okay so now let's try to run this cell and we'll

cell and we'll see if it is working fine or not okay it

see if it is working fine or not okay it seems like it is working fine

seems like it is working fine because the sale has run

because the sale has run successfully then I just need to

successfully then I just need to access the folders right so for that DBL

access the folders right so for that DBL ls.

ls. fs.

fs. LS and then inverted

FSS then we need to give our bronze layer information here let's come to The

layer information here let's come to The Container we'll see if we are getting

Container we'll see if we are getting that information here or

that information here or not con Diner

okay let it be just we just need to give this bronze container name right

this bronze container name right so ABB FSS dou slash then we have bronze

so ABB FSS dou slash then we have bronze layer

layer into

into our ADLs account name which was this

our ADLs account name which was this ADLs Prime

work do DFS doc. windows.net

doc. windows.net and then

and then flash let's see if it is working or not

flash let's see if it is working or not okay so it said that it is able to

okay so it said that it is able to access this bronze folder from this ADLs

access this bronze folder from this ADLs Prime work and it has one file which has

Prime work and it has one file which has name of Amazon Prime titles. CSV that

name of Amazon Prime titles. CSV that means that it is able to fetch that file

means that it is able to fetch that file so now let's

so now let's see if we are able to read the data or

see if we are able to read the data or not

not so it's always a good practice

so it's always a good practice so for data reading or data

so for data reading or data understanding first let's just try to

understanding first let's just try to import all the P spark functions and the

import all the P spark functions and the types so I usually prefer to import

types so I usually prefer to import these libraries first so that it will

these libraries first so that it will not bother us

not bother us afterward so I'll try to import all the

afterward so I'll try to import all the functions and then all the types

so this is done so now we will try to read our data and understand what is

read our data and understand what is exactly going on so I'll just put a

exactly going on so I'll just put a markdown

understanding then now we need to read our data right which

now we need to read our data right which is into CSV format so for that first we

is into CSV format so for that first we need to create one data frame so here

need to create one data frame so here I'll

I'll create data frame as for example DF

create data frame as for example DF uncore

uncore silver is equal to then park. read okay

silver is equal to then park. read okay it has given me by default code

it has given me by default code but any which way let's try to

but any which way let's try to understand it so our data is into do CSV

understand it so our data is into do CSV format so spark. tr. format and which is

format so spark. tr. format and which is into CSV

into CSV then option as header because I need to

then option as header because I need to see my headers also otherwise what will

see my headers also otherwise what will happen the headers that will come as a

happen the headers that will come as a first row so we to avoid that we need to

first row so we to avoid that we need to give options headers as

then we need to give infer schema as true to take whatever the schema type it

true to take whatever the schema type it is so we need that schema type as it it

is so we need that schema type as it it and then we want to load the

and then we want to load the data and we need to give that respective

data and we need to give that respective path from where we want to fetch that

path from where we want to fetch that data so we know that we want to we need

data so we know that we want to we need this data from bronze layer from ADLs

this data from bronze layer from ADLs Prime work but our file name was

Prime work but our file name was something different right so this was

something different right so this was Amazon Prime titles so this I need to

Amazon Prime titles so this I need to change with Amazon Prime

change with Amazon Prime titles and then

titles and then we just need to display our data frame

we just need to display our data frame so this is our data

so this is our data frame dot

happens okay that's that okay so we have given. CSV do CSV two times we'll try it

again okay so we can see our data into table show ID type title director cast

table show ID type title director cast and so on so now now from here also we

and so on so now now from here also we can see that this column show ID has a

can see that this column show ID has a type of string then type has string then

type of string then type has string then director has text string

director has text string format so this has happened because we

format so this has happened because we have turned on the info schema as true

have turned on the info schema as true so now let's see like after reading this

so now let's see like after reading this data it is important to check the schema

data it is important to check the schema to understand the structure of data so

to understand the structure of data so basically this will help us to identify

basically this will help us to identify the data types for each column such as

the data types for each column such as string integer and Float so if you want

string integer and Float so if you want to check the schema then what we can do

to check the schema then what we can do is like we have DF do silver dfdore

is like we have DF do silver dfdore silver and then we'll just say print

silver and then we'll just say print schema so this is our schema so here we

schema so this is our schema so here we will get the clear picture of like how

will get the clear picture of like how many column names which we have how many

many column names which we have how many columns we have then what are the names

columns we have then what are the names of those respective columns then what's

of those respective columns then what's what is the type of that column so that

what is the type of that column so that complete information which we will get

complete information which we will get using this do print schema so this is

using this do print schema so this is like this print schema is the method

like this print schema is the method which print the schema of the data frame

which print the schema of the data frame by showing the column names along with

by showing the column names along with their respective data types okay so this

their respective data types okay so this is how like we can read the data or

is how like we can read the data or understand the data then now next what

understand the data then now next what we can do we can try to clean the data

we can do we can try to clean the data so the next step would

so the next step would be data

cleaning so in the data cleaning first we can see how we can handle this

we can see how we can handle this missing values so for example

missing values so for example uh let's try to display this data frame

rating we can see that we have so many null values and along with the rating if

null values and along with the rating if we'll just check the country column in

we'll just check the country column in the country column also we have null

the country column also we have null values so for example suppose if I want

values so for example suppose if I want to replace the null values from the

to replace the null values from the rating column as unrated and I want to

rating column as unrated and I want to change the countries which has null

change the countries which has null values to the unknown so for that what

values to the unknown so for that what we can do we can say that DF do dfdore

we can do we can say that DF do dfdore Sil which is our data

Sil which is our data frame do

frame do na do

na do fill and then I need to give two columns

fill and then I need to give two columns right so for the two columns I'll need

right so for the two columns I'll need to give the curly bres and the first one

to give the curly bres and the first one would be rating so rating I would like

would be rating so rating I would like to change the null values with for

to change the null values with for example

example unrated and with this rating column

unrated and with this rating column there is one more column which is

country and Country column name I would like U null values I would like to

like U null values I would like to replace it with unknown and once we will

replace it with unknown and once we will do this transformation or we will do

do this transformation or we will do this data cleaning then assign that

this data cleaning then assign that modified data frame to our existing data

modified data frame to our existing data frame so that's why I'll do DF silver is

frame so that's why I'll do DF silver is equal to and then our condition and then

equal to and then our condition and then dot

dot display so now let's see if it is

display so now let's see if it is working or

okay so our country null values are replaced with unknown and then rating

replaced with unknown and then rating these are replaced with

these are replaced with unrated so in the same way like along

unrated so in the same way like along with this we can check like remove the

with this we can check like remove the RADS or remove the duplicate rows so for

RADS or remove the duplicate rows so for example suppose if you have multiple

example suppose if you have multiple records of the same entry right and if

records of the same entry right and if you would like to drop the duplicates

you would like to drop the duplicates then what we can do DF underscore

okay so our DF data frame name is DF do silvercore drop

silvercore drop duplicates and

duplicates and then again I would like to modify my

then again I would like to modify my existing data

existing data frame and then display that data

get so now we can see that there are so many null values for the date

many null values for the date added or the director so we need to

added or the director so we need to handle these values

handle these values right so let's try to see if we can fill

right so let's try to see if we can fill the null

the null values for other columns also

unknown okay so it has given me the code date

date added maybe date added I'll give it

as 0 1 01 2025 release

2025 release year maybe I'll give it as

year maybe I'll give it as 2020 duration is unknown is fine

2020 duration is unknown is fine description unknown listed in unknown is

description unknown listed in unknown is fine so we'll see how it looks like

now okay so we have forgotten some column some

names which is director and okay director and the cast maybe so

and okay director and the cast maybe so let's say like now we will try to delete

let's say like now we will try to delete the dup or delete the values like we

the dup or delete the values like we will try to remove the rows which has

will try to remove the rows which has any null value in that so for that what

any null value in that so for that what we can do is like our data frame and

we can do is like our data frame and then

then dot

dot drop na

drop na and in that I'll give the string as

and in that I'll give the string as any so any means that it will remove the

any so any means that it will remove the rows if that rows if that row contains

rows if that rows if that row contains any null values in any

any null values in any column so now let's see how it functions

okay it seems like we have fair enough data okay

data okay fine then for example if you want to

fine then for example if you want to rename the column name for the better

rename the column name for the better readability okay so let's say suppose we

readability okay so let's say suppose we have this column name as title so I

have this column name as title so I would like to change it as content title

would like to change it as content title so what I can do here

so what I can do here is dfdore silver

dot then I'll use the function which is with column renamed and with column

with column renamed and with column renamed I would like to give existing

renamed I would like to give existing column name which is title and then I

column name which is title and then I would like to change that name

would like to change that name to content

to content title and then I would like to display

title and then I would like to display this new modified data frame

see okay it has been changed successfully now let's say suppose I

successfully now let's say suppose I would like to do one data transformation

would like to do one data transformation here so for this I'll use this country

here so for this I'll use this country column so for this transformation what

column so for this transformation what I'll do is

I'll do is like

like if the column entry is in this country

if the column entry is in this country column is any particular country then

column is any particular country then I'll assign the value as one otherwise

I'll assign the value as one otherwise if that country value is unknown then I

if that country value is unknown then I would like to assign that value at zero

would like to assign that value at zero into our new created newly created

into our new created newly created column so for that we what we can

column so for that we what we can do so for this we have our data frame

do so for this we have our data frame and we will use fth column function for

and we will use fth column function for this and then we need to assign what

this and then we need to assign what should be our new column name so maybe

should be our new column name so maybe I'll give the name as H

I'll give the name as H country and then I need to assign that

country and then I need to assign that condition which we just spoke right now

condition which we just spoke right now so

so when our

column country is equal to is equal to unknown then I would like to assign the

unknown then I would like to assign the value as zero otherwise I would like to

value as zero otherwise I would like to assign the value as one and I want to

assign the value as one and I want to add this new column into my existing

add this new column into my existing data frame only so for that I'll just

data frame only so for that I'll just assign this condition to my existing

assign this condition to my existing data frame and then we'll just display

data frame and then we'll just display that new data

frame okay what is it saying so we have given column column two

times okay now we'll see we have each country column if we

see we have each country column if we have name as unknown then it is giving

have name as unknown then it is giving us the value zero otherwise it is giving

us the value zero otherwise it is giving us the value as one so now let's say

us the value as one so now let's say suppose U I have transformed our data

suppose U I have transformed our data and now I would like to store that data

and now I would like to store that data into parket format and into our silver

into parket format and into our silver layer so basically what we need to do is

layer so basically what we need to do is like

here our data frame is DF do DF silver then we need to write our data

then we need to write our data frame in the format of

so right now I just want this into a pocket

format then mode I would like to give the mode as

the mode as aend and then I would like

aend and then I would like to save this data but before save I need

to save this data but before save I need to

to give file path here

give file path here so option

be AB FSS and now it should be into silver layer so I'll just change

into silver layer so I'll just change this to Silver and then

what should be the file name I would like to give

like to give this file name and I'll just add silver

this file name and I'll just add silver into it Amazon Prime

into it Amazon Prime titles maybe silver layer okay

titles maybe silver layer okay CSV then this would be our

CSV then this would be our path and once we are done with that path

path and once we are done with that path just save

okay so now we are using append mode so we have four types of

we have four types of nodes the one is

nodes the one is append so with append what will

append so with append what will happen so with append let's say suppose

happen so with append let's say suppose uh this is our ADLs and this is our

uh this is our ADLs and this is our silver layer okay and we have our files

silver layer okay and we have our files already stored here and now we are

already stored here and now we are trying to store this Amazon Prime title

trying to store this Amazon Prime title silver file here so now let's say

silver file here so now let's say suppose if we already have the file with

suppose if we already have the file with the same name into our silver layer then

the same name into our silver layer then with the append it will just add that

with the append it will just add that new file below that file so it will not

new file below that file so it will not erase that already existing file it will

erase that already existing file it will just keep adding the new files into that

just keep adding the new files into that silver layer using this append mode so

silver layer using this append mode so this is about the append mode the next

this is about the append mode the next mode is over right mode so in overwrite

mode is over right mode so in overwrite mode what happens is like if we have

mode what happens is like if we have already existing file into our silver

already existing file into our silver layer and we are now trying to store

layer and we are now trying to store this new file with the same name then

this new file with the same name then what this override mode will do this

what this override mode will do this overwrite mode it will replace this

overwrite mode it will replace this existing value existing file with this

existing value existing file with this new file so that previous file will be

new file so that previous file will be gone you will not be able to to recover

gone you will not be able to to recover it and then you will just get the new

it and then you will just get the new file which will get stored here so this

file which will get stored here so this is how this overwrite mode works then we

is how this overwrite mode works then we have error mode and ignore mode so with

have error mode and ignore mode so with the error mode again what

the error mode again what happens

if this right mode will try to store this same file into our ADLs and it sees

this same file into our ADLs and it sees that okay we have already existing file

that okay we have already existing file with the same name then with the error

with the same name then with the error mode what it will do it will not save

mode what it will do it will not save your new file into the ADLs

your new file into the ADLs folder and it will throw you some error

folder and it will throw you some error setting like maybe you already have file

setting like maybe you already have file there or something so it will basically

there or something so it will basically throw the error and it will not save

throw the error and it will not save your new work into the existing ADLs

your new work into the existing ADLs work so this is how

work so this is how this error mode will function and

this error mode will function and now the fourth one is ignore so for

now the fourth one is ignore so for ignore what happens is like again this

ignore what happens is like again this is our

is our ADLs we have this file already present

ADLs we have this file already present here and we are trying to save this file

here and we are trying to save this file here it will simply check and it will

here it will simply check and it will say that okay you I have already one

say that okay you I have already one file over there with the same name and

file over there with the same name and it will not save your file there but

it will not save your file there but also it will not give you any kind of

also it will not give you any kind of error so it will simply ignore it and it

error so it will simply ignore it and it will not save your work at that time and

will not save your work at that time and also will not give you any kind of error

also will not give you any kind of error so this is how this four mode works so

so this is how this four mode works so now we are just trying to use this

now we are just trying to use this append mode and we'll try to save this

append mode and we'll try to save this information and format we would like to

information and format we would like to be into the pocket

be into the pocket format now to save this or write this

format now to save this or write this format into our silver layer we are

format into our silver layer we are going to use this bucket file format so

going to use this bucket file format so now let's try to understand a little bit

now let's try to understand a little bit more about this pocket format so pocket

more about this pocket format so pocket format is nothing but a columnar storage

format is nothing but a columnar storage file format

file format this is

this is columnar storage file format which is

columnar storage file format which is basically designed for the use with big

basically designed for the use with big data processing Frameworks like Apache

data processing Frameworks like Apache spark or the Hardo so we can say that

spark or the Hardo so we can say that it's an open-source format and that has

it's an open-source format and that has become one of the most commonly used

become one of the most commonly used file format for the Big Data due to its

file format for the Big Data due to its efficient storage and fast

efficient storage and fast performance so these are these main

performance so these are these main feature about this so like if we'll talk

feature about this so like if we'll talk about the main key feature about it

about the main key feature about it about this format then first one we can

about this format then first one we can say is columnar

storage this columnar storage is uh like this parket stores the data into

this parket stores the data into columnar format meaning that data is

columnar format meaning that data is stored column by column rather than

stored column by column rather than storing it by row by row so what it will

storing it by row by row so what it will do this will allow you for the better

do this will allow you for the better compression and more efficient quering

compression and more efficient quering because you can read only the columns

because you can read only the columns which you need so for example here if

which you need so for example here if you need to read just the ear column and

you need to read just the ear column and you just don't want to fet any other

you just don't want to fet any other kind of information I just want analysis

kind of information I just want analysis of the ear column so for this ear column

of the ear column so for this ear column what it will do it will just this spark

what it will do it will just this spark will only load the ear column into the

will only load the ear column into the memory and it will not load the entire

memory and it will not load the entire rows so this is what a column in our

rows so this is what a column in our storage is

storage is then the next key feature would be

compression so this is the second most important

important feature so parket supports the multiple

feature so parket supports the multiple compression algorithms for example one

compression algorithms for example one is

is Snappy or

Snappy or gzip or

gzip or LZ so these are some algorithms which

LZ so these are some algorithms which help to reduce the storage space so this

help to reduce the storage space so this columnar format and the compression it

columnar format and the compression it can significantly reduce the amount of

can significantly reduce the amount of disk space required to store the data so

disk space required to store the data so this is the second most important which

this is the second most important which is compression and the third one I'll

is compression and the third one I'll say

say is schema

evoluation so with this schema Evolution what

what happens is that parket allows this

happens is that parket allows this schema evaluation so that like you can

schema evaluation so that like you can change the schema of your data over the

change the schema of your data over the time period like while you are using or

time period like while you are using or while you are able to read the older

while you are able to read the older data also with the original schema so

data also with the original schema so what do you mean by change in the schema

what do you mean by change in the schema so right now uh we just saw that we have

so right now uh we just saw that we have added one new column right which is is

added one new column right which is is country so if you'll see in the source

country so if you'll see in the source file that column was not there but like

file that column was not there but like after the changes we have this one newly

after the changes we have this one newly added column right that means that we

added column right that means that we have changed the schema of of our data

have changed the schema of of our data so with this suppose if normally if we

so with this suppose if normally if we write try to write the for write this

write try to write the for write this data into the silver layer it will throw

data into the silver layer it will throw us the error because the schema of the

us the error because the schema of the source and schema of the destination is

source and schema of the destination is not same but with parket it gives that

not same but with parket it gives that FIB facility or the flexibility of

FIB facility or the flexibility of schema Evolution that means that if your

schema Evolution that means that if your data data has been changed or if your

data data has been changed or if your schema has been changed then also it

schema has been changed then also it will allow you to store that data into

will allow you to store that data into our silver layer

our silver layer so these are some important features of

so these are some important features of the parket that's why we usually prefer

the parket that's why we usually prefer parket format or the Delta format so now

parket format or the Delta format so now let's see if it is allowing us to store

let's see if it is allowing us to store the data or not okay it has been

the data or not okay it has been successful now let's come to our silver

successful now let's come to our silver layer okay now we can see we have one

layer okay now we can see we have one folder

here and in that folder we we can see that we have our parket

visible okay so now we can see that our data has been loaded successfully into

data has been loaded successfully into our silver layer that means our second

our silver layer that means our second step is done now our third step would

step is done now our third step would be now we will fet this data from this

be now we will fet this data from this silver layer whichever data is here into

silver layer whichever data is here into again into Data

again into Data brakes and we will try to perform some

brakes and we will try to perform some Advanced changes on that so that our

Advanced changes on that so that our data will be business ready data and

data will be business ready data and then we will pull that push that data

then we will pull that push that data into the gold layer or the gold folder

into the gold layer or the gold folder which we have

which we have created here in the ADLs Prime work

created here in the ADLs Prime work which is this folder gold folder so

which is this folder gold folder so we'll try to see after the after the

we'll try to see after the after the changes if we are able to load our

changes if we are able to load our business ready data into this gold

business ready data into this gold folder or not now let's see how we can

folder or not now let's see how we can move

move ahead so for that I would like to create

ahead so for that I would like to create one new

notebook I'll just create the duplicate of this

notebook for our gold

gold layer

Transformations and I'll connect my cluster for the

cluster for the same again this same procedure we need

same again this same procedure we need to do to fetch the data so I'll just

to do to fetch the data so I'll just copy paste whatever the steps which we

copy paste whatever the steps which we have created so this first one

have created so this first one was that python code with different

was that python code with different different directory ID secret keys and

all then we'll see if our instead of bronze now we need to see

our instead of bronze now we need to see if our

if our silver folder is ready or not we can see

silver folder is ready or not we can see that we have one folder

here and then we will try to import these libraries

again then we will read our data again so now

then we will read our data again so now I'll have

I'll have efcore gold framework

efcore gold framework do read. format now our format is

do read. format now our format is bucket header true M schema true and now

bucket header true M schema true and now we don't have data into bronze we have

we don't have data into bronze we have into silver layer and in silver

layer we need to assign this folder name also

okay so we have done one mistake DF it should be dfdore

gold see now we are able to fetch this new data which we stored into our silver

new data which we stored into our silver layer we can see our new column also

layer we can see our new column also that that means that our schema

that that means that our schema Evolution was

Evolution was successful then we have changed the

successful then we have changed the title to content title it is also true

title to content title it is also true it is also getting displayed we don't

it is also getting displayed we don't have any null value okay that means that

have any null value okay that means that our uh Second Step we uh that has worked

our uh Second Step we uh that has worked properly now we'll see how we can apply

properly now we'll see how we can apply some Advanced Transformations on this

some Advanced Transformations on this gold layer and we'll try to again load

gold layer and we'll try to again load this data into the gold ler now let's

this data into the gold ler now let's say suppose I want to do some Advanced

say suppose I want to do some Advanced uh Transformations so for example in the

uh Transformations so for example in the date added column I just I would like to

date added column I just I would like to CH just check the date formats of the of

CH just check the date formats of the of that respective column

that respective column and if it is not already into the proper

and if it is not already into the proper date type then I would like to convert

date type then I would like to convert that format into a proper format which

that format into a proper format which is mmdd YY okay so for that what what we

is mmdd YY okay so for that what what we can

can do so we will have

do so we will have our data

gold then we will use with column function and with column function we

function and with column function we would like to use

would like to use with date

with date added

added column comma and then we need to write

column comma and then we need to write our condition so to

layer here we will again use the column name which

into mmdd yyy format and

then okay so I think so we need we just don't need to fetch only ear here or

don't need to fetch only ear here or maybe we can also fetch

maybe we can also fetch or also extract the year only year B for

or also extract the year only year B for the business analysis so for that also

the business analysis so for that also what we can

what we can do DF

gold is equal to again our data frame dot with

dot with column and we would like to create one

column and we would like to create one new column which

new column which is here added

then here DF C data frame and into that our

respective column name so now we'll just try to see if it is working or

try to see if it is working or not so just display it

see okay so date added column has been changed

changed successfully and then it has also given

successfully and then it has also given us the year added column which is like

us the year added column which is like only year which is getting displayed

only year which is getting displayed from the date added column okay so now

from the date added column okay so now we can see that this is working

we can see that this is working fine now let's see or let's try to

fine now let's see or let's try to implement one more transformation and

implement one more transformation and let's see if it is working or

column where is it okay so we can see that we have so many entries here like

that we have so many entries here like animation comma kids comedy comma kids

animation comma kids comedy comma kids so let's see if we

so let's see if we can split this one column into two

can split this one column into two different

columns so for that we will use split function so let's see and let's

function so let's see and let's Implement that our data

frame then withd column and we want to use with column and we would like to

use with column and we would like to create a new column

create a new column name which is for example

category one and in this we will try to use split

one and in this we will try to use split function so

function so split and then what you need to split so

split and then what you need to split so I would like to split listed in column

I would like to split listed in column by the comma

by the comma and I would like to get zeroth index of

and I would like to get zeroth index of the same so similarly I'll just copy

the same so similarly I'll just copy paste it and then I would like to create

paste it and then I would like to create another column as Category

another column as Category 2 and split the listed in column and I

2 and split the listed in column and I would like to take the first index of

would like to take the first index of the listed in column let's see what is

the listed in column let's see what is it showing

we'll see or we'll try to display our modified data

frame so let's see let's see okay so now we can see that we have

see okay so now we can see that we have one category one and category two

one category one and category two separated okay so now we have splitted

separated okay so now we have splitted our column into two

our column into two categories then we can check if we have

categories then we can check if we have any missing values in any category or

any missing values in any category or now we can see that we have category one

now we can see that we have category one but for the category 2 we have so many

but for the category 2 we have so many null values right so we would like to

null values right so we would like to replace these null values with for

replace these null values with for example unknown text now we will try to

example unknown text now we will try to replace all the null values from the

replace all the null values from the category 2 with the unknown text for

category 2 with the unknown text for that what we will

that what we will do DF do dfdore gold is our data

do DF do dfdore gold is our data frame so in this data frame with column

frame so in this data frame with column we would like to check

Category 2 column name when

when this DF gold Category

to give okay so otherwise whatever is the

give okay so otherwise whatever is the existing category keep it as it

existing category keep it as it is let's see hopefully it will

is let's see hopefully it will work not

okay now we can see that whatever null values were there those are replaced

values were there those are replaced with the unknown text in the category 2

with the unknown text in the category 2 so this is how you can perform some kind

so this is how you can perform some kind of advanced Transformations I'll not be

of advanced Transformations I'll not be able to cover all those Transformations

able to cover all those Transformations but these are few which I have done so

but these are few which I have done so that you will understand what we need to

that you will understand what we need to do exactly so now we need to save this

do exactly so now we need to save this data into Delta

data into Delta format into our gold

format into our gold layer okay so for that what we will

layer okay so for that what we will do we will again use the same

technique to write our data into gold layer now we will try to save our data

layer now we will try to save our data into Delta format so I'll just change

into Delta format so I'll just change the format as Delta then mode I'll keep

the format as Delta then mode I'll keep it as happened and now our path will be

it as happened and now our path will be instead of silver it will be gold

instead of silver it will be gold layer and here also I'll change the

layer and here also I'll change the folder

name as gold and now we will try to save this

so it is now successful now let's cross verify

verify it so this is our gold layer and in that

it so this is our gold layer and in that we have now one new folder and in this

we have now one new folder and in this we have our file path here along with uh

we have our file path here along with uh file save here along with Delta logs

file save here along with Delta logs which are stored just about

which are stored just about that Delta format file so the question

that Delta format file so the question is like why do we need to use this Delta

is like why do we need to use this Delta format right so for this data format I

format right so for this data format I can say that there are so many key

can say that there are so many key features for this data format or the

features for this data format or the advantages so the First Advantage I can

advantages so the First Advantage I can say that this data format supports acci

say that this data format supports acci ID transaction

ID transaction so ACI ID transaction is nothing

but atomicity consistency isolation and durability

durability transaction which Bas basically ensures

transaction which Bas basically ensures that all the read and write operations

that all the read and write operations are done in a reliable and consistent

are done in a reliable and consistent manner so this is especially useful for

manner so this is especially useful for the Big Data environments where like we

the Big Data environments where like we have many users and the processes may be

have many users and the processes may be like accessing or modifying the same

like accessing or modifying the same data concurrently so for that this

data concurrently so for that this atomicity ensures that the set of

atomicity ensures that the set of operations like for example right is

operations like for example right is treated as a single unit of work either

treated as a single unit of work either like fully completed or not executed at

like fully completed or not executed at all so such kind of atomic atomicity

all so such kind of atomic atomicity then consistency then isolation and the

then consistency then isolation and the durability this is supported by this

durability this is supported by this Delta log or Delta format then the next

Delta log or Delta format then the next one I would say time

one I would say time travel I'll just cover theoretical part

travel I'll just cover theoretical part of the same in the next video I'm trying

of the same in the next video I'm trying I'll be trying to covering Azure data

I'll be trying to covering Azure data braks completely so in that I'll show

braks completely so in that I'll show handson for these activities like TR

handson for these activities like TR time traveling and versioning schema

time traveling and versioning schema Evolution so I'll try to perform those

Evolution so I'll try to perform those practical sessions over there so now

practical sessions over there so now it's TR travel so it is nothing but

it's TR travel so it is nothing but vering this Delta L provides the ability

vering this Delta L provides the ability to travel back in the time by quering

to travel back in the time by quering the older version of the data so it

the older version of the data so it supports the time travel to access the

supports the time travel to access the historical versions of the data which is

historical versions of the data which is especially useful for the auditing or

especially useful for the auditing or the debugging or recovering for the

the debugging or recovering for the errors from the errors so for example

errors from the errors so for example you are working on one data set

you are working on one data set and by mistake you have deleted the data

and by mistake you have deleted the data or by mistake you have deleted some rows

or by mistake you have deleted some rows in that data so now you need to fetch or

in that data so now you need to fetch or you need to recover the past version of

you need to recover the past version of that data where that where all of that

that data where that where all of that data was present and it was not deleted

data was present and it was not deleted so for that purpose we can use this time

so for that purpose we can use this time travel and and with that Prime travel we

travel and and with that Prime travel we we we can recover our older version and

we we can recover our older version and we can restore that older version so

we can restore that older version so that we can perform our next activities

that we can perform our next activities and we can it this this mistake which we

and we can it this this mistake which we have done by deleting some roles it will

have done by deleting some roles it will not affect the previous version and we

not affect the previous version and we will be able to proceed ahead very

will be able to proceed ahead very successfully for that we use time travel

successfully for that we use time travel and then again it also supports the

and then again it also supports the schema Evolution like as I have already

schema Evolution like as I have already told about the pocket format it's the

told about the pocket format it's the same way it also supports the scha

same way it also supports the scha schema Evolution so that like if the

schema Evolution so that like if the schema of the data has been changed into

schema of the data has been changed into the Transformations and is different

the Transformations and is different from the source then it will not give

from the source then it will not give you any error while storing that data

you any error while storing that data back to the sync so this is schema

back to the sync so this is schema Evolution and it also supports the large

Evolution and it also supports the large scale data processing and it has data

scale data processing and it has data lineage and audit auditability also

lineage and audit auditability also because Delta tables automatically track

because Delta tables automatically track the data lineage so this what is this

the data lineage so this what is this data

data lineage so with this data lineage it

lineage so with this data lineage it makes us easier to trace the flow of the

makes us easier to trace the flow of the data from the source to destination so

data from the source to destination so this basically ensures that our data has

this basically ensures that our data has transparency or regularity compliance

transparency or regularity compliance and also the effective troubleshooting

and also the effective troubleshooting so for these purposes we use a Delta

so for these purposes we use a Delta format and in that we also has this

format and in that we also has this Delta log transaction

files if we talk about this Delta log in brief so we can say that this Delta log

brief so we can say that this Delta log is essential this is an essential

is essential this is an essential component in Delta leg which basically

component in Delta leg which basically tracks all the changes that are made

tracks all the changes that are made into this Delta table which enables this

into this Delta table which enables this acid transactions Version Control and

acid transactions Version Control and the schema Evolution so this Delta log

the schema Evolution so this Delta log basically gets stored into this folder

basically gets stored into this folder underscore deltore log and it consists

underscore deltore log and it consists of sequential Json files that basically

of sequential Json files that basically records the metadata of each of the

records the metadata of each of the transaction for example it could be

transaction for example it could be maybe insert update or the delete so

maybe insert update or the delete so such kind of transactions get stored

such kind of transactions get stored into this Json file and it these files

into this Json file and it these files are named in in incrementally starting

are named in in incrementally starting from like with 0le 0 then if we do some

from like with 0le 0 then if we do some changes and then store it then it will

changes and then store it then it will be 0 1 02 03 so it will be into this

be 0 1 02 03 so it will be into this incremental mode so this is how this

incremental mode so this is how this data Delta log functions so this is just

data Delta log functions so this is just a brief we'll try to cover this into any

a brief we'll try to cover this into any another video which I'll create on the

another video which I'll create on the Azure data brakes so now let's move

Azure data brakes so now let's move ahead and see how we can connect our

ahead and see how we can connect our data brakes to the powerbi okay so now

data brakes to the powerbi okay so now we know that we have pushed our data

we know that we have pushed our data into gold layer right but what if

into gold layer right but what if suppose I would like to create One

suppose I would like to create One external table or Delta table on the top

external table or Delta table on the top of this data

of this data so why is it useful to create the

so why is it useful to create the external table let's try to understand

external table let's try to understand suppose this is

suppose this is our silver layer and this is our gold

our silver layer and this is our gold layer and we know that we were earlier

layer and we know that we were earlier having this data into silver layer right

having this data into silver layer right then we fetched this data and

then we fetched this data and transformed that data and then pushed

transformed that data and then pushed that data into gold layer so now we have

that data into gold layer so now we have our files into gold layer and suppose

our files into gold layer and suppose for uh by mistake I have used drop

for uh by mistake I have used drop command okay so in this case drop what

command okay so in this case drop what drop command will do it will delete the

drop command will do it will delete the table and data what is which is already

table and data what is which is already present into gold layer and our all data

present into gold layer and our all data will get deleted and we won't be having

will get deleted and we won't be having access to our data after this drop

access to our data after this drop Command right so for that sake what we

Command right so for that sake what we can do so just to avoid that suppose

layer and we will create One external table on the top of this gold

table on the top of this gold layer so this is our external table so

layer so this is our external table so we have all the data into uh data into

we have all the data into uh data into this files into gold layer and this is

this files into gold layer and this is our external table so in this case if we

our external table so in this case if we will use this drop command then this

will use this drop command then this drop command will only delete this

drop command will only delete this external table

external table and not the data which is already

and not the data which is already present into gold layer this is how our

present into gold layer this is how our data will be kept safe and there we will

data will be kept safe and there we will not be like there will not be no there

not be like there will not be no there will not be any kind of loss onto our

will not be any kind of loss onto our original data which is saved into this

original data which is saved into this gold layer so this is where this

gold layer so this is where this external table concept comes in so let's

external table concept comes in so let's try to create One external table on the

try to create One external table on the top of this data so for that what we can

top of this data so for that what we can do for that we'll just create one

do for that we'll just create one database into a SQL language so for that

database into a SQL language so for that sake select SQL language from language

sake select SQL language from language from here and the code is very simple

from here and the code is very simple like the SQL create database and the

like the SQL create database and the database name so in this case my data

database name so in this case my data name database name would be gold layer

name database name would be gold layer now let's try to execute this

command Okay so it says that it is already

already present so now let let me just drop this

present so now let let me just drop this database first

okay so now we will create our database again with the name gold layer okay so

again with the name gold layer okay so this is our new gold layer and now we

this is our new gold layer and now we will save our

will save our data into Delta format as a table in the

data into Delta format as a table in the gold layer okay so for that what we will

gold layer okay so for that what we will do this the syntax will be same we will

do this the syntax will be same we will write it in Python this is our data

write it in Python this is our data frame name right format would be Delta

frame name right format would be Delta mode I would like to keep it as append

mode I would like to keep it as append then the path we need to give the path

then the path we need to give the path of

of our gold layer okay so this is my

our gold layer okay so this is my path let's suppose in this path I would

path let's suppose in this path I would like to create

one file folder inside that so maybe my folder name would be

gold and then I would like to save this as a table so for that save as table

as a table so for that save as table then I would I'll give the gold layer

then I would I'll give the gold layer which is our database name and Dot then

which is our database name and Dot then I would like to give the table name

I would like to give the table name as maybe if Prime

as maybe if Prime gold and now let's it should be in

gold and now let's it should be in inverted comma

okay so now we will try to run this

cell okay it has been run successfully now let's

now let's see now if we'll come to our gold layer

see now if we'll come to our gold layer then we can see that we have a new

then we can see that we have a new folder of name Prime gold and in that

folder of name Prime gold and in that Prime gold we have this pocket file on

Prime gold we have this pocket file on the top of this Delta log so now we know

the top of this Delta log so now we know that we have saved our format as a Delta

that we have saved our format as a Delta right so Delta format is nothing but

right so Delta format is nothing but parket

file the pocket file format and on the top of that we have this Delta

top of that we have this Delta log so this create our Delta format

log so this create our Delta format overall so if you can see here we have

overall so if you can see here we have this parket file right we have already

this parket file right we have already discussed our uh compressive algorithm

discussed our uh compressive algorithm which is Snappy and our format is into

which is Snappy and our format is into pocket and on the top of that it has

pocket and on the top of that it has created Delta lock so overall it means

created Delta lock so overall it means that this is a Delta format in which we

that this is a Delta format in which we have saved our

have saved our data so we can see that this is how we

data so we can see that this is how we can save our data as a table now let's

can save our data as a table now let's say

say suppose I want to check it like I would

suppose I want to check it like I would like to create one I would like to use

like to create one I would like to use SQL command on this so now let's just

SQL command on this so now let's just check and if we'll see we will see that

check and if we'll see we will see that if we are able to access that table or

if we are able to access that table or not so for that what we will do select

not so for that what we will do select star from then our database name was

star from then our database name was gold layer and in that gold

gold layer and in that gold layer we have

layer we have used Prime

sale okay so now we can see that that this like that format that table which

this like that format that table which we have saved it right with uh under the

we have saved it right with uh under the database gold layer with the name Prime

database gold layer with the name Prime gold we are able to access that table

gold we are able to access that table successfully so that you can use this

successfully so that you can use this SQL command or SQL language also to work

SQL command or SQL language also to work on that table so this is how we can use

on that table so this is how we can use this now the next step would be how we

this now the next step would be how we can connect this uh datab braks file to

can connect this uh datab braks file to the power be right so for that what we

the power be right so for that what we can do come to the marketplace

can do come to the marketplace in the Marketplace search for our

connect and it will ask you to download the file but before that we need to

the file but before that we need to select the compute because this has

select the compute because this has automatically connected to the

automatically connected to the serverless starter warehouse but our

serverless starter warehouse but our files are into another cluster or

files are into another cluster or computer right so select your proper

computer right so select your proper compute or cluster and then just

compute or cluster and then just download the connection

file so once you have downloaded it just click on that file and try to

it just click on that file and try to open

it so now in this case what happened is I was already having powerbi

is I was already having powerbi downloaded on my laptop so for that what

downloaded on my laptop so for that what you need to do you just go to the Google

you need to do you just go to the Google and just write it as download powerbi

and just write it as download powerbi desktop and you will get the link very

desktop and you will get the link very first and just download and install that

first and just download and install that power powerb desktop on your laptop or a

power powerb desktop on your laptop or a desktop so now in this case what

desktop so now in this case what happened it has directly connected to my

happened it has directly connected to my powerbi because I already connected my

powerbi because I already connected my powerbi to to the same data Bri file and

powerbi to to the same data Bri file and if you'll come under

if you'll come under this 5 meta store then we can see that

this 5 meta store then we can see that this database which we have created is

this database which we have created is here gold layer and under that we have

here gold layer and under that we have table as Prime gold but what happens is

table as Prime gold but what happens is that before this it will ask you to

that before this it will ask you to connect to your respective data brakes

connect to your respective data brakes right so for that what it will give so

right so for that what it will give so it will uh it will give you one popup in

it will uh it will give you one popup in that you need to select access key from

that you need to select access key from the left left hand side and once you

the left left hand side and once you will uh select this personal access key

will uh select this personal access key then it will ask you to enter your

then it will ask you to enter your access key so you just need to enter

access key so you just need to enter enter your access key and then you just

enter your access key and then you just need to click on connect so where you

need to click on connect so where you will get this access key right so for

will get this access key right so for that get that personal access key just

that get that personal access key just come back to your data brakes select

come back to your data brakes select your account go to the

settings and under the settings you will get the option as user and then the

get the option as user and then the developer and here you you just need to

developer and here you you just need to click on this manage access

click on this manage access token and then just generate one new

token and then just generate one new token click on generate and it will give

token click on generate and it will give you one token you just have to copy this

you one token you just have to copy this token because it once you will close it

token because it once you will close it you will not be able to get that access

you will not be able to get that access again click on done then come back to

again click on done then come back to the

the powerbi where is it okay

popup and in that popup once you will get that popup just click on personal

get that popup just click on personal access

key and whichever key we have copied right just space that ke here and click

right just space that ke here and click on connect button so once you will click

on connect button so once you will click on connect button then you will get this

on connect button then you will get this Navigator and under this Navigator if

Navigator and under this Navigator if you'll see under the hive meta store we

you'll see under the hive meta store we can see that we have this database which

can see that we have this database which we have created gold layer and under

we have created gold layer and under that we have this

that we have this table so just if I cck click on this and

table so just if I cck click on this and just click on load

connection it will take some time to load the data okay so now we can see

load the data okay so now we can see that we have our Prime gold data here

that we have our Prime gold data here and all the columns which are present

and all the columns which are present here now you can create a dashboards

here now you can create a dashboards here you can create multiple charts like

here you can create multiple charts like bar chart pie chart whichever is

bar chart pie chart whichever is suitable for your information so you can

suitable for your information so you can f that you can create charts and you can

f that you can create charts and you can f respective column into X and Y AIS and

f respective column into X and Y AIS and you can add Legions small multiple so

you can add Legions small multiple so all the required information you can add

all the required information you can add it and you can create four to five

it and you can create four to five charts which are most useful for the

charts which are most useful for the business because they are from

business because they are from non-technical background and when they

non-technical background and when they will see your report of the dashboard

will see your report of the dashboard they need to able to or they should be

they need to able to or they should be able to grasp what is exactly going on

able to grasp what is exactly going on despite of they from non- technical

despite of they from non- technical background so it is very important to

background so it is very important to know that which fields are important to

know that which fields are important to display and which Dash which charts are

display and which Dash which charts are needed to display or show it to the

needed to display or show it to the business so this is how you can create

business so this is how you can create report on this PowerBar

report on this PowerBar desktop and this is what we have created

desktop and this is what we have created till

till now so this I think that this is it this

now so this I think that this is it this is how we have created ETL pipeline from

is how we have created ETL pipeline from starting from source to destination

starting from source to destination which was

which was our GitHub account

our GitHub account to connect it to the powerbi with the

to connect it to the powerbi with the business ready data for that we have

business ready data for that we have used Ed data Factory to in the data then

used Ed data Factory to in the data then we have utilized data brakes to

we have utilized data brakes to create other Medallion architecture

create other Medallion architecture which were like silver and the gold and

which were like silver and the gold and we we saw like how we can create this

we we saw like how we can create this data brakes to the ADLs account and then

data brakes to the ADLs account and then how we can phas that data how we can

how we can phas that data how we can transform that data and how we can again

transform that data and how we can again pull that data to different different

pull that data to different different layers into ADLs genu and once we are

layers into ADLs genu and once we are done with our work how we can create and

done with our work how we can create and how we can connect our data bricks to

how we can connect our data bricks to the powerbi so this is how end to end

the powerbi so this is how end to end pipeline works so if you have found this

pipeline works so if you have found this video very helpful then please don't

video very helpful then please don't forget to like the video and subscribe

forget to like the video and subscribe for the more content and also let me

for the more content and also let me know in the comments if you have any

know in the comments if you have any questions and if you want more such kind

questions and if you want more such kind of videos so in the next video I'll try

of videos so in the next video I'll try to cover the a data Brakes in detail

to cover the a data Brakes in detail with the with new Concepts which are

with the with new Concepts which are already there like Delta l and so on so

already there like Delta l and so on so thank you so much for watching this

thank you so much for watching this video and till then keep crunching the

video and till then keep crunching the data

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:🚀 Azure End-to-End ETL Pipeline Project | Azure Data Factory + Databricks + Power BI | For Beginners

Video Transcript

Paste YouTube URL

Transcript Extraction Form

Get Our Chrome Extension

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube Transcript:
🚀 Azure End-to-End ETL Pipeline Project | Azure Data Factory + Databricks + Power BI | For Beginners