Transcripción de YouTube:
PYSPARK X DBT End-To-End Data Engineering Project | Master Big Data Engineering

Sin ver el video entero: obtén la transcripción completa, busca palabras clave y copia con un solo clic.

AutoDub

Entender Videos de YouTube Extranjeros

Doblaje Inmersivo de YouTube en Español

Supera las barreras del idioma, abraza el contenido de calidad mundial

Usar Gratis

Transcripción del video

Resumen del video

Summary

Core Theme

This content is a comprehensive tutorial demonstrating an end-to-end data engineering project using PySpark and DBT, focusing on building a Medallion architecture (Bronze, Silver, Gold layers) with dynamic and modular code practices.

Mind Map

Clic para expandir

Haz clic para explorar el mapa mental interactivo completo

This project can help you crack multiple

offers in 2025 because you will master

Spark structured streaming bypark

transformations, dynamic data injection,

integrating Python control flow and

conditionals with Pispar, building

modular code with Python classes and

utilities. handle incremental load and

up abserts and you will build your gold

layer of Medallion architecture using

DBT which will cover DBD dynamic

sources, Ginga functions, incremental

DBT models, DBT CLI, DBD properties and

configs, ephemeral materialization,

slowly changing dimensions, DBD

snapshots and much more. Do you know

what pispark and dbt are the hottest

topic right now in the world of data

engineering and by the end of this video

you will master pispark and dbt trust me

and this will eventually help you to

crack your dream role this year if you

are excited about this end to end

complete data engine project so just let

me know in the comment section and let's

get started with this amazing project so

what's up what's up what's up ma fam

happy Sunday first of all and do you

know what this Sunday is very very very

very very very special because of

another project without wasting any time

because do you know what the project

video is already recorded so I know the

knowledge that you're going to get in

this particular project video is next

level so let me just give you a quick

overview of this project that we want to

build today first of all this project is

pure opensource that means we will be

just using opensource frameworks pispark

and dbt and python so this is something

that can actually help you to ace in the

interviews because you can sit in any

interview Azure, GCP, AWS whatever

because everywhere we use Pispark, DBT

and Python pure pure pure skills and you

know what um I don't know like if you

can just interpret the depth from this

particular architecture but if I just

tell you because the video is already

recorded if I just tell you each layer

is full of full of you can say real

world scenarios. If I talk about bronze

layer. Okay, let's go with this

particular architecture. So first of

all, we going to work with pispark

streaming as our source. Okay. So we

have our source files. We have our

source and we have stored our data in

the data lake. We going to directly

start with something called as spark

structured streaming to incrementally

load our data. And do you know what we

going to create dynamic notebooks to

incrementally load data pure pure pure

Python and pispark we will be using

loops arrays and some classes as well.

So it will be very very very fun. So

this will land our data into the bronze

layer and yes the

platform will be data bricks that we'll

be using to run our spark code. The

platform will be data bricks but the

code will be pure pure pure pispark. So

you can just run that code anywhere.

Okay. Then once that particular you can

say code or basically basically layer

bronze layer is built as per the

medallion architecture this data will go

to silver layer and let me just add that

particular um arrow which is missing

it's fine real time real time

architecture. So so let's say by the way

it is matching. So this bronze layer

data will go to the silver layer and you

know what in the silver layer we going

to create Python classes which will make

our code modular so that we do not need

to build the class basically any kind of

transformations because we'll be using

classes Python classes pure Python

classes with Pispar code and these kind

of scenarios are the hottest topic right

now in the interviews because this test

your knowledge both conceptual plus

implementation both make sense and we

going to just work with Python um custom

modules or basically utilities then DBT

the core of this video basically heart

of this video basically every layer is

the heart I'm just telling you like

bronze silver gold all these three

layers are insane in this particular

video because I literally didn't realize

like what will be built in this video

but when this video is completed I was

like what man this is amazing So now

obviously we're going to build our star

schema in the gold layer and using dbt

it's not like very overview or let's say

highle dbt integration with pispark no

very much in depth you want to learn

about models about about models sources

and then ginga functions yes you going

to just use ginga templates as well then

we going to use loops if conditions then

we also going to use snapshots we going

to use you can say yaml files

personalized configurations everything

will be in the DBT pure DBT and we'll

building our dimension and fact tables

using this DBT plus we also want to

learn how we can just work with

incremental data and abserts in DBT

which is next level thing

everything everything everything is

discussed in this complete end to end

data engine project video I know you are

really really excited I know I'm double

excited because I know the quality of

content that is already created. Just

one request, just drop a lovely comment

on this video because this will help me

a lot so that I can just continue

creating more and more these types of

videos. Plus, I have a great news to

celebrate with you all. And let me just

tell you what's so now let's let's let's

actually congratulate these people who

have recently cracked the interviews in

the MNC's

in maybe whatever company they wanted

to. So many many many many many

congratulations and I also want to see

your name here

and how you can just achieve this thing

by just simply following the things that

we are discussing in the videos

discussing in the projects and just be

focused that's it that's it and I'm

literally waiting your for your comment

to be here because I would love to

feature it um I know that I will not be

able to just feature all the comments

but yeah if I am just featuring some of

the comments you can just literally feel

that okay we are celebrating wins of our

data fam and I would love to celebrate

your win as well so so so many many many

many many congratulations now let's talk

about our data okay and we have selected

an amazing data set for this particular

project and just scroll down and this is

the repo by the way analy YouTube and

just scroll down and just click on

pispark dbt project because this is a

pispark dbt project. So obviously the

folder name is same and these are all

the CSV files that you can also download

it okay in your local system because

you'll be just using these CSV files and

this is basically a kind of Uber data

set okay and we have just picked this

particular data set why because this

will give you an a sense of dimensional

data model which is basically dimensions

and fact tables make sense and this is a

kind of data modeling technique that

data engineers work with and dimensional

data modeling is one of the most popular

and most in- demand data modeling

technique that you should know. Okay,

make sense? Very good. Fact tables,

dimension tables, incremental data,

slowly changing dimensions, everything

will be covered in this particular

project video and just be with me. Pure

pispark and dbt pure. Okay, very good.

So now this is the data that you can

download. Okay. And now let's get

started with our second thing that you

should know which is nothing but your

environment. So which environment we

should use for Pispark. Obviously we can

just develop all the things locally

using our own system but I know um not

everyone will be having the laptop with

good RAM with good memory with good

processing units right so what we going

to use? We going to use datab bricks

free edition which is totally free which

is totally free. You do not need to even

add a credit card or debit card nothing

which is totally free and this

particular you can say datab bricks free

edition will help you to to just run

your pispark code that's it that's it we

just want to use a platform to run our

pispark code make sense and it is the

best platform to run pispark code and

obviously all the big data things make

sense very good so how you can also

create your free databicks edition if

you haven't created simply go on

explorer and simply type datab bricks

free edition

and hit enter and simply select try

database for free. Okay. And I have

already created this one. So let me just

open an incognito mode

maybe here. Um

makes sense. So if you click here try

databix for free and then

you will see this one get free edition

instead. Do not click here because this

is just you can say express edition

which is just available for 14 days but

you want something which is available

for a longer period. Right? So simply

click on get free edition instead. When

you click here then you can just click

on sign up. Make sense? Sign up for free

edition. Simply create your account here

and that's it. Once you create your

account here, you will be able to land

on this particular page and click on

this particular thing and this is your

datab bricks homepage which is loading.

Perfect. So this is your datab bricks

homepage. Simple sorted. Very good. So I

know that you would know a little bit

about data bricks. Okay. By the way we

are simply focusing on pispark. So

that's fine. But if you want to learn

and explore in the field of data bricks,

you already know where to from where to

learn this, right? Okay, makes sense

because see data bricks is in demand and

it is actually growing a lot this

platform instead.

Make sense? So just go with the flow.

Just get the right knowledge at the

right time. And this is the right time

to gain knowledge in databicks. Make sense?

sense?

Make sense? Common sense? Okay. Okay, so

this is our databicks environment. So

first of all we need to create a catalog

because obviously um I would expect that

you should have some knowledge in

database but it's fine even if you do

not have basically just follow the steps

because we are focusing on pispark but

still you would need to handle your

metadata right so for metadata we create

something called as catalog in datab

bricks make sense do not make it

complicated it is very simple simply

click on catalog and you will see all

the cataloges available make sense these

are some of the cataloges that I created

created for my previous videos. You can

simply ignore that is fine. We will

simply create a new catalog. Click on

plus add and click on create a catalog.

And let's create a catalog called

Pispark DBT. This is our catalog name.

Simple. And then simply say create

and configure catalog and then scroll

down next save.

And our catalog is created. And this is

the catalog name Pispark DBT. Make

sense? What is catalog? Catalog is

basically the data management feature

which is equivalent to you can say

database kind of if you are coming from

SQL background catalog is equivalent to

database. Okay. And then we have schema

here. Okay. And we have default schema.

We have information schema. These are by

default two schemas. And we will also

create our own schemas for different

different layers. As you know that we

are going to follow Medallion

architecture. So in the medallion

architecture we create three schemas

bronze, silver and gold. Make sense? But

even before creating those medallion

architecture, we will create one more

schema which will be source schema.

Because our source can be anything,

right? It can be SQL database, it can be

APIs, it can be literally anything. In

our case, it will be CSV files which are

stored in the data lake. Now the good

thing is this datab bricks free edition

is automatically attached to the data

lake and what is the provider of the

data lake AWS it is already attached to

it. So we can leverage something called

as volumes okay and you do not need to

worry about what is volumes because this

is just our source and being a data

engineer you do not need to actually

worry about how to create the source you

should be worrying about how to use the

source. So in this particular section we

will be creating the volumes but it is

fine if you know about volumes. If you

do not it's also fine because you are

simply creating the source right. So

let's simply create a schema. Click on

these three dots or basically click on

create schema and I will name it as

let's say source and click on create.

So this is my schema called source and

within this source I will create a

volume. Okay makes sense a volume. Okay.

So click on the schema, click on create

and then click on volume. Make sense?

And volume name will be let's say source

source source data. Okay. And this will

be manage volume. And then click on create.

create.

Make sense? So this is my volume called

source data. Make sense? Make sense?

Okay. Very good. So this is our source

data volume where we'll be uploading all

the CSV files.

Make sense? All the CSV files. Okay. So

there are basically two ways. Either you

can upload all the CSV files here within

this volume or you can create dedicated

folders for each file. Choice is yours.

Make sense? Make sense? Make sense? Make

sense? Choice is yours. So it's up to

you how you want to just do that. Okay.

So in order to do that, I can simply

click on this particular volume and then

you can simply say create directory.

Okay. Create directory. Create directory

means you want to create a folder within your

your

volume. And what is the advantage of it?

Advantage is very very simple. Advantage

is you can simply store if you have

multiple files within the dedicated folders.

folders.

Hm, makes sense. Okay. So, it is a good

practice to create a folder for each

particular file. Okay. So, let's say I

want to create a directory for um trips

because I have trips file. Okay. I will

simply say create trips. So this is my

directory now. See trips and even if you

refresh this area you will see that

folder created here.

Uh here if you click here this is the

trips folder. Make sense? So this is my

first directory and within this

directory I can upload the data any

data. See upload to this volume. So if I

click here upload to this volume and if

I click on browse make sure that your

data will go to trips make sure that

trips is here otherwise it will uh load

all the data in the root directory

basically root location that we do not

want right so let me just click on

browse and let me just add that trips

file okay so trips dotcsv is added okay

and we have the option called override

files with the same file name if you

have any file with the same file name it

will overwrite it but we do not have the

file name. So it's fine. Click on

upload. So this file is uploaded called trips.csv.

trips.csv.

Make sense? And

whatever we have, we have trips. Then we

have vehicles. Okay. Then we have

vehicles. So I will simply click here.

Okay. I'll click on source data. And you

can see one folder here trips. Make

sense? Then I will click on create

directory and let's create a directory

for vehicles.

Okay, make sense? Now let's upload to

this volume. Same step. Let's say

vehicles. Perfect. Upload.

Okay. Sorted. Let's create another folder.

folder.

Create directory. And this directory

name will be customers.

Make sense? Very good. Customers are

here. Okay. Then let's upload to this volume.

Customers. Perfect.

Okay. Now we have drivers.

And then within this drivers, I can

Okay. Then we have locations.

And then we have locations file.

Then let's upload to this particular

volume and let's say payments. Okay. So

this is also done. So now all the

folders for our volume are done. See all

the folders are here. Why this is

helpful? As I just mentioned that if you

have more and more files and let me just

tell you we will add more and more files

in these folders because we also want to

see how we can incrementally process and

load the data. So all the things will be

covered in this particular video. Just

sit back and relax and just code. Okay.

So this is our data source that we have

created within just 5 minutes. And if

you understood this part, it is fine. If

not, it is still fine because it is it

is like very very very simple. It's not

a big deal at all. Okay. So, perfect.

So, this is basically the volume that we

have created. Okay. If you click on this

folder customers and if you click on the

CSV file, you will see that this is the

CSV file. Okay. And if you just click on

cop copy path

and if you just paste it here you will

see that this is a path to the volumes.

Okay, this is the path to the volumes

that you use in spark code as well. And

I will just show you. Don't need to

worry at all. Make sense? Make sense?

Okay. And if you just go to files, these

are basically all the files. These are

the details. And this is the managed

meta store that is created for us. Okay.

Perfect. Then permissions are here.

Okay. Perfect. So now finally we can

just create our first particular folder.

So simply click on home and click on

create and then click here to create a

folder. And folder name will be like

let's say

pispark dbt project. So this is my

folder name and within this folder we

will be creating our notebooks. Why

notebooks? Because notebooks

are the building blocks for your spark

code. Whenever you want to write code,

you always go for notebooks. Yes, you

can even create Python files, but

notebooks are better to interpret.

Notebooks are better when you want to

run your code in chunks. And this is the

recommended approach as well. Make

sense? Okay. So, let's create our first

notebook and I will call it as notebook.

And let's close this. And this is my UI.

So basically if you would not be seeing

the same exact UI, you can go to your

name, click on settings, then click on developer,

developer,

then go to the bottom and just click

here tabs for notebooks and files. Okay.

So this is basically the new feature

that we have within the database UI. So

that we can just open our notebook in

the form of tab C. This is the one

notebook. If I want to open second then

we can just create tabs and this feature

was not there before. So perfect. So

let's first of all rename our notebook

and I will name it as let's say bronze injection.

injection.

Bronze injection because we want to

ingest our data. Okay.

In the bronze layer first of all make

sense. Make sense. And do you know what

we going to work with? Dynamic

notebooks. Okay, dynamic notebooks. We

do not want to just let's say run our um

code just for like one file or basically

one source. We want to dynamically do

it. Okay. And how we can just just do

that? I will just show you. Everything

will be you everything will be done

using spark code. Pure spark code. Do

not need to worry at all. We'll be just

creating classes, loops, conditionals,

everything. Okay. So, first of all, let

me just show you how our data looks

like. So

if I go to catalog the third icon so

this is my catalog uh it's called

pispark dbt perfect and this is my

schema source okay perfect so this is my

data source data so if I want to first

of all look at the data that we have

like so many folders how we can just

first of all look at this data it is

very simple just attach your notebook

with a cluster what is cluster in Apache

Spark Apache Spark basically runs on

compute, right? Compute is the backbone

for Apache Spark. It runs with

distributed computing engines. So for

that you need clusters, right? So this

is basically an amazing feature called

serverless compute which will

automatically you can say

add more and more nodes if required and

reduce the nodes if you are not working

with more and more data. So it is like

autoscaling. You do not need to worry

about anything. Autoscaling serverless

plus serverless means you are not

managing the virtual machines. Datab

bricks is managing the virtual machines

on its end. So everything is on the data

brick side but if you do not want to

create serverless compute if you create

allpurpose compute then in that

particular scenario you will be managing

all the VMs that you do not want right

so just go with serverless and that's

fine. So now let's see how you can just

look at the data first of all. Okay,

very simple. Simply say df equals spark

let me just increase the size spark dot

read dot format. Okay. And what is the

format of our files? It is csv. Perfect.

Then we will say dot option and header

equals to true. And in pispark you

should know that we just create header

equals true. Okay. For CSV especially

because CSV do not carry any kind of

header. Basically they carry header but

they do not carry any kind of schema. So

that is why we say hey just make the

header as our schema. Simple. Okay. Then

we need to say dot option.

What else? We can say infer schema

equals true.

If we do not specify infer schema true

it will treat all the columns as string

columns because by default CSV files

will store all the information as

strings. Right? So that is why we need

to say infer schema. Make sense? Good.

Then I will simply say dotloadad. Now in

the load you would know that we always

pass the location of our file. Now what

is the location of our file? Because we

can either work with databases, we can

either work with let's say data links

and obviously in the modern world we

work more with the data links. So what

is the location of the data lake? You

actually do not need any kind of

location of the data lake. You just need

the location of volumes because volumes

is your source. Volumes are built on top

of the data links. Make sense? So how

you can just get the location of volume?

Simply click on source source data and

whatever folder you want to read. Let's

say you want to read customers data. So

you can click on these arrows and the

location will be inserted. But there's a

specific you can say

way to write the location and it is

saying by the way who will add the dot.

Okay, perfect. Spark. Perfect. So load.

So we simply first of all write volumes.

Okay. Then we write the catalog name.

Catalog name is dbt

project. Then we write the schema name

which is uh source. Okay. Then we write

the um you can say volume name. Volume

name is source data.

Then we write the location. What is the

location? It is simply a folder called

customers. That's it. That's it. Simply

run this.

Okay. Simply run this and this is a

location. This is the format how you can

just specify the location in the modern

world where we use volumes which are

built on top of your data lake. Make

sense? Okay, good. And if you do not

remember this particular thing, use the

advantage of UI. Simply click here on

the two arrows and it's fine. See, it's

exactly the same. Make sense? Let me

just rerun it. And in order to see this

data, simply use display command.

displayed here and hit enter. Basically,

shift plus enter. If you want to run

this cell, you can either click here or

you can hit shift plus enter together.

So, this is my data. See all the things

are here. And do not worry, these are

pseudo phone numbers. So, do not try to

hey, let me just call them. So, these

are basically the data for customers.

Make sense? And this is such an amazing

data set that we also have last updated

time stamp. And this is basically the

Uber data set that we have. Pseudo one,

but yeah. And this is that data frame

basically that data which is used in the

real world. That is why you are seeing

all the information like last updated

time stamp, sign up date because these

are the things that exist in the real

world. And I can say that I try to

create the projects in such a way that

you can actually highlight those in your

resume smartly and you can actually say

that hey I have built this project and

by seeing those projects. Three,

obviously it is um totally it totally

depends upon how the person is

displaying that project. But if you are

able to confidently and smartly

just showcase this kind of project, the

other person will feel that actually

this person has like worked on real data

and real project and let's actually give

you can say some more you can say

consideration on your profile or

anything because there is a myth that

you can only showcase the projects which

are built in the industry. No, if you

have built your hobby projects or

basically you can say any university

project but if those projects are of

that level those will be considered as

well. Those will be considered as well

because it is not just about creating a

very simple very uh entry-level project.

No, we are building actually a

production level the real world project.

When I use the word real world I mean

it. It's not like just adding real world

and that's it and building simple um

project. No, there will be challenges.

There will be exact environment that you

will feel in the real world. And if you

do not trust me, just talk to anyone who

is already a data engineer. Okay, the

real one, not the one who would like to

you can say criticize me. Just talk to

some real developers and just ask, hey,

I'm building the this project. Um is it

anywhere relevant to the industry or

basically in your organization that you

work with? The person will say

definitely yes. Definitely yes. Make

sense? Good. So this is my data. Okay.

And this is my data frame. And if you

have some fundamental knowledge about

medallion architecture in the medallion

architecture in the bronze layer, we try

to write the data in the as it is form.

As it is form, we do not need to apply

any kind of transformation. Why? Because

we also need to have a kind of source of

truth for our pipelines and bronze data

is the exact replica of the source.

Make sense? That is why we do not we

never apply any kind of transformation.

Never. We simply append the data as is.

Make sense? And obviously we want to

just load the data incrementally. And I

will just show you how you can just load

the data incrementally because let's say

today in the customers folder we just

have one file right customers dot csv

tomorrow there will be a new file

because obviously there can be new

customers then another then another then

another. So there can be like multiple

files right so we do not want to process

all the data because in the world of big

data we do not process all the data

every time because obviously it will be

linked to the cost right. So in the real

world we load the data incrementally.

So how you can load the data

incrementally? We will be using

something called as pispark streaming.

Yes. Do not do not do not feel like hey

we going to use pi like spark streaming

or pispark streaming like it will be

very difficult. No it will be very easy.

It will be very easy and you can

literally highlight a project in which

you have also used realtime data

processing and it's a big deal. It's a

big deal. Trust me. Okay. So now now

let's see how you can actually process

the data basic not process like

basically ingest the data incrementally.

Okay very good. So now we are all set to

incrementally ingest the data and for

that let's first of all create our

schema. Okay so let's go to catalog and

let's create a new schema within our

catalog and let's name it as bronze.

bronze.

Make sense?

Bronze schema. Perfect. Click on create.

And within this bronze schema, we do not

need to actually worry about the data

lake because obviously we'll be creating

the tables which are delta tables for

our bronze table. But we need to store

the metadata for those tables. So that

is why we'll be creating a data lake

volume so that we can store the you can

say metadata not basically metadata um

basically the checkpoint like to

identify which process which file we

need to process and which file is

already processed. This is basically the

fundamental of spark streaming and if

you do not know no need to worry I will

just show you step by step. So for this

let's create a volume and volume name

Okay checkpoint make sense. Click on

create. So this is our checkpoint

volume. Okay. So let's go back to our

bronze injection notebook. Okay. So now

what we going to do? What we going to

do? Let me just change the theme first

of all. Developer

Developer

dark mode. Perfect

bronze injection. So okay. So this is

our notebook. Okay.

So what we going to do? So this is our

display df that we have displayed the

data frame, right? And this is our

static data frame that we just wanted to

see the data frame. But what is the like

what is the other purpose of this

particular you can say batch reading

because obviously we are working with

spark structured streaming. So what is

the advantage of it? So see whenever we

want to work with spark structured

streaming we should always define the

schema for our data frame always so that

it can automatically detect hey this

column should be of this data type hey

this column should be of this data type

and so on. Okay. So there are now

basically two ways to define the schema.

is you can manually add the schema one

by one. The smarter way is when you just

write like let's say when you just write

your code for reading the data frame.

You can run one command called df do.

dots schema and if you hit and uh um if

you just simply run this you will see

this is the complete schema. You can see

that customer ID is of integer type,

first name is of string type. So

basically this is the schema that we

have got from the batch reading and

Spark has automatically inferred the

schema for this. If you are satisfied

with this schema, you can actually store

it in a data frame called let's say schema

let's say trips or basically customers.

Okay? And then you can define it schema customers.

customers.

Make sense? And if you're not happy with

the schema, let's say you want to change

one data type. Let's say um customer ID,

you want to keep it as string type,

let's say. So you can simply copy and

paste and change it manually. So this

saves a lot of time where you do not

need to define each and every column

manually. You can simply get the schema

and change the desired columns. That's

it. You do not need to worry about all

the schemas at all. Make sense? I hope

it makes sense. So we will be also doing

the same thing. So this is our schema

customers and this is for like just the

customers one. Okay. So now

let's say I want to now start my

streaming data streaming data processing

right. So how we can just do that? Let

me just add

uh text and I will simply say let's say

H3 and bold and I will make it as

Okay, spark streaming. So in this

particular spark streaming we can simply

define our data frame and we can read

our data frame using stream method

instead of batch method. So the code is

very much similar. So it says spark dot

read stream instead of read and then

format and I will simply say CSV. Okay.

And then I will say dot option header true.

true.

Then I will say schema instead of infer

schema now we need to say schema. So

schema will be my schema

customers right and this is my data

frame and I do not want to display it.

Okay so this is my data frame that we

have read using streaming method.

Perfect. Let me just run this. What will

happen? Nothing. Because of lazy

evaluation we have just defined what we

need to do. Now we want to write this

data frame into the bronze layer and we

want to create a table on top of it.

TF dot write stream

dot format and the best format is delta.

Okay. Then we can say output mode. So

now what should be the output mode? So

output mode should be append. Should be

what? Append. Why append? because we

want more files to be appended every

time we will be running this particular

stream. Make sense? So I'll simply say

append and then I will say dot option

and then I need to define the checkpoint

location. Now what is this checkpoint location?

location?

So basically checkpoint location helps

spark to know which files are processed.

So let me just tell you. So let's say

this is your spark right this is your

spark and not spark basically this is

your let's say source

okay and here is your spark

so this park engine will read the data

from the source and write the data to

the destination.

Make sense? Let's say this is one file

and one file is here. Now next day you

have another file. Let's say this is the

new file. This is a new file. Now it

needs to just process this file instead

of both the files. Right? So how spark

will know that it has already processed

this file. So that is why it creates

something called as checkpoint location.

As the name suggest it is a checkpoint.

So it will take care of that particular

spark streaming query that hey I have

already processed this green file

already process this file. So this the

file name the metadata of the file

everything will be stored here. So

before processing the next write stream

it will read the checkpoint and will say

hey this thing is already processed. So

I just need to process this particular

thing and just push it into the

destination. This is the advantage of

spark streaming checkpoint location and

we need to manage this particular

location that's why I created that

particular volume to store that

information right perfect so I will

simply say checkpoint location and what

will be the location as you know if you

go here and this is our schema bronze

and this is my location okay so far we

do not have any kind of folder in this

but we will create one folder like this

let's say

this one and within the within that I

want to create a folder called

customers. So what will be the advantage

of it? The advantage is very simple. So

within the checkpoint um volume we want

to store all the checkpoints for

customers, trips, locations, everything.

Make sense? Very good. Now I will say

dot trigger. Now this is very important.

So whenever you want to just work with

you can say streaming basically

real-time data we want our pipelines to

run in in an interval right 5 seconds 10

seconds 15 seconds or maybe 1 second 2 3

second so it's up to you how you want to

just process your data make sense it can

be 1 second 2 3 seconds or you can

simply say processing time

equals let's say 10 seconds okay you can

write like this but obviously We do not

want to process our data or basically

run our data notebooks in every 10

seconds. So I will simply say once

equals true.

What is once equals true? So what it

will do if it has processed your files

it will stop immediately stop then when

you'll be just triggering it next time

then it will read only the incremental

files then stop then next time

incremental files then stop. So this way

we can save our compute and obviously if

we are using free addition we cannot use

the compute which will be just charging

a lot to databicks because you are

learning and you do not want to create a

long bill for them. So that is why we

always use once equals to true in the

you can say environment where you do not

want to spend a lot of compute. Make

sense? Okay. But yeah in the real world

you can simply use processing processing

time 1 second 2 seconds 3 seconds 10

seconds. But here we will simply using

once equals to true. Once it is loaded

then stop. Simple. Perfect. Then at the

last we need to simply say dot2 table

and we want to create a table for this.

And we want to create a table in this

catalog. DBT

no dbt I think pi spark dbt then bronze

then customers make sense. So this is

our code that we want to write for the

spark structure streaming for

incrementally loading the data. But how

we can just make it dynamic that is the

question because if I run this code it

will simply load the data for customers.

Simple then if I want to do it for

locations then I will be just copy and

pasting it in the next cell then next

then next you will be saying yeah that's

what we do in the real world right? Yes,

but I want you to become an efficient

data engineer, a pro data engineer. How

you can just become the one? By

obviously doing the special things, not

like the things which are not being done

by everyone else, right? So the best way

to process this data that will make your

rumé highlighted or basically your

project highlighted dynamic injection.

What do I mean by dynamic injection? So basically

basically

first of all I want to just show you

what is the static thing here. The

static thing is obviously this location

obviously this checkpoint location

obviously this table creation mode and

then the schema. Let's talk about schema

at the end because that is the you can

say in which you want to just perform a

little bit of pre-work but that's fine

but how we can just make everything else

dynamic. It's very simple. So I will

simply create a list of variables

because obviously I can just show you a

lot of ways but I I just want to show

you the pispark way so that you will not

be stuck with any kind of you can say

tool that you're using. You can simply

run this pispark code anywhere else.

Okay. So I will create a code cell and

I'll simply say

um let's say entities let's say I have

these entities and I can pass the list.

I have customers.

I have I think trips,

right? Then I have locations.

Then I have m payments.

Payments or payment? Payments. Then

vehicles and customers. No, customers is

down. Vehicles.

Okay. Because I always prefer dynamic

things because dynamic data solutions

are the hottest topic right now. Every

organization or basically your hiring

manager will feel happy if you have

built dynamic solution because tactic

code is not very reliable in the in

today's world. Right? So vehicles is

done and then drivers. Okay. So this is

my list of entities that I want to

process. Make sense? Okay. Make sense?

So this is my list. Perfect. So what I

will do? I will run

run

a loop. Okay.

I will simply say for entity in

entities. Okay. And then I will remove

all these things with a variable. I will

simply say f. This is basically f string

in python. And I will simply say entity.

Entity. Okay. And here as well in the

checkpoint I will say entity so that it

will create a dynamic folder for all the entities.

entities.

And then table name as well. I want to

create a new table for each entity. So

this is done. This is very simple. Make

sense? Very good. Now the only thing

left is this particular schema

customers. So how we can just work with

this thing. Okay. So for this particular

thing you will create basically

array basically not array

schema array for each entity. So this

way you have schema customers like this.

Make sense? let's say df do.s schema

there are like basically so many ways

but I want to just show you so that you

can just perform everything in a dynamic

way everything in a dynamic way make

sense so what I will do I will simply

copy this code for the batch read okay

and I will even paste it here

and I will just show you the manual way

as well because choice is yours I want

to make everything dynamic but you can

just follow a hybrid approach manual

plus uh automatic so here I will simply

say customers

Uh here it will become entity.

we can say

So it will become schema entity.

I hope you are absorbing the knowledge

what I'm trying to do. DF batch. Okay.

And then this will become DF batch.

Perfect. So what we are trying to do, we

are simply running a loop. Okay. So

first it will go and process this batch

data and we will simply read it. Okay.

And you know whenever we just read the

data we actually do not consume a lot of

computation. Okay. Even if we have hit

the trigger because it will simply read

a few files. That's it. Then it will

simply grab the schema for it. Okay? And

then it will start the stream processing

and it will pass that schema here. And

we can just change the variable name

here which will become schema entity.

Make sense? And this way you can

actually do this processing easily. This

is your full-fledged dynamic solution

for injection into the bronze layer. You

didn't define any schema. You didn't

write any schema. You didn't define any

data types. You didn't define any

location. You didn't define any table

and you didn't copy the code. Nothing.

Just pure dynamic solution and that's

it. Make sense? That's how you build the

real world solutions.

Okay. What was the second way? If I do

not want to include this code in my

processing. So for that what you will

do? You will go here and you will keep

on adding this particular schema in

basically a list of dictionaries.

Basically this is schema customers then

schema this then schema that and blah

blah blah blah so many names. Okay. So

that particular array will look like

this. Let's say array.

array.

Okay. So this will be dictionary. Okay.

And this will be let's say name

and name will be let's say customers.

Okay. And

here will be schema. Okay.

Okay.

And schema will be like this.

schema will be this one. Okay. So this

way this is your first entity then

second then third then fourth. So you

will be keep on doing this thing. So

this will become your array. Okay. This

will become your array. And instead of

running a loop on just a list you will

run a loop on the array. And here you

can simply use the keys of that list.

Make sense? So let me just show you. I

actually deleted it. So let me just

create another one. Array. Okay.

So let's say name. Okay. Let's say name

is customers

and then schema.

So if you want to just run a loop, you

can just do like this. So for now we

have just one entity. Let me just copy

and paste it. Let me just show you just

for your understanding because I want to

know I want you to know everything.

Okay. So this is my array. So you will

run something like this for I in this

array. Okay. And if you just want to use

any kind of let's say array. Okay. So

you can simply say print

I dot okay I do.

Not array basically schema

make sense because this is a dictionary

right dictionary of two keys name and

schema. See this is a dictionary of two

two key value pairs name and schema.

These are basically your two key value

pairs right? So these two key value

pairs you can simply pick the schema.

Make sense? Because this is a

dictionary. So just print I do. schema

and let's see uh okay makes sense

because this is a dictionary not a tpple

so you will simply write like this I of

schema because each element of the loop

is a dictionary right so see that's how

you can just use a schema so choice is

yours both are fine but I just wanted to

show you more dynamic way and now you

know now you know okay perfect so I can

also remove this particular these two

cells these are not required anymore

perfect so this is my code and that we

have written. So I can now run this code

and let's see if we have any errors. We

can fix it but I don't think so we

should have. And let's run this and

let's wait for it to process. Path

doesn't exist.

Entity. Okay, makes sense because I

didn't wrap this thing into a variable.

Makes sense. Okay, now let's run this.

Path does not exist. Entity. What do you

mean? DBFS volumes uh

source data. Okay, I think oh we forgot

to wrap this thing as well. Okay, so

let's do it because this was a batch

code and I can just do it. But what what

was what's the issue? We didn't add F.

That's it. Minor minor minor things

and let's wait and this alone one cell

will process our all the tables. Just

imagine the power of this particular

data. Basically, there's an error. What

is the error?

Let's see what is the error.

Okay. Is there any error or warning?

Some streams dominated before command

could finish. Okay. Query. This is this

a schema mismatch detected when writing

to the delta table. Okay. To enable

schema migration using dataf frame

writer or data frame trader, please

select option not schema true.

Okay. Okay.

Okay. So, let's first of all check the

state like what do we have currently?

Let's open the bronze and we just have

one table. Oh, I got it. I was so sure

that okay, something is wrong. We forgot

to add here entity. So, what actually

happened? It created the table name with

entity and when it was just trying to

create another table obviously it will

say hey schema mismatch is there. So,

it's fine. We can simply remove this

table. It's fine or we can even delete

it later on. It's fine. So let's run

this code and silly mistake just ignore

and just I was saying that just imagine

the power of this particular dynamic

solution. If let's say you want to

process hundreds of tables or basically

you want to create hundreds of tables

you do not need to copy and paste that

cell 100 times. Now just one cell 100

tables are done. This is the power and

this is done. See all the tables are

processed and I can also show you the

graph. This was a graph and currently it

show basically this graph is just a

representation of real-time data. This

job is finished. So that is why we do

not have anything in real time but this

is done. Let me just show you and

refresh the page. Our bronze injection

is done. It's done.

Okay. If I open pispark dbt if I open

bronze. Oh man. Yes. All these six

tables are there. You will say seven

bro. You need to ignore this one if you

remember. So all these six tables are

there. All these six delta tables are

there and we have enabled incremental

logic using dynamic notebook and you saw

how we do it in the real world. So that

is my point of creating this project as

well. Basically all the projects that I

create see I know there are a few people

who want to who love to criticize me.

Thank you so much for doing that. It's

fine. So even if someone says hey you

are just building a project which is

just the Kaggle using Kaggle data set

this data set that data set bro data

engineering is not just about data sets

data engineering is about solutions okay

so do not think like data engineering is

just about dealing with messy messy data

set no it's about solutions okay so

I would say thank you so much for saying

all those things and I I know like there

are few people that's it but I love

those people as well and that's it and I

just want you to become aware of these

things that you can build these things with

with

normal data sets as well. Yes. Okay.

Make sense? Okay. Very good. And just be

positive. Just be positive. It's fine.

It's fine. If someone is saying bad

about you just say thank you. Thank you

so much. Okay. So this was all about our

bronze layer. And our bronze layer is

ready. And if we just go and check our catalog,

catalog,

it will be having all the tables. And

let's remove this entity table because

we do not want to keep it. Okay, just

click here three dots and delete.

Perfect. Our six data sets are there.

And don't worry, you will also get all

the notebooks in my GitHub repo so that

you can refer it for your future use

case, for your interviews, everything.

But I would recommend you to create your

own notebooks. Please, please, please.

But I know sometimes you can feel maybe

not feel like let's say you are just

experiencing some errors then you can

just refer those notebooks. Okay. So

that note those notebooks are just for

the reference but please try to create

your own notebooks because it will boost

your knowledge and your confidence.

Okay. Simple sorted. Very good. Now

let's try to create our silver layer.

And now let's discuss like what we

actually need to do in the silver layer.

And we all know that silver layer means

data transformation, data processing.

But do you know what? I will be showing

you how you can create Python classes to

automate the data transformation step.

What another dynamic solution? Yes,

another dynamic solution because this

thing can actually make a difference in

your resumeum that we are going to do in

the silver layer. Okay, makes sense.

What's that? How we can just do that?

What are the things that we will be

doing it? Okay, let's see. So now let's

talk about our silver layer.

This silver layer will be special

because you will learn a lot in the

silver layer. Trust me,

I know you already know that pi spark is

the basis of data transformation, right?

So you will be learning a lot of things

including abserts as well. If you know

about absert, you should feel excited.

If you do not know about absert, then

hold on, you will get to know. No

worries at all. So let's first of all go

to our workspace and this is our

project. Let's create our silver

notebook. Okay. So let's call it as silver

silver

transformation.

Perfect. So first of all attach this

notebook with this particular cluster.

Perfect. So now

we want to transform our data. In short,

if you want to just define what is

silver layer. Silverware layer basically

transformation layer where we just apply

some transformations, we do cleaning, we

do a lot of other things as well that we

will just be telling in this particular

section of the video. Okay, perfect. So

this particular notebook will be

handling all the data transformations

that we know and let me just plug in the mic

mic

and let me check it is working. Yes, it

is working fine. So as we know that this

notebook will be just processing all the

data. Okay, perfect. And as I just told

you that this notebook will be handling

dynamic transformations as well. Make

sense? And obviously like there will be

like so many data objects within this

notebook. That's for sure. But we'll be

handling some dynamic transformations as

well. Perfect. So let's say we want to

transform our data which is let's say

how many I think we have six right? So

if I open bronze we have six tables.

Yeah. Let's do one by one. Let's say I

want to process first of all customers.

Okay. So let's say H3 or let's say H4

and then make it bold. Then let's say customers

customers

make sense. So we are just now

processing customers. In order to

process the customers data, okay, we

will be first looking at the data. How

does it look like? Okay. So I will

simply say DF equals spark read and this

time I will simply say dot table because

we have a delta table for it and simply

provide that delta table path which is

just the catalog name, schema name and

table name. So catalog name is this

spicewark dbt dot bronze dot customers

make sense

simple sorted and let's display the data

as well or let's do it in the next cell

perfect so this will simply give us the

data not a big deal

display df

and let's try to display the data as

well so as we know that Each table is

unique in terms of transformation. Make

sense? But there are some

transformations which will be oh why it

is null by the way. No returns, no rows

returned. Wow. Why why why don't we have

any data? Okay,

Okay,

we have all the things. Okay.

Maybe this one. Okay. Then let's see

Okay. This has data. Why customers

doesn't have any data? Why

customers? Okay. Makes sense.

Is something wrong with customers data?

Okay. No rows returned. Hm. Makes sense.

Let's go to our bronze injection. Let's

see what

happened wrong. Okay. So if I open this

catalog, if I open this thing, then

bronze, then this is our volume. This is

our checkpoint. We have everything for

customers. Yes. So let's do one thing.

Let's drop this particular directory called

called

customers. And let's refresh it. And

what will happen after this? We will not

be having any record for customers.

Okay, makes sense. We have the label but

we do not have any kind of you can say

checkpoint make sense so if I will be

just running this particular

notebook for one more time it should

give me the data make sense yes source

data and then I just want to make sure

like the spelling is same because that

doesn't make any sense if the data is

not added there would be something maybe

so okay volumes. This is checkpoint. And

if I go to source,

Everything is fine. Everything is fine. Uh

let's try to process the data for one

more time. Okay. And this will be a

quick test as well for our incremental

data load. So that is also good. So if I

just show you the data for let's say drivers.

drivers.

Okay. So let's do our testing for

incremental load. So if I say df dot

count, if I want to see like how many

records do I have? I have 50 records,

right? If I now run this particular

notebook for one more time, let's say

this, this, how many records should I

see in my drivers data frame? Should I

see 100 or just 50? If our logic is

right, if our implementation is right,

then we should only see 50 records

because we do not have any new file for

drivers. Right? If you see our source

doesn't have any new data. So that is

the power of item potency which is there

in our notebook. Make sense? Okay. Very

good. Now in order for customers we

should see new data because we deleted

the checkpoint for customers. So now

Spark doesn't know if we have new data or not. So Spark will treat all the data

or not. So Spark will treat all the data as new data. Make sense? If everything

as new data. Make sense? If everything is fine. Now let's try to see our

is fine. Now let's try to see our customers for one more time.

customers for one more time. If it still shows zero, then obviously

If it still shows zero, then obviously there's something wrong. Oh, now we have

there's something wrong. Oh, now we have 200. Oh, it's perfect. Maybe there would

200. Oh, it's perfect. Maybe there would be something wrong. Maybe type or

be something wrong. Maybe type or something. Display df.

Okay, now I have data. Okay, finally see how you can just test your injection as

how you can just test your injection as well. So now you know like everything is

well. So now you know like everything is fine in our injection logic because we

fine in our injection logic because we are not seeing duplicates. Okay, we are

are not seeing duplicates. Okay, we are just seeing the relevant data that we

just seeing the relevant data that we should see. Make sense? just 50 records.

should see. Make sense? just 50 records. Okay. And I can even show you

Okay. And I can even show you in the driver if you are seeing 100 that

in the driver if you are seeing 100 that means you have done something wrong. If

means you have done something wrong. If you see 50 that means everything is

you see 50 that means everything is fine. Uh drivers or drivers

fine. Uh drivers or drivers or driver.

or driver. What is the table name? Um

What is the table name? Um bronze

bronze drivers. Okay.

drivers. Okay. Now let's see 50. Perfect. Our logic is

Now let's see 50. Perfect. Our logic is fine. If you see 100, you have done

fine. If you see 100, you have done something wrong. Okay. So this is our

something wrong. Okay. So this is our data frame. So now I was saying that

data frame. So now I was saying that there are some transformations which

there are some transformations which will be very very customized and very

will be very very customized and very subject to each entity. But there will

subject to each entity. But there will be some transformations which will be

be some transformations which will be applicable to all the data frames. Just

applicable to all the data frames. Just let me know what are those

let me know what are those transformations which are applicable to

transformations which are applicable to all the data frames no matter no matter

all the data frames no matter no matter what data frame you are processing. So

what data frame you are processing. So the answer is dduplication applying

the answer is dduplication applying absurds. These kinds of transformations

absurds. These kinds of transformations are subject to only and only

are subject to only and only not only to all the data frames. Make

not only to all the data frames. Make sense? So we'll be just handling how we

sense? So we'll be just handling how we can just apply the generic

can just apply the generic transformations and instead of rewriting

transformations and instead of rewriting the code because I don't like just

the code because I don't like just performing static things. I love dynamic

performing static things. I love dynamic things. Make sense? So we'll be just

things. Make sense? So we'll be just creating Python classes for that and

creating Python classes for that and don't need to worry. So let's say I want

don't need to worry. So let's say I want to start with customers

to start with customers and you'll be learning a lot of pispark

and you'll be learning a lot of pispark functions as well. So don't need to

functions as well. So don't need to worry. Let's say display df.

Okay, perfect. We have 200 records. That's fine. So now within this

That's fine. So now within this customers table, we want to make it more

customers table, we want to make it more enriched. Okay. And we want to basically

enriched. Okay. And we want to basically transform it. Make sense? Okay. So first

transform it. Make sense? Okay. So first of all the best thing about this is we

of all the best thing about this is we already have

already have the date time column in the desired

the date time column in the desired format which is a great thing which is a

format which is a great thing which is a great thing that's fine but we every

great thing that's fine but we every time add another column to our data

time add another column to our data frame that is like more another you can

frame that is like more another you can say generic transformation that should

say generic transformation that should be applicable on all the data frames.

be applicable on all the data frames. What is that particular time stamp

What is that particular time stamp column? So basically that is called as

column? So basically that is called as processing time stamp processing time

processing time stamp processing time stamp that means it will highlight the

stamp that means it will highlight the time stamp when that record was updated

time stamp when that record was updated because in the real world we need to see

because in the real world we need to see like when this record was updated when

like when this record was updated when this record was upserted. So by that

this record was upserted. So by that time stamp we just use to like you can

time stamp we just use to like you can say filter out the records which are old

say filter out the records which are old or which are new. Make sense? So that is

or which are new. Make sense? So that is the power of that particular column that

the power of that particular column that we use. Make sense? Okay. So that will

we use. Make sense? Okay. So that will be also added in our generic

be also added in our generic transformations. Make sense? But before

transformations. Make sense? But before generic transformation, we want to clean

generic transformation, we want to clean up a lot of things here. So here you can

up a lot of things here. So here you can see that we have

see that we have here

here this particular

this particular email column. This one

email column. This one this email column. So let's say we want

this email column. So let's say we want to first of all clean this by obviously

to first of all clean this by obviously um it is by default cleaned up but we

um it is by default cleaned up but we want to understand what are the domains

want to understand what are the domains of our customers so that we can just run

of our customers so that we can just run our ads anything it totally depends upon

our ads anything it totally depends upon the requirement but we need to fetch the

the requirement but we need to fetch the domains of our email ids of our

domains of our email ids of our customers make sense how we can just do

customers make sense how we can just do that for that I will be using something

that for that I will be using something called as and let me just say it as df

called as and let me just say it as df cust just to make it more readable.

cust just to make it more readable. Okay, perfect. DF cast. So I will be

Okay, perfect. DF cast. So I will be saying df cost

saying df cost equals

equals df cust dot withidth column because this

df cust dot withidth column because this is a new column. Whenever we want to

is a new column. Whenever we want to create a new column, we use width

create a new column, we use width column. Okay, I will say domain. Okay,

column. Okay, I will say domain. Okay, and I will apply a transformation called

and I will apply a transformation called split.

split. Split function. What this function will

Split function. What this function will do? This will split our column values

do? This will split our column values into a list based on a delimeter. So I

into a list based on a delimeter. So I will say split on this column which is

will say split on this column which is called email. And what is a delimeter?

called email. And what is a delimeter? Delmare is at the rate. Make sense?

Delmare is at the rate. Make sense? Perfect. Do you know what will happen?

Perfect. Do you know what will happen? It will create a list of values. And

It will create a list of values. And each list will be having two values

each list will be having two values because there's only one split. So 0 and

because there's only one split. So 0 and one. 0 and one. 0 and one. That means in

one. 0 and one. 0 and one. That means in each list there'll be just two elements.

each list there'll be just two elements. Now we'll be just applying indexing

Now we'll be just applying indexing because we do not want the first element

because we do not want the first element of it. Just see this is a list.

of it. Just see this is a list. Okay, just imagine this is a list and

Okay, just imagine this is a list and this is the first element. This is the

this is the first element. This is the second element. So we do not want the

second element. So we do not want the first name for now. Obviously we'll be

first name for now. Obviously we'll be just uh needing this in the future but

just uh needing this in the future but for now we just need the domain. Make

for now we just need the domain. Make sense? This is our domain. So I will

sense? This is our domain. So I will simply getting which index one

simply getting which index one makes sense because zero index is this

makes sense because zero index is this one 1 index is this. So we'll simply say

one 1 index is this. So we'll simply say apply the index of one

apply the index of one make sense. So this is our

make sense. So this is our transformation and you can even see it

transformation and you can even see it if you want to let's say display dfcast

if you want to let's say display dfcast and

and we can also import all the things

we can also import all the things like all the libraries and all from

like all the libraries and all from pispark.sqlf SQL dot functions

pispark.sqlf SQL dot functions import axis

and from pipark dossql

dossql types

types import strings make sense good. So now

import strings make sense good. So now if I just show you this particular thing

you will be able to see domain column which is this one. See domain domain

which is this one. See domain domain domain domain these are all the domains

domain domain these are all the domains that I have that we have literally

that I have that we have literally extracted using pispar

extracted using pispar make sense so now if you just want to do

make sense so now if you just want to do next cleanup what is the next cleanup

next cleanup what is the next cleanup that we want to do so this is our phone

that we want to do so this is our phone number okay and you can see that there

number okay and you can see that there are so so so many things which are

are so so so many things which are actually not relevant here for example

actually not relevant here for example there are hyphens that we can remove.

there are hyphens that we can remove. There are dots that we can remove. There

There are dots that we can remove. There are brackets that we can remove. There

are brackets that we can remove. There are basically so many things, right?

are basically so many things, right? Plus one as well. We can remove X as

Plus one as well. We can remove X as well. So, we want to make this phone

well. So, we want to make this phone number cleaned so that we can actually

number cleaned so that we can actually use this column. Let's say we want to

use this column. Let's say we want to store the information in the right

store the information in the right format. Make sense? Makes sense. Makes

format. Make sense? Makes sense. Makes sense. Makes sense. So, how we can just

sense. Makes sense. So, how we can just do that? So, first of all, you have a

do that? So, first of all, you have a lot of options. Okay, one thing that you

lot of options. Okay, one thing that you can do, you can use a function called

can do, you can use a function called drag xp replace where you can replace

drag xp replace where you can replace the values. Let's say you want to

the values. Let's say you want to replace hyphen with nothing. You want to

replace hyphen with nothing. You want to replace dot with nothing. So you have a

replace dot with nothing. So you have a lot of options that you can just do. But

lot of options that you can just do. But there's a more efficient way that I want

there's a more efficient way that I want to show you and that is so I will simply

to show you and that is so I will simply write df cost equals df cost dot with

write df cost equals df cost dot with column. This time we do not want to

column. This time we do not want to create a new column but we want to

create a new column but we want to modify the existing column. So it is

modify the existing column. So it is very simple. I can simply say phone

very simple. I can simply say phone number which is the same name of the

number which is the same name of the column because I don't want to create a

column because I don't want to create a new column. So now what will be the

new column. So now what will be the transformation or basically what will

transformation or basically what will the function we going to use it will it

the function we going to use it will it is it is called reg xp replace. So reg

is it is called reg xp replace. So reg xp replace. So what we want to replace

xp replace. So what we want to replace and with like with with what so

and with like with with what so basically we want to replace everything

basically we want to replace everything which are not numbers. Okay we simply

which are not numbers. Okay we simply want to replace everything which are not

want to replace everything which are not numbers. That's it. So I'll simply say

numbers. That's it. So I'll simply say list 0 to 9. These are my numbers. And I

list 0 to 9. These are my numbers. And I do not want to keep anything which is

do not want to keep anything which is not a list. And where is that little

not a list. And where is that little cap?

cap? That reverse V. I'm just finding a key.

That reverse V. I'm just finding a key. Where is that? Oh man, where is that

Where is that? Oh man, where is that key? Oh yeah, six. So these are

key? Oh yeah, six. So these are basically the numbers 0 to 9. And we do

basically the numbers 0 to 9. And we do not want to keep anything other than

not want to keep anything other than this. So I'll simply say hey replace all

this. So I'll simply say hey replace all the unnecessary stuff with nothing. Do

the unnecessary stuff with nothing. Do not even add a space. Nothing. Okay. Now

not even add a space. Nothing. Okay. Now let's see display dfcast.

let's see display dfcast. Perfect.

Perfect. Uh what is it saying? Missing one

Uh what is it saying? Missing one required position argument replacement.

required position argument replacement. Uh oh we forgot to pass the column name

Uh oh we forgot to pass the column name I guess. So I can simply say phone

I guess. So I can simply say phone number

because we simply said hey just create this new column which is called phone

this new column which is called phone number but on which column we want to

number but on which column we want to apply reg xp relays obviously on phone

apply reg xp relays obviously on phone number. Now it's fine. Now it's fine.

number. Now it's fine. Now it's fine. See, so this is my new phone number

See, so this is my new phone number column which doesn't have any kind of

column which doesn't have any kind of you can say special characters, braces,

you can say special characters, braces, dots, alphabets, nothing. But yes, some

dots, alphabets, nothing. But yes, some numbers will be longer because in some

numbers will be longer because in some countries we have different different

countries we have different different extension codes that we have to follow

extension codes that we have to follow and that we cannot reduce. Yes, you have

and that we cannot reduce. Yes, you have to keep everything which is in the

to keep everything which is in the number format. Okay, because you do not

number format. Okay, because you do not know the rules of all the countries.

know the rules of all the countries. Some have like three to four number of

Some have like three to four number of extensions. Some have like 0ero to one.

extensions. Some have like 0ero to one. In our um country we have plus one. So

In our um country we have plus one. So it depends. Make sense? It depends. So

it depends. Make sense? It depends. So now I also want to show you one more

now I also want to show you one more thing within this particular data frame

thing within this particular data frame which is like subject to only this

which is like subject to only this particular data frame. Let's say instead

particular data frame. Let's say instead of having first name, last name like

of having first name, last name like these like these two separate columns. I

these like these two separate columns. I don't want to just create many columns

don't want to just create many columns because I do not like to having

because I do not like to having unnecessary columns. So how you can just

unnecessary columns. So how you can just apply concatenation in pispark. So it is

apply concatenation in pispark. So it is very simple. So you will simply say df

very simple. So you will simply say df cust equals df cust do withid column and

cust equals df cust do withid column and this will be a new column and you will

this will be a new column and you will say let's say full name.

say let's say full name. Okay. And here you can use concat

Okay. And here you can use concat function but I personally like using

function but I personally like using concat ws function. So what it does we

concat ws function. So what it does we simply define a delimeter basically not

simply define a delimeter basically not delimeter basically what will be the

delimeter basically what will be the character that will be added

character that will be added automatically after concatenating two or

automatically after concatenating two or more things so I'll simply say just add

more things so I'll simply say just add a space that's it and what will be the

a space that's it and what will be the function like columns that I want to

function like columns that I want to concatenate first name and last name

concatenate first name and last name perfect make sense now let's say display

perfect make sense now let's say display df cost

df cost and I can also do one thing I can remove

and I can also do one thing I can remove df.

df. Uh basically I can drop

Uh basically I can drop if I do not want to keep first name and

if I do not want to keep first name and last name because I why do I need it

last name because I why do I need it now? Because I already have that

now? Because I already have that particular full name, right? So this is

particular full name, right? So this is my full name and I have dropped both the

my full name and I have dropped both the columns.

columns. Make sense? So that's how you can just

Make sense? So that's how you can just drop the unnecessary columns as well if

drop the unnecessary columns as well if you do not want to keep it. Make sense?

you do not want to keep it. Make sense? Okay. Very good. So we have this

Okay. Very good. So we have this particular

particular data frame so far. Okay. So far so now

data frame so far. Okay. So far so now now we can just talk about some generic

now we can just talk about some generic transformations that I was just talking

transformations that I was just talking about in the beginning. Okay. So for

about in the beginning. Okay. So for that we'll be just creating a class a

that we'll be just creating a class a Python class in which we'll be just

Python class in which we'll be just creating multiple functions which will

creating multiple functions which will be generic. One is dduplication for sure

be generic. One is dduplication for sure that we're going to do it right now.

that we're going to do it right now. Second will be absort. Okay. And the

Second will be absort. Okay. And the third one will be will be um what should

third one will be will be um what should we create in generic the third function?

we create in generic the third function? Third function can be

Third function can be um what was the function name that we

um what was the function name that we talked about in the beginning.

talked about in the beginning. Okay, we will see if we want to add more

Okay, we will see if we want to add more and more things we can just add in in

and more things we can just add in in that particular generic class. Okay,

that particular generic class. Okay, makes sense. Okay, so now let's create

makes sense. Okay, so now let's create the class. But that class is special,

the class. But that class is special, right? And I want to keep that class as

right? And I want to keep that class as a utility. As a what? As a utility. So I

a utility. As a what? As a utility. So I can simply create a Python file here. I

can simply create a Python file here. I can simply say create or I can even

can simply say create or I can even create a folder if I want to. So I will

create a folder if I want to. So I will simply say folder and the folder name

simply say folder and the folder name will be let's say UT

will be let's say UT lis or basically utils. Okay, utils

lis or basically utils. Okay, utils simple. So this is my util folder and

simple. So this is my util folder and within that folder I will create my

within that folder I will create my python file and I will name it as let's

python file and I will name it as let's say

say um

um custom utils

custom utils py okay so this is our

py okay so this is our python

python file but in order to access this file

file but in order to access this file you would need to do one thing in the

you would need to do one thing in the beginning that is adding the source path

beginning that is adding the source path to the system path. What do I mean? So

to the system path. What do I mean? So you would need to write something like

you would need to write something like this. Import OS import sis. Okay. First

this. Import OS import sis. Okay. First of all, import these two libraries. Then

of all, import these two libraries. Then you need to add this path. Which path?

you need to add this path. Which path? This one. This path to the system path.

This one. This path to the system path. Make sense? And how you can just do

Make sense? And how you can just do that? Let me show you. So first of all

that? Let me show you. So first of all you will simply say current directory

equals OS dot get current working directory. Okay, this is one way of

directory. Okay, this is one way of doing it. If you will be using this

doing it. If you will be using this particular function in Python, you can

particular function in Python, you can also say OS dot I guess path dot

also say OS dot I guess path dot abapabsolute path. Okay, like this is

abapabsolute path. Okay, like this is the one os.path dotabsolute path. And

the one os.path dotabsolute path. And then you can simply say file but this is

then you can simply say file but this is a notebook so it won't work here. So we

a notebook so it won't work here. So we will simply say get working directory

will simply say get working directory makes sense. So what it will do it will

makes sense. So what it will do it will simply give you the current working

simply give you the current working directory path. See till pispark dbd

directory path. See till pispark dbd project. In some cases it is not like

project. In some cases it is not like mandatory thing but sometimes you can

mandatory thing but sometimes you can see some errors while importing the

see some errors while importing the module. So this is the workar around for

module. So this is the workar around for that. You need to add this whole path in

that. You need to add this whole path in your system path. Make sense? So I'll

your system path. Make sense? So I'll simply say OS dot not OS basically

simply say OS dot not OS basically system.path

system.path dot append and then you need to simply

dot append and then you need to simply say current directory make sense current

say current directory make sense current directory. So you need to append this

directory. So you need to append this path

path with this. So now this particular path

with this. So now this particular path is added in your notebook. Make sense?

is added in your notebook. Make sense? So now what you can do you can literally

So now what you can do you can literally write anything here. Let's say I want to

write anything here. Let's say I want to write variable 1 equals blah blah blah

write variable 1 equals blah blah blah blah. Okay. And if I want to use this

blah. Okay. And if I want to use this variable now I can just do that like

variable now I can just do that like this. Let's say I'll simply say from

this. Let's say I'll simply say from uh utils

uh utils dot custom utils

import variable one.

variable one. Uh it's import

Uh it's import no module name calls. Very good. So this

no module name calls. Very good. So this is the error that I was just talking

is the error that I was just talking about and I'm glad this error came. So

about and I'm glad this error came. So this is the error that it is saying hey

this is the error that it is saying hey you need to just

you need to just say what is the utility name. Okay. And

say what is the utility name. Okay. And we have literally

we have literally ran this particular append function.

ran this particular append function. Okay. So simply reset your cluster.

Okay. So simply reset your cluster. Okay. And then you need to simply add

Okay. And then you need to simply add this thing. So once it is reset

this thing. So once it is reset then it will take some time to start

then it will take some time to start maybe few seconds. Done. Then you can

maybe few seconds. Done. Then you can simply add it like this.

simply add it like this. Okay perfect. Then simply run this. Now

Okay perfect. Then simply run this. Now if you will just try to

if you will just try to read this file.

read this file. It should work. See now it worked. So

It should work. See now it worked. So you can simply say variable one. So this

you can simply say variable one. So this is the value. Make sense? So one thing

is the value. Make sense? So one thing is you have to add this this utility.

is you have to add this this utility. Okay, this utilities path in your system

Okay, this utilities path in your system path and then you can just use it. If

path and then you can just use it. If you are seeing this error and you are

you are seeing this error and you are not able to do this, no need to worry.

not able to do this, no need to worry. This was just you can say a best

This was just you can say a best practice. But sometimes if you want to

practice. But sometimes if you want to just go with the flow and you are seeing

just go with the flow and you are seeing the errors and you do not want to keep

the errors and you do not want to keep on seeing the errors, simply create the

on seeing the errors, simply create the class that I'll be creating in this

class that I'll be creating in this particular Python file in here. You can

particular Python file in here. You can simply create the class here at the top

simply create the class here at the top of your Python file because once you

of your Python file because once you have that particular class in your

have that particular class in your notebook, you do not need to just go

notebook, you do not need to just go anywhere else to import it. Okay, but I

anywhere else to import it. Okay, but I just wanted to show you the best

just wanted to show you the best practice as well. But sometimes, yes, I

practice as well. But sometimes, yes, I can feel your pain. Let's say you are

can feel your pain. Let's say you are not able to just get rid of the error.

not able to just get rid of the error. Fine. Simply copy the class and paste it

Fine. Simply copy the class and paste it in your notebook and that's it. Because

in your notebook and that's it. Because your aim is to learn how to create the

your aim is to learn how to create the classes, not just to figure out like how

classes, not just to figure out like how to import the classes, right? That's

to import the classes, right? That's just a best practice. Okay, makes sense.

just a best practice. Okay, makes sense. Now let's create a class. So I want to

Now let's create a class. So I want to create a class. Let's say

create a class. Let's say create class and class name will be um

create class and class name will be um transformations.

transformations. Okay, transformations make sense. So

Okay, transformations make sense. So within this class if you have some

within this class if you have some fundamental understanding we first of

fundamental understanding we first of all create you can say some class

all create you can say some class variables which are called as

variables which are called as attributes. But I don't want to create

attributes. But I don't want to create any attribute because each function will

any attribute because each function will be unique. Okay. So here what I can do I

be unique. Okay. So here what I can do I can simply say def and I want to create

can simply say def and I want to create a function called ddup.

a function called ddup. Make sense? And within the class one

Make sense? And within the class one thing that we always define is the self

thing that we always define is the self parameter. Make sense? And this ddo

parameter. Make sense? And this ddo function will obviously accept data

function will obviously accept data frame plus it will also look for the

frame plus it will also look for the list of columns on which we need to

list of columns on which we need to apply the dduplication.

apply the dduplication. Make sense? Like on which column we need

Make sense? Like on which column we need to apply the dduplication.

to apply the dduplication. Let's say we want to apply dduplication

Let's say we want to apply dduplication on one column. We can also apply

on one column. We can also apply dduplication on two columns, three

dduplication on two columns, three columns, four columns and so on. Right?

columns, four columns and so on. Right? So we will simply say

So we will simply say ddup columns.

Perfect. And this can be a list right and we can

and we can import list from typing import list.

import list from typing import list. Okay. And this will be a list basically

Okay. And this will be a list basically and this will be a data frame. And I can

and this will be a data frame. And I can say from

say from pispark.sql SQL import data frame.

pispark.sql SQL import data frame. Okay, so TF will be data frame.

Okay, makes sense. And self is just self variable. So it's fine. So whenever you

variable. So it's fine. So whenever you want to perform dduplication, what do we

want to perform dduplication, what do we do?

do? We simply use something called as

We simply use something called as window function.

window function. Okay. And ddoop is like a very generic

Okay. And ddoop is like a very generic function or basically transformation

function or basically transformation that you will do in almost all the data

that you will do in almost all the data frames. So for that I will simply say

frames. So for that I will simply say df equals

df equals df dot width column.

df dot width column. Okay. And we will be creating a column

Okay. And we will be creating a column called ddoop. Ddoop basically let's say

called ddoop. Ddoop basically let's say um ddup key. Yes. Let's create ddup key.

um ddup key. Yes. Let's create ddup key. And what this column will do? This

And what this column will do? This column will first of all create a hash

column will first of all create a hash column based on the all the columns that

column based on the all the columns that you have provided here in the D2

you have provided here in the D2 columns. Make sense? So this will be the

columns. Make sense? So this will be the hashing of all the columns that you have

hashing of all the columns that you have provided in the list. Very good. So how

provided in the list. Very good. So how you can just get the list? Very easy.

you can just get the list? Very easy. You can simply say ddup list. Okay. And

You can simply say ddup list. Okay. And you know what? You can even get this

you know what? You can even get this particular list in the form of string

particular list in the form of string really. Yes. And you can just convert it

really. Yes. And you can just convert it into a list using um eval method. So it

into a list using um eval method. So it is very easy. But let's keep it a list.

is very easy. But let's keep it a list. Uh or let's let's let's keep it a string

Uh or let's let's let's keep it a string or it's fine. It's fine. It's fine.

or it's fine. It's fine. It's fine. Okay. You should learn new new things.

Okay. You should learn new new things. Okay. So now this will be a list. Okay.

Okay. So now this will be a list. Okay. and how we can just define that

and how we can just define that particular thing and you already know

particular thing and you already know we'll be using concat

we'll be using concat make sense and what will be the column

make sense and what will be the column names column names will be a list of

names column names will be a list of ddup columns now what is this ax that I

ddup columns now what is this ax that I have used here what is this thing

have used here what is this thing whenever we want to work with let's say

whenever we want to work with let's say pispark transformations okay and we have

pispark transformations okay and we have some python classes or basically any

some python classes or basically any python capability that you want to use

python capability that you want to use so this is a list and in the concat we

so this is a list and in the concat we simply pass what we simply pass the

simply pass what we simply pass the column names in the form of string. But

column names in the form of string. But this particular list can be unpacked

this particular list can be unpacked using this axis.

using this axis. Make sense? Using this axis. So we can

Make sense? Using this axis. So we can simply unpack anything using arrix. So

simply unpack anything using arrix. So now what it will do? It will create a

now what it will do? It will create a ddup key and it will be the combination

ddup key and it will be the combination of all the columns. Perfect. That is

of all the columns. Perfect. That is fine. Once it is done then we will

fine. Once it is done then we will simply say

simply say df dot width column. basically df equals

df dot width column. basically df equals df dotw withid column

df dotw withid column and I want to create basically the ddup

and I want to create basically the ddup counts like how many duplicates do we

counts like how many duplicates do we have ddup counts right

have ddup counts right ddoop counts and it will be a kind of

ddoop counts and it will be a kind of window function so let's import that as

window function so let's import that as well

well from

dot I guess SQL dot window import

import window

window okay makes sense import window so now if

okay makes sense import window so now if I want to say row number

okay row number dot over and now what we need to use in the window obviously ddup

need to use in the window obviously ddup key right window dot what we want to say

key right window dot what we want to say partition Okay,

partition by on ddup key fundamental of basically fundamentals of um window

basically fundamentals of um window functions. You should know about window

functions. You should know about window functions. So basically row number is a

functions. So basically row number is a function that we use to eliminate the

function that we use to eliminate the duplicates. Make sense? So we are simply

duplicates. Make sense? So we are simply applying a grouping of key columns so

applying a grouping of key columns so that whenever we have any new record for

that whenever we have any new record for basically any more record for that

basically any more record for that particular um grouping it will simply

particular um grouping it will simply say hey the count is two otherwise it

say hey the count is two otherwise it will say count is three. Simple

will say count is three. Simple simple do we need to apply any kind of

simple do we need to apply any kind of order by we can. Yes we can. And what

order by we can. Yes we can. And what should be the column name? Because

should be the column name? Because obviously if we are simply getting let's

obviously if we are simply getting let's say more and more records and if we have

say more and more records and if we have some duplicates so we should be deciding

some duplicates so we should be deciding on any column let's say what column we

on any column let's say what column we should pick for dduplication right in

should pick for dduplication right in some scenarios it can be date column in

some scenarios it can be date column in some scenario it can be anything so I

some scenario it can be anything so I will simply ask for that particular

will simply ask for that particular column from the user simple so that

column from the user simple so that column will be this one last updated

column will be this one last updated time stamp okay and I will simply ask

time stamp okay and I will simply ask this one as well I will simply say CDC

this one as well I will simply say CDC okay CD DC and it will be string type.

okay CD DC and it will be string type. Okay. So here I can say dot partition by

Okay. So here I can say dot partition by and now then dot order by.

and now then dot order by. Okay we want to apply order by make

Okay we want to apply order by make sense. So here I can simply say CDC.

sense. So here I can simply say CDC. Okay. So once it is done I just need to

Okay. So once it is done I just need to I just need to filter all the records

I just need to filter all the records which are duplicates and I will simply

which are duplicates and I will simply say column

say column uh D2 counts

D2 counts equals to equals to 1 that's all and if we have any duplicate 2 3 4

all and if we have any duplicate 2 3 4 simply remove it okay and now at the end

simply remove it okay and now at the end I can say DF equals DF dot

I can say DF equals DF dot drop and let's drop key because we do

drop and let's drop key because we do not want this column and

not want this column and dd counts

dd counts make sense d counts perfect and at the

make sense d counts perfect and at the end return df so this is a code for

end return df so this is a code for applying dduplication and obviously this

applying dduplication and obviously this is a very generic transformation that

is a very generic transformation that you want to apply on all the

you want to apply on all the transformations and we do not want to

transformations and we do not want to write this particular code again and

write this particular code again and again statically we will simply write

again statically we will simply write this code one time Okay, in a class so

this code one time Okay, in a class so that we can just reuse it every time.

that we can just reuse it every time. Make sense? So let's try to test it.

Make sense? So let's try to test it. First of all, I will simply import all

First of all, I will simply import all the things in my notebook as well.

the things in my notebook as well. Perfect. Let's run this.

Perfect. Let's run this. Uh data frame cannot import name data

Uh data frame cannot import name data frame. Okay, maybe it is capital F.

frame. Okay, maybe it is capital F. Okay, perfect. It is capital F. Let me

Okay, perfect. It is capital F. Let me just change it here as well.

just change it here as well. This is just the type ending. Okay.

This is just the type ending. Okay. Okay. So now let's test it. And I will

Okay. So now let's test it. And I will simply say

simply say from util dot utils

from util dot utils import transformations.

Okay. Let's run this. No module named util. Oh man. Let's do it one more time.

util. Oh man. Let's do it one more time. Let's simply

Let's simply add that path here. or let's actually

add that path here. or let's actually move it out. So if I say

move it out. So if I say move

move uh move

uh move or let's put it inside this particular

or let's put it inside this particular project only.

project only. Okay, makes sense. And let's delete this

Okay, makes sense. And let's delete this particular folder

particular folder because I think then we do not need to

because I think then we do not need to just define it again and again. And you

just define it again and again. And you can simply say

can simply say from

from utils import transformations

I think so from utils. Is it transformations or transformation?

transformations or transformation? Transformations. Okay.

Okay. It is same. Cannot import transformation from utils unknown

transformation from utils unknown location. What

location. What or if I say

or if I say import?

Oh, it is custom utils. Okay. From custom utils,

custom utils, import transformations.

import transformations. No module name custom utils. Okay. We

No module name custom utils. Okay. We have to just maybe let's copy and paste

have to just maybe let's copy and paste that particular utility in this one

that particular utility in this one because I don't want to run that

because I don't want to run that particular thing again and again. uh

particular thing again and again. uh because I'll be just turning it off then

because I'll be just turning it off then importing it then doing it then blah

importing it then doing it then blah blah blah blah I don't like it so let me

blah blah blah I don't like it so let me just grab this particular class here in

just grab this particular class here in my notebook

my notebook or instead of creating the Python file I

or instead of creating the Python file I can also create something called as

can also create something called as notebook but it is fine let's get rid of

notebook but it is fine let's get rid of this so I'm giving you the company so

this so I'm giving you the company so let's stick with only one approach okay

let's stick with only one approach okay perfect so let's actually run this class

perfect so let's actually run this class it's fine now what we need to

it's fine now what we need to We'll simply say

We'll simply say we want to test this particular class.

we want to test this particular class. Right? So I will simply create an object

Right? So I will simply create an object of this. I will say custom

of this. I will say custom or basically customers object equals

or basically customers object equals transformations. Okay? Because this is

transformations. Okay? Because this is the class I want to use. Makes sense.

the class I want to use. Makes sense. Now for this particular object I want to

Now for this particular object I want to use customer object

dot ddup. Okay.

ddup. Okay. So here I need to just pass all the

So here I need to just pass all the parameters. So first parameter is data

parameters. So first parameter is data frame. Then I have the list of key

frame. Then I have the list of key columns and in this case it is customer

columns and in this case it is customer ID I guess.

ID I guess. Make sense? Then I have last updated

Make sense? Then I have last updated time stamp. Make sense? So now

time stamp. Make sense? So now let's say custom DF

let's say custom DF transform. Okay. And now let's see what

transform. Okay. And now let's see what will happen if I just display this

will happen if I just display this custom

custom df

df transform. Okay, let's display it.

Make sense? Let's see. DF customer is not defined. Okay, so maybe we need to

not defined. Okay, so maybe we need to just simply run all the cells because we

just simply run all the cells because we didn't run this because of restarting

didn't run this because of restarting it. So let's say

it. So let's say run everything. So let's wait.

[Music] So this is running. Okay.

Okay. This is done. This is done.

Okay. So, perfect. So, our class is also running fine. As you can

class is also running fine. As you can see that our data frame is dduped and

see that our data frame is dduped and everything is running fine. Everything

everything is running fine. Everything is perfect and we do not need to worry

is perfect and we do not need to worry about anything here. As you can see that

about anything here. As you can see that this is dduplic dduplicated.

this is dduplic dduplicated. Make sense? Obviously we we didn't have

Make sense? Obviously we we didn't have any kind of duplications but if we do

any kind of duplications but if we do have it will be automatically gone make

have it will be automatically gone make sense so this is our data frame that is

sense so this is our data frame that is ready to go make sense which is ready to

ready to go make sense which is ready to go so what I can do now we can just try

go so what I can do now we can just try to add the absert logic what we can do

to add the absert logic what we can do try to add the upsert logic and before

try to add the upsert logic and before that you need to understand the upsert

that you need to understand the upsert command basically let me just tell you

command basically let me just tell you what is that so upserts are very very

what is that so upserts are very very simple but are tricky and you need to

simple but are tricky and you need to understand like why and what you need to

understand like why and what you need to do. So basically let's say you have your

do. So basically let's say you have your source okay and you have your

source okay and you have your destination

destination target table let's say this data

target table let's say this data basically this source has data like this

basically this source has data like this let's say 1 2 3 make sense this will be

let's say 1 2 3 make sense this will be loaded here 1 2 3 next day this source

loaded here 1 2 3 next day this source data has some more records let's say

data has some more records let's say four

four 5 and 1 and two obviously let's say

5 and 1 and two obviously let's say There's a change in two so and one as

There's a change in two so and one as well. So there there are some new

well. So there there are some new records for 4 51 1 2 make sense. So what

records for 4 51 1 2 make sense. So what will happen in the target table it will

will happen in the target table it will become four five. Now in the target

become four five. Now in the target table which is our silver layer we

table which is our silver layer we cannot store duplicates as you know. So

cannot store duplicates as you know. So what we will be doing here because we

what we will be doing here because we also want updated data one and two.

also want updated data one and two. That's why we have a concept of called

That's why we have a concept of called upsert. So this upsert will simply say

upsert. So this upsert will simply say if you have any new data it will insert

if you have any new data it will insert it. If you have any old data but with

it. If you have any old data but with updates with new values, so it will

updates with new values, so it will simply take the values and it will

simply take the values and it will update it. So this will be updated with

update it. So this will be updated with these new record with these two values.

these new record with these two values. Simple. And this way we will not be

Simple. And this way we will not be having any any kind of you can say

having any any kind of you can say duplicates plus we will be having

duplicates plus we will be having updated data all the time.

updated data all the time. Make sense? Makes sense. Makes sense.

Make sense? Makes sense. Makes sense. Makes sense. Very good. So now one thing

Makes sense. Very good. So now one thing which is very very important. What is

which is very very important. What is that thing? Basically let's say you are

that thing? Basically let's say you are loading all the data.

loading all the data. Okay. And in this case you are loading

Okay. And in this case you are loading all the data in the silver layer, right?

all the data in the silver layer, right? You are not loading only the incremental

You are not loading only the incremental data because you are reading all the

data because you are reading all the data which is available in the bronze

data which is available in the bronze layer.

layer. Yes. Make sense? So now you'll be having

Yes. Make sense? So now you'll be having one and two here as well and one and two

one and two here as well and one and two here as well. So they'll be duplicate.

here as well. So they'll be duplicate. Make sense? That is why we provided if

Make sense? That is why we provided if you remember our code for class we

you remember our code for class we remember uh we provided order by based

remember uh we provided order by based on CDC

on CDC based on what CDC and obviously here as

based on what CDC and obviously here as well. So basically we have rewritten

well. So basically we have rewritten this particular class and just for your

this particular class and just for your reference um I just copied and pasted

reference um I just copied and pasted the class here if you see the errors and

the class here if you see the errors and I also pasted the code here as well if

I also pasted the code here as well if you just want to refer this notebook so

you just want to refer this notebook so that in this particular notebook you'll

that in this particular notebook you'll be having both the ways if you want to

be having both the ways if you want to import this file or if you just want to

import this file or if you just want to directly use the class that's it and

directly use the class that's it and just to make a distinction I have added

just to make a distinction I have added extra s here okay so that you would know

extra s here okay so that you would know okay this is like coming from the python

okay this is like coming from the python file and here as well s make sense Okay,

file and here as well s make sense Okay, good. So, this is done. So, now we were

good. So, this is done. So, now we were just saying that that's why we have

just saying that that's why we have added this particular order by command

added this particular order by command in this CDC.

in this CDC. Make sense? So, what it will do, it will

Make sense? So, what it will do, it will only and only dduplicate the data based

only and only dduplicate the data based on the latest date. And if you are

on the latest date. And if you are smart, if you have some common sense, we

smart, if you have some common sense, we have made a silly mistake. We have

have made a silly mistake. We have simply provided order by. See, we have

simply provided order by. See, we have simply provided order by. Let me just

simply provided order by. Let me just first of all

first of all make this in the next line so that it

make this in the next line so that it will be more readable. Perfect. Okay. So

will be more readable. Perfect. Okay. So this is order by. But if you remember we

this is order by. But if you remember we need to fetch the latest data not the

need to fetch the latest data not the previous data. So here we should be

previous data. So here we should be writing something called as order by or

writing something called as order by or sorting based on the descending order

sorting based on the descending order not on the ascend not in the ascending

not on the ascend not in the ascending order because by default order by will

order because by default order by will work in the ascending order. Right? So

work in the ascending order. Right? So we can simply say deesc or basically we

we can simply say deesc or basically we can write here as well DC

can write here as well DC and then

and then like this make sense and is saying order

like this make sense and is saying order by red y maybe we can just remove ds

by red y maybe we can just remove ds from here and we can add it here as well

from here and we can add it here as well dot dsc simple simple now just to make

dot dsc simple simple now just to make sure let's run the code for one more

sure let's run the code for one more time okay let's see like everything is

time okay let's see like everything is fine

fine So I can simply do it like this.

So I can simply do it like this. Uh let's see if everything is fine.

Windows specification object has no attribute de. Okay. So it is saying it

attribute de. Okay. So it is saying it is not running fine with DC. Okay. Dot

is not running fine with DC. Okay. Dot order by.

order by. Okay. Makes sense. I think we can just

Okay. Makes sense. I think we can just remove it from here then because this is

remove it from here then because this is inside the window function. And if we

inside the window function. And if we are inside the window function then

are inside the window function then obviously we need to use maybe something

obviously we need to use maybe something like this. Let's try this one. So just

like this. Let's try this one. So just write here d C like this and

write here d C like this and everything should be fine. And let me

everything should be fine. And let me just update the code here as well.

just update the code here as well. order by

order by DSC of CDC.

DSC of CDC. Perfect,

Perfect, perfect, perfect, perfect. And now let's

perfect, perfect, perfect. And now let's run the whole code for another time.

run the whole code for another time. Let's see if it works

Let's see if it works because that is the one of the most

because that is the one of the most important steps that we have order by on

important steps that we have order by on the latest date otherwise you like it

the latest date otherwise you like it will not create any problem but to build

will not create any problem but to build robust solution that even if you perform

robust solution that even if you perform backdated refresh it will not cause any

backdated refresh it will not cause any problem. Okay. So let's say we want to

problem. Okay. So let's say we want to apply the upsert. Uh let's do it. And

apply the upsert. Uh let's do it. And this is our dduplicated data. Perfect.

this is our dduplicated data. Perfect. So if you want to just apply up absort.

So if you want to just apply up absort. I will simply say absort here.

I will simply say absort here. Have we defined any kind of heading? No,

Have we defined any kind of heading? No, it's fine. So I'll simply say df

it's fine. So I'll simply say df equals or even not even before like df

equals or even not even before like df we want to simply say if if we do not

we want to simply say if if we do not have any kind of data sitting in our

have any kind of data sitting in our table then we first of all need to

table then we first of all need to create the table right so I'll simply

create the table right so I'll simply say if spark dot catalog dot table exist

say if spark dot catalog dot table exist and what is the definition for table it

and what is the definition for table it is simple pispark dbt silver customers

is simple pispark dbt silver customers because you want to write in the silvers

because you want to write in the silvers silvers table right so I'll simply say

silvers table right so I'll simply say if that table is not there then first of

if that table is not there then first of all create the table first of all create

all create the table first of all create the table so I'll simply say tf dot

the table so I'll simply say tf dot write dot format and format will be

write dot format and format will be delta

delta then dot um mode can be let's say append

then dot um mode can be let's say append and it doesn't matter because this is

and it doesn't matter because this is just like one time load that's it then

just like one time load that's it then not option I will simply say dot save as

not option I will simply say dot save as table

table pispark dbd.sc silver dot customers and

pispark dbd.sc silver dot customers and we have not created any kind of you can

we have not created any kind of you can say silver schema. So let's quickly

say silver schema. So let's quickly create that. So I can simply duplicate

create that. So I can simply duplicate the tab and

the tab and let's go to catalog and let's create our

let's go to catalog and let's create our silver schema. So pi spark

silver schema. So pi spark okay create schema and this will be

okay create schema and this will be called silver.

called silver. Okay, perfect. Create. That's it. So our

Okay, perfect. Create. That's it. So our schema is also ready. Okay. So this is

schema is also ready. Okay. So this is our table that you want to create. Mode

our table that you want to create. Mode is fine and save as table is fine and

is fine and save as table is fine and that's it. So this will simply write the

that's it. So this will simply write the data. Make sense? But if if if

data. Make sense? But if if if our table is there, our table is there

our table is there, our table is there how we need to tackle the upsert

how we need to tackle the upsert basically the merge command. This is the

basically the merge command. This is the main thing. So basically

main thing. So basically if you remember I told you that we also

if you remember I told you that we also want to create a function which is very

want to create a function which is very generic and that will define basically

generic and that will define basically the process date. Did we add any kind of

the process date. Did we add any kind of process date? No. So we should add that

process date? No. So we should add that particular process date and instead of

particular process date and instead of creating

creating static function I will simply create a

static function I will simply create a dynamic function here. So this will be

dynamic function here. So this will be called as DF and then let's say

called as DF and then let's say process time stamp.

process time stamp. Okay, this will take self and this will

Okay, this will take self and this will take data frame. Simple. And this will

take data frame. Simple. And this will return basically

return basically df equals df dotwidth column.

And this will create let's say process time stamp.

time stamp. Make sense?

Make sense? And this will be nothing but just the

And this will be nothing but just the current time stamp. That's it. Current

current time stamp. That's it. Current time stamp. And that's it. And return

time stamp. And that's it. And return df.

df. Perfect. Make sense? Let's run this.

Perfect. Make sense? Let's run this. Okay. So now I can just simply use this

Okay. So now I can just simply use this particular transformation as well after

particular transformation as well after you can say this one

you can say this one after basically this one when we just

after basically this one when we just dduplicated the data. Okay. So once we

dduplicated the data. Okay. So once we have dduplicated the data let's try to

have dduplicated the data let's try to add that particular column and it's very

add that particular column and it's very simple custo stamp and current time

simple custo stamp and current time stamp is this one and let's see if we

stamp is this one and let's see if we have

have it is saying that it has no attribute

it is saying that it has no attribute process timestamp okay object

process timestamp okay object equals

equals dot process timestamp okay and where is

dot process timestamp okay and where is a c object

a c object c object object is

c object object is where is our custo object?

Okay, cost object this one transformations. Okay, makes sense.

transformations. Okay, makes sense. Transformations

Transformations process timestamp

process timestamp and everything's fine. So let me just

and everything's fine. So let me just process this one more time and let me

process this one more time and let me just remove these outputs because this

just remove these outputs because this creates

creates view very messy once the code is built.

view very messy once the code is built. Okay, perfect.

Okay, then we can just see this. Okay, now let's run it one more time. Now

now let's run it one more time. Now let's see what is the issue.

H okay, it worked. Okay, it's fine. And obviously this was not completed. So

obviously this was not completed. So that's why it is just doing this thing

that's why it is just doing this thing incomplete input. Yes, it's fine. So now

incomplete input. Yes, it's fine. So now here we should see that column which is

here we should see that column which is process time stamp which is today's time

process time stamp which is today's time stamp. Basically current time stamp.

stamp. Basically current time stamp. Make sense? So this is our time stamp

Make sense? So this is our time stamp column. That is fine. But now what we

column. That is fine. But now what we need to do we will simply say else and

need to do we will simply say else and we will also import some special

we will also import some special libraries. It's called from

libraries. It's called from delta.tables. tables import

delta.tables. tables import Delta table. Okay. So this is basically

Delta table. Okay. So this is basically the library that we want to import and

the library that we want to import and do you know what I want to make this

do you know what I want to make this dynamic as well. Really? Yes. Let's do

dynamic as well. Really? Yes. Let's do it. It will be challenging but let's do

it. It will be challenging but let's do it. That's how you learn right? That's

it. That's how you learn right? That's how you learn. Let's do it here in our

how you learn. Let's do it here in our class. Okay. In our class

let's remove the spaces. Okay. So let's create another function which is called

create another function which is called upsert.

upsert. So for upsert function we just need self

So for upsert function we just need self and df and key columns.

and df and key columns. Key columns obviously make sense. Okay.

Key columns obviously make sense. Okay. Key columns. So how we will be just

Key columns. So how we will be just doing it? We will first of all using

doing it? We will first of all using delta.ts import data table because this

delta.ts import data table because this will be our object. So we simply say DT

will be our object. So we simply say DT object equals

object equals delta table dot for path. We can also

delta table dot for path. We can also say for name it's fine. And you can also

say for name it's fine. And you can also say for path it's up to you. Let's call

say for path it's up to you. Let's call it as for name because it will be

it as for name because it will be easier. Okay. And now we can simply say

easier. Okay. And now we can simply say um pispark

um pispark dot silver dot customers right and if

dot silver dot customers right and if you just want to make it dynamic let's

you just want to make it dynamic let's take the

take the table name as well okay table name and

table name as well okay table name and let's make it dynamic let's add f

let's make it dynamic let's add f command and let's call it as table

command and let's call it as table make sense table. Simple. So this is our

make sense table. Simple. So this is our delta object that we have created on top

delta object that we have created on top of destination table because see you

of destination table because see you need to imagine this. So once your data

need to imagine this. So once your data is loaded in the initial run in the

is loaded in the initial run in the first if command right now what you need

first if command right now what you need to do you need to simply apply the

to do you need to simply apply the upsert and when you apply the upsert you

upsert and when you apply the upsert you basically need to merge it with

basically need to merge it with something. Make sense? You need to merge

something. Make sense? You need to merge it with something. So if you want to

it with something. So if you want to merge it with something you already have

merge it with something you already have a source which is df curs you need to

a source which is df curs you need to create the target data frame so you so

create the target data frame so you so that you can just merge both of them

that you can just merge both of them similar to join make sense so I'll

similar to join make sense so I'll simply say dt

simply say dt object dot alias and I will give it as

object dot alias and I will give it as alias as drg which is target then dot

alias as drg which is target then dot merge and I want to merge it with df

merge and I want to merge it with df make sense which is the source input

make sense which is the source input okay and then I will Say tf.as

it will be called as source. Make sense? So this is a merge command. What will be

So this is a merge command. What will be the condition?

the condition? What will be the condition? This is very

What will be the condition? This is very important and this is like something

important and this is like something that you need to you can say make it

that you need to you can say make it dynamic. So how you can just do that? So

dynamic. So how you can just do that? So basically here you need to dynamically

basically here you need to dynamically create the merge condition dynamically.

create the merge condition dynamically. So you have key columns right? You have

So you have key columns right? You have key columns. Here you will be using your

key columns. Here you will be using your Python skills where you will be using

Python skills where you will be using list comprehension. So I will simply say

list comprehension. So I will simply say merge condition

merge condition merge condition equals

merge condition equals first of all I will create simply the

first of all I will create simply the list and list is basically key columns

list and list is basically key columns right? So let's say I have key columns

right? So let's say I have key columns here as customer ID then so on right. So

here as customer ID then so on right. So I'll simply say for I in key columns.

I'll simply say for I in key columns. Okay. And what will be the output of

Okay. And what will be the output of this? The output will be fing which will

this? The output will be fing which will say

say source dot I that means column equals

source dot I that means column equals TRG do I that means column that's it

TRG do I that means column that's it this is the thing that I want to return.

this is the thing that I want to return. So this will be a kind of list and I

So this will be a kind of list and I want to unpack this list using dot join

want to unpack this list using dot join method.

method. Okay. Dot join because I want to just

Okay. Dot join because I want to just create a string from this list. Okay.

create a string from this list. Okay. And what will be the delimter? It will

And what will be the delimter? It will be nothing but and simple. If we have

be nothing but and simple. If we have let's say more than one list

let's say more than one list comprehension things you need to process

comprehension things you need to process it in your mind and I know this is a

it in your mind and I know this is a little bit challenging but it is good

little bit challenging but it is good for your growth. You need to see like

for your growth. You need to see like how we need to just build dynamic

how we need to just build dynamic solutions. Make sense? Because once

solutions. Make sense? Because once these solutions are built then you

these solutions are built then you simply need to use it instead of just

simply need to use it instead of just writing the code again and again and

writing the code again and again and again. Make sense? So I here I can

again. Make sense? So I here I can simply say merge condition. That's it.

simply say merge condition. That's it. And it is dynamic pure dynamic. I do not

And it is dynamic pure dynamic. I do not need to hardcode anything. No, I will

need to hardcode anything. No, I will simply define key columns list and

simply define key columns list and that's it. Boom. I do not need to worry

that's it. Boom. I do not need to worry about any kind of manual code. See how

about any kind of manual code. See how powerful it is. Yes. So this is your you

powerful it is. Yes. So this is your you can say absort command. Okay, this is

can say absort command. Okay, this is like merge condition. So once merge

like merge condition. So once merge condition is also applied then we will

condition is also applied then we will simply say dot when matched

simply say dot when matched when matched then update all

when matched then update all when not matched

when not matched then

then insert all make sense and then obviously

insert all make sense and then obviously at the end dot execute

at the end dot execute and

and that's it. This is your upsert logic.

that's it. This is your upsert logic. Simple. No need to return anything. But

Simple. No need to return anything. But obviously if we are using function, we

obviously if we are using function, we can return something. Let's say return

can return something. Let's say return let's say done.

let's say done. Okay. Or let's return one

Okay. Or let's return one anything. Make sense? Okay. So now in

anything. Make sense? Okay. So now in the update command as well, you can use

the update command as well, you can use one condition. It's up to you. Okay. It

one condition. It's up to you. Okay. It will make your code more robust. So what

will make your code more robust. So what should be the command? So let's say you

should be the command? So let's say you are doing a backdated refresh okay and

are doing a backdated refresh okay and you do not want to update the data based

you do not want to update the data based on you can say your old data does that

on you can say your old data does that make any sense let's say you are simply

make any sense let's say you are simply applying an absurd logic make sense and

applying an absurd logic make sense and in the in your end table you have um

in the in your end table you have um value for ID equals to 1 equals to let's

value for ID equals to 1 equals to let's say pen okay pen

say pen okay pen and this is recently updated But due to

and this is recently updated But due to some reasons you applied a backdated

some reasons you applied a backdated refresh on your source table and now you

refresh on your source table and now you are getting the previous value of your

are getting the previous value of your ID equals to 1 which is pencil. So would

ID equals to 1 which is pencil. So would you like to update that value with the

you like to update that value with the previous value? Obviously no. So you

previous value? Obviously no. So you should add a condition here

should add a condition here condition equals and here we will simply

condition equals and here we will simply say

say source dot last updated time stamp.

source dot last updated time stamp. Basically it can be literally anything

Basically it can be literally anything but in our case it is last updated time

but in our case it is last updated time stamp. Last update time stamp greater

stamp. Last update time stamp greater than or basically should be equals to as

than or basically should be equals to as well not just greater than should be

well not just greater than should be greater than or equals to last updated

greater than or equals to last updated time stamp. Now you will see now now you

time stamp. Now you will see now now you will say hey will it be same in all the

will say hey will it be same in all the cases? Obviously no obviously no. So

cases? Obviously no obviously no. So that's why I will create a variable CDC

that's why I will create a variable CDC and here as well

and here as well dot CDC. Perfect. and I will get the CDC

dot CDC. Perfect. and I will get the CDC from my user. This is my function

from my user. This is my function upsort. Let's run this. Perfect. So now

upsort. Let's run this. Perfect. So now let's actually use this function. So see

let's actually use this function. So see how clean our code is looking right now.

how clean our code is looking right now. We need to simply say uh custo object.

We need to simply say uh custo object. We already have c object as you know.

We already have c object as you know. I'll simply say uh custo object

I'll simply say uh custo object dot um upsert and I'll simply pass df

dot um upsert and I'll simply pass df cust

okay and customer ID and customer's name and last detect make sense and here it

and last detect make sense and here it is df cost as well so easy so simple so

is df cost as well so easy so simple so neat and clean code because all the

neat and clean code because all the processing is here in our class that you

processing is here in our class that you can obviously import it and I can also

can obviously import it and I can also So just copy and paste it here the

So just copy and paste it here the updated you can say code because you

updated you can say code because you made a lot of changes. Okay,

made a lot of changes. Okay, perfect. And let's just rename it

perfect. And let's just rename it transformations double s just to make a

transformations double s just to make a distinction. Perfect. So that's that is

distinction. Perfect. So that's that is our code. So that's how you keep your

our code. So that's how you keep your notebook the main notebook neat and

notebook the main notebook neat and clean. Okay. So now let's actually run

clean. Okay. So now let's actually run this particular thing.

this particular thing. Uh if and else. Okay.

Uh if and else. Okay. Now it should work fine. So first of all

Now it should work fine. So first of all what it will do transformation object

what it will do transformation object has no attribute upsert. Oh man we

has no attribute upsert. Oh man we simply need to rerun everything I think.

simply need to rerun everything I think. So because custo object dot upsert and

So because custo object dot upsert and we have just defined this upsert. Yes.

we have just defined this upsert. Yes. And the thing is we also need to update

And the thing is we also need to update our this thing cast object because this

our this thing cast object because this was created like long way back right. So

was created like long way back right. So let's reprocess everything.

And now it is saying table or view name delta table for name. Oh, makes sense.

delta table for name. Oh, makes sense. We forgot to write here spark that we

We forgot to write here spark that we always define. Okay.

always define. Okay. So okay, perfect. Perfect. Let's see. So

So okay, perfect. Perfect. Let's see. So this time

this time deprecated DB name has been deprecated

deprecated DB name has been deprecated since 3.4 and might be removed in a

since 3.4 and might be removed in a future version. Use table exist DB name

future version. Use table exist DB name dot table name. Oh really

dot table name. Oh really really this is duplicated. Okay this is

really this is duplicated. Okay this is basically a kind of warning. So you do

basically a kind of warning. So you do not need to worry about that. So it is

not need to worry about that. So it is saying use table exist DB name dot table

saying use table exist DB name dot table name. Okay.

name. Okay. Future warning DB name has been

Future warning DB name has been deprecated since this one. Hm. Okay. So

deprecated since this one. Hm. Okay. So now it is saying just use this one.

now it is saying just use this one. Okay. But we are using catalog as well.

Okay. But we are using catalog as well. That's why I like using you can say um

That's why I like using you can say um table name.tforpath but it is what it

table name.tforpath but it is what it is. Spark.catalog.tex

is. Spark.catalog.tex exist. Okay.

exist. Okay. So is it a kind of error or warning? Let

So is it a kind of error or warning? Let me just check if it has created our

me just check if it has created our silver layer because I think in spark

silver layer because I think in spark 4.0 there are so many changes. Okay. So

4.0 there are so many changes. Okay. So this is a error. This is not a warning.

this is a error. This is not a warning. Database Python library. Okay.

Database Python library. Okay. DB name has been deprecated.

DB name has been deprecated. Use table exists DV name dot table name.

Use table exists DV name dot table name. Okay.

Okay. Hm. We can use DV name.t name. But let

Hm. We can use DV name.t name. But let me just click on diagnose error. Oh,

me just click on diagnose error. Oh, it's fine. Let me just check. Wait, I

it's fine. Let me just check. Wait, I think Wait, why did I pass spark here?

think Wait, why did I pass spark here? Why? Wait, we didn't need to pass spark

Why? Wait, we didn't need to pass spark here because spark we pass it there like

here because spark we pass it there like uh when we create the delta object. I

uh when we create the delta object. I was like what? Okay, so it is saying

was like what? Okay, so it is saying delta table.format format missing one

delta table.format format missing one required position argument table or view

required position argument table or view name. Okay. Uh delta table.4 name. Okay.

name. Okay. Uh delta table.4 name. Okay. So we need to just correct it in the

So we need to just correct it in the class not here. So

class not here. So in the delta object

in the delta object DT object. Yeah. Here here we need to

DT object. Yeah. Here here we need to pass spark. I was like wait did I make

pass spark. I was like wait did I make any mistake? So here we need to add

any mistake? So here we need to add spark. Okay makes sense. Let's rerun it.

spark. Okay makes sense. Let's rerun it. because we never create basically any

because we never create basically any kind of catalog object using spark

kind of catalog object using spark variable.

variable. So let's see if it is working fine. So

So let's see if it is working fine. So it is saying silver customers is not a

it is saying silver customers is not a delta table. Okay, makes sense.

delta table. Okay, makes sense. Oh, I was like why it is going directly

Oh, I was like why it is going directly to the else? Because we need to say if

to the else? Because we need to say if not then because if table is not there

not then because if table is not there then it will be running if command first

then it will be running if command first of all and then it will go to the second

of all and then it will go to the second command. Perfect. That's what it should

command. Perfect. That's what it should do. So first of all it will create the

do. So first of all it will create the table that means this command make

table that means this command make sense. Then once it is done then it will

sense. Then once it is done then it will write the upsert command. So I think it

write the upsert command. So I think it should have done that. Let's refresh and

should have done that. Let's refresh and check it.

check it. Silver and then yeah customer table is

Silver and then yeah customer table is there. Let's try to query the table.

there. Let's try to query the table. Select

Select basically count. Let's see count count

basically count. Let's see count count axis

axis from

uh pispark dbt silver dot customers. Perfect. Let's see 200. Perfect. Let me

Perfect. Let's see 200. Perfect. Let me just run this code for one more time and

just run this code for one more time and I should see only 200 records instead of

I should see only 200 records instead of 400 because we have here upsert logic.

400 because we have here upsert logic. We have here upsert logic. Okay. Now it

We have here upsert logic. Okay. Now it is saying silver customers is not a

is saying silver customers is not a delta table. Okay. So what we have done

delta table. Okay. So what we have done there? Silver dot.

Okay. So where is our delta object? Pispark. Oh, it's not pispark. It's

Pispark. Oh, it's not pispark. It's called pispark dbd. Okay. Now let's run

called pispark dbd. Okay. Now let's run it one more time.

So now what's the error, bro? What's the

now what's the error, bro? What's the error?

error? I think we have used

I think we have used something

something extra

extra any braces anything.

It was saying like there's an additional curly brace. Oh, silly mistake. We

curly brace. Oh, silly mistake. We forgot to add F here. I was like, where

forgot to add F here. I was like, where did we miss curly braces?

did we miss curly braces? Let's hope for the best.

Let's hope for the best. Oh, this this view is so good, right?

Oh, this this view is so good, right? See on the right hand side, they have

See on the right hand side, they have just added all the green green green.

just added all the green green green. This is a kind of loading step. H nice.

This is a kind of loading step. H nice. So, okay. Oh, by the way, very well

So, okay. Oh, by the way, very well done. Our all the records are completed

done. Our all the records are completed now and you can see like 200 records.

now and you can see like 200 records. Okay. So, that means our upsert logic is

Okay. So, that means our upsert logic is perfect and we do not need to worry

perfect and we do not need to worry about any kind of thing. Our upsert or

about any kind of thing. Our upsert or basically merge condition is perfect.

basically merge condition is perfect. So, this is done. Now what do we need to

So, this is done. Now what do we need to do? We need to process all the tables

do? We need to process all the tables similarly. Okay. So let me just open

similarly. Okay. So let me just open another table that we have in the

another table that we have in the bronze. So next table that we have is

bronze. So next table that we have is let's say drivers. Let's try to cover

let's say drivers. Let's try to cover that one. Drivers

that one. Drivers and yeah perfect.

and yeah perfect. So let's say

drivers and if we just create a driver data frame df dot

create a driver data frame df dot spark dot read

spark dot read dot table and then

dot table and then drivers

drivers and let's try to see

driver and It's your homework as well. If you want to add more and more

If you want to add more and more functions in your class, you can and you

functions in your class, you can and you should. If they are generic and if

should. If they are generic and if they're applicable to all the data

they're applicable to all the data frames. Okay, so far we just have three

frames. Okay, so far we just have three functions and if we will find more, we

functions and if we will find more, we can just do that. Make sense? So these

can just do that. Make sense? So these are drivers. Okay. And we have similar

are drivers. Okay. And we have similar transformations for drivers as well that

transformations for drivers as well that we have for customers. Why? Because

we have for customers. Why? Because customers are also human beings. Drivers

customers are also human beings. Drivers are also human beings. Right? So we have

are also human beings. Right? So we have first name, last name, phone number, all

first name, last name, phone number, all those things similar to this one. Make

those things similar to this one. Make sense? Very good. So let me just copy

sense? Very good. So let me just copy and paste the transformations for

and paste the transformations for drivers and it will save us a lot of

drivers and it will save us a lot of time that we can do. Okay, makes sense.

time that we can do. Okay, makes sense. Okay.

Okay. Okay, makes sense. So let me just grab

Okay, makes sense. So let me just grab the code. So first of all, we can just

the code. So first of all, we can just clean our phone number. Okay, and then

clean our phone number. Okay, and then we'll simply create the full name and

we'll simply create the full name and last name. Make sense? So this is my

last name. Make sense? So this is my driver and make sure that you update the

driver and make sure that you update the data frame name otherwise it can create

data frame name otherwise it can create a mess

a mess driver. Okay and let's run this cell.

Okay, this is perfect. Now let's process our first name and last name. Make

our first name and last name. Make sense? Perfect.

Okay. So,

So, let's call it as driver.

Perfect. So, this is also done. Let's run this. Now, we'll be creating or

run this. Now, we'll be creating or basically using our generic functions.

basically using our generic functions. And see this will save us a lot of time

And see this will save us a lot of time plus efforts plus you can literally

plus efforts plus you can literally process as many tables as you can do not

process as many tables as you can do not need to rewrite the same code again and

need to rewrite the same code again and again. So I can simply say driver object

again. So I can simply say driver object equals transformations

equals transformations transformations.

transformations. Perfect. So this is my object driver

Perfect. So this is my object driver object. Make sense? So first of all

object. Make sense? So first of all obviously I would like to ddup my data

obviously I would like to ddup my data frame. So I will say driver object

frame. So I will say driver object equals

equals um driver not object basically df driver

um driver not object basically df driver equals

equals uh driver object dot

uh I can also run process timestim okay before dduplication but it's fine if you

before dduplication but it's fine if you just apply dduplication so simply Okay,

just apply dduplication so simply Okay, dduplication and I want to apply ddup

dduplication and I want to apply ddup and what will be the uh you can say par

and what will be the uh you can say par parameters obviously df then key columns

parameters obviously df then key columns then last time stamp just these three

then last time stamp just these three make sense just these three or we can

make sense just these three or we can just check if there are more parameters

just check if there are more parameters in our class. So we have

in our class. So we have 1 2 3 and then we also have CDC.

1 2 3 and then we also have CDC. CDC is also there. Okay, we have

CDC is also there. Okay, we have provided CDC. So what's what's what's

provided CDC. So what's what's what's what else? Data frame, ddup list. Wait,

what else? Data frame, ddup list. Wait, data frame list CDC. Yeah, just three.

data frame list CDC. Yeah, just three. Just three. It's fine. Just three. Okay,

Just three. It's fine. Just three. Okay, makes sense. This is my first

makes sense. This is my first transformation.

transformation. Okay, perfect. Now I can say dfd driver

Okay, perfect. Now I can say dfd driver equals driver object

equals driver object dot I want to apply let's say process

dot I want to apply let's say process time stamp. Okay I will simply provide

time stamp. Okay I will simply provide this df server df basically driver

this df server df basically driver make sense. Now I want to apply um

make sense. Now I want to apply um upsert. So now I can simply say if not

upsert. So now I can simply say if not spark.catalog table exist same step that

spark.catalog table exist same step that we have done there. Okay, this will be

we have done there. Okay, this will be drivers. Else I want to simply apply

drivers. Else I want to simply apply upsert. Make sense? Simply upsert. Okay,

upsert. Make sense? Simply upsert. Okay, makes sense. And we have the unique key

makes sense. And we have the unique key as driver ID as you can see. Perfect. So

as driver ID as you can see. Perfect. So this is our command that I can just run

this is our command that I can just run now. Okay.

now. Okay. And this will first of all create the

And this will first of all create the table and then it will upsert if I want

table and then it will upsert if I want to reprocess it like this just to test

to reprocess it like this just to test my upsert logic.

my upsert logic. Okay. And just to make sure everything

Okay. And just to make sure everything is fine, I can say

is fine, I can say select

select count of

dbt silver drivers. Perfect. Okay. 50 count. Perfect. Now let's

Okay. 50 count. Perfect. Now let's process our see how quickly we can just

process our see how quickly we can just develop. So it's like one time effort

develop. So it's like one time effort that you do and everything will be

that you do and everything will be dynamic in future. Then you just need to

dynamic in future. Then you just need to just use those things instead of

just use those things instead of developing each thing from scratch. Just

developing each thing from scratch. Just imagine you would be rewriting your

imagine you would be rewriting your absurd logic, your dduplication logic,

absurd logic, your dduplication logic, your process timestamp logic,

your process timestamp logic, everything. Just imagine the hard work

everything. Just imagine the hard work you need to do.

you need to do. Make sense? That's how you build the

Make sense? That's how you build the real world solutions. Real world

real world solutions. Real world solutions. Okay. So now let's try to

solutions. Okay. So now let's try to work with another

work with another table another source basically which is

table another source basically which is locations. Okay.

locations. Okay. Now let's try to work with locations.

So tf location equals spark dot read. And I hope like you are understanding a

And I hope like you are understanding a lot of things and even if you do not

lot of things and even if you do not understand all the things just rewatch

understand all the things just rewatch the segments of the videos because that

the segments of the videos because that are important to absorb the knowledge.

are important to absorb the knowledge. Understanding the concept is one thing.

Understanding the concept is one thing. Doing the thing is another thing that

Doing the thing is another thing that you need to do. And third thing is take

you need to do. And third thing is take notes. You have to follow all the three

notes. You have to follow all the three steps.

steps. A lot of you will say hey my step one is

A lot of you will say hey my step one is done. I have understood the thing. Okay

done. I have understood the thing. Okay it's fine. No, you have to do it. You

it's fine. No, you have to do it. You have to build it. And once you have

have to build it. And once you have built it, you need to take notes. So, so

built it, you need to take notes. So, so that you can just remember those things.

that you can just remember those things. Otherwise, you will forget everything in

Otherwise, you will forget everything in just one week. Trust me. Even if you

just one week. Trust me. Even if you have done if even if you have done that.

have done if even if you have done that. Up to you how you need to just do

Up to you how you need to just do huh. Okay. Pispark DBD location

huh. Okay. Pispark DBD location basically locations

So here we have locations okay and this data frame is actually very very clean

data frame is actually very very clean because this is like locations data and

because this is like locations data and we do not have anything to perform here.

we do not have anything to perform here. So what I will do I will keep it as it

So what I will do I will keep it as it is and we will simply apply our generic

is and we will simply apply our generic transformations and that's it. Okay.

transformations and that's it. Okay. And obviously it's like a play game that

And obviously it's like a play game that you need to do and that you can also

you need to do and that you can also play it with with your data frame. You

play it with with your data frame. You can just add more and more transmission

can just add more and more transmission as you like. But obviously we need to

as you like. But obviously we need to cover the all cover all the things right

cover the all cover all the things right in the given interval of time. So I will

in the given interval of time. So I will feel really really happy if you'll be

feel really really happy if you'll be just applying more and more

just applying more and more transformations on your own just for the

transformations on your own just for the you can say learning purpose. That's it.

you can say learning purpose. That's it. That's it. You can maybe concatad you

That's it. You can maybe concatad you can just transform the column into

can just transform the column into uppercase. Okay, good. So I will simply

uppercase. Okay, good. So I will simply create an object called location or

create an object called location or basically log object.

And if you are learning a lot, if you're learning new new things, if you didn't

learning new new things, if you didn't know about these things, just drop a

know about these things, just drop a lovely comment in the comment section.

lovely comment in the comment section. This will literally help me a lot. I

This will literally help me a lot. I literally feel happy and I will like see

literally feel happy and I will like see if you want me to continue creating with

if you want me to continue creating with these kinds of videos I should know

these kinds of videos I should know right so just let me know in the comment

right so just let me know in the comment section just let me know

section just let me know okay makes sense

okay makes sense transformationations

so this is our object let's apply by DF lock equals

lock equals DF lock

DF lock or basically lock object

or basically lock object dot ddup. Okay, let's ddoop it. And we

dot ddup. Okay, let's ddoop it. And we can also say process timestamp makes

can also say process timestamp makes sense. So let's apply both the

sense. So let's apply both the transformations together. Perfect. And

transformations together. Perfect. And now upsert if not spark.catalog.table

now upsert if not spark.catalog.table exist. Perfect. and else upsert.

exist. Perfect. and else upsert. Perfect. Let's run this.

Let's run this. Perfect. And let's run it one more time just to see our absert

it one more time just to see our absert logic is working fine for locations

logic is working fine for locations because see every table is unique,

because see every table is unique, right? So that is why it's allowed

right? So that is why it's allowed corrects

corrects from

Perfect. Very good. Now let's process our fourth table

Perfect. Payments. Payments. Payments. And you know what? I know that you have

And you know what? I know that you have learned a lot in the silver layer. I

learned a lot in the silver layer. I know that you have learned about

know that you have learned about classes, dynamic solutions, everything

classes, dynamic solutions, everything dynamic. I know. Do you know what our

dynamic. I know. Do you know what our goal layer is even more interesting

goal layer is even more interesting because it is related to DBT DB in the

because it is related to DBT DB in the DBT we'll be just creating our story

DBT we'll be just creating our story changing dimensions everything there in

changing dimensions everything there in DVT and it will be a lot of fun trust me

DVT and it will be a lot of fun trust me a lot of a lot of a lot of fun and you

a lot of a lot of a lot of fun and you will literally learn a lot okay so

will literally learn a lot okay so payment is also uh let's do the payment

payment is also uh let's do the payment dfpay equals spark

dfpay equals spark dot read table okay let's first of all

dot read table okay let's first of all display

display df. Um pay

pay perfect

payments not pay. Okay. So this is very interesting

Okay. So this is very interesting because this is related to you can say

because this is related to you can say your payments all the things and I can

your payments all the things and I can say that we have almost almost all the

say that we have almost almost all the clean data but I want to just introduce

clean data but I want to just introduce one cool transformation that will be

one cool transformation that will be very handy for you. Let's say you want

very handy for you. Let's say you want to apply case statements in pispark.

to apply case statements in pispark. Let's try to learn that using this

Let's try to learn that using this example right. So let's say

example right. So let's say if

if if if if

you have just two status right success or failed or I think it will be one more

or failed or I think it will be one more like pending yeah three success failed

like pending yeah three success failed and pending and you have your cash

and pending and you have your cash payment method okay wallet and card is

payment method okay wallet and card is very very critical okay so you want to

very very critical okay so you want to create another column which will be

create another column which will be specifically built for card and you need

specifically built for card and you need to say if the payment method is card and

to say if the payment method is card and payment method is status then say

payment method is status then say success.

success. If it is pending then pending. If it is

If it is pending then pending. If it is failed then failed and it will be just

failed then failed and it will be just for card and if there is any other value

for card and if there is any other value then it will say other

then it will say other make sense. So it will be like online

make sense. So it will be like online payment. You need to create a column

payment. You need to create a column called online payment based on these

called online payment based on these conditions. Let's try to do that. So

conditions. Let's try to do that. So I'll simply say DF um pay equals

I'll simply say DF um pay equals DF pay dot with column

with column and column name will be online payment

online payment online payment status make sense and

online payment status make sense and here what I will do I will simply say

here what I will do I will simply say when

when when

when column of um you can say payment method

column of um you can say payment method oops payment method equals to online

oops payment method equals to online basically card

basically card payment method

payment method equals to equals to um card

equals to equals to um card make sense

make sense make sense like this is like one

make sense like this is like one condition okay then I will say amp%

condition okay then I will say amp% column

payment method or basically status I guess

or basically status I guess payment status. Yes, payment status

payment status. Yes, payment status equals to equals to success. These are

equals to equals to success. These are basically two conditions. See condition

basically two conditions. See condition number one, condition number two. And

number one, condition number two. And just for the better understanding, you

just for the better understanding, you should encapsulate these things in the

should encapsulate these things in the braces.

braces. It is a kind of board mass rule. Okay,

It is a kind of board mass rule. Okay, similar to that.

similar to that. Okay, makes sense. So this is this one.

Okay, makes sense. So this is this one. If it is true, if when this condition is

If it is true, if when this condition is true, then you can simply define what

true, then you can simply define what you need to do. So see this is condition

you need to do. So see this is condition number one.

number one. Okay,

Okay, this is condition number one and

this is condition number one and this is condition number one. See,

this is condition number one. See, so who who wrapped this function like

so who who wrapped this function like this? This is my first condition. Okay,

this? This is my first condition. Okay, there should be no closing braces like

there should be no closing braces like this. Then this is my second condition

this. Then this is my second condition like this.

like this. Okay.

Like this. Okay. This is my second condition. If it

Okay. This is my second condition. If it is true like this is like like see see

is true like this is like like see see let me just move it here. So this is my

let me just move it here. So this is my first condition. when

first condition. when and

and this

this make sense. We have literally wrapped

make sense. We have literally wrapped this function. Okay. And we can remove

this function. Okay. And we can remove this extra brace from here. Perfect. So

this extra brace from here. Perfect. So this is my second condition. If this is

this is my second condition. If this is true, if both the conditions are true,

true, if both the conditions are true, so we will simply wrap both the

so we will simply wrap both the conditions into a new braces. If it is

conditions into a new braces. If it is true then I will say let's say

true then I will say let's say online success

online success make sense online

make sense online success if it is true make sense make

success if it is true make sense make sense I know this is tricky but it is

sense I know this is tricky but it is good okay

good okay make sense so this is my only this

make sense so this is my only this column when

column when okay this is just my when column Okay.

Now I want to say this is just like one condition.

condition. Then I will write

Then I will write another when. Okay. I will say dot when

another when. Okay. I will say dot when this is equals to card and failed then

this is equals to card and failed then it will be called as online failed. Make

it will be called as online failed. Make sense? Good. Then I will say these are

sense? Good. Then I will say these are the two things. I have another thing

the two things. I have another thing which is called pending.

which is called pending. So I can simply copy and paste it from

So I can simply copy and paste it from here

and I will call it as online pending.

pending. Okay, make sense. When all the three are

Okay, make sense. When all the three are checked if we have any other case then I

checked if we have any other case then I will simply say dot otherwise.

will simply say dot otherwise. Make sense? dot otherwise simply say

Make sense? dot otherwise simply say offline

offline simple

simple make sense now let's try to look at the

make sense now let's try to look at the output

display dfp pay let's try to see

let's try to see okay so here you can see offline offline

okay so here you can see offline offline and online failed because this is an

and online failed because this is an this is a card payment and it was failed

this is a card payment and it was failed and this should be failed as well this

and this should be failed as well this should be pending. Let's test it. See

should be pending. Let's test it. See online pending. So that's how you can

online pending. So that's how you can just work with case when statements in

just work with case when statements in your data frame in your pispark data

your data frame in your pispark data frame. Make sense? Make sense? Okay.

frame. Make sense? Make sense? Okay. Very good. So that's how you can just

Very good. So that's how you can just work with those things. Once you are

work with those things. Once you are done with this, let's create our you

done with this, let's create our you know that class objects. Let's do that.

know that class objects. Let's do that. So basically I'll simply say

So basically I'll simply say payment object equals to transformations

payment object equals to transformations and I want to ddup it. I want to process

and I want to ddup it. I want to process it and

it and once I'm done with those things then I

once I'm done with those things then I will simply say if not spark dot catalog

will simply say if not spark dot catalog table exist delta payments else upsert

table exist delta payments else upsert see are you are you focusing on the

see are you are you focusing on the power of dynamic code it is just like

power of dynamic code it is just like now applying this particular one line of

now applying this particular one line of code otherwise I would have been writing

code otherwise I would have been writing um same line of same lines of code like

um same line of same lines of code like multiple times

multiple times make sense make sense

make sense make sense Okay. So this is also done. Makes sense.

Okay. So this is also done. Makes sense. This is also done. So now now let's try

This is also done. So now now let's try to query it.

Select count axis from silver dot payments.

And here pispar TBT.

TBT. Oops.

silver. Perfect. So, thousand records. Perfect. Perfect.

So, thousand records. Perfect. Perfect. Perfect. Perfect. So, now our almost

Perfect. Perfect. So, now our almost almost all the tables are done. How many

almost all the tables are done. How many are left? If I check bronze,

are left? If I check bronze, I think we have two left. Trips and

I think we have two left. Trips and vehicles. Um I will just show you only

vehicles. Um I will just show you only one because one I will just show you in

one because one I will just show you in the DBD because I want to show you how

the DBD because I want to show you how you can just incrementally uh apply the

you can just incrementally uh apply the upserts using DBD because that's a great

upserts using DBD because that's a great package for that and that is why one

package for that and that is why one thing is will be transformed there in

thing is will be transformed there in the DBD that's how you will be learn

the DBD that's how you will be learn more and more things in DBD other than

more and more things in DBD other than just building the gold layer we will

just building the gold layer we will also build one table of silver layer

also build one table of silver layer make sense okay very good so let's try

make sense okay very good so let's try to create let's say um vehicles make

to create let's say um vehicles make sense and trips is our fact table. So we

sense and trips is our fact table. So we will be simply creating fact table

will be simply creating fact table silver layer in the DBD because fact

silver layer in the DBD because fact table are big tables. So we need to just

table are big tables. So we need to just incrementally load data. We can also

incrementally load data. We can also apply the logic here but I want you to

apply the logic here but I want you to learn DVD as well more and more in depth

learn DVD as well more and more in depth because gold will be very much in depth

because gold will be very much in depth I know but yes yes yes more and more

I know but yes yes yes more and more things okay so let's try to do it for

things okay so let's try to do it for vehicles.

vehicles. Okay, let's try to do it for vehicles.

Vehicles I will simply say df vehicle spark table display df.

Let's see if we have anything to apply. Okay. License plate, vehicle ID, make,

Okay. License plate, vehicle ID, make, year,

year, vehicle type, and last updated time

vehicle type, and last updated time stamp. Hm. Okay. Looks good. Model is

stamp. Hm. Okay. Looks good. Model is this. This is this license plate. I

this. This is this license plate. I think everything is fine. One thing I I

think everything is fine. One thing I I would like to do like make let's try to

would like to do like make let's try to or basically you can say model because

or basically you can say model because model should be highlighted in the data

model should be highlighted in the data frames. Make sense? Because models are

frames. Make sense? Because models are like very few and we should be able to

like very few and we should be able to highlight this thing. So I would like to

highlight this thing. So I would like to make it in upper case and how we can

make it in upper case and how we can just do that it's very simple df equals

just do that it's very simple df equals basically df vehicle

basically df vehicle d vehicle dot

d vehicle dot with column okay and I will simply say

with column okay and I will simply say make and I will use upper

make and I will use upper transformations

transformations okay upper transformation and column of

okay upper transformation and column of make that's it make sense

make that's it make sense just make and If I say display df

just make and If I say display df vehicle,

you will see that our make is now uppercase. See make

uppercase. See make we can also convert model but yeah make

we can also convert model but yeah make is important.

is important. Make make sense make makes sense. Yeah.

Make make sense make makes sense. Yeah. So yeah perfect. So now let's create

So yeah perfect. So now let's create vehicle object.

vehicle object. Same thing process timestamp and let's

Same thing process timestamp and let's try to do it in this if not

try to do it in this if not spark catalog

spark catalog pen else. Perfect.

pen else. Perfect. Let's actually shift it to a new code.

Perfect. Perfect. Perfect. Perfect. And let's see

Perfect. Perfect. Perfect. And let's see the count.

the count. Select count of asterx

Select count of asterx from

from silver dot

silver dot basically pispark

basically pispark tbt dots silver dot

tbt dots silver dot vehicles

50. Now let's try to run this one more time and they should have done exactly

time and they should have done exactly 50

50 if everything is fine.

if everything is fine. Perfect. Our notebook is completed. Well

Perfect. Our notebook is completed. Well done.

done. Well done, my data fam. Well done.

Well done, my data fam. Well done. Well, well, well done.

Well, well, well done. So, our silver notebook is ready, which

So, our silver notebook is ready, which is full of transformations, full of

is full of transformations, full of dynamic basically classes. Can you

dynamic basically classes. Can you imagine? Can you imagine? And don't

imagine? Can you imagine? And don't worry, I'll just upload this notebook.

worry, I'll just upload this notebook. So, just enjoy with this notebook and

So, just enjoy with this notebook and just play with this. And I know that you

just play with this. And I know that you have learned a lot in the silver layer.

have learned a lot in the silver layer. That was my agenda since the beginning

That was my agenda since the beginning that I want to show you classes,

that I want to show you classes, upserts, all those generic

upserts, all those generic transformations and I'm really happy

transformations and I'm really happy that you learned how to work with spark

that you learned how to work with spark structure streaming dynamically. So

structure streaming dynamically. So that's another amazing thing. So yeah,

that's another amazing thing. So yeah, these are some of the things that you

these are some of the things that you can just work with and these are really

can just work with and these are really really really really helpful. So this

really really really helpful. So this was all about our pispark coding. Now we

was all about our pispark coding. Now we going to work with our dbt dbt dbt dbt.

going to work with our dbt dbt dbt dbt. Are you ready for DBD? Because it it

Are you ready for DBD? Because it it will be amazing amazing amazing segment

will be amazing amazing amazing segment of the video. So now let's see how we

of the video. So now let's see how we can just work with DBD and if you want

can just work with DBD and if you want to learn about DBD because I would

to learn about DBD because I would expect some of the knowledge in DBD

expect some of the knowledge in DBD before jumping on the DBD part. If you

before jumping on the DBD part. If you do not have any knowledge so I don't

do not have any knowledge so I don't know what you are doing bro because I've

know what you are doing bro because I've already launched the detailed video on

already launched the detailed video on DBT. Just search DBT masterclass and

DBT. Just search DBT masterclass and just say an Lamba because this is the

just say an Lamba because this is the DBT. DBT I told DBT it is saying data

DBT. DBT I told DBT it is saying data modeling. Wow.

modeling. Wow. Just go on YouTube and just say

Just go on YouTube and just say DBT anal.

DBT anal. See this is the video DBT masterass and

See this is the video DBT masterass and this is like 5 hours long video and

this is like 5 hours long video and amazing amazing amazing master class

amazing amazing amazing master class that has covered DBT from scratch. So

that has covered DBT from scratch. So you can literally learn everything

you can literally learn everything including CI/CD. So just go and watch

including CI/CD. So just go and watch that DBT video and I will just try to

that DBT video and I will just try to include the video link as well so that

include the video link as well so that you can just go and check it out. Make

you can just go and check it out. Make sense? So I would expect some of the

sense? So I would expect some of the knowledge in DBT and let's try to

knowledge in DBT and let's try to actually perform all the things in DBD.

actually perform all the things in DBD. Just for your information, we'll be

Just for your information, we'll be using DBD cloud version which is free.

using DBD cloud version which is free. Okay. Plus it is very easy to manage.

Okay. Plus it is very easy to manage. Plus it is very easy to directly get

Plus it is very easy to directly get started. So we'll be simply learning DBD

started. So we'll be simply learning DBD cloud. And obviously if you are just

cloud. And obviously if you are just looking for crack in the interviews, DBT

looking for crack in the interviews, DBT cloud cloud is a thing that you should

cloud cloud is a thing that you should know because everything is on cloud

know because everything is on cloud right now. Makes sense. So we will be

right now. Makes sense. So we will be simply establishing everything on DBT

simply establishing everything on DBT cloud. Let's do it. Now we going to talk

cloud. Let's do it. Now we going to talk about DBT. So what is DBT? Basically dbt

about DBT. So what is DBT? Basically dbt is a data built tool which is designed

is a data built tool which is designed to make your code more modular and which

to make your code more modular and which can be like you can say take the

can be like you can say take the overhead of just managing all the

overhead of just managing all the transformations

transformations understood anything no I know that so

understood anything no I know that so basically um DBT first of all is very

basically um DBT first of all is very much in demand open source software

much in demand open source software obviously the man software is also there

obviously the man software is also there but overall DBT is compatible with all

but overall DBT is compatible with all the things basically all the cloud Azure

the things basically all the cloud Azure GCP uh AWS

GCP uh AWS Okay. So you can just target any job.

Okay. So you can just target any job. Then DBD is even compatible with all the

Then DBD is even compatible with all the kinds of you can say data platforms such

kinds of you can say data platforms such as fabric, data brick, snowflake, so

as fabric, data brick, snowflake, so many platforms, right? So DBT is like

many platforms, right? So DBT is like gamechanging thing. DBT is an amazing

gamechanging thing. DBT is an amazing tool. Plus DBT can handle your code

tool. Plus DBT can handle your code modularity and you will see all those

modularity and you will see all those things. Don't need to worry. And how it

things. Don't need to worry. And how it manages code modularity everything we

manages code modularity everything we will just see. Don't worry because we

will just see. Don't worry because we learn by doing it. Again there's a

learn by doing it. Again there's a detailed 5 hours long tutorial and you

detailed 5 hours long tutorial and you can imagine like what you will learn

can imagine like what you will learn under 5 hours. So you will learn almost

under 5 hours. So you will learn almost everything of dbt including deployment.

everything of dbt including deployment. So you can just watch that video that I

So you can just watch that video that I have just mentioned. Okay. So now let's

have just mentioned. Okay. So now let's talk about dbt cloud. So in order to

talk about dbt cloud. So in order to create a dbt cloud you can simply open a

create a dbt cloud you can simply open a new browser and simply say dbt

new browser and simply say dbt um let's say sign up.

um let's say sign up. [Music]

[Music] So sign up for DBT. You can click here

So sign up for DBT. You can click here and it will take you to www.getdbt.com.

and it will take you to www.getdbt.com. And here you can simply put your email,

And here you can simply put your email, first name, last name. Company is

first name, last name. Company is optional. If you are unemployed, it's

optional. If you are unemployed, it's fine. If you are a student, fine. If you

fine. If you are a student, fine. If you are in a company, just put the company

are in a company, just put the company name. Then password and everything and

name. Then password and everything and just click on create my account. Make

just click on create my account. Make sense? Make sense? Once it is done, once

sense? Make sense? Once it is done, once you have created all these things, then

you have created all these things, then you can simply click on login. Let me

you can simply click on login. Let me just show you. So in my case, let me

just show you. So in my case, let me just say DBD signup.

just say DBD signup. So I already have the account. So I'll

So I already have the account. So I'll simply click on login.

simply click on login. Make sense? So it will take you to

Make sense? So it will take you to cloud.dbt.com.

cloud.dbt.com. And let me just put my email account. So

And let me just put my email account. So this will be the homepage that you will

this will be the homepage that you will also see. Okay? And don't worry about

also see. Okay? And don't worry about that because we're going to just discuss

that because we're going to just discuss about all the things. All the things

about all the things. All the things means all the things within the DBT.

means all the things within the DBT. Okay? Basically the major ones because

Okay? Basically the major ones because we have already one dedicated master

we have already one dedicated master class on DBT and in that particular

class on DBT and in that particular video we were using DBT core which is

video we were using DBT core which is the engine behind DBT and which is open

the engine behind DBT and which is open source. In this particular project we

source. In this particular project we are leveraging DBT cloud which is a

are leveraging DBT cloud which is a managed service. We do not need to set

managed service. We do not need to set up anything. There's one more thing

up anything. There's one more thing which is called DBT cloud CLI. Cloud CLI

which is called DBT cloud CLI. Cloud CLI means whatever we are doing here in

means whatever we are doing here in cloud, we can actually use the CLI which

cloud, we can actually use the CLI which is cloud not cloud um

is cloud not cloud um oh man I forgot the full full form wait

oh man I forgot the full full form wait command line interface. So you can

command line interface. So you can literally use command line interface to

literally use command line interface to interact with your cloud version of dbt.

interact with your cloud version of dbt. So it is not equivalent to dbt core CLI.

So it is not equivalent to dbt core CLI. See there are two things dbt core and

See there are two things dbt core and dbt core CLI that you use locally then

dbt core CLI that you use locally then dbt cloud. Okay. and then DBT cloud CLI

dbt cloud. Okay. and then DBT cloud CLI if you want to just work with your cloud

if you want to just work with your cloud version which is the paid version but

version which is the paid version but locally make sense okay very good so

locally make sense okay very good so basically when you will be just setting

basically when you will be just setting up your first DBT account you will get

up your first DBT account you will get 30 days of premium services that anyone

30 days of premium services that anyone will get if someone is using premium or

will get if someone is using premium or paid version but after 30 days you will

paid version but after 30 days you will be automatically transferred to the

be automatically transferred to the developer option which is just like free

developer option which is just like free for like every time like whenever you

for like every time like whenever you just want to use DBD and that's That's

just want to use DBD and that's That's all and do not need to worry. You do not

all and do not need to worry. You do not need to change anything. It will be

need to change anything. It will be automatically transferred to the free

automatically transferred to the free edition after 30 days. So it's fine.

edition after 30 days. So it's fine. Okay? So do not need to set up any

Okay? So do not need to set up any billing, nothing. Everything will be

billing, nothing. Everything will be automatic. But yes, for first 30 days,

automatic. But yes, for first 30 days, you will get everything premium. But the

you will get everything premium. But the thing is whatever we are building here

thing is whatever we are building here in this project, we are just using free

in this project, we are just using free services. You do not need to be in the

services. You do not need to be in the first 30 days of premium services at

first 30 days of premium services at all. No, no, nothing. So that is the

all. No, no, nothing. So that is the best thing about DBT cloud. It is

best thing about DBT cloud. It is equivalent to datab bricks. Why? because

equivalent to datab bricks. Why? because database also gives you everything for

database also gives you everything for free. CLI basically uh DBT cloud is also

free. CLI basically uh DBT cloud is also giving you everything for free but just

giving you everything for free but just for one developer that's it you cannot

for one developer that's it you cannot bring your team within this particular

bring your team within this particular workspace no and you can just build one

workspace no and you can just build one project at a time because you're using

project at a time because you're using free edition and I think they are just

free edition and I think they are just giving you the free edition because you

giving you the free edition because you can learn you cannot build like multiple

can learn you cannot build like multiple projects make sense so let's try to

projects make sense so let's try to build our first project and I will just

build our first project and I will just name it as let's say

name it as let's say pispark

pispark dbt

and pispark dbt project let's say pispark

pispark dbt project let's say pispark dbt project make sense so this is my

dbt project make sense so this is my project name and even if you do not

project name and even if you do not define anything here we can still

define anything here we can still configure it later on do not need to

configure it later on do not need to worry and if I click on advanc settings

worry and if I click on advanc settings it will simply say project subdirectory

it will simply say project subdirectory we do not need to worry about that

we do not need to worry about that simply on continue okay so this is my

simply on continue okay so this is my project okay and then it is saying me

project okay and then it is saying me hey do you have any connection do you

hey do you have any connection do you have any connection no we do not have

have any connection no we do not have any kind of connection but we will be

any kind of connection but we will be building a connection. What is a

building a connection. What is a connection? Basically in the dbt we have

connection? Basically in the dbt we have two things. One is dbt itself which is

two things. One is dbt itself which is just creating the code and everything.

just creating the code and everything. Second thing is the adapter. So in dbt

Second thing is the adapter. So in dbt we can use any adapter datab bricks

we can use any adapter datab bricks fabric and all those tools that we have.

fabric and all those tools that we have. Okay. So we want to work with datab

Okay. So we want to work with datab bricks. So we will be using datab bricks

bricks. So we will be using datab bricks adapter and we'll be creating a

adapter and we'll be creating a connection with the datab bricks. So

connection with the datab bricks. So simply select add a new connection and

simply select add a new connection and let's create a connection with any

let's create a connection with any adapter. See we have so many adapters.

adapter. See we have so many adapters. So I'll simply say datab bricks and

So I'll simply say datab bricks and connection name will be datab bricks.

connection name will be datab bricks. Let me add co abbreviation for

Let me add co abbreviation for connection. And now it is asking me for

connection. And now it is asking me for two things host name and http path. You

two things host name and http path. You can simply go to your data bricks

can simply go to your data bricks and go to compute

and go to compute let's say here and within the compute

let's say here and within the compute just click on this particular serverless

just click on this particular serverless connection details and this is your host

connection details and this is your host name

name just provide it here

just provide it here where here here oh here host name and

where here here oh here host name and then http path is this one

then http path is this one make sense this is your connection

make sense this is your connection simple option settings it's fine So now

simple option settings it's fine So now let's click on save.

let's click on save. So this is my adapter. This is my all

So this is my adapter. This is my all the things that I want to do. And now it

the things that I want to do. And now it is saying connection usage environment

is saying connection usage environment type all and all those things. If I

type all and all those things. If I click on optional settings, it will be

click on optional settings, it will be saying catalog. By the way by as per its

saying catalog. By the way by as per its definition, let me first of all click on

definition, let me first of all click on edit. As per its definition, it is

edit. As per its definition, it is optional. But do you know what? It is

optional. But do you know what? It is not optional at all. You have to provide

not optional at all. You have to provide the catalog name. Have to. So our

the catalog name. Have to. So our catalog name is Pispark DBD. Okay, then

catalog name is Pispark DBD. Okay, then let's click on save and your changes

let's click on save and your changes will back all users. Yes, I'm the only

will back all users. Yes, I'm the only user. Okay, so I hope now it makes sense

user. Okay, so I hope now it makes sense because we have our connections ready.

because we have our connections ready. Okay, you can even land on the same

Okay, you can even land on the same thing. If you click on this screen and

thing. If you click on this screen and then Anlama GSR,

then Anlama GSR, then click on maybe create new. If you

then click on maybe create new. If you do not want to create a new account,

do not want to create a new account, it's fine. Then you can click on account

it's fine. Then you can click on account settings, it will show you the settings

settings, it will show you the settings page like this. Okay. And these are all

page like this. Okay. And these are all the connections that we have. And it is

the connections that we have. And it is saying databris connection sorted. Very

saying databris connection sorted. Very good. So this is just a connection. And

good. So this is just a connection. And if I click on projects, I will see my

if I click on projects, I will see my projects. I just have one project. And I

projects. I just have one project. And I cannot create multiple projects. See

cannot create multiple projects. See this particular button is grayed out now

this particular button is grayed out now because I do not have access to multiple

because I do not have access to multiple projects. Simple. Maybe in your case you

projects. Simple. Maybe in your case you can because you will be under 30 days of

can because you will be under 30 days of premium services. But once it is over,

premium services. But once it is over, you will not be able to create a new

you will not be able to create a new project. Simple. Okay, makes sense. So

project. Simple. Okay, makes sense. So this is our project and our connection

this is our project and our connection is also ready. And if I click here, I

is also ready. And if I click here, I will see all the configurations here.

will see all the configurations here. Project subdirectory, artifacts, source,

Project subdirectory, artifacts, source, freshness and everything. Right? Very

freshness and everything. Right? Very good. Now we want to start with our

good. Now we want to start with our development. If you click here in the

development. If you click here in the studio, just click here in the studio

studio, just click here in the studio and it will say configure environments.

and it will say configure environments. Environment required. This project does

Environment required. This project does not have a deploy development

not have a deploy development environment. So it will simply say

environment. So it will simply say configure environment because you would

configure environment because you would need at least one environment, right? So

need at least one environment, right? So let's create environment and click on

let's create environment and click on this connection. Now it is saying

this connection. Now it is saying development credentials.

development credentials. Development credentials. Okay. So now

Development credentials. Okay. So now what kind of credentials? See this is

what kind of credentials? See this is your datab bricks and this particular

your datab bricks and this particular DBT will access this datab bricks. So

DBT will access this datab bricks. So obviously it should have some kind of

obviously it should have some kind of permissions. Okay. So if you want to

permissions. Okay. So if you want to provide the permissions you can provide

provide the permissions you can provide the access token. Simply click here and

the access token. Simply click here and then settings. Then go to identity and

then settings. Then go to identity and access. Um, not really. Then go to

access. Um, not really. Then go to developer, then access tokens, and let's

developer, then access tokens, and let's create a new token. I created this

create a new token. I created this earlier. Let me just revoke it. Let's

earlier. Let me just revoke it. Let's create a new token. Let's say dbt or

create a new token. Let's say dbt or basically pi spark

basically pi spark dbt. Okay,

dbt. Okay, pi spark dbt. Let me just click on

pi spark dbt. Let me just click on generate. And you also need to just save

generate. And you also need to just save it otherwise it will be gone. And let me

it otherwise it will be gone. And let me just add the lifetime

just add the lifetime or basically for 3 days. It's fine. So

or basically for 3 days. It's fine. So it's fine. Let me just click on

it's fine. Let me just click on generate. Let me just copy it and let me

generate. Let me just copy it and let me just paste it here.

just paste it here. Okay, simple. And schema. What is this

Okay, simple. And schema. What is this schema? So, basically by default

schema? So, basically by default whenever dbt will be creating our

whenever dbt will be creating our objects, it will use this particular

objects, it will use this particular thing called schema. And literally, you

thing called schema. And literally, you can just provide anything because you'll

can just provide anything because you'll be just overriding this thing. But by

be just overriding this thing. But by default, it picks your first names

default, it picks your first names initial plus your surname. Okay, so it

initial plus your surname. Okay, so it is it will be called as dbta lamba. I

is it will be called as dbta lamba. I will just keep it as it is. and target

will just keep it as it is. and target name just keep it default and just say

name just keep it default and just say test connection. Let's make sure like

test connection. Let's make sure like everything is fine and trust me it is

everything is fine and trust me it is very very very easy to override these

very very very easy to override these things and usually we just keep it as it

things and usually we just keep it as it is whatever is there but then we just

is whatever is there but then we just try to you can say override these things

try to you can say override these things whenever we just

whenever we just deploy our stuff makes sense okay very

deploy our stuff makes sense okay very good very good very good so so so now it

good very good very good so so so now it is just testing our connection and I

is just testing our connection and I think it is completed so we are good to

think it is completed so we are good to go just click on save. Once it is

go just click on save. Once it is completed then only you can just go

completed then only you can just go ahead and the next step is setting up a

ahead and the next step is setting up a repository basically a repo. So we have

repository basically a repo. So we have so many options github, gitlab, git

so many options github, gitlab, git clone managed. So we will be using

clone managed. So we will be using managed. What is this managed? Managed

managed. What is this managed? Managed is basically the g services provided by

is basically the g services provided by dbt cloud. So we do not need to set up

dbt cloud. So we do not need to set up any github repo, azure de nothing.

any github repo, azure de nothing. Simply select managed. Okay. and

Simply select managed. Okay. and repository name will be let's say

repository name will be let's say dbt repo.

dbt repo. Okay, click on create

Okay, click on create and it is saying your project is ready.

and it is saying your project is ready. Now it is saying here are some ways you

Now it is saying here are some ways you can get started with your project. If

can get started with your project. If you get stuck assistance, you can chat

you get stuck assistance, you can chat with support blah blah blah blah. So

with support blah blah blah blah. So what are the options? Start developing

what are the options? Start developing in the studio. That is the thing that we

in the studio. That is the thing that we will be doing. Okay, because we want to

will be doing. Okay, because we want to use cloud studio IDE instead of just

use cloud studio IDE instead of just setting up everything on our own. Second

setting up everything on our own. Second thing is get started with CLI. As I just

thing is get started with CLI. As I just mentioned that you can even start or

mentioned that you can even start or create your project. Okay, which will be

create your project. Okay, which will be in the DBT cloud but you can develop it

in the DBT cloud but you can develop it locally with DBT cloud CLI which is

locally with DBT cloud CLI which is different from DBT core. Do not get

different from DBT core. Do not get confused. DBT core is just like the

confused. DBT core is just like the open-source engine that you run on your

open-source engine that you run on your machine. But DBT cloud CLI is the way to

machine. But DBT cloud CLI is the way to interact with the resources of your DBT

interact with the resources of your DBT cloud. Make sense? Then check out

cloud. Make sense? Then check out getting started tutorial. Learn about

getting started tutorial. Learn about this and even if you just click it and

this and even if you just click it and open it in a new tab. This is something

open it in a new tab. This is something called as setup CLI of your cloud

called as setup CLI of your cloud version not your DVT core. Okay. So I

version not your DVT core. Okay. So I will simply say start developing in the

will simply say start developing in the studio. Click here and what will happen?

studio. Click here and what will happen? Do you know it will create a very nice

Do you know it will create a very nice structure of my project including

structure of my project including everything

everything including every single thing

including every single thing and just wait. It is saying no optional

and just wait. It is saying no optional files and blah blah blah. So this is my

files and blah blah blah. So this is my project DBT cloud. If I click here or

project DBT cloud. If I click here or basically

basically initialize DB project as well. It's

initialize DB project as well. It's fine. But just click here

fine. But just click here and you will be seeing create folder

and you will be seeing create folder create file. It's up to you. It is

create file. It's up to you. It is saying get started by opening a file or

saying get started by opening a file or create a new scratchpad. Okay. Let's say

create a new scratchpad. Okay. Let's say initialize dbt project because it is a

initialize dbt project because it is a kind of CLI command equivalent to dbt

kind of CLI command equivalent to dbt init. Make sense? So the moment I hit

init. Make sense? So the moment I hit initialize project all these things are

initialize project all these things are here. All these things are here. Make

here. All these things are here. Make sense? All these folders. Okay. All

sense? All these folders. Okay. All these things. See everything is there.

these things. See everything is there. So this is my dbt initialize command

So this is my dbt initialize command which is called as dbt init. Make sense?

which is called as dbt init. Make sense? And first of all we need to say

And first of all we need to say create a branch because we are working

create a branch because we are working in a git environment. Okay. So we cannot

in a git environment. Okay. So we cannot make changes directly to the main

make changes directly to the main branch. Okay, we should always create a

branch. Okay, we should always create a feature branch. So let's create a

feature branch. So let's create a feature branch. Let me just say

feature branch. Let me just say uh merge this branch. Not really merge

uh merge this branch. Not really merge this branch because we first of all need

this branch because we first of all need to commit which is initial commit to

to commit which is initial commit to register the branch or basically initial

commit. Just say commit changes.

Just say commit changes. Okay. So now all these changes are

Okay. So now all these changes are committed. See main branch is locked. It

committed. See main branch is locked. It is protected because we have established

is protected because we have established that branch and now we can create a

that branch and now we can create a feature branch on top of this. Let's say

feature branch on top of this. Let's say change branch or basically click here as

change branch or basically click here as well change branch. So it is saying

well change branch. So it is saying which branch do you want to go to? I

which branch do you want to go to? I actually do not have any kind of branch.

actually do not have any kind of branch. Right? So I can simply click on create

Right? So I can simply click on create branch. I will simply say feature on

branch. I will simply say feature on click on submit.

click on submit. And now your branch is created. Okay.

And now your branch is created. Okay. And you are now in the feature branch.

And you are now in the feature branch. Make sense?

Make sense? Very good. Very very very good. Now just

Very good. Very very very good. Now just be with me. I know that maybe you will

be with me. I know that maybe you will be having some prior knowledge in DBD.

be having some prior knowledge in DBD. Maybe some of you will be totally new to

Maybe some of you will be totally new to DBD and do not need to worry because I

DBD and do not need to worry because I will make you feel comfortable. Do not

will make you feel comfortable. Do not worry at all. So first of all we need to

worry at all. So first of all we need to talk about DBT project. So this is the

talk about DBT project. So this is the backbone of the whole project. Backbone.

backbone of the whole project. Backbone. This is a very big word because this is

This is a very big word because this is the truth because YAML first of all what

the truth because YAML first of all what is YAML? YAML is basically a file format

is YAML? YAML is basically a file format which is more readable plus which is an

which is more readable plus which is an alternative of JSON because when we want

alternative of JSON because when we want to create so many nested objects JSON

to create so many nested objects JSON can be very very complicated but this is

can be very very complicated but this is like another version of JSON which is

like another version of JSON which is like more human readable which is called

like more human readable which is called YAML. Okay. So first of all let's start

YAML. Okay. So first of all let's start from the top. It is saying name,

from the top. It is saying name, project name, name means project name

project name, name means project name and project name is my new project. Now

and project name is my new project. Now this project name should be different

this project name should be different from your project name that you have

from your project name that you have defined earlier. What project name did

defined earlier. What project name did we define? We simply said pispark dbt

we define? We simply said pispark dbt project. Right? So what do I like to

project. Right? So what do I like to keep it? I like to keep it as

keep it? I like to keep it as pispark dbt

pispark dbt then project obviously and we cannot

then project obviously and we cannot keep the name same then I will simply

keep the name same then I will simply say internal simple this is because this

say internal simple this is because this is an internal project

is an internal project simple very good what is profile

simple very good what is profile profiles are used when you want to

profiles are used when you want to develop something locally but we are not

develop something locally but we are not developing something locally so just

developing something locally so just keep it default fine this is saying

keep it default fine this is saying model paths test path and all these

model paths test path and all these things so what are these things so

things so what are these things so basically whatever we build in dbd

basically whatever we build in dbd are developed using models. So they are

are developed using models. So they are simply specifying the path here that all

simply specifying the path here that all the models are here in the models folder

the models are here in the models folder and all of those things. Again all these

and all of those things. Again all these things are discussed in detail in that

things are discussed in detail in that particular video that I have just

particular video that I have just mentioned. DBT masterclass 5 hours long

mentioned. DBT masterclass 5 hours long video. Simply search on YouTube DBT an

video. Simply search on YouTube DBT an Lamba you will find that video watch

Lamba you will find that video watch that video only then you'll be able to

that video only then you'll be able to understand a lot of things here. Okay,

understand a lot of things here. Okay, makes sense. I will try my best to make

makes sense. I will try my best to make you feel comfortable if you do not have

you feel comfortable if you do not have any experience but you need to take my

any experience but you need to take my advice as well. You first need to watch

advice as well. You first need to watch that video. That's why that video is

that video. That's why that video is made before this video. There's a

made before this video. There's a common sense behind it, right? Okay. So

common sense behind it, right? Okay. So here we have models and everything. Make

here we have models and everything. Make sense? So first of all, we'll just save

sense? So first of all, we'll just save this file and that's it. And you will

this file and that's it. And you will see that this file is is modified and it

see that this file is is modified and it is here. Okay. Makes sense? So what is

is here. Okay. Makes sense? So what is the first thing that we need to do?

the first thing that we need to do? First of all, we want to try to query

First of all, we want to try to query the data which is sitting in the bronze

the data which is sitting in the bronze layer. Why bronze layer? If you

layer. Why bronze layer? If you remember, we didn't transform our one

remember, we didn't transform our one table which is called trips which is a

table which is called trips which is a fact table, right? We haven't

fact table, right? We haven't transformed that table yet. So let's try

transformed that table yet. So let's try to transform that table and let's see

to transform that table and let's see what happens. So basically

what happens. So basically I can query the data and I can just show

I can query the data and I can just show you if I go to analysis. Okay, this is a

you if I go to analysis. Okay, this is a empty folder and this is a basically a

empty folder and this is a basically a scratch pad where we can simply write

scratch pad where we can simply write the SQL queries of our own. Okay, I will

the SQL queries of our own. Okay, I will simply say create file and I will just

simply say create file and I will just call it as let's say scratch

call it as let's say scratch dossql.

dossql. Let's say create. Perfect. So here I

Let's say create. Perfect. So here I will simply say select ax from

will simply say select ax from uh pispark

uh pispark dbd dot bronze dot trips.

dbd dot bronze dot trips. Okay,

Okay, perfect. Let me just try to run this.

perfect. Let me just try to run this. And how you can just run this? Simply

And how you can just run this? Simply drag this particular thing from the

drag this particular thing from the bottom and click on preview just to

bottom and click on preview just to first of all preview the data that how

first of all preview the data that how it will look like. And it will simply

it will look like. And it will simply process it and let's wait for the

process it and let's wait for the results

results and obviously this is our preview. See,

and obviously this is our preview. See, so we are able to query the data which

so we are able to query the data which is sitting in the data bricks and this

is sitting in the data bricks and this is DBD, right? This is the power of

is DBD, right? This is the power of connection that we have built using DBD.

connection that we have built using DBD. Makes sense. So that means we can

Makes sense. So that means we can literally query any data. That's true.

literally query any data. That's true. Very good. But do you know what? Do you

Very good. But do you know what? Do you know what I am need I I'm going to just

know what I am need I I'm going to just talk about some advanced things here.

talk about some advanced things here. See see which are really really real

See see which are really really real world. If I click on models, I will see

world. If I click on models, I will see a directory example that I want to

a directory example that I want to delete. Okay, why? Because I will be

delete. Okay, why? Because I will be creating my own directories here. So

creating my own directories here. So models, models are the things which will

models, models are the things which will be deployed or created for our projects.

be deployed or created for our projects. Anything tables, views, anything that

Anything tables, views, anything that you want to create will be created using

you want to create will be created using models. Okay. So I know that I will be

models. Okay. So I know that I will be creating

creating gold layer and silver layer.

gold layer and silver layer. Okay. gold layer for sure silver layer

Okay. gold layer for sure silver layer why because in the silver layer I'll be

why because in the silver layer I'll be creating only one table trips make sense

creating only one table trips make sense make sense okay okay sorted so I'll be

make sense okay okay sorted so I'll be creating my two folders

creating my two folders it's called first of all silver

it's called first of all silver and second of all

and second of all gold. Now what is the advantage of

gold. Now what is the advantage of creating the folders within this? See

creating the folders within this? See the advantage is I can directly create

the advantage is I can directly create the files here in under models. But do

the files here in under models. But do you know what? If I want to specify the

you know what? If I want to specify the properties differently for silver, I can

properties differently for silver, I can create the properties file within the

create the properties file within the silver folder. If I want to create any

silver folder. If I want to create any specific properties for gold, I can

specific properties for gold, I can create the properties file in the gold

create the properties file in the gold folder which will be only applicable to

folder which will be only applicable to the files which are in the gold folder.

the files which are in the gold folder. That is the advantage of creating

That is the advantage of creating subdirectories here. Make sense? Very

subdirectories here. Make sense? Very good. I will create one more directory.

good. I will create one more directory. What? Bronze obviously. No, I already

What? Bronze obviously. No, I already have bronze layer. I'll be creating a

have bronze layer. I'll be creating a directory for sources. What do you mean?

directory for sources. What do you mean? I know that I can query the data like

I know that I can query the data like this. I know that. But I do not want to

this. I know that. But I do not want to do that. Why? Because in DBT I also want

do that. Why? Because in DBT I also want to see the lineage.

to see the lineage. Can I see the lineage using this?

Can I see the lineage using this? Obviously no. Because DBT doesn't know

Obviously no. Because DBT doesn't know that what is this thing? Yes, we are

that what is this thing? Yes, we are able to query the data but it is not

able to query the data but it is not internal object to dbt. To make this

internal object to dbt. To make this object internal to DBT, we have

object internal to DBT, we have something called as sources.

something called as sources. Okay, we have something called as

Okay, we have something called as sources. So with the sources, I will

sources. So with the sources, I will register this table as my source.

Make sense? Okay, and what are sources? Let me just show you. If you go to

Let me just show you. If you go to sources and dbt, let's try to read the

sources and dbt, let's try to read the documentation.

documentation. Add sources to your DAG. Okay, let me

Add sources to your DAG. Okay, let me make it zoom.

make it zoom. So, basically this is the YAML

So, basically this is the YAML configuration that we need to write.

configuration that we need to write. First of all, let me just copy it. Let

First of all, let me just copy it. Let then I will just show you what does it

then I will just show you what does it mean. So, if I go here and if I go to

mean. So, if I go here and if I go to sources, if I create a file and I will

sources, if I create a file and I will simply call as sources.yamel.

simply call as sources.yamel. You can name it anything. Okay,

You can name it anything. Okay, literally anything. I think it would be

literally anything. I think it would be written here as well. Uh yes, you can

written here as well. Uh yes, you can just name it anything. So here I will

just name it anything. So here I will simply paste that code. So now let's

simply paste that code. So now let's talk about this code. What is this

talk about this code. What is this thing? First of all, we have defined the

thing? First of all, we have defined the version which is like selfexplanatory.

version which is like selfexplanatory. We are simply using version two. Then

We are simply using version two. Then this is the sources. Basically type of

this is the sources. Basically type of resource that you want to build. We are

resource that you want to build. We are building here sources. What are

building here sources. What are different types? sources, models,

different types? sources, models, snapshots, all those things. Okay. So

snapshots, all those things. Okay. So here we are building sources. So within

here we are building sources. So within the sources, this is basically the list

the sources, this is basically the list of sources. These are basically the list

of sources. These are basically the list of sources. Okay. So this source name is

of sources. Okay. So this source name is this one which is called the shop. And

this one which is called the shop. And then second source name is stripe. In

then second source name is stripe. In our case, I will be creating source name

our case, I will be creating source name will be let's say bronze or basically

will be let's say bronze or basically source and then bronze or sorry silver.

source and then bronze or sorry silver. Make sense? Source silver. Source not

Make sense? Source silver. Source not silver. I think source bronze. First of

silver. I think source bronze. First of all, source bronze and then source

all, source bronze and then source silver. Okay. So in the source bronze,

silver. Okay. So in the source bronze, what will be the database here? Database

what will be the database here? Database name is equivalent to your catalog.

name is equivalent to your catalog. Okay. So catalog will be pispark

Okay. So catalog will be pispark dbd. And what will be the schema name?

dbd. And what will be the schema name? Schema name will be bronze. Okay.

Schema name will be bronze. Okay. And what are the tables I want to

And what are the tables I want to register? I just want to register only

one table because other tables are already there. I will be simply calling

already there. I will be simply calling it as trips. Make sense? So let me just

it as trips. Make sense? So let me just copy this and let me just paste it here

copy this and let me just paste it here because the second source will be

because the second source will be silver.

silver. Make sense? And the schema will be

Make sense? And the schema will be silver and here the table name will be

silver and here the table name will be so many because we know that we have

so many because we know that we have trips like we will be having trips for

trips like we will be having trips for now. We do not have trips here. So we

now. We do not have trips here. So we can just remove it.

can just remove it. Here we can simply say customers.

Here we can simply say customers. Okay.

Okay. Then we have let's say another name and

Then we have let's say another name and another name will be let's say

another name will be let's say locations.

locations. And then we have

And then we have drivers.

Okay. And just make sure the spaces that you're using because this is really

you're using because this is really really important. This is YAML. Okay.

really important. This is YAML. Okay. This is YAML. customers, locations,

This is YAML. customers, locations, drivers, and then we have let's say let

drivers, and then we have let's say let me just check

me just check uh

uh payments and vehicles. Okay, payments

payments and vehicles. Okay, payments and vehicles

and vehicles then name

then name vehicles. So these are my sources. Now

vehicles. So these are my sources. Now the advantage is now these all are the

the advantage is now these all are the internal objects to DBD that I can just

internal objects to DBD that I can just use for the lineage just to track like

use for the lineage just to track like how data is flowing. Make sense? Let me

how data is flowing. Make sense? Let me just save it.

just save it. Okay. And if I click on lineage, so it

Okay. And if I click on lineage, so it is saying lineage is currently

is saying lineage is currently unavailable and it will simply refresh

unavailable and it will simply refresh it. And once it is refreshed, you will

it. And once it is refreshed, you will see all the objects here as well. But

see all the objects here as well. But that's not a big deal. Okay. If I go

that's not a big deal. Okay. If I go here now, if I let's say I want to use

here now, if I let's say I want to use this particular internal object, how I

this particular internal object, how I can just use that? It is called

can just use that? It is called reference. Okay. So the code is very

reference. Okay. So the code is very simple. Let's say select aix from and

simple. Let's say select aix from and then you can simply say double curly

then you can simply say double curly braces then ref function. Okay. So the

braces then ref function. Okay. So the ref function will say what you want to

ref function will say what you want to refer. If you just open the

refer. If you just open the documentation as well, you will see like

documentation as well, you will see like how you can just refer the sources. So

how you can just refer the sources. So it is it's very simple. Oh, we do not

it is it's very simple. Oh, we do not need to use sorry refer ref. We need to

need to use sorry refer ref. We need to use source. So you can simply say source

use source. So you can simply say source and then source name and then the table

and then source name and then the table within that. So we can simply say here

within that. So we can simply say here source.

source. Okay. And within the source I'll be

Okay. And within the source I'll be simply name the source. And what is the

simply name the source. And what is the source name? If you go here the source

source name? If you go here the source name is source bronze and

name is source bronze and object name is trips. Make sense?

object name is trips. Make sense? Okay. and I want to use trips. Okay. If

Okay. and I want to use trips. Okay. If I now save this file and if I just try

I now save this file and if I just try to let's say preview the data, should I

to let's say preview the data, should I see the data? If everything is fine,

see the data? If everything is fine, obviously yes. If anything is wrong,

obviously yes. If anything is wrong, then obviously no. But yes, we are able

then obviously no. But yes, we are able to see the data. Wow.

to see the data. Wow. This is the way we can just refer to the

This is the way we can just refer to the objects which are internal to DBT which

objects which are internal to DBT which are built in DBT. Make sense? If I open

are built in DBT. Make sense? If I open lineage, I will see this particular

lineage, I will see this particular thing called source bronze trips and

thing called source bronze trips and this is an internal object. Now if I

this is an internal object. Now if I click on sources, see all the objects

click on sources, see all the objects are here. So these are no more just

are here. So these are no more just datab bricks objects. These are internal

datab bricks objects. These are internal to dbt as well because we have created

to dbt as well because we have created the sources and how using YAML file. I

the sources and how using YAML file. I hope it makes sense.

hope it makes sense. Very good. Now we need to do some

Very good. Now we need to do some things. If I go to dbt project.gyaml

things. If I go to dbt project.gyaml first of all we need to talk about

first of all we need to talk about materialization which is very very

materialization which is very very important. What is materialization? So

important. What is materialization? So if you see models here okay if you see

if you see models here okay if you see here models you will see that project

here models you will see that project name. First of all we need to change

name. First of all we need to change this project name to this one that we

this project name to this one that we have picked because it should be same

have picked because it should be same exactly same.

exactly same. Oops.

Oops. This should be exactly same.

This should be exactly same. Okay.

Okay. And this is the project name. Within

And this is the project name. Within that we are talking about models. Now

that we are talking about models. Now what are models? Models are here. Store

what are models? Models are here. Store stored are here. Okay. So within the

stored are here. Okay. So within the models it is saying all the like

models it is saying all the like everything which is built inside example

everything which is built inside example folder should be materialized as table.

folder should be materialized as table. But we do not have example folder. We

But we do not have example folder. We deleted it. So I will simply say

deleted it. So I will simply say whatever stored in the gold folder that

whatever stored in the gold folder that should be materialized as table.

should be materialized as table. Make sense? That should be materialized

Make sense? That should be materialized as table.

as table. And whatever is materialized under

And whatever is materialized under silver

silver that should be materialized as table as

that should be materialized as table as well. You can uh even pick view. Okay,

well. You can uh even pick view. Okay, it's fine. But we want to materialize

it's fine. But we want to materialize materialize it as a basically table.

materialize it as a basically table. Make sense? And by the way, we'll be

Make sense? And by the way, we'll be creating something called as snapshots

creating something called as snapshots for our gold layer. Okay, we'll simply

for our gold layer. Okay, we'll simply using snapshots. But that is fine. I

using snapshots. But that is fine. I will just let you know like how you can

will just let you know like how you can just change it. So let's simply click on

just change it. So let's simply click on save. Make sense? Makes sense. Makes

save. Make sense? Makes sense. Makes sense. You can even define properties

sense. You can even define properties file in this particular source and gold

file in this particular source and gold as well. You can simply create any

as well. You can simply create any properties.yamel and you can even paste

properties.yamel and you can even paste this code there as well. Uh let me just

this code there as well. Uh let me just show you

show you properties

properties in dbt

in dbt define properties

define properties and this is the property file that you

and this is the property file that you can just use. So this is basically the

can just use. So this is basically the sources as we have already talked about

sources as we have already talked about these are the models that you can just

these are the models that you can just simply create this thing. So that's how

simply create this thing. So that's how you can just define the properties file.

you can just define the properties file. So models name all the configurations

So models name all the configurations columns and if you also want to set

columns and if you also want to set materialization it will be there. See

materialization it will be there. See config materialized view and then

config materialized view and then columns and all those things. Make

columns and all those things. Make sense? So everything is already

sense? So everything is already discussed in detail but yeah we are

discussed in detail but yeah we are simply discussing the necessary stuff to

simply discussing the necessary stuff to complete this project. Make sense? Okay.

complete this project. Make sense? Okay. So this is basically a properties file

So this is basically a properties file that we create. Uh you can say here's an

that we create. Uh you can say here's an example that defines both sources and

example that defines both sources and models for a project. Okay. Make sense?

models for a project. Okay. Make sense? Makes sense. So that's how you can just

Makes sense. So that's how you can just define the properties anything. But we

define the properties anything. But we do not need to define any property

do not need to define any property because we are simply leveraging this

because we are simply leveraging this dbt project.

dbt project. Make sense? Okay. Very good. So now we

Make sense? Okay. Very good. So now we want to first of all create our silver

want to first of all create our silver entity which is trips. Yes. But for that

entity which is trips. Yes. But for that do you know what

do you know what we want to create an incremental data

we want to create an incremental data basically incremental processing unit.

basically incremental processing unit. Okay. So how we can just do that how? So

Okay. So how we can just do that how? So it is very simple. So first of all let's

it is very simple. So first of all let's go to silver. Okay. Let's create a new

go to silver. Okay. Let's create a new file and let's call it as trips.

file and let's call it as trips. SQL. Okay. This is our file. Let me

SQL. Okay. This is our file. Let me close all the other files. Okay. So this

close all the other files. Okay. So this is my file that I'll be just using.

is my file that I'll be just using. Okay.

Okay. Perfect.

Perfect. Perfect. Uh right. So this is my file.

Perfect. Uh right. So this is my file. Okay. So here I want to simply load the

Okay. So here I want to simply load the data. Okay. And how I can just load the

data. Okay. And how I can just load the data? I can simply say select ax from

data? I can simply say select ax from source, right? And then source name is

source, right? And then source name is source bronze and then trips. Perfect.

source bronze and then trips. Perfect. Let's try to run that. Drag it from the

Let's try to run that. Drag it from the bottom and click on preview.

So it will take some time. It will simply run your code. So this is my

simply run your code. So this is my data. Make sense? Do I have anything to

data. Make sense? Do I have anything to transform here? Literally no. Okay.

transform here? Literally no. Okay. Literally no. But it's fine. Okay.

Literally no. But it's fine. Okay. Actually we can remove payment method

Actually we can remove payment method and trip status because in the fact

and trip status because in the fact table we should not be having these

table we should not be having these information. Okay, in the trips make

information. Okay, in the trips make sense because we have trip ID, driver

sense because we have trip ID, driver ID, customer ID, vehicle ID and we just

ID, customer ID, vehicle ID and we just want trip start time, trip end time.

want trip start time, trip end time. Yes, we want date columns but we do not

Yes, we want date columns but we do not want start location, end location. Okay,

want start location, end location. Okay, and we also do not want we want

and we also do not want we want kilometers and we do not want payment

kilometers and we do not want payment method and trip status because all this

method and trip status because all this information is already in the dimension

information is already in the dimension tables. Makes sense? So now I can simply

tables. Makes sense? So now I can simply write the select statement.

write the select statement. Okay, I can simply write the select

Okay, I can simply write the select statement here. Let's say select

and literally everything that I want, let's say trip ID

let's say trip ID and then driver ID and all those

and then driver ID and all those columns.

columns. Make sense? But I want to show you the

Make sense? But I want to show you the power of DBD that you can even use

power of DBD that you can even use something called as Ginga functions.

something called as Ginga functions. What is Ginga? Basically, Ginga is a

What is Ginga? Basically, Ginga is a templating language. Okay, that helps us

templating language. Okay, that helps us to write our code in a dynamic way using

to write our code in a dynamic way using SQL. Because see, we learned how we can

SQL. Because see, we learned how we can just use dynamic code using Python.

just use dynamic code using Python. Basically, Pispark. But how we can just

Basically, Pispark. But how we can just make the same thing in SQL? If you are

make the same thing in SQL? If you are using DBD, we can leverage something

using DBD, we can leverage something called as Ganja. Make sense? So if you

called as Ganja. Make sense? So if you want to just work with ginger so how we

want to just work with ginger so how we can just work with ginger. So there are

can just work with ginger. So there are some you can say

some you can say templates not templates basically syntax

templates not templates basically syntax that you should know. Okay. So for

that you should know. Okay. So for example if you want to work with if

example if you want to work with if condition or basically for loop you can

condition or basically for loop you can simply run like something like this.

simply run like something like this. Let's say for

Let's say for i in any list. Okay. And if you just

i in any list. Okay. And if you just first of all want to create a list you

first of all want to create a list you can say

can say set I think there's a cheat sheet very

set I think there's a cheat sheet very good cheat sheet that you can just see

good cheat sheet that you can just see uh if I able to find ginga cheat sheet

uh if I able to find ginga cheat sheet just to show you like how you can just

just to show you like how you can just create these things. Oh we have copilot

create these things. Oh we have copilot as well. So very good. So these are the

as well. So very good. So these are the things that we can just define variable

things that we can just define variable name if you have variable example

name if you have variable example expression all these things and I just

expression all these things and I just want to show you like how you can also

want to show you like how you can also set the

set the variables if you just want to set the

variables if you just want to set the variable and this is like ginga template

variable and this is like ginga template is more used with HTML but yes we can

is more used with HTML but yes we can leverage it in dbd as well let's see if

leverage it in dbd as well let's see if we have something in this particular

we have something in this particular website

I think we should have ginga cheat sheet not cheat sheet in the DBD reference as

not cheat sheet in the DBD reference as well that will be much better for you to

well that will be much better for you to understand. Ginger macros.

understand. Ginger macros. Okay. Yeah, this is a very good that's

Okay. Yeah, this is a very good that's what I want to show. So let's say first

what I want to show. So let's say first of all you want to create a list. Okay.

of all you want to create a list. Okay. And it is very simple. So as I just

And it is very simple. So as I just mentioned like you can simply write set

mentioned like you can simply write set and I can simply say columns.

and I can simply say columns. Okay. And then you can simply define the

Okay. And then you can simply define the column names. Okay. So now I have two

column names. Okay. So now I have two options. I want to create a list that I

options. I want to create a list that I want to like for for the columns that I

want to like for for the columns that I want to select. Okay, one thing is I

want to select. Okay, one thing is I just need to write all the columns. The

just need to write all the columns. The smarter way is I can simply define those

smarter way is I can simply define those columns which I do not want to read.

columns which I do not want to read. Let's say

Let's say I do not want to read payment method.

Okay, payment method. And then I also do not want to read trip status.

Okay, makes sense. So this is my list. If I want to show you the list like how

If I want to show you the list like how does it look like? So I can simply say

does it look like? So I can simply say columns and let me just show you. If I

columns and let me just show you. If I click on compile,

click on compile, you do not need to click on I think

you do not need to click on I think preview. Yeah, perfect. So this is my

preview. Yeah, perfect. So this is my list because this is a compile code. You

list because this is a compile code. You do not need to run it. So these are the

do not need to run it. So these are the two items that I have in my code.

two items that I have in my code. Correct? So what I will do now? I will

Correct? So what I will do now? I will simply start writing my select

simply start writing my select statement. I will simply say select.

statement. I will simply say select. Now what do I need to write? I simply

Now what do I need to write? I simply need to select all the columns.

need to select all the columns. Basically all the columns. Okay. And I

Basically all the columns. Okay. And I let's let's let's do the simple way

let's let's let's do the simple way otherwise you will be just confused

otherwise you will be just confused totally.

totally. And let me just give all the you can say

And let me just give all the you can say column names. The one was trip ID.

Trip ID. And then it was I think veicle ID.

ID. Okay. And

Okay. And let me just see what do columns what

let me just see what do columns what other columns do we have. Let me just

other columns do we have. Let me just open

um catalog then here in the bronze tables and then

then here in the bronze tables and then trips. Perfect. So we have trip ID,

trips. Perfect. So we have trip ID, driver ID, customer ID. Okay.

driver ID, customer ID. Okay. Trip ID, vehicle ID, customer ID.

1 2 3 4 1 2 3 then customer ID and then

customer ID and then driver ID.

Then we have trip start time, trip end time. Let me just copy and paste.

So why I'm writing like this? You will get to know in just just few seconds.

get to know in just just few seconds. Let's say distance kilometer

Uh trip stat is fine. this one last update time stamp

perfect so let's say we have this list make

so let's say we have this list make sense this is the list now I want to

sense this is the list now I want to make my code modular so I can simply

make my code modular so I can simply write like this select now one thing is

write like this select now one thing is I will be simply manually writing all

I will be simply manually writing all the column names but I can simply run a

the column names but I can simply run a loop okay what loop I can simply Okay,

loop okay what loop I can simply Okay, for column in columns. Okay, so this

for column in columns. Okay, so this will simply run a loop and once it will

will simply run a loop and once it will run the loop, then I can simply say

run the loop, then I can simply say print the column. Not print basically

print the column. Not print basically use this variable.

use this variable. You can treat it like an fstring. Okay,

You can treat it like an fstring. Okay, because whenever we use fing, we use the

because whenever we use fing, we use the curly braces with variable and that's

curly braces with variable and that's it. Make sense? And what is the

it. Make sense? And what is the table name? Basically source.

table name? Basically source. Uh I think like this source and source

Uh I think like this source and source is silver bronze and trip.

is silver bronze and trip. Right. Perfect. Let me just compile this

Right. Perfect. Let me just compile this code.

code. So this will automatically list all the

So this will automatically list all the column names.

column names. What is it saying? And for Yeah, we

What is it saying? And for Yeah, we didn't end the loop. We also need to end

didn't end the loop. We also need to end the loop. So I can simply say and four

the loop. So I can simply say and four like this. Okay. Let's say compile.

So see it has literally written me this particular code. That's the power of

particular code. That's the power of making the code modular. But there's a

making the code modular. But there's a small mistake that not mistake like I

small mistake that not mistake like I did it uh purposely. I didn't add the

did it uh purposely. I didn't add the comma. But obviously in the SQL we need

comma. But obviously in the SQL we need to add the comma. So once just simply do

to add the comma. So once just simply do like this comma because comma is a text

like this comma because comma is a text right you can do it and if it will work

right you can do it and if it will work fine but there's a catch you cannot

fine but there's a catch you cannot literally like comma write like this why

literally like comma write like this why because you cannot write comma after

because you cannot write comma after each column because just look at this

each column because just look at this code you can only write the comma till

code you can only write the comma till the second last item but when it will be

the second last item but when it will be the last item of the loop you have to

the last item of the loop you have to write nothing so there's a function in

write nothing so there's a function in ginga for loop. I think it will be here

ginga for loop. I think it will be here in the documentation as well. If you

in the documentation as well. If you will see it here,

will see it here, if it is here otherwise I can just show

if it is here otherwise I can just show you in the official documentation.

you in the official documentation. Uh no. So it goes like something like

Uh no. So it goes like something like this. So if it is not the last element

this. So if it is not the last element of the loop then it will not do that

of the loop then it will not do that otherwise it will add the comma. So that

otherwise it will add the comma. So that is the condition that you want to build.

is the condition that you want to build. So let's say ginga

So let's say ginga last

last if not last something like this

if not last something like this for loop and I hope if our search engine

for loop and I hope if our search engine is good it will return me the right web

is good it will return me the right web page. Uh yes it is called by the way

page. Uh yes it is called by the way thanks to copilot

thanks to copilot because this is the right code. So it is

because this is the right code. So it is saying if not loop.last last then simply

saying if not loop.last last then simply then add the comma otherwise do not add

then add the comma otherwise do not add the comma. Simple simple so I will

the comma. Simple simple so I will simply say if condition

simply say if condition in my code and I will first of all

in my code and I will first of all remove this comma and I will add this if

remove this comma and I will add this if condition after this. So I will simply

condition after this. So I will simply say I know this is complex not complex

say I know this is complex not complex this is new and you need to watch that

this is new and you need to watch that particular DB masterclass video in order

particular DB masterclass video in order to understand all these things because

to understand all these things because we are simply applying that knowledge

we are simply applying that knowledge that we have learned there right so I'll

that we have learned there right so I'll simply say if okay and we will use

simply say if okay and we will use percentage

perfect if loop dotlast basically not not loop.last

loop dotlast basically not not loop.last last right? So if not loop do.last last

last right? So if not loop do.last last then add comma otherwise simply

then add comma otherwise simply if this is not the last element then add

if this is not the last element then add comma otherwise

end if do not do anything simple simple now let's compile the code now let's see

now let's compile the code now let's see what is the outputs perfect trip ID

what is the outputs perfect trip ID comma vehicle ID comma customer ID

comma vehicle ID comma customer ID column comma comma comma comma comma but

column comma comma comma comma comma but after the last column we do not have any

after the last column we do not have any comma

comma because we have specified hey add the

because we have specified hey add the comma after column variable if

comma after column variable if it is the last element of the loop if

it is the last element of the loop if not loop.last last then add comma simple

not loop.last last then add comma simple we can even save it so this is my select

we can even save it so this is my select statement that I have simply defined

statement that I have simply defined here make sense make sense okay very

here make sense make sense okay very very good so now if I simply run my

very good so now if I simply run my model do you know what it will do it

model do you know what it will do it will simply create the table on top of

will simply create the table on top of this select statement that's how dbt

this select statement that's how dbt works

works okay but I want to make it incremental

okay but I want to make it incremental okay so in order to work with

okay so in order to work with incremental

incremental materialization. Yes, we have a

materialization. Yes, we have a dedicated materialization for it. It's

dedicated materialization for it. It's called incremental. You need to tell it

called incremental. You need to tell it that hey, we want to create an

that hey, we want to create an incremental object. And now let's talk

incremental object. And now let's talk about that as well.

about that as well. So if I go to and if I write dbt

So if I go to and if I write dbt incremental

incremental materialization

materialization you will see that

you will see that we can literally define the where clause

we can literally define the where clause like on which column we want to

like on which column we want to incrementally load the data and yes this

incrementally load the data and yes this is the configuration that we can define

is the configuration that we can define config materialized equals to

config materialized equals to incremental

incremental make sense okay so anal lamba yes sir

make sense okay so anal lamba yes sir You said that we defined

You said that we defined materialization

materialization in the dbt project.yamel, right?

in the dbt project.yamel, right? Okay, make sense here.

Okay, make sense here. But now as for the code, it is defining

But now as for the code, it is defining materialization in the code itself.

materialization in the code itself. So what is that? Basically, you can

So what is that? Basically, you can define materialization anywhere and

define materialization anywhere and there are levels. So the first level is

there are levels. So the first level is the parent YAML which is for everything.

the parent YAML which is for everything. But if you define materialization here

But if you define materialization here like this, let me just show you.

like this, let me just show you. If you define materialization like this,

If you define materialization like this, let's say you want to materialize it.

let's say you want to materialize it. Basically this is a configuration. You

Basically this is a configuration. You can write your configuration here or you

can write your configuration here or you can write your configuration here.

can write your configuration here. Simple, not a big deal. But when you

Simple, not a big deal. But when you define your configuration closer to your

define your configuration closer to your object that will take the priority that

object that will take the priority that will take the precedence.

will take the precedence. So if I'm defining here materialization

So if I'm defining here materialization equals to incremental

equals to incremental and if I have written materialization as

and if I have written materialization as a table, it will not work. Table will be

a table, it will not work. Table will be like overridden by this. So we are

like overridden by this. So we are simply overriding this value defining it

simply overriding this value defining it here. That's why I told you even if I'm

here. That's why I told you even if I'm like defining it here as a table that's

like defining it here as a table that's fine because I can override it using my

fine because I can override it using my config closer to the model. So this is

config closer to the model. So this is my model trips. SQL which is closer to

my model trips. SQL which is closer to this part config because we're defining

this part config because we're defining config within the same file. This is the

config within the same file. This is the closest config. Make sense? So this is

closest config. Make sense? So this is defined but this is just the 50% of it.

defined but this is just the 50% of it. You will you cannot work like this. Hey,

You will you cannot work like this. Hey, I have defined incremental. Now my model

I have defined incremental. Now my model will be incrementally loading the data.

will be incrementally loading the data. Bro, this is only one part of it because

Bro, this is only one part of it because we just work with this particular

we just work with this particular materialization using a macro. Macro is

materialization using a macro. Macro is basically a function. So now this macro

basically a function. So now this macro will come into the picture. I will

will come into the picture. I will simply say

is incremental. Okay, is incremental. So if this model

Okay, is incremental. So if this model is incremental which is okay if this is

is incremental which is okay if this is incremental then run this code which is

incremental then run this code which is this one where

this one where and how we just incrementally load the

and how we just incrementally load the data. I will simply say where which is a

data. I will simply say where which is a CDC column. CDC column is last updated

CDC column. CDC column is last updated date basically time stamp should be

date basically time stamp should be greater than should be greater than what

greater than should be greater than what should be greater than or basically

should be greater than or basically equals to not equals to greater than

equals to not equals to greater than select

select maximum of last updated time stamp from

maximum of last updated time stamp from what from what from this model silver

what from what from this model silver model so there's another function called

model so there's another function called this

and now Let's read the documentation. You will understand it better.

You will understand it better. So what is it saying? It is saying that

So what is it saying? It is saying that if it is incremental,

if it is incremental, if it is incremental, if is incremental,

if it is incremental, if is incremental, we can just add if because that's just

we can just add if because that's just if condition, we will simply filter our

if condition, we will simply filter our data. Okay? And on what? On the basis of

data. Okay? And on what? On the basis of maximum date. And what will be the

maximum date. And what will be the maximum date? Which is this one. Okay?

maximum date? Which is this one. Okay? Makes sense. And maximum date means see

Makes sense. And maximum date means see just just try to understand. Let's say

just just try to understand. Let's say this is your source. This is your

this is your source. This is your destination. Okay, you want to

destination. Okay, you want to incrementally load the data. So

incrementally load the data. So initially when no data is there, what

initially when no data is there, what will happen? What will be the maximum of

will happen? What will be the maximum of this date? Obviously null. So we can

this date? Obviously null. So we can simply define any date which is so

simply define any date which is so small. Okay. So what will happen? All

small. Okay. So what will happen? All the data will go there.

the data will go there. Make sense? Now once the data is there,

Make sense? Now once the data is there, let's say there's some data here. then

let's say there's some data here. then what it will return max of something

what it will return max of something obviously some value and then it will

obviously some value and then it will only load that data only which is

only load that data only which is greater than that date. This is the

greater than that date. This is the fundamental of data engineering or

fundamental of data engineering or basically you can say incremental load.

basically you can say incremental load. That's why we say fundamentals are the

That's why we say fundamentals are the backbone of these modern technologies

backbone of these modern technologies because in the modern technologies you

because in the modern technologies you will directly work with the

will directly work with the fundamentals.

fundamentals. Make sense? Okay. So let's try to

Make sense? Okay. So let's try to improve our code. We can simply first of

improve our code. We can simply first of all say if

all say if because that's just if condition

because that's just if condition if is incremental

if is incremental okay then we will simply say max of this

okay then we will simply say max of this and now I can write co-ase

and now I can write co-ase okay co-

okay co- if it is null then simply return a very

if it is null then simply return a very small date so that we can simply

small date so that we can simply read all the data make sense okay very

read all the data make sense okay very good this is the

good this is the And now this is your uh this

And now this is your uh this function that means from this particular

function that means from this particular current current model.

current current model. Okay. Now let's say end if

perfect. This is my model. Let's save it. Let's try to preview this.

it. Let's try to preview this. Let's see what happens. We should see

Let's see what happens. We should see all the data because obviously we do not

all the data because obviously we do not have anything in the target data and the

have anything in the target data and the data will date will be picked as 1900 0

data will date will be picked as 1900 0 1 01. Make sense? Very good.

1 01. Make sense? Very good. So that's your incremental data. Now we

So that's your incremental data. Now we will be applying absert as well. Wait,

will be applying absert as well. Wait, wait, wait, wait.

So it is simply processing the data. Okay. So perfect.

processing the data. Okay. So perfect. we are able to read the data perfectly.

we are able to read the data perfectly. Very good. So now this is just a

Very good. So now this is just a incremental data load not upsert. In

incremental data load not upsert. In order to apply upsert we need to define

order to apply upsert we need to define only one small thing and we need to say

only one small thing and we need to say key column here.

key column here. Key column and let me just show you if

Key column and let me just show you if it is available here. Um

it is available here. Um defining unique key. Yes, this is the

defining unique key. Yes, this is the optional part but we want to make it not

optional part but we want to make it not optional mandatory thing because absurd

optional mandatory thing because absurd is the backbone of these silver layers.

is the backbone of these silver layers. Okay. So in the unique key you can

Okay. So in the unique key you can simply say what is a unique key? Unique

simply say what is a unique key? Unique key is my

unique key is my basically trip ID.

basically trip ID. Make sense?

Make sense? Unique key is your trip ID. See whatever

Unique key is your trip ID. See whatever column is your unique. This makes your

column is your unique. This makes your model up can handle upsert.

model up can handle upsert. Make sense? Make sense? Make sense? So

Make sense? Make sense? Make sense? So it is called as merge command basically

it is called as merge command basically for your model. Okay,

for your model. Okay, make sense. So that's how you work with

make sense. So that's how you work with these things. These things. So now what

these things. These things. So now what will happen? Do you know what will

will happen? Do you know what will happen? Let me first of all save it and

happen? Let me first of all save it and let's say preview. Do you know what will

let's say preview. Do you know what will happen? You will say an Lamba this is

happen? You will say an Lamba this is just a select statement. Whatever we are

just a select statement. Whatever we are doing and we agree that we are using

doing and we agree that we are using Ginga for making our code modular so

Ginga for making our code modular so that we can just reuse it. Everything is

that we can just reuse it. Everything is fine. But where is our create table

fine. But where is our create table command? Where is our create table

command? Where is our create table command? So whenever we work with DBT,

command? So whenever we work with DBT, we do not need to worry about create

we do not need to worry about create table command because we just define the

table command because we just define the select statement and whatever the output

select statement and whatever the output of that select statement will be will be

of that select statement will be will be go will be going to our create table

go will be going to our create table command. Everything literally

command. Everything literally everything. Okay, that's cool. Yes,

everything. Okay, that's cool. Yes, everything will be going to the create

everything will be going to the create table command automatically behind the

table command automatically behind the scenes and I will even show you the

scenes and I will even show you the command the real command that will be

command the real command that will be running behind the models. How? Let's

running behind the models. How? Let's try to first of all run this thing and

try to first of all run this thing and simply click on this particular button

simply click on this particular button which is the CLI. Okay, cloud CLI which

which is the CLI. Okay, cloud CLI which is in the cloud not local CLI. So I'll

is in the cloud not local CLI. So I'll simply say dbt run. What is dbt run? DBT

simply say dbt run. What is dbt run? DBT run will run your all the models. And in

run will run your all the models. And in our case we just have one model. So

our case we just have one model. So simply say dbt run and that's it. And

simply say dbt run and that's it. And now let's wait and you will see that it

now let's wait and you will see that it will be running our model that we have

will be running our model that we have created. So we have created only one

created. So we have created only one model so far trips. So that is why you

model so far trips. So that is why you are seeing trips here and it will simply

are seeing trips here and it will simply run all the things and it is done. It is

run all the things and it is done. It is done. Yes. And let's check if it has

done. Yes. And let's check if it has created our trips data in the silver

created our trips data in the silver layer. Right. Let's see.

layer. Right. Let's see. And let me just refresh it.

And let me just refresh it. And ideally it should not what? Yes.

And ideally it should not what? Yes. Because remember we defined the schema

Because remember we defined the schema as DBTA lamb. So my table, my model will

as DBTA lamb. So my table, my model will be here. See trips. Whoa. Okay. Okay.

be here. See trips. Whoa. Okay. Okay. Makes sense. Okay. But the good thing is

Makes sense. Okay. But the good thing is trips table is there. Okay. I will even

trips table is there. Okay. I will even show you how you can just customize your

show you how you can just customize your schema name as well. That's not a big

schema name as well. That's not a big deal. But first of all, let's see how

deal. But first of all, let's see how this table is created. So if you just

this table is created. So if you just click on the uh I would say details. And

click on the uh I would say details. And if you scroll up,

if you scroll up, if you scroll up, you will see all these

if you scroll up, you will see all these things here. So see it has already

things here. So see it has already created create or replace command

created create or replace command automatically and it is a delta table.

automatically and it is a delta table. So delta and this is our command. That's

So delta and this is our command. That's how it runs your original code in the

how it runs your original code in the background. I can even show you here as

background. I can even show you here as well. So whatever we run in our DVD

well. So whatever we run in our DVD will be saved I think here. Let me just

will be saved I think here. Let me just show you in the target. See this is

show you in the target. See this is grayed out because this is not actually

grayed out because this is not actually which is deployed but this is for us so

which is deployed but this is for us so that we can just see if you open get

that we can just see if you open get ignore this will be mentioned here. See

ignore this will be mentioned here. See because we do not want to

because we do not want to um deploy any target right. So click

um deploy any target right. So click here in the target and just click on

here in the target and just click on compile and in the compile this is my

compile and in the compile this is my project internal and this is models and

project internal and this is models and silver and silver trips. SQL. So this is

silver and silver trips. SQL. So this is my compiled code that it has compiled

my compiled code that it has compiled for me. Every information is stored

for me. Every information is stored here. And if I say let's say

here. And if I say let's say um compiled and then target

um compiled and then target then I have

compiled and run and everything manifest. Yeah, everything is there. You

manifest. Yeah, everything is there. You can just actually explore because it is

can just actually explore because it is totally grayed out. So yeah that's how

totally grayed out. So yeah that's how it can run your all the things and you

it can run your all the things and you know how it created your create table

know how it created your create table statement as well make sense okay very

statement as well make sense okay very good now let's talk about hey Anlamba

good now let's talk about hey Anlamba what if you want to just create our

what if you want to just create our objects in dedicated schema

objects in dedicated schema how we can just do that so there's a

how we can just do that so there's a command to override your schema behavior

command to override your schema behavior so for that you can just go to dbd

so for that you can just go to dbd project

project oh man this UI

oh man this UI okay so this is your project dbt project

okay so this is your project dbt project Here you can define your schema here.

Here you can define your schema here. Make sense? And let's try to look at the

Make sense? And let's try to look at the documentation. So let's say custom

documentation. So let's say custom schema dbt.

schema dbt. So custom schemas dbt and these are all

So custom schemas dbt and these are all the things that we have. Okay. And if

the things that we have. Okay. And if you see models, okay, we know that okay,

you see models, okay, we know that okay, this is a model and we can even override

this is a model and we can even override the schema in the main config file which

the schema in the main config file which is the code itself. But we do not want

is the code itself. But we do not want to do that. We want to configure the

to do that. We want to configure the schema in our dbt project.yamel.

schema in our dbt project.yamel. See apply it to a subdirectory models by

See apply it to a subdirectory models by specifying in your dbt project. We can

specifying in your dbt project. We can do it in the dbt project.yamel as well.

do it in the dbt project.yamel as well. Make sense?

Make sense? So now let me just see if a good example

So now let me just see if a good example is there. Otherwise I have to use my

is there. Otherwise I have to use my own.

own. Okay. Custom schemas,

Okay. Custom schemas, custom databases.

Okay. Yeah. So it is not like direct to that but I can just use this one and I

that but I can just use this one and I can modify it using my own example. So

can modify it using my own example. So this is let's say models. Okay. Then I

this is let's say models. Okay. Then I can simply say within my models I can

can simply say within my models I can say plus database. So for example if I

say plus database. So for example if I go to my models okay and within my

go to my models okay and within my silver model I can simply say something

silver model I can simply say something like this plus schema and schema name

like this plus schema and schema name will be let's say silver

will be let's say silver make sense schema name will be silver.

make sense schema name will be silver. So now if I deploy it, let's say I run

So now if I deploy it, let's say I run the command one more time.

the command one more time. So this will deploy the object in the

So this will deploy the object in the silver schema. Now as per the rule,

silver schema. Now as per the rule, let's see what will happen. So basically

let's see what will happen. So basically this will not deploy in our silver

this will not deploy in our silver schema. Why? Let me just show you. So if

schema. Why? Let me just show you. So if I open my this thing, it will create a

I open my this thing, it will create a new schema. What what what is the new

new schema. What what what is the new schema name? it will be called as dbt a

schema name? it will be called as dbt a lambda silver. So that means earlier it

lambda silver. So that means earlier it created the dbt a lamba which is a

created the dbt a lamba which is a default schema. This time it created

default schema. This time it created another one dbt a lamba silver. So now

another one dbt a lamba silver. So now every time it creates a schema it always

every time it creates a schema it always uses dbta lamba as a prefix.

uses dbta lamba as a prefix. But now our 50% of the work is done

But now our 50% of the work is done because we have added silver. Now the

because we have added silver. Now the 50% of the task is to remove dbta lamba

50% of the task is to remove dbta lamba as a prefix. Okay. And in order to do

as a prefix. Okay. And in order to do that, I think they should just tell you

that, I think they should just tell you how you can just do that. Basically, see

how you can just do that. Basically, see this is the macro that we can just run

this is the macro that we can just run or basically create.

or basically create. Okay, that we can simply create. So this

Okay, that we can simply create. So this is basically the macro basically the

is basically the macro basically the function that DBT uses to create or

function that DBT uses to create or generate the database or basically

generate the database or basically schema name. If I go to custom schemas,

schema name. If I go to custom schemas, I will see the schema macro. See, this

I will see the schema macro. See, this is the one. So this is the code that DBT

is the one. So this is the code that DBT uses by default. It is called generate

uses by default. It is called generate schema.

schema. Okay, they uses it. But I can override

Okay, they uses it. But I can override this. I can simply

this. I can simply create one macro. I don't need to create

create one macro. I don't need to create a macro. I will simply copy paste it. I

a macro. I will simply copy paste it. I will simply create a new file. Let's say

will simply create a new file. Let's say I will say generate custom

Okay. create

create yaml or create SQL. I think it will be

yaml or create SQL. I think it will be I think it should be

I think it should be SQL or YAML I think. So yeah, it should

SQL or YAML I think. So yeah, it should be SQL. Yeah. So it is also saying that

be SQL. Yeah. So it is also saying that to generate this macro copy this this

to generate this macro copy this this example in section. Okay. into a file

example in section. Okay. into a file named generate schema name doyamel and

named generate schema name doyamel and make changes as necessary because by

make changes as necessary because by default be careful dbt will ignore any

default be careful dbt will ignore any custom macros installed included in the

custom macros installed included in the installed packages. So we have to just

installed packages. So we have to just create a file with generate schema name

create a file with generate schema name only

only because it will read that file. So let's

because it will read that file. So let's rename this file.

rename this file. Uh it's called

it's called let's say generate schema name

generate schema name okay dossql

okay makes sense okay so this is created outside this so let me just delete it

outside this so let me just delete it from here let's create it inside this

from here let's create it inside this macros Create

file generate schema name dossql.

schema name dossql. Okay, perfect. So this file is created.

Okay, perfect. So this file is created. As you can see, it is written that you

As you can see, it is written that you have to create a generate schema name

have to create a generate schema name do.SQL inside the macros. Okay. So now

do.SQL inside the macros. Okay. So now if I paste that code here. Okay. And we

if I paste that code here. Okay. And we know that this is the part which is

know that this is the part which is using default schema. So I can simply

using default schema. So I can simply remove this part and I can simply say

remove this part and I can simply say save. Now if I say dbt

save. Now if I say dbt run.

Now let's see what happens. Now let's see. This is interesting. Now

Now let's see. This is interesting. Now let's see.

So perfect. Now let's check our datab bricks. Refresh it.

bricks. Refresh it. Okay. Now let's check the silver layer.

Okay. Now let's check the silver layer. Silver.

Silver. Perfect. It has trips table now.

Perfect. It has trips table now. Perfect. So now it has created this

Perfect. So now it has created this particular table inside the dedicated

particular table inside the dedicated schema and it is amazing. Now we know

schema and it is amazing. Now we know like how we can just create the table

like how we can just create the table using dbd and now it will be

using dbd and now it will be automatically created. Make sense?

automatically created. Make sense? Perfect. So all the things are thing are

Perfect. So all the things are thing are now clear how we can just populate the

now clear how we can just populate the incremental upserted data for our dbt.

incremental upserted data for our dbt. Okay, perfect. So now our silver layer

Okay, perfect. So now our silver layer is also ready. Now we going to talk

is also ready. Now we going to talk about how we can just build the slowly

about how we can just build the slowly changing dimensions using DBD.

changing dimensions using DBD. This is challenging.

This is challenging. Okay. So now let's see how we can just

Okay. So now let's see how we can just work with slowly changing dimensions

work with slowly changing dimensions using DBD and it's called snapshots.

using DBD and it's called snapshots. It's called what? It's called snapshots

It's called what? It's called snapshots in the DBD language. Slowly changing

in the DBD language. Slowly changing dimensions are called as snapshots. Make

dimensions are called as snapshots. Make sense? Sorted. And they are really

sense? Sorted. And they are really really handy and they are very very

really handy and they are very very useful and I will literally show you how

useful and I will literally show you how you can just create basically the

you can just create basically the snapshots. Okay. So now let's see how

snapshots. Okay. So now let's see how you can just build the snapshots one by

you can just build the snapshots one by one. Make sense? Now let's see. So now

one. Make sense? Now let's see. So now let's talk about slowly changing

let's talk about slowly changing dimensions. Mike come here. So now let's

dimensions. Mike come here. So now let's talk about slowly changing dimension

talk about slowly changing dimension type two. If you are not aware about

type two. If you are not aware about slowly changing dimensions you should

slowly changing dimensions you should be. It's 25. It's almost over. 226 is

be. It's 25. It's almost over. 226 is coming and you are still unaware of

coming and you are still unaware of story changing dimensions what are you

story changing dimensions what are you doing bro sister what are you doing so

doing bro sister what are you doing so if you do not know about story changing

if you do not know about story changing dimension basically see our trips is the

dimension basically see our trips is the fact table in which we store all the

fact table in which we store all the transactions dimension is just like the

transactions dimension is just like the you can say

you can say contextual data to those fact tables

contextual data to those fact tables whenever you want to provide some

whenever you want to provide some context okay again fundamentals

context okay again fundamentals fundamentals are important

fundamentals are important People are jumping directly on the tools

People are jumping directly on the tools and technologies, forgetting about

and technologies, forgetting about fundamentals, fundamentals thinking and

fundamentals, fundamentals thinking and you are just jumping on the new

you are just jumping on the new technologies. So it's fine. No worries.

technologies. So it's fine. No worries. Here is an by the way if you just want

Here is an by the way if you just want to brush up your fundamental knowledge.

to brush up your fundamental knowledge. I have a video for you for free. Just go

I have a video for you for free. Just go on YouTube and just search data

on YouTube and just search data engineering fundamentals.

engineering fundamentals. You will get everything. And obviously

You will get everything. And obviously this is regarding data warehousing. I

this is regarding data warehousing. I have a video for you. Just go on YouTube

have a video for you. Just go on YouTube and just search data warehousing

and just search data warehousing masterclass on lamba. You will get

masterclass on lamba. You will get everything for free. Everything for

everything for free. Everything for free. Just show your love on this

free. Just show your love on this channel. What are you doing bro? So

channel. What are you doing bro? So basically we going to build slowly

basically we going to build slowly changing dimension type two which is the

changing dimension type two which is the next level of slowly changing dimension

next level of slowly changing dimension because slowly changing dimension type

because slowly changing dimension type one is nothing but just applying the

one is nothing but just applying the abserts which are good but in the modern

abserts which are good but in the modern days we do not much rely on slowly

days we do not much rely on slowly changing dimension type one because

changing dimension type one because slowly changing dimension type one do

slowly changing dimension type one do not retain the history but slowly

not retain the history but slowly changing dimension type two retain the

changing dimension type two retain the history amazing thing. Let me just show

history amazing thing. Let me just show you with an example. Don't worry I'm

you with an example. Don't worry I'm here to just explain you everything. So

here to just explain you everything. So let's say DBT snapshots. Don't worry,

let's say DBT snapshots. Don't worry, I'm not directly jumping on the

I'm not directly jumping on the snapshots, but they have a very good

snapshots, but they have a very good example on their web page. That's why I

example on their web page. That's why I want to show you that. Okay. Feel my

want to show you that. Okay. Feel my emotions. Feel my emotions. So this is

emotions. Feel my emotions. So this is the Yes, this is very good example. So

the Yes, this is very good example. So basically you would have ordered so much

basically you would have ordered so much of stuff from online. I know you are

of stuff from online. I know you are very very rich and I know you love

very very rich and I know you love shopping. I know everything. So

shopping. I know everything. So basically let's say there is an order id

basically let's say there is an order id equals to 1 and when you ordered the

equals to 1 and when you ordered the stuff okay anything shoes clothes

stuff okay anything shoes clothes whatever so at that time status would be

whatever so at that time status would be pending because at that time you would

pending because at that time you would have just placed the order but let's say

have just placed the order but let's say after some hours or basically next day

after some hours or basically next day or basically after few days same order

or basically after few days same order ID

with status will be changed and it will become shipped instead of pending. So if

become shipped instead of pending. So if you want to retain all of this history

you want to retain all of this history like when the order was placed, when it

like when the order was placed, when it was shipped, when it was delivered, when

was shipped, when it was delivered, when it was returned, if you would have

it was returned, if you would have returned it. Okay. So all these things

returned it. Okay. So all these things can be stored in the form of slowly

can be stored in the form of slowly changing dimension type two.

changing dimension type two. That's where the concept of snapshots

That's where the concept of snapshots come into the picture.

come into the picture. Okay. So let's look at this example. So

Okay. So let's look at this example. So see earlier it was pending and it was

see earlier it was pending and it was what what are these two columns? These

what what are these two columns? These are two columns that will be created

are two columns that will be created automatically from and to date. DBT

automatically from and to date. DBT valid from DBT valid to basically when

valid from DBT valid to basically when this ID was created. Okay. When there

this ID was created. Okay. When there was no this record. Okay. So what was

was no this record. Okay. So what was the value of this? Obviously from from

the value of this? Obviously from from means when this record was added. Valid

means when this record was added. Valid to that means to which date this record

to that means to which date this record is valid. Obviously till the date when

is valid. Obviously till the date when the data for that particular record is

the data for that particular record is not changed. That means the moment this

not changed. That means the moment this record is changed to let's say shipped

record is changed to let's say shipped or basically any ID or any column is

or basically any ID or any column is changed then this validation is expired

changed then this validation is expired and new validation is started from the

and new validation is started from the 2nd of January 24 and valid to is null.

2nd of January 24 and valid to is null. Why null? Now you have two options.

Why null? Now you have two options. What is the expiry date of let's say

What is the expiry date of let's say this mouse? Why I love my mouse bro. If

this mouse? Why I love my mouse bro. If this would be expired no worries. Let's

this would be expired no worries. Let's say this what what is the expiry date of

say this what what is the expiry date of this mouse?

this mouse? Okay, if I say null

Okay, if I say null that means it will never be expired. If

that means it will never be expired. If I say the expiry date of this mouse will

I say the expiry date of this mouse will be sun

be sun basically year um 9999

basically year um 9999 99

99 and 9 09 very big date from now by the

and 9 09 very big date from now by the way 9999 makes sense. What is 99? We

way 9999 makes sense. What is 99? We just have 12 months. Okay. 9999912 and

just have 12 months. Okay. 9999912 and 01. Let's say this one. So in both the

01. Let's say this one. So in both the scenarios I want to say that the expiry

scenarios I want to say that the expiry date of this particular mouse would be

date of this particular mouse would be very very big. Okay. So some

very very big. Okay. So some organizations like to have null in their

organizations like to have null in their valid two column. Some organization like

valid two column. Some organization like to have a very big date. I personally

to have a very big date. I personally like to keep a very big date because I

like to keep a very big date because I do not want to keep nulls in my column.

do not want to keep nulls in my column. I don't like nulls. Simple. Okay. So

I don't like nulls. Simple. Okay. So that's all about the snapshots or

that's all about the snapshots or basically slowly changing dimension type

basically slowly changing dimension type two from the conceptual perspective. Now

two from the conceptual perspective. Now how we can just implement it I will just

how we can just implement it I will just get let you know. So the thing is

get let you know. So the thing is snapshots are recently updated in DBD.

snapshots are recently updated in DBD. Earlier we used to write SQL statements

Earlier we used to write SQL statements but now we need to write YAML statements

but now we need to write YAML statements if we want to create snapshots and you

if we want to create snapshots and you know what YAML are better and you will

know what YAML are better and you will feel why it's very easy to configure.

feel why it's very easy to configure. Okay. So let's actually try one example.

Okay. So let's actually try one example. Let's take any example if we have any

Let's take any example if we have any good example here otherwise I have to

good example here otherwise I have to put my own example. So I think this is a

put my own example. So I think this is a good one. Okay. And these are this is

good one. Okay. And these are this is see also using this particular function.

see also using this particular function. Makes sense. See this is good. Let's

Makes sense. See this is good. Let's copy this one. Okay. So in order to

copy this one. Okay. So in order to create snapshots basically slowly

create snapshots basically slowly changing dimensions type two. We cannot

changing dimensions type two. We cannot store our models in the gold layer or

store our models in the gold layer or any kind of layer. No there's a

any kind of layer. No there's a dedicated folder for that. It's called

dedicated folder for that. It's called snapshot.

snapshot. create a YAML file in your snapshots

create a YAML file in your snapshots directory and you can name it anything.

directory and you can name it anything. Let's say um customers and blah blah

Let's say um customers and blah blah blah trips whatever trips is not a

blah trips whatever trips is not a dimension but yeah you can say locations

dimension but yeah you can say locations and all those things. Okay. And add your

and all those things. Okay. And add your configuration details. You can also

configuration details. You can also configure a snapshot from your dbd

configure a snapshot from your dbd project.yml file docs as well. Yes, you

project.yml file docs as well. Yes, you can. But I like to create snapshot

can. But I like to create snapshot within the snapshot folder. Why? Just

within the snapshot folder. Why? Just tell just just just tell me one thing.

tell just just just tell me one thing. Come here. Come here. These are my

Come here. Come here. These are my models. This is silver, gold and

models. This is silver, gold and everything. In the datab bricks, not

everything. In the datab bricks, not datab bricks, dbt, dbt. It's not

datab bricks, dbt, dbt. It's not databicks cla. So, dbt project.yaml. In

databicks cla. So, dbt project.yaml. In this particular file, we should define

this particular file, we should define like how we want to build our projects

like how we want to build our projects in the datab bricks. But within the

in the datab bricks. But within the dedicated folders, we should tell what

dedicated folders, we should tell what to build, how to build, right? Very

to build, how to build, right? Very good. So let's go to the snapshots

good. So let's go to the snapshots folder. It is already there. I love this

folder. It is already there. I love this folder structure created by DBT. We do

folder structure created by DBT. We do not need to do anything. Let's say

not need to do anything. Let's say create file and it will be called as

create file and it will be called as let's say customers do

let's say customers do YAML. Let's not customers do.l basically

YAML. Let's not customers do.l basically because in this particular YAML file we

because in this particular YAML file we will be creating all the snapshots. You

will be creating all the snapshots. You can create individual YAML file as well.

can create individual YAML file as well. Yes, you can create multiple YAML files

Yes, you can create multiple YAML files within your folder, but it's better to

within your folder, but it's better to just go with the same YAML file. So,

just go with the same YAML file. So, I'll simply say snapshots or basically

I'll simply say snapshots or basically you can say sedd

you can say sedd types or basically sedd sedcds.

types or basically sedd sedcds. Okay, create.

Okay, create. So, as you can see that we can create

So, as you can see that we can create any file, multiple files as well, it's

any file, multiple files as well, it's fine. But I want to create only one file

fine. But I want to create only one file with multiple snapshots. Okay, let's

with multiple snapshots. Okay, let's paste the code here. And this is our

paste the code here. And this is our code that we have pasted and we need to

code that we have pasted and we need to rename it called doyamel otherwise it

rename it called doyamel otherwise it will not pick it up. Yeah, perfect. So

will not pick it up. Yeah, perfect. So this is our snapshots. Make sense? And

this is our snapshots. Make sense? And this is a name. Name means this is our

this is a name. Name means this is our first snapshot and we want to create

first snapshot and we want to create snapshot as you can say customers or

snapshot as you can say customers or basically dim customers.

basically dim customers. It will make more sense. Dim customers.

It will make more sense. Dim customers. Okay. So what is a relation? What is

Okay. So what is a relation? What is relation? Relation means source. What is

relation? Relation means source. What is the source of this table? Obviously, we

the source of this table? Obviously, we have source and it's called as source

have source and it's called as source silver.

silver. Make sense? And what is the name? It's

Make sense? And what is the name? It's called, I think, customers. Makes sense.

called, I think, customers. Makes sense. Now, schema, what will be the schema

Now, schema, what will be the schema name? We want to keep the schema name as

name? We want to keep the schema name as gold. Okay. Database is

gold. Okay. Database is Pispark

Pispark DBD. Unique ID. This is very important.

DBD. Unique ID. This is very important. What will be the key column for our

What will be the key column for our dimension customers? It will be customer

dimension customers? It will be customer ID. If you have multiple columns, you

ID. If you have multiple columns, you can simply add comma and just add those

can simply add comma and just add those thing or you can just add a list as

thing or you can just add a list as well. Strategy is very important. Time

well. Strategy is very important. Time stamp. Basically, there are two

stamp. Basically, there are two strategies. Always go with time stamp if

strategies. Always go with time stamp if you can. If not, obviously you have to

you can. If not, obviously you have to pick the other strategy. Other

pick the other strategy. Other strategies like I'm not a big fan of

strategies like I'm not a big fan of that. Time stamp strategy means let's

that. Time stamp strategy means let's say you have

say you have um data in your target table. Let's say

um data in your target table. Let's say this one. Let me just tell you with an

this one. Let me just tell you with an example because with an example you will

example because with an example you will understand better. So let's say this is

understand better. So let's say this is your source data. This is your target

your source data. This is your target data. Okay. You moved some piece of data

data. Okay. You moved some piece of data from here to here. Let's say order ID

from here to here. Let's say order ID one. Okay. Let's say order ID one. One.

one. Okay. Let's say order ID one. One. And this is also here.

And this is also here. Okay, makes sense. Now,

Okay, makes sense. Now, next day you got the data as one.

next day you got the data as one. Okay, as one. And now this one was

Okay, as one. And now this one was shipped. This one was delivered. Okay,

shipped. This one was delivered. Okay, so now this one with shipped is already

so now this one with shipped is already in the target table. Make sense? Make

in the target table. Make sense? Make sense? Okay, very good.

sense? Okay, very good. Now

Now how this would know basically this DBT

how this would know basically this DBT engine or basically any engine would

engine or basically any engine would know that delivered will come after

know that delivered will come after shipped.

shipped. You will say it's common sense, right?

You will say it's common sense, right? Bro, DBD will not go on Myntra and just

Bro, DBD will not go on Myntra and just do the shopping, right? It doesn't know

do the shopping, right? It doesn't know like delivered will come after shipped.

like delivered will come after shipped. So how does it know? It will know based

So how does it know? It will know based on a time stamp column that hey shipped

on a time stamp column that hey shipped was happened on this date. Delivered was

was happened on this date. Delivered was happened on this date. So we need to

happened on this date. So we need to keep delivered after shipped. So we need

keep delivered after shipped. So we need a column which will decide the value

a column which will decide the value like which value is updated after which

like which value is updated after which value. So that is why we have a column

value. So that is why we have a column here

here last update time right last update time

last update time right last update time stamp I guess

stamp I guess last update time stamp perfect and dbt

last update time stamp perfect and dbt valid to current we have two options we

valid to current we have two options we can either keep it null or we can also

can either keep it null or we can also keep it like this 9999 12 or 31st and

keep it like this 9999 12 or 31st and you can just celebrate your new year

you can just celebrate your new year perfect and instead of null I I

perfect and instead of null I I personally prefer keeping this value

personally prefer keeping this value instead of null values so this is my

instead of null values so this is my first snapshot that I can just create.

first snapshot that I can just create. Yes, with just with with just this

Yes, with just with with just this configuration, you can literally create

configuration, you can literally create slowly changing dimension type two. And

slowly changing dimension type two. And if you have created slowly changing

if you have created slowly changing dimension type two using pispar code,

dimension type two using pispar code, you would know like how long it take,

you would know like how long it take, how much of efforts it take. Okay, you

how much of efforts it take. Okay, you need to apply upsert command two times

need to apply upsert command two times because one time you will update and

because one time you will update and flag those records as expired. Then you

flag those records as expired. Then you will update the records. Then you will

will update the records. Then you will apply the filters.

apply the filters. All the things are gone using this

All the things are gone using this particular strategy called snapshots.

particular strategy called snapshots. Make sense? Very good. So let's see if

Make sense? Very good. So let's see if we have to do anything else.

we have to do anything else. Um almost all the things are fine since

Um almost all the things are fine since snapshot focus on configuration. The

snapshot focus on configuration. The transformation logic is minimal.

transformation logic is minimal. Typically you would select data from the

Typically you would select data from the source. If you need to apply

source. If you need to apply transformation like filters,

transformation like filters, dduplication. Very good. This this this

dduplication. Very good. This this this is very good point. Glad I just looked

is very good point. Glad I just looked at this documentation. So basically

at this documentation. So basically basically

basically sometimes you would need to apply

sometimes you would need to apply dduplication step as well.

dduplication step as well. We have already applied dduplication

We have already applied dduplication step in our silver layer. But sometime

step in our silver layer. But sometime people will not be applying dduplication

people will not be applying dduplication step in the silver layer. In that

step in the silver layer. In that particular scenario, they can add a kind

particular scenario, they can add a kind of transient layer which will act as a

of transient layer which will act as a source for their you can say snapshots.

source for their you can say snapshots. Okay, which will act as a source for

Okay, which will act as a source for their snapshots. Okay, makes sense. Now

their snapshots. Okay, makes sense. Now what is this ephemeral? Because we have

what is this ephemeral? Because we have seen incremental table view. What is

seen incremental table view. What is this ephemeral? Ephemeral is basically a

this ephemeral? Ephemeral is basically a kind of you can say

kind of you can say query or basically temporary view that

query or basically temporary view that it creates for us. Ephemeral which will

it creates for us. Ephemeral which will not be materialized. No, it will not be

not be materialized. No, it will not be created in your database. No, it will

created in your database. No, it will reside in DBT. That's it. That means it

reside in DBT. That's it. That means it is a kind of temporary view. It will

is a kind of temporary view. It will only run this query as a source but it

only run this query as a source but it will not materialize it. That's the

will not materialize it. That's the power of ephemeral. You do not need to

power of ephemeral. You do not need to use it because we are already applying

use it because we are already applying dduplication step in our silver layer.

dduplication step in our silver layer. But if you do not have, you can create a

But if you do not have, you can create a step for dduplication and that's it. And

step for dduplication and that's it. And you can just treat it as your source.

you can just treat it as your source. Simple, simple, simple, simple. So

Simple, simple, simple, simple. So that's how you can just do that. Okay,

that's how you can just do that. Okay, make sense? So now let's try to run this

make sense? So now let's try to run this particular thing and let's try to build.

particular thing and let's try to build. And you cannot use it like you cannot

And you cannot use it like you cannot create this snapshot using dbd run. You

create this snapshot using dbd run. You have to hit snapshot dbd snapshot. Let's

have to hit snapshot dbd snapshot. Let's hit enter. Let's see if it creates the

hit enter. Let's see if it creates the snapshot for us. Let's see. And we'll be

snapshot for us. Let's see. And we'll be able to create this particular object in

able to create this particular object in the gold layer, gold schema without any

the gold layer, gold schema without any kind of default name because it will run

kind of default name because it will run our uh macros folder first of all. And

our uh macros folder first of all. And in that it will run our SQL file to

in that it will run our SQL file to remove that particular prefix from the

remove that particular prefix from the schema name. Simple. Okay. So, it is

schema name. Simple. Okay. So, it is running dim customers for now. And it

running dim customers for now. And it failed. Very good. Why?

failed. Very good. Why? Why? Why? Right.

Why? Why? Right. Two any resources there are one unused

Two any resources there are one unused configuration. Okay. What is the error?

configuration. Okay. What is the error? Um

Um okay.

okay. Start snapshot

Start snapshot select ax.

select ax. Okay.

Okay. Unresolved column is a suggestion.

Unresolved column is a suggestion. Unresolved column. Oh, okay. Oh, it's

Unresolved column. Oh, okay. Oh, it's called last updated time stamp. I think

called last updated time stamp. I think I just misspelled something.

I just misspelled something. Last updated not update on Lamba.

Last updated not update on Lamba. Okay, it happens. It happens. DPD

Okay, it happens. It happens. DPD snapshot. Let's run it for one more

snapshot. Let's run it for one more time.

Let's see if it runs fine. Yeah, perfect.

perfect. Are you excited to see the result? Are

Are you excited to see the result? Are you excited? Let me just check the list

you excited? Let me just check the list that I have created before creating this

that I have created before creating this project so that I do not miss anything

project so that I do not miss anything because we are almost almost close to

because we are almost almost close to complete this project. Uh uh okay.

Hm. Okay. Makes sense. Let's check our dimension. If we go to our data bricks,

dimension. If we go to our data bricks, if we go to gold layer.

if we go to gold layer. Uh oh, we do not have a gold schema,

Uh oh, we do not have a gold schema, right?

right? Yeah. But DBT will create for us. Oh,

Yeah. But DBT will create for us. Oh, wow man. Thanks, DBT. So, we have dim

wow man. Thanks, DBT. So, we have dim customers. Very good. And if we see the

customers. Very good. And if we see the data, it's absolutely perfect. Very,

data, it's absolutely perfect. Very, very good. So, now what I will do, I

very good. So, now what I will do, I will simply create more dimensions like

will simply create more dimensions like this. And I simply need to copy and

this. And I simply need to copy and paste this code

paste this code multiple times. So this will be my dim

multiple times. So this will be my dim locations.

locations. Okay. And source will be locations.

Locations and location ID.

Okay. Everything else is fine. Makes sense.

and then perfect. This is also fine.

Then I have I think what do we have?

I think what do we have? What do we have?

What do we have? What do we have?

Payments vehicles. Okay.

Payment ID. DIM payments. Perfect.

Perfect. Okay. 1 2 3 4. One more.

Perfect. Let's try to create all these snapshots.

Let's try to create all these snapshots. I'll simply run DVD snapshot and will it

I'll simply run DVD snapshot and will it will run all the dimensions together and

will run all the dimensions together and it will run our dim customers for one

it will run our dim customers for one more time. Why? Because that is also a

more time. Why? Because that is also a dimension. Why it is running four 1 2 3

dimension. Why it is running four 1 2 3 4. Oh man, I think I forgot to rename

4. Oh man, I think I forgot to rename it. Cancel cancel.

it. Cancel cancel. Oh bro, why did you cancel? Fifth one

Oh bro, why did you cancel? Fifth one was also added. Okay, it's fine. It's

was also added. Okay, it's fine. It's fine. Let's run DBD snapshot. So, it

fine. Let's run DBD snapshot. So, it will run our DBD

will run our DBD um basically snapshot for dim customers

um basically snapshot for dim customers as well. And this time it will not

as well. And this time it will not simply add on uh add more records

simply add on uh add more records because it is like incremental

because it is like incremental processing and it knows how to process

processing and it knows how to process it internally because this is slowly

it internally because this is slowly changing dimension type two. So, this is

changing dimension type two. So, this is fine. Okay. So, now it is running the

fine. Okay. So, now it is running the vehicles as well. All the dimensions are

vehicles as well. All the dimensions are there and I'm so so so happy.

there and I'm so so so happy. And let's refresh it and let's see all

And let's refresh it and let's see all the dimensions ready. Perfect. If we see

the dimensions ready. Perfect. If we see dim customers, it's fine. It's there.

dim customers, it's fine. It's there. See, everything is there, bro.

See, everything is there, bro. Everything is there. So, we have all the

Everything is there. So, we have all the things dim customer, dim drivers, dim

things dim customer, dim drivers, dim location, dim payment, dim vehicles in

location, dim payment, dim vehicles in the gold schema. Why? because we

the gold schema. Why? because we manually changed the setting to edit the

manually changed the setting to edit the prefix for schema name. Simple, simple,

prefix for schema name. Simple, simple, simple, simple. And I know that you also

simple, simple. And I know that you also got some hint about you can say

got some hint about you can say ephemeral data type. Okay, that how you

ephemeral data type. Okay, that how you can just use ephemeral and blah blah

can just use ephemeral and blah blah blah. Make sense? Make sense? Make

blah. Make sense? Make sense? Make sense. So now our all the dimensions are

sense. So now our all the dimensions are ready. Make sense? Our all the

ready. Make sense? Our all the dimensions are ready.

dimensions are ready. So how we can just work with you can say

So how we can just work with you can say fact table h that's interesting how we

fact table h that's interesting how we can just work with fact table because

can just work with fact table because our all the tables are ready right and

our all the tables are ready right and in our gold layer if I open

in our gold layer if I open models

models our gold is empty because we'll be

our gold is empty because we'll be creating our fact table within this make

creating our fact table within this make sense so in order to create this thing

sense so in order to create this thing okay we need to apply basically joins

okay we need to apply basically joins based on the ID

based on the ID Yes, obviously. And and and let me just

Yes, obviously. And and and let me just check. Yes.

check. Yes. So, and we have a source called silver

So, and we have a source called silver dot trips. Make sense? Because obviously

dot trips. Make sense? Because obviously we do not need to create a dimension for

we do not need to create a dimension for trips. Trips is a fact table. But that

trips. Trips is a fact table. But that fact table should should only be you can

fact table should should only be you can say having only the ids and it has all

say having only the ids and it has all the ids, right?

the ids, right? I think so it has all the ids. If I

I think so it has all the ids. If I close this one, this one, this one, and

close this one, this one, this one, and if I go to let's say gold. Um,

if I go to let's say gold. Um, okay. I think we have IDs. Okay, makes

okay. I think we have IDs. Okay, makes sense.

sense. Mhm.

I think yes I think we can create a snapshot for

yes I think we can create a snapshot for our fact table as well. Why? Because we

our fact table as well. Why? Because we just need to apply up like this is

just need to apply up like this is optional step. Ideally we should not but

optional step. Ideally we should not but I just want to move my you can say trips

I just want to move my you can say trips into the gold layer as well. I just want

into the gold layer as well. I just want to do that just to make everything

to do that just to make everything available in the gold layer. Why?

available in the gold layer. Why? Because we as a data engineers we we do

Because we as a data engineers we we do not just give access to the silver

not just give access to the silver layer. We sometime give but we not we do

layer. We sometime give but we not we do not every time give the access to the

not every time give the access to the silver layer to our data analyst or any

silver layer to our data analyst or any kind of business analyst or any kind of

kind of business analyst or any kind of report builder or anyone. We just simply

report builder or anyone. We just simply provide the access to the gold layer

provide the access to the gold layer only. So I just want to move my stuff to

only. So I just want to move my stuff to the gold layer. That's why I'm just

the gold layer. That's why I'm just moving my stuff to the gold layer.

moving my stuff to the gold layer. That's fine. Usually we create views

That's fine. Usually we create views aggregated views in the gold layer but

aggregated views in the gold layer but that's fine. Let's create our snapshot

that's fine. Let's create our snapshot for

for for

for fact table as well. And let me just show

fact table as well. And let me just show you. Let's create a new YAML file. Okay.

you. Let's create a new YAML file. Okay. Uh let's create a new YAML file.

Let's create new file. I'll simply say fact table. Okay. Fact.

Make sense? And let me just say fact trips. Okay. And what will be the

fact trips. Okay. And what will be the source of this? Obviously this is not a

source of this? Obviously this is not a source.

source. This is something else. What is that? So

This is something else. What is that? So basically when we just define this one

basically when we just define this one we simply use source.

we simply use source. Make sense? We simply use source. But

Make sense? We simply use source. But here we cannot use source. No. Why?

here we cannot use source. No. Why? Because source is the object that we

Because source is the object that we created for our you can say DBD, right?

created for our you can say DBD, right? like source which is already in the you

like source which is already in the you can say database but we want to but but

can say database but we want to but but we want to just we wanted to just bring

we want to just we wanted to just bring it to our you can say DBT but for the

it to our you can say DBT but for the trips we didn't bring the silver object

trips we didn't bring the silver object we just brought the bronze object but if

we just brought the bronze object but if you want to just use it as a source then

you want to just use it as a source then we have to have to have to use it as a

we have to have to have to use it as a you can say real object of DBD instead

you can say real object of DBD instead of source really yes really so how you

of source really yes really so how you can just do that basically you simply we

can just do that basically you simply we need to remove it with the ref function.

need to remove it with the ref function. Ref function. What is that ref function?

Ref function. What is that ref function? Ref function is the abbreviation for

Ref function is the abbreviation for reference where you simply refer the

reference where you simply refer the object that you have within the dbd. Let

object that you have within the dbd. Let me just tell you. So basically

me just tell you. So basically this is silver layer right trips. SQL.

this is silver layer right trips. SQL. So what will be the model name? It is

So what will be the model name? It is the same name that you keep for the file

the same name that you keep for the file name trips. SQL. It will be trips. So I

name trips. SQL. It will be trips. So I can simply say the relation is ref and

can simply say the relation is ref and the reference is trips.

the reference is trips. Oh, so that means I'm trying to query

Oh, so that means I'm trying to query this particular trips. That's it. And I

this particular trips. That's it. And I can even show you if you want to. So

can even show you if you want to. So let's say if I go to my analysis

let's say if I go to my analysis scratchpad,

scratchpad, if I simply try to query this table, I

if I simply try to query this table, I can simply say ref and then trips.

can simply say ref and then trips. I will see this particular

So that's how we just refer to the previous models.

previous models. Make sense? Make sense? So see I can see

Make sense? Make sense? So see I can see this thing and this is the lineage. So

this thing and this is the lineage. So this is my trips which is a silver

this is my trips which is a silver object and this is my source. This is

object and this is my source. This is the lineage. Make sense? Good. So I will

the lineage. Make sense? Good. So I will simply use it here. ref and then schema

simply use it here. ref and then schema is gold. This is I think trip ID.

is gold. This is I think trip ID. Trip ID and time stamp last bit is fine.

Trip ID and time stamp last bit is fine. Everything is fine. So let me say save

Everything is fine. So let me say save and let me just

and let me just run dbd snapshot for one more time

run dbd snapshot for one more time and boom

and boom and boom it will simply create all the

and boom it will simply create all the five dimensions and one fact table

five dimensions and one fact table automatically. Slowly changing dimension

automatically. Slowly changing dimension type too. Can you imagine? Do you know

type too. Can you imagine? Do you know how many okay facts is there? Okay,

how many okay facts is there? Okay, perfect. All the objects are very neat

perfect. All the objects are very neat and clean. Very very very very

and clean. Very very very very beautiful. Okay, so now if I just

beautiful. Okay, so now if I just refresh it now. So in the gold I also

refresh it now. So in the gold I also got facts. Very good. I have literally

got facts. Very good. I have literally everything now. Everything that I want.

everything now. Everything that I want. Everything.

Everything. Literally everything.

Literally everything. Did you like DBD? Just be honest. Did

Did you like DBD? Just be honest. Did you like DBD? I love DBD. I love DBD

you like DBD? I love DBD. I love DBD because it makes your code modular. And

because it makes your code modular. And I love modular code.

I love modular code. I hope you learned a lot. And now let's

I hope you learned a lot. And now let's try to commit everything. And let's call

try to commit everything. And let's call it as development

it as development completed. Bro,

completed. Bro, bro, commit changes. And we are simply

bro, commit changes. And we are simply committing all the changes. Then we will

committing all the changes. Then we will simply merge this thing into our main

simply merge this thing into our main branch. See merge this branch to main.

branch. See merge this branch to main. Let's merge it. So that our all the

Let's merge it. So that our all the development will be merged back to the

development will be merged back to the main. See now this is a main branch and

main. See now this is a main branch and all the objects are here.

all the objects are here. Very good. Now there's a small homework

Very good. Now there's a small homework for you. What is homework? So obviously

for you. What is homework? So obviously we have completed our you can say

we have completed our you can say um project. We have completed our

um project. We have completed our dimensional data model. We have learned

dimensional data model. We have learned so many new new new new new new new

so many new new new new new new new things and professionally classes,

things and professionally classes, modularity, story changing dimensions,

modularity, story changing dimensions, type two, incremental injection, spark

type two, incremental injection, spark streaming, um dynamic code, everything,

streaming, um dynamic code, everything, everything. First of all, one request

everything. First of all, one request just drop a lovely comment on this

just drop a lovely comment on this comment section. Okay, second thing that

comment section. Okay, second thing that I want to show you that there's a small

I want to show you that there's a small homework for you. You need to create at

homework for you. You need to create at least three business views in the gold

least three business views in the gold layer using multiple KPIs. Let's say you

layer using multiple KPIs. Let's say you want to calculate all the number of um

want to calculate all the number of um trips in some region or maybe based on

trips in some region or maybe based on some I I don't know like just do some

some I I don't know like just do some analysis. Okay, just build three views.

analysis. Okay, just build three views. It's called business views in this gold

It's called business views in this gold layer. Okay, and you already know how to

layer. Okay, and you already know how to do that. You know how to use Ginga, you

do that. You know how to use Ginga, you know how to use loops, you know how to

know how to use loops, you know how to use if conditions, you know everything.

use if conditions, you know everything. Just do it. for me. Okay, very well

Just do it. for me. Okay, very well done. And now the last part, not the

done. And now the last part, not the least part because once you develop

least part because once you develop everything, if you do not share your

everything, if you do not share your data warehouse with your developers,

data warehouse with your developers, your downstream,

your downstream, what is the

what is the role of developing these things? So

role of developing these things? So let's say everything is built and if I

let's say everything is built and if I go to catalog, my catalog is looking

go to catalog, my catalog is looking very beautiful, very perfect. Okay. So

very beautiful, very perfect. Okay. So now let's say this is my catalog.

now let's say this is my catalog. Okay. And now this catalog needs to be

Okay. And now this catalog needs to be shared with the data analyst or

shared with the data analyst or basically BI analyst or basically anyone

basically BI analyst or basically anyone who wants to build the reports. So how

who wants to build the reports. So how you can just do that? You already know

you can just do that? You already know by the way. You can simply go to

by the way. You can simply go to compute. Okay. And then you can go to

compute. Okay. And then you can go to let's say this particular SQL starter

let's say this particular SQL starter warehouse connection details. Here are

warehouse connection details. Here are the connection details that you can

the connection details that you can share. If they want JDBC, you can just

share. If they want JDBC, you can just give it. If they want server name HTTP,

give it. If they want server name HTTP, you can also give it to them. It's up to

you can also give it to them. It's up to them what they want to do.

them what they want to do. Makes sense. Makes sense. Makes sense.

Makes sense. Makes sense. Makes sense. So this is your project end to end

So this is your project end to end project and don't worry I will just

project and don't worry I will just upload all the notebooks. You would have

upload all the notebooks. You would have already downloaded it okay by now. So

already downloaded it okay by now. So that was all about your end toend data

that was all about your end toend data engineering project using open-source

engineering project using open-source frameworks such as Apache Spark and

frameworks such as Apache Spark and basically Pispark and your DBT. How was

basically Pispark and your DBT. How was it? Do let me know if you loved this

it? Do let me know if you loved this video. Do let me know if you want me to

video. Do let me know if you want me to continue creating all these particular

continue creating all these particular videos, projects and just show your love

videos, projects and just show your love and support in the comment section.

and support in the comment section. Share this video with others and just

Share this video with others and just just just

just just click on the video coming on the screen

click on the video coming on the screen right now because I will see you there.

right now because I will see you there. Bye-bye.

Haz clic en cualquier texto o marca de tiempo para ir directamente a ese momento del video

La mayoría de las transcripciones están listas en menos de 5 segundos

Copia con un clicMás de 125 idiomasBuscar en el contenidoIr a marcas de tiempo

Pega la URL de YouTube

Ingresa el enlace de cualquier video de YouTube para obtener la transcripción completa

La mayoría de las transcripciones están listas en menos de 5 segundos

Instala nuestra extensión para Chrome

Obtén transcripciones al instante sin salir de YouTube. Instala nuestra extensión de Chrome y accede con un clic a la transcripción de cualquier video directamente desde la página de reproducción.

Añadir a Chrome — Gratis

Compatible con YouTube, Coursera, Udemy y más plataformas educativas

Obtén transcripciones al instante: ¡Solo cambia el dominio en la barra de direcciones!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

Transcripción de YouTubePreparando tus resultados…

Transcripción de YouTube:PYSPARK X DBT End-To-End Data Engineering Project | Master Big Data Engineering