Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
SQL Full Course for Beginners (30 Hours) – From Zero to Hero | Data with Baraa | YouTubeToText
YouTube Transcript: SQL Full Course for Beginners (30 Hours) – From Zero to Hero
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Hello and welcome to this unique course
to master SQL. My name is Barzalini and
I lead big data projects at
Mercedes-Benz over a decade of
experience in SQL data engineering,
building data warehouses and data
analytics. Now, of course, the first
question is what makes this course so
special. Well, not only you will learn
how to write SQL codes, but more
important than that, you will learn how
exactly SQL works behind the scenes. So
I'm going to break complex concept in
SQL using hundreds of animated visuals.
This makes it really easier to
understand SQL and as well it is more
fun than just sharing my screen and I
just show you code. Right. The second
reason is this course is taught by me. I
have industrial experience and I will be
sharing with you everything that I know
about SQL and how I use it in my real
projects. So I will be sharing with you
hundreds of best practices, tips and
tricks and I'm going to show you my
decision-m process in SQL. So by the end
of this course, you will be ready to
solve any complex task like I do using
SQL. So now I designed this course to
cover the basics like writing your first
SQL query and then we're going to keep
progressing in the course by covering
advanced techniques in SQL like the
window functions, stored procedures,
indexes and even at the end we're going
to build a data warehouse using SQL. And
this course is suitable for anyone data
engineers, data analyst, data scientist
and even for students. And by the way
the good news everything is for free
from the start until the ends I will be
sharing with you as well a lot of
materials code presentations and
animations and there are no hidden
costs. So you don't have to pay for
anything. But my friends in return I
really appreciate it if you support the
channel in order to grow. All right my
friends I'm really excited about it. I
don't know about you. If you are
motivated join me learning SQL. This is
go. All right. Now I'm going to show you
the road map in order to learn
everything about SQL starting from very
basics and then advance step by step
until we have very advanced topics. So
now at the start we have to understand
few stuff like what is SQL, why to learn
it, what are databases and the types of
databases and after the theory we're
going to prepare your PC with data and
the softwares. Now once we have
everything then we can go to the next
chapter. This is the basics how to query
data using SQL and here we're going to
cover the basic components in each SQL
query like select from where those
basics. Now once you understand how to
query the data, how to get the data out
of the database the next step we're
going to go and learn how to define the
structure of the database. How to create
a new table add a new column remove
column and as well how to drop a table.
So with that you are defining new stuff
in the database and then the next
chapter you have to learn about the data
manipulation. This time we're going to
go inside the table and we're going to
learn how to insert a new data, how to
update the data and as well delete few
rows from our database. So with that you
have the basics how to query data, how
to define the structure of your tables
and how to manipulate your data. And I
can say with that you cover the basics
about SQL. Now after that we start with
the intermediate phase where we're going
to deep dive into topics like how to
filter your data. Here we're going to
learn about the comparison operators,
logical operators, between and like. So
all the operators that you can use in
order to build a condition in order to
filter your data. Then after that it's
going to be very interesting topic. You
have to learn how to combine them. And
here we have two mechanism either using
the join or using the set operators. And
oh my god joining data. It's going to be
very interesting topic. Here we're going
to cover like a lot of stuff like we're
going to start with the basic joins and
then we go to advanced and then you have
to learn how to choose the right join
and after that you have to learn about
the set operators and here you have like
four methods union union all except
intersects. So with that you learn how
to combine multiple tables by combining
the columns or the rows of your tables.
So this is very important. Now moving on
in our course. Now using SQL you can do
a lot of stuff cleaning up the data a
lot of data preparations and at the end
you can do a lot of analytics and
aggregations. So there are like two
families of functions. The first one is
the role level functions and here we
have a lot of stuff you can transform
your string values the numbers date and
time and how to handle the nulls in SQL
and at the end the amazing case
statements. So all those stuffs are
transformation for only one single
value. We call it role level functions.
And after you learn how to do data
transformations, then you have to learn
about how to do data analytics and
aggregations using SQL functions. So
we're going to start with very basics
like the aggregate functions. And then
we're going to deep dive into the window
functions, analytical functions. And
here we have like aggregates, ranking
and value functions. Those are very
important tool for any data analyst or
data scientist doing analytics task in
SQL. So I can say the rowle functions is
for data engineers and the analytical
functions are for data analysts. So at
the chapter 8 we can say you have
covered now the intermediate level and
the last four chapters they will be the
advanced stuff in SQL. So here there are
a lot of techniques that you have to
learn about SQL. So the first one is the
subquery query inside another query and
the very famous CTE common table
expression. A lot of developers like
this one and then you will learn about
how to create views in the database.
This technique if you learn it you're
going to be really professional in SQL.
Then we're going to learn how to create
tables using select the temporal tables
and then we're going to learn about the
third procedures how to write a program
in SQL and after that of course comes
the triggers. So those are the advanced
techniques that you have to learn in SQL
in order to do advanced projects using
SQL. So now once you learn all those
concepts and you start writing a lot of
SQL codes you will notice that some
queries going to be really slow and for
that you have to learn how to optimize
the performance of your queries and here
there are a lot of techniques. The most
famous one is to create an index in the
database or create a partition and at
the end I will be sharing with you the
top 10 best practices that I have
learned in my projects on how to
optimize the performance of your
queries. So this is very important and
then we're going to move to very
interesting one. I will be sharing with
you how I use AI like shy GBT or copilot
as I'm using SQL in my projects. So here
you have to learn how to write correct
prompts to get assistance from AI as you
are using SQL. And finally and my
favorite one it will be about SQL
projects. So my friends here you have to
bring everything that you have learned
about SQL in handon projects. With real
projects you will get challenges and
struggle and here going to happen the
magic and the real learning and here
there are three types of projects. The
first one is data warehousing project.
This is very data engineering focused
project where you're going to learn how
to build real data warehouse where
you're going to take the data from the
raw formats and then process it in
different layers. Once you build it then
you jump to another project. Here you're
going to start exploring the data and
start getting the first insights about
the business. And the last project that
you can do is the advanced data
analytics project. So this is very
important section where you do SQL
projects. So my friends this is the road
map on how to learn SQL. So as you can
see it takes you step by step from
basics to intermediate and you will end
up having advanced topics and with that
I can tell you you will learn everything
about SQL. Okay. So now let's start with
the first chapter the introduction to
SQL and here we're going to cover few
topics. So we have to understand first
what is exactly SQL? Why we have to
learn it? what are databases and the
different SQL commands that we have in
SQL. So it is the basics the theory
go. So what is exactly SQL? Everything
generate data and data is everywhere.
Your first name is data your mobile and
everything inside the mobile is data.
Car is as well generating a lot of data.
Bank, your finance statements,
everything is data. And now of course
the question is where do we store our
data? Personally we store a lot of our
data in like excels, spreadsheets in a
text file. So you store a lot of your
data in different files. Now how about
companies? They have a lot of things
that generate a lot of data that the
products that they produce their
customers as well generating a lot of
data and sales informations and a lot of
things. So companies generate massive
amount of data. So now the big question
is how they handle the data how they
store it. Of course, they cannot go
unused like simple files. They need
something bigger, stronger and smarter.
And here where the database comes in. So
think about the database. It's like a
container for storing data. But instead
of just dumping files into folders, the
database organized the data. So it is
easy to access, to manage and to search.
So a database simply it is a container
that stores data. So now you might ask
why we are using database. Can't we just
use files like I do it personally? Well,
let me tell you why we use databases.
Imagine that someone asks the following
question. Go and find the total spending
in your data. So now, in order for Mike
to find the total spending and the
costs, he will be opening each of those
files one by one, searching for the
costs trying to combine the data and
it's going to be very long and messy
process. But now in the other side, if
your data in database and you want to
ask a question, it's going to be very
easy. So all what you have to do is to
talk to the database to ask a question
and the database can answer your
question with a result. And now comes of
course the question how do we talk to a
database? Well we use SQL. SQL is the
language that you use in order to talk
to the database. It stands for
structured query language SQL. And here
you have people that call it SQL like me
and others that call it SQL. There is no
right and wrong but if you follow me
through the course I think you will
start saying SQL. So by using SQL you
can ask the database you can ask your
data and the database going to answer
your question by sending you a result.
So this process is very easy simple and
fast and this is way better than having
your data stored in different files.
Another reason why we use databases is
that they can handle really huge amount
of data. So sometimes we have like
millions of data inside our database but
in the other side if you are storing
your data inside spreadsheets and you
have like massive amount of data what
can happen your spreadsheets going to
just break they simply can't handle big
data and another reason why we use
databases is that it is just secure. It
is safer to store important and critical
data inside the database than just
storing it in spreadsheets and files. So
the databases are secure and you can
control who is accessing what. So it is
just more professional to store the data
inside a database. All right my friends
so far what we have learned most of the
companies stores their data inside a
container called a database and for you
in order to ask questions and to talk to
your database you have to speak the
SQL. Now I'm going to show you how it
looks like usually in companies. So we
have our data inside the database and
then you will have multiple people with
multiple roles that are just writing
different SQLs in order to talk to the
data. But now not only employees and
people interact with the database. You
could build a website or an application
that as well interacts with the database
by sending different SQLs. And of
course, depend on how many people are
interacting with the application and the
website, it might generate really
massive amount of SQLs that sent to the
database. And not only that, you might
has as well tools in order to do data
visualizations where you have like a
dashboard or reports maybe created using
PowerBI or Tableau and it is used by
stakeholders and managers in order to
make decisions and as well those tools
will be connected to the database and
creating SQLs. So now as you can see we
have a lot of interactions with the
database from people applications tools
a lot of things are generating SQLs and
interacting with the database but the
database is just a container and storage
right so we need something a software
that manage all those requests and
that's why we have something called
database management system DPMS so it is
a software that going to manage all
those different requests to our database
and it going to make the priority which
SQL must be executed First, this
software can as well manage the security
whether the SQL is allowed to be
executed in the first place. So my
friends, the DPMS is the software that
going to manage the database. And now we
are not done yet. There is something
missing. So we have our data, we have
the software. What is missing here is
the hardware. So in real companies, we
cannot run that on our PC because first
our PC is weak and as well it goes
offline. That's why we need a server.
server it is like very powerful PC and
as well it lives 24/7 so it is always
available and here we can decide whether
we're going to have a server inside the
company or we can use cloud services in
order to run our database so my friends
so far what we have learned the database
it is container to store the data the
SQL it is the language in order to talk
to the database the DPMS it is the
manager it manage the database and the
server it is the physical machine where
the database lives so this is how it looks
Like and now my friends there are
different types of databases. So let's
see what do we have. The first and the
most famous one it is the relational
database. It is very simple. It is like
spreadsheets call them table where we
have columns and rows and then there is
like a relationship between those tables
to describe how they relate to each
other and that's why we call it
relational database. So if people hear a
database they're going to think about
this one. Now we have another type of
databases called key value. This time
the data is organized completely
different where you have pairs of keys
and values. Think about it. It's like a
big dictionary where you have a word
like the key and the definition of the
word this is the value. And now moving
on to the next one. This is as well
important column based. So now instead
of grouping the data by the rows this
type of databases group the data into
columns. That's why it's called column
paste. And this is very advanced
database in order to handle huge amount
of data where the main purpose is to
search for data. Moving on to another
database called graph database. The main
focus here is the relationship between
objects. So the main idea here is how to
connect my data points. And now finally
we have the document database. The data
is stored as entire documents where the
structure of the data is not that
important. What is more important is to
fit everything in one page in one
document. And now if you look to those
five types, we can group the document,
graph, column based, key value, all
those databases called NoSQL databases
and the relational database, SQL
database. And in this course, we will be
focusing of course on the relational
database. And I'm sure you have heard
about like the Microsoft SQL server, the
MySQL, the
possesses they are SQL relational
database. And for the key value you have
the radius the Amazon Dynamo DB and we
have for the column paste we have the
Cassandra and the red shift. For the
graph database we have the Neo 4G and
the very famous database the MongoDB as
a document database. Now my friends for
this course we're going to be focusing
on the SQL relational databases because
it is the most famous one and the most
used one in companies and I will be
focusing on the Microsoft SQL server. So
those are the different types of databases.
Now the databases are very structured
and organized. It has the following
hierarchy. The starting point is the
server as we learned it is powerful PC
and it is where the database lives and
inside it we can have multiple
databases. So maybe you have a database
for the sales and another one for the
HR. So the server can host multiple
databases and as we learned a database
is a container of your data. Now moving
on to the next level. In each database
we can have multiple schemas. A schema
it is like category or you can call it a
logical container that we can use it in
order to group up related objects like
let's say you have hundred of tables. So
you can split all the tables that has to
do with the orders in one schema and
then another group of tables with the
schema customers and so on. So it help
you to organize your tables and your
objects in the database. And now if you
go inside schema you can have multiple
objects like tables. So now of course
the question is what is a table? It is
like spreadsheet. It organize your data
into columns. The column define the data
that you store inside it. So you have
one column about the customer ID.
Another column about the names, the
scores, the birthday. So each column is
about one type of data and sometimes we
call the columns as fields. Now the
other thing that we have in tables is
the rows or sometimes we call it
records. It is where actually the data
is stored. Now in this example each
record represent one customer one
person. So we have one record for Maria,
John and Peter. Those we call them rows.
Now in each table there is like one very
important column called the primary key.
It is always very important to have like
one unique identifier for each customer
for each row and we use it for different
purposes in order to combine it with
another table in order to identify
quickly one customer. So it is unique.
It's like fingerprint and there is no
two customers having the same ID. Now
the overlapping between the columns and
the rows we have a single value a cell
and each value each column stores
specific data type. A data type it is
like what kind of data we are storing
like an integer 1 2 30 or a decimal
where you have a decimal point 3.14. Now
if you want to store characters we have
different data types for that like you
want to store the name or the
description. So here we can use the char
or the vchar. So you store inside them
like the first name Maria or something.
Now you might ask what is a char or
vchar. So the char always a fixed one.
So if you define it like five characters
always it's going to go and reserve five
characters from the space. But if you
want things more dynamic then you go
with the vchar. And now moving on we
have another data types called the date
and time. So if you want to store a date
like the birth dates and if you want to
store the time information you can use
the time data type. So we call those
stuff int, decimal, char, date, time.
They are data types. So my friends, as
you can see, SQL databases are very
structured. Okay. So now let's focus
more about the SQL itself. We have in
SQL different type of commands. So let's
say that we have a database and this
database is empty. So we have nothing
inside it. Now, of course, the first
thing that you have to do is to write an
SQL with the command create in order to
create brand new table in the database.
So, once you executed the database going
to go and build one, but this table is
empty. So, we have nothing inside it. So
now what you have done here is you have
defined something new, right? And we
call this type of commands the data
definition language, the DDL. We have
create to create something new, alter in
order to edit something that already
exists and drop in order to delete
something. to drop for example a table.
So this is the first family of commands.
Now if you look at our table, it is
empty. What do we need? We need data. So
let's say that we have a website or an
application. Now this application is
generating a lot of data. Now in order
for this application to move the data
inside our new table, it must use the
SQL command insert. So if you execute
insert, you can add a new data inside
your table. This type of commands we
call it data manipulation language. And
here we have three commands. insert in
order to insert a new data, update in
order to update an already existing data
and delete in order to go and delete
data from your table and that's why we
call it data manipulation language
because you are manipulating your data.
So what do we have now? We have table,
we have data inside the table. Now what
we can do we can start asking questions.
So let's say that you have analytical
question about your data. Now all what
you have to do is to write something
called SQL query and inside it you use
the command select but the whole thing
we call it a query. So you send a query
to the database, you have a question and
the database can return for you the
result, the data answering your query,
your question and we call this type of
activities using SQL, the data query
language. And here we have only one and
it is very famous. We have the select.
We can use it in order to query our
data. So those are the three different
commands in SQL. And of course, we're
going to learn all of them, but we will
spend most of our time learning how to
answer. And now you might ask me, Barra,
why we have to learn SQL? And if the
time goes back, are you going to learn
SQL again? Well, for sure, of course.
And here are the top three reasons that
I have. The first one, you have to learn
it in order to talk to the data. You
know, most of the companies stores their
data in databases, and this is a
standard way. This is how they do it.
And if you want to work on the company
in the data field and you want to talk
to their data, then you have to use SQL.
It's like you move to another country
where they speak another language and
you want to live there for a long time,
you have to speak their language. The
same thing here. If you want to work
with data, you have to learn the
language in order to speak to the
database, the SQL. So this is for me the
most important reason why we have to
learn SQL and SQL it is in high demand.
If you go now and check the job
description of the software developer,
data analyst, data engineer, data
scientist, I promise you you will find
there that they going to demand for SQL.
So you will find they going to ask for
SQL skills almost in each job
description. So if you check for any
data related jobs, you will find that
they going to ask for SQL skills. Now
another reason that I have is it is
industry standard. So if you go and
check multiple modern data platforms and
tools like PowerBI, Tableau, Kafka,
Spark, Synaps, you will understand that
there will be always a section where you
have to enter SQL code. So most of those
vendors adopt SQL because it is the
standard. It is widely used. It is like
selling points that their tools are
easy. So those are my top three reasons
why SQL is still relevant and why you
have to learn it. Okay, my friends. So
with that we have now clear
understanding what is an SQL why we need
it what are databases and their
different types why do we have DBMS
servers and as well now you have
understanding how things are very
organized and structured inside the
databases so that's all this is SQL all
right so with that we have covered the
basics about what is SQL and databases
now in the next step we're going to go
and set up our environments so that
means we're going to prepare your PC
with the data with the databases and all
the tools that you need in order to learn
SQL. Okay. So now go to the link in the
description and you will land here in my
newsletter website and you can subscribe
if you want to get weekly news about my
content. I make as well post about data
and many other projects. So once you do
that what we're going to do now we're
going to go to the downloads over here
and you will find here all the materials
of different courses and the one that we
want is SQL ultimate course. Let's go
over here. Now once you do that you will
land to this page where I have listed
all the important links. So the first
one and the most important one is to go
and download the course materials. Here
you can find everything code the slides
the presentations the whole course or if
you don't want that you can go to my get
repository and there you will find
exactly the same materials. So let's go
and download everything. Okay. So now go
and put the downloaded folder somewhere
safe and let's go inside it. And here
you can find three things. The first one
is the data sets. Here if you go inside
it you will find the data for the course
the databases that we will be using in
order to practice SQL. So everything is
available here. Now the second folder
you can find all the documentations. So
that means all the visuals the
presentation slides everything that I
present during the course. It is
available here as a documentation notes
for you. Now moving on to the third one
we have the scripts. So during the
course we will be writing a lot of SQL
codes and all those codes are here
available. So that means those are all
the codes that is used in the course.
Okay. So with that you have now all the
course materials. All right. So now the
next step is that we have to go and
download the SQL Server Express and you
can find the link as well over here. So
let's go there SQL Server Express. And
now we're going to land on the Microsoft
page where we can see the different
offering from Microsoft where it's
called server. So either we have it on
the Azure or we can download it on the
on premises. But we don't want those
stuff. Just scroll down to see those two
options. So the first option on the left
side we have the developer edition. You
will get all the features and services
that Microsoft offers with the SQL
server. It is as well free but the
installation here is little bit
complicated. But in the second option on
the right side we have the express
edition. Installation here going to be
really fast and very easy. You will get
as well all the stuff that you need for
practicing SQL and learn SQL. So both of
the options are free. It's just a matter
of the installation. We will go now for
the express edition. So go and click
download now and it's very small file.
So let's go and start it. And now the
installation going to start. So we have
basic, custom and download media. So
download media means download now and
later we're going to do the
installation. Custom means we have more
control on how to download and install
the stuff. The basic is the easiest one
and the quickest one. So let's go with
the basics and click on that. And let's
go and accept all those stuff. And now
let's click on install. So now it's
going to install the applications,
drivers and so on. It may take a little bit
time. So in order to do that, let's go
and click on install SS SMS. So let's
click on that and as well we can find
the link over here. So let's go to SQL
Server Management Studio. So let's click
on that. You can find of course this
link as well with the other links that I
have collected. So now we are again at
Microsoft page. Let's go scroll down and
now we will see the following link free
download for SQL Server Management
Studio SS SMS. So let's go and click on
that and then it's going to go and
download it. Let's go and start it. So
the first thing that we have to define
the location. I will go with the default
install. Okay. Setup completed. We just
installed SM SS SMS. So let's go and
close it. So now let's go and start it.
If you go to your menu over here, search
for SQL Server and you will find it
here. SQL Server Management Studio.
Let's go and start it. Okay, so now
we're going to get this window in order
to connect to our server. So again, what
is our server? It is the one we have
installed at the first step, SQL Server
Express. And that's why you're going to
see in the server name, your PC name, of
course, like it's not going to be my PC
name. But here we have something called
SQL Express. This is the server we just
installed. So in the first option, we
have database engines. We have reporting
services. Those are different stuff from
Microsoft. We're going to leave it as a
database engine. And it should be like
this. SQL Express. Now, how to access
this database? We have the following
stuff. We can do that using the window
authentications or SQL server
authentications. I'm going to say that
let's stick with the window
authentication. And the username going
to be the PC name and as well the window
user. If you don't have it for some
reason those informations, you can go to
your search search for
cmd and then here you can say who am I?
And with that you will get the PC name
and as well the user that you are
currently logged in. And this is exactly
what I'm seeing over here. One more
thing if you're having issue connecting
to your database make sure to check the
encryption. It should be mandatory and
to click on the trust server
certificates. So once you do that you
will be able to connect. Okay. So with
that we have the server we have the
client. And now the last step we have to
go and create the database. We want to
insert our data. So now if you look to
the object explorer and open the
databases you can see that we don't have
any database. So now let's do something
about it. Go back to the course
materials inside the data sets you will
find the following. You will find we
have here three folders MySQL postcress
and SQL server. So if you want to follow
with this course using different
database like MySQL and Postgress you
can find the exact same data for the
database that you are using. But now in
this course we are using the SQL server.
So if you follow me with that go inside
the SQL server folder and here you will
find four files with different
extensions. So what is going on here?
Now for this course we have two
databases. One that is very simple
called my database and second one that
has more tables called sales DB. And now
in SQL server there are multiple ways on
how to create databases. I will show you
now two methods on how to create the
database. Now the first option we want
to create the database from a script.
And if you look to those files, we have
here two files with the extension SQL.
Those are files with SQL code. So let's
start with the first one, the init SQL
server my
database.SQL. Go inside it. And now here
we have the SQL code. Copy everything.
And now let's go back to our studio and
then go to the menu and click on new
query. And here in the middle you can
paste the code. So now we have the code
for the first database. And all what you
have to do is to go and execute it. So
once we executed you will see we will
not get any error. And now on the left
side we don't see yet our database
because we have to refresh. So right
click on the databases and click
refresh. And now you can see it my
database. So now let's see the content.
Go extend it and then go extend the
tables. And now you see here our two
tables customers and orders. Inside
those tables we can find our data. In
order to see the data right click for
example of the customers and let's go
with the option select top 1,000 rows.
Once you do that you can see now in the
results we have here five customers.
This is our data inside the table
customers. So here again about the
interface on the left side we have the
object explorer where you can see the
whole structure of the database from
server to databases to tables. So you
can see the whole structure on the top
we have a menu with a lot of icons and
then in the middle this place here we
call it the SQL editor. We're going to
go and write their SQL codes and then
once you execute it at the bottom you
will get the result and messages and
below the SQL editor we have the output.
So here you can see for example the data
the results or different messages from
the database. So the interface is very
simple. Now we have to go and get our
second database. So if you go back to
our files you can find a second SQL file
the initql server sales db.sql. Open
that and let's go and copy everything
here and let's go back to our studio.
Same thing you have to go and create a
new query then paste the whole code and
this database is about the sales DB. So
let's go and execute it and with that we
will not get any errors and now we go to
the left side and we do the same thing
refresh and we can see the second
database sales DB. Now we can go and
explore it. So extend it go to the
tables and here you can see five tables
customers employees orders products. So
here this is the intermediate database
for our course. So now let's go and
check our data. For example, let's go to
the orders, right click on it and select
top 10,00. And those are the orders of
our database. Perfect. So everything is
working. So those are the main two
databases that we will be working
through the whole course. And of course
if you want to go and practice using
another database, it's totally fine. For
example, in Microsoft, there are a
database called Adventure Works. It is
really amazing. And I'm going to show
you now how to import it. We can go over
here the adventure works. So let's click
on this link. So now we are again in
Microsoft page. If you scroll down you
can see here three different types of
databases. The OLTB, data warehouse and
lightweights. So they are like different
databases. The OLTP is the most like
complicated one. A lot of tables and
transactions and so on. The data
warehouse it is like really nice one in
order to do data analyzes and stuff. The
lightweight it is the simplest one. So
let's go for example and get the data
warehouse. So click on that and now as
you can see the extension of this file
isbak and now I'm going to show you the
second way on how to create databases in
SQL server. So now all what you have to
do is to go to the following path. It
really depends where you have installed
the SQL server. So for me I have
installed it in the program files
Microsoft SQL Server MSSQL SQL Express
then MSSQL backup. You have to go there.
So here what you can do you can place
all the files with the extension bak.
For example, the adventure works that we
just installed. This is a backup file
for the database and we want to go and
restore it and with that you are
creating like a database. So this is the
second method on how to create databases
in SQL server by restoring the database.
If for some reason the script didn't
work for you. Now let me show you
quickly how we can do that. Let's go
back to our studio. Right click on the
database and then here we have an option
called restore database. Click on that.
And now here we have two options under
the source database and device. The
default going to be database but we have
to switch to a device because we want to
import it from files. And then we go to
these three dots. Click on that. And now
we have to go to the option add. And now
it's going to take you to the place
where the SQL server creates backups. So
here we can find our files and what we
want you to create is the adventure
works. Select that. Then okay, one more
okay and one final okay. So now the
database will be restored and it is
successfully. So now on the left side we
can see our third database. If you don't
see it go and refresh of course and here
you will find a lot of tables in the
adventure works. And as usual we can go
and explore the data by selecting top
thousand rows. So my friends now you
have three databases but of course our
focus is only the first two that we have
done my database and sales DB. And with
that you have learned two ways on how to
import databases into SQL server. So
with that my friends we have prepared
everything. We have the SQL Server
Express running on your local PC. We
have the studio the clients where we're
going to use it in order to interact
with the database and we have created
our two databases that we will be using
in order to practice SQL. So we are
ready. All right my friends. So with
that we are done with the first chapter.
We have our introduction to SQL and now
we're going to start learning the first
thing in SQL and that is how to query
our data. So let's go and start with that.
Okay, so now we can understand exactly
what is an SQL query. Now normally your
data is inside the table and your table
is inside the database and now you might
have a question from the business like
what is the total sales? What is the
total number of customers? So any
question that you have in your mind and
you want to go and ask your data you
want to go and retrieve data from the
database and in order to do that you
have to talk to the database using its
language the SQL. So in order to do that
you're going to go and write a query
where you write inside the query
something called select statement and
with that you are asking the database
for data. So once you execute your query
the database going to go and fetch your
data and then it prepares a result to be
sent back to you. So with that you are
asking the database a question by
writing a query and the database going
to process your query and answer your
question by sending back data and with
that we are like reading our data from
the database and the queries will not
modify anything will not change the data
inside your tables or even change the
structure of the database. So you use
select statement only in order to read
something from the database. You just
want to retrieve data from the database.
query. And now my friends, each SQL
query has usually different sections,
different components. We call them
clauses. And this is amazing because
you're going to have enough tools to
write a query that matches any question
that you have about your data. So what
we're going to do, we're going to cover
all those clauses step by step in order
to write any query that you need. So now
we're going to start with two clauses
that makes the simplest query in SQL.
that. All right. So now it's really
important for me that you understand how
SQL works with the code with the
queries. So now what I'm going to do,
I'm going to show you on the right side
the syntax of the query in SQL and then
on the left side I'm going to show you
exactly step by step how SQL going to go
and execute your query. So now we have
the table customers inside our database
and we will start with the easiest form
where we're going to select everything.
Select the star. So the select star is
going to go and retrieve all the columns
from your table. So everything and the
from clause it's going to tell SQL where
to find your data. So with the select we
select the columns that we want and the
from you specify the table where your
data come from. So the syntax going to
be very simple. In each query we start
always with the select. And now since we
want all the columns we're going to
write star and with that SQL going to
understand I want to see everything. And
then after that comes the keyword from.
And now we want to tell SQL where the
data come from. So we have to specify
the table name. And that's it. This is
all what you need to do. So once you
execute it what's going to happen? SQL
going to go and execute first the from
clause. So it's going to go and retrieve
all the data from the database to the
results. And then in the next step going
to go and check the select statement. So
which columns we have to keep in the
result since you are saying star then
the SQL going to keep everything all the
columns and with that you will see in
the result everything all the columns
and all the rows. So that's it. This is
how it works. Now let's go back to scale
in order to select few data from our
database. Okay. So back to our studio.
Let's go and start a new query and let's
go and find our database just to expand
it and our tables. Now it is very
important to make sure that you are
connected to the correct database. So go
to the top left in the menu over here
and make sure to select your database.
So my database like this or we have a
command for that called use and then
just write the database name like this.
So I'm telling SQL just use my database
like this and with that SQL going to
switch to your database. Now if you are
learning any new programming language,
it is very important to understand about
the comments. So comments are like notes
that you add to your code in order to
understand what is going on. And of
course the engine, the database will not
go and execute it. it's going to go and
ignore everything inside it. And there
is like two ways on how to do that.
Either you make inline comments by
typing two dashes like this and then you
write anything this is a comment. So now
in SQL if you see it is green that means
it is a comments. Now the other type you
can have multiple line comments and in
order to do that what you can do you can
write slash and then start and then you
can write anything this and then start a
new line is a comment. So as you can see
all the lines after the slash star it is
getting green that means it is a comment
and now let's say that you are at the
end. So in order to close it you write
again star and then slash and that you
are telling SQL I'm done with my
comments. So those are the two types of
writing comments in SQL. Now back to our
query. Let's say that we have the
following task says retrieve all
customer data. So I would like to see in
the results all the data of my customers
everything all the rows and all the
columns. So currently our data is stored
inside the table called customer and I
need to see all the data in the output.
In order to do that we're going to write
a query and all our query start always
with a select and since I need
everything all the columns we write star
and then a new line. Let's go and
specify for SQL from where it's going to
go and get the data. So it's going to be
from and then we going to write the name
of the table. It must be exactly like it
is in the database. So it's called
customers and you have to have it here
as a customers. So that's it. Let's go
and execute it. And now if you look to
the results, you can see we have four
columns and five rows. So with that you
are seeing everything inside the table
customers. You can see we have five
customers and you can see all the
columns about the customers. So this is
very simple. We have ask question for
the database using SQL query and the
database should answer our question by
returning our data in the results. All
right. So now let's move to another
task. I'm going to go and create a new
query and this time we're going to
retrieve all the order data. So that
means I would like to see all the data
inside the orders. So let's go and write
a very simple query. We start as usual
with select and since we want
everything. So it is select star from
our table orders. So that's it. Let's go
and execute. And with that you can see
in the output we have again four columns
but this time we have only four rows. So
that means in this table we have four
orders and we can see all the data
inside this table. So with that we can
understand we have five customers inside
our database and these customers did
generate four orders. So as you can see
we are now talking to our database and
SQL. All right. So now let's move to the
next step in our query where you say you
know what I don't want to see all the
columns from the database. I want to be
more specific. So I would like to select
exactly the columns that I need. So now
we want to select few columns from the
database where we select only the
columns that we need instead of
everything. Now about the syntax we're
going to go and change a little thing.
So instead of using star we're going to
go and make a list of columns that we
want to see in the output. So we're
going to select column one column two
and we're going to separate them using a
comma. So we are just writing a list of
columns exactly after the select. And
for the from it's going to stay as it
is. So from a table. Now if you execute
this what going to happen as usual SQL
going to start with the from. So it's
going to go and get the data from the
database and then the next step is going
to go and check the select. So what
going to happen? SQL going to go and
keep only two columns like for example
the name and the country and all the
columns that are not mentioned in the
select statements will be excluded. So
SQL going to go and remove it from the
results and keeps only the columns that
we mentioned in our query. So this time
instead of having four columns in the
output we can have only two. So with
that you are like filtering the columns
and you are selecting exactly what you
need. So now let's go back to SQL in
order to practice this. All right. So
now we have the following task and it
says retrieve each customer's name,
country and score. So that means I don't
want to see everything from the table
customers. I need only to see the three
columns. So let's see how we can do
that. As usual we start with select and
I'm going to go with a star in order to
see the whole table first from the table
customers. So it's exactly like before.
Let's go and execute it. And now I can
see everything inside the table
customers. But the task says I need only
three columns. So now what we're going
to do instead of the star, we're going
to make a list of columns. So we start a
new line and then we write the name of
the first column. So the first name and
a new line for the second column for the
country and then again a comma and then
we write a score. So with that we have
the three columns. Now what I usually
do, I go and select them and give it
then a push using a tab. This just looks
nicer and easier to read. So with that
we have now between the select and from
list of columns. Now there is like
mistake that happens a lot where we go
and type a comma after the last column.
So if you do that and execute it you
will get an error because SQL going to
expect from you a column after the comma
and since there is no column and
immediately you have a from you will get
an error. So there is no need for a
comma after the last column. Now let's
remove it and execute. And now that you
can see in the output we don't have four
columns we have only three. the first
name, the country and the score. And by
the way, they are ordered exactly like
you selected in your query. So first we
have the first name and then the country
and then the last one the score. So that
means if I go and now change the order.
So let's get the country at the end and
execute. You will see the country at the
end. I'm going to go and put it back in
between to match exactly like the task
and remove the last comma. So execute
again. And with that we have selected
few columns from our table. So we are
more specific to what we need. Okay. So
that we have covered the two select and
from next we're going to talk about the
wear clause that you can use in order to
go. So what is exactly where? We use
where in order to filter our data based
on a condition and any data that fulfill
the condition going to stay in the
output in the result and the data that
don't meet the condition will be
filtered out of the results. Condition
could be anything like for example we
say the score must be higher than 500 or
you can say the country must be equal to
Germany. So any condition that you have
in your question. Now let's see the
syntax in SQL. As usual we start with a
select. We select the columns that we
need. Then we write from where the data
come from and then after the from we're
going to write the where and exactly
after that you specify your condition.
So now let's see how SQL going to
execute this. First SQL start as usual
from the from. So it's going to go and
get your data from the database and
after that SQL going to go and execute
the wear clause. So let's say that the
condition should be higher than 500. And
now what going to happen? SQL going to
check each row whether it meets this
condition or not. So for example for
Maria she doesn't fulfill the condition
because her score the 350 is not higher
than 500. So she doesn't fulfill the
condition and SQL going to go and remove
completely this row this record from the
results. Now SQL going to go to the
second record. So Joan is fulfilling the
condition. So he going to stay in the
result. The same thing for George. Now
moving on to the fourth one Martin. So
this customer is not fulfilling the
condition and SQL going to go and remove
it from the results. The same things
happen for the last customer. The score
is zero and not fulfilling the
condition. So that means if we apply
this filter, SQL going to return only
two customers out of five. So with that
we are filtering the rows based on
condition using the work clause. Now as
you can see in the result we are getting
all the columns but if you specify in
the query like for example only two
columns like the name and the country
then SQL going to start removing as well
the columns of the results. And this
means in the output we will get only two
columns and two rows. So with that you
are filtering the columns and the rows
of your results. So now let's go back to
scale in order to practice this. All
right. So let's have the following task
and it says retrieve customers with a
score not equal to zero. So now if you
are looking to our task you see we have
like here a condition. The condition
says the score must not be equal to
zero. So I don't want to see all the
customers. I want to see only the
customers thus fulfill this condition.
So it's like we have to filter the data.
So let's go and solve the task. Let's
start as usual. Select star. There's no
specifications about the columns from
our table customers. Okay. So I'm going
to start with this. Let's go and execute
it. Now if you look at the result, you
can see like almost all the customers
are fulfilling the condition. Their
scores are not equal to zero. Only one.
The last customer his score is zero. So
this customer does not fulfill our
condition. Now let's go and build filter
for that. So we're going to say where.
And now there will be a section that is
only focusing on how to build conditions
and filtering in SQL. So don't worry a
lot about the syntax of the conditions.
We're going to cover that later of
course but it is very simple. Now for
the condition we need a column. So in
which column is our condition based on
it's going to be on the score. So we're
going to write here score and since we
are saying not equal there is like an
operator in SQL called not equal and
then we have to write a value after
that. It's going to be a zero. So again
the condition is like this. The score
must not be equal to zero. It's very
simple, right? And with that we have our
condition and we are using the where in
order to filter the data. So let's go
and execute it. And now as you can see
SQL did remove the last customer because
he is not fulfilling this condition. And
we have now only the rows that fulfill
our condition. So as you can see it is
very simple how to filter the data. All
what you have to do is to write where
clause after the from and then write a
condition after that. Now let's have
another task like for example it says
retrieve customers from Germany. So I
don't want to see all customers from
different countries. I just want to see
the customers that come from Germany. So
that means we have a condition here.
Country of the customer must be equal to
Germany. So let's go and remove the
current condition. It is not the one
that we need and execute. If you are
looking to the results, we have two
customers that come from Germany and we
are interested only to show those two
customers. So let's go and make a filter
for that. We're going to write where
clause and after that we need a column.
The column going to be the country. So
we're going to write here country and
this time the country must be equal to
Germany. So we're going to write an
equal operator. So we're going to write
Germany like this exactly like the value
inside our data. But now as you can see
we are getting like an error here. And
that's because in SQL if you want to
write a value that contains characters
then you have to put it between two
single quotes. So at the start you put a
single quote and as well at the end. And
now as you can see the red line is away
and the value now is red and that's
because it is a string value. It is a
value that contains characters and with
that you will not get an error. So if
your columns contains only numbers you
can write it without single quotes. But
if your values contains characters then
you have to write it between two single
quotes. Okay. So now back to our
condition the country must be equal to
Germany. Let's go and execute it. And it
is working. So as you can see now we are
seeing in the output only the customers
does fulfill my condition where the
country is equal to Germany. So this is
exactly how we work with the wear clause
in order to filter our data. So my
friends this is how you filter your
rows. And now let's say that I would
like to filter the rows together with
the columns. So I just want to keep the
first name and the country and not
interested to see the scores and the
ids. So in order to do that we're going
to go to the select and list the columns
that we want to see. So the first name
and after that a comma then the country
and that's it. So let's go and give it a
push and execute it. So we have two rows
and two columns. So guys as you can see
SQL is very simple. All right. So with
that you have learned how to filter your
data using the wear clause. Next we're
going to talk about how to sort your
go. Okay. So what is exactly order by?
You can use this type of clouds in order
to sort your data. And of course, in
order to sort your data, you have to
decide on two mechanism. Either you want
to sort your data ascending from the
lowest value to the highest value or
exactly the opposite way using
descending from the highest value to the
lowest. And the syntax kind of looks
like this. So as usual, we start with
the select and then from and after the
from you can specify order by and with
that you are telling SQL we have to sort
the data and you have to specify two
things. First you have to specify for
SQL the column that should be used in
order to sort the results. So for
example you can say score and after the
column name you have to specify the
mechanism. So for example you say
ascending from the lowest to the
highest. And in SQL if you don't specify
the mechanism the default going to be
ascending. So you will not get an error
if you don't specify anything after the
column name. But my advice here is
always to specify something after the
column easier because it's just
straightforward and easier to understand
and if someone reads it can understand
immediately it's going to be ascending
because maybe not everyone knows what is
the default in SQL. So always specify a
value even if it's like easier to skip
it and if you want to store the data
from the highest to the lowest then you
can specify descending. So as usual SQL
going to go and start from the from it's
going to go and grab your data from
database. Then the second step is SQL
going to go and sort the result. So the
order by going to be executed and SQL
going to see okay I'm going to sort it
by the score and using the sending
mechanism and still going to go and
start like moving around your rows where
the first row going to be the customer
with the highest score and in this
example John has the highest score the
900. So John going to appear as a first
row at the result and that's because his
score and after that the second highest
is going to be George with 750 and SQL
going to go and keep sorting the data
and then we have 500 then 350 and the
last row going to be the customer with
the lowest score the zero. So this is
how SQL executes your order by. Now
let's go back to scale in order to
practice. All right. So now we have the
firming task and it says retrieve all
customers and sort the result by the
highest score first. So now by looking
at the task we need all the customers.
So there is like no conditions or
anything to filter but we have to sort
the results. So let's go and do that.
We're going to start as usual by
selecting all the columns from the table
customers. So now if you go and execute
it you will get all your customers and
you are now seeing the data exactly like
stored in the database. And you can see
the result is not sorted by the scores.
So we have here a low score then high
score then low and so on. Now the task
says we have to sort the results. So we
have to go and use the order by and now
you have to understand from which column
and we can get that from the task. So it
says it should be sorted by the score.
So we're going to go and define the
score here. And the final thing that you
have to define is the mechanism
descending or ascending. And you can get
it as well from the task. So we have to
sort the data by the highest score
first. So the highest first and then the
lowest. So that means we're going to go
and use the descending. So that's all.
Let's go and execute it. Now as you can
see in the results, the first customer
has the highest score. Then we have the
second one with the second highest until
the last one with the lowest score.
That's it. This is how you sort your
data. And with that we have solved the
task. Now let's do exactly the opposite.
So we want to sort the results by the
lowest score first. So that means we
want to see first the customers with the
lowest score like here in this example
we should see the ID number five as the
first because he has the lowest score
the zero. Now in order to do that all
what you have to do is to switch the
mechanism instead of descending when you
can use ascending. Let's go and execute
it. And that's it. As you can see now we
have the lowest score then the second
lowest score until the last row. It's
going to be the customer with the
highest score. So the lowest score comes
first. So it is very simple. This is how
SQL. And now I'm going to show you one
more thing that you can do with the
order by. You can sort your data using
multiple columns. And we call it nested
sorting. So now let's take this very
simple example where you want to sort
your data using country. So we are
saying order by the column country and
the mechanism going to be ascending. So
from the lowest to the highest. Now if
you do that going to go and sort the
data this time based on the country. So
we're going to have like the first two
customers from Germany. It is sorting it
alphabetically. Then we have the UK and
the last two going to be from USA. Now
if you are checking the final results
you might say you know what there is
like something wrong. The data is not
completely sorted correctly. So if you
are looking to the first two customers
that come from country Germany. You can
see the scores are sorted in ascending
way from the lowest to the highest. So
first we have 350 then 500. Then UK it's
fine because we have only one customer.
Now if you look to the customers from
USA you see that it is like sorted the
way around. It is sorted descending from
the highest to the lowest. So first we
have the score 900 then zero. So there
is like no clean way on how the data is
sorted and the result is not really
clean and this issue happens usually if
you are sorting your data based in a
column that has repetition like here the
country we have twice Germany and twice
USA. So now in order to refine the
sorting and make it more correct, we can
include in the sorting another column in
this scenario for example the score. So
we can make a list of columns in the
order by and we can separate them using
the comma. And of course you can have
different mechanism for each column like
for the country we are saying it is
ascending but for the score we say you
know what let's make it descending. It
will not be only one for all columns. So
now what can happen is we're going to
start sorting the data for each section.
So for the two customers from Germany
the sorting going to be from the highest
to the lowest. So it's going to go and
switch the two customers. So Martin
going to be first because he has higher
score than Maria. And with that we are
refining the scores based on the same
value of course the country. Now for the
UK nothing going to happen because we
have only one value and for the USA as
well nothing going to happen because it
is already sorted in the correct way
from the highest to the lowest. So as
you can see if you are including a
second column you are refining your
sorting and as well my friends the order
is very important. So this is how you
can do nested sorting in SQL. Let's go
back to our SQL and start practicing.
All right so now we have the following
task and it says retrieve all customers
and sort the results by the country and
then by the highest score. So again we
need all customers. So select everything
from customers table. And now the task
says we have to sort the result by the
country. So we're going to start with
the order by and since it says by the
country. We're going to go with the
country and we're going to sort it
alphabetically. So it's going to be
ascending. So let's go execute it. Now
you can see the data is sorted
completely differently by the country.
So we have first Germany, UK and then
USA. But that's not all and says then by
the highest score. So we have to go and
include another column in the sorting
and we can go and add that by adding a
comma and then mention another column
the score and now we have to specify the
mechanism. It says by the highest score.
So the highest must come first and with
that we are using descending. Now what
is the current situation in that? If you
look to the results for example for
those two customers we have 350 and then
500. So that means the scores are sorted
ascending right the same thing for USA.
So from the lowest to the highest. Now
if you go and do it like this what going
to happen it's going to go and switch
it. So you can see over here now for
Germany first comes the highest the 500
and then the 350 and for USA as well
they switched. So we have the highest
and then the lowest and with that we
have solved the task. Now again the
order of those columns are very
important. So since the scores comes
after the country we will not get the
highest scores first at the results. So
we will not get the 900 as a first row.
And that's because the scores must be
sorted after the country. So the country
has more priority. Now if you go and
flip that. So let's go over here and
says sort first the score and then the
country. So let's go and execute it.
It's called has first to sort the
scores. So with that you will get the
900 first, right? And then the
countries. And since there is like no
duplicates in the scores, this makes no
sense at all. So you can go and skip it.
So nested sorting only makes sense if
you have repetition in your results and
you can use the help of a second column
in order to make the sorting perfect. So
that's it and with that of course we
have solved the task. All right. So with
that you have learned how to sort your
data using order by. Now in the next
step we're going to talk about how to
aggregate and group up your data using
group by and we're going to put it
between the where and the order by
because in the order of the query the
group by comes between the where and the
go. Okay. So what is exactly group by?
It's going to go and combine the rows
with the same value. So it's going to go
and combine and smash press your rows to
make it aggregated and more combined. So
all what group by does it aggregates a
column by another column. Like for
example, if you want to find the total
score by country. So you aggregate all
the scores value for one country. If you
have this kind of tasks, then you can
use the group I. Let's see the syntax of
that. We will start as usual with the
select. And now what we want to see in
the result is two columns. So we have to
specify like a category like the
country. This is the value that you want
to group the data by. and another one
where you are doing the aggregations. So
for example you are saying I would like
to see the total score. So we use the
function sum in order to summarize the
values of the score. After that as usual
we use the from in order to select the
data from specific table. And now comes
the magic we use after the from group
by. And now understands okay I have now
to combine the data. I have to group up
the data by something. And this time we
are saying you have to group up the data
by the country. So that means each value
of the country must be presented in the output only once and for each country we
output only once and for each country we want to see the aggregation and that is
want to see the aggregation and that is the total score. So let's see how is
the total score. So let's see how is going to execute it. So it's going to
going to execute it. So it's going to first start with the from it's going to
first start with the from it's going to go and get the data from the database
go and get the data from the database and then it's still going to execute the
and then it's still going to execute the group by and now scale understand okay I
group by and now scale understand okay I have to group up now the data by the
have to group up now the data by the country and it understands it has to
country and it understands it has to aggregate the scores for that. So it's
aggregate the scores for that. So it's going to go and identify the rows that
going to go and identify the rows that are sharing the same value. Like for
are sharing the same value. Like for example here we have two rows for
example here we have two rows for Germany and it's going to bring it to
Germany and it's going to bring it to the results. So now we have two rows for
the results. So now we have two rows for the same country but since we are saying
the same country but since we are saying group by country SQL going to try and
group by country SQL going to try and combine them smash them together in only
combine them smash them together in only one row. So each value of the country
one row. So each value of the country must exist at maximum once. We cannot
must exist at maximum once. We cannot leave it like this. So now what we going
leave it like this. So now what we going to do with the scores? We have two
to do with the scores? We have two scores. Now SQL going to check the
scores. Now SQL going to check the aggregate function. It is the
aggregate function. It is the summarization. So, and it's going to go
summarization. So, and it's going to go and add those values 350 + 500. And with
and add those values 350 + 500. And with that, we're going to get the total score
that, we're going to get the total score of 850. And with that, as you can see,
of 850. And with that, as you can see, scale is combining those two rows into
scale is combining those two rows into one. So, in the output, Germany will
one. So, in the output, Germany will exist only one. And about the scores, we
exist only one. And about the scores, we will get the total score. And the same
will get the total score. And the same thing going to happen for the next
thing going to happen for the next value. In the country, we have the USA.
value. In the country, we have the USA. We have it twice. So, we're going to get
We have it twice. So, we're going to get two rows. And scale going to combine
two rows. And scale going to combine those two rows in one because USA must
those two rows in one because USA must exist only once. And with the scores we
exist only once. And with the scores we will have the total scores. So 900 plus
will have the total scores. So 900 plus zero we will get 900. And with that it's
zero we will get 900. And with that it's still converted those two rows into one.
still converted those two rows into one. And for the last value in the countries
And for the last value in the countries we have the UK. It's going to stay as it
we have the UK. It's going to stay as it is. There is no need to smash and
is. There is no need to smash and combine anything because it's already
combine anything because it's already one value. So my friends if you are
one value. So my friends if you are looking to the output you can see we
looking to the output you can see we grouped the original data by the
grouped the original data by the country. And that means we're going to
country. And that means we're going to get one row for each value inside the
get one row for each value inside the country column. So my friends the
country column. So my friends the original data you have five rows in the
original data you have five rows in the output if you are using group by like
output if you are using group by like this you will get only three rows. So
this you will get only three rows. So this is exactly how the group by works.
this is exactly how the group by works. Let's go back to scale and practice.
Let's go back to scale and practice. Okay. So we have the following task and
Okay. So we have the following task and it says find the total score for each
it says find the total score for each country. So from reading this you can
country. So from reading this you can understand we have to do aggregations
understand we have to do aggregations and we have to combine the data by a
and we have to combine the data by a column. So now usually I start like
column. So now usually I start like this. I start selecting the columns that
this. I start selecting the columns that I need in order to solve this task. So
I need in order to solve this task. So what do we need? We need the country and
what do we need? We need the country and score from our table customers. So let's
score from our table customers. So let's start like this. Now you can see we have
start like this. Now you can see we have the countries and the scores. And the
the countries and the scores. And the task says we have to group up the data
task says we have to group up the data by the country. So that means this is
by the country. So that means this is the column where we're going to do the
the column where we're going to do the group by and the total scores will be
group by and the total scores will be aggregated. So what we have to do? We're
aggregated. So what we have to do? We're going to use the group by since it says
going to use the group by since it says for each country. We're going to use it
for each country. We're going to use it over here. Group by country. And now we
over here. Group by country. And now we have to go and aggregate the scores. We
have to go and aggregate the scores. We cannot leave it like this. So we're
cannot leave it like this. So we're going to say the sum of the score. So
going to say the sum of the score. So let's go and execute it. And with that,
let's go and execute it. And with that, as you can see, we are getting the total
as you can see, we are getting the total scores for each country. So now instead
scores for each country. So now instead of having five customers, we have only
of having five customers, we have only three rows now. And that's because the
three rows now. And that's because the countries has three rows. And now if you
countries has three rows. And now if you check the result, you can see something
check the result, you can see something weird. It says no column name. And
weird. It says no column name. And that's because we have changed the
that's because we have changed the scores. It's not anymore the original
scores. It's not anymore the original score. It is it is the total scores. We
score. It is it is the total scores. We have summarized those values. So SQL
have summarized those values. So SQL don't know how we going to call it. So
don't know how we going to call it. So those values doesn't come directly from
those values doesn't come directly from the database. It is manipulation that
the database. It is manipulation that you have done here. Now in order to give
you have done here. Now in order to give a nice name for that we can go and add
a nice name for that we can go and add aliases. An alias it is only like a name
aliases. An alias it is only like a name that lives inside your query. So we can
that lives inside your query. So we can do it like this as and you can specify
do it like this as and you can specify any name you want like for example total
any name you want like for example total score. And now scale can understand okay
score. And now scale can understand okay this is the name for this column and if
this is the name for this column and if you go and execute it you will see the
you go and execute it you will see the new name in the results. But you have to
new name in the results. But you have to understand this name exists only in this
understand this name exists only in this query. You are not renaming anything
query. You are not renaming anything inside your database and you cannot use
inside your database and you cannot use it in any other queries. It is just
it in any other queries. It is just something that is known inside this
something that is known inside this query and only for your results. And of
query and only for your results. And of course you can rename anything any
course you can rename anything any column like for example here you can say
column like for example here you can say this is the customer country and if you
this is the customer country and if you execute it you are just renaming the
execute it you are just renaming the column in the output. So this is really
column in the output. So this is really nice in SQL. Okay. So now there is like
nice in SQL. Okay. So now there is like one more thing about the group I the
one more thing about the group I the non-aggregated columns that you are
non-aggregated columns that you are adding in the select must be as well
adding in the select must be as well mentioned in the group I. So now for
mentioned in the group I. So now for example let's say that okay I'm seeing
example let's say that okay I'm seeing now the countries the total scores I
now the countries the total scores I would like to see as well the first
would like to see as well the first name. So you go over here and say you
name. So you go over here and say you know what let's get the first name. So
know what let's get the first name. So country first name the total scores and
country first name the total scores and execute. You will get an error because
execute. You will get an error because it's going to tell you I need only the
it's going to tell you I need only the columns that you want to group the data
columns that you want to group the data by or should be aggregated. So now the
by or should be aggregated. So now the first name it is not aggregated and as
first name it is not aggregated and as well not used for the group I. So it is
well not used for the group I. So it is just here to confuse SQL and it will not
just here to confuse SQL and it will not work. So if you bring a column either it
work. So if you bring a column either it should be in the aggregation or it
should be in the aggregation or it should be part of the group I. So in
should be part of the group I. So in order to fix this and you really want to
order to fix this and you really want to see the first name you can go over here
see the first name you can go over here and say you know what let's add it to
and say you know what let's add it to the group I and execute. This time it
the group I and execute. This time it going to work because all the columns
going to work because all the columns that are mentioned here is as well part
that are mentioned here is as well part of the group I. So now as you can see we
of the group I. So now as you can see we have the countries the first name and
have the countries the first name and the total scores and you can see again
the total scores and you can see again we have five rows we don't have three
we have five rows we don't have three rows and that's because now you are
rows and that's because now you are combining the data by the country and as
combining the data by the country and as well the first name and now you can see
well the first name and now you can see in the output we are getting five rows
in the output we are getting five rows we are not getting anymore the three
we are not getting anymore the three rows the three countries and that's
rows the three countries and that's because SQL now grouping the data by two
because SQL now grouping the data by two columns the combination of the country
columns the combination of the country and the first name and those two columns
and the first name and those two columns gives five combinations and that means
gives five combinations and that means you will get five rows so that means you
you will get five rows so that means you have to be really careful what you are
have to be really careful what you are defining in the group I and the number
defining in the group I and the number of the unique values that those columns
of the unique values that those columns are generating going to define the
are generating going to define the output the results. So if you go and
output the results. So if you go and remove the first name and from here as
remove the first name and from here as well you are grouping by only one column
well you are grouping by only one column and this column has only three values
and this column has only three values and that's why you are getting three
and that's why you are getting three rows and with that of course we have
rows and with that of course we have solved the task and now let's extend the
solved the task and now let's extend the task and say find the total score and
task and say find the total score and total number of customers for each
total number of customers for each country. So that means we need two
country. So that means we need two aggregations. We have the total score
aggregations. We have the total score and as well we need the total number of
and as well we need the total number of customers. So from reading this you can
customers. So from reading this you can understand we still want to group up the
understand we still want to group up the data by the country but this time we
data by the country but this time we need two type of aggregations. We need
need two type of aggregations. We need the total number of customers and the
the total number of customers and the total scores. So we have almost
total scores. So we have almost everything but what is missing is the
everything but what is missing is the second aggregation. Now what you can do
second aggregation. Now what you can do you can go over here and add another
you can go over here and add another aggregate function called the count. And
aggregate function called the count. And what we want to count is the number of
what we want to count is the number of customers. So we can go and add the ID
customers. So we can go and add the ID over here and call it total customers.
over here and call it total customers. So now of course SQL going to So now if
So now of course SQL going to So now if you go and execute it, you will get as
you go and execute it, you will get as well the total customers by the country.
well the total customers by the country. And now as you can see SSQL has no
And now as you can see SSQL has no problem with the ID and that's because
problem with the ID and that's because you are aggregating the ID. So SQL know
you are aggregating the ID. So SQL know what to do with it and how to combine
what to do with it and how to combine it. So that means you don't have to
it. So that means you don't have to mention the ID in the country because
mention the ID in the country because you are aggregating it. So that's all
you are aggregating it. So that's all with that we have solved as well the
with that we have solved as well the task. All right. Right. So with this you
task. All right. Right. So with this you have learned how to group up your data
have learned how to group up your data using the group eye. Next we're going to
using the group eye. Next we're going to talk about another technique on how to
talk about another technique on how to filter your data but this time using the
filter your data but this time using the having clause. So let's
go. All right. So what is exactly having? You can use it in order to
having? You can use it in order to filter your data but after the
filter your data but after the aggregation. So that means we can use
aggregation. So that means we can use the having only after using the group I.
the having only after using the group I. So let's see the syntax of that. So
So let's see the syntax of that. So again like the previous example we are
again like the previous example we are finding the total score by country. So
finding the total score by country. So we have our select from group I and now
we have our select from group I and now you say you know what I would like to
you say you know what I would like to filter the end results and in order to
filter the end results and in order to do that we use the having after the
do that we use the having after the group I and now like the wear clause you
group I and now like the wear clause you have to specify a condition. So we have
have to specify a condition. So we have the following condition where we want to
the following condition where we want to see in the results only the countries if
see in the results only the countries if their total score is higher than 800. So
their total score is higher than 800. So this going to be our condition. So now
this going to be our condition. So now you might noticing something with the
you might noticing something with the group by we are using the country the
group by we are using the country the column where we are grouping the data by
column where we are grouping the data by its value but with the having we are
its value but with the having we are using the aggregated column the sum of
using the aggregated column the sum of the score. So this is how the syntax
the score. So this is how the syntax works and now let's see how is going to
works and now let's see how is going to execute it. So as usual SQL start with
execute it. So as usual SQL start with the from we are getting our data and
the from we are getting our data and then the second step is going to go and
then the second step is going to go and aggregate the data by the country. So
aggregate the data by the country. So it's like before going to group the rows
it's like before going to group the rows with the same value of the country. So
with the same value of the country. So we're going to have one row for each
we're going to have one row for each country and this is what going to happen
country and this is what going to happen if you use group I and with that we have
if you use group I and with that we have now aggregated values right and after
now aggregated values right and after the group IQL going to go and execute
the group IQL going to go and execute the having. So having it is like a
the having. So having it is like a filter. Now we have a nice condition the
filter. Now we have a nice condition the total sale must be higher than 800 and
total sale must be higher than 800 and SQL going to go and check the new
SQL going to go and check the new results after the aggregation. So in
results after the aggregation. So in Germany we have the total sales of 850.
Germany we have the total sales of 850. So it meets the condition and it going
So it meets the condition and it going to stay in the results. The same thing
to stay in the results. The same thing for USA it is higher as well than 900s
for USA it is higher as well than 900s but for UK it is not meeting the
but for UK it is not meeting the condition 750 it is not higher than 800
condition 750 it is not higher than 800 and SQL going to go and filter out this
and SQL going to go and filter out this row so that means after applying the
row so that means after applying the having we will get only two countries
having we will get only two countries because they have values that is
because they have values that is fulfilling the condition and that's it
fulfilling the condition and that's it is what can happen if you are using
is what can happen if you are using having it is simply filtering the data
having it is simply filtering the data but now you might be confused you say
but now you might be confused you say you know what we have used the wear
you know what we have used the wear clouds to filter the data so why we have
clouds to filter the data so why we have in SQL another cloud how to filter my
in SQL another cloud how to filter my data. Can't we just use the where? Well,
data. Can't we just use the where? Well, in SQL there are like different ways on
in SQL there are like different ways on how to filter your data based on the
how to filter your data based on the scenario. So now let's go and add both
scenario. So now let's go and add both of the filters in my query. We are
of the filters in my query. We are already using the having after the group
already using the having after the group I and now let's go and add the wear.
I and now let's go and add the wear. Usually the wear comes between the from
Usually the wear comes between the from and the group I so directly after the
and the group I so directly after the from. And here we are saying the score
from. And here we are saying the score must be higher than 400. So now we are
must be higher than 400. So now we are filtering based on the scores twice,
filtering based on the scores twice, right? Once we are saying the score
right? Once we are saying the score higher than 400 and by having we are
higher than 400 and by having we are saying the sum of score must be higher
saying the sum of score must be higher than 800. So what is the big difference?
than 800. So what is the big difference? It is when the filter is happening. If
It is when the filter is happening. If you want to filter the data before the
you want to filter the data before the aggregation you want to filter the
aggregation you want to filter the original data then you can go and use
original data then you can go and use the wear clause. But if you want to
the wear clause. But if you want to filter the data after the aggregations
filter the data after the aggregations after the group by then you can go and
after the group by then you can go and use having. So it's really all about
use having. So it's really all about when the filter is happening. So let's
when the filter is happening. So let's see how is still going to execute this.
see how is still going to execute this. So as usual first the from going to be
So as usual first the from going to be executed to get the data. Then after
executed to get the data. Then after that the second step the wear going to
that the second step the wear going to be executed. This is our first filter.
be executed. This is our first filter. So SQL going to filter the data using
So SQL going to filter the data using where before doing any aggregations and
where before doing any aggregations and based on our condition the first
based on our condition the first customer will be filtered out because
customer will be filtered out because score is less than 400 and the same
score is less than 400 and the same thing for the last customer. Now after
thing for the last customer. Now after the applying the wear clouds we will get
the applying the wear clouds we will get only three rows only three customers.
only three rows only three customers. And now next SQL going to go and execute
And now next SQL going to go and execute the group by. So it's still going to go
the group by. So it's still going to go and group the data by the country. So
and group the data by the country. So now we have fewer data to be combined.
now we have fewer data to be combined. So the values will not be summarized
So the values will not be summarized because we have only one row for each
because we have only one row for each country. Now after the data is
country. Now after the data is aggregated by the group by then SQL
aggregated by the group by then SQL going to activate the second filter
going to activate the second filter having. So the next step is going to
having. So the next step is going to execute the having and here SQL going to
execute the having and here SQL going to filter the new results based on the
filter the new results based on the total scores and still going to check
total scores and still going to check one by one. So, USA is meeting the
one by one. So, USA is meeting the condition. UK going to be filtered out
condition. UK going to be filtered out because it is not higher than 800. And
because it is not higher than 800. And this time Germany as well will be
this time Germany as well will be filtered out because this time it is not
filtered out because this time it is not fulfilling the condition. In the
fulfilling the condition. In the previous example without the wear, we
previous example without the wear, we had more scores for Germany. That's why
had more scores for Germany. That's why it passed the test. But this time since
it passed the test. But this time since we filtered a lot of customers using the
we filtered a lot of customers using the wear, Germany will not have enough
wear, Germany will not have enough scores pass the second filter. So with
scores pass the second filter. So with that in the output we will get only one
that in the output we will get only one row and that's because we are filtering
row and that's because we are filtering a lot of data. So it is very simple
a lot of data. So it is very simple where going to be executed before the
where going to be executed before the group by before the aggregations having
group by before the aggregations having going to be executed after the group by
going to be executed after the group by after the aggregations. So now let's go
after the aggregations. So now let's go back to scale in order to practice.
back to scale in order to practice. Okay. So now we have very interesting
Okay. So now we have very interesting task find the average score for each
task find the average score for each country considering only customers with
country considering only customers with a score not equal to zero. So it sounds
a score not equal to zero. So it sounds like condition and return only those
like condition and return only those countries with an average score greater
countries with an average score greater than 430. So this is again another
than 430. So this is again another condition. So I know there is a lot of
condition. So I know there is a lot of things that's going on. Let's do it step
things that's going on. Let's do it step by step. Usually I start by doing a very
by step. Usually I start by doing a very simple select statement with the columns
simple select statement with the columns and data that I need. So let's start
and data that I need. So let's start with a simple select. So what do we need
with a simple select. So what do we need over here? We need a score. We need a
over here? We need a score. We need a country. Again we need a score country.
country. Again we need a score country. So all what we need is two columns. Now
So all what we need is two columns. Now I'm going to go and select the ID just
I'm going to go and select the ID just to see the customer ID. Then let's go
to see the customer ID. Then let's go and get the country score from our table
and get the country score from our table customers. So let's go and query that.
customers. So let's go and query that. So now as you can see I start with the
So now as you can see I start with the basics. Query the data and then build up
basics. Query the data and then build up on top of it the second step. Now what
on top of it the second step. Now what do we have in the task? We have to find
do we have in the task? We have to find the average score for each country. That
the average score for each country. That means we have to do some aggregations.
means we have to do some aggregations. And here we have two conditions. The
And here we have two conditions. The first condition says we need only the
first condition says we need only the customers with a score not equal to
customers with a score not equal to zero. And the second one we need only
zero. And the second one we need only the countries with an average score
the countries with an average score greater than 430. Now you have to decide
greater than 430. Now you have to decide for each condition whether you're going
for each condition whether you're going to use the where or having. Now for the
to use the where or having. Now for the first one we want to filter based on the
first one we want to filter based on the scores. So that means we want to filter
scores. So that means we want to filter before the aggregations. It's not saying
before the aggregations. It's not saying the average score. It's saying the score
the average score. It's saying the score itself. So that means we can use for
itself. So that means we can use for this a wear condition. Now about the
this a wear condition. Now about the second one it says countries with an
second one it says countries with an average score greater than 430. That
average score greater than 430. That means we want to filter the data after
means we want to filter the data after aggregating the score. So that means for
aggregating the score. So that means for this condition we have to use the
this condition we have to use the having. Now what I would like to do is
having. Now what I would like to do is to implement the first condition. It's
to implement the first condition. It's very simple. We're going to say where
very simple. We're going to say where after the from the score is not equal to
after the from the score is not equal to zero. So let's go and execute it. And
zero. So let's go and execute it. And with that we don't have any customers
with that we don't have any customers where the scores is not equal to zero.
where the scores is not equal to zero. So that we have solved this part. But
So that we have solved this part. But now for the second condition first we
now for the second condition first we have to do the aggregations. So we're
have to do the aggregations. So we're going to start with the average score.
going to start with the average score. We're going to go over here and say
We're going to go over here and say average and we're going to call it
average and we're going to call it average score. Now we don't want to see
average score. Now we don't want to see only the average score. We want to see
only the average score. We want to see the average score for each country. So
the average score for each country. So that means we have to aggregate by the
that means we have to aggregate by the country and for that we use the group I
country and for that we use the group I group by comes always after the wear
group by comes always after the wear clause. So group by and which column?
clause. So group by and which column? It's going to be the country. So
It's going to be the country. So country. Now there is like an issue
country. Now there is like an issue here. You cannot execute it like this.
here. You cannot execute it like this. We have to go and get rid of the ID. We
We have to go and get rid of the ID. We don't need it at all. So let's go and
don't need it at all. So let's go and execute it. So with that we have the
execute it. So with that we have the average score for each country and we
average score for each country and we have solved the first part. So that
have solved the first part. So that means the first and the second part they
means the first and the second part they are completed. Now we're going to talk
are completed. Now we're going to talk about the last part. The average score
about the last part. The average score must be higher than 430. And for that
must be higher than 430. And for that we're going to use the having and having
we're going to use the having and having comes after the group by. Now we need to
comes after the group by. Now we need to specify the condition. It must be the
specify the condition. It must be the aggregated column. So we're going to
aggregated column. So we're going to take the average score from here and put
take the average score from here and put it after the having and it should be
it after the having and it should be greater than 430. So that's it. With
greater than 430. So that's it. With that we have the last part as well.
that we have the last part as well. Let's go and execute it now. And with
Let's go and execute it now. And with that my friends we have filtered the
that my friends we have filtered the data after the aggregation. So this is
data after the aggregation. So this is how I decide between the where and
how I decide between the where and having. It is very simple. All right. So
having. It is very simple. All right. So with that you have learned how to filter
with that you have learned how to filter the aggregated data using the having.
the aggregated data using the having. And now next we're going to go back to
And now next we're going to go back to the top where we can use there the
the top where we can use there the keyword distinct exactly after the
keyword distinct exactly after the select. So let's go now and learn about
select. So let's go now and learn about the
distinct. Okay. So what is exactly distinct? If you use it in SQL, it's
distinct? If you use it in SQL, it's going to go and remove duplicates in
going to go and remove duplicates in your data. Duplicates are like repeated
your data. Duplicates are like repeated values in your data and it's going to
values in your data and it's going to make sure that each value appears only
make sure that each value appears only once in the results. So it sounds very
once in the results. So it sounds very simple and as well the syntax is easy.
simple and as well the syntax is easy. So as usual we start always with a
So as usual we start always with a select but directly after the select we
select but directly after the select we use the keyword distinct. So there is
use the keyword distinct. So there is nothing between them and then the normal
nothing between them and then the normal stuff we specify the columns and then
stuff we specify the columns and then the from in order to get the data from
the from in order to get the data from table. Let's say that I would like to
table. Let's say that I would like to get a list of unique values of the
get a list of unique values of the country. So the first thing that SQL
country. So the first thing that SQL going to do of course is to get the data
going to do of course is to get the data from the database using the from. And
from the database using the from. And now the second step is the select. So
now the second step is the select. So SQL going to execute it and going to
SQL going to execute it and going to select only one column the country. All
select only one column the country. All other columns going to be excluded and
other columns going to be excluded and removed from the results. And now SQL
removed from the results. And now SQL going to go to the third step. It's
going to go to the third step. It's going to go and apply the distincts on
going to go and apply the distincts on the country values. So it acts like a
the country values. So it acts like a filter where it going to make sure each
filter where it going to make sure each value happens only once. So it's going
value happens only once. So it's going to start with the first value Germany.
to start with the first value Germany. Now it's going to look to the results.
Now it's going to look to the results. Do we have Germany? Well, we don't have
Do we have Germany? Well, we don't have anything yet. So that's why it's going
anything yet. So that's why it's going to include it in the results. Then the
to include it in the results. Then the next value is going to be USA. The same
next value is going to be USA. The same thing. We don't have USA in the results.
thing. We don't have USA in the results. So it's going to go and include it. And
So it's going to go and include it. And this happens as well for the UK. We
this happens as well for the UK. We don't have UK in the final results.
don't have UK in the final results. That's why it's going to go as well
That's why it's going to go as well included. Now comes Germany again. Now
included. Now comes Germany again. Now it's going to say wait, we have it
it's going to say wait, we have it already. So it will not go and add it
already. So it will not go and add it again in the output because it must
again in the output because it must appear only once. So we will not have
appear only once. So we will not have Germany twice. And as well for the last
Germany twice. And as well for the last value the USA we have it already in the
value the USA we have it already in the results that's why it will not appear
results that's why it will not appear again and with that we have removed the
again and with that we have removed the duplicates or the repetition inside our
duplicates or the repetition inside our data. So each value is unique. Now let's
data. So each value is unique. Now let's go back to SQL. Okay that task is very
go back to SQL. Okay that task is very simple. It says return unique list of
simple. It says return unique list of all countries. So let's go and do that.
all countries. So let's go and do that. It's going to be funny. So select and
It's going to be funny. So select and now let's get the column country from
now let's get the column country from our table customers like this. Now you
our table customers like this. Now you can see we have a list of all countries
can see we have a list of all countries but the task says we need a unique list.
but the task says we need a unique list. So that means I cannot have here
So that means I cannot have here repetitions inside it. And with that
repetitions inside it. And with that we're going to use the very nice
we're going to use the very nice distinct. So if you do it like this
distinct. So if you do it like this let's go and execute. You will see there
let's go and execute. You will see there will be no duplicates in your results
will be no duplicates in your results and all the values in the result going
and all the values in the result going to be unique. So with that we have
to be unique. So with that we have solved the task. It's it's very simple.
solved the task. It's it's very simple. Now there is like one thing about the
Now there is like one thing about the distinct that I see a lot of people
distinct that I see a lot of people using it a lot in cases that it's not
using it a lot in cases that it's not really necessary. So for example, let's
really necessary. So for example, let's go and get the ID. Now if you go and
go and get the ID. Now if you go and execute it, you can see here we have a
execute it, you can see here we have a list of all ids and there are no
list of all ids and there are no duplicates. But now if I go and remove
duplicates. But now if I go and remove the distinct and executed, we will get
the distinct and executed, we will get the same results because the ids are
the same results because the ids are usually unique. So it really makes no
usually unique. So it really makes no sense to go and say distinct because as
sense to go and say distinct because as you can see the database has to go and
you can see the database has to go and make sure each value happens only once.
make sure each value happens only once. So there's like extra work for the SQL
So there's like extra work for the SQL and it is usually an expensive
and it is usually an expensive operation. So if your data is already
operation. So if your data is already unique, don't go and apply distincts.
unique, don't go and apply distincts. Only if you see repetitions and
Only if you see repetitions and duplicates and you don't want to see
duplicates and you don't want to see that only in this scenario, go and apply
that only in this scenario, go and apply the distinct. Don't go blindly for each
the distinct. Don't go blindly for each query applying distinct just in case
query applying distinct just in case there is duplicates. This is usually bad
there is duplicates. This is usually bad practices. Okay. So that's all for
practices. Okay. So that's all for distinct. Okay my friends. So with that
distinct. Okay my friends. So with that you have learned how to remove the
you have learned how to remove the duplicates using the distinct. In the
duplicates using the distinct. In the next step we're going to talk about
next step we're going to talk about another keyword that you can use
another keyword that you can use together with the select. You can use
together with the select. You can use top in order to limit your data. So now
top in order to limit your data. So now let's go and understand what this
means. Okay. So what is exactly top or in other databases we call it limit. So
in other databases we call it limit. So it is again some kind of filtering in
it is again some kind of filtering in SQL. If you use it, it's going to go and
SQL. If you use it, it's going to go and restrict the number of rows returned in
restrict the number of rows returned in the results. So you have a control on
the results. So you have a control on how many rows you want to see in the
how many rows you want to see in the results. The syntax is very simple as
results. The syntax is very simple as well. Directly after the selects you're
well. Directly after the selects you're going to use the keyword top and then
going to use the keyword top and then you specify the number of rows you want
you specify the number of rows you want to see in the results. So for example
to see in the results. So for example three and then only after that you
three and then only after that you specify the columns that you want and
specify the columns that you want and then from which table. Now let's see how
then from which table. Now let's see how going to execute it. So as usual the
going to execute it. So as usual the from going to be executed we will get
from going to be executed we will get our data and then the second step is
our data and then the second step is going to go and select the columns. In
going to go and select the columns. In this case all the columns going to stay
this case all the columns going to stay and then after that it's going to
and then after that it's going to execute that top. So how it works? It's
execute that top. So how it works? It's very simple. For each row in database,
very simple. For each row in database, we have a row number. It has nothing to
we have a row number. It has nothing to do with your data with the ids. For
do with your data with the ids. For example, here like in the current
example, here like in the current result, we have row number 1 2 3 4 5.
result, we have row number 1 2 3 4 5. Those numbers are not your actual data.
Those numbers are not your actual data. It is something technical from the
It is something technical from the database. So it is not equal to the ids.
database. So it is not equal to the ids. For example, the ids is actually your
For example, the ids is actually your content your data. So here we are not
content your data. So here we are not filtering based on the data based on the
filtering based on the data based on the row numbers. So since here we have
row numbers. So since here we have defined three SQL going to count. Okay.
defined three SQL going to count. Okay. row number one 2 three and that's it. So
row number one 2 three and that's it. So it's going to make a cut and all the
it's going to make a cut and all the rows after number three they will be
rows after number three they will be excluded from the results and you will
excluded from the results and you will get only the three rows at the results.
get only the three rows at the results. So now as you can see this type of
So now as you can see this type of filtering is not based on a condition or
filtering is not based on a condition or something it's just based on the row
something it's just based on the row numbers. So whatever results you have in
numbers. So whatever results you have in your data it will go and make a cut at
your data it will go and make a cut at specific row. So let's go to scale and
specific row. So let's go to scale and practice that. Okay. So now we have a
practice that. Okay. So now we have a very simple task. It says retrieve only
very simple task. It says retrieve only three customers. So let's go and do
three customers. So let's go and do that. We're going to go and select star
that. We're going to go and select star from our table customers and execute it.
from our table customers and execute it. Now as you can see in the output we have
Now as you can see in the output we have five customers. But the task says we
five customers. But the task says we want only three. And there is no
want only three. And there is no specifications at all about any
specifications at all about any condition. So I don't have to go and
condition. So I don't have to go and make a work clause where we write a
make a work clause where we write a condition based on our data. We just
condition based on our data. We just want three customers. So we can do that
want three customers. So we can do that very simply by just adding top exactly
very simply by just adding top exactly after the select and then specify the
after the select and then specify the number of rows you want to see from the
number of rows you want to see from the output. So select top three and then the
output. So select top three and then the star. Let's go and execute it. And with
star. Let's go and execute it. And with that we are getting three customers.
that we are getting three customers. That's it. It's very simple. All right.
That's it. It's very simple. All right. Now moving on to another task. It says
Now moving on to another task. It says retrieve the top three customers with
retrieve the top three customers with the highest scores. Now of course this
the highest scores. Now of course this is like a mix between ordering the data
is like a mix between ordering the data and filtering the data. Right? So we
and filtering the data. Right? So we usually sort the data by the scores from
usually sort the data by the scores from the highest to the lowest. But now it's
the highest to the lowest. But now it's like we are doing both together. So
like we are doing both together. So let's do it again step by step. I will
let's do it again step by step. I will just back to the select star from
just back to the select star from customers. Now what we can do we can go
customers. Now what we can do we can go and sort the data by the score from the
and sort the data by the score from the highest to the lowest using the order by
highest to the lowest using the order by so order by score and then descending.
so order by score and then descending. So let's go and execute it. And now you
So let's go and execute it. And now you can see the first customer is with the
can see the first customer is with the highest score and then the second
highest score and then the second highest and so on. Now I think you
highest and so on. Now I think you already got it in order to get the top
already got it in order to get the top three customers with the highest scores.
three customers with the highest scores. What you have to do is to just go over
What you have to do is to just go over here and say top three and execute it.
here and say top three and execute it. And with that you have now a really nice
And with that you have now a really nice analyzis on your data. It's like a
analyzis on your data. It's like a reports where we are finding the top
reports where we are finding the top customers with the highest score. So
customers with the highest score. So this is really amazing and very easy. So
this is really amazing and very easy. So as you can see mixing the top with the
as you can see mixing the top with the sorting the data you can make top end
sorting the data you can make top end analyzes or bottom end analyzers. So
analyzes or bottom end analyzers. So let's have this task retrieve the lowest
let's have this task retrieve the lowest two customers based on the score. So now
two customers based on the score. So now we want to get the lowest scores in our
we want to get the lowest scores in our table. And in order to do that is very
table. And in order to do that is very simple. What we're going to do we're
simple. What we're going to do we're going to flip that. So we're going to
going to flip that. So we're going to sort our data based on the scores
sort our data based on the scores ascending from the lowest to the
ascending from the lowest to the highest. And since we want only the
highest. And since we want only the lowest two customers, we're going to
lowest two customers, we're going to replace the three with a two and execute
replace the three with a two and execute it. And with that, we're going to get at
it. And with that, we're going to get at the lowest two customers. It is Peter
the lowest two customers. It is Peter and Maria. They have the lowest scores.
and Maria. They have the lowest scores. Again, it's very easy. Okay, this is
Again, it's very easy. Okay, this is fun. Let's go to the next one. Get the
fun. Let's go to the next one. Get the two most recent orders. Well, this time
two most recent orders. Well, this time we are speaking about another table.
we are speaking about another table. Let's go and select everything from the
Let's go and select everything from the table orders like this. So now, as you
table orders like this. So now, as you can see, we have here four orders and we
can see, we have here four orders and we want the two most recent orders. So most
want the two most recent orders. So most recent means we have to deal with the
recent means we have to deal with the order dates and we can build that by
order dates and we can build that by sorting the data by the order dates. So
sorting the data by the order dates. So order by order dates and since we are
order by order dates and since we are saying the most recent orders so from
saying the most recent orders so from the highest date to the lowest that
the highest date to the lowest that means descending right let's go and
means descending right let's go and execute it and as you can see based on
execute it and as you can see based on our data and now we can look to our
our data and now we can look to our result this is the last order in our
result this is the last order in our business based on the order age and this
business based on the order age and this one is one of the earliest orders. So
one is one of the earliest orders. So with that we have sorted the data and
with that we have sorted the data and since we want the two most recent orders
since we want the two most recent orders we go over here and say we go exactly
we go over here and say we go exactly after the select and say top two and
after the select and say top two and execute and with that we have now the
execute and with that we have now the last two orders in our business. So as
last two orders in our business. So as you can see combining the top with the
you can see combining the top with the order by you can do amazing analyszis.
order by you can do amazing analyszis. All right so this is how you limit your
All right so this is how you limit your data using top and with that you have
data using top and with that you have learned the basics everything that you
learned the basics everything that you can learn and with that you have learned
can learn and with that you have learned all the clauses the sections that you
all the clauses the sections that you can use in any query in SQL. Now next
can use in any query in SQL. Now next what we're going to do we're going to
what we're going to do we're going to put everything together in one query in
put everything together in one query in order to learn how SQL going to go and
order to learn how SQL going to go and deal with all those clauses and how SQL
deal with all those clauses and how SQL going to go and execute it. So let's go
going to go and execute it. So let's go and do
that. Okay. So now I'm going to show you the coding order of a query compared to
the coding order of a query compared to the execution order that happens in the
the execution order that happens in the database. So the coding order of a query
database. So the coding order of a query starts always with a select and then
starts always with a select and then exactly after that you can put a
exactly after that you can put a distinct and then after the distinct you
distinct and then after the distinct you can put a top. So this is the order of
can put a top. So this is the order of all those keywords and then you can go
all those keywords and then you can go and select like few columns and after
and select like few columns and after you specify the columns separated with a
you specify the columns separated with a comma you tell SQL from which table your
comma you tell SQL from which table your data come from using the from clause.
data come from using the from clause. Now after that if you want to filter the
Now after that if you want to filter the data before the aggregation you can use
data before the aggregation you can use the where clause and this always comes
the where clause and this always comes directly after the from. And if you want
directly after the from. And if you want to group the data then you have to do it
to group the data then you have to do it after the wear clause using the group by
after the wear clause using the group by and after the group buys comes the
and after the group buys comes the having if you want to filter the data.
having if you want to filter the data. And the last thing that you can specify
And the last thing that you can specify in query it is always the order by. So
in query it is always the order by. So this is the order of all those
this is the order of all those components of the query. And if you
components of the query. And if you don't follow this order you will get an
don't follow this order you will get an error from the database. Now if you look
error from the database. Now if you look to this query there are a lot of things
to this query there are a lot of things that's going to filter your data. So
that's going to filter your data. So let's check them one by one. The first
let's check them one by one. The first thing that you can do is to filter the
thing that you can do is to filter the columns. If you don't want to see all
columns. If you don't want to see all the columns, you want to see only
the columns, you want to see only specific columns, you use the select and
specific columns, you use the select and of course you must use it. So the
of course you must use it. So the columns that you specify will be shown
columns that you specify will be shown in the results. So it's like filtering
in the results. So it's like filtering the columns. Now there is another type
the columns. Now there is another type of filter where you filter out the
of filter where you filter out the duplicates if you want to see unique
duplicates if you want to see unique results and that's using the distinct.
results and that's using the distinct. So this is another type of filter.
So this is another type of filter. Moving on, we can filter the result
Moving on, we can filter the result based on the row numbers. So we can
based on the row numbers. So we can limit the result using the top. But this
limit the result using the top. But this type of filter doesn't need any
type of filter doesn't need any conditions. It's purely based on the row
conditions. It's purely based on the row number in the results. Now moving on, if
number in the results. Now moving on, if you want to filter your data based on
you want to filter your data based on conditions based on your data, you can
conditions based on your data, you can filter the rows before the aggregation
filter the rows before the aggregation using the wear clause. And the last type
using the wear clause. And the last type of filtering, you can filter your rows
of filtering, you can filter your rows after the aggregation using the having.
after the aggregation using the having. So as you can see, we have like five
So as you can see, we have like five different types and how to filter the
different types and how to filter the results in SQL. So now let's see the
results in SQL. So now let's see the execution order. As we learned the first
execution order. As we learned the first thing that's going to happen is that SQL
thing that's going to happen is that SQL going to execute the from clause. So SQL
going to execute the from clause. So SQL going to go and find your data in the
going to go and find your data in the database where all the next steps going
database where all the next steps going to be paste on this data. Now the next
to be paste on this data. Now the next step that is going to do is that it's
step that is going to do is that it's going to go and filter the data using
going to go and filter the data using the wear clause. This has to be happen
the wear clause. This has to be happen before anything else. So before any
before anything else. So before any aggregations and so on we have to make
aggregations and so on we have to make scope of the data. So once SQL apply it
scope of the data. So once SQL apply it maybe some of the rows going to be
maybe some of the rows going to be removed and once the data is filtered
removed and once the data is filtered the third step SQL going to execute the
the third step SQL going to execute the group I so going to take the results and
group I so going to take the results and start combining the similar values in
start combining the similar values in one row and start aggregating the data
one row and start aggregating the data based on the aggregate function that you
based on the aggregate function that you have specified. So now after the group
have specified. So now after the group by after aggregating the data what is
by after aggregating the data what is going to do now it's going to go and
going to do now it's going to go and apply the second type of filter the
apply the second type of filter the having. So based on the condition the
having. So based on the condition the SQL going to go and start removing few
SQL going to go and start removing few aggregated data away and keep the rest.
aggregated data away and keep the rest. Now moving on to the step number five.
Now moving on to the step number five. Finally it's going to go and execute the
Finally it's going to go and execute the select distinct. So SQL going to go and
select distinct. So SQL going to go and start selecting the columns that we need
start selecting the columns that we need to see in the results and remove the
to see in the results and remove the other stuff. And once the columns are
other stuff. And once the columns are selected SQL going to go and execute the
selected SQL going to go and execute the order by. So SQL going to start sorting
order by. So SQL going to start sorting the data based on the column that you
the data based on the column that you have specified and the mechanism as
have specified and the mechanism as well. So the data will be sorted
well. So the data will be sorted differently. And my friends the last
differently. And my friends the last step that going to happen in your query
step that going to happen in your query will be always the top statements. So
will be always the top statements. So based on the final final results SQL
based on the final final results SQL going to go and execute the top. So here
going to go and execute the top. So here we are saying top two that means we want
we are saying top two that means we want to keep only the first two rows without
to keep only the first two rows without any conditions. So SQL going to count
any conditions. So SQL going to count okay row number one two and after that
okay row number one two and after that it's going to make cuts and remove
it's going to make cuts and remove anything after that. So this is the last
anything after that. So this is the last filter that's going to happen and as
filter that's going to happen and as well the last step. So now if you sit
well the last step. So now if you sit back and look at this the coding order
back and look at this the coding order is completely different than the
is completely different than the execution order in the coding we have
execution order in the coding we have first to specify the select actually the
first to specify the select actually the select going to be executed just almost
select going to be executed just almost at the end. So at the step number five
at the end. So at the step number five and once you understand how SQL execute
and once you understand how SQL execute your query you can understand how to
your query you can understand how to build correct
queries. So now the first thing that we have learned that we can go and have
have learned that we can go and have like one query right something like this
like one query right something like this select star from customers. Now this is
select star from customers. Now this is one query and in the output we have one
one query and in the output we have one results but did you know that in SQL we
results but did you know that in SQL we can have like multiple queries and
can have like multiple queries and multiple results in one go. So we can do
multiple results in one go. So we can do everything together like for example
everything together like for example let's say I'm selecting as well the data
let's say I'm selecting as well the data from orders. So that means we have two
from orders. So that means we have two queries and now if you go and execute
queries and now if you go and execute what can happens you will get two result
what can happens you will get two result grids. The first result grid is for the
grids. The first result grid is for the first query and the second one is for
first query and the second one is for the second query. So with that you can
the second query. So with that you can do multiple queries in the same window
do multiple queries in the same window and with that the results can be
and with that the results can be splitted into multiple window depend how
splitted into multiple window depend how many queries you have and usually in SQL
many queries you have and usually in SQL you might find that by the end of each
you might find that by the end of each query there is a semicolon like this. So
query there is a semicolon like this. So at the end of the first query we have
at the end of the first query we have semicolon and for the second query we
semicolon and for the second query we have as well at the end another
have as well at the end another semicolon. For the SQL server it is not
semicolon. For the SQL server it is not a must but for other databases if you
a must but for other databases if you have multiple queries in one execution
have multiple queries in one execution you must separate them with a semicolon
you must separate them with a semicolon and with that the database can
and with that the database can understand okay this is the end of the
understand okay this is the end of the first query and this is the end of the
first query and this is the end of the second query. So you have like
second query. So you have like separations between
queries. Okay. Now moving on to another cool thing in SQL. Now what if we don't
cool thing in SQL. Now what if we don't want to query the data inside our
want to query the data inside our tables, we would like to show a static
tables, we would like to show a static value from us from the one that is
value from us from the one that is writing the query. And this is very
writing the query. And this is very practical. If you are like practicing
practical. If you are like practicing and you want to check something using a
and you want to check something using a value from you, not from the tables. So
value from you, not from the tables. So how we can do that? It is very simple.
how we can do that? It is very simple. We're going to write select and then now
We're going to write select and then now after that instead of having a column
after that instead of having a column name you can go and add any value like 1
name you can go and add any value like 1 2 3. So it is just a number and we do
2 3. So it is just a number and we do not specify after that any table. So we
not specify after that any table. So we leave it like this. Select 1 2 3 and we
leave it like this. Select 1 2 3 and we don't need to use the from close. So now
don't need to use the from close. So now if you go and execute it you will get 1
if you go and execute it you will get 1 2 3. So this is a static value. And of
2 3. So this is a static value. And of course you can go and rename the column
course you can go and rename the column like static number. So execute it again.
like static number. So execute it again. So with that we have a static value. And
So with that we have a static value. And you can go and add anything like string
you can go and add anything like string as well. So let's say hello as static
as well. So let's say hello as static for example string. So let's go and
for example string. So let's go and execute. Now we have two queries. The
execute. Now we have two queries. The second one you can see our static value.
second one you can see our static value. Hello. So in queries we can add values
Hello. So in queries we can add values from us. Not only selecting data from
from us. Not only selecting data from the queries but of course you can go and
the queries but of course you can go and mix stuff. So we can have like in one
mix stuff. So we can have like in one query data from the database and static
query data from the database and static data from us. So let me show you what I
data from us. So let me show you what I mean. Let's go over here and say select
mean. Let's go over here and say select and let's go and get for example the ID
and let's go and get for example the ID the first name from the table customers
the first name from the table customers like this. So with that we can see we
like this. So with that we can see we are getting data from the database. But
are getting data from the database. But now I can go and add something from me
now I can go and add something from me new customer and we can call it customer
new customer and we can call it customer type. So now what is going on here? Two
type. So now what is going on here? Two columns from the database and one column
columns from the database and one column from us. It is the static one. So if you
from us. It is the static one. So if you go and execute it, you can see for the
go and execute it, you can see for the ID and the first name those data comes
ID and the first name those data comes from the database. But for each record
from the database. But for each record we are always getting the same static
we are always getting the same static value new customer, new customer and so
value new customer, new customer and so on. So this piece of information comes
on. So this piece of information comes from the query. It is not stored inside
from the query. It is not stored inside the database and those two informations
the database and those two informations come from the stored data inside the
come from the stored data inside the database. So this is really cool thing.
database. So this is really cool thing. You can add few informations from you
You can add few informations from you and you can get the data from the
and you can get the data from the database. This is the static
values. Okay. One more cool thing that I want to show you that if you have a
want to show you that if you have a query like this you are selecting from
query like this you are selecting from table and filtering the data and now you
table and filtering the data and now you would like not to execute the whole
would like not to execute the whole thing. You would like to execute only a
thing. You would like to execute only a part of this query. So now sometimes as
part of this query. So now sometimes as you are writing a query, you don't want
you are writing a query, you don't want to execute the whole thing. You want to
to execute the whole thing. You want to execute only part of the query. Like for
execute only part of the query. Like for example, I would like to see all the
example, I would like to see all the customers again in this query without
customers again in this query without this filter. So instead of removing it
this filter. So instead of removing it and then query and then again adding it,
and then query and then again adding it, what you can do, you can highlight what
what you can do, you can highlight what you want without now the filter and
you want without now the filter and execute. So without the database going
execute. So without the database going to execute exactly what you highlighted.
to execute exactly what you highlighted. And now as you can see I'm getting all
And now as you can see I'm getting all the customers without the filter. And if
the customers without the filter. And if you don't highlight anything and
you don't highlight anything and execute, what's going to happen? It's
execute, what's going to happen? It's still going to execute the whole thing
still going to execute the whole thing inside the editor. And this is really
inside the editor. And this is really nice if you want to query another table
nice if you want to query another table quickly in the same editor. Like we want
quickly in the same editor. Like we want to select everything from the orders
to select everything from the orders just quickly. So you can highlight only
just quickly. So you can highlight only this query and execute. And with that
this query and execute. And with that SQL is ignoring everything else and only
SQL is ignoring everything else and only executing what I'm highlighting. And
executing what I'm highlighting. And this is really nice. It gives us like
this is really nice. It gives us like speed and dynamic. And you're going to
speed and dynamic. And you're going to find me doing that a lot in the course.
find me doing that a lot in the course. So this is really nice. Okay. My
So this is really nice. Okay. My friends. So with that we have learned
friends. So with that we have learned the basics about SQL query. the basic
the basics about SQL query. the basic components of the select statements and
components of the select statements and with that you can talk to our database
with that you can talk to our database in order to get data. Now in the next
in order to get data. Now in the next chapter we're going to learn how to
chapter we're going to learn how to define the structure of our database. So
define the structure of our database. So we're going to learn the data definition
we're going to learn the data definition language DDL. So let's
go. Okay. So usually if you have like an empty database what you want to do is to
empty database what you want to do is to go and define the structure of your
go and define the structure of your data. So one of the first things that we
data. So one of the first things that we usually do is we go and create a new
usually do is we go and create a new tables. So here we have a command called
tables. So here we have a command called create and if you use it you can create
create and if you use it you can create a new object inside the database like
a new object inside the database like for example a table. So once you execute
for example a table. So once you execute it you're going to get brand new table
it you're going to get brand new table and usually the table going to be empty
and usually the table going to be empty without any data. So it is very simple.
without any data. So it is very simple. This is what the create command does.
This is what the create command does. And now let's go to SQL in order to
And now let's go to SQL in order to create a new table. So my friends we
create a new table. So my friends we have the following task. Create a new
have the following task. Create a new table called persons with columns ID
table called persons with columns ID person name birth date and phone. Okay.
person name birth date and phone. Okay. So this time we will not start by select
So this time we will not start by select we will start with the command create
we will start with the command create table. So we are telling SQL to create a
table. So we are telling SQL to create a table and after that we have to define
table and after that we have to define the name of the table. So in this task
the name of the table. So in this task we have to call it persons. Now we have
we have to call it persons. Now we have to go and open two parenthesis like this
to go and open two parenthesis like this and in between we have to define the
and in between we have to define the columns. So what do we need? First we
columns. So what do we need? First we need an ID. So this is the first column
need an ID. So this is the first column name. And next we have to define which
name. And next we have to define which data type for this column. It's going to
data type for this column. It's going to be an int. So it is a number does not
be an int. So it is a number does not contain any characters. And now next we
contain any characters. And now next we can define some constraints and we
can define some constraints and we cannot have a person without an ID. So
cannot have a person without an ID. So it should not be in null. So not null.
it should not be in null. So not null. This is the first column. So we have
This is the first column. So we have defined the name of the column, the data
defined the name of the column, the data type and the constraint. Okay. So let's
type and the constraint. Okay. So let's go to the second column and here we're
go to the second column and here we're going to have a comma and the next one
going to have a comma and the next one name going to be person name. So this is
name going to be person name. So this is the column name and the person name we
the column name and the person name we can have. And now the data type for this
can have. And now the data type for this column it going to be a varchar because
column it going to be a varchar because the person name contains characters. So
the person name contains characters. So vchar. And now we have to define the
vchar. And now we have to define the length. So I'm going to go with 50
length. So I'm going to go with 50 characters. And now I would say this is
characters. And now I would say this is a must. So each person should has a
a must. So each person should has a name. So we're going to say not null as
name. So we're going to say not null as well. So that we have the name, the type
well. So that we have the name, the type and the constraint. Now let's move to
and the constraint. Now let's move to the third column. It's going to be birth
the third column. It's going to be birth date. Now which type of informations we
date. Now which type of informations we have inside the birth date? So it's
have inside the birth date? So it's going to be a date, not a number, not
going to be a date, not a number, not characters. So we're going to go with
characters. So we're going to go with the data date. And now about the
the data date. And now about the constraint well depends. I would say in
constraint well depends. I would say in our application it is an optional
our application it is an optional because this is very personal
because this is very personal information and maybe some persons will
information and maybe some persons will not provide their birth dates. So this
not provide their birth dates. So this is an optional and I will not say it is
is an optional and I will not say it is not null. So nulls are allowed. Now
not null. So nulls are allowed. Now let's move on to the next one. It's
let's move on to the next one. It's going to be the phone. So now what is
going to be the phone. So now what is the data type of a phone? Well we have
the data type of a phone? Well we have some types numbers we have characters
some types numbers we have characters special characters. So we could have
special characters. So we could have anything. So that's why I'm going to go
anything. So that's why I'm going to go with the farchar. And here you can
with the farchar. And here you can specify the length that you think it's
specify the length that you think it's okay. I'm going to go with 15. Now of
okay. I'm going to go with 15. Now of course depend on the system that you are
course depend on the system that you are building. I would say the phones are
building. I would say the phones are very important in order to validate
very important in order to validate whether this is a real person. So we're
whether this is a real person. So we're going to say not null. So we are not
going to say not null. So we are not allowing nulls in this field. Perfect.
allowing nulls in this field. Perfect. So with that we have covered all the
So with that we have covered all the columns that are required. We have
columns that are required. We have defined the data types and as well the
defined the data types and as well the constraints. Now the last thing in each
constraints. Now the last thing in each database table we should has a primary
database table we should has a primary key in order to make sure this table has
key in order to make sure this table has an integrity and maybe as well
an integrity and maybe as well connectable to other tables. So now what
connectable to other tables. So now what we're going to do, we're going to go and
we're going to do, we're going to go and add the primary key constraint, comma,
add the primary key constraint, comma, for the last column. And then we're
for the last column. And then we're going to say constraint. Now we have to
going to say constraint. Now we have to give a primary key name. This is only
give a primary key name. This is only going to be visible for the database. So
going to be visible for the database. So I'm going to call it PK for primary key.
I'm going to call it PK for primary key. And here persons and then after that
And here persons and then after that we're going to say primary key. And
we're going to say primary key. And between two parentheses, we're going to
between two parentheses, we're going to go and pick which one is the primary
go and pick which one is the primary key. And of course, it's going to be the
key. And of course, it's going to be the ID. So we're going to go over here and
ID. So we're going to go over here and say ID. So again, we are saying there is
say ID. So again, we are saying there is a new constraint. This is the name of
a new constraint. This is the name of it. It's only internal for the database.
it. It's only internal for the database. And then we are saying this one is a
And then we are saying this one is a primary key on the field ID. So that's
primary key on the field ID. So that's it with that. We have defined a primary
it with that. We have defined a primary key for our table. Let's go and execute
key for our table. Let's go and execute it. So as you can see it is successful.
it. So as you can see it is successful. Let's go and check our database for our
Let's go and check our database for our new table. So if you don't see it
new table. So if you don't see it already, you have to right click on the
already, you have to right click on the database and then go and refresh. So
database and then go and refresh. So let's go to tables and now we have a
let's go to tables and now we have a brand new table called persons. So with
brand new table called persons. So with that we have created our new table. Now
that we have created our new table. Now of course for the DDL commands you will
of course for the DDL commands you will not get results or data. All what you're
not get results or data. All what you're getting is a message from the database
getting is a message from the database and the message says here the command
and the message says here the command completed successfully and then we have
completed successfully and then we have a date when this is completed. So that
a date when this is completed. So that means the DDL command will never return
means the DDL command will never return data. It is changing the structure of
data. It is changing the structure of your database. It's not about retrieving
your database. It's not about retrieving any data and so on. So this command did
any data and so on. So this command did change something in our database and in
change something in our database and in this scenario it created a new table and
this scenario it created a new table and that's why we call this data definition
that's why we call this data definition language DDL because we are defining the
language DDL because we are defining the database. Now of course if you go and
database. Now of course if you go and say select star from our new table
say select star from our new table persons. So let's go highlight it and
persons. So let's go highlight it and then execute it. You will see we are
then execute it. You will see we are getting of course the columns. So the
getting of course the columns. So the ID, the person name, birth date, the
ID, the person name, birth date, the phone but we don't have any rows that
phone but we don't have any rows that means our table is empty. Now what is
means our table is empty. Now what is very important to that you go and save
very important to that you go and save those informations in an SQL script
those informations in an SQL script because maybe later you have to redefine
because maybe later you have to redefine this table but let's say that you have
this table but let's say that you have created different queries and you have
created different queries and you have lost the script and now I would like to
lost the script and now I would like to see again the create statements for this
see again the create statements for this table well there is trick for that if
table well there is trick for that if you go to the left side you see the
you go to the left side you see the persons right here right click on it and
persons right here right click on it and then you have here script table as and
then you have here script table as and now we have here different options that
now we have here different options that you can run on the table and the first
you can run on the table and the first one says create two Then let's go to new
one says create two Then let's go to new query
query editor. So now what happened? The
editor. So now what happened? The database did read the metadata
database did read the metadata informations about the person and
informations about the person and created your DDL query with many extra
created your DDL query with many extra stuff that we haven't done. But this is
stuff that we haven't done. But this is the template that the database uses. So
the template that the database uses. So now we can see a lot of stuff. But what
now we can see a lot of stuff. But what is interesting is this create table. So
is interesting is this create table. So we can see create table the schema DBU
we can see create table the schema DBU the default one then the persons and
the default one then the persons and then we have our columns the data type
then we have our columns the data type and as well the constraints. So with
and as well the constraints. So with that you got back your DDL statements
that you got back your DDL statements and many other stuff about the table
and many other stuff about the table which is now not interesting. But now
which is now not interesting. But now what I really need is to see the create
what I really need is to see the create statements about this table. So this is
statements about this table. So this is how you can get back your DL command.
how you can get back your DL command. But of course what I recommend you is
But of course what I recommend you is always put your code inside a get
always put your code inside a get repository and always keep it up to
repository and always keep it up to date. So that always you can check your
date. So that always you can check your work and extend
it. Okay. So now what else you can do with the structure of your database? If
with the structure of your database? If you have already a table, what you can
you have already a table, what you can do, you can go and edit and change the
do, you can go and edit and change the definition of the table. So for example,
definition of the table. So for example, let's say I would like to add a new
let's say I would like to add a new column. In order to do that, we can use
column. In order to do that, we can use the command alter. Alter means you want
the command alter. Alter means you want to edit the definition of your table and
to edit the definition of your table and you want to change it like adding new
you want to change it like adding new column or maybe changing the data type
column or maybe changing the data type and anything in the definition of the
and anything in the definition of the table. So the alter command, you can use
table. So the alter command, you can use it in order to change the definition of
it in order to change the definition of your table. And now let's go back to
your table. And now let's go back to scale and try to change something. All
scale and try to change something. All right. Now the task says add a new
right. Now the task says add a new column called email to the person's
column called email to the person's table. So it is very simple what you can
table. So it is very simple what you can do. We can use the alter table command.
do. We can use the alter table command. So we are not creating new table. We
So we are not creating new table. We want to edit already existing table. So
want to edit already existing table. So which table we want to modify it's going
which table we want to modify it's going to be the persons. So we are telling SQL
to be the persons. So we are telling SQL we want to change something in the table
we want to change something in the table persons. And of course we have to tell
persons. And of course we have to tell SQL what we want to change. Are we
SQL what we want to change. Are we removing a column? Are we adding column?
removing a column? Are we adding column? In this scenario we want to add new
In this scenario we want to add new column. So let's go and add the email
column. So let's go and add the email information. So this is the column name
information. So this is the column name and as you are creating a table you have
and as you are creating a table you have to define column name the data type and
to define column name the data type and the constraint. So now for the emails
the constraint. So now for the emails we're going to have like characters,
we're going to have like characters, numbers, special characters. So we're
numbers, special characters. So we're going to go with the varchar and about
going to go with the varchar and about the length it's going to be let's say 50
the length it's going to be let's say 50 and I'm going to say each person has to
and I'm going to say each person has to has an email. So it's going to be not
has an email. So it's going to be not null. So with that we are adding
null. So with that we are adding completely a new column. So that's it.
completely a new column. So that's it. Let's go and execute it. Now again this
Let's go and execute it. Now again this is not a query. This is a DDL command
is not a query. This is a DDL command and in the output we will not get data.
and in the output we will not get data. We will get a message whether everything
We will get a message whether everything went correctly. So it says command
went correctly. So it says command completed successfully and the time when
completed successfully and the time when this is completed. Now we can go and do
this is completed. Now we can go and do a simple query just to have a check to
a simple query just to have a check to the table. So and now you can see we
the table. So and now you can see we have our columns and at the end we have
have our columns and at the end we have a new column called emails. This is very
a new column called emails. This is very important. If you are adding new column
important. If you are adding new column it's going to be always at the end of
it's going to be always at the end of the table. But now you might say you
the table. But now you might say you know what I would like to have the email
know what I would like to have the email like something in the middle maybe after
like something in the middle maybe after the person name. Well, in order to do
the person name. Well, in order to do that, you have completely to delete and
that, you have completely to delete and drop the table and create it from the
drop the table and create it from the scratch using create command which is
scratch using create command which is might be bad if you have data inside the
might be bad if you have data inside the table. So if you are fine by adding your
table. So if you are fine by adding your new column at the end, you can use the
new column at the end, you can use the alter table. But if you say I would like
alter table. But if you say I would like it in the middle, then sadly you have to
it in the middle, then sadly you have to go and drop everything and start from
go and drop everything and start from the scratch. Okay. So now let's have
the scratch. Okay. So now let's have another task and it says remove the
another task and it says remove the column phone from the person's table. So
column phone from the person's table. So now we're going to do exactly the
now we're going to do exactly the opposite. We're going to go remove it
opposite. We're going to go remove it completely with its data from the table.
completely with its data from the table. So we're going to still saying alter
So we're going to still saying alter table persons. We are saying we want to
table persons. We are saying we want to edit the definition of the table
edit the definition of the table persons. And now instead of adding we
persons. And now instead of adding we will be dropping a column. And then
will be dropping a column. And then after that we have to specify as well
after that we have to specify as well the column name. It's going to be the
the column name. It's going to be the phone. But we don't have to mention
phone. But we don't have to mention again the data type and the constraint.
again the data type and the constraint. And that's because the database already
And that's because the database already knows those informations. So we need
knows those informations. So we need those informations if we are creating
those informations if we are creating something new. That's why we can get rid
something new. That's why we can get rid of that. We just need the column name
of that. We just need the column name and the database is going to do the
and the database is going to do the rest. So let's go and do that. Now you
rest. So let's go and do that. Now you can see successful. And now let's go and
can see successful. And now let's go and check our table. And now as you can see
check our table. And now as you can see we have the ID, person name, birth date,
we have the ID, person name, birth date, email, and we don't have the column
email, and we don't have the column phone. Be careful. If you are deleting
phone. Be careful. If you are deleting column, you will be losing as well all
column, you will be losing as well all the data inside this column. So as you
the data inside this column. So as you can see, this is very simple. This is
can see, this is very simple. This is how we can edit the definition of our
how we can edit the definition of our table by adding and removing columns.
Okay, now moving on to the last one in this group of commands. So now so far
this group of commands. So now so far what we have done, we have created
what we have done, we have created something new in the database. We have
something new in the database. We have changed the definition of something
changed the definition of something inside our database. And now the last
inside our database. And now the last one, you can go and drop something from
one, you can go and drop something from the database. Let's say we have another
the database. Let's say we have another table and we don't need it anymore. So
table and we don't need it anymore. So we can go and use the drop command in
we can go and use the drop command in order to remove the table completely
order to remove the table completely from the database. And this means as
from the database. And this means as well removing everything the table and
well removing everything the table and the data inside it. So now let's go to
the data inside it. So now let's go to SQL and let's drop something from our
SQL and let's drop something from our database. Okay. So now our task says
database. Okay. So now our task says delete the table persons from the
delete the table persons from the database. This is the simplest form of
database. This is the simplest form of command in SQL but yet the most risky
command in SQL but yet the most risky one. So what we need? We have to delete
one. So what we need? We have to delete and drop the whole table persons. We
and drop the whole table persons. We don't need it anymore. We're going to
don't need it anymore. We're going to say drop table and then all what we have
say drop table and then all what we have to do is to give the name of the table
to do is to give the name of the table persons. So three words. You don't have
persons. So three words. You don't have to specify anything. Just destroy the
to specify anything. Just destroy the table persons. Let's go and execute it.
table persons. Let's go and execute it. It is successful. So as you can see it
It is successful. So as you can see it is very simple. Now on the left side to
is very simple. Now on the left side to your database go refresh and go to the
your database go refresh and go to the tables and you will not see the table
tables and you will not see the table persons. So the drop command it is very
persons. So the drop command it is very simple but yet very risky. So if you
simple but yet very risky. So if you compare now create table with a drop
compare now create table with a drop table you can see destroying things is
table you can see destroying things is way easier than building it. Those are
way easier than building it. Those are the commands create alter drop. those
the commands create alter drop. those commands we use in order to define the
commands we use in order to define the structure of our database the DDL
structure of our database the DDL commands that was very simple all right
commands that was very simple all right so that's all about the data definition
so that's all about the data definition language DDL and with that you have
language DDL and with that you have learned how to define new stuff in your
learned how to define new stuff in your database now moving on to the next one
database now moving on to the next one we're going to learn about the data
we're going to learn about the data manipulation language and here we're
manipulation language and here we're going to learn how to manipulate our
going to learn how to manipulate our data inside the database let's
go all right so now what we're going to do we're going to go and modify and
do we're going to go and modify and manipulate your data inside the
manipulate your data inside the database. So now sometimes what happens
database. So now sometimes what happens you have a table inside your database
you have a table inside your database and the table is empty. You don't have
and the table is empty. You don't have any rows any data inside the table. Now
any rows any data inside the table. Now in order to add your data to the table
in order to add your data to the table what you can do you can use the command
what you can do you can use the command insert. So insert going to go and add
insert. So insert going to go and add new rows to your table and of course not
new rows to your table and of course not always the table must be empty to add
always the table must be empty to add your data. You can add new rows to
your data. You can add new rows to already existing data and SQL going to
already existing data and SQL going to go and append it at the end of the
go and append it at the end of the table. Now my friends in order to insert
table. Now my friends in order to insert new data to the target table there are
new data to the target table there are two methods. The first and the classical
two methods. The first and the classical way in order to insert new data we can
way in order to insert new data we can use the insert command and manually
use the insert command and manually specifying the values that should be
specifying the values that should be inserted to the table. So you're going
inserted to the table. So you're going to start specifying in the script the
to start specifying in the script the values and then they're going to be
values and then they're going to be inserted as a new rows to the target
inserted as a new rows to the target table. So in this process you are
table. So in this process you are manually inserting new values to the
manually inserting new values to the table using like an SQL scripts. So now
table using like an SQL scripts. So now we're going to focus on this scenario on
we're going to focus on this scenario on how to insert data. All right. Now let's
how to insert data. All right. Now let's check quickly the syntax of the insert
check quickly the syntax of the insert command. It start with the keyword
command. It start with the keyword insert into and after that we have to
insert into and after that we have to specify the table name. So where we want
specify the table name. So where we want to insert and then we make a list of all
to insert and then we make a list of all columns that we want to insert. And then
columns that we want to insert. And then we specify list of columns where we're
we specify list of columns where we're going to insert values into them. And
going to insert values into them. And after that we say values. And finally
after that we say values. And finally we're going to go now and specify the
we're going to go now and specify the data that should be inserted to the
data that should be inserted to the table. and we make it as well as a list
table. and we make it as well as a list like we have done for the columns. Now
like we have done for the columns. Now in the insert statements specifying
in the insert statements specifying those columns it is totally optional. So
those columns it is totally optional. So if you don't specify the columns of the
if you don't specify the columns of the table then SQL going to expect you to
table then SQL going to expect you to insert values into each column because
insert values into each column because sometimes of course we don't want to
sometimes of course we don't want to insert value for each column. You can
insert value for each column. You can skip few columns of course but if you
skip few columns of course but if you want to insert a value for each column
want to insert a value for each column either you go and specify them as a list
either you go and specify them as a list or you can skip it. Now for the insert
or you can skip it. Now for the insert statements there is very important rule.
statements there is very important rule. The number of columns and values must
The number of columns and values must match. So if you specify here three
match. So if you specify here three columns then you must insert as well
columns then you must insert as well exactly three values. So this must be
exactly three values. So this must be matching. And one last thing about the
matching. And one last thing about the syntax you can insert multiple values in
syntax you can insert multiple values in one go. So for each row you can specify
one go. So for each row you can specify a list of values that must be inserted.
a list of values that must be inserted. So that's all about the syntax. Let's go
So that's all about the syntax. Let's go back to SQL in order to practice insert
back to SQL in order to practice insert command. Okay. So now let's go and
command. Okay. So now let's go and insert a new customers. So it's very
insert a new customers. So it's very simple. It start with insert into. So we
simple. It start with insert into. So we are saying we want to insert data into.
are saying we want to insert data into. So we have to go and specify the table
So we have to go and specify the table name customers. Now after that we have
name customers. Now after that we have to specify list of columns where we want
to specify list of columns where we want to insert data into it. And what we can
to insert data into it. And what we can do we can go and check which columns do
do we can go and check which columns do we have inside our table. So we can see
we have inside our table. So we can see we have ID, first name, country, score.
we have ID, first name, country, score. And we can go and make a list of that.
And we can go and make a list of that. So we can say ID, first name, country
So we can say ID, first name, country and score. So we just have a list of all
and score. So we just have a list of all columns inside our table customers. Now
columns inside our table customers. Now what we need? We need the values. So
what we need? We need the values. So which data should be inserted. So we can
which data should be inserted. So we can go and open two parenthesis. And now we
go and open two parenthesis. And now we have to specify an ID. We know the last
have to specify an ID. We know the last customer was five. So we're going to go
customer was five. So we're going to go with the customer six. Now we have to
with the customer six. Now we have to give the name of the customer. Let's go
give the name of the customer. Let's go for Anna. And then a country. Let's go
for Anna. And then a country. Let's go for USA. And this customer has no
for USA. And this customer has no scores. So what we can do? We can say
scores. So what we can do? We can say null. So we don't know the score of this
null. So we don't know the score of this customer. nulls means nothing we don't
customer. nulls means nothing we don't know. So with that you can go and insert
know. So with that you can go and insert one row. But now let's say that I would
one row. But now let's say that I would like to go and insert like a second row
like to go and insert like a second row one more customer. What we can do we can
one more customer. What we can do we can separate this with a comma and then we
separate this with a comma and then we can go and repeat the whole thing again.
can go and repeat the whole thing again. So the ID is seven. The next one let's
So the ID is seven. The next one let's call this customer Sam and we don't know
call this customer Sam and we don't know the country of this customer. So we're
the country of this customer. So we're going to say it's null. But the score we
going to say it's null. But the score we know it already. It is 100. So as you
know it already. It is 100. So as you can see we are adding a value for each
can see we are adding a value for each of those columns. And if you don't know
of those columns. And if you don't know the answer then make it null. if the
the answer then make it null. if the database allows it to be null. Some
database allows it to be null. Some columns they are not allowed to be null
columns they are not allowed to be null like the primary key. So if you go and
like the primary key. So if you go and say over here null the database will not
say over here null the database will not allow it. Well actually we can go and
allow it. Well actually we can go and test it. Let's execute. And you can see
test it. Let's execute. And you can see you cannot insert the value null into
you cannot insert the value null into the column ID. So this is not allowed.
the column ID. So this is not allowed. Going to have a seven. But for the other
Going to have a seven. But for the other columns it is allowed. You can go and
columns it is allowed. You can go and check the definition of the table. Now
check the definition of the table. Now we go and execute. Now the output of the
we go and execute. Now the output of the modifications command is going to always
modifications command is going to always indicate what happens to the data. So it
indicate what happens to the data. So it says two rows affected. Affected might
says two rows affected. Affected might be inserted, updated, deleted. So you're
be inserted, updated, deleted. So you're going to get a general statement from
going to get a general statement from the database. But you are getting how
the database. But you are getting how many record is affected. So we got two
many record is affected. So we got two because we have inserted two records. So
because we have inserted two records. So now as you can see it's not like the
now as you can see it's not like the query. We are not getting any data in
query. We are not getting any data in the output. We are just getting a
the output. We are just getting a message. So this is a big difference
message. So this is a big difference between querying the data using the
between querying the data using the selects and modifying the data using
selects and modifying the data using inserts. We are doing now direct
inserts. We are doing now direct modifications to the data inside our
modifications to the data inside our database. Of course, if you want to see
database. Of course, if you want to see the data in the customers, what we can
the data in the customers, what we can do, we can go and query the data, right?
do, we can go and query the data, right? So, let's go and do that. Select star
So, let's go and do that. Select star from customers. I would like to see the
from customers. I would like to see the whole table. So, market and execute it.
whole table. So, market and execute it. Now, you can see we have seven
Now, you can see we have seven customers. So, we just manipulated our
customers. So, we just manipulated our data. We have here Anna and Sam. This is
data. We have here Anna and Sam. This is how you can insert data to the database.
how you can insert data to the database. Now, there's like few rules you have to
Now, there's like few rules you have to be careful as you are inserting new data
be careful as you are inserting new data to your tables. You have to pay
to your tables. You have to pay attention that the order of the columns
attention that the order of the columns that you have defined. insert is
that you have defined. insert is matching the values that you are
matching the values that you are inserting over here. Let's have an
inserting over here. Let's have an example. I'm going to go and remove this
example. I'm going to go and remove this over here and let's say that we are
over here and let's say that we are inserting a new one number eight and now
inserting a new one number eight and now in the first name instead of the name of
in the first name instead of the name of the customers we have inserted the
the customers we have inserted the country like USA and in the country we
country like USA and in the country we have inserted the name is just mistake
have inserted the name is just mistake and we are all human right? So let's
and we are all human right? So let's have a name like this max. Now if you go
have a name like this max. Now if you go and execute it the database can accept
and execute it the database can accept it because it is really hard for the
it because it is really hard for the database to understand that you have
database to understand that you have made here an error. Both of them are var
made here an error. Both of them are var and the database doesn't care about the
and the database doesn't care about the content of the data as long as you are
content of the data as long as you are following the rules of the data type. So
following the rules of the data type. So now if you go and select the data from
now if you go and select the data from the customers you can see now we have a
the customers you can see now we have a customer called USA from the country
customer called USA from the country max. So the SQL going to do it blindly
max. So the SQL going to do it blindly like you insert the data as long as you
like you insert the data as long as you are following the data type rules and
are following the data type rules and the constraints. So for example, if you
the constraints. So for example, if you made this error over here and you say
made this error over here and you say the id is max and let's say the first
the id is max and let's say the first name is let's say nine and you execute
name is let's say nine and you execute it here the database is smart enough to
it here the database is smart enough to say you know what there is something
say you know what there is something wrong the ID should not be strange so
wrong the ID should not be strange so the database going to reject your
the database going to reject your inserts be careful of the order of your
inserts be careful of the order of your columns now let's go and query again our
columns now let's go and query again our table now if you are in the insert
table now if you are in the insert commands defining all the columns
commands defining all the columns exactly like the table so as you can see
exactly like the table so as you can see we have here complete match ID first
we have here complete match ID first name country score we have all the
name country score we have all the columns and as well the correct order
columns and as well the correct order there is like lazy way you can go and
there is like lazy way you can go and remove the whole thing over here and
remove the whole thing over here and with that the database can understand
with that the database can understand okay we are inserting values to all of
okay we are inserting values to all of the columns so going to understand you
the columns so going to understand you are inserting something to each columns
are inserting something to each columns in the correct direction so let's go and
in the correct direction so let's go and do that correctly nine and here let's
do that correctly nine and here let's say we
say we have from Germany so if you go and
have from Germany so if you go and execute it it will be working even
execute it it will be working even though we didn't define the columns and
though we didn't define the columns and that's because the values that we are
that's because the values that we are inserting as exactly the same number of
inserting as exactly the same number of columns of the table and following as
columns of the table and following as well the rules. Now moving on to the
well the rules. Now moving on to the next one, you can go and add only two
next one, you can go and add only two columns in the definition. If you know
columns in the definition. If you know already always the country and the score
already always the country and the score is null. We know only two informations,
is null. We know only two informations, the ID and the name. Then you don't have
the ID and the name. Then you don't have always to go and say null null null and
always to go and say null null null and so on. We can go and skip that. Okay. So
so on. We can go and skip that. Okay. So now let me show you what I mean. We're
now let me show you what I mean. We're going to go after the table name and
going to go after the table name and we're going to define only two columns,
we're going to define only two columns, the ID and the first name. So that means
the ID and the first name. So that means we are telling SQL we want to insert
we are telling SQL we want to insert only two columns. And now you have to be
only two columns. And now you have to be careful. If you define here two columns
careful. If you define here two columns then the values should be as well two
then the values should be as well two columns. So we're going to remove the
columns. So we're going to remove the country and the score. And we can go and
country and the score. And we can go and add only two informations. So 10. And we
add only two informations. So 10. And we can go and add here for example Sara. So
can go and add here for example Sara. So if you go and execute it, it will be
if you go and execute it, it will be working. And now what is skill is doing
working. And now what is skill is doing with the other two columns. It's going
with the other two columns. It's going to be nulls. So let's go and select
to be nulls. So let's go and select again from our table. You can see here
again from our table. You can see here Sara has null in the country and as well
Sara has null in the country and as well in the score because we didn't define
in the score because we didn't define those informations. But be careful, you
those informations. But be careful, you cannot here skip a column that is not
cannot here skip a column that is not allowed to be null. So you have always
allowed to be null. So you have always have in your list all the columns that
have in your list all the columns that are not null. So for example, I cannot
are not null. So for example, I cannot go and insert only the first name. I
go and insert only the first name. I will get an error because the database
will get an error because the database can try to insert a null in the ID and
can try to insert a null in the ID and this is not allowed. So you can skip
this is not allowed. So you can skip only nullable
columns. All right, my friends. So that was the first method on how to insert
was the first method on how to insert data to your target table as you saw by
data to your target table as you saw by typing manually the values inside an
typing manually the values inside an insert command using values. And now
insert command using values. And now let's move to another methods. We're
let's move to another methods. We're going to insert data but this time not
going to insert data but this time not manually. We're going to insert data
manually. We're going to insert data using another table. So imagine we have
using another table. So imagine we have the following scenario. We have an
the following scenario. We have an already existing table with data and
already existing table with data and this going to be the source table, the
this going to be the source table, the source of your data and we have another
source of your data and we have another table. This table is empty and we want
table. This table is empty and we want to insert a new data to this target
to insert a new data to this target table. Now what we can do, we can take
table. Now what we can do, we can take the data from the source table and
the data from the source table and insert it into the target table without
insert it into the target table without manually writing the script for the
manually writing the script for the values. So we are moving the data from
values. So we are moving the data from one table to another. Now in order to do
one table to another. Now in order to do that we need to do two steps. The first
that we need to do two steps. The first step we have to write an SQL query using
step we have to write an SQL query using select from and so on in order to select
select from and so on in order to select the data that we need from the source
the data that we need from the source table. And once you do that you will get
table. And once you do that you will get a results. So this is like you are doing
a results. So this is like you are doing a normal query. You right select and you
a normal query. You right select and you will get an answer with the results. And
will get an answer with the results. And now what we can do in the next step we
now what we can do in the next step we can take this results and use an insert
can take this results and use an insert command in order to insert this results
command in order to insert this results into the target table. And with that we
into the target table. And with that we have moved the data from the source
have moved the data from the source table to the target table. So first
table to the target table. So first write the query on the source table. And
write the query on the source table. And the second step use an insert to move
the second step use an insert to move this results to the target table. So
this results to the target table. So let's go back to the scale in order to
let's go back to the scale in order to do that. So now we have the following
do that. So now we have the following task and it says insert data from the
task and it says insert data from the table customers into the table persons.
table customers into the table persons. So that means the source table is the
So that means the source table is the customers and the target table is
customers and the target table is persons. Now how I usually do it that I
persons. Now how I usually do it that I keep my eye on the target table to
keep my eye on the target table to understand the structure of this table
understand the structure of this table and I start writing the query from the
and I start writing the query from the source table. If you go to the left
source table. If you go to the left side, we can see okay, we have here an
side, we can see okay, we have here an ID. We have here person name, birth date
ID. We have here person name, birth date and phone. And you can see only the
and phone. And you can see only the birth date except nulls and the rest we
birth date except nulls and the rest we have always to provide informations. So
have always to provide informations. So with that I have now understanding about
with that I have now understanding about the table persons. Now next I'm going to
the table persons. Now next I'm going to go and start writing the query from the
go and start writing the query from the source. So we start like this. Select
source. So we start like this. Select star from our table customers just to
star from our table customers just to have an overview of our table. Now the
have an overview of our table. Now the next step we're going to go and design a
next step we're going to go and design a perfect result from this query that is
perfect result from this query that is matching the target table. So in the
matching the target table. So in the output we need ID and we have it from
output we need ID and we have it from the customer from the original table.
the customer from the original table. We're going to go and select ID. Okay.
We're going to go and select ID. Okay. So now next we need a person name and
So now next we need a person name and here we have from the original table
here we have from the original table something called first name. So this is
something called first name. So this is a perfect match. So we're going to go
a perfect match. So we're going to go and select this table as a second
and select this table as a second column. So we have covered the first
column. So we have covered the first two. Then the third one is going to be
two. Then the third one is going to be the birth date. Well, my friends, we
the birth date. Well, my friends, we don't have birth dates, but the database
don't have birth dates, but the database can accept it as a null. So, I'm going
can accept it as a null. So, I'm going to go and write a null because I don't
to go and write a null because I don't have such information from the source
have such information from the source table. And now the next one going to be
table. And now the next one going to be the phone as well. We don't have phone
the phone as well. We don't have phone informations. But we cannot have it as a
informations. But we cannot have it as a null because it says here not null. So,
null because it says here not null. So, what we're going to do, we're going to
what we're going to do, we're going to go and add a static value, a default
go and add a static value, a default value. So, we're going to have two
value. So, we're going to have two single quotes and in between we're going
single quotes and in between we're going to say unknown. Since it is var, it can
to say unknown. Since it is var, it can accept this word. So, now let's go and
accept this word. So, now let's go and just query. So we have the ID, we have
just query. So we have the ID, we have the first name, the birth date is empty,
the first name, the birth date is empty, and the phones is unknown. Now you might
and the phones is unknown. Now you might say, but the column name is not matching
say, but the column name is not matching with the column name of the persons.
with the column name of the persons. Well, the database does not care about
Well, the database does not care about that. As long as the result of the data
that. As long as the result of the data is matching the table, it can go and
is matching the table, it can go and insert it. So the database will never
insert it. So the database will never compare the column names together. But
compare the column names together. But if you like and go and add here like the
if you like and go and add here like the aliases exactly like the target table it
aliases exactly like the target table it will not hurt but it has no effect on
will not hurt but it has no effect on the results. All right. Okay. So now we
the results. All right. Okay. So now we have like query select and we have a
have like query select and we have a results but this is not an insert. So
results but this is not an insert. So how we going to insert the result of
how we going to insert the result of this into the table persons. Well for
this into the table persons. Well for that we need the insert into command. So
that we need the insert into command. So insert into and now we have to specify
insert into and now we have to specify the target table going to be the
the target table going to be the persons. And of course you can go and
persons. And of course you can go and list all the column names but if you
list all the column names but if you have like exact match you can skip it
have like exact match you can skip it but for me I would like always to add it
but for me I would like always to add it just to make sure that we don't have any
just to make sure that we don't have any issue. So the ID, person name, birth
issue. So the ID, person name, birth date and the
date and the phone. So that's it. Let's go and
phone. So that's it. Let's go and execute. So it is working now. We can
execute. So it is working now. We can see 10 rows affected. Well that means 10
see 10 rows affected. Well that means 10 rows are inserted from the table
rows are inserted from the table customers into the target persons. And
customers into the target persons. And now what we can do we can go and query
now what we can do we can go and query the table persons just to check that
the table persons just to check that everything is working perfectly. Select
everything is working perfectly. Select star from persons and let's go and
star from persons and let's go and execute. And with that you can see our
execute. And with that you can see our 10 persons that we have added from the
10 persons that we have added from the customers. So with that we have moved
customers. So with that we have moved the data from one table and inserted
the data from one table and inserted into another table. And as you can see
into another table. And as you can see it was very simple. First you have to
it was very simple. First you have to write a query from the source table in
write a query from the source table in order to collect the data that you need.
order to collect the data that you need. and then you go and insert it into the
and then you go and insert it into the target table. So this is really nice and
target table. So this is really nice and easy and this is another way on how to
easy and this is another way on how to insert data into your
database. Okay, so with that we have learned how to insert data to our
learned how to insert data to our tables. Now let's say that I don't have
tables. Now let's say that I don't have something new. I don't have any rows to
something new. I don't have any rows to be added to my table but I have an
be added to my table but I have an update. I would like to go and change
update. I would like to go and change the content of the already existing
the content of the already existing rows. So what you can do? We can use the
rows. So what you can do? We can use the command updates in order to change the
command updates in order to change the content of already existing rows. So
content of already existing rows. So again my friends insert going to go and
again my friends insert going to go and insert completely new rows but update
insert completely new rows but update going to go and change the data of
going to go and change the data of already existing row. Now let's have a
already existing row. Now let's have a look quickly to the syntax of the
look quickly to the syntax of the updates. It start with the keyword
updates. It start with the keyword updates and then we have to specify the
updates and then we have to specify the table name and after that we're going to
table name and after that we're going to use sit in order to specify what are the
use sit in order to specify what are the new values for the columns. So you have
new values for the columns. So you have to write down for each column that you
to write down for each column that you want to update a new value and you
want to update a new value and you separate the columns of course using a
separate the columns of course using a comma. Now after that we have to specify
comma. Now after that we have to specify as well a wear condition. So it's like
as well a wear condition. So it's like the queries you say where and then you
the queries you say where and then you write a condition and if you don't do
write a condition and if you don't do that and you don't use the wear clause
that and you don't use the wear clause what going to happen you will be end up
what going to happen you will be end up updating all the rows inside your table.
updating all the rows inside your table. So that's why we need always the wear
So that's why we need always the wear clause. All right. So that's all about
clause. All right. So that's all about the syntax. Let's go back to SQL in
the syntax. Let's go back to SQL in order to update our data. Okay. So let's
order to update our data. Okay. So let's have the following task and it says
have the following task and it says change the score of customer 6 to zero.
change the score of customer 6 to zero. So that means we have to go and modify
So that means we have to go and modify the data of the customer ID equal to
the data of the customer ID equal to six. So now first I would like to go and
six. So now first I would like to go and have a look to our data. So select star
have a look to our data. So select star from customers and now the task is
from customers and now the task is targeting this customer over here and we
targeting this customer over here and we would like to replace the null to zero.
would like to replace the null to zero. Now how we can go and update this
Now how we can go and update this information inside the table? We can use
information inside the table? We can use the update command. So what we going to
the update command. So what we going to do? We're going to start writing update
do? We're going to start writing update and after that we have to specify the
and after that we have to specify the table name. So what we are updating? We
table name. So what we are updating? We are updating the customers and then
are updating the customers and then we're going to tell the database to set
we're going to tell the database to set the value of the score to a zero. So we
the value of the score to a zero. So we would like to update and change the
would like to update and change the value from null to a zero. And now here
value from null to a zero. And now here comes something very risky. Don't
comes something very risky. Don't execute this query yet. If you do that,
execute this query yet. If you do that, what's going to happen? The database
what's going to happen? The database going to go to the table customers and
going to go to the table customers and replace all those values of all
replace all those values of all customers to zero. So it's going to go
customers to zero. So it's going to go and update the whole table and this is
and update the whole table and this is of course very risky. That's why in the
of course very risky. That's why in the update command we have to give a wear
update command we have to give a wear condition a filter in order to target
condition a filter in order to target only specific row or the rows that you
only specific row or the rows that you want really to modify. In this case we
want really to modify. In this case we want to change only one row. So what we
want to change only one row. So what we have to do is to go and specify the work
have to do is to go and specify the work condition like we have done in the
condition like we have done in the select query. Nothing new, right? So
select query. Nothing new, right? So we're going to say where the customer ID
we're going to say where the customer ID is equal to six. And with that SQL will
is equal to six. And with that SQL will not go and update everything. First it's
not go and update everything. First it's going to filter the data and then
going to filter the data and then updates. And now before I execute just
updates. And now before I execute just to make sure I go and check which data
to make sure I go and check which data going to be affected. So it's very
going to be affected. So it's very simple you go and select star from table
simple you go and select star from table customers and then I go and take the
customers and then I go and take the exact where and put it in my query and
exact where and put it in my query and then I select the whole thing and
then I select the whole thing and execute. And now if this query gives me
execute. And now if this query gives me the data that should be modified then
the data that should be modified then I'm doing the update command correctly.
I'm doing the update command correctly. And in this case we are targeting only
And in this case we are targeting only one customer. This is the customer
one customer. This is the customer number six. And with that I feel really
number six. And with that I feel really confident with my update. So what we can
confident with my update. So what we can do since I'm going to use this later I'm
do since I'm going to use this later I'm going to put the whole thing in a
going to put the whole thing in a comment and if I execute now only the
comment and if I execute now only the update going to be executed. So let's go
update going to be executed. So let's go and do that. Now very important to check
and do that. Now very important to check the message you can see one row is
the message you can see one row is affected which is really good because if
affected which is really good because if I see here 10 rows is affected that
I see here 10 rows is affected that means everything is updated. Now let's
means everything is updated. Now let's go and check the data. I'm going to go
go and check the data. I'm going to go and remove the wear here and check the
and remove the wear here and check the whole table. Now you can see we still
whole table. Now you can see we still have the old scores only Anna has now
have the old scores only Anna has now score zero instead of null. So this is
score zero instead of null. So this is how I usually update the data. You have
how I usually update the data. You have to do it very carefully. Now let's move
to do it very carefully. Now let's move to another task. It's going to say
to another task. It's going to say change the score of the customer number
change the score of the customer number 10 to zero and update the country to UK.
10 to zero and update the country to UK. So now this time we are targeting the
So now this time we are targeting the user number 10. As you can see she
user number 10. As you can see she doesn't have the country and score. And
doesn't have the country and score. And the task wants us to change the score to
the task wants us to change the score to a zero and the country to UK. So now how
a zero and the country to UK. So now how we going to do it? We're going to use
we going to do it? We're going to use the exact same command but with
the exact same command but with different condition. So the ID this
different condition. So the ID this times is equal to 10 and the score is to
times is equal to 10 and the score is to zero. But now we have to change as well
zero. But now we have to change as well the country. Now if you want to do
the country. Now if you want to do multiple updates, you're going to have
multiple updates, you're going to have here a comma after the score and the new
here a comma after the score and the new line and let's say country equal and
line and let's say country equal and then we're going to add UK. So select
then we're going to add UK. So select the whole thing and let's go and
the whole thing and let's go and execute. So again it is affecting only
execute. So again it is affecting only one row. This is really good. And if you
one row. This is really good. And if you go and check the table search for Sara,
go and check the table search for Sara, you can see in one update we have
you can see in one update we have updated two columns the country and as
updated two columns the country and as well the score. So with that we have
well the score. So with that we have solved the task. It's very simple. Now
solved the task. It's very simple. Now moving on to the second task. It says
moving on to the second task. It says update all customers with a null score
update all customers with a null score by setting their score to a zero. So
by setting their score to a zero. So this time we are not speaking about one
this time we are not speaking about one specific customer. We are talking about
specific customer. We are talking about updating the data for a subset of
updating the data for a subset of customers. So now imagine you have like
customers. So now imagine you have like hundreds of customers and you are making
hundreds of customers and you are making one update command for each customer.
one update command for each customer. It's going to be really wasting of time.
It's going to be really wasting of time. Now instead of that we can specify a
Now instead of that we can specify a condition that targets multiple
condition that targets multiple customers and we're going to do the
customers and we're going to do the update for those customers in one go. So
update for those customers in one go. So now let's see how we're going to do it.
now let's see how we're going to do it. We are talking only about replacing the
We are talking only about replacing the nulls with a zero. So we don't need the
nulls with a zero. So we don't need the country. So set score equal to zero. But
country. So set score equal to zero. But now we will not be specific for the ids.
now we will not be specific for the ids. Now we have to make a new condition.
Now we have to make a new condition. It's going to say like this where score
It's going to say like this where score is null. Now of course in the course we
is null. Now of course in the course we have a full dedicated chapter about the
have a full dedicated chapter about the nulls and here all what we are doing is
nulls and here all what we are doing is we are searching for scores that is
we are searching for scores that is equal to null. But we cannot write an
equal to null. But we cannot write an equal we have to write it like this is
equal we have to write it like this is null. Of course before we update
null. Of course before we update anything we have to go and test it in a
anything we have to go and test it in a query. So select star from customers
query. So select star from customers where score is null. Let's go and
where score is null. Let's go and execute. Now as you can see we have two
execute. Now as you can see we have two customers where the score is null. So
customers where the score is null. So that means this condition is targeting a
that means this condition is targeting a subset of customers and we're going to
subset of customers and we're going to do now the updates for multiple rows for
do now the updates for multiple rows for this subset. So that means we can run
this subset. So that means we can run this query. Let's go and execute it. Now
this query. Let's go and execute it. Now you can see two rows are affected. So
you can see two rows are affected. So that means multiple rows got affected
that means multiple rows got affected got updated. So now if you go and query
got updated. So now if you go and query our table customers you can see we don't
our table customers you can see we don't have any nulls inside the scores and we
have any nulls inside the scores and we have replaced all the nulls with a zero.
have replaced all the nulls with a zero. And of course you can do the same thing.
And of course you can do the same thing. you can go and make an update command in
you can go and make an update command in order to replace all the nulls in the
order to replace all the nulls in the country to maybe something unknown or
country to maybe something unknown or any default value that you want. So this
any default value that you want. So this is how you can update multiple rows in
is how you can update multiple rows in one
go. All right my friends. So with that we have learned how to insert new rows
we have learned how to insert new rows to our tables and as well how to update
to our tables and as well how to update the content of already existing row. Now
the content of already existing row. Now the last thing or command that we can do
the last thing or command that we can do to the data inside the table that we can
to the data inside the table that we can go and remove rows from our table and we
go and remove rows from our table and we can do that using the command delete. So
can do that using the command delete. So if you use delete SQL going to go and
if you use delete SQL going to go and start removing already existing rows
start removing already existing rows inside your table. All right. Now for
inside your table. All right. Now for the syntax of the delete it's going to
the syntax of the delete it's going to be very simple. We're going to say
be very simple. We're going to say delete from and then we're going to
delete from and then we're going to write the table name. And here comes
write the table name. And here comes something very important. We have to add
something very important. We have to add a wear condition. And it's like the
a wear condition. And it's like the update. If you don't do that, if you
update. If you don't do that, if you don't include where condition, what
don't include where condition, what going to happen? You will end up
going to happen? You will end up deleting all the rows inside the table.
deleting all the rows inside the table. So the syntax is very simple. Let's go
So the syntax is very simple. Let's go back to scale in order to delete some
back to scale in order to delete some data. Okay. So now we have the following
data. Okay. So now we have the following task. Delete all customers with an ID
task. Delete all customers with an ID greater than five. So now we have to go
greater than five. So now we have to go and delete all the customers that we
and delete all the customers that we recently added. So how we going to do
recently added. So how we going to do it? It's very simple. We're going to say
it? It's very simple. We're going to say delete from. So that means I want to
delete from. So that means I want to delete something from a table. And we
delete something from a table. And we have to specify the table name. It's
have to specify the table name. It's going to be the customers. So the syntax
going to be the customers. So the syntax is very simple. Now my friends, this is
is very simple. Now my friends, this is more risky than updates because if you
more risky than updates because if you execute it like this, don't do that yet.
execute it like this, don't do that yet. Wait, what's going to happen? All the
Wait, what's going to happen? All the data of the customers going to be
data of the customers going to be deleted. So you will get an empty table
deleted. So you will get an empty table and we will not do that. So now we're
and we will not do that. So now we're going to do exactly like the update
going to do exactly like the update command. We're going to specify the work
command. We're going to specify the work clause. So it says the ID should be
clause. So it says the ID should be greater than five. So that means ID
greater than five. So that means ID higher than five. So with that we are
higher than five. So with that we are defining a subset of the data that
defining a subset of the data that should be deleted, not everything. And
should be deleted, not everything. And if we check in the updates, we have here
if we check in the updates, we have here to do a double check before deleting
to do a double check before deleting anything. So again what we do, we select
anything. So again what we do, we select star from table customers and we're
star from table customers and we're going to go and copy the work condition
going to go and copy the work condition in order to test what going to be
in order to test what going to be deleted. So it's going to be all the
deleted. So it's going to be all the customers that is higher than five. And
customers that is higher than five. And with that I'm making sure that my delete
with that I'm making sure that my delete command is correct which is from what I
command is correct which is from what I see here is correct. So those five
see here is correct. So those five customers should be deleted. So now
customers should be deleted. So now let's go and delete those customers. And
let's go and delete those customers. And now very important to read the message.
now very important to read the message. It says five rows affected. So that
It says five rows affected. So that means five customers got deleted. And
means five customers got deleted. And this is better than 10 of course. So
this is better than 10 of course. So let's go and check what customers left.
let's go and check what customers left. So we have 1 2 3 4 5. Those are the
So we have 1 2 3 4 5. Those are the original customers. And everything else
original customers. And everything else got deleted. And with that we have
got deleted. And with that we have solved the task. And this is how we can
solved the task. And this is how we can delete data from tables. Be very
delete data from tables. Be very careful. Always test before doing the
careful. Always test before doing the delete command. Okay. So now we have the
delete command. Okay. So now we have the following task. And it says delete all
following task. And it says delete all data from table persons. So that means
data from table persons. So that means we have to go and drop everything from
we have to go and drop everything from the table persons. But we don't want to
the table persons. But we don't want to delete the table. We just want to delete
delete the table. We just want to delete the data inside the table now. So now
the data inside the table now. So now what we're going to do, we're going to
what we're going to do, we're going to write delete from. And now we have to
write delete from. And now we have to specify the table persons. And if you
specify the table persons. And if you execute it, what's going to happen? SQL
execute it, what's going to happen? SQL going to go and drop all the data in the
going to go and drop all the data in the persons. But in SQL, we have more
persons. But in SQL, we have more interesting command. If you want to
interesting command. If you want to delete everything from the table
delete everything from the table persons, we have that truncate.
persons, we have that truncate. Truncate. It is exactly like delete from
Truncate. It is exactly like delete from persons. It's going to go and make the
persons. It's going to go and make the whole table empty. But why I like to use
whole table empty. But why I like to use truncate because it is way faster than
truncate because it is way faster than deletes. If you have large tables, the
deletes. If you have large tables, the delete command going to be really slow
delete command going to be really slow because with the delete there is like a
because with the delete there is like a lot of things happening behind the
lot of things happening behind the scenes. There is like logs and
scenes. There is like logs and protocols. But if you are using trunk,
protocols. But if you are using trunk, the database going to skip all those
the database going to skip all those extra stuff and it's going to be very
extra stuff and it's going to be very fast. So if you want to delete all the
fast. So if you want to delete all the data from table, you can do it like this
data from table, you can do it like this if it's like small table. But what I
if it's like small table. But what I usually do, I go and write truncate and
usually do, I go and write truncate and then table. we're going to get the same
then table. we're going to get the same effect and with that I'm saying reset
effect and with that I'm saying reset everything make the table empty. So
everything make the table empty. So let's go and execute it and now with
let's go and execute it and now with that you will not get the number of
that you will not get the number of deleted rows and that's why it's
deleted rows and that's why it's truncate it's way faster. It is not
truncate it's way faster. It is not protocoling anything it's not logging
protocoling anything it's not logging anything it just go and delete all the
anything it just go and delete all the data without any extra steps. So this is
data without any extra steps. So this is how we can delete all the data from a
how we can delete all the data from a table but the table still exists. Okay
table but the table still exists. Okay my friends, so with that you have
my friends, so with that you have learned the basics on how to manipulate
learned the basics on how to manipulate your data inside the database the data
your data inside the database the data manipulation language DML and with that
manipulation language DML and with that I can tell you we have covered the
I can tell you we have covered the basics of SQL. So with that we have
basics of SQL. So with that we have covered the beginner level. Now in the
covered the beginner level. Now in the next chapters we will be in the
next chapters we will be in the intermediate level and the first thing
intermediate level and the first thing that you're going to learn in the
that you're going to learn in the intermediate level you will learn how to
intermediate level you will learn how to filter your data and we're going to
filter your data and we're going to cover many operators that you can use
cover many operators that you can use inside the workclass. So let's go.
All right. So now let's have an overview about all different operators in SQL. So
about all different operators in SQL. So the first group of operators we have the
the first group of operators we have the comparison operators. They are the
comparison operators. They are the easiest one where all what we have to do
easiest one where all what we have to do is to compare two values and we have
is to compare two values and we have like six different variants and how to
like six different variants and how to do that. Now to the next one we have the
do that. Now to the next one we have the logical operators. We use it in order to
logical operators. We use it in order to combine multiple operators. And moving
combine multiple operators. And moving on to the next one we have the range
on to the next one we have the range operator. Here we have only one, the
operator. Here we have only one, the between. We're going to use it in order
between. We're going to use it in order to check whether a value falls within a
to check whether a value falls within a specific range. Now moving on to the
specific range. Now moving on to the next one, we have the membership
next one, we have the membership operator. And here we have two things.
operator. And here we have two things. We have the in operator or not in. Here
We have the in operator or not in. Here all what you have to do is to check
all what you have to do is to check whether a value is in a list or not. And
whether a value is in a list or not. And the last category that we have is the
the last category that we have is the search operator. And here as well we
search operator. And here as well we have only one operator that like we use
have only one operator that like we use it in order to search for a specific
it in order to search for a specific thing in a text. So my friends, we're
thing in a text. So my friends, we're going to go through all those operators
going to go through all those operators one by one. Okay. So now let's go and
one by one. Okay. So now let's go and deep dive into the first category the
deep dive into the first category the comparison operators and we're going to
comparison operators and we're going to cover all those stuff. So what is
cover all those stuff. So what is exactly comparison
operator? Okay. So what is exactly comparison operators? It is very simple.
comparison operators? It is very simple. We want to compare two things and there
We want to compare two things and there is a lot of things that we can compare
is a lot of things that we can compare in SQL. But the formula for that going
in SQL. But the formula for that going to be always like this. So we have the
to be always like this. So we have the first expression and then operator and
first expression and then operator and then we have another expression and this
then we have another expression and this going to form something called
going to form something called condition. So here we have a lot of
condition. So here we have a lot of variance. We can compare one column to
variance. We can compare one column to another column. So for example, you can
another column. So for example, you can go and compare the first name with the
go and compare the first name with the last name. So both of the expressions
last name. So both of the expressions are columns here. Another scenario, you
are columns here. Another scenario, you want to compare a column with a value, a
want to compare a column with a value, a static value. Like for example, you say
static value. Like for example, you say the first name must be equal to a value
the first name must be equal to a value like John. So now we are comparing a
like John. So now we are comparing a column with a value. It's not anymore
column with a value. It's not anymore two columns. Now we have another
two columns. Now we have another scenario where we want to apply a
scenario where we want to apply a function to a column and then compare
function to a column and then compare the results to maybe a value. So for
the results to maybe a value. So for example, we apply the upper function to
example, we apply the upper function to the first name and then this must be
the first name and then this must be equal to a value like John with all the
equal to a value like John with all the letters in the uppercase. And one more
letters in the uppercase. And one more thing that you can compare you can write
thing that you can compare you can write an expression in one of the sides like
an expression in one of the sides like for example you can say if we multiply
for example you can say if we multiply price with the quantity it must be equal
price with the quantity it must be equal to 1,000 for example. So here we have an
to 1,000 for example. So here we have an expression. We have multiple columns
expression. We have multiple columns included in one sides and the output of
included in one sides and the output of this expression must be equal to 1,000.
this expression must be equal to 1,000. And now the last one is going to be a
And now the last one is going to be a little bit more advanced and we're going
little bit more advanced and we're going to cover that of course in other
to cover that of course in other chapter. We can include a whole query
chapter. We can include a whole query the complete query to one of the sides
the complete query to one of the sides and we call this a subquery. So in one
and we call this a subquery. So in one of the sides you're going to write a
of the sides you're going to write a whole query select from where whatever
whole query select from where whatever you want and you go and compare the
you want and you go and compare the result of this query to for example a
result of this query to for example a value or a column. So as you can see in
value or a column. So as you can see in a scale we can compare a lot of things
a scale we can compare a lot of things together. Either comparing the columns
together. Either comparing the columns together or a column with a value or we
together or a column with a value or we use a function or an expression or even
use a function or an expression or even a whole query. So this is how we build
a whole query. So this is how we build conditions in SQL. Okay my friends. So
conditions in SQL. Okay my friends. So let's see how the conditions works in
let's see how the conditions works in SQL. So we have our data the name the
SQL. So we have our data the name the country the score and let's say that we
country the score and let's say that we have built a condition where it says the
have built a condition where it says the country must be equal to the USA. So
country must be equal to the USA. So this is very simple comparison operator
this is very simple comparison operator and this is the condition that we are
and this is the condition that we are using inside the work clause. So once
using inside the work clause. So once you apply this filter to your data what
you apply this filter to your data what going to happen? SQL going to go row by
going to happen? SQL going to go row by row evaluating whether it is meeting the
row evaluating whether it is meeting the condition. If it's not fulfilling the
condition. If it's not fulfilling the condition then SQL going to remove it
condition then SQL going to remove it from the results. But if it is
from the results. But if it is fulfilling the condition it's going to
fulfilling the condition it's going to keep it. So now we are comparing the
keep it. So now we are comparing the values of column together with a static
values of column together with a static value the USA. So we're going to compare
value the USA. So we're going to compare whatever value we get from the country
whatever value we get from the country together with the USA. So now let's see
together with the USA. So now let's see how is going to apply this filter to our
how is going to apply this filter to our data for the first customer Maria. Now
data for the first customer Maria. Now you can see the value inside the country
you can see the value inside the country is Germany. So Isql now going to go and
is Germany. So Isql now going to go and compare Germany to USA since it is not
compare Germany to USA since it is not equal. Then is going to understand okay
equal. Then is going to understand okay Maria is not fulfilling the condition.
Maria is not fulfilling the condition. So it is false and is going to go and
So it is false and is going to go and remove this customer from the results.
remove this customer from the results. So she is not fulfilling the condition.
So she is not fulfilling the condition. Moving on to the next one to Joan. Now S
Moving on to the next one to Joan. Now S is going to take the value inside the
is going to take the value inside the country the USA it is equal to USA. So
country the USA it is equal to USA. So that means John is fulfilling the
that means John is fulfilling the condition and Isl going to be happy
condition and Isl going to be happy about it. So it is true and this means
about it. So it is true and this means is going to keep Joan in the final
is going to keep Joan in the final results. Now moving on to George the
results. Now moving on to George the value is UK not equal to USA. He is not
value is UK not equal to USA. He is not fulfilling the condition. Is going to go
fulfilling the condition. Is going to go and remove him from the final result.
and remove him from the final result. Same thing for Martin. Germany is not
Same thing for Martin. Germany is not equal to USA. Is going to remove this
equal to USA. Is going to remove this customer as well. And to the last one
customer as well. And to the last one bit better you can see the value is USA.
bit better you can see the value is USA. So USA equal USA. The condition is
So USA equal USA. The condition is fulfilled. SQL is happy about it and
fulfilled. SQL is happy about it and going to leave the customer in the
going to leave the customer in the output. So now if you go and apply this
output. So now if you go and apply this condition using the comparison operator
condition using the comparison operator to your data only two customers going to
to your data only two customers going to be left in the output. This is exactly
be left in the output. This is exactly how the conditions and the comparison
how the conditions and the comparison operators works in SQL. Okay. So now
operators works in SQL. Okay. So now let's start with the first operator.
let's start with the first operator. It's very simple. We have the equal.
It's very simple. We have the equal. It's going to checks if the two values
It's going to checks if the two values are equal. That's very simple. Let's
are equal. That's very simple. Let's have an example. Okay. So now we have
have an example. Okay. So now we have this task. It says retrieve all
this task. It says retrieve all customers from Germany. So this is very
customers from Germany. So this is very basic. We're going to go and select and
basic. We're going to go and select and we're going to select all the columns
we're going to select all the columns since we don't have any specifications
since we don't have any specifications from the table customers. And if you go
from the table customers. And if you go and execute it, you will get all the
and execute it, you will get all the customers. But we don't need that only
customers. But we don't need that only the customers that comes from Germany.
the customers that comes from Germany. So we have to go and apply a condition
So we have to go and apply a condition using the wear clause country equal to
using the wear clause country equal to the value Germany. So make sure you are
the value Germany. So make sure you are writing it exactly like in the database
writing it exactly like in the database otherwise it will not work. So let's go
otherwise it will not work. So let's go and execute and with that we are getting
and execute and with that we are getting only the customers from Germany. So it
only the customers from Germany. So it is very simple and this is why we use
is very simple and this is why we use the equal operator. Okay. So now moving
the equal operator. Okay. So now moving on to the next one again very simple. If
on to the next one again very simple. If you want to check if two values are not
you want to check if two values are not equal we can use the not equal operator.
equal we can use the not equal operator. So let's have an example. Okay. So now
So let's have an example. Okay. So now we let's have the opposite task. It says
we let's have the opposite task. It says retrieve all customers who are not from
retrieve all customers who are not from Germany. So this is very simple. We are
Germany. So this is very simple. We are saying here who are not they are not
saying here who are not they are not equal to Germany. So we can use the not
equal to Germany. So we can use the not equal operator in order to get these
equal operator in order to get these customers. So with that as you can see
customers. So with that as you can see after executing we are getting all the
after executing we are getting all the customers country is not equal to
customers country is not equal to Germany and there's like another way on
Germany and there's like another way on how to do the not equal doing it like
how to do the not equal doing it like this we'll get the same results. All
this we'll get the same results. All right my friends moving on to the next
right my friends moving on to the next one. We can check if a value is greater
one. We can check if a value is greater than another value. So we use the
than another value. So we use the greater operator. Let's have an example.
greater operator. Let's have an example. Okay. So now the next task it says
Okay. So now the next task it says retrieve all customers with a score
retrieve all customers with a score greater than 500. Now we want to filter
greater than 500. Now we want to filter the data based on the score. So we're
the data based on the score. So we're going to say where score and now the
going to say where score and now the task says greater than 500. We're going
task says greater than 500. We're going to use the operator greater than 500.
to use the operator greater than 500. It's very simple. So with that we will
It's very simple. So with that we will get only the customers where the score
get only the customers where the score is higher than 500. So for example Maria
is higher than 500. So for example Maria it's not fulfilling the condition. The
it's not fulfilling the condition. The same thing for the Peter and as well for
same thing for the Peter and as well for Martin it must be greater than 500. So
Martin it must be greater than 500. So if you go executed you will get only
if you go executed you will get only those two customers because they are
those two customers because they are greater than 500. Okay, moving on to the
greater than 500. Okay, moving on to the next one. This time we're going to check
next one. This time we're going to check if a value is greater than or equal to
if a value is greater than or equal to another value. So it is like mix between
another value. So it is like mix between the greater than and the equal. If one
the greater than and the equal. If one of them is fulfilled then the value
of them is fulfilled then the value going to meet the condition. So let's
going to meet the condition. So let's have an example for that. Now, if the
have an example for that. Now, if the task says retrieve all customers with a
task says retrieve all customers with a score of 500 or more, this time we're
score of 500 or more, this time we're going to go and include the customers
going to go and include the customers where their score is equal as well to
where their score is equal as well to 500 or higher. So, we're going to have a
500 or higher. So, we're going to have a similar condition based on the score and
similar condition based on the score and the 500's value, but this time we're
the 500's value, but this time we're going to say greater or equal to 500.
going to say greater or equal to 500. So, if you go now and execute it, this
So, if you go now and execute it, this time we're going to see the customer
time we're going to see the customer Martin with the score of 500. So, in
Martin with the score of 500. So, in this scenario, we're going to use
this scenario, we're going to use greater or equal. All right. Right. So
greater or equal. All right. Right. So now let's keep moving. The next one is
now let's keep moving. The next one is as well very simple. We're going to
as well very simple. We're going to check this time if a value is less than
check this time if a value is less than another value. So we're going to use the
another value. So we're going to use the less operator. Let's have an example.
less operator. Let's have an example. Now moving on to another simple task.
Now moving on to another simple task. Retrieve all customers with a score less
Retrieve all customers with a score less than 500. So this time we want all the
than 500. So this time we want all the customers with a lower score. And we're
customers with a lower score. And we're going to use exactly the opposite. It's
going to use exactly the opposite. It's going to be the score is less than 500.
going to be the score is less than 500. And again here it is not equal, right?
And again here it is not equal, right? So if you go and execute, you will get
So if you go and execute, you will get all the customers with a low scores. he
all the customers with a low scores. he will not get to Martin because Martin is
will not get to Martin because Martin is equal to 500. So with that we have
equal to 500. So with that we have solved the task. We have all the
solved the task. We have all the customers with the score less than 500.
customers with the score less than 500. Okay my friends, now moving on to the
Okay my friends, now moving on to the last one. I think you already got it. So
last one. I think you already got it. So we're going to check whether a value is
we're going to check whether a value is less than or equal to another value. So
less than or equal to another value. So you can go and combine the less operator
you can go and combine the less operator together with the equal and if one of
together with the equal and if one of them is fulfilled then the value going
them is fulfilled then the value going to meet the condition. So let's have an
to meet the condition. So let's have an example for that. This time we are
example for that. This time we are retrieving all customers with a score of
retrieving all customers with a score of 500 or less. So the query going to be
500 or less. So the query going to be very similar but we are saying it is
very similar but we are saying it is less or equal to 500. So we are
less or equal to 500. So we are including the value in our condition.
including the value in our condition. And with that as you can see we still
And with that as you can see we still have our two customers where we have the
have our two customers where we have the score less than 500 but we have now as
score less than 500 but we have now as well Martin with a score of 500. Okay my
well Martin with a score of 500. Okay my friends. So with that we have covered
friends. So with that we have covered the first group the comparison
the first group the comparison operators. Now we're going to move on to
operators. Now we're going to move on to the next group. We're going to speak
the next group. We're going to speak about the logical operators and here we
about the logical operators and here we have three and or not. So let's start
have three and or not. So let's start with the first one. What is exactly and
operator. Okay. So now what is the definition of the and it says all
definition of the and it says all conditions must be true. So all the
conditions must be true. So all the conditions that you have in the wear
conditions that you have in the wear clause must be true in order to keep the
clause must be true in order to keep the row in the results. So let's understand
row in the results. So let's understand what this means. things going to get
what this means. things going to get more complicated where you can have not
more complicated where you can have not only one condition but you might have
only one condition but you might have multiple conditions in your query. So
multiple conditions in your query. So here we're going to add a second
here we're going to add a second condition where we're going to say not
condition where we're going to say not only the country must be equal to USA
only the country must be equal to USA but also the score must be higher than
but also the score must be higher than 500. So now you have two conditions and
500. So now you have two conditions and you have to put them in the wear clause.
you have to put them in the wear clause. Now you have to combine those conditions
Now you have to combine those conditions using the logical operator and here we
using the logical operator and here we have two options two operators the and
have two options two operators the and operator and the or operator. In this
operator and the or operator. In this scenario, if you say and then SQL is
scenario, if you say and then SQL is very restrictive. Both of the conditions
very restrictive. Both of the conditions must be true in order to keep the row in
must be true in order to keep the row in the results. So now let's see how this
the results. So now let's see how this going to work. Now for the first row and
going to work. Now for the first row and for the first condition you can see the
for the first condition you can see the country is Germany and it is not
country is Germany and it is not fulfilling the first condition. So this
fulfilling the first condition. So this going to be false. And as well if you
going to be false. And as well if you check the second condition for the first
check the second condition for the first row you can see the score is 350. So
row you can see the score is 350. So that means this customer is as well not
that means this customer is as well not fulfilling even the second condition. So
fulfilling even the second condition. So both of the conditions is false and it's
both of the conditions is false and it's going to go I remove this customer from
going to go I remove this customer from the results. Now to the next one John
the results. Now to the next one John you can see John is fulfilling the first
you can see John is fulfilling the first condition because the country is equal
condition because the country is equal to USA and as well fulfilling the second
to USA and as well fulfilling the second condition. His score is 900 and this is
condition. His score is 900 and this is higher than 500. So now SQL going to be
higher than 500. So now SQL going to be very happy about it because both of them
very happy about it because both of them is true and this is the only way in
is true and this is the only way in order to keep the row in the output
order to keep the row in the output because we are using the operator and so
because we are using the operator and so John going to stay in the output. Now
John going to stay in the output. Now moving on to George. He is not
moving on to George. He is not fulfilling the first condition. But now
fulfilling the first condition. But now the second condition is fulfilled. His
the second condition is fulfilled. His score is 750 and this is higher than
score is 750 and this is higher than 500. So now it's like 50/50 right. In
500. So now it's like 50/50 right. In one side it's false but the other side
one side it's false but the other side is true. But this is not enough for the
is true. But this is not enough for the ant operator. Both of them should be
ant operator. Both of them should be true in order to keep the result in the
true in order to keep the result in the output. That's why SQL going to remove
output. That's why SQL going to remove this row. Now moving on to Martin. He is
this row. Now moving on to Martin. He is not fulfilling both of the conditions.
not fulfilling both of the conditions. So SQL going to go I remove it from the
So SQL going to go I remove it from the results. And now for the last one. Peter
results. And now for the last one. Peter is fulfilling the first condition. the
is fulfilling the first condition. the country is equal to USA but the second
country is equal to USA but the second condition is sadly not fulfilled so we
condition is sadly not fulfilled so we have the score zero not higher than 500
have the score zero not higher than 500 again we have the same scenario it's
again we have the same scenario it's 50/50 and this is not enough for the ant
50/50 and this is not enough for the ant operator that's why SQL going to go I
operator that's why SQL going to go I remove it so as you can see if you use
remove it so as you can see if you use an and operator a lot of rows going to
an and operator a lot of rows going to be removed if one of the condition is
be removed if one of the condition is not met so the ant operator is very
not met so the ant operator is very restrictive both of the conditions must
restrictive both of the conditions must be fulfilled to keep the row in the
be fulfilled to keep the row in the results so this is exactly how the and
results so this is exactly how the and operator works. Okay. So now we have the
operator works. Okay. So now we have the following task. Retrieve all customers
following task. Retrieve all customers who are from USA and have a score
who are from USA and have a score greater than 500. So here we are like
greater than 500. So here we are like combining multiple conditions and let's
combining multiple conditions and let's go and do it step by step. So the first
go and do it step by step. So the first thing that we have to go and select the
thing that we have to go and select the data from the correct table. So select
data from the correct table. So select star from customers and with that we are
star from customers and with that we are getting all the customers from the
getting all the customers from the table. Now the first condition we need
table. Now the first condition we need the customers that come from USA. So we
the customers that come from USA. So we need only those two customers and in
need only those two customers and in order to do that as we learned we can go
order to do that as we learned we can go and use the wear clause and the
and use the wear clause and the condition going to be country equal to
condition going to be country equal to USA. So if you go and execute we will
USA. So if you go and execute we will get those two customers. Nothing is new.
get those two customers. Nothing is new. We have used the compression operator
We have used the compression operator equal. But we are not done yet. We have
equal. But we are not done yet. We have another condition from those two
another condition from those two customers. We need only the customers
customers. We need only the customers where their score is higher than 500. So
where their score is higher than 500. So now by looking to those two customers
now by looking to those two customers you can see we see that the bitter here
you can see we see that the bitter here does not have a score higher than 500
does not have a score higher than 500 and we don't want to see that in the
and we don't want to see that in the results. So now what we have to do we
results. So now what we have to do we have to go and write a condition for
have to go and write a condition for this one over here. So this is based
this one over here. So this is based this time on the scores not on the
this time on the scores not on the country. So the score should be greater
country. So the score should be greater than 500. Now as you can see we have the
than 500. Now as you can see we have the first condition for the first one here
first condition for the first one here and the second condition for the second
and the second condition for the second requirement. Now the question how to
requirement. Now the question how to connect those two conditions. So here we
connect those two conditions. So here we have two options and or and to be honest
have two options and or and to be honest this is very simple the task says it
this is very simple the task says it customer should fulfill both of the
customer should fulfill both of the conditions should be from USA and as
conditions should be from USA and as well at the same time greater than 500.
well at the same time greater than 500. So it is very simple real and so with
So it is very simple real and so with that we have connected both of those
that we have connected both of those conditions and if you go and query it
conditions and if you go and query it you will get only one customer that is
you will get only one customer that is fulfilling our conditions. So from all
fulfilling our conditions. So from all customers we have only one customer
customers we have only one customer that's fulfilled this condition that
that's fulfilled this condition that comes from USA and at the same time the
comes from USA and at the same time the score of this customer is higher than
score of this customer is higher than 500. So this is how we use the ant
500. So this is how we use the ant operator in order to connect two
operator in order to connect two conditions. Okay my friends. So that's
conditions. Okay my friends. So that's all for the ant operator. Let's speak
all for the ant operator. Let's speak now about the or
operator. All right. Now the or operator it says at least one condition must be
it says at least one condition must be true. So it is less restrictive than the
true. So it is less restrictive than the and it is enough to have one condition
and it is enough to have one condition true in order to keep the row in the
true in order to keep the row in the results. Let's understand exactly what
results. Let's understand exactly what this means. Okay. So now we have the
this means. Okay. So now we have the same scenario. We have two conditions
same scenario. We have two conditions and in SQL you have to connect them
and in SQL you have to connect them either using the and operator or the or
either using the and operator or the or operator. In this scenario we're going
operator. In this scenario we're going to talk about the or operator. And as we
to talk about the or operator. And as we said at least one of the conditions must
said at least one of the conditions must be fulfilled in order to leave the
be fulfilled in order to leave the record in the results. So let's see
record in the results. So let's see what's going to happen here. Now the
what's going to happen here. Now the first customer Maria she is not
first customer Maria she is not fulfilling the first condition and as
fulfilling the first condition and as well the second condition. So both of
well the second condition. So both of them is false and this is the only
them is false and this is the only scenario where SQL going to remove the
scenario where SQL going to remove the record from the results because it is
record from the results because it is not fulfilling the minimum at least one
not fulfilling the minimum at least one of them should be true. Both of them is
of them should be true. Both of them is false then SQL going to go and remove
false then SQL going to go and remove this row. Now moving on to the next one
this row. Now moving on to the next one to John. John is from USA and has higher
to John. John is from USA and has higher score than 500. Both of the conditions
score than 500. Both of the conditions is green. So both of them is true and
is green. So both of them is true and this is more than enough to keep the row
this is more than enough to keep the row in the output. That's why we will see
in the output. That's why we will see John in the outputs. Now moving on to
John in the outputs. Now moving on to the third one, George. George is not
the third one, George. George is not fulfilling the first condition because
fulfilling the first condition because UK is not equal to USA. But John this
UK is not equal to USA. But John this time is fulfilling the second condition.
time is fulfilling the second condition. So we have here true and since we have
So we have here true and since we have at least one true, this is good enough
at least one true, this is good enough to keep the record in the output. So you
to keep the record in the output. So you will see George in the results. Now
will see George in the results. Now moving on to Martin. He is not
moving on to Martin. He is not fulfilling the first condition as well
fulfilling the first condition as well not fulfilling the second condition.
not fulfilling the second condition. Both of them is false and this is not
Both of them is false and this is not enough to keep the result in the output.
enough to keep the result in the output. So that's why it's still going to go and
So that's why it's still going to go and remove it. Now moving on to the last
remove it. Now moving on to the last one. Peter he is fulfilling the first
one. Peter he is fulfilling the first condition but not the second condition
condition but not the second condition but still everything is fine because he
but still everything is fine because he is fulfilling at least one condition. So
is fulfilling at least one condition. So we have the minimum and it's still going
we have the minimum and it's still going to leave it in the output. So as you can
to leave it in the output. So as you can see the or operator is not restrictive
see the or operator is not restrictive like the and operator. It's enough to
like the and operator. It's enough to have one true in order to keep the data
have one true in order to keep the data in the output. And this is exactly how
in the output. And this is exactly how the or operator works. Now let's see the
the or operator works. Now let's see the second task. Retrieve all customers who
second task. Retrieve all customers who are either from USA or have a score
are either from USA or have a score greater than 500. So it is a very
greater than 500. So it is a very similar task. We have two conditions. So
similar task. We have two conditions. So we need the customers that are either
we need the customers that are either from USA. So it is based on this country
from USA. So it is based on this country equal to USA. And the second condition
equal to USA. And the second condition is the score is greater than 500. But
is the score is greater than 500. But this time we are very relaxed. either
this time we are very relaxed. either this condition is fulfilled or the
this condition is fulfilled or the second one. So instead of having and we
second one. So instead of having and we will be using the operator or. So it is
will be using the operator or. So it is enough to fulfill one of those
enough to fulfill one of those conditions. And if you go and execute
conditions. And if you go and execute now as you can see we are getting more
now as you can see we are getting more results because it is easier to fulfill
results because it is easier to fulfill the conditions. So we can see those
the conditions. So we can see those three customers either fulfilling the
three customers either fulfilling the first condition or the second one. All
first condition or the second one. All right my friends. So that's all for the
right my friends. So that's all for the or operator and we're going to move to
or operator and we're going to move to the last one in this group the not. So
the last one in this group the not. So what do we mean with the not operator?
Okay. So now what is this operator not? It is a reverse operator. It's going to
It is a reverse operator. It's going to go and exclude the matching values. So
go and exclude the matching values. So what this exactly means? Let's have a
what this exactly means? Let's have a very simple example. All right. So now
very simple example. All right. So now the net operator is not like the or and
the net operator is not like the or and the ands. This operator will not go and
the ands. This operator will not go and combine two conditions. So you can use
combine two conditions. So you can use it with only one condition. And let's
it with only one condition. And let's say that our current condition is like
say that our current condition is like this. The country must be equal to USA.
this. The country must be equal to USA. So this is like a comparison operator.
So this is like a comparison operator. And if you apply it to your data, as we
And if you apply it to your data, as we learned, it's going to leave only two
learned, it's going to leave only two customers, John and Peter, because they
customers, John and Peter, because they fulfill the conditions and all other
fulfill the conditions and all other customers will be removed because they
customers will be removed because they don't fulfill the condition. So nothing
don't fulfill the condition. So nothing crazy so far. But now if you go and
crazy so far. But now if you go and apply the not operator to the condition,
apply the not operator to the condition, what going to happen? You're going to
what going to happen? You're going to reverse the whole truth. So you are
reverse the whole truth. So you are saying if this condition is fulfilled,
saying if this condition is fulfilled, it must be removed from the final
it must be removed from the final results. So it is switching everything.
results. So it is switching everything. We want to see the customers that is not
We want to see the customers that is not fulfilling the condition. So now let's
fulfilling the condition. So now let's see what can happen if you apply the not
see what can happen if you apply the not operator together with the condition. We
operator together with the condition. We can see that the first customer is not
can see that the first customer is not fulfilling the condition which is great
fulfilling the condition which is great thing. This is exactly what we want. We
thing. This is exactly what we want. We want the customer that is not fulfilling
want the customer that is not fulfilling the condition. That's why going to be
the condition. That's why going to be happy about it and SQL going to make it
happy about it and SQL going to make it true and leave it in the output. So
true and leave it in the output. So Maria is fulfilling the whole thing. She
Maria is fulfilling the whole thing. She is not meeting the condition. So SQL
is not meeting the condition. So SQL going to leave it at the output. Now for
going to leave it at the output. Now for the next one. So this customer is
the next one. So this customer is fulfilling the condition and that is not
fulfilling the condition and that is not a good thing. So SQL going to go and
a good thing. So SQL going to go and this time remove John from the results
this time remove John from the results because he is fulfilling the condition.
because he is fulfilling the condition. And moving on to George. So George is
And moving on to George. So George is not fulfilling the condition which is
not fulfilling the condition which is amazing. So that's why SQL going to keep
amazing. So that's why SQL going to keep this time George in the output. The same
this time George in the output. The same thing for Martin. Martin is not
thing for Martin. Martin is not fulfilling the condition. So Isl going
fulfilling the condition. So Isl going to keep the customer and better he is
to keep the customer and better he is fulfilling the condition. So SQL going
fulfilling the condition. So SQL going to go and remove this customer from the
to go and remove this customer from the output. So as you can see we have
output. So as you can see we have reversed everything right. The not
reversed everything right. The not operator going to make the true false
operator going to make the true false and the false true. Okay. So this is how
and the false true. Okay. So this is how it works. Now let's go back to SQL in
it works. Now let's go back to SQL in order to practice. Okay. The next task
order to practice. Okay. The next task it says retrieve all customers with a
it says retrieve all customers with a score not less than 500. So this sounds
score not less than 500. So this sounds really funny. As usual we're going to go
really funny. As usual we're going to go and select star from customers. And now
and select star from customers. And now we have to filter the data based on this
we have to filter the data based on this condition. So the score is not less than
condition. So the score is not less than 500. Well, you can go and say well the
500. Well, you can go and say well the score is higher, greater or equal to
score is higher, greater or equal to 500, right? And with that it is not less
500, right? And with that it is not less than 500. So if you go and execute it,
than 500. So if you go and execute it, we just solve the task, right? We get
we just solve the task, right? We get all the customers that are not less than
all the customers that are not less than 500. Or you can go and use the not
500. Or you can go and use the not operator to make things more funnier. So
operator to make things more funnier. So you go over here and say it is not and
you go over here and say it is not and then you switch it. So you make like
then you switch it. So you make like this. So the score is less than 500. But
this. So the score is less than 500. But as we use here not then we twisted
as we use here not then we twisted everything. So we are saying the score
everything. So we are saying the score is not less than 500. And if you execute
is not less than 500. And if you execute it you will get the exact same results.
it you will get the exact same results. Convert the truth. If you remove it and
Convert the truth. If you remove it and execute you will get everything that is
execute you will get everything that is less than 500. But if you put the nut
less than 500. But if you put the nut you will convert the whole logic. So if
you will convert the whole logic. So if you go and execute you are not getting
you go and execute you are not getting the scores that are less than 500. So
the scores that are less than 500. So this is really nice. This is how you use
this is really nice. This is how you use the nut operator. Okay my friends. So
the nut operator. Okay my friends. So with that we have covered everything
with that we have covered everything about the logical operators. Now we're
about the logical operators. Now we're going to move to the third group. We're
going to move to the third group. We're going to talk about the range operator.
going to talk about the range operator. And here we have only one the between.
And here we have only one the between. So what is exactly between
operator? Okay. So what is between? It's going to go and check if a value falls
going to go and check if a value falls within a specific range. So you have a
within a specific range. So you have a range and you are checking whether your
range and you are checking whether your value is in the range or outside the
value is in the range or outside the range. So let's understand exactly what
range. So let's understand exactly what this means. Okay. So now in order to
this means. Okay. So now in order to build a range you need two things. You
build a range you need two things. You need the lower boundary for the range
need the lower boundary for the range and you need as well the upper boundary.
and you need as well the upper boundary. Once you have two boundaries then you
Once you have two boundaries then you have a range and everything between
have a range and everything between those two boundaries going to be true
those two boundaries going to be true and everything outside those boundaries
and everything outside those boundaries going to be false. So now for example
going to be false. So now for example let's say that we have the lower
let's say that we have the lower boundary 100 and the upper boundary 500.
boundary 100 and the upper boundary 500. And there is one thing that you have to
And there is one thing that you have to understand about the between the
understand about the between the boundaries are inclusive. So that means
boundaries are inclusive. So that means if a value is exactly 100 or exactly 500
if a value is exactly 100 or exactly 500 then it's going to considered as a true.
then it's going to considered as a true. So it is considered to be inside the
So it is considered to be inside the range. Now if you apply this filter to
range. Now if you apply this filter to our data where we say the score must be
our data where we say the score must be between 100 and 500 going to go and do
between 100 and 500 going to go and do the following. So for the first customer
the following. So for the first customer Maria is going to go and check whether
Maria is going to go and check whether her score is inside the boundaries. So
her score is inside the boundaries. So as you can see 300 is between 100 and
as you can see 300 is between 100 and 500. So she is in the green area and
500. So she is in the green area and that's why Isque going to be happy about
that's why Isque going to be happy about it and leave the customer in the
it and leave the customer in the outputs. Now moving on to John. John has
outputs. Now moving on to John. John has 900. As you can see 900 is greater than
900. As you can see 900 is greater than 500. So this value is going to be
500. So this value is going to be outside the boundaries on the right side
outside the boundaries on the right side and this means the score of John is not
and this means the score of John is not in the range. That's why he is not
in the range. That's why he is not fulfilling the condition and SQL going
fulfilling the condition and SQL going to go and remove this customer from the
to go and remove this customer from the results. Now moving on to George 750.
results. Now moving on to George 750. The same thing outside the range. SQL
The same thing outside the range. SQL will not accept it and remove this
will not accept it and remove this customer from the final results. Now
customer from the final results. Now moving on to Martin his score is 500 and
moving on to Martin his score is 500 and this is exactly at the boundary. So if
this is exactly at the boundary. So if it's like 5001 it's going to be outside.
it's like 5001 it's going to be outside. So since between is inclusive then SQL
So since between is inclusive then SQL going to accept it and Martin considered
going to accept it and Martin considered to be in the range and fulfilling the
to be in the range and fulfilling the condition. So SQL going to keep him in
condition. So SQL going to keep him in the final result. Now here are speaking
the final result. Now here are speaking about better he has zero score and this
about better he has zero score and this is less than 100. So in the left side
is less than 100. So in the left side not in the range. So not fulfilling the
not in the range. So not fulfilling the condition and SQL going to go and remove
condition and SQL going to go and remove him. This is exactly how between works
him. This is exactly how between works in SQL. It's very simple. Okay. So now
in SQL. It's very simple. Okay. So now we have the following task and it says
we have the following task and it says retrieve all customers whose score falls
retrieve all customers whose score falls in range between 100 and 500. So let's
in range between 100 and 500. So let's start as usual by selecting all data
start as usual by selecting all data from customers and execute it. Now the
from customers and execute it. Now the task says everything. We need all
task says everything. We need all customers in a range. So we have a lower
customers in a range. So we have a lower value and a higher value. So in order to
value and a higher value. So in order to do that as usual we're going to use the
do that as usual we're going to use the where and then we're going to specify
where and then we're going to specify the column that we want to filter on. So
the column that we want to filter on. So it's going to be the score and since we
it's going to be the score and since we have like two boundaries we can go and
have like two boundaries we can go and use the function between and we start
use the function between and we start with the first boundary the lowest
with the first boundary the lowest boundary. So it is the 100 and 500 the
boundary. So it is the 100 and 500 the high boundary the upper boundary. So
high boundary the upper boundary. So between 100 and 500. So now let's go and
between 100 and 500. So now let's go and execute it. And with that we get only
execute it. And with that we get only those two customers because they are
those two customers because they are between this window. Now there is
between this window. Now there is another way in how to solve this task by
another way in how to solve this task by not using between. We can go and use the
not using between. We can go and use the comparison operator together with a
comparison operator together with a logical operator and. So let me show you
logical operator and. So let me show you how we can do that. I'm going to go and
how we can do that. I'm going to go and copy the whole thing. And now we're
copy the whole thing. And now we're going to write two conditions. So first
going to write two conditions. So first the score should be higher or equal to
the score should be higher or equal to 100 because the boundaries is inclusive
100 because the boundaries is inclusive and the other one the score is less or
and the other one the score is less or equal to 500. So this is the upper
equal to 500. So this is the upper boundary. So with that we have the two
boundary. So with that we have the two conditions and we can go and connect
conditions and we can go and connect them using the and operator. So it's
them using the and operator. So it's like very similar to the between we have
like very similar to the between we have an and between the upper and the lower
an and between the upper and the lower boundaries but we are using the
boundaries but we are using the comparison operators. So it is higher or
comparison operators. So it is higher or equal to 100 and lower or equal to 500.
equal to 100 and lower or equal to 500. If you go and run this query you will
If you go and run this query you will get exactly same results. Now if you ask
get exactly same results. Now if you ask me which method is my favorite I'm going
me which method is my favorite I'm going to go with this method and I will skip
to go with this method and I will skip the between because each time to be
the between because each time to be honest for me I forget about the between
honest for me I forget about the between whether the boundaries are inclusive or
whether the boundaries are inclusive or exclusive. But if I read the script I am
exclusive. But if I read the script I am going to see exactly that those
going to see exactly that those boundaries are inclusive because we have
boundaries are inclusive because we have here the equals. So I really prefer
here the equals. So I really prefer using the compressor operator together
using the compressor operator together with the and then using between. So it's
with the and then using between. So it's up to you if you memorize it then go
up to you if you memorize it then go with the between. But for me I'm going
with the between. But for me I'm going to go with the compression operators.
to go with the compression operators. Okay my friends. So that's all about the
Okay my friends. So that's all about the between and the range operator. Now
between and the range operator. Now let's move to another group. We have the
let's move to another group. We have the membership operator. So here we have
membership operator. So here we have like two. We have the in and the not in.
like two. We have the in and the not in. So let's understand what this exactly
means. Okay. So what is in operator? It's going to go and check if a value
It's going to go and check if a value exist in a list. So you have a list of
exist in a list. So you have a list of values and you are checking whether your
values and you are checking whether your value is a member of your list. So let's
value is a member of your list. So let's have very simple example in order to
have very simple example in order to understand what this means. Okay. So now
understand what this means. Okay. So now how this works exactly what you have to
how this works exactly what you have to do is to go and make a list of values.
do is to go and make a list of values. So let's say that I have a list and
So let's say that I have a list and there I have specified two values
there I have specified two values Germany and USA. So those two are the
Germany and USA. So those two are the members of this list. Now if you use the
members of this list. Now if you use the n operator it's going to go and check
n operator it's going to go and check the value of countries whether it is in
the value of countries whether it is in the list or not. So let's do it one by
the list or not. So let's do it one by one. For the first customer Maria her
one. For the first customer Maria her country is Germany and Germany is member
country is Germany and Germany is member of the list. So it's going to be happy
of the list. So it's going to be happy and going to leave Maria in the final
and going to leave Maria in the final results. Now moving on to John. John
results. Now moving on to John. John comes from USA. USA is member of the
comes from USA. USA is member of the list. So he is fulfilling as well the
list. So he is fulfilling as well the condition and you're going to see John
condition and you're going to see John in the final results. Now we come to
in the final results. Now we come to George. George comes from UK and UK is
George. George comes from UK and UK is not member of our list. And SQL going to
not member of our list. And SQL going to go and remove this customer from the
go and remove this customer from the final results not fulfilling the
final results not fulfilling the condition. Now for the last two, Martin
condition. Now for the last two, Martin and Peter, their country is a member of
and Peter, their country is a member of the list and SQL going to go and leave
the list and SQL going to go and leave those customers in the final results. So
those customers in the final results. So as you can see it's very simple. Or what
as you can see it's very simple. Or what you have to do is to define the members
you have to do is to define the members of a list and use the n operator and if
of a list and use the n operator and if the value is a member of this list it's
the value is a member of this list it's going to be true otherwise it's going to
going to be true otherwise it's going to be false. Now of course the other
be false. Now of course the other operator going to be exactly the
operator going to be exactly the opposite where we say not in the list.
opposite where we say not in the list. So we are searching for values that are
So we are searching for values that are not in this list. So as we are using not
not in this list. So as we are using not it's going to go and reverse completely
it's going to go and reverse completely the truth. And if you apply this you
the truth. And if you apply this you will get in the result only one
will get in the result only one customer. you will get George and the
customer. you will get George and the result because the country is UK and UK
result because the country is UK and UK is not a member of the list. So if you
is not a member of the list. So if you use not together with the in operator
use not together with the in operator you will get exactly the opposite
you will get exactly the opposite effect. So this is how the in and the
effect. So this is how the in and the not in operator works in SQL. Let's go
not in operator works in SQL. Let's go back to scale in order to practice that.
back to scale in order to practice that. Okay. So now we have this task and it
Okay. So now we have this task and it says retrieve all customers from either
says retrieve all customers from either Germany or USA. Okay. So let's try to
Germany or USA. Okay. So let's try to solve this task. This going to be a
solve this task. This going to be a little bit tricky. So select star from
little bit tricky. So select star from customers as usual and execute it. So
customers as usual and execute it. So now we need in the results only customer
now we need in the results only customer that comes either from Germany or USA.
that comes either from Germany or USA. So that means this customer over here
So that means this customer over here should be excluded from the result
should be excluded from the result because he come from UK. So how we going
because he come from UK. So how we going to write it? It's going to be like this
to write it? It's going to be like this maybe. So the first one going to be the
maybe. So the first one going to be the country is equal to Germany or the
country is equal to Germany or the country is equal to USA right something
country is equal to USA right something like this. So if you go and execute it,
like this. So if you go and execute it, you will get in the output only the
you will get in the output only the customers that are either from Germany
customers that are either from Germany or USA. And with that we have solved the
or USA. And with that we have solved the task, right? Well, there is another way
task, right? Well, there is another way in order to solve this task which is
in order to solve this task which is more clear and shorter using the n
more clear and shorter using the n operator. So now how we going to do it?
operator. So now how we going to do it? Let's go and get the whole thing in
Let's go and get the whole thing in another query. And now instead of having
another query. And now instead of having equals and ors and so on, we're going to
equals and ors and so on, we're going to use the in operator and then we're going
use the in operator and then we're going to have like two parentheses and then
to have like two parentheses and then inside it we're going to have a list of
inside it we're going to have a list of values. So it's going to be the Germany
values. So it's going to be the Germany and then the second value going to be
and then the second value going to be USA like this. So we are saying country
USA like this. So we are saying country should be in this list Germany or USA
should be in this list Germany or USA and if it is like one of those values
and if it is like one of those values then the condition is fulfilled. So now
then the condition is fulfilled. So now if you go and execute this one over here
if you go and execute this one over here you will get the exact same results. So
you will get the exact same results. So my friends, if you notice that you are
my friends, if you notice that you are repeating yourself in the wear condition
repeating yourself in the wear condition and you are just changing the value of
and you are just changing the value of the condition, it is based on the same
the condition, it is based on the same column and you are connecting them using
column and you are connecting them using the or then there is something wrong and
the or then there is something wrong and always think on this scenario to use the
always think on this scenario to use the in operator because this can be really
in operator because this can be really ugly once you have a lot of values. So
ugly once you have a lot of values. So imagine in our database we have a lot of
imagine in our database we have a lot of countries and your query going to be
countries and your query going to be like something like this. So you are
like something like this. So you are keep repeating country equal or country
keep repeating country equal or country equal and so on. Instead of that you're
equal and so on. Instead of that you're going to have a really nice list of
going to have a really nice list of countries in one go. So this is as you
countries in one go. So this is as you can see here it is easier to extend and
can see here it is easier to extend and as well has better performance. So as
as well has better performance. So as you can see we are repeating the same
you can see we are repeating the same thing but we are just changing the value
thing but we are just changing the value and we are connecting all those
and we are connecting all those conditions using the or in this scenario
conditions using the or in this scenario go and use the in operator. All right my
go and use the in operator. All right my friends. So that's all for the
friends. So that's all for the membership operators. Now we're going to
membership operators. Now we're going to speak about the last one the search
speak about the last one the search operator. And here we have only one the
operator. And here we have only one the like. And each time we're going to say
like. And each time we're going to say like, I'm going to remind you to like
like, I'm going to remind you to like this course. So let's
go. Okay. So now what is like operator? You can use it in order to search for a
You can use it in order to search for a pattern in your text. So if you have
pattern in your text. So if you have like a text or characters and you are
like a text or characters and you are searching for a specific pattern inside
searching for a specific pattern inside the text. So let's have an example in
the text. So let's have an example in order to understand exactly what this
order to understand exactly what this means. Okay. So now if you don't have
means. Okay. So now if you don't have yet cafe, go grab one because you have
yet cafe, go grab one because you have to focus for this one. Now what we have
to focus for this one. Now what we have to do is to define a pattern in is
to do is to define a pattern in is scale. In order to build a pattern we
scale. In order to build a pattern we have like two special characters. If you
have like two special characters. If you use a percentage you are saying
use a percentage you are saying anything. So I'm going to accept
anything. So I'm going to accept anything. So it could be no characters
anything. So it could be no characters at all or only one character or many
at all or only one character or many characters. So I'm saying anything. Now
characters. So I'm saying anything. Now if you use an underscore you are
if you use an underscore you are expecting to have exactly one thing like
expecting to have exactly one thing like one character or one number. So it is
one character or one number. So it is exactly one. I know this sounds
exactly one. I know this sounds complicated but with an example you can
complicated but with an example you can understand this. And I can tell you the
understand this. And I can tell you the percentage is way more famous than the
percentage is way more famous than the underscore. I rarely really use the
underscore. I rarely really use the underscore. So now let's say that I
underscore. So now let's say that I build the pattern like this. I say the
build the pattern like this. I say the first character must be M and then
first character must be M and then percentage. So here I'm saying in my
percentage. So here I'm saying in my text the first character must be an M
text the first character must be an M and after the first character I really
and after the first character I really don't care. It could be any character,
don't care. It could be any character, any number whatever. So this is the
any number whatever. So this is the pattern and now let's have few values in
pattern and now let's have few values in order to say whether it's true or false.
order to say whether it's true or false. So now if you have the value Mariam. So
So now if you have the value Mariam. So now you can see the first character is
now you can see the first character is an M which is perfect. This is exactly
an M which is perfect. This is exactly our pattern. The first character must be
our pattern. The first character must be an M. And then after the M we got like
an M. And then after the M we got like four characters. So whatever it is
four characters. So whatever it is totally fine. We can say Maria is
totally fine. We can say Maria is fulfilling our pattern. And this is
fulfilling our pattern. And this is exactly what we are searching for. This
exactly what we are searching for. This value is fulfilling the condition. Okay.
value is fulfilling the condition. Okay. Now moving on to the next value we have
Now moving on to the next value we have m a. So here again the first character
m a. So here again the first character is an M which is perfect. And after that
is an M which is perfect. And after that we have only one character a. Well we
we have only one character a. Well we have say percentage. So it could be
have say percentage. So it could be anything one character multiple
anything one character multiple characters a number or whatever. So
characters a number or whatever. So that's why this value can match our
that's why this value can match our pattern and we will see it in the
pattern and we will see it in the outputs. Now moving on to the next value
outputs. Now moving on to the next value we have only one m which is as well
we have only one m which is as well totally fine because we are saying the
totally fine because we are saying the first character must be an M and then
first character must be an M and then followed with anything. Now moving on to
followed with anything. Now moving on to the last scenario we have Emma. Now this
the last scenario we have Emma. Now this is a problematic because the first
is a problematic because the first character is an E and in our pattern we
character is an E and in our pattern we say it must start with M. So we don't
say it must start with M. So we don't have that in this word. The first
have that in this word. The first character is an E. That's why this value
character is an E. That's why this value is not fulfilling our pattern and SQL
is not fulfilling our pattern and SQL going to remove this value from the
going to remove this value from the final results. So this is exactly what
final results. So this is exactly what going to happen if you have this pattern
going to happen if you have this pattern and those values. Now let's have another
and those values. Now let's have another scenario where you say you know what it
scenario where you say you know what it could start with anything but for me it
could start with anything but for me it is very important the last two
is very important the last two characters it must be an I and N. So we
characters it must be an I and N. So we could start with anything but the last
could start with anything but the last two must be an I and N. So let's take
two must be an I and N. So let's take this value Martin going to go and check
this value Martin going to go and check immediately the last two characters. So
immediately the last two characters. So you can see we have an I and N and the
you can see we have an I and N and the first part marks it is fine. It could be
first part marks it is fine. It could be anything. So this value is fulfilling
anything. So this value is fulfilling the condition because the last two
the condition because the last two characters is an I and N. Now moving on
characters is an I and N. Now moving on to the next one we have vin. So v i n
to the next one we have vin. So v i n the last two characters is as well
the last two characters is as well exactly what we are searching for. It is
exactly what we are searching for. It is fulfilling the condition and we have
fulfilling the condition and we have before it like only v. So we say
before it like only v. So we say anything with a percentage. Right? Now
anything with a percentage. Right? Now one more we have in. So it is as well
one more we have in. So it is as well fulfilling the condition because before
fulfilling the condition because before it we don't have anything. So en is
it we don't have anything. So en is fulfilling as well the condition. The
fulfilling as well the condition. The percentage is always saying anything.
percentage is always saying anything. Now moving on to the last scenario we
Now moving on to the last scenario we have Jasmine. They are not the last two
have Jasmine. They are not the last two characters. The last two characters is
characters. The last two characters is an N and E and this is not matching our
an N and E and this is not matching our pattern and this why this value is not
pattern and this why this value is not fulfilling our pattern and you will not
fulfilling our pattern and you will not see it in the results. So with that you
see it in the results. So with that you can understand how we can search for
can understand how we can search for something in a text using the like
something in a text using the like operator. Let's keep going. Now let's
operator. Let's keep going. Now let's say that I have a percentage at the
say that I have a percentage at the start and percentage at the end and in
start and percentage at the end and in between I have only one character an R.
between I have only one character an R. If you define it like this you are
If you define it like this you are saying if there is an R anywhere it is
saying if there is an R anywhere it is good enough whether it's beginning or at
good enough whether it's beginning or at the end or in between then the condition
the end or in between then the condition is fulfilled. So if you have Maria you
is fulfilled. So if you have Maria you can see we have an R in the middle. So
can see we have an R in the middle. So in the left side we have two characters
in the left side we have two characters on the right side we have two characters
on the right side we have two characters doesn't matter the main thing we have an
doesn't matter the main thing we have an R somewhere. So this going to be
R somewhere. So this going to be fulfilling the condition. Now moving on
fulfilling the condition. Now moving on to better we have an R at the end and
to better we have an R at the end and that is totally fine cuz we say at the
that is totally fine cuz we say at the right side it could be anything. So we
right side it could be anything. So we have an R somewhere that's why it's
have an R somewhere that's why it's going to fulfill the condition. Now we
going to fulfill the condition. Now we have another case where we say Ryan we
have another case where we say Ryan we have an R at the start. So we don't have
have an R at the start. So we don't have anything before and we have after that
anything before and we have after that like three characters which is totally
like three characters which is totally fine. So we don't really care about the
fine. So we don't really care about the position of the R. It is totally
position of the R. It is totally acceptable to have an R anywhere. And if
acceptable to have an R anywhere. And if you have only an R that is as well good
you have only an R that is as well good enough. You don't have anything before.
enough. You don't have anything before. you don't have anything after and that's
you don't have anything after and that's okay. But if you have a word like Alice,
okay. But if you have a word like Alice, we don't have any R inside it. So that's
we don't have any R inside it. So that's why this is the only case where you say
why this is the only case where you say we don't have here an R and it's going
we don't have here an R and it's going to remove this value from the results.
to remove this value from the results. And this way of searching of something
And this way of searching of something is very famous. You don't care about the
is very famous. You don't care about the words before this word and after the
words before this word and after the word, right? So if you are searching for
word, right? So if you are searching for any word, you're going to say percentage
any word, you're going to say percentage before and percentage after. Now I know
before and percentage after. Now I know that we want to practice with the
that we want to practice with the underscore. So let's say that I have two
underscore. So let's say that I have two underscores and then the character B and
underscores and then the character B and then a percentage. So here what I'm
then a percentage. So here what I'm saying there should be something in the
saying there should be something in the first position. There should be as well
first position. There should be as well something in the second position. Then
something in the second position. Then the third position should be the
the third position should be the character B must be exactly at this
character B must be exactly at this position and after that it could be
position and after that it could be anything. So we really don't care. I
anything. So we really don't care. I know this is a little bit complicated.
know this is a little bit complicated. Let's have an example. So we have the
Let's have an example. So we have the value alert. Now we can see the first
value alert. Now we can see the first position we have something the A. Then
position we have something the A. Then the second position we have as well
the second position we have as well something the L. So so far we are good
something the L. So so far we are good at the pattern and then the third
at the pattern and then the third position we have B. So we have complete
position we have B. So we have complete match and the rest the ERT whatever. So
match and the rest the ERT whatever. So with that Albert is matching our
with that Albert is matching our pattern. Moving on to the next one rope.
pattern. Moving on to the next one rope. You can see the first character we have
You can see the first character we have something which is good. We have the R.
something which is good. We have the R. Then the second character we have an O.
Then the second character we have an O. So it's not empty. We have something.
So it's not empty. We have something. And then the third one we have exactly
And then the third one we have exactly B. And after that we don't have anything
B. And after that we don't have anything which is fine. So again this value going
which is fine. So again this value going to fulfill the condition. So moving on
to fulfill the condition. So moving on to the next one. So it start with an A.
to the next one. So it start with an A. So we have something in the first
So we have something in the first position. The second position we have as
position. The second position we have as well something the B. But now the third
well something the B. But now the third character it is a problem. It is not P.
character it is a problem. It is not P. We have an E. So that's why it is not
We have an E. So that's why it is not following our pattern. And is going to
following our pattern. And is going to go and remove it. Now moving on to last
go and remove it. Now moving on to last example we have an A and an N. So in the
example we have an A and an N. So in the first position we have something. The
first position we have something. The second one as well. But the third one we
second one as well. But the third one we don't have anything. We don't have a B.
don't have anything. We don't have a B. So that's why it's going to be removed.
So that's why it's going to be removed. So my friends I know that was a lot.
So my friends I know that was a lot. This is exactly how you build a pattern
This is exactly how you build a pattern for the like operator using the
for the like operator using the percentage and the underscore. But the
percentage and the underscore. But the percentage is more famous. So this is
percentage is more famous. So this is exactly how it works. Let's go back to
exactly how it works. Let's go back to scale in order to have some examples.
scale in order to have some examples. All right, let's start with this task.
All right, let's start with this task. Find all customers whose first name
Find all customers whose first name starts with a capital M. So let's go and
starts with a capital M. So let's go and start searching for those informations.
start searching for those informations. We're going to start as usual. Select
We're going to start as usual. Select star from customers. And now we have to
star from customers. And now we have to go and build the filter logic. So we're
go and build the filter logic. So we're going to say where. Now we are searching
going to say where. Now we are searching something in the first name. So we're
something in the first name. So we're going to say first name. So that means
going to say first name. So that means it is very important to start with an M
it is very important to start with an M and then the rest it doesn't matter. So
and then the rest it doesn't matter. So we're going to use the like operator in
we're going to use the like operator in order to search. And we're going to have
order to search. And we're going to have our single quotes and we're going to
our single quotes and we're going to start with the M. And it doesn't matter
start with the M. And it doesn't matter what comes after that. So for us it is
what comes after that. So for us it is very important that the first character
very important that the first character is an M. Let's go and execute it. And
is an M. Let's go and execute it. And with that we got our two customers Maria
with that we got our two customers Maria and Martin. And both of them starts with
and Martin. And both of them starts with an M. So with that we have solved the
an M. So with that we have solved the task. It is very simple. Now we have the
task. It is very simple. Now we have the following task. Find all customers whose
following task. Find all customers whose first name ends with an N. So let's go
first name ends with an N. So let's go first and select all the customers here.
first and select all the customers here. And we need all those customers where
And we need all those customers where they are having an N at the end. So we
they are having an N at the end. So we have John and as well Martin. So how we
have John and as well Martin. So how we going to do it? The same thing where
going to do it? The same thing where first name like since we are searching
first name like since we are searching but here we're going to change the
but here we're going to change the expression. So it must ends with an N as
expression. So it must ends with an N as a last character. So before that it
a last character. So before that it doesn't matter whether it is the first
doesn't matter whether it is the first character. So it could be anything but
character. So it could be anything but the last character of the word should be
the last character of the word should be an N. So that's it. Let's go and
an N. So that's it. Let's go and execute. And with that we got John and
execute. And with that we got John and Martin because the last character is an
Martin because the last character is an N. It is very simple, right? It is all
N. It is very simple, right? It is all about where we're going to place this
about where we're going to place this percentage. Okay. So now we have the
percentage. Okay. So now we have the next task. Find all customers whose
next task. Find all customers whose first name contains an R. So here we
first name contains an R. So here we don't have like specifications whether
don't have like specifications whether it is at the start or at the end.
it is at the start or at the end. Somewhere there should be an R. So if
Somewhere there should be an R. So if you go and execute first without any
you go and execute first without any wear condition you can see here for
wear condition you can see here for example Maria we have in the middle
example Maria we have in the middle somewhere an R George George as well
somewhere an R George George as well Martin and Peter at the end. So we have
Martin and Peter at the end. So we have a lot of names with an R. So how we can
a lot of names with an R. So how we can search for that? We're going to stick
search for that? We're going to stick with the where first name like and here
with the where first name like and here our character going to be an R and we're
our character going to be an R and we're going to put before it and after it a
going to put before it and after it a percentage. So it doesn't matter what is
percentage. So it doesn't matter what is before it or after it somewhere there
before it or after it somewhere there should be an R. So let's go and execute
should be an R. So let's go and execute it. And with that we got all our
it. And with that we got all our customers where somewhere we have an R.
customers where somewhere we have an R. As you can see it is very simple. If you
As you can see it is very simple. If you put it before and after then you are
put it before and after then you are open for more results. And this is
open for more results. And this is usually used a lot in order to search
usually used a lot in order to search for a value inside your database. All
for a value inside your database. All right. Now we're going to move to a
right. Now we're going to move to a funny one. It kind of says find all
funny one. It kind of says find all customers whose first name has an R in
customers whose first name has an R in the third position for some reason. I
the third position for some reason. I don't know why. So let's go and execute
don't know why. So let's go and execute our customers here without any filter.
our customers here without any filter. So it is for us very important to find
So it is for us very important to find the customers where in the third
the customers where in the third position we have an R like here for
position we have an R like here for example Maria the third character is an
example Maria the third character is an R which is okay but with Peter over here
R which is okay but with Peter over here it is not the third character so it is
it is not the third character so it is not fulfilling the condition. So how we
not fulfilling the condition. So how we going to write that? It going to say
going to write that? It going to say like this where the first name like but
like this where the first name like but we have to write it now from the start.
we have to write it now from the start. So the first position going to be an
So the first position going to be an underscore the second position going to
underscore the second position going to be as well an underscore and now in the
be as well an underscore and now in the third position going to have an R. So
third position going to have an R. So with that we make sure the third
with that we make sure the third position and an R and before it we have
position and an R and before it we have two positions and now afterward it
two positions and now afterward it doesn't matter what comes after that it
doesn't matter what comes after that it could be nothing or characters. So if
could be nothing or characters. So if you go and execute it like this we will
you go and execute it like this we will get Maria and Martin and we will not get
get Maria and Martin and we will not get Peter because the R is not in the third
Peter because the R is not in the third position. So now if you don't do it
position. So now if you don't do it correctly with the underscores let's go
correctly with the underscores let's go and remove one of them and execute. You
and remove one of them and execute. You will get nothing because we don't have
will get nothing because we don't have any first name where the second position
any first name where the second position is an R. So you have to be very careful
is an R. So you have to be very careful with this. All right my friends. So this
with this. All right my friends. So this is how you search inside your values.
is how you search inside your values. And with that we have covered all
And with that we have covered all different groups of operators that you
different groups of operators that you can use inside a wear clause. So with
can use inside a wear clause. So with that you have learned how to filter your
that you have learned how to filter your data using multiple operators that you
data using multiple operators that you can use inside the wear clause. So you
can use inside the wear clause. So you can filter anything now in SQL. Now we
can filter anything now in SQL. Now we will move to very interesting topic. You
will move to very interesting topic. You will learn how to combine your data from
will learn how to combine your data from multiple tables. And here we have two
multiple tables. And here we have two main methods. The first one is SQL joins
main methods. The first one is SQL joins and the second set operators. And they
and the second set operators. And they are really big topics. So we're going to
are really big topics. So we're going to first focus on the SQL joins. And here
first focus on the SQL joins. And here we have a lot of things to cover. So now
we have a lot of things to cover. So now we are talking about the core of SQL. So
we are talking about the core of SQL. So let's
go. All right. So now we have two tables, table A and table B. And the big
tables, table A and table B. And the big question here is how to combine those
question here is how to combine those two tables. What do we want exactly? Do
two tables. What do we want exactly? Do you want to combine the rows or the
you want to combine the rows or the columns? And now if you say I would like
columns? And now if you say I would like to combine the columns then we are
to combine the columns then we are talking about joining tables. So we're
talking about joining tables. So we're going to use joins in SQL. So now let's
going to use joins in SQL. So now let's say that we are joining the table A with
say that we are joining the table A with the table B and we start from the table
the table B and we start from the table A. So SQL going to take the columns and
A. So SQL going to take the columns and the rows of the table A and SQL going to
the rows of the table A and SQL going to call it the left table because we
call it the left table because we started from there and then we join it
started from there and then we join it with the table B and SQL going to call
with the table B and SQL going to call the second table as the right table. And
the second table as the right table. And here what's going to happen? and SQL
here what's going to happen? and SQL going to take the columns and the rows
going to take the columns and the rows from the right table and put it side by
from the right table and put it side by side with the columns and rows of the
side with the columns and rows of the table A. So we are like combining the
table A. So we are like combining the columns we are putting them side by
columns we are putting them side by side. And now if you say you know what I
side. And now if you say you know what I don't want to do that I would like to
don't want to do that I would like to combine the rows both of the tables
combine the rows both of the tables having the same columns. I just want to
having the same columns. I just want to stack them. So we are now talking about
stack them. So we are now talking about another methods. It is called the set
another methods. It is called the set operators. So here there is like no left
operators. So here there is like no left and right. So since we started with the
and right. So since we started with the table A, the SQL going to take the
table A, the SQL going to take the columns and the rows of the table A and
columns and the rows of the table A and put it in the results. And then it's
put it in the results. And then it's going to go to the second table, table B
going to go to the second table, table B and it's going to take only the rows and
and it's going to take only the rows and put it below the rows of the the table
put it below the rows of the the table A. So we are putting the rows beneath
A. So we are putting the rows beneath each others. We are doing like
each others. We are doing like appending. So that means as we are using
appending. So that means as we are using the set operators, we are combining the
the set operators, we are combining the rows. Our table going to be longer but
rows. Our table going to be longer but with the joins we are combining the
with the joins we are combining the columns side by side and we are getting
columns side by side and we are getting wider table. But now for each methods
wider table. But now for each methods there are different types. So now for
there are different types. So now for example in order to do the joints we
example in order to do the joints we have four very famous types. We can do
have four very famous types. We can do an inner join, full join, left join,
an inner join, full join, left join, right join. But of course there are more
right join. But of course there are more than that but those are the basics. And
than that but those are the basics. And for the set methods we have as well
for the set methods we have as well types. We have the union, union all
types. We have the union, union all except and intersect. And for each
except and intersect. And for each methods there are like different rules.
methods there are like different rules. In order to join the tables we have to
In order to join the tables we have to define the key columns between the two
define the key columns between the two tables. Don't worry we're going to learn
tables. Don't worry we're going to learn about that later. This is the
about that later. This is the requirement in order to join tables and
requirement in order to join tables and the requirement of combining tables
the requirement of combining tables using the set operators the tables in
using the set operators the tables in your query should has the exact same
your query should has the exact same number of columns but here you don't
number of columns but here you don't need any like key in order to combine
need any like key in order to combine the tables. So guys if you look at this
the tables. So guys if you look at this in order to combine two tables first you
in order to combine two tables first you have to decide do I want to combine the
have to decide do I want to combine the columns or the rows. So first you have
columns or the rows. So first you have to decide in the methods and after that
to decide in the methods and after that you have different types on how exactly
you have different types on how exactly you're going to go and combine the data
you're going to go and combine the data and of course there are rules that you
and of course there are rules that you have to follow. Now, of course, we're
have to follow. Now, of course, we're going to go and cover everything in the
going to go and cover everything in the course, but now in this section, we're
course, but now in this section, we're going to learn how we're going to
going to learn how we're going to combine the tables using the SQL joins.
combine the tables using the SQL joins. So, we're going to go and dive into this
word. All right. So, now what is exactly SQL joins? Now, let's say that we have
SQL joins? Now, let's say that we have two tables. On the left table, we have
two tables. On the left table, we have the customer name. So, we have four
the customer name. So, we have four customers. And on the right table, we
customers. And on the right table, we have the country informations about the
have the country informations about the customer. And now we would like to query
customer. And now we would like to query both of those informations the names and
both of those informations the names and the countries. Now in order to query
the countries. Now in order to query those two tables in one query first we
those two tables in one query first we have to connect them. And in order to
have to connect them. And in order to connect those two tables we need a key a
connect those two tables we need a key a column that exist on the left and on the
column that exist on the left and on the right sides. And by looking to this the
right sides. And by looking to this the common column here is the ID of the
common column here is the ID of the customer. Now once we connect those ids
customer. Now once we connect those ids together we will be able to query those
together we will be able to query those tables together and SQL going to start
tables together and SQL going to start matching those ids. So for the ID number
matching those ids. So for the ID number one, we will get the name Maria and the
one, we will get the name Maria and the country Germany. And the ID2 is
country Germany. And the ID2 is connecting John to USA. And now you can
connecting John to USA. And now you can see the ID3 is not connectable. So we
see the ID3 is not connectable. So we cannot connect it to the right side. But
cannot connect it to the right side. But for the ID4, we can use it in order to
for the ID4, we can use it in order to connect Martin to Germany. So this is
connect Martin to Germany. So this is exactly what happens if you join two
exactly what happens if you join two tables. You connect those two tables
tables. You connect those two tables using a common column, a key like the
using a common column, a key like the ID. And once we have matching value, we
ID. And once we have matching value, we can connect the two rows together. So
can connect the two rows together. So this is what we mean with SQL
joins. Now you might ask why do we need actually joins? Well, the first and very
actually joins? Well, the first and very important reason is to recombine your
important reason is to recombine your data. So now usually in databases the
data. So now usually in databases the data about something like the customers
data about something like the customers could be spreaded into multiple tables.
could be spreaded into multiple tables. Like we could have table called
Like we could have table called customers, another one where we have the
customers, another one where we have the customer addresses and a third table
customer addresses and a third table where you can find the orders of the
where you can find the orders of the customers and maybe another one where
customers and maybe another one where you can find the reviews of the
you can find the reviews of the customers. So as you can see the data of
customers. So as you can see the data of the customers is spreaded into like four
the customers is spreaded into like four tables. Now how about I would like to
tables. Now how about I would like to see all the data about the customers in
see all the data about the customers in one results. So I would like to see the
one results. So I would like to see the complete big picture about our
complete big picture about our customers. What we can do, we can go and
customers. What we can do, we can go and connect those four tables using the SQL
connect those four tables using the SQL joins. And once we do that in one query,
joins. And once we do that in one query, I will be able to combine all those
I will be able to combine all those tables in one big results. And this is
tables in one big results. And this is the most important reason why we use SQL
the most important reason why we use SQL joins in order to combine all the data
joins in order to combine all the data about specific topic in order to see the
about specific topic in order to see the big picture. Now, another reason why we
big picture. Now, another reason why we use SQL joins is to do data enrichment.
use SQL joins is to do data enrichment. It is where I want to get an extra data
It is where I want to get an extra data and extra information. So let's say that
and extra information. So let's say that you are querying the table customers and
you are querying the table customers and this is your main table the master
this is your main table the master table. So you are able to see all the
table. So you are able to see all the data that you need but sometimes what
data that you need but sometimes what happens you would like to get an extra
happens you would like to get an extra information from another table like for
information from another table like for example the zip codes of the countries.
example the zip codes of the countries. So you would like the help of another
So you would like the help of another table we call it a reference table or
table we call it a reference table or sometimes lookup table where there is
sometimes lookup table where there is like one extra information that you
like one extra information that you would like to add it to your master
would like to add it to your master table to the primary source of your
table to the primary source of your data. So now what we can do we can join
data. So now what we can do we can join those two tables in order to enhance our
those two tables in order to enhance our table. So we are getting one extra
table. So we are getting one extra relevant informations for the customers
relevant informations for the customers and this process we call it data
and this process we call it data enrichments. I'm getting an extra data
enrichments. I'm getting an extra data for my main table. So this is another
for my main table. So this is another reason why we use joins. All right. So
reason why we use joins. All right. So now so far we have used joins in order
now so far we have used joins in order to get the data from two tables. But now
to get the data from two tables. But now there is another use case for the SQL
there is another use case for the SQL joins. We use it in order to check the
joins. We use it in order to check the existence of your data in another table
existence of your data in another table or maybe as well the not existence. So
or maybe as well the not existence. So let's say that I have a table called
let's say that I have a table called customers and I'm working with this
customers and I'm working with this table and doing queries. But now I would
table and doing queries. But now I would like to check something. I would like to
like to check something. I would like to check whether our customers did order
check whether our customers did order something. Now in order to check that I
something. Now in order to check that I need the help of another table for
need the help of another table for example the table orders. So that means
example the table orders. So that means I'm using the table orders only for my
I'm using the table orders only for my check. So I don't want to get any extra
check. So I don't want to get any extra data from the orders in my final
data from the orders in my final results. I'm just using the table orders
results. I'm just using the table orders and we call in this table a lookup. So
and we call in this table a lookup. So now what we can do we can connect those
now what we can do we can connect those two tables together. And now based on
two tables together. And now based on the existence of the customers inside
the existence of the customers inside the second table the orders either the
the second table the orders either the customer going to stay in the final
customer going to stay in the final results or going to be removed. So that
results or going to be removed. So that means I'm filtering the data based on
means I'm filtering the data based on the join. And of course I can check as
the join. And of course I can check as well the net existence. I would like to
well the net existence. I would like to see in the final results all the
see in the final results all the customers that didn't order anything. So
customers that didn't order anything. So it is the same scenario. So my friends,
it is the same scenario. So my friends, those are the main three reasons why you
those are the main three reasons why you use SQL joins. First, if you want to
use SQL joins. First, if you want to combine the data from multiple tables in
combine the data from multiple tables in one big picture. So I use join in order
one big picture. So I use join in order to get the data from different tables.
to get the data from different tables. The second use case, you are working
The second use case, you are working with one table but you would like to get
with one table but you would like to get an extra information from another table.
an extra information from another table. So you are doing it like something
So you are doing it like something called data enrichments. And in the
called data enrichments. And in the third scenario, we don't want to combine
third scenario, we don't want to combine the data. We want just to join it with
the data. We want just to join it with another table in order to do a check to
another table in order to do a check to check the existence of your records in
check the existence of your records in another table. So this is why we need
another table. So this is why we need joins in
SQL. Now there is like a lot of different possibilities on how to join
different possibilities on how to join tables, how to join the data. Now in
tables, how to join the data. Now in order to make it easy to understand,
order to make it easy to understand, we're going to visuals as like two
we're going to visuals as like two circles. So we have the table A and a
circles. So we have the table A and a table B. The table A is on the left
table B. The table A is on the left side. We call it the left table. And the
side. We call it the left table. And the table B going to be on the right side
table B going to be on the right side and we call it the right table. The side
and we call it the right table. The side of the tables is very important. Now if
of the tables is very important. Now if you combine those two circles, you will
you combine those two circles, you will get three different possibilities. The
get three different possibilities. The circles going to overlap. And here
circles going to overlap. And here exactly where we can have the matching
exactly where we can have the matching data between the two tables. So the data
data between the two tables. So the data is available on the left and on the
is available on the left and on the right. Or another possibility you want
right. Or another possibility you want to get all the data from one of the
to get all the data from one of the tables. So you can get all the rows from
tables. So you can get all the rows from one circle. And the third possibility
one circle. And the third possibility you want to get only the unmatching data
you want to get only the unmatching data from one table. So if something exists
from one table. So if something exists in one table but not in the other table
in one table but not in the other table then we call it unmatching data. So
then we call it unmatching data. So those are the three scenarios that you
those are the three scenarios that you have to ask yourself once you are
have to ask yourself once you are combining tables and this can generate a
combining tables and this can generate a lot of join types. So here we have like
lot of join types. So here we have like basic SQL joins those are the classical
basic SQL joins those are the classical one and here depends on the scenario
one and here depends on the scenario whether you want only matching all or
whether you want only matching all or all the rows from either left or right
all the rows from either left or right and we have advanced SQL joins where we
and we have advanced SQL joins where we focus on the unmatching data. Now we're
focus on the unmatching data. Now we're going to go and cover all those types
going to go and cover all those types one by one. So we're going to start
one by one. So we're going to start first with the basics and the first
first with the basics and the first option that you have is to get all the
option that you have is to get all the data without joining tables. So let's
data without joining tables. So let's see what this means.
So what do we mean with no join? Well, we want to returns the data from two
we want to returns the data from two tables without combining them. So
tables without combining them. So actually this is not a joint type
actually this is not a joint type because we are not combining anything.
because we are not combining anything. We just want to query the data from two
We just want to query the data from two tables. So that means from the table A
tables. So that means from the table A we want to see all the rows everything
we want to see all the rows everything and from the table B we want to see
and from the table B we want to see everything as well all the rows. So that
everything as well all the rows. So that means we want to see two results and
means we want to see two results and there is no need to combine them. So
there is no need to combine them. So let's see the syntax of that. So all
let's see the syntax of that. So all what you have to do is very simple.
what you have to do is very simple. Select star from table A and then
Select star from table A and then semicolon and then start another query.
semicolon and then start another query. Select star from table B. So that's it.
Select star from table B. So that's it. And of course since we are not combining
And of course since we are not combining the data there will be no join in the
the data there will be no join in the syntax. So that's it. Let's go to SQL in
syntax. So that's it. Let's go to SQL in order to do that. Okay. So now we have
order to do that. Okay. So now we have the following task. It says retrieve all
the following task. It says retrieve all data from customers and orders in two
data from customers and orders in two different results. So that sounds that
different results. So that sounds that we don't have to go and combine the
we don't have to go and combine the tables together. And all what we can do
tables together. And all what we can do is the following. We can go and select
is the following. We can go and select the data from the first table like this
the data from the first table like this and then we make another query for the
and then we make another query for the second table the orders and we don't
second table the orders and we don't have to go and combine them in one big
have to go and combine them in one big query. We just use a very simple select
query. We just use a very simple select statements in order to retrieve the
statements in order to retrieve the data. So if you go and execute it since
data. So if you go and execute it since you have two separate queries you will
you have two separate queries you will get two results and with that in one
get two results and with that in one result you will get all the customers
result you will get all the customers and in the other result you will get all
and in the other result you will get all the orders and the data is not combined
the orders and the data is not combined at all. So this is how you query two
at all. So this is how you query two tables without combining them. So with
tables without combining them. So with that we are getting all the data without
that we are getting all the data without joining the tables. Now we're going to
joining the tables. Now we're going to start talking about the first type of
start talking about the first type of join the inner join where we start
join the inner join where we start combining the data from two tables. So
combining the data from two tables. So let's
go. Okay. So now what is exactly an inner join? So this type going to return
inner join? So this type going to return only the matching rows from both tables.
only the matching rows from both tables. So that means we will see in the output
So that means we will see in the output only matching rows. So now what do we
only matching rows. So now what do we need from the left table? We want only
need from the left table? We want only the matching data. So we will not get
the matching data. So we will not get the whole circle of A. We will get only
the whole circle of A. We will get only where we have an overlapping with the
where we have an overlapping with the table B. So we want to see the data from
table B. So we want to see the data from A only if it exists in the table B. And
A only if it exists in the table B. And now what do we need from the table B?
now what do we need from the table B? Exactly the same thing only the matching
Exactly the same thing only the matching data. So that means I don't want to see
data. So that means I don't want to see all the data from B. I want to see only
all the data from B. I want to see only the data in B that has a match from the
the data in B that has a match from the table A from the left side. And with
table A from the left side. And with that you will get only the matching data
that you will get only the matching data from both tables. Now let's see how we
from both tables. Now let's see how we can write that in SQL. So it is a usual
can write that in SQL. So it is a usual query and always we start with a select.
query and always we start with a select. So we select for example all the columns
So we select for example all the columns from and here we specify the table name.
from and here we specify the table name. So it's going to be a. So so far nothing
So it's going to be a. So so far nothing new. But now we want to add as well the
new. But now we want to add as well the table B in the same query. In order to
table B in the same query. In order to do that we use the keyword join and then
do that we use the keyword join and then we say table B the name of the table.
we say table B the name of the table. And since we have like different types
And since we have like different types of joins in SQL, you can specify the
of joins in SQL, you can specify the type of the join before the keyword
type of the join before the keyword join. And if you don't specify anything,
join. And if you don't specify anything, the default type is inner join. But my
the default type is inner join. But my friends, the best practices is always
friends, the best practices is always mention the type. I don't like to skip
mention the type. I don't like to skip the defaults because in projects maybe
the defaults because in projects maybe not everyone is aware of the defaults.
not everyone is aware of the defaults. So don't skip that. Always specify the
So don't skip that. Always specify the type. So now what we're going to do,
type. So now what we're going to do, we're going to put the keyword inner
we're going to put the keyword inner before the join. And with that SQL going
before the join. And with that SQL going to know how to deal with the rows
to know how to deal with the rows between two tables. But still we are not
between two tables. But still we are not done there. We have to tell SQL how to
done there. We have to tell SQL how to combine the tables. And with that we use
combine the tables. And with that we use the keyword on. And after that you
the keyword on. And after that you specify the join condition. And as we
specify the join condition. And as we learned in order to join two tables we
learned in order to join two tables we have to find out a common column in
have to find out a common column in order to match the data. Right? And
order to match the data. Right? And usually in scale they are the keys or
usually in scale they are the keys or ids. So the condition can be like this.
ids. So the condition can be like this. the key from the table A must be equal
the key from the table A must be equal to the key from the table B. So this is
to the key from the table B. So this is the join condition and using this join
the join condition and using this join SQL can go and start matching the data
SQL can go and start matching the data from the left table and the right table.
from the left table and the right table. And there is one thing that is very
And there is one thing that is very important while you are joining the
important while you are joining the tables you have to understand about the
tables you have to understand about the order of the tables in your query. Now
order of the tables in your query. Now in the inner join the order of the
in the inner join the order of the tables doesn't really matter. So whether
tables doesn't really matter. So whether you start from A or you start from B it
you start from A or you start from B it doesn't matter because you will get the
doesn't matter because you will get the same results. Both of the tables has the
same results. Both of the tables has the same priority and it doesn't matter
same priority and it doesn't matter where we start whether we say from A
where we start whether we say from A join B or we say from B join A we will
join B or we say from B join A we will get the exact same results. So in the
get the exact same results. So in the inner join you don't have to worry about
inner join you don't have to worry about the order of the tables. So that's all
the order of the tables. So that's all about the inner join. Now let's go back
about the inner join. Now let's go back to scale in order to practice. Okay. So
to scale in order to practice. Okay. So now we have the following task and it
now we have the following task and it says all customers along with their
says all customers along with their orders but only for customers who have
orders but only for customers who have placed an order. So my friends that
placed an order. So my friends that means we need the data from the
means we need the data from the customers and from the orders from two
customers and from the orders from two tables and we have to put everything in
tables and we have to put everything in one results. That means we have to join
one results. That means we have to join two tables. Now let's go and do it step
two tables. Now let's go and do it step by step. So we're going to go and say
by step. So we're going to go and say select star from customers and then we
select star from customers and then we have to go and join it with the orders.
have to go and join it with the orders. We're going to say join orders. Now you
We're going to say join orders. Now you have to go and specify the join type. Is
have to go and specify the join type. Is it inner, left, full and so on. Well
it inner, left, full and so on. Well that's depend on the task. It says we
that's depend on the task. It says we want all customers but only for
want all customers but only for customers who have placed an order. So
customers who have placed an order. So there is like condition right here. We
there is like condition right here. We don't want to see everything from the
don't want to see everything from the customer. We just want to see only the
customer. We just want to see only the matching data only if the customers has
matching data only if the customers has an order in the orders table. And for
an order in the orders table. And for that we can go and use the inner join.
that we can go and use the inner join. Of course if you can leave it like this
Of course if you can leave it like this you will get the same effects but I'm
you will get the same effects but I'm going to go and specify it like this
going to go and specify it like this inner join just to make it clear. We are
inner join just to make it clear. We are speaking about the inner join. And after
speaking about the inner join. And after that we have to go and specify the join
that we have to go and specify the join condition. So we have to go and find a
condition. So we have to go and find a common column between the customers and
common column between the customers and the orders. So how I usually do it I go
the orders. So how I usually do it I go and explore both of the tables. So I'm
and explore both of the tables. So I'm going to go and select everything from
going to go and select everything from customers and as well
customers and as well everything from the orders. So let's go
everything from the orders. So let's go and execute. Now we're going to start
and execute. Now we're going to start searching where do we have a common
searching where do we have a common column between those two tables. So we
column between those two tables. So we have the from the first table first
have the from the first table first name, country score and you don't find
name, country score and you don't find any of those informations in the second
any of those informations in the second table. The only one is the ID. So the ID
table. The only one is the ID. So the ID of the customer and the ID of the
of the customer and the ID of the customer you can find it in the orders
customer you can find it in the orders the second column here. So this is the
the second column here. So this is the common column between those two tables.
common column between those two tables. And usually in databases we create ids
And usually in databases we create ids exactly for this in order to connect
exactly for this in order to connect tables. So it's really rarely that we're
tables. So it's really rarely that we're going to use like a country or score or
going to use like a country or score or first name in order to join tables. We
first name in order to join tables. We usually use the ids. So let's go back to
usually use the ids. So let's go back to our query and use those two columns. So
our query and use those two columns. So it's going to be the ID from the
it's going to be the ID from the customers equal to the customer ID. So
customers equal to the customer ID. So that's it. With that we have the
that's it. With that we have the condition we have decided on the type
condition we have decided on the type and we can go and execute it. Now you
and we can go and execute it. Now you can see we are getting only three
can see we are getting only three customers. Right? If you don't apply the
customers. Right? If you don't apply the inner join we can see that we have five
inner join we can see that we have five customers. So that means actually we
customers. So that means actually we have two customers without any orders
have two customers without any orders any matching data from the other table.
any matching data from the other table. And as well you can see very nicely we
And as well you can see very nicely we have now not only the columns from the
have now not only the columns from the customers but as well all the columns
customers but as well all the columns from the orders side by side. So with
from the orders side by side. So with that we have combined the data and as
that we have combined the data and as well with that we have solved the task
well with that we have solved the task but we will not leave our query like
but we will not leave our query like this because it is not really good
this because it is not really good practices. What we have to do is to go
practices. What we have to do is to go and select only the columns that really
and select only the columns that really make sense in our query because in many
make sense in our query because in many cases in your tables you will have a lot
cases in your tables you will have a lot of columns that is not needed like for
of columns that is not needed like for example if you check here you see we
example if you check here you see we have the customer ID here and as well
have the customer ID here and as well the customer ID over here. So it's like
the customer ID over here. So it's like repetition and it's enough to see it
repetition and it's enough to see it only once. So what you have to do is to
only once. So what you have to do is to go and pick few columns that we want.
go and pick few columns that we want. For example, I'm going to start with the
For example, I'm going to start with the ID maybe the first name and that's all
ID maybe the first name and that's all from the first table. Let's go and get
from the first table. Let's go and get the order ID and I don't want the
the order ID and I don't want the customer ID again. So from the second
customer ID again. So from the second table I'll get add the sales. So let's
table I'll get add the sales. So let's go and execute it. And with that you can
go and execute it. And with that you can see very nicely the customer's name and
see very nicely the customer's name and their orders with the sales. And now
their orders with the sales. And now comes something very important.
comes something very important. Sometimes if you have two tables you
Sometimes if you have two tables you might have columns that having the same
might have columns that having the same names. Like imagine the order ID in the
names. Like imagine the order ID in the table orders it's called ID. So that
table orders it's called ID. So that means we have the same name in both
means we have the same name in both tables and this kind of makes SQL very
tables and this kind of makes SQL very confused. And here you will get an error
confused. And here you will get an error tells you I really don't know what do
tells you I really don't know what do you mean with the ID. Is it from the
you mean with the ID. Is it from the table customers or from the orders? So
table customers or from the orders? So we have to tell SQL exactly from which
we have to tell SQL exactly from which table did this column come from. So in
table did this column come from. So in SQL in order to do that what we do
SQL in order to do that what we do before the column name you write again
before the column name you write again the table name the customers and then
the table name the customers and then you make a dot and now we are telling
you make a dot and now we are telling SQL this column the ID it comes from the
SQL this column the ID it comes from the table customers and SQL will not be
table customers and SQL will not be confused about it and it's going to go
confused about it and it's going to go and get the ID from the customers. And
and get the ID from the customers. And for the second id you can go over here
for the second id you can go over here and as well before it you say orders do
and as well before it you say orders do id so that knows okay this ID come from
id so that knows okay this ID come from the orders and the other one comes from
the orders and the other one comes from the customers and it is always good
the customers and it is always good practice especially if you are joining
practice especially if you are joining tables to always assign for each column
tables to always assign for each column a table because after a while if you
a table because after a while if you open your query and you see okay the
open your query and you see okay the sales does the sales come from the
sales does the sales come from the customers or the orders and if you have
customers or the orders and if you have a long list of columns it's going to be
a long list of columns it's going to be really confusing so that's why we
really confusing so that's why we consider it best practices if you always
consider it best practices if you always assign for each column the table name
assign for each column the table name especially if you are doing joins. So
especially if you are doing joins. So it's going to be like this. But of
it's going to be like this. But of course if you have like only one table
course if you have like only one table it's clear that all the columns in the
it's clear that all the columns in the select comes from this table. But since
select comes from this table. But since here we are dealing with multiple tables
here we are dealing with multiple tables it is good to show it like this. And of
it is good to show it like this. And of course here we don't have the ID. We
course here we don't have the ID. We have the order ID and the same thing for
have the order ID and the same thing for the join condition. So the ID from here
the join condition. So the ID from here comes from the customers and the
comes from the customers and the customer ID come from the orders. So now
customer ID come from the orders. So now it is clear for everyone which column
it is clear for everyone which column come from which table. But now you might
come from which table. But now you might say you know what each time I have to
say you know what each time I have to write the customers this is very long
write the customers this is very long name and sometimes in real projects
name and sometimes in real projects you're going to see tables that has
you're going to see tables that has really long name and it's going to be
really long name and it's going to be really annoying to add it each time
really annoying to add it each time before each column right so instead of
before each column right so instead of that we can go and assign aliases for
that we can go and assign aliases for the tables but only for the columns so
the tables but only for the columns so usually we go over here and say as and
usually we go over here and say as and maybe you can go and use only one
maybe you can go and use only one character like the first character C.
character like the first character C. And now instead of saying customers you
And now instead of saying customers you can go over here and say C. The same
can go over here and say C. The same thing for the second column and as well
thing for the second column and as well over here. And you can use now the C in
over here. And you can use now the C in everywhere in your query. The same thing
everywhere in your query. The same thing for the orders. You can go over here and
for the orders. You can go over here and say has O. And now instead of orders you
say has O. And now instead of orders you say
say O on here. And now it is very easily to
O on here. And now it is very easily to see those two columns comes from the C
see those two columns comes from the C that means the customers and those two
that means the customers and those two columns comes from the O the orders.
columns comes from the O the orders. Those are the best practices as you are
Those are the best practices as you are joining tables together in SQL. And of
joining tables together in SQL. And of course with that we have solved the
course with that we have solved the task. And about the order of the tables,
task. And about the order of the tables, it doesn't matter where do you start. So
it doesn't matter where do you start. So for example, if you take the orders here
for example, if you take the orders here and put it in the join and get the
and put it in the join and get the orders in the from. So I just switch the
orders in the from. So I just switch the tables and execute it, you will get the
tables and execute it, you will get the exact same results. So if you are doing
exact same results. So if you are doing inner join between two tables, don't
inner join between two tables, don't worry about the order of the tables.
worry about the order of the tables. Okay. So now let's go and instant
Okay. So now let's go and instant exactly how executed the inner join.
exactly how executed the inner join. Okay. So now again here we have our
Okay. So now again here we have our query. Then we have the two tables
query. Then we have the two tables customers and orders. And here we have
customers and orders. And here we have the ID where we are joining the data. So
the ID where we are joining the data. So this is the ID from the table customers
this is the ID from the table customers and this is the customer ID that we have
and this is the customer ID that we have in the orders. Now let's see how SQL can
in the orders. Now let's see how SQL can execute this. So we are saying I would
execute this. So we are saying I would like to see the ID and the first name.
like to see the ID and the first name. So we will get the ID, the first name
So we will get the ID, the first name from the table customers and we would
from the table customers and we would like to get the order ID and as well the
like to get the order ID and as well the sales from the table orders. So our
sales from the table orders. So our result going to focus on those four
result going to focus on those four columns. Now the data should be joined
columns. Now the data should be joined between those two tables using the inner
between those two tables using the inner join and SQL going to start from the
join and SQL going to start from the left table from the customers because we
left table from the customers because we say from customers. So it's going to
say from customers. So it's going to start matching the ID from the left
start matching the ID from the left table with the right table. So it's
table with the right table. So it's going to say okay is there a match from
going to say okay is there a match from the first record from the first order?
the first record from the first order? Well yes it is the same ID and then SQL
Well yes it is the same ID and then SQL going to say okay that condition is
going to say okay that condition is fulfilled and we are allowed to see the
fulfilled and we are allowed to see the data. So the data will be presented in
data. So the data will be presented in the output. So we're going to have the
the output. So we're going to have the ID Maria and the order ID from Maria and
ID Maria and the order ID from Maria and the sales of this order. So there is a
the sales of this order. So there is a match. Then SQL going to go to the
match. Then SQL going to go to the second record. Well, we don't have a
second record. Well, we don't have a match. The third we don't have match.
match. The third we don't have match. And so on for the last one. So we have
And so on for the last one. So we have only one match for this ID. Then SQL
only one match for this ID. Then SQL going to go again to the customers and
going to go again to the customers and pick the second one and start matching
pick the second one and start matching again with the first order. Do we have a
again with the first order. Do we have a match? Well, no. Then it's going to go
match? Well, no. Then it's going to go to the second. Well, now we have a
to the second. Well, now we have a match. So SQL going to be happy. the
match. So SQL going to be happy. the condition is fulfilled and we will see
condition is fulfilled and we will see the results. So we're going to see the
the results. So we're going to see the first name and as well the order
first name and as well the order information for this customer in the
information for this customer in the output. It's going to keep searching. So
output. It's going to keep searching. So we don't have a match as well here. So
we don't have a match as well here. So that's it. Now for the third customer as
that's it. Now for the third customer as well from the start there match no to
well from the start there match no to the second to the third and here we have
the second to the third and here we have a match. So it's going to go and show
a match. So it's going to go and show this informations since there is a
this informations since there is a match. So the customer three George with
match. So the customer three George with the order from this customer order ID
the order from this customer order ID and the sales as well in the output. Now
and the sales as well in the output. Now it's going to go and keep continuing the
it's going to go and keep continuing the search. Well, we don't have any match.
search. Well, we don't have any match. Then it's still going to go to the
Then it's still going to go to the fourth customer and start matching. Do
fourth customer and start matching. Do we have here an ID? Do we have here a
we have here an ID? Do we have here a match? Well, no. Then the second, third,
match? Well, no. Then the second, third, and fourth. We don't have any order for
and fourth. We don't have any order for this ID. There is no match at all. And
this ID. There is no match at all. And since we are saying inner join then SQL
since we are saying inner join then SQL will not allow to show the data of this
will not allow to show the data of this customer in the results. There is no
customer in the results. There is no match and SQL going to totally ignore
match and SQL going to totally ignore this customer. Then we're going to go to
this customer. Then we're going to go to the last one and start as well matching
the last one and start as well matching this ID with the orders. Well, there is
this ID with the orders. Well, there is no match as well. SQL going to go and
no match as well. SQL going to go and exclude this user from the results. So
exclude this user from the results. So this is exactly how the inner join
this is exactly how the inner join works. it start from the left side and
works. it start from the left side and start matching the data on the right
start matching the data on the right side and only if there is match the
side and only if there is match the result going to be presented in the
result going to be presented in the output and this is exactly why we are
output and this is exactly why we are getting this results and how the inner
getting this results and how the inner join works. So now if you look again to
join works. So now if you look again to the reasons why we are joining tables we
the reasons why we are joining tables we can say we can use the inner join in
can say we can use the inner join in order to recombine the multiple tables
order to recombine the multiple tables into one big picture. So the first use
into one big picture. So the first use case and as well we can use the inner
case and as well we can use the inner join in order to filter the data. So
join in order to filter the data. So since we are saying only the matching
since we are saying only the matching data that means we are filtering the
data that means we are filtering the data we are checking the existence of
data we are checking the existence of the records in another table. So you can
the records in another table. So you can use inner join either to combine data
use inner join either to combine data from multiple tables or you can use it
from multiple tables or you can use it as well only for filtering purposes only
as well only for filtering purposes only to check the existence of your rows. So
to check the existence of your rows. So this is usually the two use cases of
this is usually the two use cases of inner. All right. So that's all about
inner. All right. So that's all about the first type the inner join. Next
the first type the inner join. Next we're going to talk about the left join.
we're going to talk about the left join. So we're going to focus on the left
So we're going to focus on the left side. So let's go.
Okay. So now what is exactly left join? This type going to returns all the rows
This type going to returns all the rows from the left table and only the
from the left table and only the matching from the right table. So now if
matching from the right table. So now if you look again to our two circles A and
you look again to our two circles A and B. What do we need from the left table?
B. What do we need from the left table? We want to see everything all the rows
We want to see everything all the rows all the data. So that means we will get
all the data. So that means we will get a full circle. And now from the right
a full circle. And now from the right table we want to get only the matching
table we want to get only the matching data. So that means we don't want to see
data. So that means we don't want to see everything from the table B. We want to
everything from the table B. We want to see only the records that has match to
see only the records that has match to the table A. So that means my friends
the table A. So that means my friends the left table has here more priority.
the left table has here more priority. This is the primary source of your data.
This is the primary source of your data. The main source we cannot miss anything.
The main source we cannot miss anything. This is very important. We want to see
This is very important. We want to see all the data. But from the table B, it
all the data. But from the table B, it is a secondary source of data and we are
is a secondary source of data and we are joining it only to get an additional
joining it only to get an additional data. So I don't want everything. I want
data. So I don't want everything. I want only the data that has matched to the
only the data that has matched to the lift table. So this is what we mean with
lift table. So this is what we mean with a lift join. Now if you look to the
a lift join. Now if you look to the syntax it's going to be very similar to
syntax it's going to be very similar to the inner join. So we start from the
the inner join. So we start from the left table the A. Then we say left join
left table the A. Then we say left join the right table B and then the same
the right table B and then the same condition using keys. So here we just
condition using keys. So here we just switch the type. Instead of inner we
switch the type. Instead of inner we have now left. But now here with the
have now left. But now here with the syntax we need to be very careful. The
syntax we need to be very careful. The order of the tables now is very
order of the tables now is very important. You have to start from the
important. You have to start from the correct table. So you have to mention
correct table. So you have to mention the left table exactly in the from
the left table exactly in the from clause and then you join it with the
clause and then you join it with the right table. So in the join you have to
right table. So in the join you have to specify the right table. If you don't do
specify the right table. If you don't do it like this then you will not get all
it like this then you will not get all the data from a and you will not get the
the data from a and you will not get the results that you are expecting. So this
results that you are expecting. So this is what we mean with the left join.
is what we mean with the left join. Let's go back to scale in order to
Let's go back to scale in order to practice. All right. So now we have the
practice. All right. So now we have the following task. It says get all
following task. It says get all customers along with their orders
customers along with their orders including those without orders. So again
including those without orders. So again here we need the data from two tables
here we need the data from two tables the customers and orders and we want
the customers and orders and we want everything in one result. So that means
everything in one result. So that means we have to go and join the data. And now
we have to go and join the data. And now the task says includes those without
the task says includes those without orders. So that means I want to see
orders. So that means I want to see everything the matching data and the
everything the matching data and the unmatching data from the table
unmatching data from the table customers. And by looking to our query
customers. And by looking to our query this is not working because we are not
this is not working because we are not getting everything right. We are getting
getting everything right. We are getting only the customers that has match in the
only the customers that has match in the table orders. And this is not of course
table orders. And this is not of course fulfilling the task. So now if you read
fulfilling the task. So now if you read the task you can understand the main
the task you can understand the main table here is the customers. We are not
table here is the customers. We are not speaking about to see all the orders and
speaking about to see all the orders and not missing any order and the orders
not missing any order and the orders here is only for additional
here is only for additional informations. So now in order to not
informations. So now in order to not lose any data for the customers we make
lose any data for the customers we make sure we start from the table customers.
sure we start from the table customers. So that means now the customers on the
So that means now the customers on the left side and now after that instead of
left side and now after that instead of inner join this is not good thing for
inner join this is not good thing for this task. We're going to say left join
this task. We're going to say left join and with that we guarantee we will get
and with that we guarantee we will get all the data from the customers. Now we
all the data from the customers. Now we say left join orders and of course the
say left join orders and of course the condition going to stay like this. This
condition going to stay like this. This is how we are connecting the two tables.
is how we are connecting the two tables. So actually that's it. Let's go and
So actually that's it. Let's go and execute it. And now by looking to the
execute it. And now by looking to the result you can see that we have now five
result you can see that we have now five customers even the customers that didn't
customers even the customers that didn't place any orders. So you can see Martin
place any orders. So you can see Martin and Peter they don't have any order ID.
and Peter they don't have any order ID. So that means they didn't order
So that means they didn't order anything. And as you can see is showing
anything. And as you can see is showing us nulls when there is no match. So with
us nulls when there is no match. So with that we have solved the task. Now my
that we have solved the task. Now my friends one more thing as I told you the
friends one more thing as I told you the order of the tables is very important
order of the tables is very important because the customer is now the left
because the customer is now the left table because you start from it and the
table because you start from it and the second table the orders is the right
second table the orders is the right table. Now if you go and switch them
table. Now if you go and switch them like this. So we start from the orders
like this. So we start from the orders and then join it with the customers and
and then join it with the customers and you go execute it you will not get all
you go execute it you will not get all the customers and of course the task is
the customers and of course the task is now not solved. So as you can see you
now not solved. So as you can see you are getting now completely different
are getting now completely different result if you go and switch the tables.
result if you go and switch the tables. So be careful where you start and how
So be careful where you start and how you join the tables in order to get the
you join the tables in order to get the effects that you want. All right. So now
effects that you want. All right. So now I'm going to put everything back like
I'm going to put everything back like before. Now let's go and understand how
before. Now let's go and understand how is exactly executed this query. Okay. So
is exactly executed this query. Okay. So now again we have the data from
now again we have the data from customers and orders and this time we
customers and orders and this time we are doing the lift join. So now let's
are doing the lift join. So now let's see how is going to do it. So going to
see how is going to do it. So going to say okay we need the ID and the first
say okay we need the ID and the first name and we will get that as well in the
name and we will get that as well in the results and from the right table we need
results and from the right table we need only those two informations the order ID
only those two informations the order ID and the sales in the output. So those
and the sales in the output. So those are the columns that we need. So now SQL
are the columns that we need. So now SQL in the left join going to do it a little
in the left join going to do it a little bit differently. It's going to start as
bit differently. It's going to start as well from the lift table from the
well from the lift table from the customers. But this time going to go and
customers. But this time going to go and immediately put the result in the output
immediately put the result in the output without like trying to match anything
without like trying to match anything and to check whether the data exist or
and to check whether the data exist or not because it doesn't matter not doing
not because it doesn't matter not doing any validation whether the customer
any validation whether the customer exist in the orders. Since it's lift
exist in the orders. Since it's lift join is still going to show all the data
join is still going to show all the data from the lift table. So there will be
from the lift table. So there will be like no check. But now as a next step in
like no check. But now as a next step in order to get the order ID and the sales
order to get the order ID and the sales SQL will start searching. So SQL going
SQL will start searching. So SQL going to go over here and start searching
to go over here and start searching where do we have a customer with this
where do we have a customer with this ID? Well, it's going to be the first
ID? Well, it's going to be the first order. We're going to get the order ID
order. We're going to get the order ID and as well the sales informations and
and as well the sales informations and we will see that in the output. So
we will see that in the output. So that's it for the first one. Now it's
that's it for the first one. Now it's going to go to the second row and the
going to go to the second row and the same thing going to happen immediately.
same thing going to happen immediately. The SQL going to go and put the result
The SQL going to go and put the result in the output without checking anything.
in the output without checking anything. And then in order to get the order data,
And then in order to get the order data, it will start searching for this ID. So
it will start searching for this ID. So we have it here in the second row. We
we have it here in the second row. We have the order ID and the sales. And
have the order ID and the sales. And it's still going to put those results to
it's still going to put those results to the output. So the search for the third
the output. So the search for the third one immediately going to put everything
one immediately going to put everything in the output. And then start searching
in the output. And then start searching for orders with this ID. We have it over
for orders with this ID. We have it over here. So this order belongs to the user
here. So this order belongs to the user ID number three. So far we are getting
ID number three. So far we are getting the same result as the inner joint. But
the same result as the inner joint. But we are not done yet. Now exactly count
we are not done yet. Now exactly count the difference this guy going to go and
the difference this guy going to go and get Martin and put it immediately in the
get Martin and put it immediately in the output and start searching for an order
output and start searching for an order with this ID. So do we have any order
with this ID. So do we have any order with the ID number four? Well, we don't
with the ID number four? Well, we don't have anything this time. SQL of course
have anything this time. SQL of course will not go and exclude the ID number
will not go and exclude the ID number four. It's going to leave it. But in SQL
four. It's going to leave it. But in SQL if there is no match, we still have to
if there is no match, we still have to have something in the output. So SQL
have something in the output. So SQL going to go and say the output going to
going to go and say the output going to be null like this. We don't know it is
be null like this. We don't know it is unknown. And the same thing for the
unknown. And the same thing for the sales. So in the lift join if there is
sales. So in the lift join if there is no match you will see nulls. The same
no match you will see nulls. The same thing for the next customer for better.
thing for the next customer for better. So SQL will go and put the result
So SQL will go and put the result immediately in the output and then start
immediately in the output and then start searching the orders. So do we have
searching the orders. So do we have anything for the ID number five? We
anything for the ID number five? We don't have anything. That's why SQL
don't have anything. That's why SQL going to go and present nulls as well in
going to go and present nulls as well in the output. And that's why you saw nulls
the output. And that's why you saw nulls in the output because those customers
in the output because those customers don't have any orders. So this is
don't have any orders. So this is exactly the effect of the lift join. you
exactly the effect of the lift join. you will get everything from the lift table
will get everything from the lift table and only the matching stuff on the right
and only the matching stuff on the right side and if there is something not
side and if there is something not matching you will get nulls. So that's
matching you will get nulls. So that's it is this is how scale execute the left
it is this is how scale execute the left join okay so now back to this use cases
join okay so now back to this use cases of joins if I think about lift join I
of joins if I think about lift join I can use it in order to recombine data in
can use it in order to recombine data in order to build this big picture and as
order to build this big picture and as well in the second use case where we use
well in the second use case where we use it in order to get an extra information
it in order to get an extra information from another table. So we have a main
from another table. So we have a main table and secondary table. So we use it
table and secondary table. So we use it for both use cases and as well in the
for both use cases and as well in the third use case only with a twist that
third use case only with a twist that we're going to learn later. So that's
we're going to learn later. So that's all about the left join. Now we have
all about the left join. Now we have another type that is exactly the
another type that is exactly the opposite of the lift join. We have the
opposite of the lift join. We have the right join. So now let's understand what
right join. So now let's understand what this
means. Okay. So now what is exactly right join? This is the total opposite
right join? This is the total opposite of the left join. So this tag going to
of the left join. So this tag going to returns all the rows from the right
returns all the rows from the right table and only the matching from the
table and only the matching from the left table. So here the main table the
left table. So here the main table the main focus is the right table. So SQL
main focus is the right table. So SQL going to get you all the rows everything
going to get you all the rows everything from the table B the right table but
from the table B the right table but from the left side we will get only the
from the left side we will get only the matching data. So that means in the left
matching data. So that means in the left sides you will get only the data that
sides you will get only the data that has a match on the right side and with
has a match on the right side and with that the right table going to be the
that the right table going to be the primary the main source of your data. So
primary the main source of your data. So it is very important table but the lift
it is very important table but the lift table is not that important. You are
table is not that important. You are just joining it in order to get
just joining it in order to get additional data. So again about the
additional data. So again about the syntax it's not that crazy. All what you
syntax it's not that crazy. All what you have to do is to change the join type.
have to do is to change the join type. So instead of left you say right join
So instead of left you say right join and again here the order of the tables
and again here the order of the tables is very important because the side here
is very important because the side here makes a difference. So we start from the
makes a difference. So we start from the left table A and then right join it to
left table A and then right join it to the table B. So it sounds very similar
the table B. So it sounds very similar to the left join. We are just switching
to the left join. We are just switching things. Now let's go back to scale. in
things. Now let's go back to scale. in order to practice. Okay my friends, so
order to practice. Okay my friends, so now we have the following task and it
now we have the following task and it says get all customers along with their
says get all customers along with their orders including orders without matching
orders including orders without matching customers. So again we have the
customers. So again we have the customers and the orders and we are
customers and the orders and we are doing the join but here the condition is
doing the join but here the condition is different. We want to see all the orders
different. We want to see all the orders even if they don't have a matching
even if they don't have a matching customer. So that means I would like to
customer. So that means I would like to see everything from the table orders and
see everything from the table orders and the customers table here is only like
the customers table here is only like supporting and helping. So the main
supporting and helping. So the main table that we are focusing on is in the
table that we are focusing on is in the orders. We want to see everything and
orders. We want to see everything and from the customers only the matching and
from the customers only the matching and if you are looking currently to the
if you are looking currently to the results you can see we are seeing only
results you can see we are seeing only three orders right but in the original
three orders right but in the original table if you go back over here you can
table if you go back over here you can see that we have four orders. So we are
see that we have four orders. So we are currently using this query not seeing
currently using this query not seeing all the orders. So now how we going to
all the orders. So now how we going to solve it? If you start from the table
solve it? If you start from the table customers you can say you know what
customers you can say you know what instead of left join we're going to say
instead of left join we're going to say right join. And with that you're going
right join. And with that you're going to guarantee you will get everything
to guarantee you will get everything from the table orders. But now the left
from the table orders. But now the left table the customers is not that
table the customers is not that important and you will see the data of
important and you will see the data of the customers only if there is a match.
the customers only if there is a match. So doing the right join like this
So doing the right join like this guaranteed to see everything whether
guaranteed to see everything whether there is match or no match. Now if you
there is match or no match. Now if you go and execute it you can see on the
go and execute it you can see on the right side the order ID and the sales
right side the order ID and the sales and we can see now all the orders and on
and we can see now all the orders and on the left side the ID and the first name.
the left side the ID and the first name. We are seeing only the customers if they
We are seeing only the customers if they did order something. And for the orders
did order something. And for the orders without a known customer, we are getting
without a known customer, we are getting nulls. So with us, you have solved the
nulls. So with us, you have solved the task using the right join. So now my
task using the right join. So now my friends, you have to go and solve this
friends, you have to go and solve this task to get the exact same results. But
task to get the exact same results. But you are allowed to use only the left
you are allowed to use only the left join. So you are not allowed to use the
join. So you are not allowed to use the right join. So now go pause the video,
right join. So now go pause the video, solve the task and meet you
solve the task and meet you [Music]
[Music] soon. Now my friends, in SQL there is
soon. Now my friends, in SQL there is always alternatives on how to solve a
always alternatives on how to solve a task. So now if you want to get all the
task. So now if you want to get all the data from B and only the matching from
data from B and only the matching from A, you can do it like we have done using
A, you can do it like we have done using the right join. But if you go and switch
the right join. But if you go and switch the sides and you make the table B as a
the sides and you make the table B as a left table and the table A as a right
left table and the table A as a right table, you can do that of course in SQL.
table, you can do that of course in SQL. But you have to switch the join type. So
But you have to switch the join type. So instead of right, we have to use left
instead of right, we have to use left now since the B table now on the left
now since the B table now on the left side and as well you have to switch the
side and as well you have to switch the order. So you start from the B table and
order. So you start from the B table and then you say left join the A table. and
then you say left join the A table. and of course the same join condition. And
of course the same join condition. And if you do that, you will get the exact
if you do that, you will get the exact same result as the left query. So if you
same result as the left query. So if you just switch the tables and as well
just switch the tables and as well switch the join type, you can get the
switch the join type, you can get the same results. And to be honest, my
same results. And to be honest, my friends, I don't like the right join.
friends, I don't like the right join. It's just in the last 10 years, I always
It's just in the last 10 years, I always tend to start from a table and then use
tend to start from a table and then use a left join. And from my point of view,
a left join. And from my point of view, the left join is way more famous than
the left join is way more famous than the right join. And I think I never used
the right join. And I think I never used a query where I'm using a right join. So
a query where I'm using a right join. So my advice for you always try to skip the
my advice for you always try to skip the right join and stick with the left join
right join and stick with the left join just get the order of the tables in the
just get the order of the tables in the query correct and you will get the same
query correct and you will get the same results. So with that you know an
results. So with that you know an alternative for the right join. Now all
alternative for the right join. Now all what you have to do is to go and switch
what you have to do is to go and switch the right to left. Uh this is not enough
the right to left. Uh this is not enough because if I go and execute it. So now
because if I go and execute it. So now all what I have to do is to go and
all what I have to do is to go and switch the tables like this. So we start
switch the tables like this. So we start from the table orders because I want to
from the table orders because I want to see everything from the orders and then
see everything from the orders and then lift join it with the customers. And of
lift join it with the customers. And of course we don't have to change anything
course we don't have to change anything here. It doesn't matter the order
here. It doesn't matter the order because we have an equal operator here.
because we have an equal operator here. What is very important here is where you
What is very important here is where you start from which table and what is the
start from which table and what is the table that you are joining with. So if
table that you are joining with. So if you go and execute it, you will get the
you go and execute it, you will get the exact same results. So now I'm seeing
exact same results. So now I'm seeing all the orders. I'm not missing anything
all the orders. I'm not missing anything and only the matching customers. And I
and only the matching customers. And I prefer this way solving this task
prefer this way solving this task instead of using the right join. All
instead of using the right join. All right. So that's all about the right
right. So that's all about the right join. Next we're going to combine
join. Next we're going to combine everything. We're going to talk about
everything. We're going to talk about the full join. So let's
go. Okay. So now what is exactly a full join? If you use it, SQL returns
join? If you use it, SQL returns everything all the rows from both
everything all the rows from both tables. So now if you check again our
tables. So now if you check again our circles from the left table, we want to
circles from the left table, we want to get everything all the rows. So you will
get everything all the rows. So you will get the whole circle and as well from
get the whole circle and as well from the right table you want to get
the right table you want to get everything all the rows the whole
everything all the rows the whole circle. So that you want to get
circle. So that you want to get everything the matching the unmatching
everything the matching the unmatching all the data from left and right. Now
all the data from left and right. Now let's check the syntax. It's going to be
let's check the syntax. It's going to be very simple. The joint type here going
very simple. The joint type here going to be a full join. And the full join it
to be a full join. And the full join it is very similar to the inner join. You
is very similar to the inner join. You remember the order of the tables is not
remember the order of the tables is not important at all. So there is here no
important at all. So there is here no main table and secondary table. Both of
main table and secondary table. Both of the tables are important and it doesn't
the tables are important and it doesn't matter in your query where you start.
matter in your query where you start. You can start from A full join B or you
You can start from A full join B or you can start from B then full join A. you
can start from B then full join A. you will get the exact same results. It
will get the exact same results. It sounds simple. Let's go to SQL and
sounds simple. Let's go to SQL and practice the full join. All right. So
practice the full join. All right. So now we have the following task and it
now we have the following task and it says get all customers and all orders
says get all customers and all orders even if there is no match. So now again
even if there is no match. So now again we need the data from customers and
we need the data from customers and orders. But now of course which type
orders. But now of course which type we're going to use? It says even if
we're going to use? It says even if there is no match but it didn't say no
there is no match but it didn't say no match from orders or customers. So you
match from orders or customers. So you can understand from this task we are not
can understand from this task we are not focusing only on the orders or the
focusing only on the orders or the customers. Both of them are equally
customers. Both of them are equally important and we need all the data. So
important and we need all the data. So that means we need all the data from
that means we need all the data from left, all the data from right and we can
left, all the data from right and we can go and use the full join. So now we have
go and use the full join. So now we have this query over here. We are starting
this query over here. We are starting from customers and then joining to
from customers and then joining to orders. But now instead of having left,
orders. But now instead of having left, we're going to say full join. So now
we're going to say full join. So now let's go and just execute it. Now if you
let's go and just execute it. Now if you are looking to the left side, you can
are looking to the left side, you can see we are getting all the customers,
see we are getting all the customers, right? So we have our five customers and
right? So we have our five customers and if you are looking to the right, you can
if you are looking to the right, you can see all our orders. So with that we have
see all our orders. So with that we have everything from left and everything from
everything from left and everything from right and the matching data is just side
right and the matching data is just side by side in the results and if there is
by side in the results and if there is no match we are getting nulls. So
no match we are getting nulls. So actually with that we have solved the
actually with that we have solved the task and again it doesn't matter how you
task and again it doesn't matter how you start. You can start from the orders and
start. You can start from the orders and then join it to the customers and you
then join it to the customers and you will get the exact same results. So you
will get the exact same results. So you are getting exactly the same data. Now
are getting exactly the same data. Now let's go and understand exactly how is
let's go and understand exactly how is executed the full join. Okay again we
executed the full join. Okay again we have the data of the customers and the
have the data of the customers and the orders and our full join. So now we're
orders and our full join. So now we're still going to identify those columns
still going to identify those columns that we want to see in the results. So
that we want to see in the results. So the ID and the first name, the order ID
the ID and the first name, the order ID and the sales informations to the
and the sales informations to the output. Now it's still going to start
output. Now it's still going to start from the left table since it is started
from the left table since it is started with the customers. It's still going to
with the customers. It's still going to take simply everything from the left
take simply everything from the left table and present it in the output.
table and present it in the output. Since it is full join, we want to see
Since it is full join, we want to see all the data from the left side. And now
all the data from the left side. And now start searching for matches from the
start searching for matches from the right table. So let's start with the
right table. So let's start with the first customer. And as usual, we will
first customer. And as usual, we will get the order from the customer number
get the order from the customer number one. And the same thing for the second
one. And the same thing for the second customer, we have as well here match. So
customer, we have as well here match. So we will get as well. It's like that lift
we will get as well. It's like that lift join. And for the third one, we have as
join. And for the third one, we have as well a match. And we're going to have it
well a match. And we're going to have it like this. And since we don't have
like this. And since we don't have orders for those two customers, we will
orders for those two customers, we will get as well nulls in the outputs. So
get as well nulls in the outputs. So scale going to mark it with null. The
scale going to mark it with null. The same thing over here. And as well for
same thing over here. And as well for the last customer. So we will get nulls
the last customer. So we will get nulls for those two customers. And now of
for those two customers. And now of course SQL will not stop here otherwise
course SQL will not stop here otherwise we will get a left join effect. Now SQL
we will get a left join effect. Now SQL going to start looking at the right side
going to start looking at the right side to find any order that is not in the
to find any order that is not in the output. So SQL going to see okay the
output. So SQL going to see okay the first order is in the output. The second
first order is in the output. The second one is as well in the output. The third
one is as well in the output. The third but the fourth one is not in the
but the fourth one is not in the results. So SQL going to take this
results. So SQL going to take this result and put it in the output. So this
result and put it in the output. So this order has no match at all from the left
order has no match at all from the left side. And with that if you are looking
side. And with that if you are looking to the right side you can see SQL going
to the right side you can see SQL going to be happy because we have all the
to be happy because we have all the orders from the right table. And of
orders from the right table. And of course SQL will not leave it like this.
course SQL will not leave it like this. Instead of that SQL going to show nulls
Instead of that SQL going to show nulls on the left side. So there is no ID and
on the left side. So there is no ID and there is no first name. So this is
there is no first name. So this is exactly why we got this results. And
exactly why we got this results. And this is how SQL executed the full join.
this is how SQL executed the full join. Okay. Okay. So now if you are looking to
Okay. Okay. So now if you are looking to the use cases I can say you can use the
the use cases I can say you can use the full join in order as well to recombine
full join in order as well to recombine the data from multiple tables if you
the data from multiple tables if you don't want to miss anything from all
don't want to miss anything from all four tables all data the matching and
four tables all data the matching and unmatching data but I don't use it
unmatching data but I don't use it usually for data enrichment for the
usually for data enrichment for the second use case and where we can use the
second use case and where we can use the full join is in the last use case as
full join is in the last use case as well but with a little twist that we're
well but with a little twist that we're going to learn later. So this is mainly
going to learn later. So this is mainly where we can use the full join. All
where we can use the full join. All right. So with that we have covered the
right. So with that we have covered the basic types of joins inner, left, right
basic types of joins inner, left, right and full join. Those are the classical
and full join. Those are the classical joins on how to combine two
tables. Now we're going to start talking about the advanced SQL joins. And now
about the advanced SQL joins. And now we're going to cover the first part the
we're going to cover the first part the lift anti- join. So let's see what this
means. Okay. So now what is exactly a lift anti- join? Now in this mechanism
lift anti- join? Now in this mechanism we want to return rows from the left
we want to return rows from the left side the left table that has no match in
side the left table that has no match in the right table. So now by looking to
the right table. So now by looking to our two circles from the left table we
our two circles from the left table we want to see only the unmatching rows. So
want to see only the unmatching rows. So only rows that exist in table A but it
only rows that exist in table A but it don't exist in the table B. So if there
don't exist in the table B. So if there is like matching data we don't want to
is like matching data we don't want to see it. And now from the right table we
see it. And now from the right table we don't want anything. We don't want any
don't want anything. We don't want any data. So that means the only source of
data. So that means the only source of your data going to be the left table.
your data going to be the left table. And from the right table we don't need
And from the right table we don't need any data. We are just joining the tables
any data. We are just joining the tables to do a check to filter the data. So now
to do a check to filter the data. So now for the syntax this can be interesting.
for the syntax this can be interesting. We don't have a special type called left
We don't have a special type called left anti- join. At least in the SQL server
anti- join. At least in the SQL server we still can create this effect. Since
we still can create this effect. Since we are saying left we can use the type
we are saying left we can use the type left join and then as usual the join
left join and then as usual the join condition with the keys. But now if you
condition with the keys. But now if you leave it like this you will get the
leave it like this you will get the effect of the lift join. And we don't
effect of the lift join. And we don't want that because with the lift join you
want that because with the lift join you will get the complete circle from the
will get the complete circle from the lift table. But now in order to remove
lift table. But now in order to remove the matching data this overlapping in
the matching data this overlapping in the middle what we can do we can use a
the middle what we can do we can use a filter and in order to filter the data
filter and in order to filter the data we use the wear clause. So now in order
we use the wear clause. So now in order to get rid of the matching data we can
to get rid of the matching data we can take the key from the right table and we
take the key from the right table and we say the key must be null. So if the key
say the key must be null. So if the key is null so that means there is no match
is null so that means there is no match on the right side. And if you do it like
on the right side. And if you do it like this you will get the effect of the left
this you will get the effect of the left anti-join only the data in the left that
anti-join only the data in the left that has no match on the right. So now let's
has no match on the right. So now let's go in scale and create this effect.
go in scale and create this effect. Okay. So now we have the following task
Okay. So now we have the following task and it says get all customers who
and it says get all customers who haven't placed any order. So now by
haven't placed any order. So now by looking to this query clearly we are
looking to this query clearly we are focusing on the table customers but we
focusing on the table customers but we want to see the customers that didn't
want to see the customers that didn't order anything. So they are in our
order anything. So they are in our database but the customers are inactive.
database but the customers are inactive. Now there are like different ways on how
Now there are like different ways on how to solve this task but we're going to
to solve this task but we're going to solve it using the joins. Now let's go
solve it using the joins. Now let's go and start by just writing a very simple
and start by just writing a very simple query where we are selecting everything
query where we are selecting everything from the table customers. Now you can
from the table customers. Now you can see this is our five customers. And now
see this is our five customers. And now I want to check which of those customers
I want to check which of those customers didn't order anything yet. Now since we
didn't order anything yet. Now since we are talking about the orders, we can go
are talking about the orders, we can go and join it with the table orders. So
and join it with the table orders. So we're going to say lift join the table
we're going to say lift join the table orders as all and then we're going to go
orders as all and then we're going to go and connect the tables using the ids
and connect the tables using the ids with the customer ID. So now if you go
with the customer ID. So now if you go and execute it now we are still seeing
and execute it now we are still seeing all the customers because we are using
all the customers because we are using the lift join and now we can see the
the lift join and now we can see the orders informations of each customer and
orders informations of each customer and you can see immediately those two
you can see immediately those two customers didn't order anything because
customers didn't order anything because we are seeing here nulls right so they
we are seeing here nulls right so they are empty there is no orders now we can
are empty there is no orders now we can use this information in order to filter
use this information in order to filter the data I just want to see Martin and
the data I just want to see Martin and Peter so what you can do we can go and
Peter so what you can do we can go and say where and all what you have to do is
say where and all what you have to do is to take the key that we are using in
to take the key that we are using in order to join in the tables this is this
order to join in the tables this is this one over here and say this must be null
one over here and say this must be null so is null so if you see it like this
so is null so if you see it like this that means you want to see the data if
that means you want to see the data if the customer ID is null so let's go and
the customer ID is null so let's go and execute it perfect now you are getting
execute it perfect now you are getting the customers who haven't order anything
the customers who haven't order anything and this is exactly the effect that we
and this is exactly the effect that we wanted the left anti-join we are getting
wanted the left anti-join we are getting the data from the left side where there
the data from the left side where there are no match on the right side so you
are no match on the right side so you have always to do it in two steps first
have always to do it in two steps first join the data as you normally do using
join the data as you normally do using the classical joins the lift join and
the classical joins the lift join and then the second step you go and use a
then the second step you go and use a filter using the wear clause if you do
filter using the wear clause if you do it like this you can check for not
it like this you can check for not existence and with that we are getting
existence and with that we are getting the effect of the left anti-join so
the effect of the left anti-join so that's it okay so now if you are looking
that's it okay so now if you are looking to this picture I think you already know
to this picture I think you already know where we use the lift anti- join we're
where we use the lift anti- join we're going to use it only in the last use
going to use it only in the last use case where we are checking the existence
case where we are checking the existence so if you use the lift join together
so if you use the lift join together with the where you can check for the
with the where you can check for the notexistence of your data in another
notexistence of your data in another table so This is exactly for this
table so This is exactly for this scenario. All right. So that's all about
scenario. All right. So that's all about the left anti- join. Now we're going to
the left anti- join. Now we're going to speak about the exact opposite of that.
speak about the exact opposite of that. We will cover the right anti- join. So
We will cover the right anti- join. So it's going to be very similar but we are
it's going to be very similar but we are just switching sides. So let's
go. Okay. So now what is exactly the right anti- join? Well, it is the
right anti- join? Well, it is the opposite of the left anti- join. So we
opposite of the left anti- join. So we want to return the rows from the right
want to return the rows from the right table that has no match in the left
table that has no match in the left table. So again if you are looking to
table. So again if you are looking to our two circles. Now what is important
our two circles. Now what is important is the right table. We want to see only
is the right table. We want to see only the unmatching rows from the right
the unmatching rows from the right table. So only the rows that exist in B
table. So only the rows that exist in B but not in A. And from the left table we
but not in A. And from the left table we don't need anything. So no data is
don't need anything. So no data is needed and that means the only source of
needed and that means the only source of data comes from the right table and you
data comes from the right table and you are using the left table as a filter as
are using the left table as a filter as a lookup just in order to check the
a lookup just in order to check the existence. So now the syntax of that
existence. So now the syntax of that going to be very similar to the left
going to be very similar to the left anti- join. So we don't have a special
anti- join. So we don't have a special type called right anti-join. We have to
type called right anti-join. We have to use the classical one the right join.
use the classical one the right join. But if you do that you will get
But if you do that you will get everything from the right table. And now
everything from the right table. And now in order to get rid of the matching data
in order to get rid of the matching data in the middle we use a filter. We use
in the middle we use a filter. We use the wear clause where we say we are
the wear clause where we say we are interested only on the unmatching data.
interested only on the unmatching data. So we take the key from the left table
So we take the key from the left table and we say the key from left is null.
and we say the key from left is null. And if you do that you will get rid of
And if you do that you will get rid of any matching data. Is null means there
any matching data. Is null means there is no match. And again here the same
is no match. And again here the same thing the order of the tables is very
thing the order of the tables is very important since here we are talking
important since here we are talking about sides and you have to do it
about sides and you have to do it correctly. Okay. So now the task says
correctly. Okay. So now the task says get all orders without matching
get all orders without matching customers. So now it is exactly the
customers. So now it is exactly the opposite. We want to see all the orders
opposite. We want to see all the orders that don't have a valid customer. So
that don't have a valid customer. So this is really bad scenario. You have in
this is really bad scenario. You have in your business orders without a valid
your business orders without a valid customers. So let's see how we can
customers. So let's see how we can discover that using SQL joins. Now as
discover that using SQL joins. Now as you can see we are focusing completely
you can see we are focusing completely on the orders. It's not the customers
on the orders. It's not the customers anymore. And we want to see only the
anymore. And we want to see only the orders where there is no match with the
orders where there is no match with the customers. So now again here we have two
customers. So now again here we have two steps. The first step we're going to go
steps. The first step we're going to go and do the normal join. So using either
and do the normal join. So using either the left or the right join. Now by
the left or the right join. Now by looking to this query you can leave it
looking to this query you can leave it like this where you can start from the
like this where you can start from the customers. But if you want to fully
customers. But if you want to fully focus on the orders you have to switch
focus on the orders you have to switch this from left to right. And with that
this from left to right. And with that you will get all the orders and only the
you will get all the orders and only the matching customers. And let's go and
matching customers. And let's go and remove this workloads from here first.
remove this workloads from here first. So I'm just adding comments. And with
So I'm just adding comments. And with that SQL going to totally ignore this
that SQL going to totally ignore this line of code. So let's go and execute
line of code. So let's go and execute it. Now you can see we are getting all
it. Now you can see we are getting all the orders right and data from customers
the orders right and data from customers only if there is a match. And now of
only if there is a match. And now of course this is not the task. We don't
course this is not the task. We don't want to see all the orders. We want to
want to see all the orders. We want to see only the orders where we don't have
see only the orders where we don't have a match from the customers. So if you
a match from the customers. So if you look to this those three orders they are
look to this those three orders they are okay. They are totally fine. We are
okay. They are totally fine. We are finding customers for them. So they have
finding customers for them. So they have valid customers. But this order here is
valid customers. But this order here is really bad. So there is no valid
really bad. So there is no valid customer for this order and now our task
customer for this order and now our task to show only this type of orders in the
to show only this type of orders in the result. Now what we have to do we have
result. Now what we have to do we have to use the workclass in order to get
to use the workclass in order to get exactly the effects. So this time we're
exactly the effects. So this time we're going to say if the ID of the customer
going to say if the ID of the customer here. So here we're going to say the ID
here. So here we're going to say the ID of the customer from the table customers
of the customer from the table customers must be null. So we're going to remove
must be null. So we're going to remove this here and take the key join from the
this here and take the key join from the customer and we are saying this ID must
customer and we are saying this ID must be null. So let's go and execute it.
be null. So let's go and execute it. Perfect. With us we have solved the task
Perfect. With us we have solved the task and we are getting the effect of the
and we are getting the effect of the right anti- join and we are getting now
right anti- join and we are getting now those orders that don't have any
those orders that don't have any customers. So we have solved the task.
customers. So we have solved the task. Now my friends you have to go and solve
Now my friends you have to go and solve this task without using the right join
this task without using the right join but still you have to get the same
but still you have to get the same effects. You want to get exactly those
effects. You want to get exactly those orders without customers. So pause the
orders without customers. So pause the video and go solve the task.
video and go solve the task. [Music]
[Music] Now again as you know me I don't like
Now again as you know me I don't like the right joins. We can create the same
the right joins. We can create the same effects if you switch the sides of the
effects if you switch the sides of the table. So if you say the B table now on
table. So if you say the B table now on the left side and the A on the right
the left side and the A on the right side then we will get the same effect if
side then we will get the same effect if you go and switch the type of join from
you go and switch the type of join from right to left and you go just switch the
right to left and you go just switch the tables. So you start from the B table
tables. So you start from the B table since it's on the left side and then
since it's on the left side and then join it with the A. And we still say of
join it with the A. And we still say of course in our work condition where the
course in our work condition where the data from A is null. So there is no
data from A is null. So there is no match. So if you do this you will get
match. So if you do this you will get the exact same results like the lift
the exact same results like the lift query by using the lift join and just
query by using the lift join and just switching the tables. So you will get
switching the tables. So you will get the same results and with that you know
the same results and with that you know that in scale we have always
that in scale we have always alternatives. I hope that you are done.
alternatives. I hope that you are done. So it's very simple what you're going to
So it's very simple what you're going to do. We're going to go and switch the
do. We're going to go and switch the joins and since the orders is the main
joins and since the orders is the main table we're going to start first from
table we're going to start first from the table orders. So we are putting it
the table orders. So we are putting it on the left side and then the right
on the left side and then the right table going to be the customers. And of
table going to be the customers. And of course the condition going to stay as it
course the condition going to stay as it is. We want to see the orders where
is. We want to see the orders where there is no customer. So we don't have
there is no customer. So we don't have to switch anything here or in the join
to switch anything here or in the join key. So let's go and execute it. With
key. So let's go and execute it. With that you are getting the same exact
that you are getting the same exact results. Since we are using here the
results. Since we are using here the star, it's always starts from the left
star, it's always starts from the left table and show the data from the right
table and show the data from the right table. But still the result is valid. We
table. But still the result is valid. We are getting this type of orders without
are getting this type of orders without matching customers. And I prefer this
matching customers. And I prefer this way. All right. So now with that we have
way. All right. So now with that we have the left, the right and now of course
the left, the right and now of course what is next? We will get the full. So
what is next? We will get the full. So let's speak about now the full anti-join
let's speak about now the full anti-join in SQL. Let's
go. Okay. So now what is exactly a full anti- join? Well, this time we don't
anti- join? Well, this time we don't have sides. We want to return only the
have sides. We want to return only the rows that don't match in either tables.
rows that don't match in either tables. So what this means? If you are looking
So what this means? If you are looking to the left circle, we want only the
to the left circle, we want only the unmatching rows. So we don't want the
unmatching rows. So we don't want the whole circle. We want only the data that
whole circle. We want only the data that exist in A but it don't exist in B on
exist in A but it don't exist in B on the right table. Sounds like the left
the right table. Sounds like the left ant join but since we are saying full
ant join but since we are saying full then you have to do the same thing on
then you have to do the same thing on the right side as well. So on the right
the right side as well. So on the right table we want only the unmatching rows.
table we want only the unmatching rows. So we want to see in the result the data
So we want to see in the result the data that is in B but don't have a match from
that is in B but don't have a match from A. So it's exactly the opposite. And if
A. So it's exactly the opposite. And if you look to this then that means we want
you look to this then that means we want to see only the unmatching data and this
to see only the unmatching data and this is exactly the opposite effect of the
is exactly the opposite effect of the inner join. In the inner join we were
inner join. In the inner join we were interested only on the matching data
interested only on the matching data only when there is like overlapping. But
only when there is like overlapping. But now with the full anti-join it is
now with the full anti-join it is exactly the opposite. We don't want to
exactly the opposite. We don't want to see the matching data. We want to see
see the matching data. We want to see everything else the unmatching data. So
everything else the unmatching data. So how we going to write this query? Again
how we going to write this query? Again here we don't have a special type called
here we don't have a special type called full anti-join. We will use the help of
full anti-join. We will use the help of the classical full join. So the basic
the classical full join. So the basic one. So you start from a full join b and
one. So you start from a full join b and then the same key. But now what is
then the same key. But now what is interesting is about the where
interesting is about the where condition. Now we have like two
condition. Now we have like two conditions right? So now in order to get
conditions right? So now in order to get all data from A that has no match in B,
all data from A that has no match in B, you have to make a filter where you say
you have to make a filter where you say the key from the B table must be null.
the key from the B table must be null. And now since we want the exact same
And now since we want the exact same thing from the right table, we want all
thing from the right table, we want all the data in B that has no match in A.
the data in B that has no match in A. You have to say as well the key from the
You have to say as well the key from the A table must be null. So now we have
A table must be null. So now we have here like two conditions. And in SQL if
here like two conditions. And in SQL if you have like two conditions in the work
you have like two conditions in the work clause, you have here two options either
clause, you have here two options either use and operator or the over operator.
use and operator or the over operator. So now the one that we're going to use
So now the one that we're going to use here is the or operator. So either the
here is the or operator. So either the key from right is empty or the key from
key from right is empty or the key from left is empty. If you do it like this,
left is empty. If you do it like this, you will get the effect of the full
you will get the effect of the full anti- join. And of course since here
anti- join. And of course since here both sides are equal then the order of
both sides are equal then the order of the tables as well here is not that
the tables as well here is not that important. So you can say from A full
important. So you can say from A full join B or from B full join A. It doesn't
join B or from B full join A. It doesn't matter. So now let's go back to scale in
matter. So now let's go back to scale in order to create this effect. Okay.
order to create this effect. Okay. Instead we have the following task and
Instead we have the following task and it says find customers without orders
it says find customers without orders and orders without customers. So if you
and orders without customers. So if you are looking to this this means we want
are looking to this this means we want to see only the unmatching data from
to see only the unmatching data from customers and as well from orders. There
customers and as well from orders. There is no main table and secondary table.
is no main table and secondary table. Both of them are equally important. So
Both of them are equally important. So now since we are talking about the
now since we are talking about the unmatching data and the anti-join we
unmatching data and the anti-join we have to do it in two steps. The first
have to do it in two steps. The first step we're going to do the classical
step we're going to do the classical join and then we focus on the wear
join and then we focus on the wear clause. So let me remove the wear clause
clause. So let me remove the wear clause to make it as a comment. Now since we
to make it as a comment. Now since we want the data from left and right, we're
want the data from left and right, we're going to go and use the full join. So
going to go and use the full join. So let's go and execute it. Now you can see
let's go and execute it. Now you can see we are getting the effect of the full
we are getting the effect of the full join. We are getting all the orders and
join. We are getting all the orders and as well all the customers. But now we
as well all the customers. But now we are interested only on the strange cases
are interested only on the strange cases where they are like orders without
where they are like orders without customers like this one here and as well
customers like this one here and as well customers without orders. So that means
customers without orders. So that means the first three rows they are not really
the first three rows they are not really interesting for us because it is boring.
interesting for us because it is boring. We have here matching data and this is
We have here matching data and this is totally fine but we are not focusing on
totally fine but we are not focusing on that now. We are focusing only if there
that now. We are focusing only if there is like missing data from left or from
is like missing data from left or from right. As you notice I'm saying or and
right. As you notice I'm saying or and this is very important because we're
this is very important because we're going to use the or operator. So now
going to use the or operator. So now let's focus on getting this scenario
let's focus on getting this scenario over here. We want to get an order
over here. We want to get an order without a customer. So that means the
without a customer. So that means the customer ID must be null. And we have it
customer ID must be null. And we have it already here. So we are saying where the
already here. So we are saying where the ID of the customer is null. So if I go
ID of the customer is null. So if I go and execute it, I will get only one
and execute it, I will get only one records only this one over here. But as
records only this one over here. But as well I want to get the opposite
well I want to get the opposite scenario. So in this scenario, the
scenario. So in this scenario, the customer ID must be null. So we're going
customer ID must be null. So we're going to say or the customer
to say or the customer ID in the orders is null or we can do it
ID in the orders is null or we can do it like side by side like this. Either the
like side by side like this. Either the right side is null or the left side is
right side is null or the left side is null. So if you go and execute it, you
null. So if you go and execute it, you will get the effect of the full
will get the effect of the full anti-join. And with that we are finding
anti-join. And with that we are finding the customers without orders and orders
the customers without orders and orders without customers. I think this is
without customers. I think this is really fun and as well really easy. So
really fun and as well really easy. So this is how we do the full anti- join.
this is how we do the full anti- join. All right. So now if you are looking to
All right. So now if you are looking to the use cases we use the full anti- join
the use cases we use the full anti- join again exactly for the last use case in
again exactly for the last use case in order to check the existence. So if you
order to check the existence. So if you combine the full with the where you can
combine the full with the where you can check the existence or the notexistence
check the existence or the notexistence of your data in another table. So this
of your data in another table. So this is exactly the scenario for that.
Okay, my friends, now we have a bonus section where I'm going to challenge you
section where I'm going to challenge you to solve the following task without
to solve the following task without using an inner join. So, it says, "Get
using an inner join. So, it says, "Get all customers along with their orders,
all customers along with their orders, but only for customers who have placed
but only for customers who have placed an order, but without using an inner
an order, but without using an inner join." So, pause the video now and go
join." So, pause the video now and go and solve this
and solve this [Music]
[Music] task. Okay, so now let's see how we're
task. Okay, so now let's see how we're going to solve this. We want the
going to solve this. We want the customers, the orders, blah blah blah.
customers, the orders, blah blah blah. But we want only the customers who have
But we want only the customers who have placed an order. Previously, we have
placed an order. Previously, we have used the inner join in order to solve
used the inner join in order to solve this task. But this time, we are not
this task. But this time, we are not allowed to use it. So, let's go and
allowed to use it. So, let's go and solve it. This is how I'm going to do
solve it. This is how I'm going to do it. Select star from table customers.
it. Select star from table customers. Can't give it the alias. So, now I'm
Can't give it the alias. So, now I'm getting all the customers, but I am
getting all the customers, but I am interested only the customers who have
interested only the customers who have placed an order. So, as we know before
placed an order. So, as we know before there's like two customers didn't order
there's like two customers didn't order anything, and we don't want to see them
anything, and we don't want to see them in the final results. Now how we will
in the final results. Now how we will get that? Well, we can use the help of
get that? Well, we can use the help of the table orders in order to check the
the table orders in order to check the existence of our customers there. And of
existence of our customers there. And of course, I'm not allowed to use the inner
course, I'm not allowed to use the inner join. So I'm going to go and use a left
join. So I'm going to go and use a left join with a table orders and then
join with a table orders and then combine them as usual. Nothing new with
combine them as usual. Nothing new with the customer ID. So now let's go and
the customer ID. So now let's go and execute it. As you can see, we are doing
execute it. As you can see, we are doing it step by step. You don't have to rush
it step by step. You don't have to rush everything in one go. So you start
everything in one go. So you start simple, check the results and decide on
simple, check the results and decide on the next step. So now by looking at
the next step. So now by looking at these results I want to get those three
these results I want to get those three customers because they have ordered
customers because they have ordered something and we are seeing data about
something and we are seeing data about their orders and I don't want to get in
their orders and I don't want to get in the result the last two. So again we
the result the last two. So again we still can use the customer ID from the
still can use the customer ID from the right table in order to decide which
right table in order to decide which data going to stay in the result and
data going to stay in the result and which data should be filtered. We're
which data should be filtered. We're going to go and use the wear clause and
going to go and use the wear clause and then the key from the orders and this
then the key from the orders and this time we're going to say is not null. I
time we're going to say is not null. I know we didn't learn yet about the not
know we didn't learn yet about the not and the logical operators but using the
and the logical operators but using the not null it means there should be data
not null it means there should be data inside the column it must not be null if
inside the column it must not be null if you do it like this and execute you will
you do it like this and execute you will get the exact effect as the inner join.
get the exact effect as the inner join. So as you can see as you are joining the
So as you can see as you are joining the tables using the left join you can
tables using the left join you can control what you want to see using the
control what you want to see using the wear clouds using the filter and this is
wear clouds using the filter and this is how you can solve this task without
how you can solve this task without using an inner join. Okay, so with that
using an inner join. Okay, so with that we have covered all those three
we have covered all those three scenarios in order to find the
scenarios in order to find the unmatching data. Left, right, full and
unmatching data. Left, right, full and joints. Now we can speak about one crazy
joints. Now we can speak about one crazy join. We call it the cross join. This
join. We call it the cross join. This one is totally different from all other
one is totally different from all other types that we have learned. So let's
types that we have learned. So let's understand exactly what is the cross
understand exactly what is the cross join. Let's
go. So now what is exactly a cross join? Now in some scenarios we want to combine
Now in some scenarios we want to combine every row from the left, every row from
every row from the left, every row from the right. So that means I want to see
the right. So that means I want to see all the possible combinations from both
all the possible combinations from both tables. So we are doing something called
tables. So we are doing something called like cartesian join. So now if you look
like cartesian join. So now if you look at our two circles, we want everything
at our two circles, we want everything from A and as well everything from B. So
from A and as well everything from B. So that means I want to see everything from
that means I want to see everything from A combined with everything with B. So in
A combined with everything with B. So in this example, we have two rows in A and
this example, we have two rows in A and three rows in B. If you do a cross join,
three rows in B. If you do a cross join, you will get six possible combinations
you will get six possible combinations by just multiplying the number of rows
by just multiplying the number of rows between A and B. So be careful using the
between A and B. So be careful using the cross join. If you use it, you will get
cross join. If you use it, you will get like crazy number of rows in the results
like crazy number of rows in the results and you're going to make the database
and you're going to make the database really busy finding out the result for
really busy finding out the result for you. So now about the syntax, it's going
you. So now about the syntax, it's going to be the easiest. So you start as usual
to be the easiest. So you start as usual from one of those tables, the A for
from one of those tables, the A for example, and then you say cross join B.
example, and then you say cross join B. So now my friends, if you look at this,
So now my friends, if you look at this, you can see it's not like the previous
you can see it's not like the previous joins that we have done. We have always
joins that we have done. We have always before talked about unmatching rows,
before talked about unmatching rows, matching rows and so on. But here we
matching rows and so on. But here we don't care at all about whether the data
don't care at all about whether the data is matching or not. I just want to see
is matching or not. I just want to see all the possible combinations
all the possible combinations everything. So since we don't care about
everything. So since we don't care about matching the two tables, we don't have
matching the two tables, we don't have to specify any condition. So there is no
to specify any condition. So there is no need to use the keyword on because we
need to use the keyword on because we don't need any condition. So that's it.
don't need any condition. So that's it. You just say cross join B and the magic
You just say cross join B and the magic can happen. So this is a cross join.
can happen. So this is a cross join. Let's go to SQL to try that. Okay. So
Let's go to SQL to try that. Okay. So now we have the following task. It says
now we have the following task. It says generate all possible combinations of
generate all possible combinations of customers and orders. So that means we
customers and orders. So that means we want everything with everything using
want everything with everything using the cross join and this going to be very
the cross join and this going to be very simple. So we're going to start with
simple. So we're going to start with select star from whatever table. So you
select star from whatever table. So you can start from the customers and then
can start from the customers and then you say cross join orders. That's it.
you say cross join orders. That's it. Very simple. Let's go and execute it. So
Very simple. Let's go and execute it. So now as you know we have five customers
now as you know we have five customers and four orders. And if you multiply
and four orders. And if you multiply them you will get in the results 20
them you will get in the results 20 rows. So now we are getting everything
rows. So now we are getting everything with everything. even if the data is not
with everything. even if the data is not matching at all. So you can see for
matching at all. So you can see for example the orders here. So this is one
example the orders here. So this is one order that belongs only to one customer
order that belongs only to one customer the customer ID one. So it is an order
the customer ID one. So it is an order from actually Maria but still we are
from actually Maria but still we are seeing this same order with the other
seeing this same order with the other customers since we want to combine
customers since we want to combine everything with everything. So there are
everything with everything. So there are no rules. The same thing for the next
no rules. The same thing for the next set. So this is the second order
set. So this is the second order actually belongs to John but we are
actually belongs to John but we are seeing this order with all customers. So
seeing this order with all customers. So that's it. This is how the cross join
that's it. This is how the cross join works. And now you might ask me why we
works. And now you might ask me why we have this. It makes no sense, right?
have this. It makes no sense, right? Well, my friends, I rarely use it. But
Well, my friends, I rarely use it. But sometimes if I want to generate like
sometimes if I want to generate like test data or maybe if you have like for
test data or maybe if you have like for example table called colors and table
example table called colors and table called products and you would like to
called products and you would like to see all the combinations between the
see all the combinations between the products and the colors. So in some
products and the colors. So in some scenarios it makes really sense to see
scenarios it makes really sense to see all your products together with all the
all your products together with all the colors without any matching conditions
colors without any matching conditions or whatever. So there are like few
or whatever. So there are like few scenarios for the cross join if you are
scenarios for the cross join if you are like doing simulations or testing. So
like doing simulations or testing. So this is how we do the cross join. Okay.
this is how we do the cross join. Okay. So that's all about the cross join. And
So that's all about the cross join. And with that we have covered the four
with that we have covered the four advanced types of joins. Now if you look
advanced types of joins. Now if you look at this you might ask okay how I'm going
at this you might ask okay how I'm going to choose between all those types. So
to choose between all those types. So you might ask me okay bar how you do it?
you might ask me okay bar how you do it? Well I'm going to show you now my
Well I'm going to show you now my decision tree that I usually follow in
decision tree that I usually follow in order to choose the correct type.
So now if I'm combining two tables and I want to see in the results only the
want to see in the results only the matching data between two tables then I
matching data between two tables then I go and use the inner join. We don't have
go and use the inner join. We don't have any other type for that. So that's
any other type for that. So that's simple but now if I want to see
simple but now if I want to see everything all the data I don't want to
everything all the data I don't want to miss anything after joining two tables
miss anything after joining two tables then I take different path and here I
then I take different path and here I ask myself is there like one side more
ask myself is there like one side more important than the other am I interested
important than the other am I interested in all data from one table from one side
in all data from one table from one side like here we have like a main table or a
like here we have like a main table or a master table then I go and use the lift
master table then I go and use the lift join but if I want to see all the data
join but if I want to see all the data from all tables in my query everything
from all tables in my query everything so there is no one table more important
so there is no one table more important than other then I go with the full join
than other then I go with the full join So this is another path and now the
So this is another path and now the third path if I'm interested to see only
third path if I'm interested to see only the unmatching data. So I'm doing some
the unmatching data. So I'm doing some kind of checkups and so on. And here
kind of checkups and so on. And here again the same thing do I want to see
again the same thing do I want to see the unmatching data from only one side.
the unmatching data from only one side. There is like one table that is
There is like one table that is important then I go and use the lift
important then I go and use the lift anti- join. So I want to see the
anti- join. So I want to see the unmatching data from one table and I'm
unmatching data from one table and I'm using the other table only for the
using the other table only for the check. But in my query if both of the
check. But in my query if both of the tables are important there is no main
tables are important there is no main table and secondary table both are
table and secondary table both are important then I go and use the full
important then I go and use the full anti- join. So actually that's it. This
anti- join. So actually that's it. This is the decision tree that I follow
is the decision tree that I follow usually as I'm writing a query. And you
usually as I'm writing a query. And you might ask me how about the right join.
might ask me how about the right join. Well as you know me I don't have it at
Well as you know me I don't have it at all in my decision tree. So I don't use
all in my decision tree. So I don't use it at all. Now by looking to this I can
it at all. Now by looking to this I can tell you if I check most of the queries
tell you if I check most of the queries that I write very often I use the left
that I write very often I use the left join. So I can tell you this is my
join. So I can tell you this is my favorite way on how to join tables. So
favorite way on how to join tables. So let me show you exactly
why. Usually I write queries in order to do data analyzes. So in data analytics
do data analyzes. So in data analytics you have always like starting points.
you have always like starting points. You have like a topic that you are
You have like a topic that you are analyzing like the customer. So you have
analyzing like the customer. So you have always like a master table. So I always
always like a master table. So I always start with the main table of my
start with the main table of my analysis. So in my query I start from
analysis. So in my query I start from this table from table A the main table.
this table from table A the main table. And then what happens? The data is not
And then what happens? The data is not enough in this table. I need some extra
enough in this table. I need some extra data that comes from another table like
data that comes from another table like the table B. So the table B is only here
the table B. So the table B is only here like an additional data to the master
like an additional data to the master table. So I go and use the lift join in
table. So I go and use the lift join in order to connect the table B and then I
order to connect the table B and then I find another interesting information in
find another interesting information in another table in table C. So same things
another table in table C. So same things happens. I go and join the tables using
happens. I go and join the tables using the lift join and so on. So I keep
the lift join and so on. So I keep connecting multiple tables to this main
connecting multiple tables to this main table in the middle. And my query going
table in the middle. And my query going to look like this. always doing lift
to look like this. always doing lift joins with multiple tables. Now, of
joins with multiple tables. Now, of course, you might say, "Yeah, but
course, you might say, "Yeah, but sometimes you would like to see only the
sometimes you would like to see only the matching data and so on. So, it makes
matching data and so on. So, it makes sense only to use the inner join." Well,
sense only to use the inner join." Well, in order to do that, I can control
in order to do that, I can control everything that I want to see in the
everything that I want to see in the final results using the wear clause. So,
final results using the wear clause. So, in the wear clause, I define exactly
in the wear clause, I define exactly what I want to see in the final result.
what I want to see in the final result. So, with that, I get like more
So, with that, I get like more flexibility on whether I want to see the
flexibility on whether I want to see the matching, unmatching data and so on like
matching, unmatching data and so on like we done in the lift and join, right? So
we done in the lift and join, right? So as I'm analyzing data I tend very
as I'm analyzing data I tend very frequently having this setup where I
frequently having this setup where I start from the main table and I lift
start from the main table and I lift join all other tables and with the word
join all other tables and with the word conditions I control the final results.
conditions I control the final results. So this is how I connect multiple tables
So this is how I connect multiple tables together. So now if I want to visual
together. So now if I want to visual this in like circles it's going to look
this in like circles it's going to look like this. We have the circle A. So this
like this. We have the circle A. So this is the master table the starting point.
is the master table the starting point. I want to see all the data from table A
I want to see all the data from table A and I live join it then with another
and I live join it then with another table B and from table B I want to see
table B and from table B I want to see only the matching data. So it's like the
only the matching data. So it's like the lift join. Now what going to happen? I'm
lift join. Now what going to happen? I'm going to go and add another table. So
going to go and add another table. So another circle the circle C. And from
another circle the circle C. And from the circle C, we want to see only the
the circle C, we want to see only the matching data. And of course you can
matching data. And of course you can keep adding circles to this. But it's
keep adding circles to this. But it's going to be always the same thing. And
going to be always the same thing. And in your circle going to has only the
in your circle going to has only the matching data. So now as we learned we
matching data. So now as we learned we can use joins in order to combine
can use joins in order to combine multiple tables to get a complete big
multiple tables to get a complete big picture about topic like the customers.
picture about topic like the customers. I would like to see everything about the
I would like to see everything about the customers in the final results. So
customers in the final results. So either you're going to do it like me
either you're going to do it like me where you start from the main table and
where you start from the main table and then go and lift join all other tables
then go and lift join all other tables or maybe you say you know what there is
or maybe you say you know what there is no main table about the customer's data
no main table about the customer's data all the tables are equally important
all the tables are equally important then you can go and join all those
then you can go and join all those tables using the inner join if you are
tables using the inner join if you are interested only on the match data so
interested only on the match data so what can happen if you have again those
what can happen if you have again those circles from the A you need only the
circles from the A you need only the matching data from B you need as well
matching data from B you need as well only matching data and as well from the
only matching data and as well from the third circle so you are interested only
third circle so you are interested only on the overlapping between all all three
on the overlapping between all all three tables. So you will get only this
tables. So you will get only this section where you have overlapping
section where you have overlapping between all three tables. So this is of
between all three tables. So this is of course another way on how to join
course another way on how to join multiple tables. Okay. So now my friends
multiple tables. Okay. So now my friends let's go back to scale in order to
let's go back to scale in order to practice how to join multiple tables.
practice how to join multiple tables. Okay. So now let's have a task. This
Okay. So now let's have a task. This going to be a little bit challenging. We
going to be a little bit challenging. We will be doing multi- joins using the
will be doing multi- joins using the sales DB. Retrieve a list of all orders
sales DB. Retrieve a list of all orders along with the related customer product
along with the related customer product and employee details. And for each order
and employee details. And for each order display the following. We want to see
display the following. We want to see the order ID, the customer name, the
the order ID, the customer name, the product name, sales price, salesperson
product name, sales price, salesperson name. So there is a lot of things that
name. So there is a lot of things that is going on. And the first thing that
is going on. And the first thing that you're going to notice it does now we
you're going to notice it does now we are using different database. We will be
are using different database. We will be not using the my database, we're going
not using the my database, we're going to go and use the sales DB. So this is
to go and use the sales DB. So this is the first thing that we have to do. So
the first thing that we have to do. So instead of using my database, so we say
instead of using my database, so we say use sales DB and then execute it. We are
use sales DB and then execute it. We are now connected to the sales DB. So this
now connected to the sales DB. So this is the first thing. So now if you are
is the first thing. So now if you are reading this task there are a lot of
reading this task there are a lot of tables that are involved. We need the
tables that are involved. We need the orders, we need the customers, products
orders, we need the customers, products and employees. So there are like four
and employees. So there are like four tables needed in this task and we need
tables needed in this task and we need different stuff from each table. So now
different stuff from each table. So now how I think about it well it is mainly
how I think about it well it is mainly focusing on the table orders right? So
focusing on the table orders right? So we need all the orders we cannot miss
we need all the orders we cannot miss any order here. So this sounds for me
any order here. So this sounds for me this is the main table and then it says
this is the main table and then it says along with that we need other
along with that we need other informations. So that means the other
informations. So that means the other tables are not that important like the
tables are not that important like the orders. So this gives me feeling about
orders. So this gives me feeling about what is the main table and this going to
what is the main table and this going to be my starting points. So let's start
be my starting points. So let's start from that from the table orders. So
from that from the table orders. So select star from and here you have to
select star from and here you have to pay attention that this database has
pay attention that this database has always a schema. It's called if you look
always a schema. It's called if you look to the left side sales dot the table
to the left side sales dot the table name. So we have to write that now in
name. So we have to write that now in our query. So we're going to write it
our query. So we're going to write it over here sales dot and then the table
over here sales dot and then the table name orders. Let's go and execute it.
name orders. Let's go and execute it. Now I know this is the first time that
Now I know this is the first time that you are querying this table. We have a
you are querying this table. We have a lot of informations here and as well we
lot of informations here and as well we have a lot of ids. Those ids going to
have a lot of ids. Those ids going to help us of course on joining our data
help us of course on joining our data with the other tables. So what do we
with the other tables. So what do we need from here? We need the order ID. So
need from here? We need the order ID. So we have it over here. We're going to get
we have it over here. We're going to get the order ID. This time the naming
the order ID. This time the naming convention is different. We don't have
convention is different. We don't have like underscores and comm. We have
like underscores and comm. We have different type of namings. So be careful
different type of namings. So be careful with that. So what else do we need? We
with that. So what else do we need? We need the sales. So if you go to the
need the sales. So if you go to the right side over here, we have column
right side over here, we have column gold sales and we're going to go and
gold sales and we're going to go and include it to the results. Now all the
include it to the results. Now all the other informations are actually not
other informations are actually not needed, but I need those ids in order to
needed, but I need those ids in order to join it with the other tables. So now
join it with the other tables. So now what I'm going to do, I'm going to go
what I'm going to do, I'm going to go and give it an alias and all. So now I'm
and give it an alias and all. So now I'm going to go and assign it for each
going to go and assign it for each column. This comes from the orders and
column. This comes from the orders and as well the same thing for the sales. So
as well the same thing for the sales. So that's it for now. And if I go and
that's it for now. And if I go and execute it, I will get the orders and
execute it, I will get the orders and the sales. All right, so that's all for
the sales. All right, so that's all for the first table. Let's go now and see
the first table. Let's go now and see what do we need. We need the customer's
what do we need. We need the customer's name. Well, actually we don't have this
name. Well, actually we don't have this piece of information in the orders. So
piece of information in the orders. So all what you have to do is to go and
all what you have to do is to go and explore in the other tables in order to
explore in the other tables in order to find this column. So how I usually do I
find this column. So how I usually do I go and explore the tables like this. So
go and explore the tables like this. So I write a symbol select from each
I write a symbol select from each tables. So the customers. So now I go
tables. So the customers. So now I go and repeat this for each table inside
and repeat this for each table inside the database. So we have the customers,
the database. So we have the customers, employees, we have an orders, the orders
employees, we have an orders, the orders archive and as well the products. So now
archive and as well the products. So now I start exploring the table. So if I go
I start exploring the table. So if I go to the customers over here, we can see
to the customers over here, we can see we have here five customers and we can
we have here five customers and we can see the names of the customers. So we
see the names of the customers. So we see the first name and the last name and
see the first name and the last name and this is exactly what I need for my
this is exactly what I need for my query. Now of course we have to go and
query. Now of course we have to go and connect this table with the orders. So
connect this table with the orders. So we need a common column. Usually it's
we need a common column. Usually it's going to be the ID. So here we have the
going to be the ID. So here we have the customer ID and if you go and query the
customer ID and if you go and query the orders you can find here as well the
orders you can find here as well the customer ID. Now if you are working in
customer ID. Now if you are working in big projects you're going to have a lot
big projects you're going to have a lot of tables and exploring each one of them
of tables and exploring each one of them going to be really hard. So now of
going to be really hard. So now of course if you have like in the project
course if you have like in the project hundreds of tables it's going to be
hundreds of tables it's going to be really hard to explore each table. So
really hard to explore each table. So instead of that a good project a good
instead of that a good project a good database usually has an entity
database usually has an entity relationship model er model like the one
relationship model er model like the one that we have for the course. And here
that we have for the course. And here you can find easily the tables that you
you can find easily the tables that you have inside your database and as well
have inside your database and as well the relationship between them and this
the relationship between them and this is very important especially if you want
is very important especially if you want to join tables. So now by just looking
to join tables. So now by just looking quickly to this diagram I can understand
quickly to this diagram I can understand okay there is an ID called customer ID
okay there is an ID called customer ID inside the table orders and it is like a
inside the table orders and it is like a foreign key to the primary key the
foreign key to the primary key the customer ID. So that means if I want to
customer ID. So that means if I want to connect the orders with the customers I
connect the orders with the customers I have to use that customer ID. So as you
have to use that customer ID. So as you can see this is really nice
can see this is really nice documentations and I can quickly
documentations and I can quickly understand how to join the tables. So
understand how to join the tables. So now back to our query. Now what I'm
now back to our query. Now what I'm going to do I'm going to say lift join.
going to do I'm going to say lift join. So with that I guarantee all the orders
So with that I guarantee all the orders going to be presented in the output and
going to be presented in the output and I will see always 10 orders. So now
I will see always 10 orders. So now let's join it with the table customers
let's join it with the table customers sales dot customers and let's give it an
sales dot customers and let's give it an alias like this. And now we're going to
alias like this. And now we're going to build the joining condition. So it's
build the joining condition. So it's going to be the customer ID from the
going to be the customer ID from the table orders equal to the customer ID
table orders equal to the customer ID from the table customers. So that SQL
from the table customers. So that SQL understand how to match the two tables.
understand how to match the two tables. And now the two tables are connected and
And now the two tables are connected and I can get the informations now from the
I can get the informations now from the customers. So see let's go and get the
customers. So see let's go and get the first name and as well the last
first name and as well the last name. So now let's go and execute it. So
name. So now let's go and execute it. So now as you can see we have customers for
now as you can see we have customers for each order which is really nice. So with
each order which is really nice. So with that we got the customer name and the
that we got the customer name and the order ID. Now the next one we need the
order ID. Now the next one we need the product name. So either you're going to
product name. So either you're going to go here and start exploring. I think it
go here and start exploring. I think it is inside the table products. And here
is inside the table products. And here you can see we have the product. This is
you can see we have the product. This is the name of the products. And if you
the name of the products. And if you check our ER diagram, you can see we can
check our ER diagram, you can see we can connect the table orders with the
connect the table orders with the products using the product ID. So we
products using the product ID. So we have the product ID in the left and as
have the product ID in the left and as well in the right. And now we can go and
well in the right. And now we can go and build this join as well over here. So
build this join as well over here. So again I go with a lift join. I don't
again I go with a lift join. I don't want to lose anything from the table
want to lose anything from the table orders sales products and we give it an
orders sales products and we give it an alias P. Now the condition for that here
alias P. Now the condition for that here you have to be very focused. You want to
you have to be very focused. You want to get the product from the orders. So you
get the product from the orders. So you say O dot product id equal to the
say O dot product id equal to the product ID from the table products. So
product ID from the table products. So as you can see in the joins we are
as you can see in the joins we are always joining with the table orders.
always joining with the table orders. Right? We are not trying to join for
Right? We are not trying to join for example the customers with the products.
example the customers with the products. Always we are joining with the main
Always we are joining with the main table. So with that we have connected
table. So with that we have connected the third table and we can get the
the third table and we can get the information that we need. So we need the
information that we need. So we need the products as I'm going to go and rename
products as I'm going to go and rename it products name. So let's go and
it products name. So let's go and execute it. And with that my friends I'm
execute it. And with that my friends I'm getting now the product informations
getting now the product informations from the table products. So we have the
from the table products. So we have the sales as well and we need the price. So
sales as well and we need the price. So if you go to the products you can see we
if you go to the products you can see we have as well price information. I forgot
have as well price information. I forgot about it. So let's go and get it as well
about it. So let's go and get it as well from the same table. price. So let's go
from the same table. price. So let's go and execute it. And with that we have as
and execute it. And with that we have as well the prices. Now the last column it
well the prices. Now the last column it says we want to get the saleserson name.
says we want to get the saleserson name. So the name of the employee right now if
So the name of the employee right now if you go and explore as well we have here
you go and explore as well we have here employees table and execute it. You can
employees table and execute it. You can see we have here the name and the last
see we have here the name and the last name of the employees and we have an ID.
name of the employees and we have an ID. So now we need this ID as well in the
So now we need this ID as well in the orders. So you can see we have the
orders. So you can see we have the product ID, the customer ID. We already
product ID, the customer ID. We already used those two. But we have here one
used those two. But we have here one more extra ID called the salesperson ID.
more extra ID called the salesperson ID. Of course, it is not called employee ID.
Of course, it is not called employee ID. So here you might be a little bit
So here you might be a little bit skeptical about it. That's why we have
skeptical about it. That's why we have to go and check again our ER diagram.
to go and check again our ER diagram. And as you can see the employee ID from
And as you can see the employee ID from the employees, it is connected to the
the employees, it is connected to the salesperson ID. So that I have better
salesperson ID. So that I have better feeling about it and I understand. Okay,
feeling about it and I understand. Okay, I can connect the orders with the
I can connect the orders with the employees using the salesperson ID. So
employees using the salesperson ID. So let's go and do that. I'm going to say
let's go and do that. I'm going to say lift join. So as you can see I'm just
lift join. So as you can see I'm just doing left joins sales dot employees as
doing left joins sales dot employees as e and the condition again very important
e and the condition again very important always the first table is included in
always the first table is included in the join condition and here we're going
the join condition and here we're going to say the sales person ID is equal to
to say the sales person ID is equal to the employee ID. So with that we have
the employee ID. So with that we have connected as well the employees and we
connected as well the employees and we will get as well the first name and the
will get as well the first name and the last name. So perfect that's it. Let's
last name. So perfect that's it. Let's go and execute it. And as you can see
go and execute it. And as you can see guys, now we are getting the name of the
guys, now we are getting the name of the salesperson. Now here comes an issue. As
salesperson. Now here comes an issue. As you are joining multiple tables and you
you are joining multiple tables and you are getting columns from different
are getting columns from different tables, what can happen? You might
tables, what can happen? You might encounter this scenario where you have
encounter this scenario where you have the same names in multiple tables. So
the same names in multiple tables. So now as you can see we have the first
now as you can see we have the first name last name from the employees and as
name last name from the employees and as well we have the first name last name
well we have the first name last name from the customers and it's going to be
from the customers and it's going to be really hard from the result to
really hard from the result to understand what are we talking about? Is
understand what are we talking about? Is it the customers? Is it the employee?
it the customers? Is it the employee? That's why in this scenario if you have
That's why in this scenario if you have the same names we have to go and start
the same names we have to go and start giving aliases. So for the first one
giving aliases. So for the first one we're going to say customer first name
we're going to say customer first name and as well for the last name we're
and as well for the last name we're going to say customer last name. Same
going to say customer last name. Same thing for the employee. So let's say
thing for the employee. So let's say employee first name or we can call it
employee first name or we can call it the saleserson whatever employee last
the saleserson whatever employee last name. So if you go and execute it now
name. So if you go and execute it now it's going to be more clear. Here we are
it's going to be more clear. Here we are talking about the name of the customer
talking about the name of the customer and here we are talking about the name
and here we are talking about the name of the employee. And again one more
of the employee. And again one more thing if you are not using aliases it's
thing if you are not using aliases it's going to be an issue. So for example if
going to be an issue. So for example if you go over here and you don't use the
you go over here and you don't use the table name before the column. So if I go
table name before the column. So if I go and remove it and execute it you will
and remove it and execute it you will see I'm getting an error. Now SQL can't
see I'm getting an error. Now SQL can't understand what are you talking about.
understand what are you talking about. Is it the first name of the customer or
Is it the first name of the customer or from the employees because you are not
from the employees because you are not specific about it. So you have to tell
specific about it. So you have to tell SQL to which table belong this column.
SQL to which table belong this column. It's very important to use a table name
It's very important to use a table name or the alias before the column name.
or the alias before the column name. Especially if you have the same column.
Especially if you have the same column. So now we will not get an error. And
So now we will not get an error. And with that we have solved the task. You
with that we have solved the task. You have really to pay attention about the
have really to pay attention about the join keys. The condition you have to do
join keys. The condition you have to do it correctly cuz as you can see now we
it correctly cuz as you can see now we have a lot of tables and a lot of
have a lot of tables and a lot of columns and sometimes happens an issue
columns and sometimes happens an issue where you specify the wrong columns or
where you specify the wrong columns or the joins and the result can makes at
the joins and the result can makes at all no sense. So always double check are
all no sense. So always double check are you using the correct keys in order to
you using the correct keys in order to join the tables. So with that you have
join the tables. So with that you have solved the task and this is exactly how
solved the task and this is exactly how I join tables. I have always a starting
I join tables. I have always a starting point from an important table and
point from an important table and everything else going to be left joined
everything else going to be left joined and in my results if I want to remove
and in my results if I want to remove any scenario then I go and use the wear
any scenario then I go and use the wear clause. So this is how I join multiple
clause. So this is how I join multiple tables. Okay my friends. So with that
tables. Okay my friends. So with that you have learned now everything about
you have learned now everything about how to join the tables in SQL and this
how to join the tables in SQL and this is very important to understand. Now
is very important to understand. Now moving on to the second method on how to
moving on to the second method on how to combine your data from multiple tables.
combine your data from multiple tables. We have the set operators. So we're
We have the set operators. So we're going to go and cover how to combine the
going to go and cover how to combine the rows from multiple tables. So let's
go. All right, my friends. So now as we learned before, in order to combine two
learned before, in order to combine two tables we have two methods. If you want
tables we have two methods. If you want to combine the columns, we use the
to combine the columns, we use the joins. And we have learned all those
joins. And we have learned all those different types on how to combine data
different types on how to combine data using join. So we have covered this
using join. So we have covered this section. But now if we want to combine
section. But now if we want to combine the rows of two tables, we can use the
the rows of two tables, we can use the set operators. And here we have four
set operators. And here we have four different types. We have union, union
different types. We have union, union all, except and intersects. So now we're
all, except and intersects. So now we're going to go and deep dive into this word
going to go and deep dive into this word on how to combine the rows of tables
on how to combine the rows of tables using the set operators. And now of
using the set operators. And now of course in this course we're going to
course in this course we're going to cover everything. So let's
go. All right. So now let's have a look to the syntax of the set operators.
to the syntax of the set operators. Okay. So now let's see that we have the
Okay. So now let's see that we have the following query. we are selecting the
following query. we are selecting the data from the customers. So this is our
data from the customers. So this is our first query or our first select
first query or our first select statements and we have another one which
statements and we have another one which is very similar where we are selecting
is very similar where we are selecting the informations from the employees and
the informations from the employees and this is our second select statement. So
this is our second select statement. So now what we can do we can put between
now what we can do we can put between those two queries a set operators like
those two queries a set operators like for example the union. We can use of
for example the union. We can use of course any other set operators like the
course any other set operators like the union all intersects except and so on.
union all intersects except and so on. So as you can see the syntax is very
So as you can see the syntax is very simple. We have two different queries
simple. We have two different queries and we just put between them the set
and we just put between them the set operator. So this is how the syntax of
operator. So this is how the syntax of the set operators looks like. All right
the set operators looks like. All right friends. So now we're going to talk
friends. So now we're going to talk about the rules of the set operators.
about the rules of the set operators. And we're going to start with the rule
And we're going to start with the rule number one the SQL clauses. In each
number one the SQL clauses. In each individual select statements or query.
individual select statements or query. We can use almost all the SQL clauses
We can use almost all the SQL clauses like where join group by having. But
like where join group by having. But there is only one exception with the
there is only one exception with the order by. Order by you can use it only
order by. Order by you can use it only once and only at the end of the entire
once and only at the end of the entire query. So that means we cannot use order
query. So that means we cannot use order by in each select statements or in each
by in each select statements or in each query. We can use it only once and only
query. We can use it only once and only at the ends of the entire query. All
at the ends of the entire query. All right. So about the syntax again here we
right. So about the syntax again here we have our two select statements and in
have our two select statements and in between them we have the set operators.
between them we have the set operators. So now in each query we can go and use
So now in each query we can go and use multiple stuff like the join where group
multiple stuff like the join where group by having. So we can make each query
by having. So we can make each query complex as we want. So everything is
complex as we want. So everything is allowed but not the order by the order
allowed but not the order by the order by must be always placed at the end of
by must be always placed at the end of the entire query. So if you want to sort
the entire query. So if you want to sort the result by the first name, you have
the result by the first name, you have to use the order by exactly at the end.
to use the order by exactly at the end. So we are not allowed to use order by in
So we are not allowed to use order by in each query. Okay. Moving on to the rule
each query. Okay. Moving on to the rule number two. The number of columns. The
number two. The number of columns. The number of columns in each query must be
number of columns in each query must be the same. Okay. Okay. So now in order to
the same. Okay. Okay. So now in order to understand this rule, let's have this
understand this rule, let's have this very simple example. We're going to go
very simple example. We're going to go and select the first name and the last
and select the first name and the last name from the table sales customers. So
name from the table sales customers. So this is our first query, our first
this is our first query, our first select statements and let's say that I
select statements and let's say that I have another one and we want to select
have another one and we want to select the first name last name but this time
the first name last name but this time from another table, the employees. So
from another table, the employees. So with that we have our two queries and I
with that we have our two queries and I would like now to go and combine them
would like now to go and combine them into one result. So we're going to go
into one result. So we're going to go and use the set operator union. Let's go
and use the set operator union. Let's go and execute it. So now as you can see in
and execute it. So now as you can see in the result we will get the first name
the result we will get the first name and last name from two tables the
and last name from two tables the customers and employees. And it is
customers and employees. And it is working because we are fulfilling the
working because we are fulfilling the rule where it says the number of columns
rule where it says the number of columns must be the same in both queries. So how
must be the same in both queries. So how many columns do we have in the first
many columns do we have in the first query? We have two right and as well in
query? We have two right and as well in the second query we have two columns. So
the second query we have two columns. So that's why everything is working. So now
that's why everything is working. So now let's go and break the rule by adding
let's go and break the rule by adding another column to the first query. So
another column to the first query. So let's say that I would like to have the
let's say that I would like to have the customer ID as well in the first query
customer ID as well in the first query and with that as you can see in the
and with that as you can see in the first query we have three columns but in
first query we have three columns but in the second we have only two. So let's go
the second we have only two. So let's go and execute it. Now as you can see in
and execute it. Now as you can see in the result we will get an error where it
the result we will get an error where it says if you are using union intersect
says if you are using union intersect and all those set operators you must
and all those set operators you must have an equal number of columns between
have an equal number of columns between queries. So this is the rule you have to
queries. So this is the rule you have to have the same number of columns in order
have the same number of columns in order to repair it. So I'm going to do I'm
to repair it. So I'm going to do I'm just going to remove the customer ID.
just going to remove the customer ID. Okay. So here again we have two columns
Okay. So here again we have two columns and the second one as well two columns
and the second one as well two columns and everything going to be working.
and everything going to be working. Okay. Moving on to the rule number
Okay. Moving on to the rule number three. The data types of columns in each
three. The data types of columns in each query must match must be compatible in
query must match must be compatible in matching. In order to check that what
matching. In order to check that what we're going to do we're going to go to
we're going to do we're going to go to the object explorer to the left side.
the object explorer to the left side. Let's go and browse the customers and
Let's go and browse the customers and the columns. And as you can see we have
the columns. And as you can see we have here the first name and last name with
here the first name and last name with the same data type. We have the vchar.
the same data type. We have the vchar. And if you go to the employees, you can
And if you go to the employees, you can see as well the first name, last name
see as well the first name, last name having varchar. So the first column is
having varchar. So the first column is varchchar from the first query and as
varchchar from the first query and as well for the employees and as well the
well for the employees and as well the last name from the customers having the
last name from the customers having the same data type as the last name from
same data type as the last name from employees. So the data type is matching.
employees. So the data type is matching. Now let's go and break this rule.
Now let's go and break this rule. Instead of having the first name, I
Instead of having the first name, I would like to go and use the customer
would like to go and use the customer ID. So now let's check the customer ID
ID. So now let's check the customer ID on the left side. It is an int, an
on the left side. It is an int, an integer. But the first name is
integer. But the first name is invarchar. So here we have a mismatch
invarchar. So here we have a mismatch between data types. Let's go and try to
between data types. Let's go and try to execute it. So now we are getting an
execute it. So now we are getting an error where it says SQL is trying to
error where it says SQL is trying to convert the value Frank to an integer.
convert the value Frank to an integer. So what this means the first query is
So what this means the first query is always controlling everything the names
always controlling everything the names and as well the data types. So here we
and as well the data types. So here we have an integer and now scale is trying
have an integer and now scale is trying as well to convert the first name values
as well to convert the first name values to an integer and of course it will not
to an integer and of course it will not work because we have here characters
work because we have here characters inside and it cannot convert characters
inside and it cannot convert characters to an integer. So we have a mismatch
to an integer. So we have a mismatch between data types between the customer
between data types between the customer ID and the first name and that's why we
ID and the first name and that's why we will get an error. The second column we
will get an error. The second column we don't have an issue because it is
don't have an issue because it is varchar in the first table and as well
varchar in the first table and as well for the second table. So now in order to
for the second table. So now in order to repair it either select a first name in
repair it either select a first name in the first query or we can go over here
the first query or we can go over here and say employee ID and with that if I
and say employee ID and with that if I execute it we will not get any errors
execute it we will not get any errors because the employee ID is as well an
because the employee ID is as well an integer and we have a match in the data
integer and we have a match in the data types. So as you can see it's not enough
types. So as you can see it's not enough to have the same number of columns. You
to have the same number of columns. You have to have as well matching data types
have to have as well matching data types between those two queries. Okay, let's
between those two queries. Okay, let's move to the next rule. Rule number four,
move to the next rule. Rule number four, the order of columns. The order of
the order of columns. The order of columns in each query must be as well
columns in each query must be as well the same. Okay, so let's understand what
the same. Okay, so let's understand what this means. Now we have here again the
this means. Now we have here again the same example where we are selecting the
same example where we are selecting the ID and last name from customers and we
ID and last name from customers and we are combining it using union with the
are combining it using union with the employee ID and last name from the
employee ID and last name from the employees. And as you can see everything
employees. And as you can see everything is working because we have the same
is working because we have the same number of columns and we have a matching
number of columns and we have a matching data types. So now let's go and break
data types. So now let's go and break it. What I'm going to do I'm just going
it. What I'm going to do I'm just going to switch between those two columns. So
to switch between those two columns. So first I'm selecting the last name and
first I'm selecting the last name and then the customer ID. So again I have
then the customer ID. So again I have the same number of columns and the ID is
the same number of columns and the ID is integer matching the ID of the employee
integer matching the ID of the employee and the last name having the same data
and the last name having the same data type. So let's go and execute it. So
type. So let's go and execute it. So here again SQL going to throw an error
here again SQL going to throw an error and says SQL is trying to convert the
and says SQL is trying to convert the value go back to an integer. So it's
value go back to an integer. So it's like character to integer. It will not
like character to integer. It will not work. So what happened here? I have here
work. So what happened here? I have here the same informations. I have an ID and
the same informations. I have an ID and last name and ID and last name. Well,
last name and ID and last name. Well, SQL doesn't work like this. SQL going to
SQL doesn't work like this. SQL going to go and map the first column from the
go and map the first column from the first query with the first column with
first query with the first column with the second query. So it's going to go
the second query. So it's going to go and map last name to employee ID. And
and map last name to employee ID. And since they have different data types,
since they have different data types, SQL going to throw an error. So SQL
SQL going to throw an error. So SQL doesn't understand or don't know how to
doesn't understand or don't know how to map let's say the ID with the ID and
map let's say the ID with the ID and since they have different data types SQL
since they have different data types SQL going to go and throw an error. So as
going to go and throw an error. So as you can see here we have the same
you can see here we have the same informations between customers and
informations between customers and employees but they don't have the same
employees but they don't have the same order. So SQL cannot go and map the
order. So SQL cannot go and map the informations because of the names of the
informations because of the names of the columns. It's going to go and simply
columns. It's going to go and simply just mapping the columns like this. The
just mapping the columns like this. The first column from the first query with
first column from the first query with the first column from the second query.
the first column from the second query. So as you can see in this rule you must
So as you can see in this rule you must have the same order of the columns.
have the same order of the columns. First the ID and then the last name and
First the ID and then the last name and with that it's going to work again. All
with that it's going to work again. All right moving on to the rule number five.
right moving on to the rule number five. The column aliases column names that we
The column aliases column names that we see in the output in the result is
see in the output in the result is defined and determined by the column
defined and determined by the column names of the first query the first
names of the first query the first select statements. So that means the
select statements. So that means the first query is responsible of naming the
first query is responsible of naming the columns in the output. Okay. So let's
columns in the output. Okay. So let's understand what this rule means. Again
understand what this rule means. Again we have the same example. The customer
we have the same example. The customer ID, last name from customers, union,
ID, last name from customers, union, employee ID, last name from employees.
employee ID, last name from employees. So if you check closely the output, you
So if you check closely the output, you can see that in the output we have the
can see that in the output we have the customer ID and not the employee ID.
customer ID and not the employee ID. Even though we have the ids from the
Even though we have the ids from the employee ID, but as you can see the
employee ID, but as you can see the first query is controlling the naming of
first query is controlling the naming of the output. So since the first column
the output. So since the first column called the customer ID, you will see it
called the customer ID, you will see it in the output as a customer ID. So the
in the output as a customer ID. So the naming of the like the next queries will
naming of the like the next queries will be totally ignored. So that's why if you
be totally ignored. So that's why if you want to give aliases to the output,
want to give aliases to the output, you're going to go and do it only for
you're going to go and do it only for the first query. So for example, I go
the first query. So for example, I go over here and say instead of having
over here and say instead of having customer ID, I would like to call it as
customer ID, I would like to call it as an ID. So now if I go and execute it, as
an ID. So now if I go and execute it, as you can see in the output, we will get
you can see in the output, we will get an ID. So I don't have to go and in each
an ID. So I don't have to go and in each query give this alias. So I don't have
query give this alias. So I don't have to go over here and say yeah you are as
to go over here and say yeah you are as well the ID because it's enough to
well the ID because it's enough to define it from the first query. So
define it from the first query. So there's no need to give the same names
there's no need to give the same names in the next queries. Let's take another
in the next queries. Let's take another example where we would like to have an
example where we would like to have an alias for the last name. So I would like
alias for the last name. So I would like to have it like this last name and let's
to have it like this last name and let's go and do it in the second query. So
go and do it in the second query. So last name let's go and execute it. So
last name let's go and execute it. So now as you can see in the output, we
now as you can see in the output, we still have last name and there's no
still have last name and there's no underscore because this is totally
underscore because this is totally ignored from SQL. This is not the first
ignored from SQL. This is not the first query. The first query says you are last
query. The first query says you are last name without underscore. So again if you
name without underscore. So again if you want to do that we go over here. Let me
want to do that we go over here. Let me just get it and put it in the first
just get it and put it in the first query. Let's go and execute it. So my
query. Let's go and execute it. So my friends, the first query is very
friends, the first query is very important in order to give the names for
important in order to give the names for the output. So if you want to do aliases
the output. So if you want to do aliases and to rename stuff, do it only on the
and to rename stuff, do it only on the first query. And as well the first query
first query. And as well the first query controls the data types. All right. Now
controls the data types. All right. Now to the last rule matching the correct
to the last rule matching the correct informations. If in your query you
informations. If in your query you fulfill all other rules and you don't
fulfill all other rules and you don't have an error in the SQL that doesn't
have an error in the SQL that doesn't mean that your result is accurate and
mean that your result is accurate and correct. You are the only one that is
correct. You are the only one that is responsible of mapping the informations
responsible of mapping the informations between queries correctly because SQL
between queries correctly because SQL doesn't understand the content and the
doesn't understand the content and the informations of your tables of your
informations of your tables of your queries. And if you don't match the
queries. And if you don't match the informations correctly between the
informations correctly between the queries, you will get inaccurate and
queries, you will get inaccurate and wrong results in the output. Okay. So
wrong results in the output. Okay. So now back to our example. Let's say I
now back to our example. Let's say I would like to get the first name and as
would like to get the first name and as well the last name from the customers
well the last name from the customers and the same informations from the
and the same informations from the employees. Let's go and execute it. Now
employees. Let's go and execute it. Now as you can see it's very nice where we
as you can see it's very nice where we are getting the first name, last name
are getting the first name, last name from both tables in one result and we
from both tables in one result and we are fulfilling all the requirements in
are fulfilling all the requirements in SQL. Same numbers, same data types and
SQL. Same numbers, same data types and so on. Now let's go and make incorrect
so on. Now let's go and make incorrect results. So what I'm going to do, I'm
results. So what I'm going to do, I'm just going to swap the first name and
just going to swap the first name and last name in the second query. So first
last name in the second query. So first last name and then the first name. So
last name and then the first name. So let's go and execute it. So now as you
let's go and execute it. So now as you can see we will get results because we
can see we will get results because we are fulfilling all other rules because
are fulfilling all other rules because we have the same number of columns and
we have the same number of columns and as well we have matching data types. So
as well we have matching data types. So the first one is character the first
the first one is character the first name and the last name is as well
name and the last name is as well character. So SQL will just present the
character. So SQL will just present the result as you define it. But the result
result as you define it. But the result is completely wrong because now we have
is completely wrong because now we have if you check the first column here the
if you check the first column here the first name. So here we can see last
first name. So here we can see last names inside the first names. For
names inside the first names. For example, Brown and Baker those are last
example, Brown and Baker those are last names but we can see them inside the
names but we can see them inside the first name. And the same thing in the
first name. And the same thing in the last name. We now we can see first names
last name. We now we can see first names inside it. Mary, Carol, they are all
inside it. Mary, Carol, they are all first names. So as you can see the
first names. So as you can see the result has really bad data quality. We
result has really bad data quality. We are now mixing stuff and it doesn't
are now mixing stuff and it doesn't makes any sense. But SQL will not know
makes any sense. But SQL will not know that because SQL doesn't know the
that because SQL doesn't know the information the content of your data.
information the content of your data. It's just mapping the data types. So
It's just mapping the data types. So first name is varchchar the last name as
first name is varchchar the last name as well vchar. Everything is fine and you
well vchar. Everything is fine and you will get the results. So my friends you
will get the results. So my friends you are responsible of having the same
are responsible of having the same informations mapped between the two
informations mapped between the two queries and not having an error from a
queries and not having an error from a skill doesn't mean that we have now
skill doesn't mean that we have now correct results. So pay attention to the
correct results. So pay attention to the informations that you are mapping
informations that you are mapping between the two queries. All right. So
between the two queries. All right. So those are the rules of the set
those are the rules of the set operators. So the first one is that the
operators. So the first one is that the order by can only be used once at the
order by can only be used once at the end of the entire query and all queries
end of the entire query and all queries must have the same number of columns,
must have the same number of columns, the matching data types, the same order
the matching data types, the same order of columns and the first query always
of columns and the first query always control the names and the aliases of the
control the names and the aliases of the result set and as well the data type.
result set and as well the data type. And the last rule is that make sure that
And the last rule is that make sure that you are mapping the correct informations
you are mapping the correct informations to each others between queries. So those
to each others between queries. So those are the rules of the set
operators. Okay. So what is union? Union going to go and return all distinct
going to go and return all distinct unique rows from both queries. So that
unique rows from both queries. So that means it's going to go and combine
means it's going to go and combine everything and all the rows going to be
everything and all the rows going to be presented at the output. So since it
presented at the output. So since it says all distinct unique rows that means
says all distinct unique rows that means union going to go and remove all
union going to go and remove all duplicates from the combined result set.
duplicates from the combined result set. So union going to make sure that each
So union going to make sure that each row going to appear only once. All
row going to appear only once. All right. So now let's have this very
right. So now let's have this very simple example. We have two sets of
simple example. We have two sets of data. We have the customers where we
data. We have the customers where we have five customers with the first names
have five customers with the first names and as well we have another set called
and as well we have another set called employees and we have as well the first
employees and we have as well the first names of the employees and we have five
names of the employees and we have five employees. And now if you take a look to
employees. And now if you take a look to the first names you can see that we have
the first names you can see that we have the same persons as a customers and as
the same persons as a customers and as well as employees. We have given and
well as employees. We have given and marry in both sets of data. So now how
marry in both sets of data. So now how is k going to execute union it's going
is k going to execute union it's going to go and return everyone from customers
to go and return everyone from customers and everyone from the employees. But now
and everyone from the employees. But now since we have given and married twice in
since we have given and married twice in the output we're going to have them only
the output we're going to have them only once. So this is how the union works. It
once. So this is how the union works. It going to go and return everyone from two
going to go and return everyone from two sets but without duplicates. All right.
sets but without duplicates. All right. So now we have the following task and it
So now we have the following task and it says combine the data from employees and
says combine the data from employees and customers into one table. So that means
customers into one table. So that means in one table we want to combine all
in one table we want to combine all informations from employees and
informations from employees and customers. So which informations do we
customers. So which informations do we need? This is the first question that I
need? This is the first question that I usually ask myself. So in order to do
usually ask myself. So in order to do that first we have to explore the data.
that first we have to explore the data. So select star from sales customers and
So select star from sales customers and then semicolon. Then I'm going to write
then semicolon. Then I'm going to write another query select star from sales and
another query select star from sales and employees and semicolon. So now why I'm
employees and semicolon. So now why I'm using two different semicolons because
using two different semicolons because I'm telling SQL we have now two separate
I'm telling SQL we have now two separate queries. They have nothing to do with
queries. They have nothing to do with each others. And if you go and execute
each others. And if you go and execute it like this. And now in the output you
it like this. And now in the output you can see we got two result grids. The
can see we got two result grids. The first result grid is for the first query
first result grid is for the first query and the second one for the second query.
and the second one for the second query. So they have nothing to do with each
So they have nothing to do with each others. I just want to explore those two
others. I just want to explore those two tables in order to understand how I'm
tables in order to understand how I'm going to map those informations. So now
going to map those informations. So now if we check those two tables you can see
if we check those two tables you can see that both of them has ids. So we can map
that both of them has ids. So we can map those informations right. Both of them
those informations right. Both of them has as well first name last name. So
has as well first name last name. So that means I can go and map the first
that means I can go and map the first name and last name together. Now in the
name and last name together. Now in the customers we have country but we don't
customers we have country but we don't have this informations in the employee.
have this informations in the employee. So we have to go and ignore it. And we
So we have to go and ignore it. And we have as well here score where we don't
have as well here score where we don't have a score for the employees. That
have a score for the employees. That means I can go and map three
means I can go and map three informations between the customers and
informations between the customers and employees. Now of course we can go and
employees. Now of course we can go and think do we need really the ids because
think do we need really the ids because it doesn't make really any sense to have
it doesn't make really any sense to have the ids in the tables. It's not anymore
the ids in the tables. It's not anymore unique because we have here the custom
unique because we have here the custom ID one and employee one. So I think we
ID one and employee one. So I think we can go and ignore it. So the only really
can go and ignore it. So the only really two informations that is useful to map
two informations that is useful to map is the first name and last name. So now
is the first name and last name. So now let's go and add those two informations.
let's go and add those two informations. So we need the first name, last name and
So we need the first name, last name and the same informations as well from the
the same informations as well from the employees. But now we want everything to
employees. But now we want everything to be in one query. That's why I'm going to
be in one query. That's why I'm going to go and remove the semicolons. And now we
go and remove the semicolons. And now we have to go and use set operators between
have to go and use set operators between those two queries. And now in order to
those two queries. And now in order to combine the data we have two options
combine the data we have two options either union or union all in this
either union or union all in this example it doesn't mention anything
example it doesn't mention anything about duplicates and so on. I would like
about duplicates and so on. I would like to go with the union in order to remove
to go with the union in order to remove the duplicates if there is any. So
the duplicates if there is any. So that's it. Let's go and execute it. Now
that's it. Let's go and execute it. Now as you can see in the output we have
as you can see in the output we have only one result because we have only one
only one result because we have only one big query. And now we have the first
big query. And now we have the first names and last names from the customers
names and last names from the customers and employees. And now one more thing
and employees. And now one more thing about the order of the queries. It
about the order of the queries. It doesn't matter whether we start with the
doesn't matter whether we start with the employees or with the customers. we will
employees or with the customers. we will get the exact same results but pay
get the exact same results but pay attention to the naming of the columns.
attention to the naming of the columns. Always the first query controls the
Always the first query controls the names but since now they have the same
names but since now they have the same naming so it should not be a problem. So
naming so it should not be a problem. So if I go and switch those two tables and
if I go and switch those two tables and start it again we will get exact same
start it again we will get exact same results. So now let's understand how
results. So now let's understand how scale did combine the data using the
scale did combine the data using the union. Okay. So now we have here the
union. Okay. So now we have here the results from the first query and the
results from the first query and the second query employees and customers and
second query employees and customers and we are combining the data using union.
we are combining the data using union. The first step in SQL is that it's going
The first step in SQL is that it's going to go and take the columns from the
to go and take the columns from the first query which is from the employees.
first query which is from the employees. So it's going to take the first name
So it's going to take the first name last name as a column name to the
last name as a column name to the results. And now the next that is going
results. And now the next that is going to go and start combining the rows
to go and start combining the rows between those two tables. So first going
between those two tables. So first going to go and take the rows from employees
to go and take the rows from employees and as well going to check whether there
and as well going to check whether there is duplicates in the data. So as you can
is duplicates in the data. So as you can see we don't have here any duplicates.
see we don't have here any duplicates. So we're going to have the five
So we're going to have the five employees. And now the next step is
employees. And now the next step is going to start adding rows from the
going to start adding rows from the second query from the customers very
second query from the customers very carefully without generating any
carefully without generating any duplicates. We don't have it in the
duplicates. We don't have it in the output. That's why it's still going to
output. That's why it's still going to go and add it to the result. Append it.
go and add it to the result. Append it. And then the next customer we have Kevin
And then the next customer we have Kevin Brown. As you can see, we have it
Brown. As you can see, we have it already in the results. That's why will
already in the results. That's why will not go and add it to the result.
not go and add it to the result. Otherwise, it's going to go and generate
Otherwise, it's going to go and generate duplicates. So it's still going to
duplicates. So it's still going to ignore this customer. The same thing for
ignore this customer. The same thing for Mary. We have Mary as well in the
Mary. We have Mary as well in the results. So it's going to skip it. And
results. So it's going to skip it. And then we're going to go to the mark. As
then we're going to go to the mark. As you can see, we don't have mark in the
you can see, we don't have mark in the results. That's why SQL going to go and
results. That's why SQL going to go and take this customer and put it in the
take this customer and put it in the output. And then the last one, we have
output. And then the last one, we have Anna. We don't have Anna in the results.
Anna. We don't have Anna in the results. That's why SQL can go and as well add it
That's why SQL can go and as well add it to the results. And now with this, SQL
to the results. And now with this, SQL did combine the rows between those two
did combine the rows between those two tables. And we have here eight persons.
tables. And we have here eight persons. So as you can see, SQL is combining the
So as you can see, SQL is combining the data, but very carefully not generating
data, but very carefully not generating any duplicates. All right. So that's it.
any duplicates. All right. So that's it. This is how the union operator works.
Okay. So now union all union union all going to go and return all rows from
going to go and return all rows from both queries. So it's very similar to
both queries. So it's very similar to union. It going to go and combine all
union. It going to go and combine all the rows and everything going to be
the rows and everything going to be presented in the combined result set.
presented in the combined result set. But the big difference to the union all
But the big difference to the union all will not remove any duplicates. It is
will not remove any duplicates. It is the only set operators that doesn't
the only set operators that doesn't remove duplicates and it going to show
remove duplicates and it going to show all the rows as it is. So if you have a
all the rows as it is. So if you have a row 10 times from the query, you will
row 10 times from the query, you will find it as well in the output 10 times.
find it as well in the output 10 times. Now you might ask me when to use union
Now you might ask me when to use union and when to use union all. I'm going to
and when to use union all. I'm going to say that there is one big difference
say that there is one big difference between them is that union all has way
between them is that union all has way better performance and it's faster than
better performance and it's faster than the union. And that's because union all
the union. And that's because union all doesn't perform additional steps like
doesn't perform additional steps like removing duplicates. So my friends that
removing duplicates. So my friends that means if you know already that in my
means if you know already that in my queries there is no duplicates. I know
queries there is no duplicates. I know my tables. I know my queries. There's no
my tables. I know my queries. There's no duplicates. Don't use union and always
duplicates. Don't use union and always use union all because you will get
use union all because you will get better performance. Another scenario for
better performance. Another scenario for the union all is that I would like to
the union all is that I would like to see the duplicate. I'm doing data
see the duplicate. I'm doing data quality checks and I would like to see
quality checks and I would like to see whether there is duplicate after I
whether there is duplicate after I combine multiple queries. So in this
combine multiple queries. So in this situation I go and use as well the union
situation I go and use as well the union all. Now we have again the same example.
all. Now we have again the same example. We have the customers and employees and
We have the customers and employees and we have as well the same persons Kevin
we have as well the same persons Kevin and Mary as customers and as well as
and Mary as customers and as well as employees. So now if you want to combine
employees. So now if you want to combine the data using union all it going to
the data using union all it going to return all rows including duplicates. So
return all rows including duplicates. So that means SQL going to go and execute
that means SQL going to go and execute union all like this it going to return
union all like this it going to return everything from customers and everything
everything from customers and everything from employees and Kevin and Mary going
from employees and Kevin and Mary going to be presented twice in the output. So
to be presented twice in the output. So as you can see union all is returning
as you can see union all is returning all the rows as it is from the two
all the rows as it is from the two result sets and if there's duplicates in
result sets and if there's duplicates in the sets we will get as well duplicate
the sets we will get as well duplicate in the output. So Kevin going to be
in the output. So Kevin going to be existing twice in the output and marry
existing twice in the output and marry as well twice. So this is how the union
as well twice. So this is how the union all works. All right. So now we have
all works. All right. So now we have very similar SQL task and it says
very similar SQL task and it says combine the data from employees and
combine the data from employees and customers into one table including
customers into one table including duplicates. So it's exactly like the
duplicates. So it's exactly like the last task but this time in the task we
last task but this time in the task we are saying include duplicates. So we
are saying include duplicates. So we cannot go and use union. We have now to
cannot go and use union. We have now to go and use union all. We will have the
go and use union all. We will have the exact same query. So we are selecting
exact same query. So we are selecting the employees first last name and as
the employees first last name and as well customers first last name. And now
well customers first last name. And now instead of using union, we're going to
instead of using union, we're going to go and use union all. So all what we
go and use union all. So all what we have to do is that to go over here and
have to do is that to go over here and say union all. So now pay attention to
say union all. So now pay attention to this. As you can see in the union
this. As you can see in the union previously, we got eight records or
previously, we got eight records or eight persons from the output. So now
eight persons from the output. So now let's go and execute it and check the
let's go and execute it and check the results. Now as you can see we got now
results. Now as you can see we got now 10 persons instead of eight. And that's
10 persons instead of eight. And that's because we have five customers and five
because we have five customers and five employees and we have duplicates inside
employees and we have duplicates inside the data. We have two duplicates. Now if
the data. We have two duplicates. Now if you check we have here Mary and as well
you check we have here Mary and as well over here we have Mary and same goes for
over here we have Mary and same goes for given we have given over here and as
given we have given over here and as well here. So we have duplicates inside
well here. So we have duplicates inside the data and SQL just combine the two
the data and SQL just combine the two tables. Okay. So now we're going to
tables. Okay. So now we're going to understand how SQL execute union all in
understand how SQL execute union all in order to combine data. All right. Again
order to combine data. All right. Again we have the two results from queries. We
we have the two results from queries. We have the employees and customers and SQL
have the employees and customers and SQL going to do the same steps. First going
going to do the same steps. First going to go and get the column names from the
to go and get the column names from the first query and put it in the output.
first query and put it in the output. It's still going to go and take all the
It's still going to go and take all the employees and put it in the output
employees and put it in the output without checking anything. So that means
without checking anything. So that means if there is duplicates in the data, it's
if there is duplicates in the data, it's going to be presented as well in the
going to be presented as well in the output. It's very simple. Now it's going
output. It's very simple. Now it's going to go to the second step and as well
to go to the second step and as well take all the customers and append it
take all the customers and append it into the output like this. So that's it.
into the output like this. So that's it. It's very fast. It's going to go and
It's very fast. It's going to go and just combine all the rows from the
just combine all the rows from the employees and all the rows from the
employees and all the rows from the customers. And with that, we're going to
customers. And with that, we're going to get that 10 persons. And as you can see,
get that 10 persons. And as you can see, we have duplicates in the data. So we
we have duplicates in the data. So we have marry twice and given as well
have marry twice and given as well twice. And that's why union all is the
twice. And that's why union all is the fastest. It doesn't have any extra steps
fastest. It doesn't have any extra steps or checks. Just taking all rows from all
or checks. Just taking all rows from all queries and put it in the output. All
queries and put it in the output. All right. So as you can see it's very
right. So as you can see it's very simple, right? So that's all for the
simple, right? So that's all for the union
all. Okay. So what is except sometime we call it minus in other databases but in
call it minus in other databases but in SQL server we call it except. So it's
SQL server we call it except. So it's going to go and return a distinct rows
going to go and return a distinct rows from the first query that are not found
from the first query that are not found in the second query. So from this
in the second query. So from this definition we can understand that the
definition we can understand that the order of the queries can affect the
order of the queries can affect the final result. There is a first query and
final result. There is a first query and a second query. So it is the only set
a second query. So it is the only set operator where you have to pay attention
operator where you have to pay attention to the order of the queries. And as well
to the order of the queries. And as well it's like the others. It's going to go I
it's like the others. It's going to go I remove the duplicates from the result
remove the duplicates from the result set. All right. Again we have this very
set. All right. Again we have this very simple example. We have two sets, five
simple example. We have two sets, five customers, five employees and there is
customers, five employees and there is the same persons as a customer and as
the same persons as a customer and as employees Kevin and Mary. So now we're
employees Kevin and Mary. So now we're going to go and combine those two sets
going to go and combine those two sets using the excepts or sometime we call it
using the excepts or sometime we call it minus. So it says it's going to return
minus. So it says it's going to return unique rows in the first table that are
unique rows in the first table that are not in the second table. So what going
not in the second table. So what going to happen? What is the first table?
to happen? What is the first table? Let's say the customers on the left
Let's say the customers on the left side. So here we have five persons.
side. So here we have five persons. Joseph, Mark, Anna, Kevin and Mary. So
Joseph, Mark, Anna, Kevin and Mary. So now the rule is we need the customers
now the rule is we need the customers that are not employees. So it's safe for
that are not employees. So it's safe for Joseph, Mark and Anna because they are
Joseph, Mark and Anna because they are not existing in the second set. That's
not existing in the second set. That's why SQL going to return those three
why SQL going to return those three values. But now for the two customers
values. But now for the two customers given and marry here there is an issue.
given and marry here there is an issue. Given and marry they are members of the
Given and marry they are members of the second set. The second table the
second set. The second table the employees. That's why SQL going to go
employees. That's why SQL going to go and exclude them from the output because
and exclude them from the output because they are not fulfilling the rule. So in
they are not fulfilling the rule. So in the output we will get only three
the output we will get only three customers and all the values from
customers and all the values from employees and the common values between
employees and the common values between customers and employees will be excluded
customers and employees will be excluded from the output. So this is how the
from the output. So this is how the except works. All right. So let's have a
except works. All right. So let's have a very simple skill task and it says find
very simple skill task and it says find the employees who are not customers at
the employees who are not customers at the same time. Okay. So let's see how
the same time. Okay. So let's see how we're going to solve that. We're going
we're going to solve that. We're going to stay with the same queries as usual.
to stay with the same queries as usual. We have the employees and the customers
We have the employees and the customers but instead of having union all we're
but instead of having union all we're going to use the set operator except. So
going to use the set operator except. So now since we are using except we have to
now since we are using except we have to make sure that the order of the queries
make sure that the order of the queries are correct. So the first query is the
are correct. So the first query is the employees which is correct because we
employees which is correct because we have to find the employees who are not
have to find the employees who are not customers at the same time. So we are
customers at the same time. So we are focusing on the employees. The first
focusing on the employees. The first table is correct and the second table is
table is correct and the second table is customers. If the task says find the
customers. If the task says find the customers who are not employees at the
customers who are not employees at the same time then we have to go and switch
same time then we have to go and switch it. We have first to query the
it. We have first to query the customers. So now everything is correct.
customers. So now everything is correct. Let's go and execute it. And now in the
Let's go and execute it. And now in the output we see three employees who are
output we see three employees who are not customers at the same time. So we
not customers at the same time. So we have Carol, Frank and Michael. But as we
have Carol, Frank and Michael. But as we know we have five employees Kevin and
know we have five employees Kevin and Mary. They are not here in the result
Mary. They are not here in the result because they are customers as well. So
because they are customers as well. So now let me show you what can happen if I
now let me show you what can happen if I just switch those informations. So we
just switch those informations. So we start with customers and then with
start with customers and then with employees. Let's go and execute it. As
employees. Let's go and execute it. As you can see, we're going to get
you can see, we're going to get completely different results. Now we are
completely different results. Now we are getting customers informations. And now
getting customers informations. And now in the output, we got three customers
in the output, we got three customers who are not employees at the same time.
who are not employees at the same time. This is not what we want from this task.
This is not what we want from this task. So if you do it like this, it's going to
So if you do it like this, it's going to be incorrect. So pay always attention
be incorrect. So pay always attention here to the order of that query. So now
here to the order of that query. So now let's go and correct it. So we're going
let's go and correct it. So we're going to have first employees and then
to have first employees and then customers. Let's execute it. And now
customers. Let's execute it. And now let's go and understand how SQL execute
let's go and understand how SQL execute the except operator. All right. So again
the except operator. All right. So again we have the results from the two queries
we have the results from the two queries or from two tables and now we are doing
or from two tables and now we are doing except between them. So let's see how is
except between them. So let's see how is going to execute it. It's going to take
going to execute it. It's going to take as usual first the names from the first
as usual first the names from the first query from the employees and put it in
query from the employees and put it in the output. And now SQL going to present
the output. And now SQL going to present data only from the first query in the
data only from the first query in the output. And it going to go and use the
output. And it going to go and use the customers only as a check. So SQL will
customers only as a check. So SQL will not put any data or rows from the
not put any data or rows from the customers. It will just use the second
customers. It will just use the second query as a lookup in order to check the
query as a lookup in order to check the data. So, it's going to start with the
data. So, it's going to start with the first employee, Frankly. Do we have
first employee, Frankly. Do we have Frankly in the customers? Well, no, we
Frankly in the customers? Well, no, we don't have it. That's why it's going to
don't have it. That's why it's going to accept it and put it in the output. And
accept it and put it in the output. And then in the next step, it's still going
then in the next step, it's still going to go to the second employee and check.
to go to the second employee and check. As you can see, we have it already in
As you can see, we have it already in the customers. So, SQL going to go and
the customers. So, SQL going to go and ignore it. It's not allowed to be in the
ignore it. It's not allowed to be in the output. The same thing for Mary. We have
output. The same thing for Mary. We have it as well in the customers. That's why
it as well in the customers. That's why it will not be presented in the output.
it will not be presented in the output. So Michael, we don't have a Michael in
So Michael, we don't have a Michael in customers. That's why it can be
customers. That's why it can be presented in the output. And as well for
presented in the output. And as well for Carol, the same thing. We don't have
Carol, the same thing. We don't have Carol as a customer and we're going to
Carol as a customer and we're going to have it in the output. So as you can
have it in the output. So as you can see, we will get data only from the
see, we will get data only from the first table and the second table only
first table and the second table only going to be used in order to check the
going to be used in order to check the informations from it. So we don't have
informations from it. So we don't have in the output any customers, it's only
in the output any customers, it's only employees. So now let's check quickly
employees. So now let's check quickly what going to happen if we switch the
what going to happen if we switch the tables. So now we have the customers as
tables. So now we have the customers as the first table. SQL going to take the
the first table. SQL going to take the columns from the first table and it's
columns from the first table and it's going to start presenting the customers
going to start presenting the customers informations in the output and going to
informations in the output and going to go and use the employees only as a
go and use the employees only as a lookup. So do we have Joseph? We don't
lookup. So do we have Joseph? We don't have it in the employee. And then Kevin
have it in the employee. And then Kevin and Mary we have it already in the
and Mary we have it already in the employees and Mark and Anna are not part
employees and Mark and Anna are not part of the employees that's why can go and
of the employees that's why can go and present the results in the output like
present the results in the output like this. So now as you can see SQL is
this. So now as you can see SQL is focusing on the table customers and we
focusing on the table customers and we are getting data from the customers not
are getting data from the customers not from the employees. Employees is only as
from the employees. Employees is only as a check. So with that we understand the
a check. So with that we understand the order of the queries is very important
order of the queries is very important for the exceptions. We will get
for the exceptions. We will get different results if we have different
different results if we have different order. All right. So that's all for the
order. All right. So that's all for the except
operator. Okay. So what is intersect? Intersect going to go and return only
Intersect going to go and return only row that are common in both queries.
row that are common in both queries. It's something very similar to the inner
It's something very similar to the inner join and as well here it's going to go
join and as well here it's going to go and remove duplicates. So there will be
and remove duplicates. So there will be no duplicates in the output. All right.
no duplicates in the output. All right. Again we have this very simple example
Again we have this very simple example where we have five customers and five
where we have five customers and five employees and now we're going to combine
employees and now we're going to combine them using the intersect. So what
them using the intersect. So what intersect does it going to go and return
intersect does it going to go and return common rows between two tables. So how
common rows between two tables. So how SQL going to execute it? It's very
SQL going to execute it? It's very simple. SQL going to go and search for
simple. SQL going to go and search for the common values. So what are the
the common values. So what are the common values? It's given and marry and
common values? It's given and marry and SQL going to return only those two
SQL going to return only those two values given and marry and all others
values given and marry and all others going to be excluded from the results.
going to be excluded from the results. It's very simple, right? It's going to
It's very simple, right? It's going to go and return only the common values and
go and return only the common values and this is how the intersect works in SQL.
this is how the intersect works in SQL. Okay, let's have this simple task and it
Okay, let's have this simple task and it says find the employees who are also
says find the employees who are also customers. So we're going to have the
customers. So we're going to have the same queries employees and customers but
same queries employees and customers but instead of having except we're going to
instead of having except we're going to go and use intersect. Since we are
go and use intersect. Since we are finding the common informations between
finding the common informations between the employees and customers it's very
the employees and customers it's very simple and straightforward. Let's go and
simple and straightforward. Let's go and execute it. And with that we're going to
execute it. And with that we're going to get the Kevin and Mary. This is the two
get the Kevin and Mary. This is the two persons that are at the same time
persons that are at the same time employees and customers. And of course
employees and customers. And of course here we don't have to pay attention to
here we don't have to pay attention to the order of the queries. It's going to
the order of the queries. It's going to be the same if we say find the customers
be the same if we say find the customers who are also employees. So if you go and
who are also employees. So if you go and just switch for example the customers
just switch for example the customers with employees you will see that we will
with employees you will see that we will get the exact same results. So it
get the exact same results. So it doesn't matter which query is first
doesn't matter which query is first again pay attention to the first query
again pay attention to the first query that define the names. So now let's
that define the names. So now let's understand how is scale execute
understand how is scale execute intersects behind the scenes. Okay again
intersects behind the scenes. Okay again our two tables and now we are doing
our two tables and now we are doing intersects. So as usual SQL going to go
intersects. So as usual SQL going to go and take the columns from the first
and take the columns from the first query and now we're going to go and find
query and now we're going to go and find the common data between those two
the common data between those two results. So it's going to do it row by
results. So it's going to do it row by row. So we have the employee Frank. Do
row. So we have the employee Frank. Do we have it as a customer? No. So it will
we have it as a customer? No. So it will not be in the output. Given brown, we
not be in the output. Given brown, we have it in the employees and as well as
have it in the employees and as well as a customer over here. So that's why we
a customer over here. So that's why we will get it in the output. The same
will get it in the output. The same thing for Mary. So we have Mary as
thing for Mary. So we have Mary as employee and as well as customer. So
employee and as well as customer. So we're going to have it in the output.
we're going to have it in the output. Michael and Carol, they are not
Michael and Carol, they are not customers. They are only employees.
customers. They are only employees. That's why we will not get it in the
That's why we will not get it in the output. The same thing goes for the
output. The same thing goes for the customers. Joseph, we don't have Mark.
customers. Joseph, we don't have Mark. We don't have Anna because they are not
We don't have Anna because they are not employees. So with that we're going to
employees. So with that we're going to get only the common informations between
get only the common informations between the two tables or two queries and it
the two tables or two queries and it doesn't matter whether we start with
doesn't matter whether we start with customer or with employees we will get
customer or with employees we will get at the end the same information. All
at the end the same information. All right so that's all it's very simple
right so that's all it's very simple right this is how the intersect works in
right this is how the intersect works in SQL.
All right friends, so now we come to the part where I'm going to show you how I
part where I'm going to show you how I usually use the set operators in my
usually use the set operators in my projects for data analyszis or for data
projects for data analyszis or for data engineering. So here are the most
engineering. So here are the most important use cases for the set
important use cases for the set operators. All right, the first use case
operators. All right, the first use case is combining similar tables before doing
is combining similar tables before doing data analyzes. In some scenarios, we
data analyzes. In some scenarios, we want to generate a report and we end up
want to generate a report and we end up writing similar queries on top of
writing similar queries on top of similar tables and we go at the end and
similar tables and we go at the end and join all the results from the queries in
join all the results from the queries in order to present the final report. And
order to present the final report. And now instead of doing that what we can do
now instead of doing that what we can do first we can go and combine all the
first we can go and combine all the similar informations into one table and
similar informations into one table and then we can do on top of it a query a
then we can do on top of it a query a data analyzes in order to generate a
data analyzes in order to generate a report and we can do that using the
report and we can do that using the union or union all. Let's have few
union or union all. Let's have few examples. So let's say that we have four
examples. So let's say that we have four tables employees, customers, suppliers
tables employees, customers, suppliers and students. So as you can see all of
and students. So as you can see all of them are sharing the same informations.
them are sharing the same informations. They hold data about persons. So now
They hold data about persons. So now let's say that you are generating a
let's say that you are generating a report that requires all the individuals
report that requires all the individuals in the organization in the database. So
in the organization in the database. So what you're going to end up doing is
what you're going to end up doing is writing SQL query for the employees,
writing SQL query for the employees, another one for customers and as well
another one for customers and as well for the suppliers and the students. And
for the suppliers and the students. And then you're going to go and merge all
then you're going to go and merge all the results from those queries into the
the results from those queries into the final report. Now the issue with this
final report. Now the issue with this setup is that you are having a lot of
setup is that you are having a lot of queries, a lot of similar queries. So
queries, a lot of similar queries. So you have it here four times. And now
you have it here four times. And now what might happen is that you go and
what might happen is that you go and change the logic of the first two
change the logic of the first two queries and you forget later to do it
queries and you forget later to do it for the other two and you will get
for the other two and you will get really inconsistent data in the reports.
really inconsistent data in the reports. So instead of that what we can do we can
So instead of that what we can do we can go and use the set operators in order to
go and use the set operators in order to combine first all those tables in one
combine first all those tables in one big table. So what we're going to do
big table. So what we're going to do we're going to go and use a union in
we're going to go and use a union in order to combine those four tables into
order to combine those four tables into the table persons. So we're going to
the table persons. So we're going to have it like this. So we will get all
have it like this. So we will get all the rows from the employees and put it
the rows from the employees and put it in the persons all the rows from the
in the persons all the rows from the customers from the suppliers and as well
customers from the suppliers and as well from the students and put everything in
from the students and put everything in one big table that holds all the
one big table that holds all the informations about the individuals that
informations about the individuals that we have inside our database. And now the
we have inside our database. And now the next step after we combine the data now
next step after we combine the data now we write an SQL query in order to
we write an SQL query in order to analyze this new big table and the
analyze this new big table and the result going to be presented in the
result going to be presented in the reports. And now of course the advantage
reports. And now of course the advantage here is that we have only one SQL query
here is that we have only one SQL query for the data analyzers on top of this
for the data analyzers on top of this table instead of having it four times.
table instead of having it four times. And now if you go and change the logic
And now if you go and change the logic of the SQL query, it going to be applied
of the SQL query, it going to be applied automatically on all the data that we
automatically on all the data that we have in the database. And we have done
have in the database. And we have done already this example where we have
already this example where we have combined the data between the employees
combined the data between the employees and customers. Another scenario where we
and customers. Another scenario where we have to combine data before doing any
have to combine data before doing any reporting. That's sometimes the database
reporting. That's sometimes the database developers tend to divide a table one
developers tend to divide a table one big table into multiple small tables in
big table into multiple small tables in order to optimize the performance. For
order to optimize the performance. For example, here splitting the orders by
example, here splitting the orders by the year. We have orders 2022 2023. Now
the year. We have orders 2022 2023. Now again here if you want to generate a
again here if you want to generate a report in order to analyze the orders
report in order to analyze the orders over the years over the time either
over the years over the time either you're going to go and make a query for
you're going to go and make a query for each of those tables or you're going to
each of those tables or you're going to go first combining all those tables into
go first combining all those tables into one table called orders. So what we're
one table called orders. So what we're going to do we're going to use a union
going to do we're going to use a union between all those tables in order to
between all those tables in order to generate one central table called the
generate one central table called the orders. So all the rows from the first
orders. So all the rows from the first table and all rows from the next table.
table and all rows from the next table. next one and the last one. So, we're
next one and the last one. So, we're going to put everything in one big table
going to put everything in one big table and once we have the orders, we're going
and once we have the orders, we're going to go and write analytical skill query
to go and write analytical skill query on top of the orders in order to
on top of the orders in order to generate the report. So, as you can see,
generate the report. So, as you can see, it's very important step in order to
it's very important step in order to prepare the data before doing data
prepare the data before doing data analyszis. Okay. So now let's have the
analyszis. Okay. So now let's have the following SQL task and it says the
following SQL task and it says the orders are stored in separate tables. We
orders are stored in separate tables. We have the orders and orders archive. Now
have the orders and orders archive. Now combine all orders data into one report
combine all orders data into one report without duplicates. Okay. So by looking
without duplicates. Okay. So by looking to the task we have to combine two
to the task we have to combine two tables orders and orders archive. So
tables orders and orders archive. So either union or union all. But since the
either union or union all. But since the task says without duplicates that means
task says without duplicates that means we have to go with the union. But now
we have to go with the union. But now before we combine any data we have first
before we combine any data we have first to understand the content of the orders
to understand the content of the orders and the orders archive in order to map
and the orders archive in order to map the columns correctly. So first we have
the columns correctly. So first we have to go and explore the two tables. So
to go and explore the two tables. So let's start with selecting the data from
let's start with selecting the data from orders everything semicolon and as well
orders everything semicolon and as well from the second table sales orders
from the second table sales orders archive and as well semicolon. So let's
archive and as well semicolon. So let's go and execute it. So now in the output
go and execute it. So now in the output we get two results because we have two
we get two results because we have two separate queries. The first result is
separate queries. The first result is for the orders and the second one is for
for the orders and the second one is for the orders archive. Let me just make it
the orders archive. Let me just make it a little bit bigger. And now as you can
a little bit bigger. And now as you can see we have almost identical tables. So
see we have almost identical tables. So as you can see we have the order ID,
as you can see we have the order ID, product ID, customer ID. So everything
product ID, customer ID. So everything looks like identical and of course we
looks like identical and of course we can go and check that using the object
can go and check that using the object explorer on the left side. So we have
explorer on the left side. So we have here the orders and those are the
here the orders and those are the columns. And if you go to the orders
columns. And if you go to the orders archive, you can see that we have the
archive, you can see that we have the exact same columns. So that means we can
exact same columns. So that means we can go and map all columns from orders with
go and map all columns from orders with the all columns of orders archive. So
the all columns of orders archive. So let's go and do that. So I'm just going
let's go and do that. So I'm just going to remove all semicolons and then we're
to remove all semicolons and then we're going to go and use the union. So now we
going to go and use the union. So now we have everything in one query. Let's go
have everything in one query. Let's go and execute it. Now we will get in the
and execute it. Now we will get in the output one single results, one single
output one single results, one single table with all informations from orders
table with all informations from orders and orders archive. So we have all
and orders archive. So we have all orders now in one table and everything
orders now in one table and everything currently is matching. So with that we
currently is matching. So with that we have solved the task. We have one result
have solved the task. We have one result with all orders. We don't have any
with all orders. We don't have any duplicates since we are using union and
duplicates since we are using union and we have combined the data. But now we
we have combined the data. But now we have one issue with that. This solution,
have one issue with that. This solution, this query is quick and dirty and
this query is quick and dirty and actually it's not following the best
actually it's not following the best practices. So now the best practices
practices. So now the best practices here is to list clearly all the columns
here is to list clearly all the columns in each query without using star. All
in each query without using star. All right. So now let's go and do that. Now
right. So now let's go and do that. Now we need a list of all columns from the
we need a list of all columns from the table orders and the table orders
table orders and the table orders archive. And since we have a lot of
archive. And since we have a lot of columns, what we're going to do, we go
columns, what we're going to do, we go to object explorer, right click on the
to object explorer, right click on the table name, and then let's go select the
table name, and then let's go select the top thousand rows. So let's click on
top thousand rows. So let's click on that. And now we're going to get a very
that. And now we're going to get a very simple select statements where we have
simple select statements where we have all the column names from the table
all the column names from the table orders. This is what I usually do if I
orders. This is what I usually do if I need all the columns in the my select
need all the columns in the my select statements. So let's go and copy it and
statements. So let's go and copy it and go back to our query. Then let's go
go back to our query. Then let's go replace the first star with those
replace the first star with those columns. And we're going to do the same
columns. And we're going to do the same thing as well for the orders archive
thing as well for the orders archive since they have the same names. So let's
since they have the same names. So let's go and do that as well. So let me just
go and do that as well. So let me just make this smaller in order to see the
make this smaller in order to see the query. So now we have a select for the
query. So now we have a select for the table orders with all columns and as
table orders with all columns and as well a select with all columns for the
well a select with all columns for the table orders archive. So let's go and
table orders archive. So let's go and execute it. And of course now we're
execute it. And of course now we're going to go and get the same results.
going to go and get the same results. Now you might ask why we are doing this.
Now you might ask why we are doing this. Why didn't we stick with the star? It's
Why didn't we stick with the star? It's quick. It's simple. Well for the
quick. It's simple. Well for the following reason. So now currently the
following reason. So now currently the status is that everything is matching.
status is that everything is matching. We have 100% identical tables. But what
We have 100% identical tables. But what happened with the time is that we do
happened with the time is that we do development in our solution and we might
development in our solution and we might go and change the schema of the table
go and change the schema of the table orders. So we might rename stuff, we
orders. So we might rename stuff, we might add new columns or maybe switch
might add new columns or maybe switch the columns. So this means the table
the columns. So this means the table order with the time will not be anymore
order with the time will not be anymore identical with the archive. And this is
identical with the archive. And this is of course a problem if you are mapping
of course a problem if you are mapping the data blindly using the star. So now
the data blindly using the star. So now let me show you what I mean. Let's say
let me show you what I mean. Let's say that in this table we are developing the
that in this table we are developing the orders and we just switch those two
orders and we just switch those two columns in the schema for some reason.
columns in the schema for some reason. So now we have the product ID first and
So now we have the product ID first and then the order ID. So let's go and
then the order ID. So let's go and execute it. Now if you are using star
execute it. Now if you are using star you will not notice this informations.
you will not notice this informations. But if you are using script you're going
But if you are using script you're going to see immediately that here we have
to see immediately that here we have first the order ID and then product ID.
first the order ID and then product ID. And here we have the opposite. So it's
And here we have the opposite. So it's more clear listing the columns than
more clear listing the columns than using the star. And now as you can see
using the star. And now as you can see in the output you can see that we have a
in the output you can see that we have a problem that here we have order ids and
problem that here we have order ids and then suddenly we have something like the
then suddenly we have something like the product ID. So we're going to have
product ID. So we're going to have incorrect data which leads to incorrect
incorrect data which leads to incorrect analyzes. So here the best practices to
analyzes. So here the best practices to not use the star and to clearly list all
not use the star and to clearly list all the columns. Now one more technique that
the columns. Now one more technique that I usually use once I'm combining data is
I usually use once I'm combining data is that I add the source of the data inside
that I add the source of the data inside the query. So what I mean with that now
the query. So what I mean with that now you can see that we have here two orders
you can see that we have here two orders with the order ID one they are not
with the order ID one they are not duplicates they are completely different
duplicates they are completely different informations and that's because they
informations and that's because they come from different tables. So what I
come from different tables. So what I usually do I go and add the source of
usually do I go and add the source of each record it's really nice information
each record it's really nice information for the analytics for the users to
for the analytics for the users to understand where these records come
understand where these records come from. So how we going to do that? We're
from. So how we going to do that? We're going to have for example on the first
going to have for example on the first column the following word let's say
column the following word let's say orders and we're going to call it let's
orders and we're going to call it let's say that's source table and we're going
say that's source table and we're going to do the same thing as well in the
to do the same thing as well in the second query. Right? So the source table
second query. Right? So the source table here is not the orders it's the orders
here is not the orders it's the orders archive. So I'm just adding a static
archive. So I'm just adding a static columns to my query in order to see the
columns to my query in order to see the source of the table. So now we have here
source of the table. So now we have here two different values. And let's go and
two different values. And let's go and execute it. And now you see we have
execute it. And now you see we have created a new column called source table
created a new column called source table where it has only two values. We have
where it has only two values. We have the orders and the orders archive. Let's
the orders and the orders archive. Let's go and sort the data by the order ID. So
go and sort the data by the order ID. So order by order ID. So let's go and
order by order ID. So let's go and execute it. And now you can see it very
execute it. And now you can see it very clearly. The first order order ID one
clearly. The first order order ID one comes from the table orders and the
comes from the table orders and the second one comes from the orders
second one comes from the orders archive. So this is really nice
archive. So this is really nice information that you can add to your
information that you can add to your data once you are combining multiple
data once you are combining multiple tables. So that's all about this use
tables. So that's all about this use case on how to combine data between
case on how to combine data between different
tables. All right. Now we have another use case for the set operators. It's
use case for the set operators. It's more for data engineers. We can use the
more for data engineers. We can use the except in order to find the delta
except in order to find the delta between two batches of data. For
between two batches of data. For example, data engineers build data
example, data engineers build data pipelines in order to load daily new
pipelines in order to load daily new data from the source systems to a data
data from the source systems to a data warehouse or a data lake. Now, in those
warehouse or a data lake. Now, in those data pipelines, we have to build a logic
data pipelines, we have to build a logic in order to identify what are the new
in order to identify what are the new data that is generated from the source
data that is generated from the source system in order to insert it in the data
system in order to insert it in the data warehouse. One way to do it is to use
warehouse. One way to do it is to use the set operator except in order to
the set operator except in order to compare the current data with the
compare the current data with the previous load. Let's have a very simple
previous load. Let's have a very simple example. So in the day number one we
example. So in the day number one we have two customers one and two. So what
have two customers one and two. So what going to happen in this day we're going
going to happen in this day we're going to go and load those two customers into
to go and load those two customers into the data warehouse. So in the data
the data warehouse. So in the data warehouse we will get as well one and
warehouse we will get as well one and two. So this is for the first day
two. So this is for the first day nothing is crazy. We just load the data
nothing is crazy. We just load the data as it is. Now for the second day we will
as it is. Now for the second day we will get the new data from the source system
get the new data from the source system and it's going to look like this. So now
and it's going to look like this. So now if you check the second day you can see
if you check the second day you can see that we have again the customer number
that we have again the customer number one we have already loaded to the data
one we have already loaded to the data warehouse. So we have it as the previous
warehouse. So we have it as the previous day but we have a new customer ID number
day but we have a new customer ID number three. So now in order to load only the
three. So now in order to load only the new data we don't need to load again the
new data we don't need to load again the customer number one. What we can do? We
customer number one. What we can do? We can do an accept between the day number
can do an accept between the day number two with the previous load with the day
two with the previous load with the day number one. So now if we simply do an
number one. So now if we simply do an accept between those two sets we're
accept between those two sets we're going to go and identify the new data
going to go and identify the new data that is existing in the source system
that is existing in the source system which is only the record number three.
which is only the record number three. So now what going to happen if we do
So now what going to happen if we do except between day two and day one we
except between day two and day one we will get one record the new record that
will get one record the new record that we're going to go and insert it inside
we're going to go and insert it inside our data warehouse. So as you can see
our data warehouse. So as you can see this set operator except is very
this set operator except is very powerful in order to compare two sets
powerful in order to compare two sets and not only for data analysis we can
and not only for data analysis we can use it as you can see for data
use it as you can see for data engineering in order to identify what is
engineering in order to identify what is the new data that is generated from the
the new data that is generated from the sources in order to insert it inside our
sources in order to insert it inside our data warehouse.
Okay, one more use case for the set operators that I personally use a lot in
operators that I personally use a lot in my project is that if you are doing data
my project is that if you are doing data migrations, you can use the accept in
migrations, you can use the accept in order to check the data quality and more
order to check the data quality and more specifically we can use it in order to
specifically we can use it in order to check the data completeness. Okay, so we
check the data completeness. Okay, so we have the following scenario where we are
have the following scenario where we are doing data migrations between two
doing data migrations between two databases. So let's say that we would
databases. So let's say that we would like to move this table from database A
like to move this table from database A to database B. So we're going to go and
to database B. So we're going to go and load the table to the new database. And
load the table to the new database. And now what is very important after you
now what is very important after you move the data is that to check whether
move the data is that to check whether all the records did move from database A
all the records did move from database A to database B we are not missing
to database B we are not missing anything even one record. So we want to
anything even one record. So we want to do data completeness test and there are
do data completeness test and there are many methods on how to do this test. One
many methods on how to do this test. One of them is to use that set operator
of them is to use that set operator except. So how we going to do it? We're
except. So how we going to do it? We're going to do an except between the table
going to do an except between the table from database A and the table from
from database A and the table from database B in order to find any record
database B in order to find any record that is still in database A which is not
that is still in database A which is not migrated to the database B. And of
migrated to the database B. And of course the best result is that we will
course the best result is that we will not get anything. The result should be
not get anything. The result should be empty. If we get an empty that means all
empty. If we get an empty that means all the rows from database A exists in the
the rows from database A exists in the database B. And now of course we are not
database B. And now of course we are not done yet. We want to do the comparison
done yet. We want to do the comparison but the way around. We want to find any
but the way around. We want to find any new rows that is in database B that we
new rows that is in database B that we don't find in database A. Those two
don't find in database A. Those two tables must be identical. So now what
tables must be identical. So now what we're going to do, we're going to do an
we're going to do, we're going to do an except but the first table going to be
except but the first table going to be from the database B. And then we're
from the database B. And then we're going to compare it with the database A.
going to compare it with the database A. And we have the same expectation. The
And we have the same expectation. The output should be as well empty. And now
output should be as well empty. And now after doing the except twice for both
after doing the except twice for both sides and we are getting empty in the
sides and we are getting empty in the results. That means those two tables are
results. That means those two tables are identical and we are not missing
identical and we are not missing anything. So this is another amazing use
anything. So this is another amazing use case for the set operators in order to
case for the set operators in order to improve the quality of your data
improve the quality of your data migrations and in order to do data
migrations and in order to do data completeness
test. Okay. So now let's have a quick summary about the set operators. So the
summary about the set operators. So the set operator is going to go and combine
set operator is going to go and combine the rows of multiple queries, multiple
the rows of multiple queries, multiple tables into one single result. And we
tables into one single result. And we have four different types of the asset
have four different types of the asset operators. The first one is the union
operators. The first one is the union where it's going to go and combine all
where it's going to go and combine all the rows but without including any
the rows but without including any duplicates. The second one we have the
duplicates. The second one we have the union all it's very similar. And the
union all it's very similar. And the third one we have the except it's going
third one we have the except it's going to show all the rows from the first
to show all the rows from the first query that cannot be found in the second
query that cannot be found in the second query. And the fourth one we have the
query. And the fourth one we have the intersect where it's going to show the
intersect where it's going to show the common rows between two queries. And of
common rows between two queries. And of course we have SQL rules in order to use
course we have SQL rules in order to use the set operators. Both of the queries
the set operators. Both of the queries should have the same number of columns,
should have the same number of columns, the same data types and the order of
the same data types and the order of columns. And the last rule, don't forget
columns. And the last rule, don't forget that the first query controls the
that the first query controls the aliases, the name of the columns and the
aliases, the name of the columns and the data types of the entire result. And we
data types of the entire result. And we have found amazing use cases for the set
have found amazing use cases for the set operators. Like for example, using union
operators. Like for example, using union and union all in order to combine
and union all in order to combine similar informations into one big table.
similar informations into one big table. Or we can go and use the amazing except
Or we can go and use the amazing except operator in order to compare two
operator in order to compare two different results in order to find the
different results in order to find the differences between them. And I usually
differences between them. And I usually use it in order to do data quality
use it in order to do data quality checks to test the data completeness.
checks to test the data completeness. And another use case as a data engineer
And another use case as a data engineer you can go and implement the except in
you can go and implement the except in your logic in your data pipelines in
your logic in your data pipelines in order to identify what are the new data
order to identify what are the new data that must be inserted in your system.
that must be inserted in your system. Okay my friends. So with that we have
Okay my friends. So with that we have learned all the set operators that we
learned all the set operators that we have inside SQL. And with that you have
have inside SQL. And with that you have learned how to combine your data from
learned how to combine your data from multiple tables using SQL. So we are
multiple tables using SQL. So we are done with this chapter. Now we're going
done with this chapter. Now we're going to go to the right side. So now we're
to go to the right side. So now we're going to start talking about the
going to start talking about the functions in SQL. And here we have two
functions in SQL. And here we have two big families. The first one is the row
big families. The first one is the row level or the single value functions. And
level or the single value functions. And the second one we have the aggregate
the second one we have the aggregate analytical functions. So let's start
analytical functions. So let's start with the first one the rowle functions.
with the first one the rowle functions. And here we can group them into multiple
And here we can group them into multiple categories. And we will start now with
categories. And we will start now with the string functions. But first let's
the string functions. But first let's understand what is exactly functions and
understand what is exactly functions and why do we need them in SQL. So let's
go. Okay. So what is exactly function and why we need it. Now again we have
and why we need it. Now again we have our data inside the table. Now there is
our data inside the table. Now there is like a lot of stuff that you can do with
like a lot of stuff that you can do with your data. So sometimes you have to
your data. So sometimes you have to change the values of your data like
change the values of your data like doing data manipulation or you want to
doing data manipulation or you want to do some aggregations and analyzes. So
do some aggregations and analyzes. So maybe you want to analyze your data and
maybe you want to analyze your data and find insights and maybe build reports
find insights and maybe build reports and sometimes you might find bad data
and sometimes you might find bad data inside your tables and you want to clean
inside your tables and you want to clean that up. So you want to do data
that up. So you want to do data cleansing and sometimes you have to do
cleansing and sometimes you have to do data transformations and data
data transformations and data manipulation on our data in order to
manipulation on our data in order to solve some SQL tasks and in SQL in order
solve some SQL tasks and in SQL in order to solve those tasks we have functions.
to solve those tasks we have functions. So again what is exactly a function? It
So again what is exactly a function? It is a built-in code block that accepts an
is a built-in code block that accepts an input value. Then the function going to
input value. Then the function going to go and process this value and it going
go and process this value and it going to return a result an output value. So
to return a result an output value. So you give an input value do some
you give an input value do some transformations and give an output. And
transformations and give an output. And we can group the functions into two big
we can group the functions into two big categories. The first one we call it
categories. The first one we call it single row functions. So you give the
single row functions. So you give the function only one value and at the
function only one value and at the return you will get as well one value.
return you will get as well one value. So the input for the function going to
So the input for the function going to be only one single value like maria and
be only one single value like maria and the output of the function going to be
the output of the function going to be as well single row value. So one value
as well single row value. So one value in one value out. And now the other
in one value out. And now the other category of functions we call it
category of functions we call it multirow functions. So for example if
multirow functions. So for example if you have the function sum this function
you have the function sum this function accept multiple rows multiple values
accept multiple rows multiple values like it gets 30 10 20 40 the function is
like it gets 30 10 20 40 the function is then going to go and summarize all those
then going to go and summarize all those rows and return in the output only one
rows and return in the output only one value. The summarization of all those
value. The summarization of all those values going to be 100. So the input is
values going to be 100. So the input is multiple rows and the output is one
multiple rows and the output is one single value. So those are the two main
single value. So those are the two main categories of functions in scale.
Now my friends you have to understand something about the functions that you
something about the functions that you can go and nest functions together. So
can go and nest functions together. So you can use multiple functions together
you can use multiple functions together in order to manipulate one value. And
in order to manipulate one value. And this technique is not only in SQL in any
this technique is not only in SQL in any programming language. So let's have this
programming language. So let's have this example. We have the function left. It's
example. We have the function left. It's going to go and extract like few
going to go and extract like few characters. Let's say two characters. So
characters. Let's say two characters. So the input for this function let's say
the input for this function let's say it's Maria. This value going to enter
it's Maria. This value going to enter the function. The function is going to
the function. The function is going to go and extract the first two characters.
go and extract the first two characters. And in the output we will get only two
And in the output we will get only two characters m a. So this is one function.
characters m a. So this is one function. We have an input and output. Now you
We have an input and output. Now you might say you know what we have multiple
might say you know what we have multiple steps on this value. So the first step
steps on this value. So the first step we want to extract the first two
we want to extract the first two characters using the lift function. But
characters using the lift function. But we have a second step. So we want to
we have a second step. So we want to transform this output into a lowercase
transform this output into a lowercase characters. So we have another function
characters. So we have another function lower and the input for this second
lower and the input for this second function will be the output of the first
function will be the output of the first function. So ma it is at the same time
function. So ma it is at the same time output and input for another function.
output and input for another function. So the lower function going to take this
So the lower function going to take this value and convert it into lowerase
value and convert it into lowerase character. So it's like inside the
character. So it's like inside the factory the materials going to be
factory the materials going to be processed into multiple stations and the
processed into multiple stations and the output of one station going to be the
output of one station going to be the input for the next station. And this is
input for the next station. And this is exactly what we can do with the
exactly what we can do with the functions. So now how we going to build
functions. So now how we going to build that? The first step is to start with
that? The first step is to start with the first function. So this is simple
the first function. So this is simple one function. Now for the next step what
one function. Now for the next step what you're going to do on the left side
you're going to do on the left side you're going to write lower and put the
you're going to write lower and put the whole thing in parenthesis. So now the
whole thing in parenthesis. So now the whole thing the first function going to
whole thing the first function going to be inside another function and with that
be inside another function and with that you have nested one function in another
you have nested one function in another and of course if you need a third
and of course if you need a third function like for example the length
function like for example the length what you're going to do you're going to
what you're going to do you're going to put the whole thing again between two
put the whole thing again between two parentheses. So now that means the
parentheses. So now that means the output of the lift going to go to the
output of the lift going to go to the lower and the output of the lower going
lower and the output of the lower going to go to the length. So it is very
to go to the length. So it is very simple and the order of the execution
simple and the order of the execution for this will start always in the inner
for this will start always in the inner function. So the lift function going to
function. So the lift function going to be executed first and then the outside
be executed first and then the outside function the lower and the last function
function the lower and the last function that's going to be executed is the
that's going to be executed is the length. This is how the nested functions
length. This is how the nested functions works in SQL or in any programming
language. Now my friends in SQL we have a lot of functions that's why we have to
a lot of functions that's why we have to group them as well into subcategories.
group them as well into subcategories. Like if you are talking about the single
Like if you are talking about the single row functions, we have functions for the
row functions, we have functions for the string values and as well for the
string values and as well for the numeric, the date and time and as well
numeric, the date and time and as well functions in order to handle the nulls.
functions in order to handle the nulls. And if you are talking about the
And if you are talking about the multirow functions, here we have
multirow functions, here we have basically two groups. The first one is
basically two groups. The first one is the simple aggregate functions. Those
the simple aggregate functions. Those are the basics in order to aggregate
are the basics in order to aggregate your data. And we have another advanced
your data. And we have another advanced one. We call it the window functions or
one. We call it the window functions or sometime we call it analytical
sometime we call it analytical functions. So now if I'm looking to
functions. So now if I'm looking to those two groups and now my friends it
those two groups and now my friends it is very important to understand those
is very important to understand those functions because using them you can do
functions because using them you can do whatever you want with your data and if
whatever you want with your data and if I'm looking to those two groups the
I'm looking to those two groups the single row functions those stuff here
single row functions those stuff here they are functions in order to
they are functions in order to manipulate and prepare the data for the
manipulate and prepare the data for the second group. So if you are thinking
second group. So if you are thinking about data engineers and data analysts
about data engineers and data analysts the data engineers going to go and
the data engineers going to go and prepare the data in SQL using the single
prepare the data in SQL using the single row functions. So you're going to use
row functions. So you're going to use them in order to clean up, transform,
them in order to clean up, transform, manipulate your data in order to prepare
manipulate your data in order to prepare it for the analyzes. And if you are data
it for the analyzes. And if you are data analyst, you will be mostly using the
analyst, you will be mostly using the aggregate functions in almost every
aggregate functions in almost every task. So I really see it like this. The
task. So I really see it like this. The single row functions for data engineers
single row functions for data engineers and multirow functions for data
and multirow functions for data analysts. And my friends, what we're
analysts. And my friends, what we're going to do in this course, we're going
going to do in this course, we're going to visit each of those subgroups one by
to visit each of those subgroups one by one, exploring the functions,
one, exploring the functions, understanding how they work and when
understanding how they work and when we're going to use them. So let's start
we're going to use them. So let's start with the first group, the string
with the first group, the string functions. And here we're going to learn
functions. And here we're going to learn how to manipulate the string values. So
how to manipulate the string values. So let's
go. Okay. So now since we have a lot of string functions, I'm going to go and
string functions, I'm going to go and divide them into categories based on the
divide them into categories based on the purpose. So for example, we have a group
purpose. So for example, we have a group of functions that's going to go and
of functions that's going to go and manipulate the string values. So we have
manipulate the string values. So we have concatenation, upper, lower, replace,
concatenation, upper, lower, replace, and so on. And another group where we
and so on. And another group where we have only one function. It is where we
have only one function. It is where we can do calculations on the string
can do calculations on the string values. And the last group, it is all
values. And the last group, it is all about how to extract something from a
about how to extract something from a string value. And here we have three
string value. And here we have three functions left, right, substring. So now
functions left, right, substring. So now let's go and start with the first group
let's go and start with the first group about the data manipulation. And the
about the data manipulation. And the first function we have here
concat. All right. So what is exactly concat or concatenation? It's going to
concat or concatenation? It's going to go and combine multiple string values
go and combine multiple string values into one value. So if you have multiple
into one value. So if you have multiple things you can put everything in one
things you can put everything in one value. So let's have a very simple
value. So let's have a very simple example. Okay. So now let's say that you
example. Okay. So now let's say that you have one value called Michael. So here
have one value called Michael. So here you have a first name and you have
you have a first name and you have totally separated value for the last
totally separated value for the last name another column where you have a
name another column where you have a value like Scott. And now you say you
value like Scott. And now you say you know what it makes no sense to have the
know what it makes no sense to have the first name separated from the last name.
first name separated from the last name. I would like to go and combine them in
I would like to go and combine them in one value. So you can go and use the
one value. So you can go and use the concat in order to combine those two
concat in order to combine those two values or multiple values into one
values or multiple values into one single value like Michael Scott.
single value like Michael Scott. I think that pretty much sums it up. So
I think that pretty much sums it up. So it is nicer to see the full name in one
it is nicer to see the full name in one value instead of having like two columns
value instead of having like two columns for that. So that's it. This is why we
for that. So that's it. This is why we need the concatenations. Now let's go
need the concatenations. Now let's go back to scale in order to try that out.
back to scale in order to try that out. Okay. So now we have the following task.
Okay. So now we have the following task. Show a list of customers first names
Show a list of customers first names together with their country in one
together with their country in one column. So that means we have to make a
column. So that means we have to make a list of customers and we have to combine
list of customers and we have to combine two columns in one. So let's start
two columns in one. So let's start writing the query. Select. We need the
writing the query. Select. We need the first name, the country from the table
first name, the country from the table customers. So first let's go and execute
customers. So first let's go and execute this. Now as you can see we have list of
this. Now as you can see we have list of customers but the issue here the first
customers but the issue here the first name and the countries those two
name and the countries those two informations are in different columns
informations are in different columns but the task says they should be in one
but the task says they should be in one column. So now in order to combine those
column. So now in order to combine those two things we have to use the
two things we have to use the concatenate function. So concat. So I'm
concatenate function. So concat. So I'm going to start with the first argument.
going to start with the first argument. It's going to be the first name and then
It's going to be the first name and then the country like this. And we're going
the country like this. And we're going to give it a name. Let's call it like
to give it a name. Let's call it like this name country. Now let's go ahead
this name country. Now let's go ahead and execute it. Now in the output you
and execute it. Now in the output you can see we have a new column. It's
can see we have a new column. It's called name country and we have both of
called name country and we have both of the informations in one column. So we
the informations in one column. So we have Maria, Germany, join USA. But it
have Maria, Germany, join USA. But it doesn't really look good because there's
doesn't really look good because there's like no spacing between them. Now we can
like no spacing between them. Now we can go and make some separation between them
go and make some separation between them by just adding one more thing in between
by just adding one more thing in between like for example maybe a space. So now
like for example maybe a space. So now we are concatenating the first name
we are concatenating the first name together with a space this over here and
together with a space this over here and then the country. So let's go and
then the country. So let's go and execute it. Now as you can see we have
execute it. Now as you can see we have nice separations between the first name
nice separations between the first name and the country. And of course you can
and the country. And of course you can go and add different separations like
go and add different separations like maybe my notes or underscore and you
maybe my notes or underscore and you will get the same effect. So with that
will get the same effect. So with that we have a list of customers where we
we have a list of customers where we have the first name together with the
have the first name together with the country in one column. As you can see
country in one column. As you can see it's very simple. This is how you
it's very simple. This is how you combine two columns in one. It is really
combine two columns in one. It is really nice and easy transformation. Okay. So
nice and easy transformation. Okay. So that's all about the concatenation in
that's all about the concatenation in scale. Next we're going to talk about
scale. Next we're going to talk about two functions. The upper and the
lower. Okay. So what is upper function? It's going to go and converts all the
It's going to go and converts all the characters of a string to an uppercase.
characters of a string to an uppercase. It's going to make everything
It's going to make everything capitalized. And the lower function is
capitalized. And the lower function is exactly the opposite. It's going to go
exactly the opposite. It's going to go and convert everything to a lower case.
and convert everything to a lower case. So let's have very simple example for
So let's have very simple example for those two functions. Okay. So now we
those two functions. Okay. So now we have like three values with different
have like three values with different cases. The first one where you have only
cases. The first one where you have only the first character capitalized and the
the first character capitalized and the rest is lowered and then the same value
rest is lowered and then the same value but everything is lowered and a third
but everything is lowered and a third one where you have everything with an
one where you have everything with an uppercase. Now if you go and apply the
uppercase. Now if you go and apply the function upper to those three values
function upper to those three values what going to happen for the first value
what going to happen for the first value going to go and turn it into an
going to go and turn it into an uppercase. So everything going to be
uppercase. So everything going to be capitalized not only the first
capitalized not only the first character. And now for the second value
character. And now for the second value going to turn it as well to completely
going to turn it as well to completely capitalized. So all the characters going
capitalized. So all the characters going to change. And for the last value it is
to change. And for the last value it is already capitalized. So in the output
already capitalized. So in the output you will get the same value. So actually
you will get the same value. So actually nothing going to happen for that. So
nothing going to happen for that. So this is simply the uppercase. Now let's
this is simply the uppercase. Now let's see what can happen if you use the lower
see what can happen if you use the lower case. For the first value only the first
case. For the first value only the first character going to be changed and then
character going to be changed and then you will have everything in lower case.
you will have everything in lower case. The second value it is already a
The second value it is already a lowerase value. So if you apply lower
lowerase value. So if you apply lower case nothing going to happen. You will
case nothing going to happen. You will get the same value. But for the last one
get the same value. But for the last one everything here is capitalized and if
everything here is capitalized and if you apply lower case all the characters
you apply lower case all the characters going to convert to a lower case. So my
going to convert to a lower case. So my friends this is very simple. Let's go
friends this is very simple. Let's go back to your skill in order to practice
back to your skill in order to practice that. Okay. So we have the following
that. Okay. So we have the following task and it says transform the
task and it says transform the customer's first name to lowerase. So
customer's first name to lowerase. So now as you can see the first names here
now as you can see the first names here the first character is a capital the
the first character is a capital the rest is lowerase. So now in this task we
rest is lowerase. So now in this task we have to convert the whole thing into
have to convert the whole thing into lower case. So let's go and do that.
lower case. So let's go and do that. It's very simple. We're going to say
It's very simple. We're going to say lower first name and let's go and call
lower first name and let's go and call it low name. So that's it. Let's go and
it low name. So that's it. Let's go and execute it. Now if you go and compare
execute it. Now if you go and compare the lower name with the first name, you
the lower name with the first name, you can see all the characters now in the
can see all the characters now in the lower case. So that's it for the task.
lower case. So that's it for the task. We have transformed the first name to
We have transformed the first name to lower case. All right. The next task is
lower case. All right. The next task is exactly the opposite. Transform the
exactly the opposite. Transform the customer's first name to uppercase. So
customer's first name to uppercase. So let's go and have a new column. We're
let's go and have a new column. We're going to say upper then the first
going to say upper then the first name as app name. So that's it. It's
name as app name. So that's it. It's very simple. Let's go and execute. Now
very simple. Let's go and execute. Now you can see in the output we have a new
you can see in the output we have a new column called up name and inside it we
column called up name and inside it we have the first name but now all the
have the first name but now all the characters in upper case. So this is how
characters in upper case. So this is how you convert the case to lower or to
you convert the case to lower or to upper in SQL. Okay. So that's all about
upper in SQL. Okay. So that's all about the upper and the lower. Next we're
the upper and the lower. Next we're going to talk about very interesting
going to talk about very interesting function. It is the
trim. So the trim function going to go and remove the leading and trailing
and remove the leading and trailing spaces in your string values. So it's
spaces in your string values. So it's going to go and get rid of the empty
going to go and get rid of the empty spaces at the start and at the end of a
spaces at the start and at the end of a string value. Let's have very simple
string value. Let's have very simple example. Okay. So now we're going to
example. Okay. So now we're going to have different scenarios. The first one
have different scenarios. The first one you can have like a value join where you
you can have like a value join where you don't have any spaces and this is the
don't have any spaces and this is the normal case. But sometimes you might
normal case. But sometimes you might have it like this where at the start you
have it like this where at the start you have a leading space. You have an empty
have a leading space. You have an empty space or sometimes we call it white
space or sometimes we call it white space. In another scenario the space
space. In another scenario the space might be at the end of the word. So here
might be at the end of the word. So here we call it trailing space and in another
we call it trailing space and in another scenario you might have both of them.
scenario you might have both of them. This is really bad. where at the start
This is really bad. where at the start you have the leading space and at the
you have the leading space and at the end you have the trailing space. And of
end you have the trailing space. And of course you might not have only one
course you might not have only one space, you might have multiple spaces
space, you might have multiple spaces depend on how long did the user press
depend on how long did the user press the space, right? So of course my
the space, right? So of course my friends spaces are really evil and this
friends spaces are really evil and this makes no sense to have it in your data.
makes no sense to have it in your data. Now what you have to do is to do data
Now what you have to do is to do data cleansing. We have to clean up this miss
cleansing. We have to clean up this miss and you have the best function in order
and you have the best function in order to clean up the data. You have the trim.
to clean up the data. You have the trim. So if you apply trim for the first
So if you apply trim for the first value, nothing going to happen because
value, nothing going to happen because everything is clean and we don't have
everything is clean and we don't have any spaces. Now if you apply it for the
any spaces. Now if you apply it for the second case where you have a leading
second case where you have a leading space if you do that SQL going to go and
space if you do that SQL going to go and remove this space. The same thing for
remove this space. The same thing for the trailing space. So if you have space
the trailing space. So if you have space at the end the trim function going to
at the end the trim function going to find it and clean that up. And if you
find it and clean that up. And if you have it at the start and at the end then
have it at the start and at the end then it's as well no problem. It's going to
it's as well no problem. It's going to go and clean that up. And as well the
go and clean that up. And as well the trim function can go and clean multiple
trim function can go and clean multiple spaces. So if you have like five spaces
spaces. So if you have like five spaces 10 spaces at the end or at the start the
10 spaces at the end or at the start the trim function going to go and clean that
trim function going to go and clean that up. So this is how the trim works. And
up. So this is how the trim works. And now let's go back to our scale in order
now let's go back to our scale in order to find out whether we have any spaces.
to find out whether we have any spaces. Okay. So now we have a very tricky and
Okay. So now we have a very tricky and interesting task. It says find the
interesting task. It says find the customers whose first name contains
customers whose first name contains leading or trailing spaces. So now by
leading or trailing spaces. So now by looking to those values we have to find
looking to those values we have to find any spaces inside the customer's name.
any spaces inside the customer's name. Now by just looking to this results you
Now by just looking to this results you will not find any white spaces because
will not find any white spaces because it's really hard to see especially if it
it's really hard to see especially if it is like trailing spaces. Now we have to
is like trailing spaces. Now we have to write query order to detect any spaces
write query order to detect any spaces in the names. So how we can do that?
in the names. So how we can do that? Okay. So now think about it a little bit
Okay. So now think about it a little bit and I can give you a hint. You can use
and I can give you a hint. You can use the function trim in order to remove any
the function trim in order to remove any white spaces and you have to use it
white spaces and you have to use it inside a wear clause. So what we're
inside a wear clause. So what we're going to do we're going to say where. So
going to do we're going to say where. So now we have to build a condition to
now we have to build a condition to detect any spaces. So if you are saying
detect any spaces. So if you are saying if the first name is not equal to itself
if the first name is not equal to itself first name after applying a trim. So
first name after applying a trim. So after trimming the first name if it is
after trimming the first name if it is not equal to the first name so that
not equal to the first name so that means there was spaces. So again what is
means there was spaces. So again what is going on here? Let's go for Maria. If
going on here? Let's go for Maria. If Maria has no nulls if you trim this
Maria has no nulls if you trim this value nothing going to happen. The value
value nothing going to happen. The value going to stay exactly like before
going to stay exactly like before because there is no white spaces. But if
because there is no white spaces. But if in Maria there is any space inside it.
in Maria there is any space inside it. Trimming the value will not be equal to
Trimming the value will not be equal to the first name if it contains any
the first name if it contains any spaces. So if the column is not equal to
spaces. So if the column is not equal to the same column after trimming it that
the same column after trimming it that means there is spaces. So let's go and
means there is spaces. So let's go and execute it. And now we can see in the
execute it. And now we can see in the output we have one customer John where
output we have one customer John where we have this situation. Now if you don't
we have this situation. Now if you don't believe me or you don't follow me here
believe me or you don't follow me here we can have another easier check. So
we can have another easier check. So let's go and comment this out and let's
let's go and comment this out and let's have a look to our first names. Now we
have a look to our first names. Now we can go and calculate the length of the
can go and calculate the length of the first name like we have done before. So
first name like we have done before. So length name and let's go and execute it.
length name and let's go and execute it. Now if you can see here Maria we have
Now if you can see here Maria we have five characters but John we have here
five characters but John we have here four characters but the length is five
four characters but the length is five and that's because we have somewhere
and that's because we have somewhere space and the space going to count as a
space and the space going to count as a character. So here there is like
character. So here there is like something wrong right and you can check
something wrong right and you can check the others as well everything is
the others as well everything is matching but only John we have here an
matching but only John we have here an issue and now in order to see this more
issue and now in order to see this more clearly we're going to use two functions
clearly we're going to use two functions the trim and the length. So first let's
the trim and the length. So first let's go and trim the first
go and trim the first name. And after trimming the values, I'm
name. And after trimming the values, I'm going to calculate the length. So we are
going to calculate the length. So we are nesting together the trim and the
nesting together the trim and the length. And I'm going to call it length.
length. And I'm going to call it length. Trim name. So let's go and execute it.
Trim name. So let's go and execute it. Now we can see the length before
Now we can see the length before trimming any value. And we can see the
trimming any value. And we can see the length after trimming the values. So you
length after trimming the values. So you can see over here that join before
can see over here that join before trimming is five and after trimming is
trimming is five and after trimming is four. So we have here an issue. Now we
four. So we have here an issue. Now we can make things more clear where we can
can make things more clear where we can go and subtract the length of the first
go and subtract the length of the first name with the length of the first name.
name with the length of the first name. But first we trim the values. So here we
But first we trim the values. So here we can call it maybe a flag or something.
can call it maybe a flag or something. So let's go and execute it. Now by
So let's go and execute it. Now by looking to the flag it is really easy to
looking to the flag it is really easy to now to see if we have a zero then
now to see if we have a zero then everything is fine. We don't have any
everything is fine. We don't have any white spaces. But if we have higher than
white spaces. But if we have higher than zero like here one then this is an
zero like here one then this is an indicator that we have a white space.
indicator that we have a white space. Either you do it like this where the
Either you do it like this where the first name is not equal the first name
first name is not equal the first name after trimming or you use more
after trimming or you use more complicated solution where you say where
complicated solution where you say where and I'm going to remove this from here
and I'm going to remove this from here the length of the first name is not
the length of the first name is not equal to the length after trimming so
equal to the length after trimming so not equal so if you go and execute it
not equal so if you go and execute it you will get exactly again join so this
you will get exactly again join so this is how we detect any empty spaces inside
is how we detect any empty spaces inside our data using the trim function or
our data using the trim function or maybe as well using the length but I
maybe as well using the length but I really prefer the first solution it is
really prefer the first solution it is way easier using one function. All
way easier using one function. All right, so that's all about how to remove
right, so that's all about how to remove the empty spaces using the trim. Next,
the empty spaces using the trim. Next, we're going to talk about very important
we're going to talk about very important function called
replace. Now the replace function going to go and replace a specific character.
to go and replace a specific character. So that means we have something old and
So that means we have something old and we want to replace it with something
we want to replace it with something new. Let's have a very simple example to
new. Let's have a very simple example to understand it. All right. So now imagine
understand it. All right. So now imagine we have a phone number where the data is
we have a phone number where the data is splitted by a dash. Now let's say that I
splitted by a dash. Now let's say that I don't like to have the dash in my data.
don't like to have the dash in my data. I would like to have slash like any
I would like to have slash like any other special character. Now in order to
other special character. Now in order to replace the dash, we can use the
replace the dash, we can use the function replace. So we have to specify
function replace. So we have to specify for SQL two things. The old value the
for SQL two things. The old value the dash with a new value the slash. So if
dash with a new value the slash. So if you do that in the output it's going to
you do that in the output it's going to go and remove all those dashes between
go and remove all those dashes between the numbers and the replacement going to
the numbers and the replacement going to be the dash between them. So it's very
be the dash between them. So it's very simple, right? All what you are doing is
simple, right? All what you are doing is replacing an old value with a new value
replacing an old value with a new value and that's why we call it replace. But
and that's why we call it replace. But we can use this function as well in
we can use this function as well in order to remove something not only we
order to remove something not only we replace and you can do that by not
replace and you can do that by not specifying anything in the new value
specifying anything in the new value like just the single quotes and with
like just the single quotes and with that it's going to be nothing a blank.
that it's going to be nothing a blank. So now what's going to happen is still
So now what's going to happen is still going to go and replace the dash with a
going to go and replace the dash with a blank and that means I'm just removing
blank and that means I'm just removing the dashes from the output. So if you do
the dashes from the output. So if you do it you will remove the dash and you will
it you will remove the dash and you will get only numbers. So if the replacement
get only numbers. So if the replacement going to be a blank then that means this
going to be a blank then that means this function will be replacing any value
function will be replacing any value that you specify. So this is exactly how
that you specify. So this is exactly how it works and this is why we use the
it works and this is why we use the replace function in SQL. Now let's go
replace function in SQL. Now let's go back in order to practice. So let's do
back in order to practice. So let's do the same example. This time we're going
the same example. This time we're going to go and select from a static value. So
to go and select from a static value. So we're going to get 1 2 3 4 5 6 7 8 9 0.
we're going to get 1 2 3 4 5 6 7 8 9 0. So if you go and execute it, you can see
So if you go and execute it, you can see we are getting the phone number. Now
we are getting the phone number. Now let's go and remove the dashes from this
let's go and remove the dashes from this value. So let's have a new line and we
value. So let's have a new line and we start with replace. The first thing that
start with replace. The first thing that you have to specify for SQL the value
you have to specify for SQL the value itself. So let's go and get the value.
itself. So let's go and get the value. This is the first argument. The second
This is the first argument. The second argument going to be the old value. So
argument going to be the old value. So the old value going to be the dash. And
the old value going to be the dash. And now the third argument will be the
now the third argument will be the replacement. And since we want to remove
replacement. And since we want to remove it, we don't want to replace it with
it, we don't want to replace it with anything. We will have just single
anything. We will have just single quotes and nothing between them. So
quotes and nothing between them. So there's no space between those single
there's no space between those single quotes. Now we can go and rename stuff
quotes. Now we can go and rename stuff like this is the phone. And this is a
like this is the phone. And this is a clean phone. Let's go and execute it.
clean phone. Let's go and execute it. Now, as you can see in the output of the
Now, as you can see in the output of the function, we don't have any dashes
function, we don't have any dashes between the numbers. And you can go and
between the numbers. And you can go and test stuff. Like for example, I can go
test stuff. Like for example, I can go and add a slash and execute it. You will
and add a slash and execute it. You will see slashes between them. So you can go
see slashes between them. So you can go and try multiple stuff. So this is one
and try multiple stuff. So this is one nice use case for the replace function.
nice use case for the replace function. Now there is another use case for the
Now there is another use case for the replace function is that sometimes in my
replace function is that sometimes in my data file names going to be stored like
data file names going to be stored like for example, let's say reports.t txt and
for example, let's say reports.t txt and now let's say that I would like to
now let's say that I would like to change the file format from .txt to CSV.
change the file format from .txt to CSV. Now how we're going to do that we're
Now how we're going to do that we're going to go with a new line say replace
going to go with a new line say replace and then the first argument going to be
and then the first argument going to be the value. So let's take our value from
the value. So let's take our value from here and now what is the old value it's
here and now what is the old value it's going to be the txt and I want to
going to be the txt and I want to replace it with another format with
replace it with another format with another extension. So it's going to be
another extension. So it's going to be the CSV. So we're going to say this is
the CSV. So we're going to say this is the new file name and this is the old
the new file name and this is the old file name. So let's go and execute it.
file name. So let's go and execute it. And now as you can see in the output SQL
And now as you can see in the output SQL did replace the txt with SCSV. This is
did replace the txt with SCSV. This is as well where I use the replace function
as well where I use the replace function in my projects. So my friends the
in my projects. So my friends the replace function is really fun and those
replace function is really fun and those are two nice use cases for the replace.
are two nice use cases for the replace. All right. So that's all about the
All right. So that's all about the replace function in SQL and with that we
replace function in SQL and with that we have covered the whole datamations. Now
have covered the whole datamations. Now in the next group we're going to talk
in the next group we're going to talk about the calculations. And here we have
about the calculations. And here we have only one function the
length. Now the length function it's very simple. It's going to go and count
very simple. It's going to go and count how many characters you have in one
how many characters you have in one value. So you are calculating the length
value. So you are calculating the length of a value. Let's have very simple
of a value. Let's have very simple example to understand it. Okay. So now
example to understand it. Okay. So now let's say that we have the value Maria.
let's say that we have the value Maria. If you apply the length function for
If you apply the length function for that what's going to happen? It's going
that what's going to happen? It's going to go and start counting how many
to go and start counting how many characters we have inside this value. So
characters we have inside this value. So the m is 1. a 2 3 4 5 in the output you
the m is 1. a 2 3 4 5 in the output you will get the number five. So five is the
will get the number five. So five is the length or the total number of characters
length or the total number of characters in this value. Now let's say that you
in this value. Now let's say that you have a number like 350. If you go and
have a number like 350. If you go and apply the length function still is going
apply the length function still is going to go and count how many digits do we
to go and count how many digits do we have. The three is 1 5 2 3. So the total
have. The three is 1 5 2 3. So the total length for that going to be three. So
length for that going to be three. So you can apply it even for numbers and
you can apply it even for numbers and not only that you can go and apply it on
not only that you can go and apply it on a date value. So let's say that you have
a date value. So let's say that you have the following date 2026 1st 23. So SQL
the following date 2026 1st 23. So SQL going to go and count each digit each
going to go and count each digit each character even the underscores not only
character even the underscores not only the numbers underscore is as well a
the numbers underscore is as well a digit right? So the total length of this
digit right? So the total length of this date it's going to be 10. So you can
date it's going to be 10. So you can apply any data type to the links
apply any data type to the links function and in the output you will get
function and in the output you will get always a number. That's it. This is how
always a number. That's it. This is how you can count the number of characters
you can count the number of characters in any value. Let's go back to scale in
in any value. Let's go back to scale in order to practice that. Okay. So now we
order to practice that. Okay. So now we have the task calculate the length of
have the task calculate the length of each customer's first name. So it is
each customer's first name. So it is very simple. We're going to go and apply
very simple. We're going to go and apply the function length len to the column
the function length len to the column first name and we're going to call it
first name and we're going to call it length name. So let's go and execute it.
length name. So let's go and execute it. And with that as you can see we are
And with that as you can see we are getting in the output numbers and these
getting in the output numbers and these numbers are the number of characters of
numbers are the number of characters of each name of our customers. So this is
each name of our customers. So this is how we calculate the length and that's
how we calculate the length and that's it for this group. Now moving on to the
it for this group. Now moving on to the next one. It's going to be very
next one. It's going to be very interesting. Now we're going to talk
interesting. Now we're going to talk about how to extract something from a
about how to extract something from a string value. And here we're going to
string value. And here we're going to cover now two functions the left and the
right. Now the lift function going to go and extract specific number of
and extract specific number of characters from the start of a string
characters from the start of a string value. So if you want to get few
value. So if you want to get few characters at the beginning of a value,
characters at the beginning of a value, you can use the lift. But now the right
you can use the lift. But now the right function is exactly the opposite. It's
function is exactly the opposite. It's going to go and extract specific number
going to go and extract specific number of characters from the end of string
of characters from the end of string value. So if you want few characters
value. So if you want few characters from the end of your value, you can use
from the end of your value, you can use right. Now in order to apply the left or
right. Now in order to apply the left or the right function, you have to give SQL
the right function, you have to give SQL two things. The value where you want to
two things. The value where you want to extract a part from it and the number of
extract a part from it and the number of characters, how many characters you want
characters, how many characters you want to extract and this is the same for the
to extract and this is the same for the left and the right. Now let's say that
left and the right. Now let's say that we have again this value Mariam. And now
we have again this value Mariam. And now if the task says I would like to extract
if the task says I would like to extract the first two characters and since we
the first two characters and since we are talking about the starting position,
are talking about the starting position, we're going to use the lift function.
we're going to use the lift function. And since it says two characters, we're
And since it says two characters, we're going to go with the two. So it's going
going to go with the two. So it's going to start counting M is 1, A is two and
to start counting M is 1, A is two and after that it's going to stop and make a
after that it's going to stop and make a cut and it's going to go and return the
cut and it's going to go and return the two characters M A. So we are counting
two characters M A. So we are counting from the left side going to the right
from the left side going to the right side. Right now if your task says
side. Right now if your task says extract the last two characters here we
extract the last two characters here we are talking about the end position of
are talking about the end position of your value and for that we're going to
your value and for that we're going to use the right function since we are
use the right function since we are approaching from the right side and
approaching from the right side and since we want only two characters the
since we want only two characters the number of characters going to be two. So
number of characters going to be two. So this time going to start counting from
this time going to start counting from the right side moving to the left side.
the right side moving to the left side. So A is one, I is two and that's it.
So A is one, I is two and that's it. Then SQL going to stop and extract only
Then SQL going to stop and extract only those two characters. I A. So if you
those two characters. I A. So if you want to extract data at the starting
want to extract data at the starting position, you use the left. But if you
position, you use the left. But if you want to extract characters from the end
want to extract characters from the end position of your value, then you use the
position of your value, then you use the right function. Now let's go back to
right function. Now let's go back to scaler in order to practice. Okay. So
scaler in order to practice. Okay. So now we have the following task. Retrieve
now we have the following task. Retrieve the first two characters of each first
the first two characters of each first name. So we just need the first two
name. So we just need the first two characters. Since we are coming from the
characters. Since we are coming from the left side, we can go and use the
left side, we can go and use the function left. So it's very simple.
function left. So it's very simple. First name and we need only two
First name and we need only two characters. So two. So we're going to
characters. So two. So we're going to call it first to character. Let's go
call it first to character. Let's go ahead and execute it. And now you can
ahead and execute it. And now you can see in the output we have two characters
see in the output we have two characters MA. Now with John we have only G because
MA. Now with John we have only G because we have a leading space. Well, you can
we have a leading space. Well, you can leave it like this or you can transform
leave it like this or you can transform it. And then George we have G and so on.
it. And then George we have G and so on. So with that we are getting the first
So with that we are getting the first three characters. Now in order to fix it
three characters. Now in order to fix it for John what we're going to do we're
for John what we're going to do we're going to say trim first and then apply
going to say trim first and then apply the lift. So with that we are getting
the lift. So with that we are getting rid of all white spaces and then we
rid of all white spaces and then we apply the lift. So with that everything
apply the lift. So with that everything looks perfect. So for John we have jo.
looks perfect. So for John we have jo. So this is how we can get the first two
So this is how we can get the first two characters of a column. Now let's move
characters of a column. Now let's move to the next one. The task says retrieve
to the next one. The task says retrieve the last two characters of each first
the last two characters of each first name. So this time we need the last two.
name. So this time we need the last two. So we are coming from the right side. So
So we are coming from the right side. So we're going to do it like this. We're
we're going to do it like this. We're going to say
going to say write first name and then as well too.
write first name and then as well too. So last two character let's go and
So last two character let's go and execute it. And now as you can see in
execute it. And now as you can see in the output we have new column where we
the output we have new column where we have the last two characters from the
have the last two characters from the first name. So we have here I a er and
first name. So we have here I a er and for John as well working and that's
for John as well working and that's because we don't have any trailing
because we don't have any trailing spaces but if you have any trailing
spaces but if you have any trailing spaces then go and use that trim
spaces then go and use that trim function. All right so that's all for
function. All right so that's all for the left and right and now we're going
the left and right and now we're going to go to the last function. we have the
substring. So the substring going to go and extract a part of a string at a
and extract a part of a string at a specified position. So this time we
specified position. So this time we don't want something from the beginning
don't want something from the beginning or the end. We want something like in
or the end. We want something like in the middle. So we want to specify the
the middle. So we want to specify the starting position and we want to extract
starting position and we want to extract few characters from there. So let's have
few characters from there. So let's have very simple example to understand it.
very simple example to understand it. Now in order to use the substring you
Now in order to use the substring you need three things. The first one is the
need three things. The first one is the value itself where you want to extract a
value itself where you want to extract a specific part from it and then you have
specific part from it and then you have to specify the starting position where
to specify the starting position where SQL going to start extracting the
SQL going to start extracting the characters that you want and as well SQL
characters that you want and as well SQL needs the links how many characters we
needs the links how many characters we have to extract. So now let's say that
have to extract. So now let's say that we have the following task after the
we have the following task after the second character extract two characters.
second character extract two characters. So from reading this you can see we
So from reading this you can see we specified the starting position this is
specified the starting position this is the second character and the length
the second character and the length going to be the two characters. So let's
going to be the two characters. So let's have this example. Well, if you have
have this example. Well, if you have Maria, so now we have to specify the
Maria, so now we have to specify the starting position. Now we are saying
starting position. Now we are saying after the second character. So the first
after the second character. So the first character m is one. Then a is two. After
character m is one. Then a is two. After two, we got the position number three,
two, we got the position number three, right? So starting from R. So that means
right? So starting from R. So that means we have to specify for SQL three because
we have to specify for SQL three because the starting position going to be number
the starting position going to be number three. This is after the two. Now we
three. This is after the two. Now we want only two characters. So we want the
want only two characters. So we want the R and the I. If you give this to SQL
R and the I. If you give this to SQL Maria starting position three and the
Maria starting position three and the length two, SQL can go and extract the
length two, SQL can go and extract the two characters the R I. And this is
two characters the R I. And this is exactly what we want. We want two
exactly what we want. We want two characters after the second position,
characters after the second position, the second character. So with that, we
the second character. So with that, we didn't extract something from the left
didn't extract something from the left or from the right. We extracted at
or from the right. We extracted at specific position. And this is exactly
specific position. And this is exactly why we need the substring. Now let's
why we need the substring. Now let's make it a little bit more difficult
make it a little bit more difficult where we're going to say after the
where we're going to say after the second character extract everything all
second character extract everything all the characters. So not only RA I I would
the characters. So not only RA I I would like RA I A. So now nothing's changed
like RA I A. So now nothing's changed about the starting position. It's going
about the starting position. It's going to stay at three. But now if you are
to stay at three. But now if you are looking to this value and you want to
looking to this value and you want to extract everything starting from R. That
extract everything starting from R. That means you have to specify the length of
means you have to specify the length of three. But this is not really good
three. But this is not really good because let's have another value in the
because let's have another value in the same column. So we have Martin. So the
same column. So we have Martin. So the starting position going to be as well R.
starting position going to be as well R. And now the lengths going to be
And now the lengths going to be different. So we have here four
different. So we have here four characters. So now the length is not
characters. So now the length is not anymore three. It is four. But you have
anymore three. It is four. But you have to specify something at the end for SQL.
to specify something at the end for SQL. You can go for four. That's fine for
You can go for four. That's fine for Maria as well. But if you have a lot of
Maria as well. But if you have a lot of values, it's going to be really hard to
values, it's going to be really hard to specify exactly the correct length.
specify exactly the correct length. That's why instead of specifying a
That's why instead of specifying a static number like three or four, we can
static number like three or four, we can use another function. So now my friends,
use another function. So now my friends, if you use the length function, you will
if you use the length function, you will get the total number of characters,
get the total number of characters, right? So for Maria, you will get five.
right? So for Maria, you will get five. For Martin, you will get six. And those
For Martin, you will get six. And those numbers are okay to use in the length
numbers are okay to use in the length because they are more than what we need.
because they are more than what we need. And that's totally fine. So if you are
And that's totally fine. So if you are saying okay for Maria start from the
saying okay for Maria start from the third position and cut for me five
third position and cut for me five characters SQL going to find only three
characters SQL going to find only three but you will not get an error. So you
but you will not get an error. So you are extracting more than you need and
are extracting more than you need and you will always get all the characters
you will always get all the characters after the starting position. So this is
after the starting position. So this is a little trick that we use in order to
a little trick that we use in order to make the links dynamic where we cannot
make the links dynamic where we cannot find one value that we can use in all
find one value that we can use in all scenarios. And now let's go back to SQL
scenarios. And now let's go back to SQL in order to practice the substring.
in order to practice the substring. Okay. So now we have the following task
Okay. So now we have the following task and it says retrieve a list of customers
and it says retrieve a list of customers first names after removing the first
first names after removing the first character. So now don't ask me why but
character. So now don't ask me why but for some reason we don't want to see the
for some reason we don't want to see the first character of the first names. We
first character of the first names. We want to remove it. So how we can do
want to remove it. So how we can do that? We cannot use the left or the
that? We cannot use the left or the right. We have to go with the substring
right. We have to go with the substring because it is little bit more
because it is little bit more complicated. So substring and let's go
complicated. So substring and let's go and get and the first argument going to
and get and the first argument going to be the value. So it comes from the first
be the value. So it comes from the first name and then the second argument is the
name and then the second argument is the starting position. So where we want to
starting position. So where we want to start since it is saying I want all the
start since it is saying I want all the characters after the first character. So
characters after the first character. So that means we will be starting from the
that means we will be starting from the position number two. So for example
position number two. So for example Maria here the first character M
Maria here the first character M position number one and we want to start
position number one and we want to start our substring from the position number
our substring from the position number two. So that was so that was the easy
two. So that was so that was the easy part. Now the next one the question is
part. Now the next one the question is how much characters we want to leave. So
how much characters we want to leave. So do we leave here like four characters
do we leave here like four characters like in Maria we have four characters
like in Maria we have four characters but in John we have only three then the
but in John we have only three then the next one is four and so on. So if you go
next one is four and so on. So if you go for example with four and let's call it
for example with four and let's call it sub name. So we make it static. What can
sub name. So we make it static. What can happen? It's going to work for some
happen? It's going to work for some scenarios like Maria. We have here Ara
scenarios like Maria. We have here Ara and for better we are getting it. But
and for better we are getting it. But for Martin it is not working. We are not
for Martin it is not working. We are not getting the last N because it has like
getting the last N because it has like five characters after the first one. And
five characters after the first one. And by just looking to the result as you can
by just looking to the result as you can see we have here one issue with John and
see we have here one issue with John and that's because the first character is an
that's because the first character is an empty string. So this is really
empty string. So this is really annoying. So that's why we use the trim
annoying. So that's why we use the trim first just to get rid of all those white
first just to get rid of all those white spaces. And now you can see it's working
spaces. And now you can see it's working fine. So we are not getting the J. We
fine. So we are not getting the J. We have everything after the first
have everything after the first character. So now instead of having this
character. So now instead of having this static what we're going to do we're
static what we're going to do we're going to make it variable. So we're
going to make it variable. So we're going to go and use the length of the
going to go and use the length of the first name. So with that we make sure we
first name. So with that we make sure we have enough length to extract. And this
have enough length to extract. And this can work for any value inside the first
can work for any value inside the first name even if the name is like 20
name even if the name is like 20 characters. So let's go and execute. And
characters. So let's go and execute. And now you can see for Martin it is now
now you can see for Martin it is now working. So we have here like five
working. So we have here like five characters after the M. And here we have
characters after the M. And here we have four characters after the M as well. And
four characters after the M as well. And here we have three characters after the
here we have three characters after the G. So it is working completely and it is
G. So it is working completely and it is full dynamic. So this is the trick by
full dynamic. So this is the trick by using the links together with the
using the links together with the substring. And as you can see now we are
substring. And as you can see now we are using three functions in one go. We have
using three functions in one go. We have the length, we have the trim and we have
the length, we have the trim and we have the substring. And this is what happens
the substring. And this is what happens in scale. we use multiple functions
in scale. we use multiple functions together in order to solve like complex
together in order to solve like complex tasks. So this is how you can extract a
tasks. So this is how you can extract a substring from a string. All right. So
substring from a string. All right. So that's all about the substring and with
that's all about the substring and with that we have covered a lot of very
that we have covered a lot of very important string functions in SQL and
important string functions in SQL and now you have enough tools in order to
now you have enough tools in order to manipulate the string values in your
manipulate the string values in your data. Okay my friends. So with that we
data. Okay my friends. So with that we have learned how to manipulate your
have learned how to manipulate your string values inside SQL using the
string values inside SQL using the string functions. Now we will move to
string functions. Now we will move to the second one. you will learn how to
the second one. you will learn how to manipulate the numbers, the numeric
manipulate the numbers, the numeric values. So let's
go. Okay. So now let's have this example 3.516. Now let's say that you want to
3.516. Now let's say that you want to apply the function round and you are
apply the function round and you are using two decimal places. So what going
using two decimal places. So what going to happen? It's going to go and keep
to happen? It's going to go and keep only two digits after the decimal point.
only two digits after the decimal point. So five and one and the third digit
So five and one and the third digit after the decimal six. It will decide
after the decimal six. It will decide whether the number going to round up or
whether the number going to round up or stay as it is. And now since six is
stay as it is. And now since six is higher than five. So that means SQL
higher than five. So that means SQL going to go around the numbers up. So
going to go around the numbers up. So instead of having 51 we will get 52. And
instead of having 51 we will get 52. And after that the third digit going to
after that the third digit going to reset to zero. So in the out you will
reset to zero. So in the out you will get
get 3.52. Now let's say that you have done
3.52. Now let's say that you have done round but only for one decimal place.
round but only for one decimal place. Now it's still going to go and keep only
Now it's still going to go and keep only one decimal place and that is the five.
one decimal place and that is the five. And the second digit this time going to
And the second digit this time going to decide whether we round up or not. And
decide whether we round up or not. And now since one is less than five, there
now since one is less than five, there is no need to round up and the five
is no need to round up and the five going to stay as it is. It will not turn
going to stay as it is. It will not turn to six. So there is no round up and the
to six. So there is no round up and the digits after the five going to reset to
digits after the five going to reset to zero. So we're going to get 3.5. Now
zero. So we're going to get 3.5. Now let's say that you say round zero. So
let's say that you say round zero. So that means I don't want to see any
that means I don't want to see any digits after the decimal point. So now
digits after the decimal point. So now SQL going to go and check the first
SQL going to go and check the first digit after the decimal point, the five.
digit after the decimal point, the five. This one going to decide whether the
This one going to decide whether the three going to turn to four or not. And
three going to turn to four or not. And now since we have five it is good enough
now since we have five it is good enough to round the number because either five
to round the number because either five or above five going to round the
or above five going to round the numbers. So that's why it's going to be
numbers. So that's why it's going to be a round up and SQL going to return at
a round up and SQL going to return at the end four and all the digits after
the end four and all the digits after the decimal points going to be reset to
the decimal points going to be reset to zero. So this is exactly how the round
zero. So this is exactly how the round function works in SQL. So now let's see
function works in SQL. So now let's see how we can do that in SQL. Okay. So now
how we can do that in SQL. Okay. So now let's go and practice about the number
let's go and practice about the number functions. So what we're going to do
functions. So what we're going to do we're going to write SQL select but this
we're going to write SQL select but this time we will not select any data from
time we will not select any data from the database. We going to practice using
the database. We going to practice using our static value like for example the
our static value like for example the value 3 dot 516. So let's go and execute
value 3 dot 516. So let's go and execute it. So with that I have this decimal
it. So with that I have this decimal number. Now let's go and start
number. Now let's go and start practicing the round function. So now
practicing the round function. So now let's go and round this number
let's go and round this number 3.516 and this time we are rounding to
3.516 and this time we are rounding to decimals. So let's go and call it round
decimals. So let's go and call it round two and let's go and execute it. So as
two and let's go and execute it. So as you can see in the output we are
you can see in the output we are rounding two decimal places and we have
rounding two decimal places and we have the two because as we learned the six
the two because as we learned the six going to go and round it up. Now let's
going to go and round it up. Now let's go and do the same thing for one. So
go and do the same thing for one. So let's round one execute. And as you can
let's round one execute. And as you can see in the output we are rounding to one
see in the output we are rounding to one decimal. So we have the five and
decimal. So we have the five and everything is zero. And we don't have
everything is zero. And we don't have six here because the one is lower than
six here because the one is lower than five and it will not round up the
five and it will not round up the numbers. And let's and round by the
numbers. And let's and round by the zero. it is rounding it to an integer to
zero. it is rounding it to an integer to the four and all the decimal digits are
the four and all the decimal digits are zero and we have four because we have
zero and we have four because we have five and five going to round up the
five and five going to round up the number. So as you can see it is really
number. So as you can see it is really nice and this is how we round numbers in
SQL. Now there is another number function which is really cool called APS
function which is really cool called APS or the absolute what it going to do it's
or the absolute what it going to do it's going to go and convert any negative
going to go and convert any negative number to a positive. So let me show you
number to a positive. So let me show you what I mean. Let's go and say we have
what I mean. Let's go and say we have like minus 10. So this is a negative
like minus 10. So this is a negative number. But if I say APS, so the
number. But if I say APS, so the absolute of the minus 10, what I will
absolute of the minus 10, what I will get? I will get a positive number. So
get? I will get a positive number. So it's like giving us the absolute of any
it's like giving us the absolute of any number or in other words, it is like
number or in other words, it is like converting the negative to a positive.
converting the negative to a positive. And if the number is already positive,
And if the number is already positive, nothing going to happen. So if I say the
nothing going to happen. So if I say the absolute of the 10, I will get as well a
absolute of the 10, I will get as well a 10. So this is really nice and cool
10. So this is really nice and cool function that is really important in
function that is really important in order to transform numbers in many
order to transform numbers in many scenarios like if you have mistakes on
scenarios like if you have mistakes on your database like let's say minus sales
your database like let's say minus sales makes no sense to have sales that is
makes no sense to have sales that is minus. So in order to correct the data
minus. So in order to correct the data we can use the APS in order to convert
we can use the APS in order to convert all the negative numbers to a positive.
all the negative numbers to a positive. So this is really nice cool and easy
So this is really nice cool and easy function to learn. All right my friends.
function to learn. All right my friends. So that's all for the numeric functions.
So that's all for the numeric functions. We have covered two very simple
We have covered two very simple functions and now in the next topic we
functions and now in the next topic we have a lot of functions about how to
have a lot of functions about how to manipulate the date and time in SQL. So
manipulate the date and time in SQL. So let's
go. So what is a date? If you take a look at calendar and you pick any date,
look at calendar and you pick any date, for example, August 20th,
for example, August 20th, 2025, this date could represent an event
2025, this date could represent an event like a birth date. Happy birthday.
Happy birthday. or a project deadline at your work and
or a project deadline at your work and mainly it has three components. The
mainly it has three components. The first part is a fourdigit number
first part is a fourdigit number indicating the year. Then the next
indicating the year. Then the next component it is the month. So normally
component it is the month. So normally we represent the month with a number
we represent the month with a number between 1 and 12. And the last component
between 1 and 12. And the last component is the day. This is a number between 1
is the day. This is a number between 1 and 31 depending on the month. Now in
and 31 depending on the month. Now in database we call this structure of those
database we call this structure of those three components a date. So this is what
three components a date. So this is what we mean with dates in SQL. All right.
we mean with dates in SQL. All right. All right. So now let's move to the next
All right. So now let's move to the next one. What is time? Time refers to a
one. What is time? Time refers to a specific point within a day. Like for
specific point within a day. Like for example, we have 18:00, 55 minutes, and
example, we have 18:00, 55 minutes, and 45 seconds. So this structure has as
45 seconds. So this structure has as well three components. The first one we
well three components. The first one we call it the hours. It is as well a
call it the hours. It is as well a number between 0 and 23 indicating the
number between 0 and 23 indicating the hour of the day. Then the next one, it
hour of the day. Then the next one, it is the minutes. This is a number between
is the minutes. This is a number between 0 and 59. Moving on to the last
0 and 59. Moving on to the last component, we have the second. This is
component, we have the second. This is again the same thing a number between 0
again the same thing a number between 0 and 59. So now this structure with those
and 59. So now this structure with those three components we call it in databases
three components we call it in databases and SQL a time. So this is what we mean
and SQL a time. So this is what we mean with the time. Now to the last type if
with the time. Now to the last type if you go and combine both the date
you go and combine both the date together with the time and you put them
together with the time and you put them side by side you will get a new
side by side you will get a new structure and a new name in the
structure and a new name in the databases and we call it usually time
databases and we call it usually time stamp. This name is used in many
stamp. This name is used in many databases like Oracle, Postgress and
databases like Oracle, Postgress and MySQL. But in the SQL server, we have
MySQL. But in the SQL server, we have another name for that. We call it date
another name for that. We call it date time. So again, it's very simple. The
time. So again, it's very simple. The date time or time stamp has the date
date time or time stamp has the date information together with the time
information together with the time information. So here in this example, we
information. So here in this example, we have six components from left to right
have six components from left to right and here we have like a hierarchy in
and here we have like a hierarchy in this structure. So we start with the
this structure. So we start with the highest which is the year. Then we have
highest which is the year. Then we have the month, the day and then we continue
the month, the day and then we continue to the hour, minutes and seconds. So
to the hour, minutes and seconds. So those are the three different types
those are the three different types about date and time informations in SQL.
about date and time informations in SQL. We have the date alone or the time alone
We have the date alone or the time alone or together in the date time. All right,
or together in the date time. All right, let's explore now the data that we have
let's explore now the data that we have inside our database searching for date
inside our database searching for date and time informations. Now let's go to
and time informations. Now let's go to the table orders and if you go and
the table orders and if you go and expand it, you will find here two
expand it, you will find here two columns having the data type dates. So
columns having the data type dates. So we have the order dates with the date
we have the order dates with the date and as well the shipping date with the
and as well the shipping date with the data type dates. And if you check the
data type dates. And if you check the last column, the creation date, this one
last column, the creation date, this one is date time 2. So now let's go and
is date time 2. So now let's go and query those informations in order to
query those informations in order to understand the structure. I'm just going
understand the structure. I'm just going to select the order ID, the order date,
to select the order ID, the order date, and the ship
and the ship date and the creation
date and the creation time from sales orders and from is big.
time from sales orders and from is big. So let's go and execute it. Now if you
So let's go and execute it. Now if you go and check both order date and ship
go and check both order date and ship date, you can find that here we have
date, you can find that here we have only the structure or the informations
only the structure or the informations about the date and we have nothing about
about the date and we have nothing about the time. So again here we have a year,
the time. So again here we have a year, month and day and that's why they have
month and day and that's why they have the data type date. Now let's go and
the data type date. Now let's go and check the creation time. Not only we
check the creation time. Not only we have the date information but as well we
have the date information but as well we have the time information. So it start
have the time information. So it start with the date information year, month,
with the date information year, month, day and then we have hour, minute and
day and then we have hour, minute and seconds and then we have fractions of
seconds and then we have fractions of the seconds, milliseconds and so on. So
the seconds, milliseconds and so on. So this is how the date time or time stamp
this is how the date time or time stamp looks like in databases and this is how
looks like in databases and this is how the date looks
like. All right my friends now in SQL I can say that we have three different
can say that we have three different sources in order to query the dates. The
sources in order to query the dates. The first one is dates that are stored
first one is dates that are stored inside our database like we saw here in
inside our database like we saw here in those columns like the order date,
those columns like the order date, shipping date, creation time. All those
shipping date, creation time. All those are columns that holds this informations
are columns that holds this informations and they are stored inside our database.
and they are stored inside our database. So this is the first source of dates
So this is the first source of dates that we can get inside our queries. Let
that we can get inside our queries. Let me just remove those stuff and let's
me just remove those stuff and let's stick with the creation time. So let's
stick with the creation time. So let's just execute it. So those are date and
just execute it. So those are date and time informations stored inside our
time informations stored inside our database. The second type is a
database. The second type is a hard-coded date string that we can use
hard-coded date string that we can use inside our queries. Let me show you an
inside our queries. Let me show you an example. So now if we go to a new line,
example. So now if we go to a new line, I can go and define a date like this. So
I can go and define a date like this. So 2025 August 20th. So that in this string
2025 August 20th. So that in this string we have hardcoded a date that is static
we have hardcoded a date that is static for all rows. Let me just call it
for all rows. Let me just call it hardcoded and let's go and execute it.
hardcoded and let's go and execute it. Now we can see in the output we're going
Now we can see in the output we're going to get a static date for all rows. So
to get a static date for all rows. So this going to be the same for all rows
this going to be the same for all rows inside our table. So this value is not
inside our table. So this value is not stored inside our database. This value I
stored inside our database. This value I just added to our query and hardcoded
just added to our query and hardcoded it. So sometimes in queries we define
it. So sometimes in queries we define our dates that's going to be used maybe
our dates that's going to be used maybe later in calculations and so on. Now the
later in calculations and so on. Now the third source of getting dates inside our
third source of getting dates inside our query is using the function get date.
query is using the function get date. Get date is the first and the most
Get date is the first and the most important function that we use in SQL.
important function that we use in SQL. It's going to go and return the current
It's going to go and return the current date and time at the moment of executing
date and time at the moment of executing the query. So let's try that out. I'm
the query. So let's try that out. I'm going to go and get a new line. So get
going to go and get a new line. So get dates. It's very simple. It doesn't
dates. It's very simple. It doesn't accept any values inside the function.
accept any values inside the function. So it's going to be empty. So let's call
So it's going to be empty. So let's call it today. All right. Let's go and
it today. All right. Let's go and execute it. And of course, we're going
execute it. And of course, we're going to get different results because the get
to get different results because the get date now is the date and the time that
date now is the date and the time that I'm recording this video. So currently
I'm recording this video. So currently it is July 18, 2024. And I'm recording
it is July 18, 2024. And I'm recording this around 20 p.m. So as you can see,
this around 20 p.m. So as you can see, this going to be as well repeated for
this going to be as well repeated for each row. We're going to get always the
each row. We're going to get always the same value. So again, this depend on the
same value. So again, this depend on the execution of that query. So during the
execution of that query. So during the tutorial, you're going to learn a lot
tutorial, you're going to learn a lot about the get date and we're going to
about the get date and we're going to use it in a lot of functions. So those
use it in a lot of functions. So those are the three different sources of
are the three different sources of getting date information inside your
getting date information inside your query either from a column inside our
query either from a column inside our database or hardcoded using a string.
database or hardcoded using a string. And the third one is using the get date
And the third one is using the get date in order to get the current date and
in order to get the current date and time informations at the moment of the
time informations at the moment of the query
execution. Nice. Now we have a clear understanding what is date and time in
understanding what is date and time in SQL. The next question is how to
SQL. The next question is how to manipulate those informations using SQL
manipulate those informations using SQL functions. Okay. Now we have our date
functions. Okay. Now we have our date August 20th, 2025. One of the things
August 20th, 2025. One of the things that we can do with the date is we can
that we can do with the date is we can go and extract different parts of the
go and extract different parts of the date. For example, we are interested
date. For example, we are interested only on the year. So we can go and
only on the year. So we can go and extract only the year part. Or if you
extract only the year part. Or if you are interested in the month, you can go
are interested in the month, you can go and extract the month and you will get
and extract the month and you will get August. And of course, we can go and
August. And of course, we can go and extract the day and we will get the 20.
extract the day and we will get the 20. So this is the first thing that we can
So this is the first thing that we can do. We can extract the parts of the
do. We can extract the parts of the dates. Now another thing that we can do
dates. Now another thing that we can do is we can go and change the date format.
is we can go and change the date format. So instead of having like a small minus
So instead of having like a small minus between those date parts, we can go and
between those date parts, we can go and split them using slash. We can even
split them using slash. We can even start first with the month August then
start first with the month August then 20 the day and then the year but having
20 the day and then the year but having only the short form of the year 25 or we
only the short form of the year 25 or we can go and change the format where we
can go and change the format where we say we don't need any special character
say we don't need any special character we just leave it as a space. So as you
we just leave it as a space. So as you can see we are changing and manipulating
can see we are changing and manipulating the format of the date. Another category
the format of the date. Another category or task we can go and do date
or task we can go and do date calculations. So we can go and take our
calculations. So we can go and take our date and add to it for example 3 years
date and add to it for example 3 years or we can go and find the differences
or we can go and find the differences between two dates like we are doing a
between two dates like we are doing a subtraction or let's say minus and we
subtraction or let's say minus and we will get for example 30 days. So we can
will get for example 30 days. So we can go and add stuff subtract stuff or find
go and add stuff subtract stuff or find differences between two dates. It's like
differences between two dates. It's like we are doing calculations on the date.
we are doing calculations on the date. Now to the last thing that we can do
Now to the last thing that we can do with this date is we can go and test
with this date is we can go and test this date or validate it whether it is a
this date or validate it whether it is a real date that SQL understands. So we
real date that SQL understands. So we can put it on the test and at the output
can put it on the test and at the output we're going to get true or false or zero
we're going to get true or false or zero and one. So as you can see here we have
and one. So as you can see here we have different ways or let's say categories
different ways or let's say categories on how to manipulate our dates in SQL.
on how to manipulate our dates in SQL. Now we're going to go and group up the
Now we're going to go and group up the different date and time functions under
different date and time functions under four categories. The first category and
four categories. The first category and the most important one we have the part
the most important one we have the part extraction and here we have around seven
extraction and here we have around seven different functions that we can use in
different functions that we can use in order to do this task. Another category
order to do this task. Another category we have the format and casting. And here
we have the format and casting. And here we have three different functions.
we have three different functions. Underneath this category we have the
Underneath this category we have the format, convert and cast. And then the
format, convert and cast. And then the third category we have the calculations
third category we have the calculations of the dates. We have two functions date
of the dates. We have two functions date add and date diff. And the last category
add and date diff. And the last category the validation. We have here only one
the validation. We have here only one function called is dates. So as you can
function called is dates. So as you can see we have a lot of scale functions. We
see we have a lot of scale functions. We have 13 date and time functions that
have 13 date and time functions that we're going to cover in this tutorial on
we're going to cover in this tutorial on how to manipulate the date and time
how to manipulate the date and time informations in SQL. And this is how we
informations in SQL. And this is how we can group them into four different
can group them into four different categories. Let's start now with the
categories. Let's start now with the biggest category. We have the part
biggest category. We have the part extraction. We're going to cover all
extraction. We're going to cover all those seven functions in details on how
those seven functions in details on how to extract
parts. All right friends, now we're going to cover three very easy quick
going to cover three very easy quick functions in SQL to extract the parts of
functions in SQL to extract the parts of the dates. So they are very simple. The
the dates. So they are very simple. The day function going to return a day from
day function going to return a day from a date and in the same way the month
a date and in the same way the month going to return the month from a date
going to return the month from a date and guess what the year going to return
and guess what the year going to return a year from a date. Okay. So now in
a year from a date. Okay. So now in order to understand how they work we
order to understand how they work we have a date like this one 2025 August
have a date like this one 2025 August 20th. Sometimes you are not interested
20th. Sometimes you are not interested in the whole date. You would like to get
in the whole date. You would like to get only a part from this date. So you go
only a part from this date. So you go and use the function day in order to
and use the function day in order to extract the two digit 20. Now in other
extract the two digit 20. Now in other scenario you might be interested in the
scenario you might be interested in the month information. So you would like to
month information. So you would like to get those two digits 08. So we can use
get those two digits 08. So we can use the function month in order to extract
the function month in order to extract the month information in order to get
the month information in order to get the August. So 08 and one more situation
the August. So 08 and one more situation where you want to have only the year
where you want to have only the year information. So you are interested in
information. So you are interested in the four digits 2025. So you can go and
the four digits 2025. So you can go and use the function year in order to
use the function year in order to extract it. So in the output if you
extract it. So in the output if you apply it you will get 2025. So it's very
apply it you will get 2025. So it's very simple. This is how those three
simple. This is how those three functions work. All right. Now let's
functions work. All right. Now let's check the syntax of those three
check the syntax of those three functions. It's pretty easy. So we have
functions. It's pretty easy. So we have it always like this. A keyword called
it always like this. A keyword called day. This is the function name. And then
day. This is the function name. And then it accept only one parameter. It is the
it accept only one parameter. It is the date. The same things for the others. We
date. The same things for the others. We have a function called month and it
have a function called month and it accept as well only one parameter the
accept as well only one parameter the date and as well for the year the same
date and as well for the year the same thing. So the syntax is very
thing. So the syntax is very straightforward. It accept only one
straightforward. It accept only one value the date and we have the function
value the date and we have the function name like the name of the part that we
name like the name of the part that we want to extract. All right. So now let's
want to extract. All right. So now let's try out those functions. I will be
try out those functions. I will be working with the column creation time.
working with the column creation time. So let's try for example extracting the
So let's try for example extracting the year from the creation time using the
year from the creation time using the year function. So it's going to be very
year function. So it's going to be very simple. It's going to be year and then
simple. It's going to be year and then creation
creation time like this. And let's call it year.
time like this. And let's call it year. That's it. Let's go and execute it. Now
That's it. Let's go and execute it. Now as you can see it's very simple. We have
as you can see it's very simple. We have only one year 2025 from the creation
only one year 2025 from the creation time. So with that as you can see we got
time. So with that as you can see we got a new column where we have only the year
a new column where we have only the year informations inside it. And this
informations inside it. And this information come from the creation date.
information come from the creation date. So we have only 2025. Now let's go and
So we have only 2025. Now let's go and do the same for the month. So we're
do the same for the month. So we're going to have the same thing month
going to have the same thing month creation time and let's call it month.
creation time and let's call it month. So let's execute it. Now as you can see
So let's execute it. Now as you can see in the output we got as well the number
in the output we got as well the number of the month. So we have here January,
of the month. So we have here January, February and March and those information
February and March and those information as well are extracted from the creation
as well are extracted from the creation time and the same thing using the day
time and the same thing using the day function. So let's go and use that. So
function. So let's go and use that. So creation time and we call it day. So now
creation time and we call it day. So now as you can see in the output we have the
as you can see in the output we have the day part from the creation time. So here
day part from the creation time. So here we have 1, 5, 10 and so on and all those
we have 1, 5, 10 and so on and all those informations come from the creation
informations come from the creation time. So as you can see those three
time. So as you can see those three functions are very simple and quick in
functions are very simple and quick in order to extract parts from a date or
order to extract parts from a date or date
date [Music]
[Music] time. All right. So what is date part?
time. All right. So what is date part? Date part going to go and return
Date part going to go and return specific part of the date as a number.
specific part of the date as a number. All right. So now back to our example.
All right. So now back to our example. We have learned how to extract the day,
We have learned how to extract the day, month and year. But of course now in a
month and year. But of course now in a day we have more informations that we
day we have more informations that we could extract. Not only those three we
could extract. Not only those three we could extract for example the week right
could extract for example the week right the quarter so all those informations
the quarter so all those informations are as well stored in this dates we
are as well stored in this dates we cannot see it like as a value but inside
cannot see it like as a value but inside the SQL you can extract the week and
the SQL you can extract the week and quarter but we don't have a function
quarter but we don't have a function dedicated for those stuff because they
dedicated for those stuff because they are not commonly used like the year and
are not commonly used like the year and month and day but still we can extract
month and day but still we can extract those information using the date parts
those information using the date parts for example we can say date part and we
for example we can say date part and we can specify the part as a week and with
can specify the part as a week and with that SQL going to return for this
that SQL going to return for this example 34 and maybe in other situation
example 34 and maybe in other situation you are interested in the quarter right
you are interested in the quarter right so you can specify it like this date
so you can specify it like this date part quarter so we are interested in the
part quarter so we are interested in the part of quarter and in the output you
part of quarter and in the output you will get three so this is exactly the
will get three so this is exactly the power of the date part you can go and
power of the date part you can go and extract way more parts that is available
extract way more parts that is available in these dates and one more thing to
in these dates and one more thing to notice about the date part year and day
notice about the date part year and day all of them are always generating the
all of them are always generating the output an integer a number. So we have
output an integer a number. So we have the for the quarter 3 for the week 34
the for the quarter 3 for the week 34 the day 20 2025 and so on. So all of
the day 20 2025 and so on. So all of those informations are integer. So
those informations are integer. So integer is the data type of the output
integer is the data type of the output of these functions. Okay. So let's have
of these functions. Okay. So let's have a look to the syntax of the data part.
a look to the syntax of the data part. It start with the function name date
It start with the function name date parts and it accept two parameters. The
parts and it accept two parameters. The first one is the part that we want to
first one is the part that we want to extract. So we want to define what do we
extract. So we want to define what do we want. We want the month, the day, the
want. We want the month, the day, the year and so on. And the second parameter
year and so on. And the second parameter is the date itself. So let's have an
is the date itself. So let's have an example. We can say date part and we
example. We can say date part and we would like to extract the month from the
would like to extract the month from the order dates. So the part is the month
order dates. So the part is the month and the order date is the date that we
and the order date is the date that we want to extract from. So with that we
want to extract from. So with that we are specifying the part as a month. Now
are specifying the part as a month. Now in SQL there is another way on how to
in SQL there is another way on how to specify the parts. We can go and use
specify the parts. We can go and use like an abbreviation of the month. So if
like an abbreviation of the month. So if you specify instead of month instead of
you specify instead of month instead of writing the whole thing you write mm you
writing the whole thing you write mm you will get the same results. So it's like
will get the same results. So it's like abbreviation and shortcut in order to
abbreviation and shortcut in order to write scripts. But I rarely see that in
write scripts. But I rarely see that in the implementations. I always tend to
the implementations. I always tend to write it completely like this month
write it completely like this month because it's more like standards if you
because it's more like standards if you are switching between different
are switching between different databases. So as you can see it's very
databases. So as you can see it's very simple. You have to give SQL two things
simple. You have to give SQL two things which part you want to extract and the
which part you want to extract and the date that you want to extract from.
date that you want to extract from. Okay. So now we're going to go and
Okay. So now we're going to go and extract different parts from the
extract different parts from the creation time using the date part. Let's
creation time using the date part. Let's start for example by extracting the year
start for example by extracting the year again. So let's go and do that. date
again. So let's go and do that. date parts and then we have to specify which
parts and then we have to specify which part we need. So we're going to write
part we need. So we're going to write year like this and then the next one
year like this and then the next one going to be the value. So it's going to
going to be the value. So it's going to be the creation time. So let's call it
be the creation time. So let's call it year and let's say date parts. Let's go
year and let's say date parts. Let's go and execute it. So now at the output you
and execute it. So now at the output you can see we got as well again the years
can see we got as well again the years that is extracted from the creation
that is extracted from the creation time. So it's going to be identical to
time. So it's going to be identical to the year function. So there is no
the year function. So there is no differences between them. Both of them
differences between them. Both of them are integer and it holds the year
are integer and it holds the year informations. Now we can go and try
informations. Now we can go and try different parts. For example, let's copy
different parts. For example, let's copy the whole thing and let's extract for
the whole thing and let's extract for example the month. So you can go over
example the month. So you can go over here and change it to month and let's
here and change it to month and let's rename it
rename it execute. So at the output you see we got
execute. So at the output you see we got as well the months is identical as well
as well the months is identical as well to the function month. And the same
to the function month. And the same thing for the day. So we are just
thing for the day. So we are just changing the
changing the parts and in the output we are getting
parts and in the output we are getting the parts. So here we have as well the
the parts. So here we have as well the days it is identical to the day
days it is identical to the day function. So so far we don't have
function. So so far we don't have something new from the date part because
something new from the date part because we have it already from the other
we have it already from the other functions. But now we're going to go and
functions. But now we're going to go and extract other parts that are not year
extract other parts that are not year month and day. So for example let's go
month and day. So for example let's go and get the hours. So we have the date
and get the hours. So we have the date part and here as a part you say hour and
part and here as a part you say hour and let's call it here as well hour. Let's
let's call it here as well hour. Let's go and execute it. Now you can see in
go and execute it. Now you can see in the output we have a new dedicated
the output we have a new dedicated column that shows only the information
column that shows only the information from the hour. So we have here 12 23 and
from the hour. So we have here 12 23 and so on. And those informations comes from
so on. And those informations comes from the time and the same thing you can
the time and the same thing you can define minutes and so on. But now let's
define minutes and so on. But now let's go and get something interesting like
go and get something interesting like the quarter. So let's go and duplicate
the quarter. So let's go and duplicate it and instead of hour let's get
it and instead of hour let's get quarter. So this information it's not
quarter. So this information it's not displayed in the creation time but SQL
displayed in the creation time but SQL can go and extract it. So let's call it
can go and extract it. So let's call it quarter and let's go and execute it. Now
quarter and let's go and execute it. Now as you can see in the output we have one
as you can see in the output we have one new field called quarter and inside it
new field called quarter and inside it everywhere we have a one because all
everywhere we have a one because all those dates are in the range of the
those dates are in the range of the quarter one. So as you can see this is
quarter one. So as you can see this is amazing of course for reporting and
amazing of course for reporting and analyzes. Let's go and have something
analyzes. Let's go and have something else like the week day. So we are over
else like the week day. So we are over here quarter and let's call it week day
here quarter and let's call it week day and rename as well this to week day. So
and rename as well this to week day. So let's go and execute it. All right. So
let's go and execute it. All right. So now let's go and get something else like
now let's go and get something else like for example the week. So I just
for example the week. So I just duplicated over here instead of quarter
duplicated over here instead of quarter let's write week. So I would like to get
let's write week. So I would like to get the week number. So let's go and execute
the week number. So let's go and execute it. So now in the output as you can see
it. So now in the output as you can see we got a dedicated field that show us
we got a dedicated field that show us the week number from the creation time.
the week number from the creation time. So we can see this dates come from the
So we can see this dates come from the week number one. Those two come from
week number one. Those two come from week number two and so on. So that's it.
week number two and so on. So that's it. As you can see guys all those
As you can see guys all those informations that you are getting from
informations that you are getting from the date part are numbers. And now we
the date part are numbers. And now we can extract way more informations than
can extract way more informations than only the year, month and day. And even
only the year, month and day. And even if those informations are not displayed
if those informations are not displayed directly in the field itself like the
directly in the field itself like the quarter, weeks and so
quarter, weeks and so [Music]
[Music] on. All right. So now we have very
on. All right. So now we have very similar function to the date part. We
similar function to the date part. We have the date name. So the only
have the date name. So the only difference here is that it returns the
difference here is that it returns the name of the date parts. All right. So
name of the date parts. All right. So now back to our example. We have learned
now back to our example. We have learned we can extract different types of parts
we can extract different types of parts from one date. But we learned as well
from one date. But we learned as well that all of them are numbers. How about
that all of them are numbers. How about we would like to extract the name of the
we would like to extract the name of the month. So instead of eight, I would like
month. So instead of eight, I would like to get the name of the month like
to get the name of the month like August. Or instead of the 20, I would
August. Or instead of the 20, I would like to get the day name like here in
like to get the day name like here in this example, it going to be Wednesday.
this example, it going to be Wednesday. So in order to get the name of the
So in order to get the name of the parts, we have to use the function date
parts, we have to use the function date name. So for example, if you use the
name. So for example, if you use the function date name using the part month,
function date name using the part month, you will not get eight in the output.
you will not get eight in the output. You will get the full name of the month
You will get the full name of the month August. So as you can see we are getting
August. So as you can see we are getting a string a full name and as well the
a string a full name and as well the same thing if you use date name for the
same thing if you use date name for the week day you will not get 20 like the
week day you will not get 20 like the day function you will get the name of
day function you will get the name of the day Wednesday and as well here the
the day Wednesday and as well here the output is string so as you can see it's
output is string so as you can see it's very simple we are using the date name
very simple we are using the date name in order to get the name of the parts
in order to get the name of the parts and the data type of the output here is
and the data type of the output here is a string it is not an integer so as you
a string it is not an integer so as you can see here we have different types of
can see here we have different types of functions that all of them are doing the
functions that all of them are doing the same job we are extracting ing parts
same job we are extracting ing parts from one date. Okay. So now by checking
from one date. Okay. So now by checking the data name syntax, it's going to be
the data name syntax, it's going to be identical to the date part. So we are
identical to the date part. So we are just switching the function name. It
just switching the function name. It needs from you to define the part and as
needs from you to define the part and as well the dates. The only difference here
well the dates. The only difference here is that we are getting different data
is that we are getting different data type at the output. So here we are
type at the output. So here we are getting a string instead of integer. All
getting a string instead of integer. All right. So now let's check the date name.
right. So now let's check the date name. It is very similar to the date part. So
It is very similar to the date part. So we're going to have it like this. We're
we're going to have it like this. We're going to work as well with the creation
going to work as well with the creation time. So we're going to say date name
time. So we're going to say date name and then after that we have to define
and then after that we have to define the parts. So let's go for example with
the parts. So let's go for example with the month and our field is as usual the
the month and our field is as usual the creation time and let's call it month
creation time and let's call it month date
date name like this. So that's it. Let's go
name like this. So that's it. Let's go and execute it. Now if you go to the
and execute it. Now if you go to the output over here you can see we have the
output over here you can see we have the month but this time we don't have
month but this time we don't have numbers. We have the full name of the
numbers. We have the full name of the month. So we have January, February,
month. So we have January, February, March instead of having 1 2 3. So this
March instead of having 1 2 3. So this is the big difference between the date
is the big difference between the date name and date part. Date part you get
name and date part. Date part you get numbers. Date name you get the name of
numbers. Date name you get the name of the part. So let's do the same thing for
the part. So let's do the same thing for the day. We would like to get the name
the day. We would like to get the name of the day. So I'm just duplicating it.
of the day. So I'm just duplicating it. But now in order to get the full name of
But now in order to get the full name of the day, we cannot go with the day.
the day, we cannot go with the day. We're going to go with the week day as a
We're going to go with the week day as a part. So that's it. I will call it week
part. So that's it. I will call it week day. So let's execute it. Now as you can
day. So let's execute it. Now as you can see in the output, we have here a new
see in the output, we have here a new column called week day. And inside it we
column called week day. And inside it we have the name of the day instead of a
have the name of the day instead of a number. So here we have Wednesday,
number. So here we have Wednesday, Sunday, Friday and so on. So the full
Sunday, Friday and so on. So the full name of the day go of course with the
name of the day go of course with the day. Let's go and try that
day. Let's go and try that out. So this is the day of the month and
out. So this is the day of the month and of course the day of the month has no
of course the day of the month has no name and SQL of course going to return
name and SQL of course going to return the numbers again. So you can see 1 5 10
the numbers again. So you can see 1 5 10 20 and so on. But still there is a
20 and so on. But still there is a difference between the day from the day
difference between the day from the day name and the day from the date parts. In
name and the day from the date parts. In the date parts we are getting integers.
the date parts we are getting integers. So if you store this information in a
So if you store this information in a new table it's going to be stored as an
new table it's going to be stored as an integer. But in the date that you are
integer. But in the date that you are getting from the date name it is a
getting from the date name it is a number but still it can be stored as a
number but still it can be stored as a string value. So the data type of those
string value. So the data type of those numbers is a string and the data types
numbers is a string and the data types of the day from the date part is an
of the day from the date part is an integer. And the same thing can happen
integer. And the same thing can happen if you extract for example a year. So
if you extract for example a year. So you don't have like a full text of the
you don't have like a full text of the year. So let me just do it like this. So
year. So let me just do it like this. So if we say a year, you will not get the
if we say a year, you will not get the name of the year. You're still getting
name of the year. You're still getting the numbers, the digits, but the data
the numbers, the digits, but the data type here is a string. So that's it.
type here is a string. So that's it. This is the difference between the date
This is the difference between the date name and the date parts. For the month
name and the date parts. For the month and weekday, you will get the full name.
and weekday, you will get the full name. For the other stuff, you will get
For the other stuff, you will get numbers but with the string data type.
numbers but with the string data type. So the most important thing about the
So the most important thing about the date name is to present easy to read and
date name is to present easy to read and human readable informations to the
human readable informations to the users. So imagine you are building a
users. So imagine you are building a report called sales by month and then
report called sales by month and then you show to the user the muscles as
you show to the user the muscles as numbers 1 2 3 until 12. This is of
numbers 1 2 3 until 12. This is of course okay but it is way more nicer if
course okay but it is way more nicer if you present those informations as a full
you present those informations as a full text. So you go with the date name in
text. So you go with the date name in order to show instead of one you show
order to show instead of one you show January, February, March and the full
January, February, March and the full name of the month. And this going to
name of the month. And this going to look way nicer in reporting for the
look way nicer in reporting for the users. So this is the core use case of
users. So this is the core use case of the date
name. So what is date trunk? Date trunk going to go and truncate the date to a
going to go and truncate the date to a specific part. So let's understand what
specific part. So let's understand what this means. Okay. Now let's check the
this means. Okay. Now let's check the syntax of the date trunk. It's going to
syntax of the date trunk. It's going to be exactly the same like date part and
be exactly the same like date part and date name. So you have to define the
date name. So you have to define the part and the date that you want to
part and the date that you want to extract apart from it. So the only thing
extract apart from it. So the only thing that is different here we are giving
that is different here we are giving different function name. So as you can
different function name. So as you can see all those three functions like
see all those three functions like having the same structure you have to
having the same structure you have to provide which part you want to extract
provide which part you want to extract like a month, day, week, hour, minutes
like a month, day, week, hour, minutes and so on and the date or date and time
and so on and the date or date and time that you want to extract a part from it
that you want to extract a part from it and of course with the date trunk we are
and of course with the date trunk we are getting at the output date or date time.
getting at the output date or date time. Okay. So now let's understand exactly
Okay. So now let's understand exactly how the date trunk works. We have the
how the date trunk works. We have the following date time and as we learned we
following date time and as we learned we have like a hierarchy where we start
have like a hierarchy where we start with the highest from the year then we
with the highest from the year then we move to the month, day, hours, minutes
move to the month, day, hours, minutes and seconds and by looking to this
and seconds and by looking to this information it is very precise. We know
information it is very precise. We know exact second for this information right?
exact second for this information right? So the level of details here is very
So the level of details here is very high. We know the seconds of this event.
high. We know the seconds of this event. So now the date going to allow us to
So now the date going to allow us to change this level of details of this
change this level of details of this information by specifying the level of
information by specifying the level of details. Let's take for example if we
details. Let's take for example if we say the date trunk minutes. So we are
say the date trunk minutes. So we are saying we are interested only at the
saying we are interested only at the minutes level. We are not interesting
minutes level. We are not interesting with the seconds. So what can happen?
with the seconds. So what can happen? Everything between the year and the
Everything between the year and the minutes going to be kept. That means all
minutes going to be kept. That means all those information will not be changed
those information will not be changed but only the seconds going to be
but only the seconds going to be reseted. We are not interested anymore
reseted. We are not interested anymore with the seconds. This is very detailed
with the seconds. This is very detailed for us. So it's going to go and reset
for us. So it's going to go and reset the seconds to 0 0. So we are saying the
the seconds to 0 0. So we are saying the minimum level is the minutes and we are
minimum level is the minutes and we are not interested anything like before it
not interested anything like before it the seconds let's say now we say you
the seconds let's say now we say you know what the minutes is very detailed I
know what the minutes is very detailed I would like to be at the hours level so
would like to be at the hours level so we specify for the date rank hour so
we specify for the date rank hour so here things changed we're going to keep
here things changed we're going to keep the informations now between the year
the informations now between the year and the hours and anything after that
and the hours and anything after that going to be reseted so now minutes and
going to be reseted so now minutes and seconds going to be in the range of the
seconds going to be in the range of the resets and SQL going to go and reset the
resets and SQL going to go and reset the 55 to 0 0 so now the level of details is
55 to 0 0 so now the level of details is little bit lower now we know only the
little bit lower now we know only the informations until the hours and we are
informations until the hours and we are not interested about the minutes and the
not interested about the minutes and the seconds and I think you already get it
seconds and I think you already get it if you say date trunk day what's going
if you say date trunk day what's going to happen it's going to keep everything
to happen it's going to keep everything between year and day and the whole time
between year and day and the whole time going to be resets so the hours and
going to be resets so the hours and seconds all those information is going
seconds all those information is going to reset to 0 0 so now by looking to
to reset to 0 0 so now by looking to this we don't know anything about the
this we don't know anything about the time we know only informations about the
time we know only informations about the dates and now we can go one more step
dates and now we can go one more step and we say you know what I'm not
and we say you know what I'm not interested about the days I'm doing
interested about the days I'm doing analyszis on the month level so what is
analyszis on the month level so what is here kept is only two informations year
here kept is only two informations year and month and everything below that the
and month and everything below that the day and the time going to be reseted but
day and the time going to be reseted but this time SQL will not reset the date to
this time SQL will not reset the date to 0 0 because there is no date called 0 0
0 0 because there is no date called 0 0 it start always with the first date so
it start always with the first date so it's going to reset to 01 so the dates
it's going to reset to 01 so the dates parts and the dates going to reset to 01
parts and the dates going to reset to 01 one and the dates parts in the time
one and the dates parts in the time going to reset to 0 0. So now we are at
going to reset to 0 0. So now we are at the level of the month. Now you can go
the level of the month. Now you can go to the last step and you say you know
to the last step and you say you know what I'm interested only on the years
what I'm interested only on the years and I'm doing only analyzes at this
and I'm doing only analyzes at this level at the highest level. So you can
level at the highest level. So you can go and say date trunk year and now
go and say date trunk year and now what's going to happen going to keep
what's going to happen going to keep only the year and everything below that
only the year and everything below that going to be reseted. So between month
going to be reseted. So between month and the seconds everything going to
and the seconds everything going to resets. So here is scale going to reset
resets. So here is scale going to reset as well the August 2011. So the only
as well the August 2011. So the only value that is kept is the year and
value that is kept is the year and everything else is reseted. So this is
everything else is reseted. So this is the 1st of January and the time is
the 1st of January and the time is completely reseted. So now we are at the
completely reseted. So now we are at the lowest level of details. We know only
lowest level of details. We know only information about the year and we don't
information about the year and we don't care about any other parts. So as you
care about any other parts. So as you can see the date trunk here is not
can see the date trunk here is not really extracting a part here. Date
really extracting a part here. Date trunk is like resetting stuff. So we are
trunk is like resetting stuff. So we are navigating through the hierarchy of the
navigating through the hierarchy of the date and time and we are controlling at
date and time and we are controlling at which level we are doing the analyszis.
which level we are doing the analyszis. So as you can see at the end it's not
So as you can see at the end it's not very complicated once you understand how
very complicated once you understand how it works and it is very useful in
it works and it is very useful in analyzis. So this is how the date trunk
analyzis. So this is how the date trunk works in SQL. Okay, let's have a few
works in SQL. Okay, let's have a few examples about the date rank together
examples about the date rank together with the creation time. So as you can
with the creation time. So as you can see the creation time the level of it is
see the creation time the level of it is the seconds. So we have seconds
the seconds. So we have seconds information with the creation time. Now
information with the creation time. Now I would like to move it to the minutes.
I would like to move it to the minutes. So let's go and do this date trunk and
So let's go and do this date trunk and we're going to say let's tr it at the
we're going to say let's tr it at the minutes level for the creation time. So
minutes level for the creation time. So let's call it minute date trunk. So
let's call it minute date trunk. So let's go and execute it. Now if you go
let's go and execute it. Now if you go and check the output over here and
and check the output over here and compare it to the creation time, you can
compare it to the creation time, you can see here we have zeros at the seconds.
see here we have zeros at the seconds. So as you can see we have the seconds
So as you can see we have the seconds completely resetted compared to the
completely resetted compared to the creation time. Now let's say that I'm
creation time. Now let's say that I'm not interested in the time information
not interested in the time information inside the creation time. I would like
inside the creation time. I would like only to get the date. So in order to do
only to get the date. So in order to do that, we can use the date trunk where we
that, we can use the date trunk where we reset to the level of the day. So let's
reset to the level of the day. So let's go and duplicate it. I'm going to put it
go and duplicate it. I'm going to put it over here and instead of minutes, let's
over here and instead of minutes, let's say we have a day and let's go and check
say we have a day and let's go and check the output. Now if you go and check the
the output. Now if you go and check the result over here you can see all the
result over here you can see all the time informations are reseted to zeros
time informations are reseted to zeros and we have here only information about
and we have here only information about the date. So we have year month and day
the date. So we have year month and day and everything else is reset it to zero.
and everything else is reset it to zero. Now of course we can go to the maximum
Now of course we can go to the maximum where we say I just need the year. So I
where we say I just need the year. So I don't need anything else. So let's try
don't need anything else. So let's try that out. We're going to take date trunk
that out. We're going to take date trunk and say year and let's call it year. So
and say year and let's call it year. So let's go and execute it. Now if you
let's go and execute it. Now if you check the output over here you can see
check the output over here you can see that everything is reseted beside the
that everything is reseted beside the year. So we have only the year
year. So we have only the year information but everything else is
information but everything else is reseted to the first of January and the
reseted to the first of January and the time is as well is reseted. So as you
time is as well is reseted. So as you can see the output of the date trunk is
can see the output of the date trunk is always as a date time and it help us as
always as a date time and it help us as well to navigate through the hierarchy
well to navigate through the hierarchy of the day time and we can truncate at
of the day time and we can truncate at the level that we want. All right. So
the level that we want. All right. So now we're going to check why data trunk
now we're going to check why data trunk is amazing function for data analyszis.
is amazing function for data analyszis. So let's have this example. We are
So let's have this example. We are saying select
saying select creation time and we want to count the
creation time and we want to count the number of orders based on the creation
number of orders based on the creation time from our table sales orders and
time from our table sales orders and we're going to use the group by in order
we're going to use the group by in order to group the data by the creation time.
to group the data by the creation time. So let's go and execute it. Now as you
So let's go and execute it. Now as you can see we're going to get one
can see we're going to get one everywhere because the level of details
everywhere because the level of details the granularity or the creation time is
the granularity or the creation time is very high and that's because here we
very high and that's because here we have the seconds and since our data is
have the seconds and since our data is small we will not get like two orders at
small we will not get like two orders at the same seconds. Now in data analytics
the same seconds. Now in data analytics you would like quickly to aggregate the
you would like quickly to aggregate the data at different granularity like for
data at different granularity like for example at the month level. So you can
example at the month level. So you can do that very quickly using the date
do that very quickly using the date trunk and you say you know what let's
trunk and you say you know what let's say at the month and let's call it
say at the month and let's call it creation and we're going to have the
creation and we're going to have the same thing for the group pie. So let's
same thing for the group pie. So let's go and execute it. So now as you can see
go and execute it. So now as you can see at the output we have only three rows we
at the output we have only three rows we don't have like 10 rows and that's
don't have like 10 rows and that's because we have three months. So that
because we have three months. So that means we just rolled up to the month
means we just rolled up to the month level instead of the seconds. And we can
level instead of the seconds. And we can see now in the month of January we have
see now in the month of January we have four orders, February as well four and
four orders, February as well four and March we have only two. So now we are
March we have only two. So now we are talking about different level of details
talking about different level of details in the output and granularity. And now
in the output and granularity. And now you might say let's go and aggregate the
you might say let's go and aggregate the data at different level at the year
data at different level at the year level. So you can just change over here
level. So you can just change over here the year and execute it. And with that
the year and execute it. And with that now we are at the highest level of
now we are at the highest level of aggregations. We are at the year level
aggregations. We are at the year level and since in our data we have only 2025.
and since in our data we have only 2025. So we will get the total number of
So we will get the total number of orders inside the table and that is 10.
orders inside the table and that is 10. And this is really amazing in data
And this is really amazing in data analytics. You can go and quickly change
analytics. You can go and quickly change the granularity and the level of
the granularity and the level of aggregation or details by simply
aggregation or details by simply defining the level inside the dates. So
defining the level inside the dates. So this is why the date rank is amazing. It
this is why the date rank is amazing. It allow us to do analyszis and
allow us to do analyszis and aggregations by zooming in and zooming
out. Okay. So now we're going to talk about the last function in the part
about the last function in the part extraction category. We have the end of
extraction category. We have the end of the month. As the name says, it's going
the month. As the name says, it's going to go and return the last day of a
to go and return the last day of a month. So let's see how end of month
month. So let's see how end of month works. This is very simple. So let's
works. This is very simple. So let's take our date 20th August 2025. If you
take our date 20th August 2025. If you go now and apply this function to it,
go now and apply this function to it, what's going to happen? It's going to go
what's going to happen? It's going to go and change only the day information. So
and change only the day information. So instead of 20, it's going to go to the
instead of 20, it's going to go to the last day of the month. So it's going to
last day of the month. So it's going to go and change the 20 to 31. The last day
go and change the 20 to 31. The last day of the month, August in 2025. Let's take
of the month, August in 2025. Let's take another example is the 1st of February
another example is the 1st of February 2025. If you apply the end of the month,
2025. If you apply the end of the month, it's going to go and change the day from
it's going to go and change the day from the 1st to 28. The last day of month
the 1st to 28. The last day of month February. So as you can see, it's very
February. So as you can see, it's very simple. Let's take another example where
simple. Let's take another example where it is already the last day of the month.
it is already the last day of the month. So we have 31 of March. If you apply the
So we have 31 of March. If you apply the end of the month here, what can happen?
end of the month here, what can happen? Nothing going to happen. You're going to
Nothing going to happen. You're going to get in return the same value. So this is
get in return the same value. So this is how it works. And as you can see always
how it works. And as you can see always the output of the end of the month going
the output of the end of the month going to be as well a date. So this is how end
to be as well a date. So this is how end of month work. It is very simple. All
of month work. It is very simple. All right. Now quickly about the syntax of
right. Now quickly about the syntax of the end of the month. It's going to have
the end of the month. It's going to have the exact same syntax like the day,
the exact same syntax like the day, month, year. It accepts only one
month, year. It accepts only one parameter. It is the date. So we have to
parameter. It is the date. So we have to pass here a date in order to find out
pass here a date in order to find out the end of the month. So let's go and
the end of the month. So let's go and find the end of the month of our
find the end of the month of our creation time. So end of the month like
creation time. So end of the month like this. And let's have our creation time.
this. And let's have our creation time. So let's see the end of month. Let's go
So let's see the end of month. Let's go and execute it. And now in the output
and execute it. And now in the output you can see we have a new column a date
you can see we have a new column a date column. And inside it we have values
column. And inside it we have values about the end of the month. So for
about the end of the month. So for example here we have January, January,
example here we have January, January, January and so on. So you will see
January and so on. So you will see always here the end of January and the
always here the end of January and the same thing for February and March. So
same thing for February and March. So that's it. This is really nice function
that's it. This is really nice function in case you need the end of the month of
in case you need the end of the month of each date. Maybe you're creating a
each date. Maybe you're creating a report or analyzes where you need this
report or analyzes where you need this information. And now you might ask me
information. And now you might ask me how about to get the first day of the
how about to get the first day of the month. Is there like any function for
month. Is there like any function for it? Well, no. But there is a trick in
it? Well, no. But there is a trick in order to get the first day of the month
order to get the first day of the month using another function that we just
using another function that we just learned. Think about it. How to get the
learned. Think about it. How to get the days as one everywhere. So we have to
days as one everywhere. So we have to get here the 1st of January, the 1st of
get here the 1st of January, the 1st of February, and the 1st of March. So how
February, and the 1st of March. So how we can do that? Well, using the date
we can do that? Well, using the date trunk. So let me show you how we're
trunk. So let me show you how we're going to do this. So date
going to do this. So date trunk and we're going to reset at the
trunk and we're going to reset at the level of month. So we don't need the
level of month. So we don't need the days it going to reset to the first. So
days it going to reset to the first. So our field is creation time and this
our field is creation time and this going to be the start of month. So let's
going to be the start of month. So let's go and execute it. So now as you can see
go and execute it. So now as you can see in the output we have the start of month
in the output we have the start of month and you can see we have everywhere here
and you can see we have everywhere here a one since we reset it at the level of
a one since we reset it at the level of month and this going to give us the
month and this going to give us the first day of the month. And now you
first day of the month. And now you might say you know what here we have a
might say you know what here we have a lot of zeros how to get it exactly like
lot of zeros how to get it exactly like the end of the month and that's because
the end of the month and that's because the date rank give us date and time
the date rank give us date and time always. So that means we have to change
always. So that means we have to change the data type and that we're going to
the data type and that we're going to learn later using the cast function but
learn later using the cast function but we can go and do it right now. So we can
we can go and do it right now. So we can say cast and we want to change the whole
say cast and we want to change the whole thing to date. And now that we change
thing to date. And now that we change the data type from date time to date and
the data type from date time to date and in the output as you can see we have
in the output as you can see we have only the date information. So now it's
only the date information. So now it's really amazing that you got two dates.
really amazing that you got two dates. The first one is the start of the month
The first one is the start of the month and the second is the end of the month.
and the second is the end of the month. And those information might be helpful
And those information might be helpful if you are generating reporting and you
if you are generating reporting and you need the start and the end of the
need the start and the end of the [Music]
[Music] month. So now we come to the part where
month. So now we come to the part where we ask the question why do we need those
we ask the question why do we need those parts? Why do we need to extract the
parts? Why do we need to extract the date parts from a date? So let's have
date parts from a date? So let's have the following use cases. The first use
the following use cases. The first use case of extracting the part is doing
case of extracting the part is doing data aggregations and reporting.
data aggregations and reporting. Sometimes we are building like reports
Sometimes we are building like reports based on our data and sometimes we have
based on our data and sometimes we have to aggregate our data by a specific time
to aggregate our data by a specific time unit like for example we are building a
unit like for example we are building a reports in order to show the sales by
reports in order to show the sales by year. So we have different years and we
year. So we have different years and we are aggregating the data based on the
are aggregating the data based on the year or you want to drill down to more
year or you want to drill down to more details where you want to aggregate the
details where you want to aggregate the data by the quarter. So in this report
data by the quarter. So in this report we are showing the sales by quarter Q1 2
we are showing the sales by quarter Q1 2 3 4 or you decide to go in more details
3 4 or you decide to go in more details where you show a report says sales by
where you show a report says sales by month and then you start aggregating
month and then you start aggregating your data by the month. So you have
your data by the month. So you have January, February, March and so on. So
January, February, March and so on. So as you can see we can use those
as you can see we can use those different parts in order to aggregate
different parts in order to aggregate the data based on it and these different
the data based on it and these different parts can offer us different analyzes
parts can offer us different analyzes with different details. So now we have
with different details. So now we have the following task and it says how many
the following task and it says how many orders were placed each year. So that
orders were placed each year. So that means we have to group up our data by
means we have to group up our data by the year and we have to count the number
the year and we have to count the number of orders. Let's go and solve it. So
of orders. Let's go and solve it. So let's go with the select. And now what
let's go with the select. And now what do we need? We need the order date. This
do we need? We need the order date. This going to indicate when the order is
going to indicate when the order is placed. So and we have to go and count
placed. So and we have to go and count the star. So this going to be number of
the star. So this going to be number of orders. and from our table sales orders
orders. and from our table sales orders and we have to group up by the order
and we have to group up by the order dates. So that's it. Let's go a and
dates. So that's it. Let's go a and execute it. So now in the output we are
execute it. So now in the output we are getting the number of orders but by the
getting the number of orders but by the order date. So we are still not there.
order date. So we are still not there. We have to have it as a year. So we
We have to have it as a year. So we don't need the whole date information.
don't need the whole date information. We need only the year information. So
We need only the year information. So that means we have to go and extract the
that means we have to go and extract the part year. In order to do that we can do
part year. In order to do that we can do it like this. So we can go with the year
it like this. So we can go with the year and we have it as well in the group I.
and we have it as well in the group I. So that's it. Let's go and execute it.
So that's it. Let's go and execute it. And with that as you can see we got the
And with that as you can see we got the number of orders for each year. And
number of orders for each year. And since in our data we have only 2025 we
since in our data we have only 2025 we will get only one row. So with that the
will get only one row. So with that the task is solved. We are now aggregating
task is solved. We are now aggregating the data on the level of the year. Now
the data on the level of the year. Now let's have another task which is the
let's have another task which is the same but only different parts. How many
same but only different parts. How many orders were placed each month. So we
orders were placed each month. So we have to go and change it to a month.
have to go and change it to a month. It's very simple. We're going to use the
It's very simple. We're going to use the function month and as well in the group
function month and as well in the group by. So let's go and execute it. And now
by. So let's go and execute it. And now as you can see in the output we don't
as you can see in the output we don't have one row. Now we have three rows.
have one row. Now we have three rows. And that's because we have three months
And that's because we have three months inside our data. And for each month we
inside our data. And for each month we will get the total number of orders. So
will get the total number of orders. So for the January we have four, February
for the January we have four, February we have four and March we have two
we have four and March we have two orders. Now you might say you know what
orders. Now you might say you know what I don't want the months as a numbers. I
I don't want the months as a numbers. I would like to have the full name of the
would like to have the full name of the month. So in order to do that we're
month. So in order to do that we're going to go and use the function date
going to go and use the function date name. So let's go and use date name and
name. So let's go and use date name and then we have to specify the date part.
then we have to specify the date part. It's going to be the month and the value
It's going to be the month and the value going to be the order date and we have
going to be the order date and we have to have the same thing as well in the
to have the same thing as well in the group I. So let's go and execute it. Now
group I. So let's go and execute it. Now you can see in the output we are getting
you can see in the output we are getting the full name of the month which is
the full name of the month which is easier to read. So this is one of the
easier to read. So this is one of the use cases why we need to extract parts
use cases why we need to extract parts from a date in order to aggregate the
from a date in order to aggregate the data on a specific level.
So now let's have the following task and it says show all orders that were placed
it says show all orders that were placed during the month of February. So that
during the month of February. So that means we don't need all the orders. We
means we don't need all the orders. We need only a subset of the orders based
need only a subset of the orders based on the order dates. Now let's go and
on the order dates. Now let's go and check the data. So select star first
check the data. So select star first from sales orders and let's go and
from sales orders and let's go and execute it. So now with that we have our
execute it. So now with that we have our 10 orders. Now if you check the order
10 orders. Now if you check the order date over here you can see that we have
date over here you can see that we have orders in January, February and March.
orders in January, February and March. Now we are interested only on the orders
Now we are interested only on the orders that were placed in February. So only
that were placed in February. So only these subsets. So that means we have now
these subsets. So that means we have now to filter the data based on the month
to filter the data based on the month information. So what we're going to do,
information. So what we're going to do, we're going to have a wear clause. And
we're going to have a wear clause. And now we don't need the whole order date.
now we don't need the whole order date. We need only the part month. So we're
We need only the part month. So we're going to go with the month and order
going to go with the month and order date and this going to be equal to two.
date and this going to be equal to two. Since the output going to be in number.
Since the output going to be in number. So let's go and execute it. Now as you
So let's go and execute it. Now as you can see SQL did filter the data and in
can see SQL did filter the data and in the output we have only the orders were
the output we have only the orders were placed in the month of February. So this
placed in the month of February. So this is as well very common use case. Why do
is as well very common use case. Why do we need the parts? We use it in order to
we need the parts? We use it in order to filter the data based on specific part
filter the data based on specific part of the dates. So as you can see it's
of the dates. So as you can see it's very quick and easy. And here my
very quick and easy. And here my recommendation is that if you are
recommendation is that if you are filtering the data always use the
filtering the data always use the numbers. So always use a date function
numbers. So always use a date function that gives you a number because it's
that gives you a number because it's always faster to search for integers
always faster to search for integers instead of searching for a character or
instead of searching for a character or for string. So don't use the date name
for string. So don't use the date name function in order to search or filter
function in order to search or filter for the data. It's better to use the
for the data. It's better to use the date part or month, year and day. Since
date part or month, year and day. Since you can work with numbers and numbers
you can work with numbers and numbers are always faster to retrieve data and
are always faster to retrieve data and to filter your
informations. Okay. So now we have a lot of functions and I would like now to do
of functions and I would like now to do a quick recap about the data type of
a quick recap about the data type of their results. So as we learned we have
their results. So as we learned we have functions like day, month, year, date
functions like day, month, year, date bar and the output of all those
bar and the output of all those functions going to be integer. It's
functions going to be integer. It's going to be a number. Now we have
going to be a number. Now we have another function the date time. If you
another function the date time. If you use it the output of this function going
use it the output of this function going to be a string because here we are
to be a string because here we are extracting the name of the date part.
extracting the name of the date part. And if you go and use the date trunk you
And if you go and use the date trunk you will get in the output always date time
will get in the output always date time two. So you are getting both the date
two. So you are getting both the date and time. And the last function that we
and time. And the last function that we learned end of month if you use it in
learned end of month if you use it in the results you will get the data type
the results you will get the data type date. So this is really important to
date. So this is really important to understand the data type of the output
understand the data type of the output so that you don't get any unexpected
so that you don't get any unexpected results. All right. So now you might say
results. All right. So now you might say you know what those are a lot of
you know what those are a lot of functions and like I'm saying they are
functions and like I'm saying they are doing the same stuff. We are extracting
doing the same stuff. We are extracting the parts of the dates. So now you might
the parts of the dates. So now you might ask me how do you decide on when to use
ask me how do you decide on when to use which function? This is how I usually do
which function? This is how I usually do it. First I ask myself which part I want
it. First I ask myself which part I want to extract. If I want to extract a date
to extract. If I want to extract a date or a month then I ask the question do I
or a month then I ask the question do I need it as an integer as a number? If
need it as an integer as a number? If it's yes then I go and use the day
it's yes then I go and use the day function or the month function because
function or the month function because they are quick and I will get exactly
they are quick and I will get exactly what I need. But now if I need the full
what I need. But now if I need the full name of the month or the day then I go
name of the month or the day then I go with the function date name. Now moving
with the function date name. Now moving back if I'm interested on the part year.
back if I'm interested on the part year. So here we don't have a year name or
So here we don't have a year name or something. I'm going to go immediately
something. I'm going to go immediately with the function year. But now let's
with the function year. But now let's say that I don't need the day, month or
say that I don't need the day, month or year. I'm interested in other parts like
year. I'm interested in other parts like the week, the quarter and so on. Only
the week, the quarter and so on. Only for this scenario, I go with the
for this scenario, I go with the function date part. So this is my
function date part. So this is my decision process. This is how I decide
decision process. This is how I decide when to use which SQL function in order
when to use which SQL function in order to extract the parts of the
dates. All right. All right. So now I have prepared for you here a list of all
have prepared for you here a list of all parts that we can use inside those three
parts that we can use inside those three functions date part date name and date
functions date part date name and date trunk. And you can see in this table the
trunk. And you can see in this table the different outputs using those different
different outputs using those different three functions. So for example if you
three functions. So for example if you go and use the month with the date part
go and use the month with the date part you will get eight but for the date name
you will get eight but for the date name you will get August and for the date
you will get August and for the date trunk you will get truncated date time
trunk you will get truncated date time at the level of the month where you
at the level of the month where you reset the days and times. So this is a
reset the days and times. So this is a full list of all examples you can go and
full list of all examples you can go and check it. And one more thing that I have
check it. And one more thing that I have prepared for you in order to practice
prepared for you in order to practice with all those different parts. I have
with all those different parts. I have made one big query with all different
made one big query with all different parts. So if you go and download the
parts. So if you go and download the queries of this chapter, you will find
queries of this chapter, you will find the following files and let's go now and
the following files and let's go now and open all date parts. So we're going to
open all date parts. So we're going to go inside it and here we have a long
go inside it and here we have a long query. So what we're going to do, we're
query. So what we're going to do, we're going to select everything and copy it
going to select everything and copy it and let's go back to our scale and paste
and let's go back to our scale and paste it. So let me just zoom out and then
it. So let me just zoom out and then let's go and execute the whole thing. So
let's go and execute the whole thing. So now in my code I have just done a union
now in my code I have just done a union for each possible part. For example for
for each possible part. For example for the year we have date part date name and
the year we have date part date name and date trunk and I'm using currently the
date trunk and I'm using currently the get date. So we are manipulating this
get date. So we are manipulating this one and then the output can be presented
one and then the output can be presented over here. So you can see it like this.
over here. So you can see it like this. So if you use the part here for the date
So if you use the part here for the date name you will get 2024. The same thing
name you will get 2024. The same thing for the date name and this is for the
for the date name and this is for the date rank. And with that you have all
date rank. And with that you have all possible parts that you can use in SQL
possible parts that you can use in SQL in one query. So with that you can learn
in one query. So with that you can learn what are the outputs for different
what are the outputs for different parts. All right. So with that we have
parts. All right. So with that we have learned all those functions on how to
learned all those functions on how to extract the parts of dates. All right.
extract the parts of dates. All right. Moving to the second category. We're
Moving to the second category. We're going to learn how to do formatting and
going to learn how to do formatting and casting for the date informations in SQL
casting for the date informations in SQL using three functions.
So now before we deep dive to the formatting and casting I would like you
formatting and casting I would like you to understand what is date format. So
to understand what is date format. So back to our example we have here the
back to our example we have here the date and time informations and we
date and time informations and we understood there is components year
understood there is components year month day and so on. Now if you check
month day and so on. Now if you check the date time there is combination of
the date time there is combination of numbers and characters. For example the
numbers and characters. For example the 2025 is a number but between the month
2025 is a number but between the month and the year there is like a minus
and the year there is like a minus between them and this is a character. So
between them and this is a character. So now this is a very specific format and
now this is a very specific format and in SQL we can have a code for this
in SQL we can have a code for this format. So for example let's start with
format. So for example let's start with the year we have here four digits and we
the year we have here four digits and we can represent it with 4 Y. So Y Y and we
can represent it with 4 Y. So Y Y and we call those characters as format
call those characters as format specifiers. So this is how we represent
specifiers. So this is how we represent the year. Then between the year and the
the year. Then between the year and the month there is like this small minus and
month there is like this small minus and then the month is two digits and we're
then the month is two digits and we're going to represent it with two big M. So
going to represent it with two big M. So m M then between the month and the day
m M then between the month and the day there is a minus. So we have as well
there is a minus. So we have as well minus and then the day going to
minus and then the day going to represented with two digits d and then
represented with two digits d and then we have like a space between the date
we have like a space between the date and time and then we start with the
and time and then we start with the date. So it start with the hour big h
date. So it start with the hour big h and big h because here we have the
and big h because here we have the system of 24 and then we have double
system of 24 and then we have double points small m small m. So as you can
points small m small m. So as you can see here the formats are case sensitive.
see here the formats are case sensitive. So there is a big difference between
So there is a big difference between small m and a big m. So a small m
small m and a big m. So a small m indicates for a minute and big m
indicates for a minute and big m indicates for a month. So as you can see
indicates for a month. So as you can see here the case format is case sensitive.
here the case format is case sensitive. So two small m means minutes but two
So two small m means minutes but two capital m means month. Then double point
capital m means month. Then double point and small 2s. So now the whole code is
and small 2s. So now the whole code is called the date format. So this is the
called the date format. So this is the date format representation of this
date format representation of this value. Now in the world there are
value. Now in the world there are different representations on how to
different representations on how to represent a date. So for example in SQL
represent a date. So for example in SQL we have the international standard
we have the international standard ISO6801 and the date format is like we
ISO6801 and the date format is like we have learned first it start with the
have learned first it start with the year. So four digit for the years minus
year. So four digit for the years minus two digit for the month minus two digit
two digit for the month minus two digit for the day. So year month day but in
for the day. So year month day but in the USA we have different standards. So
the USA we have different standards. So first it start with the month. So we
first it start with the month. So we have mm and then after that it is
have mm and then after that it is followed with the day. So we have then
followed with the day. So we have then the day and after that at the end we
the day and after that at the end we have the year. So this is the sentence
have the year. So this is the sentence format that is used in USA and in Europe
format that is used in USA and in Europe we have different representations of the
we have different representations of the day. So it start first with the small.
day. So it start first with the small. So it starts with the day then the month
So it starts with the day then the month and then the year. So this is exactly
and then the year. So this is exactly the opposite of the international
the opposite of the international standards. So as you can see we don't
standards. So as you can see we don't have one standard. We have different
have one standard. We have different ways on how we represent dates. But in
ways on how we represent dates. But in SQL the SQL server is following the
SQL the SQL server is following the format of the international standards.
format of the international standards. So SQL server start always with the year
So SQL server start always with the year then month then day. So all dates that
then month then day. So all dates that are used in our SQL database can be
are used in our SQL database can be following this
format. Okay. So after we understood what is date format, now let's talk
what is date format, now let's talk about formatting and casting. So what is
about formatting and casting. So what is formatting? Is changing the format of
formatting? Is changing the format of value from one to another. So we are
value from one to another. So we are changing how the data looks like. So for
changing how the data looks like. So for example, we have our date. So it's
example, we have our date. So it's following the international standards
following the international standards start with year, month, then day. Now we
start with year, month, then day. Now we can go and change the format using the
can go and change the format using the function format where we can go and
function format where we can go and define a different date format like it
define a different date format like it start with the month and then we have
start with the month and then we have like slash instead of minus and then the
like slash instead of minus and then the day/ year. So in the outer we're going
day/ year. So in the outer we're going to get it like this and even the years
to get it like this and even the years is only two digits not four. So here we
is only two digits not four. So here we are providing for SQL the format that we
are providing for SQL the format that we would like to see the data with or you
would like to see the data with or you can go with other format where you have
can go with other format where you have three big M and then four digits for the
three big M and then four digits for the year and between them is just a space.
year and between them is just a space. So in the output you will get
So in the output you will get abbreviation of the month name and then
abbreviation of the month name and then space and the year. So this is one way
space and the year. So this is one way on how to format data. But in the scale
on how to format data. But in the scale there is another function that help us
there is another function that help us to format data and that is convert. So
to format data and that is convert. So here we provide not the format itself we
here we provide not the format itself we provide style number. So for example the
provide style number. So for example the style number six. So it can show it like
style number six. So it can show it like this day space and after that we have
this day space and after that we have the abbreviation name of the month and
the abbreviation name of the month and then two digits of the year. Or if you
then two digits of the year. Or if you use another style the 112 then you will
use another style the 112 then you will get the year, month, day without any
get the year, month, day without any separation between them. And of course
separation between them. And of course not only the date and time we can style
not only the date and time we can style we can style as well numbers and here we
we can style as well numbers and here we can use the function format in order to
can use the function format in order to change the format of the number. So here
change the format of the number. So here if you're using the format of numeric
if you're using the format of numeric values then the values will be separated
values then the values will be separated with comma or if you use c for the
with comma or if you use c for the currency then you will get the dollar
currency then you will get the dollar sign or if you go and use p then you
sign or if you go and use p then you will get the percentage and at the end
will get the percentage and at the end you have the percentage character. So as
you have the percentage character. So as you can see we can as well change the
you can see we can as well change the format of the numbers but only the
format of the numbers but only the dates. So this is what we mean by
dates. So this is what we mean by formatting we are just changing how the
formatting we are just changing how the value looks like. Now in the other hand
value looks like. Now in the other hand the casting the casting can go and
the casting the casting can go and change the data type from one to
change the data type from one to another. So for example if we have the
another. So for example if we have the value 1 2 3 as a string we can go and
value 1 2 3 as a string we can go and convert it from the data type string to
convert it from the data type string to an integer. So in the output we will get
an integer. So in the output we will get as well 1 2 3 but as a number or we can
as well 1 2 3 but as a number or we can go and change the data type from dates
go and change the data type from dates to a string. So in the output it is not
to a string. So in the output it is not anymore dates it is a string value or
anymore dates it is a string value or the way around we can change the data
the way around we can change the data type from a string to a date. So as you
type from a string to a date. So as you can see we can change the data type from
can see we can change the data type from one to another and we can use that using
one to another and we can use that using two functions. The first one is and the
two functions. The first one is and the most famous one is cast function or in
most famous one is cast function or in SQL server we can use as well the
SQL server we can use as well the convert function in order to change the
convert function in order to change the data type. So this is what we mean with
data type. So this is what we mean with casting changing the data type from one
casting changing the data type from one to
another. All right. So let's start with the first function the format. So what
the first function the format. So what is format? As the name suggest it
is format? As the name suggest it formats a date or time value. So it's
formats a date or time value. So it's like we are changing how the date and
like we are changing how the date and time looks. Okay. So let's check the
time looks. Okay. So let's check the syntax of the format and here it accepts
syntax of the format and here it accepts two parameters and the third one is
two parameters and the third one is optional. So the first one we have to
optional. So the first one we have to provide a value. It could be a date or a
provide a value. It could be a date or a number. And the second one we have to
number. And the second one we have to provide the format. So here we are
provide the format. So here we are specifying the new look the new format
specifying the new look the new format for this value. Now the third one it is
for this value. Now the third one it is optional one. It is the culture. Culture
optional one. It is the culture. Culture means show me the value whether it's
means show me the value whether it's date, time or number. Show me this value
date, time or number. Show me this value in the style of a specific country or
in the style of a specific country or region. So each country each region has
region. So each country each region has different format. So here we can go and
different format. So here we can go and change it to specific region format. But
change it to specific region format. But as I said it is optional. Let's have an
as I said it is optional. Let's have an example. So here we are saying go and
example. So here we are saying go and format the order dates using the
format the order dates using the following format. So dd day then slash
following format. So dd day then slash then we have the month then slash then
then we have the month then slash then the year. So going to go and format this
the year. So going to go and format this with this new format. And as you can see
with this new format. And as you can see here we didn't specify any culture since
here we didn't specify any culture since it's optional. Let's see another option
it's optional. Let's see another option where we can say you know what I would
where we can say you know what I would like to have the order date formatted
like to have the order date formatted with this format but we would like to go
with this format but we would like to go and add the style of Japan. So we are
and add the style of Japan. So we are specifying here the code or the style of
specifying here the code or the style of Japan. And of course we can go and use
Japan. And of course we can go and use the format not only for the date but as
the format not only for the date but as well for formatting the numbers. So here
well for formatting the numbers. So here we are specifying the value. The format
we are specifying the value. The format is D. And as well we have activated the
is D. And as well we have activated the culture option. We are using the style
culture option. We are using the style of France. So this is the syntax of the
of France. So this is the syntax of the format. Using this option is not really
format. Using this option is not really common. So I rarely see this format or
common. So I rarely see this format or someone using it. So the first example
someone using it. So the first example is the most used one in the projects
is the most used one in the projects where we have the culture as default or
where we have the culture as default or we are not using the culture at all. And
we are not using the culture at all. And of course if you don't specify anything
of course if you don't specify anything is going to go and use the default
is going to go and use the default culture which is enus. So this is all
culture which is enus. So this is all about the syntax of the format. All
about the syntax of the format. All right. So now let's have a few examples
right. So now let's have a few examples using the format. So we're going to go
using the format. So we're going to go and format the creation time. So we're
and format the creation time. So we're going to do it like this. Format. And
going to do it like this. Format. And what we are formatting? We are
what we are formatting? We are formatting the creation time and now you
formatting the creation time and now you can go and define any specifier you
can go and define any specifier you want. For example, let's say DD like
want. For example, let's say DD like this. So let's go and check the outputs.
this. So let's go and check the outputs. So execute it. Now if you are using DD,
So execute it. Now if you are using DD, you will get the day information. So we
you will get the day information. So we can see if you're using this specifier,
can see if you're using this specifier, we are getting two digits about the day.
we are getting two digits about the day. So and as well we are getting the
So and as well we are getting the leading zero. So we are getting the 01
leading zero. So we are getting the 01 05 and all those informations are the
05 and all those informations are the day information. Now let's go and try
day information. Now let's go and try something else. adding one more D. So
something else. adding one more D. So let's have it 3D and here as well. So
let's have it 3D and here as well. So let's go execute it. So now if you check
let's go execute it. So now if you check the output, we are getting now the name
the output, we are getting now the name of the day. It is not full. So we are
of the day. It is not full. So we are getting like a short name of the day or
getting like a short name of the day or abbreviated one. So this is sometime
abbreviated one. So this is sometime nice if you are creating like a calendar
nice if you are creating like a calendar or something. Let's go and add one more
or something. Let's go and add one more D. So we're going to have 4 D. And let's
D. So we're going to have 4 D. And let's go and check the result for this one.
go and check the result for this one. Now in the output we are getting the
Now in the output we are getting the full name of the day. So it's really
full name of the day. So it's really nice. Now we are getting full
nice. Now we are getting full flexibility on how to format our day.
flexibility on how to format our day. Okay. So now let's keep playing. Let's
Okay. So now let's keep playing. Let's get something else. I'm just going to go
get something else. I'm just going to go and duplicate everything and I will go
and duplicate everything and I will go with the month now. So this is 2 M, 3 M
with the month now. So this is 2 M, 3 M and 4 M. Let me do it like this. So
and 4 M. Let me do it like this. So let's go and execute it. Now as you can
let's go and execute it. Now as you can see we are getting the same stuff but
see we are getting the same stuff but for the month. So mm we will get the two
for the month. So mm we will get the two digits and 3m we will get the
digits and 3m we will get the abbreviated name of the month and for m
abbreviated name of the month and for m we will get the full name of the month.
we will get the full name of the month. So it's like we are extracting the date
So it's like we are extracting the date part from the format but of course we
part from the format but of course we don't use it like this. We will go and
don't use it like this. We will go and write the whole format that we need for
write the whole format that we need for a date. So for example let's go and
a date. So for example let's go and change this format to the USA format. So
change this format to the USA format. So in order to do it so we're going to go
in order to do it so we're going to go over here. So let's say format again the
over here. So let's say format again the creation time. And now we're going to
creation time. And now we're going to write the format of USA. So it's going
write the format of USA. So it's going to be mm. Then after that then after the
to be mm. Then after that then after the month we're going to have like minus
month we're going to have like minus then day and then after that we're going
then day and then after that we're going to get the year. So for time year and
to get the year. So for time year and that's it. Let's call it USA format. So
that's it. Let's call it USA format. So let's go and excuse it. And now you can
let's go and excuse it. And now you can see in the outut we got a new column
see in the outut we got a new column where we see now the date information
where we see now the date information but as a USA standards. So it start with
but as a USA standards. So it start with the month then the day and then
the month then the day and then afterward we got the year. And of course
afterward we got the year. And of course we can do the same thing in order to
we can do the same thing in order to generate the standard format of Europe.
generate the standard format of Europe. So what we're going to do I'll just
So what we're going to do I'll just duplicate it. And now the format of that
duplicate it. And now the format of that going to start with the day then the
going to start with the day then the month and then the year. So now if you
month and then the year. So now if you check the output you can see it start
check the output you can see it start with day minus then we have the month
with day minus then we have the month then minus the year. So as you can see
then minus the year. So as you can see we are changing the format of the date
we are changing the format of the date from creation time to something new. All
from creation time to something new. All right. So now we have the following task
right. So now we have the following task and it says show creation time using the
and it says show creation time using the following format. Now we have a very
following format. Now we have a very weird format. So it start with the word
weird format. So it start with the word day. Then after that we have the
day. Then after that we have the abbreviation of the day and then
abbreviation of the day and then abbreviation of the month. This is the
abbreviation of the month. This is the quarter informations. Then the year and
quarter informations. Then the year and after that we have the time and we're
after that we have the time and we're going to say whether it's PM or A.M. So
going to say whether it's PM or A.M. So it's little bit weird format that you
it's little bit weird format that you don't see it everywhere but still we
don't see it everywhere but still we want to practice on how to construct
want to practice on how to construct such custom format. So let's do it step
such custom format. So let's do it step by step. I'm going to go over here and a
by step. I'm going to go over here and a new line. So the first one is like day.
new line. So the first one is like day. So we don't have any format for that.
So we don't have any format for that. It's just like characters. So this one
It's just like characters. So this one going to be static for all the format.
going to be static for all the format. So what we going to do? We're going to
So what we going to do? We're going to say with a string this is the day. So
say with a string this is the day. So let's go and execute it. So with that we
let's go and execute it. So with that we got a static value. Everywhere we have
got a static value. Everywhere we have the word day. So that's it. And after
the word day. So that's it. And after that we have a space. So I'm going to go
that we have a space. So I'm going to go and include it after the day in the
and include it after the day in the string. So we have a day then space and
string. So we have a day then space and after that we need the abbreviation of
after that we need the abbreviation of the day name. So what we're going to do
the day name. So what we're going to do we're going to go first with the plus
we're going to go first with the plus operator in order to concatenate the
operator in order to concatenate the strings. So we need the format function
strings. So we need the format function for the creation time. And what do we
for the creation time. And what do we need? We need the short name. So it's
need? We need the short name. So it's going to be three times the d. Let's go
going to be three times the d. Let's go and execute it. Let me just say here
and execute it. Let me just say here custom formats. So now as you can see in
custom formats. So now as you can see in the output we have here the day. Then
the output we have here the day. Then afterward we have space and then the
afterward we have space and then the abbreviation of the name of the day. So
abbreviation of the name of the day. So it looks so far good. Now after that
it looks so far good. Now after that what do we need? We need space and then
what do we need? We need space and then the abbreviation of the month. So we can
the abbreviation of the month. So we can go and add all those stuff together with
go and add all those stuff together with the format here. So we don't have to
the format here. So we don't have to create two formats. So space and the
create two formats. So space and the abbreviation of the month is 3 M. So
abbreviation of the month is 3 M. So let's go and test it. Great. So now as
let's go and test it. Great. So now as you can see we got the abbreviation of
you can see we got the abbreviation of the month as well side by side. So we so
the month as well side by side. So we so far we have covered this part. Now we
far we have covered this part. Now we have to move to the second part. So we
have to move to the second part. So we still need a space and then Q1. Well the
still need a space and then Q1. Well the Q going to be static. So we cannot go
Q going to be static. So we cannot go and extend this format. We have to start
and extend this format. We have to start a new one. So what I'm going to do I'm
a new one. So what I'm going to do I'm just going to add a plus here and a new
just going to add a plus here and a new line. So what do we need? We need first
line. So what do we need? We need first a space between the month and the
a space between the month and the quarter. So let's go and add space and
quarter. So let's go and add space and we need the Q as a static value like
we need the Q as a static value like this. Let me just move it like this. And
this. Let me just move it like this. And now after that we need this one like
now after that we need this one like this right so now we need the quarter
this right so now we need the quarter informations and we don't have format
informations and we don't have format for that that's why we have to go and
for that that's why we have to go and use the part extraction functions and
use the part extraction functions and the one that we're going to use since we
the one that we're going to use since we are using string I will go with the date
are using string I will go with the date name so quarter and we are extracting
name so quarter and we are extracting from the creation time so let's go and
from the creation time so let's go and test it so now in the output you can see
test it so now in the output you can see we have everywhere a Q1 and that's
we have everywhere a Q1 and that's because all of those dates are in Q1 all
because all of those dates are in Q1 all right so now we are so far halfway in
right so now we are so far halfway in our format Not. So now next what do we
our format Not. So now next what do we need? We need like a space and then the
need? We need like a space and then the year information and then the time
year information and then the time information. So now in order to go and
information. So now in order to go and get space we're going to do it very
get space we're going to do it very simply concatenate and we're going to
simply concatenate and we're going to have space. Now let's go to a new line
have space. Now let's go to a new line and in order to get the year I will go
and in order to get the year I will go with the format as well. So format and
with the format as well. So format and what do we have? We're going to have the
what do we have? We're going to have the creation time
creation time again. So how we going to format it now?
again. So how we going to format it now? What do we need? We need the year. So
What do we need? We need the year. So it's going to be four times the y and
it's going to be four times the y and after that we have like space and then
after that we have like space and then the time information. We still can't do
the time information. We still can't do that inside the format, right? So we're
that inside the format, right? So we're going to have space here. And then next
going to have space here. And then next what do we have? We have the hours. So
what do we have? We have the hours. So it's going to be h the small h because
it's going to be h the small h because here we are talking about the pm and am.
here we are talking about the pm and am. It's not the 24hour system. And then
It's not the 24hour system. And then after that what do we have? The points
after that what do we have? The points double points. Then the minutes going to
double points. Then the minutes going to be small 2 m. And then after that the
be small 2 m. And then after that the seconds. So far this is exactly this
seconds. So far this is exactly this part over here. And now what is missing
part over here. And now what is missing a space and the PM the designator. So in
a space and the PM the designator. So in order to do that we're going to have a
order to do that we're going to have a space as well and then small 2 * tt. All
space as well and then small 2 * tt. All right. So we are almost there. Let's go
right. So we are almost there. Let's go and execute it. Now you can see it is
and execute it. Now you can see it is working. So we have the year then space
working. So we have the year then space the hours minutes and space and then we
the hours minutes and space and then we have the designator. So this is PM and
have the designator. So this is PM and this is A.M. which is correct. So that's
this is A.M. which is correct. So that's it. We are done. This is how you can
it. We are done. This is how you can create those crazy formats in SQL using
create those crazy formats in SQL using the help of format or maybe date name or
the help of format or maybe date name or maybe some static values like we just
maybe some static values like we just added here. So I think it's really fun
added here. So I think it's really fun formatting the dates in
SQL. Now one use case for the format that I frequently use in my project is
that I frequently use in my project is using it to format the date before doing
using it to format the date before doing aggregations. So it's like part
aggregations. So it's like part extraction but here we have more
extraction but here we have more customizations on how we represent the
customizations on how we represent the date at the reports. So we can show a
date at the reports. So we can show a report like sales by month where we
report like sales by month where we display for example the date as
display for example the date as abbreviation name of the month Jan and
abbreviation name of the month Jan and as well two digits for the year 25. So
as well two digits for the year 25. So once we change the format like this and
once we change the format like this and then do data aggregations we will have a
then do data aggregations we will have a nice report about the sales by month. So
nice report about the sales by month. So let's have a quick aggregations using
let's have a quick aggregations using the format. So, we're going to go and
the format. So, we're going to go and say select and now the order date and
say select and now the order date and count the number of
count the number of orders from our table sales orders and
orders from our table sales orders and then group by. But now before we start
then group by. But now before we start using the order date, we have to go and
using the order date, we have to go and format it. And then if you take the
format it. And then if you take the order date, let's go and execute it. So
order date, let's go and execute it. So as you can see the level of details is
as you can see the level of details is very high and we have here 10 rows and
very high and we have here 10 rows and for each day we have like one order. Now
for each day we have like one order. Now we learned we can go and use the date
we learned we can go and use the date part in order to extract one part and
part in order to extract one part and then aggregate on it. So now instead of
then aggregate on it. So now instead of that we're going to go and use the
that we're going to go and use the format function. So let's go and change
format function. So let's go and change the format and it is the order dates.
the format and it is the order dates. And our format going to be like this. So
And our format going to be like this. So three big M and then two digits for the
three big M and then two digits for the year. That's it. And let's call it order
year. That's it. And let's call it order dates. And we need this as well for the
dates. And we need this as well for the order date over here for the group I and
order date over here for the group I and here a comma. So that's it. Let's go and
here a comma. So that's it. Let's go and execute it. So in the output as you can
execute it. So in the output as you can see over here we have three months and
see over here we have three months and here we having the aggregation the
here we having the aggregation the number of orders for each month. So now
number of orders for each month. So now it's like the date part but now we are
it's like the date part but now we are customizing the format as we want. So we
customizing the format as we want. So we can use the format in order to change
can use the format in order to change the granularity of the date in order to
the granularity of the date in order to do that
aggregations. Now I'm going to show you a real use case for the formatting in
a real use case for the formatting in real projects. Now our data could be
real projects. Now our data could be stored in different technologies like
stored in different technologies like the data could be stored in CSV file or
the data could be stored in CSV file or we can get our data using an API call or
we can get our data using an API call or in very common scenario our data could
in very common scenario our data could be stored in database. So now what we
be stored in database. So now what we usually do we go and extract the data
usually do we go and extract the data from these different sources into one
from these different sources into one central storage. It could happen that
central storage. It could happen that you are getting different formats for
you are getting different formats for the dates and of course this is a
the dates and of course this is a problem for analytics. You cannot
problem for analytics. You cannot present different formats for the dates.
present different formats for the dates. What we're going to do we're going to go
What we're going to do we're going to go and clean up the formats into one
and clean up the formats into one standard format. So that means we have
standard format. So that means we have to format the incoming data to new
to format the incoming data to new formats and once we have one standard
formats and once we have one standard format we can use it in analytics and
format we can use it in analytics and reports. So this is very common use case
reports. So this is very common use case in data preparation and in data cleanup
in data preparation and in data cleanup by formatting different formats into one
by formatting different formats into one standard
format. Now in SQL we have many different date and time specifiers and I
different date and time specifiers and I said they are case sensitive and each
said they are case sensitive and each one of them has a different meaning. So
one of them has a different meaning. So I prepared for you as well all possible
I prepared for you as well all possible specifiers that we can use with the
specifiers that we can use with the formats. Not only that, if you go back
formats. Not only that, if you go back to the queries that you can find in this
to the queries that you can find in this chapter, you can find here date format.
chapter, you can find here date format. So all date formats. If you go inside
So all date formats. If you go inside it, you can go and copy the whole query
it, you can go and copy the whole query and then go back to SQL then execute it.
and then go back to SQL then execute it. You can find here a live example because
You can find here a live example because I'm manipulating now the get date. So
I'm manipulating now the get date. So you can find here a list of all possible
you can find here a list of all possible date specifiers that you can use with
date specifiers that you can use with the formats. So I would say go and
the formats. So I would say go and practice with those different date
practice with those different date formats in order to understand what is
formats in order to understand what is possible in SQL. So as we learned not
possible in SQL. So as we learned not only we can change the format of the
only we can change the format of the date, we can change as well the format
date, we can change as well the format of the number using the function formats
of the number using the function formats and those are the different possibility
and those are the different possibility that you can use as a specifier for this
that you can use as a specifier for this format in order to change the format of
format in order to change the format of the numbers and as well I have prepared
the numbers and as well I have prepared all those different specifiers in one
all those different specifiers in one big query. So if you go inside it and
big query. So if you go inside it and copy it and then put it in SQL and
copy it and then put it in SQL and execute it, you will find here all
execute it, you will find here all different possibilities that we have as
different possibilities that we have as a specifier to change the format of the
numbers. All right. So what is convert? It's very simple. It's going to go and
It's very simple. It's going to go and change the value to a different type and
change the value to a different type and as well at the same time it helps
as well at the same time it helps formatting the value. Okay. So let's
formatting the value. Okay. So let's check the syntax of the convert and it
check the syntax of the convert and it looks like this. It start with the
looks like this. It start with the function converts and it accept two
function converts and it accept two parameters the data type first since we
parameters the data type first since we can use this function in order to cast
can use this function in order to cast the data types. So you can use string
the data types. So you can use string integer dates and so on and then we have
integer dates and so on and then we have to specify the value. So which value
to specify the value. So which value should be casted. And the last parameter
should be casted. And the last parameter it is optional one where you define the
it is optional one where you define the style the format of the value. Let's
style the format of the value. Let's have this very simple example. We are
have this very simple example. We are saying convert to the data type integer
saying convert to the data type integer int and the value that should be
int and the value that should be converted is 1 2 3 as a string. So it's
converted is 1 2 3 as a string. So it's going to convert it to integer. We are
going to convert it to integer. We are saying convert to a vchart and the value
saying convert to a vchart and the value that should be converted is the order
that should be converted is the order date. So the order date should be a
date. So the order date should be a date. So we're going to convert it from
date. So we're going to convert it from date to v charts using the format or the
date to v charts using the format or the style of 34. So here we are specifying a
style of 34. So here we are specifying a style a format for this value. And of
style a format for this value. And of course it is optional and if you are not
course it is optional and if you are not using anything the default value that's
using anything the default value that's going to be used is zero. So this is the
going to be used is zero. So this is the syntax of the convert in SQL. All right.
syntax of the convert in SQL. All right. So now we're going to have few examples
So now we're going to have few examples on how to work with the convert. So
on how to work with the convert. So let's go and convert for example string
let's go and convert for example string to integer. So we're going to say for
to integer. So we're going to say for example convert. So what is the target
example convert. So what is the target data type? It's going to be the integer
data type? It's going to be the integer and the value. It's going to be like for
and the value. It's going to be like for example 1 2 3. So and let's call it like
example 1 2 3. So and let's call it like this string to integer and the function
this string to integer and the function is convert. So now in the column name as
is convert. So now in the column name as you can see I'm using here brackets and
you can see I'm using here brackets and that's because I'm using like empty
that's because I'm using like empty spaces and so on and with that I will
spaces and so on and with that I will get more freedom on how to name things.
get more freedom on how to name things. So this is just the name. So this is no
So this is just the name. So this is no function or something. Let's go and
function or something. Let's go and excuse it. Now as you can see it's going
excuse it. Now as you can see it's going to work. So we are converting from a
to work. So we are converting from a string value to an integer and the
string value to an integer and the output this 1 2 3 here is not string.
output this 1 2 3 here is not string. This is the data type of integer. All
This is the data type of integer. All right. So now let's have another example
right. So now let's have another example where we want to convert from string to
where we want to convert from string to date. So the target going to be the date
date. So the target going to be the date and the value let's have this value as
and the value let's have this value as usual and we're going to go and call
usual and we're going to go and call it string to date convert. Okay. So
it string to date convert. Okay. So let's go and execute it. Now in the
let's go and execute it. Now in the output we will get this information this
output we will get this information this string as a date. And with that we have
string as a date. And with that we have converted the data type from string to
converted the data type from string to dates. Now let's have another example
dates. Now let's have another example where we want to convert the date time
where we want to convert the date time to a date. As you remember the creation
to a date. As you remember the creation time is a date time and we would like to
time is a date time and we would like to have it as only date. So let's go and
have it as only date. So let's go and convert and we would like it to be as
convert and we would like it to be as well date but this time it's going to be
well date but this time it's going to be a column called creation time and let's
a column called creation time and let's give it the name. So we are converting
give it the name. So we are converting date time to dates. But of course here
date time to dates. But of course here we have to go and select. So from sales
we have to go and select. So from sales orders that's it. Let's go and execute
orders that's it. Let's go and execute it. Now, as you can see in the output,
it. Now, as you can see in the output, we got only date. I'm going to go and
we got only date. I'm going to go and select the creation time in the query as
select the creation time in the query as well. So now, as you can see, the
well. So now, as you can see, the creation time was before a date time.
creation time was before a date time. So, we have the time information as
So, we have the time information as well. But if you go and cast it using
well. But if you go and cast it using the convert and make it only date. So,
the convert and make it only date. So, SQL going to go and convert it to date
SQL going to go and convert it to date and you're going to lose all the
and you're going to lose all the informations about the time. So, so far
informations about the time. So, so far what we are doing here is just casting.
what we are doing here is just casting. So, we are changing the data type from
So, we are changing the data type from one to another. But in the convert, we
one to another. But in the convert, we can do both. We can do casting and
can do both. We can do casting and formatting. So let's see how we can do
formatting. So let's see how we can do that. I will just get rid of those
that. I will just get rid of those information at the start. So creation
information at the start. So creation time. And now we're going to go and
time. And now we're going to go and convert the date time of the creation
convert the date time of the creation time to a varchar to a string. And as
time to a varchar to a string. And as well to give it the format of the USA
well to give it the format of the USA standard format. So let's see how we can
standard format. So let's see how we can do that. We're going to start with
do that. We're going to start with convert. We are changing now to var. So
convert. We are changing now to var. So this is the new data type and the value
this is the new data type and the value is the creation time. And now if I don't
is the creation time. And now if I don't give it a style, it's going to stay with
give it a style, it's going to stay with the standard format, but we would like
the standard format, but we would like to have the USA standards. So in order
to have the USA standards. So in order to do that, we're going to go and add
to do that, we're going to go and add the style of the format. So it's going
the style of the format. So it's going to be 32. So that's it. Let's have a
to be 32. So that's it. Let's have a name like this. So
name like this. So USA standard and we are using the style
USA standard and we are using the style of 32. Let's go with that. This is just
of 32. Let's go with that. This is just a name again. So it's not a function.
a name again. So it's not a function. Let's go ahead and execute it. And now
Let's go ahead and execute it. And now in the output we got a new field and the
in the output we got a new field and the data type of this field is a varchar. So
data type of this field is a varchar. So it's not a date or date time. And as you
it's not a date or date time. And as you can see the date now is formatted using
can see the date now is formatted using this style the 32 the US standard
this style the 32 the US standard format. So it start with a month then a
format. So it start with a month then a day and then a year. So now let's go and
day and then a year. So now let's go and do the same thing in order to get the
do the same thing in order to get the standard format in Europe. So I will
standard format in Europe. So I will just go and copy the whole thing. I will
just go and copy the whole thing. I will just change the style. So instead of 32
just change the style. So instead of 32 we're going to go with the 34. And I
we're going to go with the 34. And I will just change the name as well. So,
will just change the name as well. So, so we are just changing the style. Let's
so we are just changing the style. Let's go ahead and execute it. Now, as you can
go ahead and execute it. Now, as you can see, we got the same thing. We have as
see, we got the same thing. We have as well a v jar and the format now is
well a v jar and the format now is different. So, we have here the day,
different. So, we have here the day, then the month, and then the year. So,
then the month, and then the year. So, this is how you work with the convert
this is how you work with the convert function. You can use it in order to do
function. You can use it in order to do only casting or not only that, you can
only casting or not only that, you can do casting and as well formatting. So,
do casting and as well formatting. So, you have both things in one function.
And now if you're talking about which styles are available, we have many
styles are available, we have many styles that you can use inside the
styles that you can use inside the convert. So I have prepared for you a
convert. So I have prepared for you a list of all styles that you can use with
list of all styles that you can use with the convert. So we have styles only for
the convert. So we have styles only for the dates and another styles only for
the dates and another styles only for the time and styles for only date time.
the time and styles for only date time. Now in the download folders you can find
Now in the download folders you can find here one file called all culture
here one file called all culture formats. And here you can find one query
formats. And here you can find one query that I have prepared where you can find
that I have prepared where you can find inside it the different cultures and the
inside it the different cultures and the examples. So let's go and copy it and
examples. So let's go and copy it and let's go back to scale paste it and
let's go back to scale paste it and let's see the results. So now if you
let's see the results. So now if you check the output we got the first column
check the output we got the first column is the cultures that is used. So we have
is the cultures that is used. So we have a lot of cultures like around 17s and
a lot of cultures like around 17s and you can see how the numbers are
you can see how the numbers are formatted or the date is formatted based
formatted or the date is formatted based on this culture. So it's really fun. You
on this culture. So it's really fun. You can check here for example how the
can check here for example how the format in Japan or Korea or France and
format in Japan or Korea or France and the German one. If you scroll down, you
the German one. If you scroll down, you can find the Arabic, the Russian and so
can find the Arabic, the Russian and so on. So you can see the format of each
on. So you can see the format of each dates is changing based on the culture.
dates is changing based on the culture. So I would say have fun. Go and try
So I would say have fun. Go and try those different cultures formats in
those different cultures formats in order to format your numbers or
dates. So what is the cast function? It going to go and convert a value to a
going to go and convert a value to a different data type. So it turns one
different data type. So it turns one data type to another. All right. So now
data type to another. All right. So now let's check the syntax of the cast. I
let's check the syntax of the cast. I really like this one. It is not typical
really like this one. It is not typical like format or syntax in SQL. So it says
like format or syntax in SQL. So it says the cast is the function and then inside
the cast is the function and then inside it we need two things but it's not
it we need two things but it's not separated like with the comma as we
separated like with the comma as we learned before with all other functions
learned before with all other functions but this time is separated with the
but this time is separated with the keyword as. So it's like the natural
keyword as. So it's like the natural English you are saying cast the value as
English you are saying cast the value as a data type. So you are casting the
a data type. So you are casting the value to a new data type. So let's have
value to a new data type. So let's have this very simple example we have here
this very simple example we have here cast the value 1 2 3 as integer. So
cast the value 1 2 3 as integer. So previously it is string and it going to
previously it is string and it going to be converted to integer. So as you can
be converted to integer. So as you can see it's very simple. Now in this
see it's very simple. Now in this example we are saying cast this value
example we are saying cast this value this string value as a dates. So
this string value as a dates. So converted from string to dates. So as
converted from string to dates. So as you can see with the cast we don't have
you can see with the cast we don't have here any option of formatting or styling
here any option of formatting or styling the values. So it's only dedicated for
the values. So it's only dedicated for casting the value from one data type to
casting the value from one data type to another one. So this is the syntax of
another one. So this is the syntax of the cast. It is very straightforward and
the cast. It is very straightforward and really nice function. Okay. So now let's
really nice function. Okay. So now let's have a few examples about the cast. So
have a few examples about the cast. So let's go and convert a value from a
let's go and convert a value from a string to integer. So it's very simple.
string to integer. So it's very simple. We're going to say cast. So now we need
We're going to say cast. So now we need the value. So let's go with the 1 2 3.
the value. So let's go with the 1 2 3. So we have here a string. And then we're
So we have here a string. And then we're going to say as and then we have to
going to say as and then we have to define the data type. So the data type
define the data type. So the data type going to be integer. So that's it. So
going to be integer. So that's it. So let's give it the name like this string
let's give it the name like this string to integer. Let's go and execute it. Now
to integer. Let's go and execute it. Now as you can see we got the value but with
as you can see we got the value but with the data type integer. From string to
the data type integer. From string to integer. Now let's do the way around. We
integer. Now let's do the way around. We cast from integer to string. So we're
cast from integer to string. So we're going to say cast 1 2 3 as var jar and
going to say cast 1 2 3 as var jar and we're going to give it a
we're going to give it a name int to string. So let's go and
name int to string. So let's go and execute it. Now in the output we have 1
execute it. Now in the output we have 1 2 3 but this time it has the data type
2 3 but this time it has the data type varchar. Now let's go and work with the
varchar. Now let's go and work with the date. So we're going to go and convert a
date. So we're going to go and convert a value a string value to a date. So our
value a string value to a date. So our value going to be the usual one and we
value going to be the usual one and we want it from string to date. So we're
want it from string to date. So we're going to have the data type as date. So
going to have the data type as date. So let's give it a name string to date.
let's give it a name string to date. Let's go and execute it. Now we're going
Let's go and execute it. Now we're going to have this value with the data type
to have this value with the data type date. So that's it. Now let's say that I
date. So that's it. Now let's say that I would like to have this value but as
would like to have this value but as date time. So I will just copy the whole
date time. So I will just copy the whole thing and go to a new line and say date
thing and go to a new line and say date time two. So the name of this going to
time two. So the name of this going to be string to date time. Let's go and
be string to date time. Let's go and execute it. Now in the output as you can
execute it. Now in the output as you can see we are getting not only the date but
see we are getting not only the date but as well we are getting the time
as well we are getting the time information. But now since we didn't
information. But now since we didn't provide SQL with any time information
provide SQL with any time information SQL going to go and show it as zeros.
SQL going to go and show it as zeros. Now let's do one more casting where we
Now let's do one more casting where we change the data type from date time to
change the data type from date time to date. So now we need our creation
date. So now we need our creation time but we have to get it from the
time but we have to get it from the tables. So from sales orders let's go
tables. So from sales orders let's go and execute it. So now in the output you
and execute it. So now in the output you can see the creation time is a date
can see the creation time is a date time. We have the time information but
time. We have the time information but we are not interested about the time
we are not interested about the time information. I would like to have this
information. I would like to have this field as a date. So it's very simple
field as a date. So it's very simple what we're going to do. We're going to
what we're going to do. We're going to say cast. Now the value is creation time
say cast. Now the value is creation time and then the keyword as and we need it
and then the keyword as and we need it as a date. So we're going to give it the
as a date. So we're going to give it the name date time to date. So let's go and
name date time to date. So let's go and execute it. Now as you can see in the
execute it. Now as you can see in the output we got the creation time but only
output we got the creation time but only with the date information. We don't have
with the date information. We don't have anything about the time. So we get it as
anything about the time. So we get it as a date instead of date time. So that's
a date instead of date time. So that's it. This is amazing function SQL and
it. This is amazing function SQL and it's very simple and we can use it only
it's very simple and we can use it only for casting. So only to change the data
for casting. So only to change the data type from one to another. And we cannot
type from one to another. And we cannot use this function in order to change the
use this function in order to change the format. So if you are casting you will
format. So if you are casting you will get always the standard format from
SQL. So now let's go and compare our functions side by side. So we have our
functions side by side. So we have our three functions. cast, convert and
three functions. cast, convert and format and we can do two things either
format and we can do two things either casting or formatting. So by the casting
casting or formatting. So by the casting for the first function cast we can
for the first function cast we can change any type to any other type. So
change any type to any other type. So there is no restriction at all. The same
there is no restriction at all. The same thing for the converts the same thing we
thing for the converts the same thing we can convert anything to anything. But
can convert anything to anything. But for the format we can change only to a
for the format we can change only to a string. So any data type like a date or
string. So any data type like a date or number to a string value because the
number to a string value because the main thing for the format is not
main thing for the format is not changing the data type. Now if you are
changing the data type. Now if you are talking about changing the format of the
talking about changing the format of the values, you cannot use the cast function
values, you cannot use the cast function in order to change the format. So the
in order to change the format. So the cast function is only for casting. It
cast function is only for casting. It makes sense. Now about the convert, we
makes sense. Now about the convert, we can use it in order to change the format
can use it in order to change the format of the date and time. But we cannot use
of the date and time. But we cannot use it in order to change the number
it in order to change the number formats. And for that we have a
formats. And for that we have a dedicated function called format. So we
dedicated function called format. So we can use it to change the format of the
can use it to change the format of the date and time and as well the numbers.
date and time and as well the numbers. So those are the main differences
So those are the main differences between those three functions. All
between those three functions. All right. So with those three functions we
right. So with those three functions we have learned how to do formatting and
have learned how to do formatting and casting on date informations. Now moving
casting on date informations. Now moving on to the third group we have the date
on to the third group we have the date calculations and here we have two
calculations and here we have two functions on how to do date calculations
functions on how to do date calculations or mathematical operations on the dates.
If okay so now we're going to start with the first function the date add. So what
the first function the date add. So what is date add? Date add can allow us to
is date add? Date add can allow us to add or subtract a specific time interval
add or subtract a specific time interval to or from a date. So let's understand
to or from a date. So let's understand how the date add work. So here again we
how the date add work. So here again we have our date August 20th 2025. So now
have our date August 20th 2025. So now in some scenarios we would like to add
in some scenarios we would like to add years to our dates. So for example let's
years to our dates. So for example let's say I would like to add three years to
say I would like to add three years to our date. So we can do that using the
our date. So we can do that using the date ad. So if you do that in the output
date ad. So if you do that in the output you will get 2028 August 20th only the
you will get 2028 August 20th only the date part is changed and where we have
date part is changed and where we have added three years but in other scenarios
added three years but in other scenarios you would like to go and add months. So
you would like to go and add months. So for example let's go and add two months
for example let's go and add two months to the August. So in the output you will
to the August. So in the output you will get 2025 10 20 with that we have added
get 2025 10 20 with that we have added two months and of course we can go and
two months and of course we can go and add days to our dates. So for example
add days to our dates. So for example we're going to go and add five days to
we're going to go and add five days to our date. So in the output we'll get the
our date. So in the output we'll get the same year 2025 the same month August but
same year 2025 the same month August but only the day will be changed to 25. So
only the day will be changed to 25. So we have added five days to the original
we have added five days to the original dates. And of course we can go and
dates. And of course we can go and subtract dates even though that the
subtract dates even though that the function called date add. So for
function called date add. So for example, we can go and subtract three
example, we can go and subtract three years from our dates and we will get So
years from our dates and we will get So if you do that, you will get 2022 August
if you do that, you will get 2022 August 20th or if you go and subtract two
20th or if you go and subtract two months from our dates. So it's going to
months from our dates. So it's going to stay the same year 2025. But this time
stay the same year 2025. But this time instead of August, we will go back to
instead of August, we will go back to June with the same date 20. And the same
June with the same date 20. And the same thing going to happen for the days if
thing going to happen for the days if you go and subtract five days. So the
you go and subtract five days. So the same year 2025, the same month August,
same year 2025, the same month August, but only the days going to be instead of
but only the days going to be instead of 20, it's going to be 15. So as you can
20, it's going to be 15. So as you can see with the date ad you can manipulate
see with the date ad you can manipulate the years, the month and the days by
the years, the month and the days by subtracting or adding new intervals. So
subtracting or adding new intervals. So this is how the date ad works. All
this is how the date ad works. All right. So now let's check the syntax of
right. So now let's check the syntax of the date ad. And here things little bit
the date ad. And here things little bit more complicated. We have to provide
more complicated. We have to provide three informations. The first one is a
three informations. The first one is a part. What do you want to add? Do you
part. What do you want to add? Do you want to add years or months or days and
want to add years or months or days and so on. Then the second one is interval.
so on. Then the second one is interval. So it's like how many days? How many
So it's like how many days? How many years? How many months? And then the
years? How many months? And then the last one is the date. This is the date
last one is the date. This is the date that we're going to be manipulating by
that we're going to be manipulating by adding or subtracting intervals. Let's
adding or subtracting intervals. Let's check the following example. We are
check the following example. We are saying here date add. So what is the
saying here date add. So what is the part here is a year. That means we want
part here is a year. That means we want to manipulate only the year parts. Then
to manipulate only the year parts. Then the interval here is two. So it is
the interval here is two. So it is positive. We want to add two years. So
positive. We want to add two years. So it's going to go to each order and start
it's going to go to each order and start adding two years for each date value.
adding two years for each date value. Now let's check another example. Here we
Now let's check another example. Here we are saying date add month. So here we
are saying date add month. So here we want to manipulate the month part. But
want to manipulate the month part. But here we are saying minus4 that means we
here we are saying minus4 that means we want to go and subtract four months from
want to go and subtract four months from each value in the order date. So as you
each value in the order date. So as you can see the value of the interval
can see the value of the interval whether it's positive or negative. We
whether it's positive or negative. We are controlling here the function
are controlling here the function whether it is subtraction or addition.
whether it is subtraction or addition. So let's have few examples about the
So let's have few examples about the date add using our field order dates. So
date add using our field order dates. So for example let's go and add two years
for example let's go and add two years for each date. So we can do it like this
for each date. So we can do it like this date adds. So we are adding years that's
date adds. So we are adding years that's why we're going to go with the part year
why we're going to go with the part year and how many years we are adding we are
and how many years we are adding we are adding two years. So this is our
adding two years. So this is our interval and our field our value is the
interval and our field our value is the order date. So now in the output as you
order date. So now in the output as you can see we got a date but this date is
can see we got a date but this date is always 2 years higher than the order
always 2 years higher than the order date. So everywhere you have see 2027.
date. So everywhere you have see 2027. Now let's go and add maybe three months
Now let's go and add maybe three months for each date. Just going to go and copy
for each date. Just going to go and copy it and say a month. Let's change the
it and say a month. Let's change the interval to three and we're going to
interval to three and we're going to call it three
call it three months later. So now if you check the
months later. So now if you check the output over here we have a new date but
output over here we have a new date but now the difference between it and the
now the difference between it and the order date we have here always three
order date we have here always three months more than the order dates. So for
months more than the order dates. So for example here we have January but in the
example here we have January but in the new one we have April and for the next
new one we have April and for the next one we have February and in the new
one we have February and in the new field we have May. So as you can see we
field we have May. So as you can see we are adding months over here. So as you
are adding months over here. So as you can see we are adding monthses to our
can see we are adding monthses to our original filled order date. Now let's
original filled order date. Now let's say that I would like to go and subtract
say that I would like to go and subtract 10 days. So let's go and do the same. So
10 days. So let's go and do the same. So we're going to have the date add. Since
we're going to have the date add. Since we are talking about the days, it's
we are talking about the days, it's going to be the day. We're going to
going to be the day. We're going to subtract 10 days. So minus 10 for the
subtract 10 days. So minus 10 for the order date. So let's call it 10 days
order date. So let's call it 10 days before. Let's go and execute it. Now we
before. Let's go and execute it. Now we got as well a new date. And this date
got as well a new date. And this date has always 10 days before the order
has always 10 days before the order date. So for example, let's take the
date. So for example, let's take the order number seven. In the order date we
order number seven. In the order date we have 15, but in the new column we have
have 15, but in the new column we have five. So we have subtracted 10 days from
five. So we have subtracted 10 days from the original filled order dates. So as
the original filled order dates. So as you can see it's very simple to add or
you can see it's very simple to add or subtract days, year, months using the
subtract days, year, months using the date
add. All right. So what is date diff? diff stands for difference and date diff
diff stands for difference and date diff can going to can allow us to find the
can going to can allow us to find the differences between two dates. All
differences between two dates. All right. So let's understand how the date
right. So let's understand how the date diff works in SQL. Now imagine we have
diff works in SQL. Now imagine we have two dates. We have the order date 2025
two dates. We have the order date 2025 August 20th and the shipping date is the
August 20th and the shipping date is the 1st of February in the next year 2026.
1st of February in the next year 2026. Now we might ask the question how many
Now we might ask the question how many years have passed between the order date
years have passed between the order date and the shipping date. So in order to
and the shipping date. So in order to answer this question we can use the
answer this question we can use the function date diff and we can define the
function date diff and we can define the part year. If you do it like this it's
part year. If you do it like this it's going to subtract those two dates and it
going to subtract those two dates and it going to return one. So the date
going to return one. So the date difference between those two dates is
difference between those two dates is exactly one year. But now if the
exactly one year. But now if the question is how many months are between
question is how many months are between the order date and the shipping dates.
the order date and the shipping dates. So here again we can go and use the date
So here again we can go and use the date diff between the order date and the
diff between the order date and the shipping date but we use the part month.
shipping date but we use the part month. If you do it like this in the output you
If you do it like this in the output you will get three months. And now of course
will get three months. And now of course if the question is how many days are
if the question is how many days are between the order date and the shipping
between the order date and the shipping dates. So here we can use the function
dates. So here we can use the function date diff where we specify the day
date diff where we specify the day inside it and in the output you will get
inside it and in the output you will get 68. So this is how the date diff works.
68. So this is how the date diff works. You go and subtract two different dates
You go and subtract two different dates and you will get in the output a number
and you will get in the output a number how many years how many months how many
how many years how many months how many days. So that's it. All right. Now to
days. So that's it. All right. Now to the syntax of the date diff. It accept
the syntax of the date diff. It accept here as well three parameters. So the
here as well three parameters. So the first one is the parts as usual year,
first one is the parts as usual year, month, day. And then here we need two
month, day. And then here we need two dates, not only one, we need two. So we
dates, not only one, we need two. So we need the starting dates and the ending
need the starting dates and the ending dates. So that means here we have the
dates. So that means here we have the youngest dates and the end date going to
youngest dates and the end date going to be the oldest dates. So for example,
be the oldest dates. So for example, here we have date diff and we are saying
here we have date diff and we are saying find the differences in years between
find the differences in years between the order dates. This is the start date
the order dates. This is the start date and the shipping dates. So which dates
and the shipping dates. So which dates normally happen? First we have to order
normally happen? First we have to order something. So we have the order date and
something. So we have the order date and once you order what can happen next is
once you order what can happen next is the shipping date. That's why the
the shipping date. That's why the shipping date is as an end date. So we
shipping date is as an end date. So we want to find the differences between
want to find the differences between them in years or of course if you want
them in years or of course if you want to find the differences between them in
to find the differences between them in days we have to go and change the part
days we have to go and change the part from year to day. So as you can see the
from year to day. So as you can see the syntax is very simple and very logical
syntax is very simple and very logical right. All right let's have the
right. All right let's have the following simple task and it says
following simple task and it says calculate the age of employees. So let's
calculate the age of employees. So let's see how we can solve that. So we're
see how we can solve that. So we're going to go and select first all the
going to go and select first all the informations from employees. So sales
informations from employees. So sales and employees. Okay, let's execute it.
and employees. Okay, let's execute it. Now in the employees, we don't have any
Now in the employees, we don't have any informations about the age, but we have
informations about the age, but we have the birthday. So we can go and transform
the birthday. So we can go and transform this birthday to an age. And of course,
this birthday to an age. And of course, how we calculate the age? We count how
how we calculate the age? We count how many years between this year and the
many years between this year and the birthday. So that means we have to go
birthday. So that means we have to go and use two functions the date diff and
and use two functions the date diff and the get day in order to have the year of
the get day in order to have the year of the current year. So that means we have
the current year. So that means we have to go and use the function date diff. So
to go and use the function date diff. So let's go and do that. I'm going to go
let's go and do that. I'm going to go first selecting only few informations.
first selecting only few informations. So employee ID and P date. So let's
So employee ID and P date. So let's start with the date diff. So if we are
start with the date diff. So if we are talking about the age we are calculating
talking about the age we are calculating how many years that's why we're going to
how many years that's why we're going to say as a part going to be the year. So
say as a part going to be the year. So what is the starting date is the birth
what is the starting date is the birth date of the person. So it's going to be
date of the person. So it's going to be the birth date. And now we need the end
the birth date. And now we need the end date. We don't have here anything about
date. We don't have here anything about the end date. The end date going to be
the end date. The end date going to be the current year. So in order to get the
the current year. So in order to get the current year, we're going to go with the
current year, we're going to go with the function get dates. And with that we are
function get dates. And with that we are getting the current date information.
getting the current date information. And this is exactly what we want. So
And this is exactly what we want. So let's close it and let's go and call it
let's close it and let's go and call it an age. So it's very simple. We are
an age. So it's very simple. We are counting how many years between the
counting how many years between the birth dates and the current dates. So
birth dates and the current dates. So let's go and execute it. So now we are
let's go and execute it. So now we are getting the ages. As you can see the
getting the ages. As you can see the first person is 33, the second one is 52
first person is 33, the second one is 52 and so on. And now you might getting
and so on. And now you might getting different values than I'm getting now.
different values than I'm getting now. And that's maybe you are doing the
And that's maybe you are doing the course now in 2025 or 2026 and the
course now in 2025 or 2026 and the employees going to be older than now.
employees going to be older than now. Now we are 2024 and I'm getting those
Now we are 2024 and I'm getting those ages. So this is how we calculate the
ages. So this is how we calculate the age using the help of two functions. The
age using the help of two functions. The date diff and the get date. Okay. Okay,
date diff and the get date. Okay. Okay, so now we have another task for the day
so now we have another task for the day diff and it says find the average
diff and it says find the average shipping duration in days for each
shipping duration in days for each month. So here we have a lot of
month. So here we have a lot of informations. Let's do it step by step.
informations. Let's do it step by step. Let's first find out the shipping
Let's first find out the shipping durations in days. So let's go and
durations in days. So let's go and select few informations from our table.
select few informations from our table. So select order ID. We have the order
So select order ID. We have the order date, ship
date, ship date and I think that's it. So from
date and I think that's it. So from sales orders. So let's go ahead and
sales orders. So let's go ahead and execute it. So now we have our 10
execute it. So now we have our 10 orders. We have the order date and the
orders. We have the order date and the shipping dates. Now we have to go and
shipping dates. Now we have to go and create a new field called shipping
create a new field called shipping duration. So what is the shipping
duration. So what is the shipping duration? It is the number of days
duration? It is the number of days between the order dates and the shipping
between the order dates and the shipping dates. So how many days it took from the
dates. So how many days it took from the order placement until the day of the
order placement until the day of the shipping. So that means we have two
shipping. So that means we have two dates and we have to go and find the
dates and we have to go and find the differences between them. We're going to
differences between them. We're going to go with the function date diff. So now
go with the function date diff. So now since we are saying in days we have to
since we are saying in days we have to go with the part day. So what is the
go with the part day. So what is the start date? The start date is the order
start date? The start date is the order date. And what is the end date? It's
date. And what is the end date? It's going to be the shipping dates like
going to be the shipping dates like this. So I'm going to call it day to
this. So I'm going to call it day to ship like this. Let's go and execute it.
ship like this. Let's go and execute it. So now by checking the result for
So now by checking the result for example for the order one it is ordered
example for the order one it is ordered at the 1st of January and it is shipped
at the 1st of January and it is shipped on 5th of January. So between those two
on 5th of January. So between those two dates we have around 4 days. So four is
dates we have around 4 days. So four is the shipping duration and if you go to
the shipping duration and if you go to the order number three the differences
the order number three the differences between the order date and the shipping
between the order date and the shipping date we have around 15 days. So with
date we have around 15 days. So with that we have solved this part shipping
that we have solved this part shipping duration in days. But now the task says
duration in days. But now the task says we have to find the average duration for
we have to find the average duration for each month. So that means we have to go
each month. So that means we have to go and select for example the month of
and select for example the month of January and find the average duration.
January and find the average duration. So we have to go and do a simple
So we have to go and do a simple aggregation. We're going to go to the
aggregation. We're going to go to the date if at the start and say average.
date if at the start and say average. And we're going to close it over here.
And we're going to close it over here. And let's go and rename it average
And let's go and rename it average shipping. And now we have to aggregate
shipping. And now we have to aggregate by the month. So we don't need the whole
by the month. So we don't need the whole order dates. We need the month of the
order dates. We need the month of the order date. So like this. We don't need
order date. So like this. We don't need of course the order ID, but now we need
of course the order ID, but now we need to group up the data using this
to group up the data using this dimension, the month order dates. So
dimension, the month order dates. So that's it. Let's go and execute it. So
that's it. Let's go and execute it. So now in the output you can see we have
now in the output you can see we have three months and for each month we have
three months and for each month we have the average shipping durations in days.
the average shipping durations in days. So for the first month it is around 7
So for the first month it is around 7 days for February is as well 7 days and
days for February is as well 7 days and for March we have less duration 5 days.
for March we have less duration 5 days. So with that we have solved the task. As
So with that we have solved the task. As you can see the date diff is very strong
you can see the date diff is very strong function in order to do data analytics
function in order to do data analytics using the dates information. All right.
using the dates information. All right. Right. So now we have the following task
Right. So now we have the following task and it says find the number of days
and it says find the number of days between each order and the previous
between each order and the previous order. So there's a lot of stuff going
order. So there's a lot of stuff going on over here. Let's do it step by step.
on over here. Let's do it step by step. Let's start by selecting the basic
Let's start by selecting the basic stuff. So select order ID, order date
stuff. So select order ID, order date from the table sales orders. Let's go
from the table sales orders. Let's go and execute it. So we have our 10 orders
and execute it. So we have our 10 orders and we have the current order dates. So
and we have the current order dates. So now we have to find the differences
now we have to find the differences between two dates. order dates, the
between two dates. order dates, the current one and the previous order
current one and the previous order dates. So in our data, we have the
dates. So in our data, we have the current order dates, but we don't have
current order dates, but we don't have the previous order date for each order.
the previous order date for each order. And in order to calculate the previous
And in order to calculate the previous one, do you remember about the window
one, do you remember about the window functions? We can go and use the lag in
functions? We can go and use the lag in order to access a value from a previous
order to access a value from a previous records. So let's go and do that. The
records. So let's go and do that. The order date, I'm just going to call it
order date, I'm just going to call it current order dates. And let's go and
current order dates. And let's go and find the previous order dates. So we're
find the previous order dates. So we're going to go with the lag of the order
going to go with the lag of the order date because we are interested in the
date because we are interested in the value of the order date. Now over we
value of the order date. Now over we have to sort the data. So we're going to
have to sort the data. So we're going to sort it
sort it by the order date as well. So this is
by the order date as well. So this is going to help us always to access the
going to help us always to access the previous value of the order date. So
previous value of the order date. So we're going to call it
we're going to call it previous order date. Let's go and
previous order date. Let's go and execute it and let's check the result.
execute it and let's check the result. For the first order, we don't have
For the first order, we don't have anything previously. So that's why we
anything previously. So that's why we are getting a null. For the second
are getting a null. For the second record, the current order date is the
record, the current order date is the 5th of January and the previous one is
5th of January and the previous one is the 1st of January. And this value comes
the 1st of January. And this value comes from the previous record, the previous
from the previous record, the previous order. Great. Amazing. So with that we
order. Great. Amazing. So with that we have now the two dates, the current date
have now the two dates, the current date and the previous one. And now we can go
and the previous one. And now we can go very simply finding the number of days
very simply finding the number of days between those two dates. And we can do
between those two dates. And we can do using the amazing function date diff. So
using the amazing function date diff. So we are interested on the days that's why
we are interested on the days that's why it's going to be the day. So what is the
it's going to be the day. So what is the starting day? If you check those two
starting day? If you check those two dates, you can see that the previous
dates, you can see that the previous order date is the starting date. So
order date is the starting date. So we're going to take the whole thing, the
we're going to take the whole thing, the whole window function and put it over
whole window function and put it over here. So I just moved my picture. So
here. So I just moved my picture. So here is the previous order dates. And
here is the previous order dates. And now the end date, what's going to be?
now the end date, what's going to be? It's going to be the current order date
It's going to be the current order date which is our order date like this. So
which is our order date like this. So again, we are finding the number of days
again, we are finding the number of days between the previous dates and the
between the previous dates and the current dates. So that's it. Let's close
current dates. So that's it. Let's close it. So I'm just going to call it number
it. So I'm just going to call it number of
of days. So let's go and execute it. Now of
days. So let's go and execute it. Now of course we have here null. So we will get
course we have here null. So we will get as well null in the output. And now you
as well null in the output. And now you can check over here how many days
can check over here how many days between those two dates. We have exactly
between those two dates. We have exactly four days. And as well for the next one
four days. And as well for the next one we have around 5 days, 10 days and so
we have around 5 days, 10 days and so on. So we have solved the task. We have
on. So we have solved the task. We have now the number of days between each
now the number of days between each order and the previous order. So this
order and the previous order. So this type of analyszis is very important in
type of analyszis is very important in the business. We call it time gap
the business. We call it time gap analyzes and we have done it using the
analyzes and we have done it using the help of the window function and as well
help of the window function and as well the date function date diff. So date div
the date function date diff. So date div function is amazing function to do data
function is amazing function to do data analyzes. All right. So with those two
analyzes. All right. So with those two functions we have learned how to do
functions we have learned how to do mathematical operations on date
mathematical operations on date informations or we can call it date
informations or we can call it date calculations. Now moving on to the
calculations. Now moving on to the easiest and the last group, we have the
easiest and the last group, we have the date validation. And here we have only
date validation. And here we have only one function, the is
date. Okay. So what is is date? So the is date is very simple. It's going to
is date is very simple. It's going to check whether a value is a date. So it
check whether a value is a date. So it going to return one if the string value
going to return one if the string value is a valid date or zero if it is not a
is a valid date or zero if it is not a valid date. Okay. So let's check quickly
valid date. Okay. So let's check quickly the syntax of the is date. It's very
the syntax of the is date. It's very simple. The keyword is date is the
simple. The keyword is date is the function name and it accepts only one
function name and it accepts only one value. So for example you can pass a
value. So for example you can pass a string like this and you can ask SQL is
string like this and you can ask SQL is it a date. So is date and the value and
it a date. So is date and the value and of course for this example you will get
of course for this example you will get true or one. So as you can see we are
true or one. So as you can see we are passing here a string value and we are
passing here a string value and we are validating whether it is good enough to
validating whether it is good enough to be a date or as well you can go and
be a date or as well you can go and specify a number like here 2025. So is
specify a number like here 2025. So is this value a date and of course SQL
this value a date and of course SQL going to accept it and say yeah this is
going to accept it and say yeah this is a year so you will get as well a one. So
a year so you will get as well a one. So you can pass as well a number or
you can pass as well a number or integer. So you are just checking the
integer. So you are just checking the values whether they are suitable enough
values whether they are suitable enough to be a date. So that's all about the
to be a date. So that's all about the syntax of the is dates. Okay. So now
syntax of the is dates. Okay. So now let's have few examples. For example,
let's have few examples. For example, let's go and select and we're going to
let's go and select and we're going to say is date and we will check a value.
say is date and we will check a value. So let's say this value is a string 1 2
So let's say this value is a string 1 2 3. Let's go and call it date. Check one.
3. Let's go and call it date. Check one. Let's go and execute it. Now in the
Let's go and execute it. Now in the output it's going to say no, it is not a
output it's going to say no, it is not a date. And that's why we are getting the
date. And that's why we are getting the value zero which is correct because 1 2
value zero which is correct because 1 2 3 is not a date. Let's pick another
3 is not a date. Let's pick another value. The same thing is dates. And now
value. The same thing is dates. And now the value going to be the following. So
the value going to be the following. So 2025 August 20. So let's call it date
2025 August 20. So let's call it date check 2. And let's go and execute it.
check 2. And let's go and execute it. Now in the output we will get one. That
Now in the output we will get one. That means the value that we have provided is
means the value that we have provided is a date. And that's why we have a one in
a date. And that's why we have a one in the output because ESKL is saying this
the output because ESKL is saying this is a date. Now let's have another
is a date. Now let's have another example. We're going to take the whole
example. We're going to take the whole thing. So this is a check three and
thing. So this is a check three and remove this from here. But I would like
remove this from here. But I would like to go and change the format. So let's
to go and change the format. So let's say that we start with the day then
say that we start with the day then month and then the year. Let's go and
month and then the year. Let's go and check. Now in the output you can see it
check. Now in the output you can see it is zero because SQL does not understand
is zero because SQL does not understand the formats. So we are not following the
the formats. So we are not following the standard format of the database and
standard format of the database and scale and that's why going to say no
scale and that's why going to say no this is not a date. This is like a
this is not a date. This is like a string value. So this means only if the
string value. So this means only if the value is following the status format SQL
value is following the status format SQL going to understand this is a date. Now
going to understand this is a date. Now let's go and check another thing for
let's go and check another thing for example let's say is date and let's have
example let's say is date and let's have only the year. So 2025 and let's give it
only the year. So 2025 and let's give it the name date check for let's go and
the name date check for let's go and execute it. Now in the output we will
execute it. Now in the output we will get one. So that means is considering
get one. So that means is considering this value as a date. So that means
this value as a date. So that means Iskll is smart enough to understand okay
Iskll is smart enough to understand okay we have provided a year information and
we have provided a year information and is going to accept it and say okay maybe
is going to accept it and say okay maybe this is the 1st of January of 2025. Now
this is the 1st of January of 2025. Now let's go and do the same thing but for
let's go and do the same thing but for the month let's see whether SQL going to
the month let's see whether SQL going to accept it. So check five and we have the
accept it. So check five and we have the month of August. Let's go and check now
month of August. Let's go and check now going to say no I don't understand this
going to say no I don't understand this value this is zero. So that mean this
value this is zero. So that mean this value is provided is not a date. So by
value is provided is not a date. So by checking those results as you can see
checking those results as you can see SQL understand only the standard formats
SQL understand only the standard formats and it allow you as well to check
and it allow you as well to check whether a year is a date. So this is how
whether a year is a date. So this is how the is date works in SQL. And now you
the is date works in SQL. And now you might ask well when I'm going to do this
might ask well when I'm going to do this when I'm going to check whether the
when I'm going to check whether the value is a date or not. Let me give you
value is a date or not. Let me give you this following scenario. Now imagine
this following scenario. Now imagine that we have the following date. So we
that we have the following date. So we have four values as a string. And now if
have four values as a string. And now if you check the data you can see that we
you check the data you can see that we are following the standard format but
are following the standard format but only one value has an issue. So we have
only one value has an issue. So we have here data quality problem. So now what
here data quality problem. So now what we want to do, we want to go and cast
we want to do, we want to go and cast this string value to a date. We don't
this string value to a date. We don't want this to stay as a string value. We
want this to stay as a string value. We would like to have it in the final
would like to have it in the final result as a date. So what we usually do
result as a date. So what we usually do is that we go and have like subquery on
is that we go and have like subquery on top of those values. So like this. So
top of those values. So like this. So now what we're going to do, we're going
now what we're going to do, we're going to go and say we would like to go and
to go and say we would like to go and cast the order dates as date. We don't
cast the order dates as date. We don't want it as a string. And we're going to
want it as a string. And we're going to call it order dates from these values.
call it order dates from these values. So let me just make it like this and
So let me just make it like this and let's go and execute it. Now SQL going
let's go and execute it. Now SQL going to give you an error and say well I
to give you an error and say well I cannot convert everything to a date
cannot convert everything to a date because you have maybe corrupt data and
because you have maybe corrupt data and this is of course because of this row.
this is of course because of this row. So SQL is not able to convert this
So SQL is not able to convert this string to a date. But of course now the
string to a date. But of course now the example is very simple. We know that but
example is very simple. We know that but if you have a huge table it's going to
if you have a huge table it's going to be really hard to identify those issues.
be really hard to identify those issues. But now still I would like to go and
But now still I would like to go and convert those value here. I don't want
convert those value here. I don't want to get an error. And now if there is
to get an error. And now if there is like some values like here that is
like some values like here that is corrupt and so on this value could be
corrupt and so on this value could be null. So how we can force SQL to convert
null. So how we can force SQL to convert the data type from string to date and
the data type from string to date and not give us this error. And for this we
not give us this error. And for this we can go and use the help of the function
can go and use the help of the function is date. Let me show you how I usually
is date. Let me show you how I usually do it. So let's go and say let's check
do it. So let's go and say let's check whether the order date is a date. So
whether the order date is a date. So let's have it like this. And now before
let's have it like this. And now before we go and execute, I'm going to make
we go and execute, I'm going to make this as a comment because if I execute
this as a comment because if I execute it like this, we will get an error. And
it like this, we will get an error. And let's go and get the order date in our
let's go and get the order date in our select. So let's go and execute it. Now
select. So let's go and execute it. Now as you can see in the output, we have
as you can see in the output, we have our string value. So they are not yet a
our string value. So they are not yet a date. And we have the result of our
date. And we have the result of our check. So as you can see the first row,
check. So as you can see the first row, we are getting a zero. So it's saying
we are getting a zero. So it's saying this value is not a date. But for all
this value is not a date. But for all other values, we are getting one. So
other values, we are getting one. So they are passing the check and they are
they are passing the check and they are dates. So now what we're going to do
dates. So now what we're going to do we're going to go and build a logic
we're going to go and build a logic where we're going to say go and cast the
where we're going to say go and cast the value from string to date only if the
value from string to date only if the flag or the check is equal to one. So
flag or the check is equal to one. So that means we can go and use the help of
that means we can go and use the help of the case when statement. Let me show you
the case when statement. Let me show you how we can do that. So let's do it step
how we can do that. So let's do it step by step. We're going to say case win.
by step. We're going to say case win. Now we need the check. So is
Now we need the check. So is dates the order date. So if the output
dates the order date. So if the output of this check is equal to one then you
of this check is equal to one then you are allowed to do the casting. So let's
are allowed to do the casting. So let's go and get the cast as a result of this
go and get the cast as a result of this condition and if it's not equal to one
condition and if it's not equal to one then it could stay as a null. So let's
then it could stay as a null. So let's have it as a null if it didn't pass the
have it as a null if it didn't pass the test. So end and we can call it new
test. So end and we can call it new order dates. So now let's go and execute
order dates. So now let's go and execute it. Now as you can see we are not
it. Now as you can see we are not getting error from SQL. So now if you
getting error from SQL. So now if you check the output for the invalid dates
check the output for the invalid dates we are getting a null. So we are not
we are getting a null. So we are not getting an SQL error. And now only if
getting an SQL error. And now only if these string values are a valid dates
these string values are a valid dates it's allowed to be casted. So that you
it's allowed to be casted. So that you can go and cast a string value to a date
can go and cast a string value to a date even though that you have bad data
even though that you have bad data quality and this is very important step
quality and this is very important step in order to prepare the data before
in order to prepare the data before doing analyszis and it help us as well
doing analyszis and it help us as well to find data quality issues. So for
to find data quality issues. So for example we can go over here and say you
example we can go over here and say you know what let's go and search for all
know what let's go and search for all issues. So we're going to go and take
issues. So we're going to go and take the is dates. So let's go and get the
the is dates. So let's go and get the check and I'm going to say let me see
check and I'm going to say let me see all string values that are invalid that
all string values that are invalid that are failing the test. So let me execute
are failing the test. So let me execute it. And with that we are getting this
it. And with that we are getting this record. And now imagine we have a lot of
record. And now imagine we have a lot of data. So it's now it's really easy to
data. So it's now it's really easy to identify those issues by just using the
identify those issues by just using the S dates. So this is as well amazing way
S dates. So this is as well amazing way in order to identify data quality
in order to identify data quality issues. Now of course you might say you
issues. Now of course you might say you know what I don't want to see here null.
know what I don't want to see here null. Maybe let's get a dummy value. Well it's
Maybe let's get a dummy value. Well it's very easy. We can go over here and say
very easy. We can go over here and say else. So and we can go and get for
else. So and we can go and get for example very large value something like
example very large value something like this that is easy to identify. So now
this that is easy to identify. So now with that instead of getting nulls
with that instead of getting nulls inside your data you can get such a
inside your data you can get such a dummy value. So now you understand the
dummy value. So now you understand the use case of the is dates and why this
use case of the is dates and why this function is amazing doing data
cleanup. All right. So with that we have covered 13 different date and time
covered 13 different date and time functions in SQL. So we have learned how
functions in SQL. So we have learned how to extract the date parts using seven
to extract the date parts using seven different functions and we have learned
different functions and we have learned as well when to use which one. So they
as well when to use which one. So they are amazing in order to do data
are amazing in order to do data aggregations and as well filtering. And
aggregations and as well filtering. And then we have learned how to change the
then we have learned how to change the date format from one to another and as
date format from one to another and as well how to change the data types. And
well how to change the data types. And then we learned how to do mathematical
then we learned how to do mathematical operations on our dates. So how we can
operations on our dates. So how we can add or subtract days, years, months from
add or subtract days, years, months from a date or the amazing function the date
a date or the amazing function the date diff where we can go and find the
diff where we can go and find the differences in days or years between two
differences in days or years between two days. And the last one we can go and
days. And the last one we can go and validate whether the values that we have
validate whether the values that we have are dates or not. So as we learned date
are dates or not. So as we learned date functions are amazing functions in order
functions are amazing functions in order to do data analyzes and reporting. All
to do data analyzes and reporting. All right my friends. So with that we have
right my friends. So with that we have learned a lot of very important SQL
learned a lot of very important SQL functions and how to manipulate the date
functions and how to manipulate the date and time values in your database using
and time values in your database using SQL. Now in the next section we're going
SQL. Now in the next section we're going to start talking about the null
to start talking about the null functions in order to handle the nulls
functions in order to handle the nulls inside your tables. So let's go.
So what are the nulls? Imagine you are filling out a forum and there will be
filling out a forum and there will be usually like fields that are required
usually like fields that are required and another fields that are optional. So
and another fields that are optional. So what usually happens? We leave those
what usually happens? We leave those optional fields unanswered. So we don't
optional fields unanswered. So we don't provide any values and we leave it
provide any values and we leave it empty. And now once we are done filling
empty. And now once we are done filling out the form and we click on register,
out the form and we click on register, the data will be inserted into database
the data will be inserted into database tables. So now what can happen? The
tables. So now what can happen? The fields where you have provided answers
fields where you have provided answers and values can be filled inside the
and values can be filled inside the table while the unanswered fields will
table while the unanswered fields will have no value and this is what we call
have no value and this is what we call in SQL a null. So in databases a null
in SQL a null. So in databases a null means nothing unknown. It is not equal
means nothing unknown. It is not equal to anything. So it is not equal to zero
to anything. So it is not equal to zero or empty string or blank space. A null
or empty string or blank space. A null is simply nothing. It tells us there is
is simply nothing. It tells us there is no value and it is missing. It's like
no value and it is missing. It's like saying I don't know what this value is.
saying I don't know what this value is. So this is what a null means in
SQL. All right friends, so now we're going to do a deep dive into special SQL
going to do a deep dive into special SQL functions on how to handle the nulls
functions on how to handle the nulls inside our data. Now in some scenarios
inside our data. Now in some scenarios we have nulls inside our tables and we
we have nulls inside our tables and we would like to go and remove it and
would like to go and remove it and replace it with a new value like for
replace it with a new value like for example 40. And in order to do that in
example 40. And in order to do that in scale we have two functions. The first
scale we have two functions. The first one called is a null and the second one
one called is a null and the second one called coales. But now let's say that we
called coales. But now let's say that we have another scenario where we have a
have another scenario where we have a value inside our table like the 40 and
value inside our table like the 40 and we want to go and make it as a null. So
we want to go and make it as a null. So now we are doing the exact opposite. We
now we are doing the exact opposite. We are replacing the value with a null and
are replacing the value with a null and for that we have the SQL function null
for that we have the SQL function null if. So as you can see with those two
if. So as you can see with those two scenarios we are replacing stuff. So
scenarios we are replacing stuff. So from null to value or from value to
from null to value or from value to null. So they are really helpful in
null. So they are really helpful in order to manipulate the data inside our
order to manipulate the data inside our databases. Now moving on to another
databases. Now moving on to another scenario where we don't want to
scenario where we don't want to manipulate anything. We want just to
manipulate anything. We want just to check. So we don't want to replace or
check. So we don't want to replace or convert anything. We want just to check
convert anything. We want just to check in our database whether we have a null
in our database whether we have a null value. And for that we have a function
value. And for that we have a function called is a null. But between the is and
called is a null. But between the is and null there is like space. It is
null there is like space. It is different than the first function. So if
different than the first function. So if you apply is null you're going to get a
you apply is null you're going to get a boolean true or false. For this scenario
boolean true or false. For this scenario you will get true. Or the second option
you will get true. Or the second option you can go and check whether the value
you can go and check whether the value is not null. So we can use is not null
is not null. So we can use is not null and for this example you can get false.
and for this example you can get false. So in the output we are getting a
So in the output we are getting a boolean true or false. So those keywords
boolean true or false. So those keywords are really amazing in order to check
are really amazing in order to check whether we have nulls inside our data.
whether we have nulls inside our data. So this is the big picture of all
So this is the big picture of all functions that we have in SQL in order
functions that we have in SQL in order to handle the nulls. So now let's go and
to handle the nulls. So now let's go and understand those functions one by one.
So let's start with the first function is null. Is null going to go and replace
is null. Is null going to go and replace a null with a specific value. Now the
a null with a specific value. Now the syntax of the isnull is very simple.
syntax of the isnull is very simple. We're going to use the keyword is a null
We're going to use the keyword is a null and it accepts two arguments. First the
and it accepts two arguments. First the value and then the second the
value and then the second the replacement value. So let's have an
replacement value. So let's have an example. We can go and use the is null
example. We can go and use the is null for the column called shipping address.
for the column called shipping address. So we are checking the nulls inside it.
So we are checking the nulls inside it. And if SQL encounters any null, it going
And if SQL encounters any null, it going to go and replace it with the value
to go and replace it with the value unknown. So this going to be like a
unknown. So this going to be like a default value for the nulls. So the
default value for the nulls. So the first value is a column and the second
first value is a column and the second value is like static. Always going to be
value is like static. Always going to be the unknown if we find any nulls. Now of
the unknown if we find any nulls. Now of course in other scenarios we don't want
course in other scenarios we don't want to have it always like the unknown. We
to have it always like the unknown. We would like to use another column to help
would like to use another column to help the first one. So let's have this
the first one. So let's have this scenario. So now with this syntax we are
scenario. So now with this syntax we are checking the values of the shipping
checking the values of the shipping address and if we find any nulls it's
address and if we find any nulls it's going to get the replacement from the
going to get the replacement from the billing address. So here in this example
billing address. So here in this example we have two columns. We don't have here
we have two columns. We don't have here any static value. We will get the values
any static value. We will get the values of the billing address only if the
of the billing address only if the shipping address is null. So we are
shipping address is null. So we are replacing the nulls using the help of
replacing the nulls using the help of other column. And in the first scenario
other column. And in the first scenario we are replacing the nulls with a static
we are replacing the nulls with a static value the default value. So let's have a
value the default value. So let's have a very simple example in order to learn
very simple example in order to learn how this works. So what we are doing we
how this works. So what we are doing we are checking whether the value is null.
are checking whether the value is null. If it's yes then we're going to go and
If it's yes then we're going to go and get the value from the replacement and
get the value from the replacement and if the value is not null then show the
if the value is not null then show the value itself. So we have the following
value itself. So we have the following example. We are going to check the
example. We are going to check the values from the shipping address and if
values from the shipping address and if there is nulls then go replace it with
there is nulls then go replace it with the default value na. So let's see how
the default value na. So let's see how going to go and execute this very simple
going to go and execute this very simple example. We have two orders. The first
example. We have two orders. The first order we are checking the submit address
order we are checking the submit address is the value of this address is null.
is the value of this address is null. Well, no. We have a value a. So that's
Well, no. We have a value a. So that's why it's scale going to go and return
why it's scale going to go and return the same value. So in the outputs we
the same value. So in the outputs we will get a. So if it's not null, it's
will get a. So if it's not null, it's going to return the same value. So now
going to return the same value. So now it's going to move to the second order
it's going to move to the second order and here we have the shipment address as
and here we have the shipment address as a null. So what going to happen here? If
a null. So what going to happen here? If the value is null, then we going to get
the value is null, then we going to get the replacement value. So what is the
the replacement value. So what is the replacement value is the NA. So that's
replacement value is the NA. So that's why in the output we will not get a null
why in the output we will not get a null we will get the N A. So if you check the
we will get the N A. So if you check the result what happens? We're going to get
result what happens? We're going to get the addresses from the shipping address
the addresses from the shipping address but only if we have a null we will get
but only if we have a null we will get like default value. It's very important
like default value. It's very important to understand if you are using the
to understand if you are using the default value in the output you will
default value in the output you will never get a null. All right. So let's
never get a null. All right. So let's have another example for the second
have another example for the second scenario where we are not using a
scenario where we are not using a default value we are using a column. So
default value we are using a column. So we have a supportive column that's going
we have a supportive column that's going to be checked. So in this scenario we
to be checked. So in this scenario we are saying is null shipping address and
are saying is null shipping address and billing address. So we have two columns
billing address. So we have two columns and of course the logic going to be the
and of course the logic going to be the same right. So we are checking only
same right. So we are checking only once. Let's see how SQL going to execute
once. Let's see how SQL going to execute this example. We have this time three
this example. We have this time three orders and we have addresses from the
orders and we have addresses from the shipments and as well from billing. So
shipments and as well from billing. So now SQL is always focusing on the
now SQL is always focusing on the shipping address since it is the first
shipping address since it is the first column. So we are not checking the
column. So we are not checking the billing address at all. So it start with
billing address at all. So it start with the first order. Is it null? Well, no,
the first order. Is it null? Well, no, we have the value A. So, we will get it
we have the value A. So, we will get it as well in the output and SQL will not
as well in the output and SQL will not get anything from the billing address.
get anything from the billing address. So, we will get a. So, that's it for the
So, we will get a. So, that's it for the first order. Now, it's still going to go
first order. Now, it's still going to go to the second order. And this time,
to the second order. And this time, we're going to have a null. So, now in
we're going to have a null. So, now in the rule, we are saying if the shipping
the rule, we are saying if the shipping address is a null, go get the value from
address is a null, go get the value from the billing address. So, this time we're
the billing address. So, this time we're going to go to the replacement, right?
going to go to the replacement, right? So we will get the value C in the output
So we will get the value C in the output because the shipping address is the
because the shipping address is the null. Now let's move to the third row.
null. Now let's move to the third row. As you can see here we have again null.
As you can see here we have again null. So SQL going to go and get the value
So SQL going to go and get the value from the billing address. But here in
from the billing address. But here in this scenario the billing address is as
this scenario the billing address is as well null. That's why we will get the
well null. That's why we will get the value null in the output. So as you can
value null in the output. So as you can see having the replacements values from
see having the replacements values from a column there is no guarantee that
a column there is no guarantee that there will be always a value like here
there will be always a value like here in the third order it is a null that's
in the third order it is a null that's why we will get null as well in the
why we will get null as well in the output. So if you think you are using is
output. So if you think you are using is null to replace all the nulls by having
null to replace all the nulls by having two columns you might end up as well
two columns you might end up as well having a null in the output if the
having a null in the output if the replacement having nulls. So if you want
replacement having nulls. So if you want to make sure you don't get any nulls in
to make sure you don't get any nulls in the output you have to go and use a
the output you have to go and use a static value. So this is how SQL execute
static value. So this is how SQL execute the
isnull. All right. So what is coales? Coal is going to go and return the first
Coal is going to go and return the first null value from a list. All right. So
null value from a list. All right. So now the syntax of the coales is way
now the syntax of the coales is way better than the is null. Here it accepts
better than the is null. Here it accepts like a list of many values. So here for
like a list of many values. So here for example we have value 1 2 3 you can add
example we have value 1 2 3 you can add four five as much as you want. So we are
four five as much as you want. So we are creating here a list of values to be
creating here a list of values to be checked. So for example, we still can
checked. So for example, we still can use it like the isnull where we have the
use it like the isnull where we have the shipping address where we replace the
shipping address where we replace the null with a static value the unknown or
null with a static value the unknown or as we learned we can go and use two
as we learned we can go and use two columns shipping address and the billing
columns shipping address and the billing address. So so far it's like the same
address. So so far it's like the same use cases as the is null but now of
use cases as the is null but now of course the kalis is not only limited to
course the kalis is not only limited to two we can go and use three. So we are
two we can go and use three. So we are saying go check the shipping address if
saying go check the shipping address if it's null then go check the billing
it's null then go check the billing address. If it is as well null then use
address. If it is as well null then use at the end the default value the static
at the end the default value the static one the unknown. So as you can see we
one the unknown. So as you can see we can use more than two values with the
can use more than two values with the coalis. Okay. So now let's understand
coalis. Okay. So now let's understand the cowless and how this works. Now the
the cowless and how this works. Now the workflow is something similar to the
workflow is something similar to the isnull. So in this example we have two
isnull. So in this example we have two columns shipping address and the billing
columns shipping address and the billing address. It's going to consider it as a
address. It's going to consider it as a list and it's going to start checking
list and it's going to start checking from left to right. So it's going to
from left to right. So it's going to check the first value from the shipping
check the first value from the shipping address whether it's null. If no, it's
address whether it's null. If no, it's not null then we're going to go and get
not null then we're going to go and get the value one. So we will get the value
the value one. So we will get the value from the shipping address. And if yes,
from the shipping address. And if yes, it is null then it's going to go and get
it is null then it's going to go and get the value two. So we're going to get the
the value two. So we're going to get the value from the shipping address. Now we
value from the shipping address. Now we have the similar data. We have three
have the similar data. We have three orders. Let's see how going to execute
orders. Let's see how going to execute it. So it's going to start with the
it. So it's going to start with the first row and it's going to focus on the
first row and it's going to focus on the shipping address. So here the value is
shipping address. So here the value is not null. So we have it as an A. So
not null. So we have it as an A. So that's why we will get the value one. So
that's why we will get the value one. So we will get the value from the shipping
we will get the value from the shipping address and nothing else going to be
address and nothing else going to be checked. Now moving on to the second
checked. Now moving on to the second row. This time the shipping address is
row. This time the shipping address is null. So it's going to go and get the
null. So it's going to go and get the value from the second column and it's
value from the second column and it's going to be the C. Right? So in the
going to be the C. Right? So in the output we will get C. Now to the last
output we will get C. Now to the last example, we have it as a null and it's
example, we have it as a null and it's going to go and get the value from the
going to go and get the value from the second column and this time we're going
second column and this time we're going to get as well a null like the is null
to get as well a null like the is null function. So at the results we are
function. So at the results we are getting exactly the same result as
getting exactly the same result as isnull. So for this scenario it doesn't
isnull. So for this scenario it doesn't matter whether you use isnull or
matter whether you use isnull or kowalis. So now of course we are still
kowalis. So now of course we are still not happy with that because I don't want
not happy with that because I don't want to see any nulls in the output and I
to see any nulls in the output and I will still need to use the billing
will still need to use the billing address instead of any static values. So
address instead of any static values. So I would like to have everything the
I would like to have everything the values from the billing address and as
values from the billing address and as well I would like to have at the end a
well I would like to have at the end a default value so that I don't have any
default value so that I don't have any nulls in the output. So how we going to
nulls in the output. So how we going to solve it? So now we can use the power of
solve it? So now we can use the power of the account list where we can include
the account list where we can include multiple values in one function. So what
multiple values in one function. So what we're going to do we're going to have
we're going to do we're going to have the shipping address first then the
the shipping address first then the billing address and at the end we're
billing address and at the end we're going to have the default value. So we
going to have the default value. So we have now a list of three values and of
have now a list of three values and of course our workflow going to be a little
course our workflow going to be a little bit bigger. So again here it's going to
bit bigger. So again here it's going to start from the left to the right. So
start from the left to the right. So first it's going to go and check the
first it's going to go and check the value one. If it is null then it's going
value one. If it is null then it's going to go as well checking the value two.
to go as well checking the value two. And if the value two is as well null, we
And if the value two is as well null, we will get the last value. It's going to
will get the last value. It's going to be the value three. So now let's run the
be the value three. So now let's run the example again using the new kalis. So
example again using the new kalis. So the first thing we're going to go and
the first thing we're going to go and check the first value which is the
check the first value which is the shipping address for the record number
shipping address for the record number one. So now as you can see the value is
one. So now as you can see the value is not null. So we have here an a. So what
not null. So we have here an a. So what going to happen? We're going to get the
going to happen? We're going to get the value a as well in the output. So that
value a as well in the output. So that means this one going to be activated and
means this one going to be activated and we will not check anything else. So that
we will not check anything else. So that means in the output it's going to be
means in the output it's going to be like this. and the first value is
like this. and the first value is returned and everything else will be
returned and everything else will be ignored. So, SQL will not check
ignored. So, SQL will not check anything. So, as you can see, we are
anything. So, as you can see, we are returning the first null value. So, now
returning the first null value. So, now let's move to the second order. Now,
let's move to the second order. Now, we're going to check again the first
we're going to check again the first value. Is it null? Well, yes. As you can
value. Is it null? Well, yes. As you can see, we have here a null. So, that means
see, we have here a null. So, that means we're going to go and activate this path
we're going to go and activate this path over here on the right side. So, now SQL
over here on the right side. So, now SQL will not go blindly putting anything
will not go blindly putting anything from the billing address in the results.
from the billing address in the results. First SQL has to check it. So SQL going
First SQL has to check it. So SQL going to check it whether it's null or not.
to check it whether it's null or not. SQL going to go and return it as well in
SQL going to go and return it as well in the output. And we have activated this
the output. And we have activated this path. So SQL is returning the value two
path. So SQL is returning the value two which is the value from the billing
which is the value from the billing address. So now let's move to the third
address. So now let's move to the third order. SQL first going to go and check
order. SQL first going to go and check the shipping address. Is it null? Well
the shipping address. Is it null? Well yes it is null. So that's why SQL going
yes it is null. So that's why SQL going to go and start checking the second
to go and start checking the second value. So this time SQL will not return
value. So this time SQL will not return the billing address value since it's
the billing address value since it's null. It's going to go and return the
null. It's going to go and return the third value. And what is the third
third value. And what is the third value? It is our static value the NA. So
value? It is our static value the NA. So in the output we're going to get the NA
in the output we're going to get the NA our default value. So with that as you
our default value. So with that as you can see in the output we will not get
can see in the output we will not get any nulls. We are using the default
any nulls. We are using the default value and as well multiple columns. So
value and as well multiple columns. So if you check the output, it's always the
if you check the output, it's always the first priority to check the values from
first priority to check the values from the first column, the shipping address.
the first column, the shipping address. If it's null, then the second priority
If it's null, then the second priority going to be the billing address. If it's
going to be the billing address. If it's null, then the last priority, it's going
null, then the last priority, it's going to be the default value. So as you can
to be the default value. So as you can see, SQL is checking the values from
see, SQL is checking the values from left to right and it stops immediately
left to right and it stops immediately once it encounters the first not null
once it encounters the first not null value and return it in the results. So
value and return it in the results. So this is how the cow works.
All right. So now let's have a quick summary about the differences between
summary about the differences between the kowalis and isnull. So as we learned
the kowalis and isnull. So as we learned the isnull is limited only to two values
the isnull is limited only to two values where the kowalis is amazing where you
where the kowalis is amazing where you can have a list of multiple values which
can have a list of multiple values which is a great advantage compared to the
is a great advantage compared to the isnull. Now if you are talking about the
isnull. Now if you are talking about the performance the isnull is faster than
performance the isnull is faster than the kawalis. So if you want to optimize
the kawalis. So if you want to optimize the performance of your query then go
the performance of your query then go with the isnull. Now there is another
with the isnull. Now there is another problem with the isnull is that we have
problem with the isnull is that we have different keywords for different
different keywords for different databases. So for Microsoft SQL server
databases. So for Microsoft SQL server we use the isnull as we learned but in
we use the isnull as we learned but in Oracle they have different
Oracle they have different implementations they use the NVL and
implementations they use the NVL and other database like MySQL you have if
other database like MySQL you have if null and all those three functions are
null and all those three functions are doing the same but we have different
doing the same but we have different implementations for different databases
implementations for different databases but in the other hand the cowis it is
but in the other hand the cowis it is available in all different databases. So
available in all different databases. So here we have like an agreement or
here we have like an agreement or standards between the databases of using
standards between the databases of using the kowalis. So here again this is a
the kowalis. So here again this is a great advantage for the kowalis because
great advantage for the kowalis because if you are writing like scripts and
if you are writing like scripts and someday you want to migrate from one
someday you want to migrate from one database to another. If you are using
database to another. If you are using the kowalis you don't have to change
the kowalis you don't have to change anything but if you are using the isnull
anything but if you are using the isnull then you have to go and adjust your
then you have to go and adjust your queries and scripts with the correct
queries and scripts with the correct functions. That's why I tend always to
functions. That's why I tend always to use the kalis and avoid using the
use the kalis and avoid using the isnull. Only if it's really necessary
isnull. Only if it's really necessary that I have really bad performance, I go
that I have really bad performance, I go and try the isnull. But I usually stick
and try the isnull. But I usually stick with the kowalis. So that is my advice
with the kowalis. So that is my advice for you. Go with the kowalis and stick
for you. Go with the kowalis and stick with the standard.
Now the use cases of the kowalis and the isnull are very similar and we mainly
isnull are very similar and we mainly use them in order to handle the null
use them in order to handle the null before doing any SQL task. For example,
before doing any SQL task. For example, we can use them in order to handle the
we can use them in order to handle the null before doing data aggregations. So
null before doing data aggregations. So let's understand what this means.
let's understand what this means. Imagine that we have three sales. We
Imagine that we have three sales. We have 15, 25, and a null. Now if you go
have 15, 25, and a null. Now if you go and use an aggregate functions like the
and use an aggregate functions like the average, what's going to happen? SQL
average, what's going to happen? SQL going to calculate it like this. 15 + 25
going to calculate it like this. 15 + 25 divided by two and the average is going
divided by two and the average is going to be 20. So as you can see here SQL is
to be 20. So as you can see here SQL is including only the two values 15 and 25
including only the two values 15 and 25 and ignores totally the null value. So
and ignores totally the null value. So in the calculations the null will not be
in the calculations the null will not be included because if SQL does that the
included because if SQL does that the output going to be as well null. So the
output going to be as well null. So the nulls are totally ignored. Now the same
nulls are totally ignored. Now the same thing can happen with the other
thing can happen with the other aggregate functions like the sum count
aggregate functions like the sum count if you are counting the sales min and
if you are counting the sales min and max. There is only one exception about
max. There is only one exception about the aggregate function count. If you are
the aggregate function count. If you are using it with the star, SQL here is
using it with the star, SQL here is considering not the values. SQL going to
considering not the values. SQL going to consider the rows. That's why SQL going
consider the rows. That's why SQL going to go and include all those rows and
to go and include all those rows and find the output going to be three. Now
find the output going to be three. Now in some scenarios, if your business
in some scenarios, if your business understand the null as zero, then you're
understand the null as zero, then you're going to have a problem with the result
going to have a problem with the result of your analyzes if you don't handle the
of your analyzes if you don't handle the nulls. So what we have to do? We have to
nulls. So what we have to do? We have to handle the null before doing the
handle the null before doing the aggregations. So we have to go and
aggregations. So we have to go and replace a null with zero using either
replace a null with zero using either the isnar or the kowalis. So once you do
the isnar or the kowalis. So once you do that the calculation going to be changed
that the calculation going to be changed for the average. So it's going to be 15
for the average. So it's going to be 15 + 25 + 0 divided by 3 and the output
+ 25 + 0 divided by 3 and the output this time going to be 13.3. So with that
this time going to be 13.3. So with that you're going to get more accurate
you're going to get more accurate results for the business if they
results for the business if they understand nulls as zero. All right. So
understand nulls as zero. All right. So now we have the following example. It
now we have the following example. It says find the average scores for the
says find the average scores for the customers. So let's go and solve it. So
customers. So let's go and solve it. So we're going to go and select the
we're going to go and select the customer ID, the score from table
customer ID, the score from table customers. So let's go and execute it.
customers. So let's go and execute it. So as you can see, we have four
So as you can see, we have four customers with score and the last one
customers with score and the last one doesn't have any score. So we have it as
doesn't have any score. So we have it as a null. Let's go and calculate the
a null. Let's go and calculate the average for the score and I would like
average for the score and I would like to have the window function in order to
to have the window function in order to see the details as well. So this is
see the details as well. So this is average scores. So let's go and execute
average scores. So let's go and execute it. Now of course what is going on here?
it. Now of course what is going on here? The four values going to be added to
The four values going to be added to each others and divided by four and the
each others and divided by four and the null is totally ignored. Now of course
null is totally ignored. Now of course the question is what the business
the question is what the business understand with the null. If it is zero
understand with the null. If it is zero then we have inaccurate results. So
then we have inaccurate results. So let's go and fix it. Now this time we're
let's go and fix it. Now this time we're going to say okay we're going to have
going to say okay we're going to have the average but instead of score we're
the average but instead of score we're going to handle the nulls first. So we
going to handle the nulls first. So we have to replace any nulls with zero. We
have to replace any nulls with zero. We can go and use the kowalis or the
can go and use the kowalis or the isnull. So I will go with the cabalis
isnull. So I will go with the cabalis like this and score if you find any null
like this and score if you find any null make it zero. So that's it and as well I
make it zero. So that's it and as well I will go with the window function. So
will go with the window function. So average scores let's call it two. Now
average scores let's call it two. Now let's go and execute it. Now as you can
let's go and execute it. Now as you can see in the output we got 500 and this is
see in the output we got 500 and this is different than the previous average and
different than the previous average and that's because we have replaced the null
that's because we have replaced the null with zero. Let's just go and display it
with zero. Let's just go and display it in order to understand it. So I will
in order to understand it. So I will copy it and put it here. So let's call
copy it and put it here. So let's call it score two and execute it. So now SQL
it score two and execute it. So now SQL is going to summarize all those values
is going to summarize all those values and divided by five and that's why we
and divided by five and that's why we are getting the 500. So if our business
are getting the 500. So if our business understand the null as a zero this
understand the null as a zero this average going to be more accurate after
average going to be more accurate after we handle the null. As you can see in
we handle the null. As you can see in some scenarios we have to handle the
some scenarios we have to handle the nulls before doing any data
nulls before doing any data aggregations.
All right, moving on to the next use case for the kowalis and isnull. We can
case for the kowalis and isnull. We can use them in order to handle the nulls
use them in order to handle the nulls before doing any mathematical
before doing any mathematical operations. So let's understand what
operations. So let's understand what this means using the plus operator. So
this means using the plus operator. So if you do plus operator between two
if you do plus operator between two numbers like 1 + 5, you are summarizing
numbers like 1 + 5, you are summarizing the values and you will get six. And if
the values and you will get six. And if you do the plus operator between string
you do the plus operator between string values like a + b. So now what we are
values like a + b. So now what we are doing, we are doing data concatenations
doing, we are doing data concatenations and the output going to be a b. So now
and the output going to be a b. So now if you go and replace the one with a
if you go and replace the one with a value like zero. So 0 + 5 we will get
value like zero. So 0 + 5 we will get five. Nothing fancy about that. And for
five. Nothing fancy about that. And for the strings if you go and replace a
the strings if you go and replace a value with an empty string. So there is
value with an empty string. So there is zero characters between the two quotes
zero characters between the two quotes plus the B. So in the output you will
plus the B. So in the output you will get only B. So it's fine and nothing is
get only B. So it's fine and nothing is critical. But now we come to the
critical. But now we come to the problem. If you use a null if you
problem. If you use a null if you replace the one with null in the output
replace the one with null in the output you will get a null. because you are
you will get a null. because you are saying okay five plus something that I
saying okay five plus something that I don't know so SQL says okay you are
don't know so SQL says okay you are summarizing now a value with a no value
summarizing now a value with a no value it is unknown so I don't as well know
it is unknown so I don't as well know what going to be the answer that's why
what going to be the answer that's why going to say it's going to be null just
going to say it's going to be null just don't know what is the answer and the
don't know what is the answer and the same thing can happen with anything else
same thing can happen with anything else like the string so if you're saying null
like the string so if you're saying null plus b and here going to say the same
plus b and here going to say the same thing the null is unknown and the answer
thing the null is unknown and the answer going to be as well unknown so my
going to be as well unknown so my friends this is very critical in the
friends this is very critical in the analyzes and working with data. So this
analyzes and working with data. So this means we have to handle the nulls before
means we have to handle the nulls before doing any mathematical operations. And
doing any mathematical operations. And this is not only for the plus operator,
this is not only for the plus operator, it's as well for the other operators
it's as well for the other operators like minus and so on. All right. So now
like minus and so on. All right. So now let's have the following task. And it
let's have the following task. And it says display the full name of the
says display the full name of the customers in a single field by merging
customers in a single field by merging their first and last names and add 10
their first and last names and add 10 bonus points for each customer's score.
bonus points for each customer's score. So let's go and solve it. We're going to
So let's go and solve it. We're going to select first the basic informations.
select first the basic informations. Let's get the customer ID. What do we
Let's get the customer ID. What do we need? the first name, the last name and
need? the first name, the last name and we need the scores. So that's it from
we need the scores. So that's it from sales customers. Let's go and execute
sales customers. Let's go and execute it. Now the first task is that we have
it. Now the first task is that we have to generate a new field called full name
to generate a new field called full name where we have to go and merge or
where we have to go and merge or concatenate their first and last names.
concatenate their first and last names. So let's go and do that. We need the
So let's go and do that. We need the first name plus and then let's have a
first name plus and then let's have a space between the first and last name
space between the first and last name and then plus let's have the last name
and then plus let's have the last name as
as full name. So let's go and execute it.
full name. So let's go and execute it. Now if you check the result for the
Now if you check the result for the first customer it is working. So we have
first customer it is working. So we have Joseph Goldenberg. The same thing for
Joseph Goldenberg. The same thing for the second customer. But for the third
the second customer. But for the third customer we have here a problem.
customer we have here a problem. Customer doesn't have any last name but
Customer doesn't have any last name but she has a first name. So we have here a
she has a first name. So we have here a Mary. So the full name here is
Mary. So the full name here is completely null which is not correct.
completely null which is not correct. For this example we have at least to
For this example we have at least to show the first name Mary even though
show the first name Mary even though that the last name is missing. So the
that the last name is missing. So the result is not really accurate and that's
result is not really accurate and that's because we are doing the plus operator
because we are doing the plus operator between a null and marry. So that means
between a null and marry. So that means we have to go and handle the nulls
we have to go and handle the nulls before doing any plus operator. So again
before doing any plus operator. So again here we can go with the cowless or the
here we can go with the cowless or the isnull. So let's go and create a new
isnull. So let's go and create a new field using the cowless. So it's going
field using the cowless. So it's going to be the last name and now we have to
to be the last name and now we have to define a new value. If it's null so we
define a new value. If it's null so we could have like something unknown or we
could have like something unknown or we could have like an empty string and we
could have like an empty string and we can do that using two quotes and between
can do that using two quotes and between them there is nothing. So we are using
them there is nothing. So we are using an empty string. So let's go and check
an empty string. So let's go and check the results. Last name two. So let's go
the results. Last name two. So let's go and execute it. Now we can see that the
and execute it. Now we can see that the last name over here for marry it has an
last name over here for marry it has an empty string and it is not anymore a
empty string and it is not anymore a null. So now SQL knows okay this is a
null. So now SQL knows okay this is a string and there is no characters inside
string and there is no characters inside it. So with that SQL knows more
it. So with that SQL knows more informations and we can go and now
informations and we can go and now concatenate those informations. So let's
concatenate those informations. So let's go and do that. We're going to take the
go and do that. We're going to take the whole thing and replace the last name
whole thing and replace the last name with the kowalis. So let me just remove
with the kowalis. So let me just remove this last name over here and execute it.
this last name over here and execute it. So now as you can see things looks
So now as you can see things looks better. Now we have in the full name for
better. Now we have in the full name for mari only the first name. And of course
mari only the first name. And of course if you don't like it like this you would
if you don't like it like this you would like to have another default value. You
like to have another default value. You can go over here and say something like
can go over here and say something like in a not available. So let's go and
in a not available. So let's go and execute it. And with that you can see
execute it. And with that you can see immediately uh there is here a missing
immediately uh there is here a missing last name. But it doesn't really look
last name. But it doesn't really look good. So I will just remove it and go
good. So I will just remove it and go with the empty string. We're going to go
with the empty string. We're going to go and execute it. So with that we have
and execute it. So with that we have solved the first part of the task where
solved the first part of the task where we have the full names and we are not
we have the full names and we are not missing any informations from the first
missing any informations from the first name and the last name. Now let's go to
name and the last name. Now let's go to the second part of the task where we
the second part of the task where we have to add 10 bonus points for each
have to add 10 bonus points for each customer score. So we have to go and add
customer score. So we have to go and add a 10 for each score. So let's go and do
a 10 for each score. So let's go and do it. I'm going to put it at the end. So
it. I'm going to put it at the end. So score + 10 and let's give it the name
score + 10 and let's give it the name score with bonus. So that's it. Let's go
score with bonus. So that's it. Let's go and execute it. So now in the output you
and execute it. So now in the output you can see it's very easy. We have added a
can see it's very easy. We have added a 10 for each score. So we have increased
10 for each score. So we have increased the score points for each customer. But
the score points for each customer. But now for the last customer Anna you can
now for the last customer Anna you can see over here she doesn't have a value
see over here she doesn't have a value in the scores and that's why didn't go
in the scores and that's why didn't go and added 10. So we will get as well a
and added 10. So we will get as well a null. And of course this might not be
null. And of course this might not be fair that the last customer is not
fair that the last customer is not getting any point even though that we
getting any point even though that we have increased for all others. So that
have increased for all others. So that means we have to go and handle the null
means we have to go and handle the null by replacing the null to zero. And only
by replacing the null to zero. And only after that we're going to add a plus to
after that we're going to add a plus to it. So let's go and do that. I'm going
it. So let's go and do that. I'm going to add a
to add a kalis if it is null then go and make it
kalis if it is null then go and make it zero. And afterward go and add a 10
zero. And afterward go and add a 10 points. So let's go and execute it. So
points. So let's go and execute it. So now as you can see at the results
now as you can see at the results everything now is fair where we have a
everything now is fair where we have a 10 bonus points for each customers even
10 bonus points for each customers even if the customer doesn't have any values
if the customer doesn't have any values in the scores like here Anna she has
in the scores like here Anna she has like null but still she is getting a 10
like null but still she is getting a 10 points. So here again as you can see if
points. So here again as you can see if you don't handle the nulls correctly
you don't handle the nulls correctly before doing the mathematical operations
before doing the mathematical operations you might get unexpected results. So be
you might get unexpected results. So be careful with the nulls and handle them
careful with the nulls and handle them correctly before adding anything.
Okay, moving on to the next use case for the kowalis and is null. We can use them
the kowalis and is null. We can use them in order to handle the null before doing
in order to handle the null before doing joins. This is little bit advanced use
joins. This is little bit advanced use case but it's very important to
case but it's very important to understand it. So let's understand why
understand it. So let's understand why this is important. Let's have for
this is important. Let's have for example two tables table A and table B.
example two tables table A and table B. And in some scenarios we have to go and
And in some scenarios we have to go and combine those two tables using the
combine those two tables using the joins. And now in order to join two
joins. And now in order to join two tables, we have to go and specify the
tables, we have to go and specify the keys between the table A and table B in
keys between the table A and table B in order to join on it. So in this example,
order to join on it. So in this example, we have two keys in order to join the
we have two keys in order to join the tables. Now here comes the special case.
tables. Now here comes the special case. If those keys don't have any nulls
If those keys don't have any nulls inside it and all the data are filled,
inside it and all the data are filled, then your join going to work perfectly
then your join going to work perfectly and you will get the expected results.
and you will get the expected results. And now you might have a special case
And now you might have a special case where there are nulls inside the keys.
where there are nulls inside the keys. So there are missing values and this is
So there are missing values and this is a big problem because in the output you
a big problem because in the output you will get unexpected results and some
will get unexpected results and some records will be totally missing. So in
records will be totally missing. So in this scenario we have to handle the
this scenario we have to handle the nulls inside the keys before doing the
nulls inside the keys before doing the joins. Let's have a very simple example
joins. Let's have a very simple example in order to understand this behavior.
in order to understand this behavior. All right. So now let's have this very
All right. So now let's have this very simple example where we have two tables
simple example where we have two tables and we want to combine them. So in the
and we want to combine them. So in the first table we have a year type orders
first table we have a year type orders and in the second table we have as well
and in the second table we have as well year type and we have sales. So now we
year type and we have sales. So now we would like to go and combine those two
would like to go and combine those two tables in order to have all informations
tables in order to have all informations in one result. Now we can go of course
in one result. Now we can go of course and use the inner join between the table
and use the inner join between the table one and table two and the keys for the
one and table two and the keys for the joins here. As you can see we have the
joins here. As you can see we have the year in both of the tables and as well
year in both of the tables and as well the type. So we're going to go and use
the type. So we're going to go and use both of those columns as a key for the
both of those columns as a key for the join. So let's do it step by step how
join. So let's do it step by step how going to execute this. So we need the
going to execute this. So we need the year type and the results. So it's going
year type and the results. So it's going to go and take those two columns to the
to go and take those two columns to the results and we need the orders and
results and we need the orders and sales. So it's going to take as well the
sales. So it's going to take as well the orders and the sales from the second
orders and the sales from the second table. So now let's start doing it row
table. So now let's start doing it row by row. So the first key going to be
by row. So the first key going to be those two columns. So we have 2024 and
those two columns. So we have 2024 and the type A. So now it's going to start
the type A. So now it's going to start searching for those two informations in
searching for those two informations in the second table. And as you can see we
the second table. And as you can see we have here a match, right? So the first
have here a match, right? So the first row is as well matching since it's inner
row is as well matching since it's inner join it going to present in the output
join it going to present in the output only the matching rows from left and
only the matching rows from left and right. So in the outputs we're going to
right. So in the outputs we're going to get the whole row from the table one and
get the whole row from the table one and we will get the sales from the table
we will get the sales from the table two. All right. So that's all for the
two. All right. So that's all for the first row. Now let's move to the second
first row. Now let's move to the second row over here. So what are the values of
row over here. So what are the values of the keys? We have 20 24 and null. So now
the keys? We have 20 24 and null. So now if you check the matches on the right
if you check the matches on the right side you can see we have a match here
side you can see we have a match here right it is logical so it's as well 20
right it is logical so it's as well 20 24 and null so everything is matching
24 and null so everything is matching and we should get it in the result right
and we should get it in the result right SQL cannot go and use the equal operator
SQL cannot go and use the equal operator in order to join tables so even though
in order to join tables so even though that is logically it makes sense to have
that is logically it makes sense to have it at the output but still SQL cannot go
it at the output but still SQL cannot go and compare the nulls that's why this is
and compare the nulls that's why this is a problem for this combination SQL will
a problem for this combination SQL will not find any matching So we will not get
not find any matching So we will not get any informations for the combination of
any informations for the combination of 2024 and null. So for us of course in
2024 and null. So for us of course in the business this is missing
the business this is missing informations and as well inaccurate
informations and as well inaccurate results. So we're going to miss this row
results. So we're going to miss this row and it's still going to go and jump to
and it's still going to go and jump to the third row. So here what are the
the third row. So here what are the values of the key. We have 20 25 and B.
values of the key. We have 20 25 and B. Now it's going to go and search it in
Now it's going to go and search it in the second table and it's still going to
the second table and it's still going to find a match over here. So in the
find a match over here. So in the outputs we're going to get those values.
outputs we're going to get those values. The the orders going to be 50, the sales
The the orders going to be 50, the sales 300. Now it's going to go to the last
300. Now it's going to go to the last row and we have here again the same
row and we have here again the same problem. We have here 2025 and null. And
problem. We have here 2025 and null. And of course if you check the data you will
of course if you check the data you will say yes we have a matching over here but
say yes we have a matching over here but SQL would ignore it. So we have exactly
SQL would ignore it. So we have exactly the same situation and we will not find
the same situation and we will not find it at the results. So at the output we
it at the results. So at the output we will get only two rows even though that
will get only two rows even though that those two tables are like identicals if
those two tables are like identicals if you compare the keys. So with that we
you compare the keys. So with that we are losing data at the results and we
are losing data at the results and we are providing inaccurate results. So my
are providing inaccurate results. So my friends if you have nulls inside your
friends if you have nulls inside your keys what can happen you will lose
keys what can happen you will lose records at the output. So here it's very
records at the output. So here it's very important to handle the nulls inside the
important to handle the nulls inside the keys before doing the joins. All right
keys before doing the joins. All right so now in order to fix it we're going to
so now in order to fix it we're going to go and use either the kalis or the
go and use either the kalis or the isnull in the join. So as you can see we
isnull in the join. So as you can see we are not using the type directly. We are
are not using the type directly. We are handling it by replacing the null with
handling it by replacing the null with an empty string. It doesn't matter which
an empty string. It doesn't matter which value you are using. The main thing is
value you are using. The main thing is that you have a value and SQL can go and
that you have a value and SQL can go and map it. So you could have it as empty
map it. So you could have it as empty string or a blank or any default value.
string or a blank or any default value. But I usually go with the empty string
But I usually go with the empty string since it's little bit faster than having
since it's little bit faster than having any other characters. So now what going
any other characters. So now what going to happen is we're going to go
to happen is we're going to go everywhere and replace those nulls with
everywhere and replace those nulls with an empty string. So now we don't have
an empty string. So now we don't have any nulls inside our keys and let's go
any nulls inside our keys and let's go and see what can happen. So we're going
and see what can happen. So we're going to start with the first row again. Here
to start with the first row again. Here we have a matching from the right table
we have a matching from the right table and we're going to see the whole records
and we're going to see the whole records in the outputs. So we will get as well
in the outputs. So we will get as well the sales as 100. And now it's going to
the sales as 100. And now it's going to go to the second row over here. So this
go to the second row over here. So this time we don't have a null. We have 2024
time we don't have a null. We have 2024 and an empty string. So now it's going
and an empty string. So now it's going to go and search for a match and it's
to go and search for a match and it's going to find it over here. we have as
going to find it over here. we have as well 2024 and an empty string. So now
well 2024 and an empty string. So now what can happen in the outputs we're
what can happen in the outputs we're going to get a
going to get a 204 but here we will get a null. So we
204 but here we will get a null. So we will not get an empty string we will get
will not get an empty string we will get a null over here and that's because we
a null over here and that's because we are handling the null only on the join.
are handling the null only on the join. So as you can see we have here the is
So as you can see we have here the is null type on the join but we don't have
null type on the join but we don't have it on the select. So in the select the
it on the select. So in the select the type going to be like the original data
type going to be like the original data and the original data was a null. We are
and the original data was a null. We are just handling the null in the joints
just handling the null in the joints just in order to let SQL understand how
just in order to let SQL understand how to map and match the data. So in this
to map and match the data. So in this example, I'm not changing the values in
example, I'm not changing the values in the select. So that's why we will get
the select. So that's why we will get the original value. But the orders we
the original value. But the orders we will get it 40 and the sales going to be
will get it 40 and the sales going to be 20. Now moving on to the third row. I
20. Now moving on to the third row. I think you already get it. So let's going
think you already get it. So let's going to find the match and the sales going to
to find the match and the sales going to be 300. All right. Now we're going to
be 300. All right. Now we're going to move to the last one. And here we have
move to the last one. And here we have the same scenario. So we have 2025 and
the same scenario. So we have 2025 and an empty string. So it's not null
an empty string. So it's not null anymore. And SQL going to go and search
anymore. And SQL going to go and search for all those informations and it's
for all those informations and it's going to find it over here. So SQL going
going to find it over here. So SQL going to take this fields over here in the
to take this fields over here in the type in null not an empty string because
type in null not an empty string because in the select we didn't handle it. So
in the select we didn't handle it. So the order going to be 60 and the sales
the order going to be 60 and the sales going to be 200. So as you can see now
going to be 200. So as you can see now the result is complete. We successfully
the result is complete. We successfully combined both of those tables in one big
combined both of those tables in one big results using joins but as well using
results using joins but as well using the help of the isnull function in order
the help of the isnull function in order to have a complete results and not miss
to have a complete results and not miss any value. So my friends be very careful
any value. So my friends be very careful check always the keys whether they have
check always the keys whether they have nulls or not and if you find nulls go
nulls or not and if you find nulls go immediately and handle it so you don't
immediately and handle it so you don't lose any records in the results and you
lose any records in the results and you get accurate
analyzes. All right, moving on to the next use case for the isnull. We can use
next use case for the isnull. We can use it in order to handle the nulls before
it in order to handle the nulls before sorting the data. So imagine we have the
sorting the data. So imagine we have the following sales 15 25 and null. Now if
following sales 15 25 and null. Now if you go and sort the data by the sales
you go and sort the data by the sales ascending from the lowest to the highest
ascending from the lowest to the highest what can happen? SQL going to show the
what can happen? SQL going to show the nulls at the start and that is not
nulls at the start and that is not because the null is the lowest value
because the null is the lowest value because null has no value. But SQL show
because null has no value. But SQL show it like this. it's going to place it at
it like this. it's going to place it at the start and then below it we're going
the start and then below it we're going to have the lowest value. So it is the
to have the lowest value. So it is the 15 and at the end we're going to have
15 and at the end we're going to have the 25. Now if you are doing the exact
the 25. Now if you are doing the exact opposite where we are sorting the data
opposite where we are sorting the data from the highest to the lowest using
from the highest to the lowest using descending. So what going to happen is
descending. So what going to happen is going to sort it like this. We're going
going to sort it like this. We're going to have 25 then 15 and the last thing
to have 25 then 15 and the last thing that going to appear in the list going
that going to appear in the list going to be the null. So here SQL is showing
to be the null. So here SQL is showing the nulls at the end and that is again
the nulls at the end and that is again not because nulls are the lowest value
not because nulls are the lowest value it has no value but SQL do it like this
it has no value but SQL do it like this show it at the end. So this is how SQL
show it at the end. So this is how SQL deals with the nulls if you are sorting
deals with the nulls if you are sorting the data. So in order to understand this
the data. So in order to understand this use case let's have the following task.
use case let's have the following task. So the task says sort the customers from
So the task says sort the customers from the lowest to the highest scores with
the lowest to the highest scores with nulls appearing last. All right. So
nulls appearing last. All right. So let's solve it. This going to be very
let's solve it. This going to be very interesting one. So we need the customer
interesting one. So we need the customer informations. So let's go and select and
informations. So let's go and select and we need the customer ID and the scores
we need the customer ID and the scores from sales customers and let's go and
from sales customers and let's go and execute it. So we have a simple list of
execute it. So we have a simple list of all customers and their scores. But now
all customers and their scores. But now we have to go and sort the data from the
we have to go and sort the data from the lowest to the highest. So we're going to
lowest to the highest. So we're going to go and use the order by clause and we
go and use the order by clause and we need the field score. And since it's
need the field score. And since it's lowest to the highest that means we need
lowest to the highest that means we need to have the ascending and in SQL it is a
to have the ascending and in SQL it is a default. So we don't have to go and
default. So we don't have to go and mention it. So let's go and execute it.
mention it. So let's go and execute it. So now as you can see in the results it
So now as you can see in the results it start from the lowest to the highest and
start from the lowest to the highest and the first part of our task is solved.
the first part of our task is solved. But now of course we have an issue right
But now of course we have an issue right because we have a null and as we learned
because we have a null and as we learned SQL going to put it at the first place
SQL going to put it at the first place on the list. But the task says with
on the list. But the task says with nulls appearing last. So we really don't
nulls appearing last. So we really don't want to see the nulls at the start. We
want to see the nulls at the start. We don't worry about it. So we would like
don't worry about it. So we would like to have it at the end of the list. So
to have it at the end of the list. So that means we have to go and handle the
that means we have to go and handle the nulls before sorting the data. And here
nulls before sorting the data. And here we have two ways to do it. One way that
we have two ways to do it. One way that is lazy and the other one is more
is lazy and the other one is more professional. So let me show you first
professional. So let me show you first the lazy way. We're going to go and
the lazy way. We're going to go and replace the null with a very big number.
replace the null with a very big number. So for example, what we're going to do,
So for example, what we're going to do, we're going to go and use the kowalis
we're going to go and use the kowalis and we're going to say okay score and
and we're going to say okay score and then let's have a lot of number so that
then let's have a lot of number so that we have a really big score. I just want
we have a really big score. I just want to select it in order to see the
to select it in order to see the results. So as you can see it's a very
results. So as you can see it's a very big number here. So if you take this and
big number here. So if you take this and replace the order by with the new score.
replace the order by with the new score. So that's it. Let's go and execute it.
So that's it. Let's go and execute it. So now if you check the results we have
So now if you check the results we have already solved the task. We have listed
already solved the task. We have listed all the customers from the highest to
all the customers from the highest to the lowest and the nulls are at the end.
the lowest and the nulls are at the end. So now the question why do we call this
So now the question why do we call this lazy or not professional and that's
lazy or not professional and that's because we are defining a static value.
because we are defining a static value. And of course for this example it is
And of course for this example it is working but we don't know later what's
working but we don't know later what's going to happen. Maybe things change
going to happen. Maybe things change where in this course you're going to get
where in this course you're going to get a higher value than this and then
a higher value than this and then sorting the data will make no sense
sorting the data will make no sense since the null going to be like in
since the null going to be like in between values. So who knows your value
between values. So who knows your value might be a real value inside the data.
might be a real value inside the data. Now let me show you the other way which
Now let me show you the other way which is more professional in order to solve
is more professional in order to solve this task where we don't play with luck
this task where we don't play with luck at all. So let's go and do that. Let me
at all. So let's go and do that. Let me just move this little bit here. I'm
just move this little bit here. I'm going to go and create a new logic where
going to go and create a new logic where we're going to say case when if the
we're going to say case when if the score is null then what's going to
score is null then what's going to happen we want the value one otherwise
happen we want the value one otherwise the value going to be zero so end so we
the value going to be zero so end so we are just creating a flag with zero and
are just creating a flag with zero and one if the score is null then we're
one if the score is null then we're going to get the flag of one if we have
going to get the flag of one if we have a value for the score we will get zero
a value for the score we will get zero so let's have it like this and I will
so let's have it like this and I will just go and get rid of this kalis so
just go and get rid of this kalis so let's go and execute it Now if you check
let's go and execute it Now if you check our new nice flag you can see we have
our new nice flag you can see we have zeros everywhere where we have a value
zeros everywhere where we have a value in the score but only once we have a
in the score but only once we have a null we will get the flag of one. So now
null we will get the flag of one. So now once we got this what we're going to do
once we got this what we're going to do we're going to go and sort our data
we're going to go and sort our data based on this flag and the score even
based on this flag and the score even though the task is not mentioning
though the task is not mentioning anything about the flag but we are using
anything about the flag but we are using it in order to force the nulls to be at
it in order to force the nulls to be at the end of the result. Let me show you
the end of the result. Let me show you how we're going to do that. So let me
how we're going to do that. So let me just remove all this. So first we want
just remove all this. So first we want to sort the data by our new flag in
to sort the data by our new flag in order to make sure that the nulls at the
order to make sure that the nulls at the end. So we're going to have our flag and
end. So we're going to have our flag and then afterward we sort the data by the
then afterward we sort the data by the score. So let's go and have the score.
score. So let's go and have the score. So again what we are doing first sort
So again what we are doing first sort the data by the flag in order to push
the data by the flag in order to push the nulls at the end. And now once all
the nulls at the end. And now once all those values are equal to each others
those values are equal to each others what's going to happen SQL going to go
what's going to happen SQL going to go and sort the data by the score. So SQL
and sort the data by the score. So SQL going to use the scores in order to sort
going to use the scores in order to sort the data and both of them are ascending.
the data and both of them are ascending. Let's go and execute it. Now as you can
Let's go and execute it. Now as you can see we're going to get exactly same
see we're going to get exactly same results. The values from the lowest to
results. The values from the lowest to the highest and the nulls are at the
the highest and the nulls are at the end. And as you can see with the order
end. And as you can see with the order by we didn't use any static values or
by we didn't use any static values or any big numbers. And of course we don't
any big numbers. And of course we don't need the flag at the select. So we can
need the flag at the select. So we can go and remove it. So let's execute it.
go and remove it. So let's execute it. And with that we have solved the task.
And with that we have solved the task. So as you can see we can use those nice
So as you can see we can use those nice functions like the cowis or the isnull
functions like the cowis or the isnull in order to handle the nulls before
in order to handle the nulls before sorting your
data. So what is the function null if null if going to go and compare two
null if going to go and compare two values and it going to returns a null if
values and it going to returns a null if they are equal otherwise if they are not
they are equal otherwise if they are not equal it going to returns the first
equal it going to returns the first value. Okay. Okay. So now the syntax of
value. Okay. Okay. So now the syntax of the null if it accepts only two values
the null if it accepts only two values value one and value two. So here again
value one and value two. So here again of course you can go and use a column
of course you can go and use a column with a static value like the unknown. So
with a static value like the unknown. So we are comparing the values between a
we are comparing the values between a column and a static value or you can go
column and a static value or you can go and compare two columns the shipping
and compare two columns the shipping address and the billing address. So
address and the billing address. So again here it accepts only two values.
again here it accepts only two values. We cannot have it like the kalis where
We cannot have it like the kalis where we have a list of multiple values. All
we have a list of multiple values. All right. So now let's understand exactly
right. So now let's understand exactly what do we mean with the null if. So the
what do we mean with the null if. So the workflow going to be like this. SQL
workflow going to be like this. SQL going to go and check two values the
going to go and check two values the value one and the value two. And if they
value one and the value two. And if they are equal then SQL going to go and
are equal then SQL going to go and return a null. But if the two values are
return a null. But if the two values are not equal going to go and return the
not equal going to go and return the first value. So it is the one on the
first value. So it is the one on the left side. So by checking the outcomes
left side. So by checking the outcomes here we will never have a scenario where
here we will never have a scenario where we're going to get the second value.
we're going to get the second value. That means the second value always used
That means the second value always used as a check. So we are checking against
as a check. So we are checking against this value. So either we're going to get
this value. So either we're going to get the value one or a null. Let's have this
the value one or a null. Let's have this very simple example. We are saying null
very simple example. We are saying null if price and we are checking whether
if price and we are checking whether it's equal to minus1. So we are saying
it's equal to minus1. So we are saying if the price is equal to minus1 then go
if the price is equal to minus1 then go and replace it with a null because it is
and replace it with a null because it is data quality issue that we have a price
data quality issue that we have a price that is negative. It makes no sense for
that is negative. It makes no sense for our business. And if it is minus1 then
our business. And if it is minus1 then it means for us a null. We don't know
it means for us a null. We don't know the price of this product. So we will
the price of this product. So we will correct it using the null if. Let's
correct it using the null if. Let's check this very simple example. We have
check this very simple example. We have two orders. So SQL going to start with
two orders. So SQL going to start with the first order and check the first
the first order and check the first value. So what is the first value? Is
value. So what is the first value? Is the price. So here we have a 90. SQL
the price. So here we have a 90. SQL going to go and check is 90 equal to
going to go and check is 90 equal to minus one. Well, no. That means it's
minus one. Well, no. That means it's going to go and execute this path. So
going to go and execute this path. So that means in the output we will get the
that means in the output we will get the first value which is 90. So in the
first value which is 90. So in the output we will get a 90. Now let's move
output we will get a 90. Now let's move to the second order. Here we have a
to the second order. Here we have a minus one. So SQL going to check is
minus one. So SQL going to check is minus one here equal to the minus one
minus one here equal to the minus one that we have in the null if well yes. So
that we have in the null if well yes. So that means SQL going to go and execute
that means SQL going to go and execute this path where we were going to get the
this path where we were going to get the null value in the output and we're going
null value in the output and we're going to get it like this. So now if you
to get it like this. So now if you compare the result from null if and the
compare the result from null if and the price you can see we don't have any more
price you can see we don't have any more the minus one. And as you can see now we
the minus one. And as you can see now we are doing exactly the opposite as
are doing exactly the opposite as kowalis and is null. We are replacing a
kowalis and is null. We are replacing a real value with a null. Now moving on to
real value with a null. Now moving on to the second example and this is very
the second example and this is very interesting one in the analytics where
interesting one in the analytics where we can go and use two columns inside the
we can go and use two columns inside the null if. So in this example we are
null if. So in this example we are saying null if original price and
saying null if original price and discount price. So SQL have to go and
discount price. So SQL have to go and compare the prices between those two
compare the prices between those two columns and if they are equal it should
columns and if they are equal it should return a null. And now you might say
return a null. And now you might say okay in this example why we are doing
okay in this example why we are doing this? Well we can use it in order to
this? Well we can use it in order to highlight or flag special cases inside
highlight or flag special cases inside our data. And the special case here is
our data. And the special case here is if the original price is equal to the
if the original price is equal to the discount price and if those two prices
discount price and if those two prices are equals that means we have an issue
are equals that means we have an issue in our program or something like went
in our program or something like went wrong as we are inserting data. So let's
wrong as we are inserting data. So let's see what's going to happen for the first
see what's going to happen for the first row we're going to go and compare the
row we're going to go and compare the 150 from the original price with the
150 from the original price with the discount price. So they are not equal
discount price. So they are not equal right. So that means going to go and
right. So that means going to go and return the original price the 150 in the
return the original price the 150 in the output. So let's move to the second
output. So let's move to the second order. Here we have the original price
order. Here we have the original price 250 and as well the discount price is
250 and as well the discount price is 250. So they are equal and if they are
250. So they are equal and if they are equal then we will get a null in the
equal then we will get a null in the output. So as you can see again here we
output. So as you can see again here we are not getting any values from the
are not getting any values from the discount. We are using it only for a
discount. We are using it only for a check. So with that we have a quick flag
check. So with that we have a quick flag like using the nulls as flag in order to
like using the nulls as flag in order to identify where we have equal values. So
identify where we have equal values. So this is how the null if works.
All right friends, here we have a very nice use case for the null if and that
nice use case for the null if and that is preventing the error of dividing by
is preventing the error of dividing by zero. Let's see what this means. Okay,
zero. Let's see what this means. Okay, let's have the following task and it
let's have the following task and it says find the sales price for each order
says find the sales price for each order by dividing the sales by quantity. So
by dividing the sales by quantity. So let's go and solve it. This should be
let's go and solve it. This should be very easy. So we need the order ID. We
very easy. So we need the order ID. We need the sales and the
need the sales and the quantity from sales orders. Let's go and
quantity from sales orders. Let's go and execute it. So now we have 10 orders.
execute it. So now we have 10 orders. Those are the sales and the quantity. So
Those are the sales and the quantity. So now it's very easy to calculate the
now it's very easy to calculate the price. It's going to be the sales
price. It's going to be the sales divided by quantity and we're going to
divided by quantity and we're going to call it price. So let's go and execute
call it price. So let's go and execute it. Now as you can see we got an error
it. Now as you can see we got an error says divide by zero error encountered.
says divide by zero error encountered. So that means somewhere we have a zero
So that means somewhere we have a zero for the quantity and this is a problem.
for the quantity and this is a problem. Let's go and check the data again. So
Let's go and check the data again. So I'm just going to comment the whole
I'm just going to comment the whole thing and let's go and execute it. So
thing and let's go and execute it. So now by checking the result yes we got
now by checking the result yes we got for the order ID 10 here we have
for the order ID 10 here we have quantity zero. So it will not work if
quantity zero. So it will not work if you divide by zero of course. So how we
you divide by zero of course. So how we can solve it? We can use the magic of
can solve it? We can use the magic of the null if where we're going to go and
the null if where we're going to go and replace the zero with a null. So getting
replace the zero with a null. So getting a null is way better than getting an
a null is way better than getting an error. Right? So let's go and do that.
error. Right? So let's go and do that. I'm just going to remove the comments.
I'm just going to remove the comments. And here we're going to say null if if
And here we're going to say null if if the quantity equal to the zero value. So
the quantity equal to the zero value. So that's it. Let's go and execute it. Now
that's it. Let's go and execute it. Now as you can see it is working. And with
as you can see it is working. And with that we are making sure that we are not
that we are making sure that we are not dividing by zero. And that's because we
dividing by zero. And that's because we replace it with a null. And if you
replace it with a null. And if you divide anything by null you will get a
divide anything by null you will get a null. So if you check the result over
null. So if you check the result over here the order 10 we got the price of
here the order 10 we got the price of null which is correct and for the all
null which is correct and for the all other values everything is working
other values everything is working because we have values and we didn't
because we have values and we didn't replace it with a null that's why we
replace it with a null that's why we have values for the price and this is
have values for the price and this is very common use case for the null if we
very common use case for the null if we can use it in order to prevent dividing
can use it in order to prevent dividing by
zero. All right so what is is null? It's going to return true if the value is
going to return true if the value is null. So it is checking the value if
null. So it is checking the value if it's null it's going to return true
it's null it's going to return true otherwise it's going to returns a false.
otherwise it's going to returns a false. Now the exact opposite if you go use the
Now the exact opposite if you go use the is not null. So if you use these
is not null. So if you use these keywords it's going to returns a true if
keywords it's going to returns a true if the value is not null otherwise if it is
the value is not null otherwise if it is null it's going to go and return a
null it's going to go and return a false. Okay. So the syntax for that is
false. Okay. So the syntax for that is very simple. It start with a value or
very simple. It start with a value or expression and then after that we're
expression and then after that we're going to have the keyword is space null
going to have the keyword is space null and the is not is exactly the same. So
and the is not is exactly the same. So we have a value then afterwards we have
we have a value then afterwards we have the is not null. So we have the not
the is not null. So we have the not operator after that and the is not is
operator after that and the is not is exactly the same. So we have a value
exactly the same. So we have a value then we have the is not the not operator
then we have the is not the not operator then the null. So it's very simple.
then the null. So it's very simple. Let's have an example. We are checking
Let's have an example. We are checking whether the values of the shipping
whether the values of the shipping address is null. So we can have it like
address is null. So we can have it like this. Shipping address is null or we can
this. Shipping address is null or we can check the opposite whether it's not
check the opposite whether it's not null. So the shipping address is not
null. So the shipping address is not null. It's very easy. Okay. So now let's
null. It's very easy. Okay. So now let's understand how this works. we are
understand how this works. we are checking the value. So if the value is
checking the value. So if the value is null then return a true if it is not
null then return a true if it is not null then we return a false. So as you
null then we return a false. So as you can see it never returns the value
can see it never returns the value itself or any nulls. So we are getting a
itself or any nulls. So we are getting a boolean of true and false. So we are
boolean of true and false. So we are creating like a boolean flag in order to
creating like a boolean flag in order to assist us with the checks. So we have
assist us with the checks. So we have this very simple example price is null
this very simple example price is null and we have those two rows. So we are
and we have those two rows. So we are checking whether the price is null in
checking whether the price is null in the first order it is not null right
the first order it is not null right that's why we will get a false in the
that's why we will get a false in the output and the second order the value is
output and the second order the value is null so it is correct that's why we will
null so it is correct that's why we will get true now of course if we go and use
get true now of course if we go and use the is not null is going to be exact
the is not null is going to be exact opposite so is the price not null well
opposite so is the price not null well yes it's not null that's why you will
yes it's not null that's why you will get a true over here so now for the
get a true over here so now for the second check it is null right so the
second check it is null right so the output going to be false we will get the
output going to be false we will get the exact opposite. So that's it. It's very
exact opposite. So that's it. It's very simple how the isnull and is not null
works. All right. One very obvious use case for is null and is not null is by
case for is null and is not null is by searching for missing informations or
searching for missing informations or searching for nulls. And maybe after
searching for nulls. And maybe after that we can go and clean up our data by
that we can go and clean up our data by removing the nulls from our data set.
removing the nulls from our data set. Let's have the following task and it
Let's have the following task and it says identify the customers who has no
says identify the customers who has no scores. All right, let's go and solve
scores. All right, let's go and solve it. This is very simple. So let's start
it. This is very simple. So let's start by selecting star from sales customers.
by selecting star from sales customers. So we need everything. Let's go and
So we need everything. Let's go and execute it. Now as you can see we have
execute it. Now as you can see we have our five customers. But the task says we
our five customers. But the task says we have to have all the customers who have
have to have all the customers who have no score. So that means the result
no score. So that means the result should return only the last record since
should return only the last record since the score of Anna is null. So let's go
the score of Anna is null. So let's go and have a wear clause. So where and now
and have a wear clause. So where and now what do we need? We need the score. Then
what do we need? We need the score. Then we don't use the equal, we use is null
we don't use the equal, we use is null like this. So that's it. Let's go and
like this. So that's it. Let's go and execute it. And with that, as you can
execute it. And with that, as you can see, it's very simple. We have filtered
see, it's very simple. We have filtered our data and now we can see all the
our data and now we can see all the customers where the score is null. This
customers where the score is null. This is a very basic check to understand
is a very basic check to understand whether our data contains nulls. All
whether our data contains nulls. All right, moving on to the next task and it
right, moving on to the next task and it says show a list of all customers who
says show a list of all customers who have scores. So back to our example,
have scores. So back to our example, this time we're going to do exactly the
this time we're going to do exactly the opposite. We want a list of all
opposite. We want a list of all customers where we have a value in the
customers where we have a value in the scores. So what we're going to do, we're
scores. So what we're going to do, we're going to say where score is not null. So
going to say where score is not null. So if you go and execute it, you can see
if you go and execute it, you can see we're going to get a clean list where
we're going to get a clean list where all the customers have score. And with
all the customers have score. And with that, we get rid of all nulls inside the
that, we get rid of all nulls inside the score field. And maybe this is helpful
score field. And maybe this is helpful in order to do further analyzes.
All right friends, now we come to very interesting use case for the isnull and
interesting use case for the isnull and that is by introducing a new type of
that is by introducing a new type of joints between tables that's going to
joints between tables that's going to help us to find the unmatching rows
help us to find the unmatching rows between two tables. Let's have a quick
between two tables. Let's have a quick recap about the joints in SQL in order
recap about the joints in SQL in order to understand the new types. So
to understand the new types. So basically we have two sets or let's say
basically we have two sets or let's say two tables the left and the right. And
two tables the left and the right. And if you go and use an inner join what we
if you go and use an inner join what we are doing here we are finding only the
are doing here we are finding only the matching rows between the left table and
matching rows between the left table and the right table. So at the result we
the right table. So at the result we will get only the matching rows. Now we
will get only the matching rows. Now we have another type of joints called lift
have another type of joints called lift outer join. And if you use this type at
outer join. And if you use this type at the result you will get all the rows
the result you will get all the rows from the left table and as well only the
from the left table and as well only the matching rows from the right table. Now
matching rows from the right table. Now we have another type which is exactly
we have another type which is exactly the opposite the right join. And here
the opposite the right join. And here we're going to get all the rows from the
we're going to get all the rows from the right table and only the matching
right table and only the matching informations from the left table. And
informations from the left table. And now to the last type that we learned. We
now to the last type that we learned. We have the full join where we will get all
have the full join where we will get all the rows from the left and as well all
the rows from the left and as well all the rows from the right. So we will not
the rows from the right. So we will not be missing anything. So those are the
be missing anything. So those are the four basic joints that we have learned
four basic joints that we have learned in SQL. But in SQL we have as well other
in SQL. But in SQL we have as well other types that are more advanced. But we
types that are more advanced. But we don't have in SQL any keywords for that.
don't have in SQL any keywords for that. So the first one called lift anti-join.
So the first one called lift anti-join. So what we are saying here we need all
So what we are saying here we need all the rows from the left table but this
the rows from the left table but this time without the matching rows. So all
time without the matching rows. So all the informations that are matching with
the informations that are matching with the right table we don't want to see it
the right table we don't want to see it at the results. And as I said we don't
at the results. And as I said we don't have here an extra keyword for this type
have here an extra keyword for this type of join. But in order to get this effect
of join. But in order to get this effect we're going to go and combine the left
we're going to go and combine the left join together with the isnull. And with
join together with the isnull. And with that we're going to get all the data
that we're going to get all the data from the left side but without anything
from the left side but without anything that is matching the right side. And
that is matching the right side. And this we call it left anti- join. And we
this we call it left anti- join. And we have another advanced type for the
have another advanced type for the joints called the right anti- join. This
joints called the right anti- join. This is exactly the opposite. So we are
is exactly the opposite. So we are saying all the rows from the right table
saying all the rows from the right table without having any matching rows from
without having any matching rows from the left table. So all the informations
the left table. So all the informations on the right side that is not matching
on the right side that is not matching the left side. So again here we don't
the left side. So again here we don't have a keyword for that. We're going to
have a keyword for that. We're going to go and work with the right join plus and
go and work with the right join plus and is null. So with that, as you can see,
is null. So with that, as you can see, we have two new types of joins added to
we have two new types of joins added to our four basic joins. Now this might be
our four basic joins. Now this might be confusing. Let's have the following task
confusing. Let's have the following task in order to understand it. Show a list
in order to understand it. Show a list of all details for customers who have
of all details for customers who have not placed any orders. All right. So
not placed any orders. All right. So let's see how we can create the effect
let's see how we can create the effect of the left anti-join. So let's do it
of the left anti-join. So let's do it step by step. We need here two tables.
step by step. We need here two tables. We need the customers and as well the
We need the customers and as well the orders. So since we are focusing on the
orders. So since we are focusing on the customers, the lift table going to be
customers, the lift table going to be the customers. So let's go and do that.
the customers. So let's go and do that. We're going to go and say select star
We're going to go and say select star from sales customers. This is our first
from sales customers. This is our first table. So we are using the alias of C.
table. So we are using the alias of C. So let's go and execute it. Now as you
So let's go and execute it. Now as you can see we got the list of all
can see we got the list of all customers. So that we have all the
customers. So that we have all the details for our customers. But now we
details for our customers. But now we have to go and join it with the orders.
have to go and join it with the orders. So in order to do that let's have a new
So in order to do that let's have a new line. left join
line. left join sales sales orders and let's have the
sales sales orders and let's have the lso and now we have to go and define the
lso and now we have to go and define the key for the join so on it's going to be
key for the join so on it's going to be the customer ID equal the customer ID in
the customer ID equal the customer ID in the order table so now if you go and
the order table so now if you go and execute it now what we're going to do
execute it now what we're going to do we're going to go and show the order ID
we're going to go and show the order ID from the table orders so order ID just
from the table orders so order ID just to see whether we have a match or not so
to see whether we have a match or not so let's have it like this and execute it
let's have it like this and execute it Now let's go and check the results. As
Now let's go and check the results. As you can see those four columns comes
you can see those four columns comes from the table customers and only the
from the table customers and only the last column come from the orders. So now
last column come from the orders. So now what is interesting is to check the
what is interesting is to check the order ID whether we have nulls or not.
order ID whether we have nulls or not. So as you can see for the customer one
So as you can see for the customer one we have everything matching. For the
we have everything matching. For the customer two as well we have orders the
customer two as well we have orders the three as well for only the last one
three as well for only the last one customer ID 5 we have here a null. So
customer ID 5 we have here a null. So that means SQL was not able to find any
that means SQL was not able to find any order for this customer. So again what
order for this customer. So again what this means we have only one customer
this means we have only one customer Anna where she doesn't have any order
Anna where she doesn't have any order but all other customers they did have an
but all other customers they did have an order and that's because we have values
order and that's because we have values from the right table. So once we have
from the right table. So once we have values that means we have matching but
values that means we have matching but since here we have a null that means we
since here we have a null that means we don't have any matching. So now since
don't have any matching. So now since the left anti- joint says we would like
the left anti- joint says we would like to have all the data from the left table
to have all the data from the left table without having any matching from the
without having any matching from the right table. So that means for this
right table. So that means for this example we would like to get only this
example we would like to get only this customer Anna. And this is exactly as
customer Anna. And this is exactly as well fulfilling our task. The task says
well fulfilling our task. The task says list all details for customers who have
list all details for customers who have not placed any order. All data from
not placed any order. All data from customers where we don't have matching
customers where we don't have matching from the orders. Now I think you already
from the orders. Now I think you already got it how to get this effect. We're
got it how to get this effect. We're going to go and filter the data like the
going to go and filter the data like the following. So we're going to have the
following. So we're going to have the wear clause and now we need the column
wear clause and now we need the column from the right table from the orders. So
from the right table from the orders. So we're going to go with the customer ID
we're going to go with the customer ID comes from the orders. So we're going to
comes from the orders. So we're going to say oh customer ID is null. And of
say oh customer ID is null. And of course you can go with the order ID as
course you can go with the order ID as well. You're going to get the same
well. You're going to get the same effect. But I would like always to use
effect. But I would like always to use the key that we are using with the join.
the key that we are using with the join. So let's go and execute it. And now as
So let's go and execute it. And now as you can see we got the effect of the
you can see we got the effect of the left anti join and with that as you can
left anti join and with that as you can see we got the customer that we are
see we got the customer that we are aiming for. So here we have the data
aiming for. So here we have the data from the left side that is not matching
from the left side that is not matching the right side. So the customers who
the right side. So the customers who have not placed an order and with that
have not placed an order and with that we have solved the task. So as you can
we have solved the task. So as you can see we have implemented the left and
see we have implemented the left and join by combining the left join together
join by combining the left join together with the is null. So this is the power
with the is null. So this is the power of playing with the nulls in SQL.
Now my friends, there is something that is really confuses a lot of developers
is really confuses a lot of developers or anyone that is working with data in
or anyone that is working with data in databases and SQL and that is the
databases and SQL and that is the differences between nulls, empty string
differences between nulls, empty string and blank spaces. So the nulls as we
and blank spaces. So the nulls as we learned we are saying I don't know what
learned we are saying I don't know what the value is it is unknown. But now in
the value is it is unknown. But now in the other hand the empty string you are
the other hand the empty string you are saying I know the value it is nothing.
saying I know the value it is nothing. So the empty string is a string value
So the empty string is a string value which has a zero characters. This is
which has a zero characters. This is totally different than the nulls. The
totally different than the nulls. The nulls we don't know anything about it.
nulls we don't know anything about it. So now sometimes maybe happens to you as
So now sometimes maybe happens to you as you are filling a forum and you come to
you are filling a forum and you come to one field you go and by mistake hit a
one field you go and by mistake hit a space bar and with that you are entering
space bar and with that you are entering space into the field and you just jump
space into the field and you just jump to the next field without entering any
to the next field without entering any other values. So we have now like a
other values. So we have now like a space character inside the field. This
space character inside the field. This is really evil in databases because once
is really evil in databases because once the user enter a blank space, it's going
the user enter a blank space, it's going to go and store it as a value inside the
to go and store it as a value inside the database and it's going to take storage.
database and it's going to take storage. So it could be one space or many spaces
So it could be one space or many spaces depends on how long you press the space
depends on how long you press the space bar. So the blank space is a string but
bar. So the blank space is a string but the size is not zero like the empty
the size is not zero like the empty string. We're going to have a size of
string. We're going to have a size of how many spaces you have entered. So
how many spaces you have entered. So here it's not like the null. We know the
here it's not like the null. We know the value it is string and the character of
value it is string and the character of that going to be space. Okay. So let's
that going to be space. Okay. So let's see those three scenarios inside scale.
see those three scenarios inside scale. Now I have like a dummy data using the
Now I have like a dummy data using the city statements. Don't worry about it.
city statements. Don't worry about it. I'm going to teach you all those stuff
I'm going to teach you all those stuff in the next tutorials. So now we have
in the next tutorials. So now we have here like four rows. The first one with
here like four rows. The first one with a value a. The next one with null. The
a value a. The next one with null. The third one with empty string. So as you
third one with empty string. So as you can see there is nothing between those
can see there is nothing between those two quotes. And the last one we have a
two quotes. And the last one we have a space between those two quotes. Now
space between those two quotes. Now let's go and query this temporal table.
let's go and query this temporal table. So select star from orders and execute.
So select star from orders and execute. So now by looking to the values of the
So now by looking to the values of the categories you can find all the
categories you can find all the scenarios now. So now the first scenario
scenarios now. So now the first scenario is the easiest one where we have a
is the easiest one where we have a normal value. We have here an a. But the
normal value. We have here an a. But the other three rows we don't have normal
other three rows we don't have normal values. We have like empty stuff. So the
values. We have like empty stuff. So the first one going to be the null. So we
first one going to be the null. So we don't have a value. This is the special
don't have a value. This is the special marker from SQL. It says null. So there
marker from SQL. It says null. So there is no value. And the other two they are
is no value. And the other two they are really confusing. As you can see it's
really confusing. As you can see it's really hard by just looking to the data
really hard by just looking to the data or to the results whether it is an empty
or to the results whether it is an empty string or a blank space. And this confus
string or a blank space. And this confus a lot of developers or anyone working
a lot of developers or anyone working with data seeing those results. It's
with data seeing those results. It's really hard to detect the data quality
really hard to detect the data quality issues by just looking at the results.
issues by just looking at the results. So now in this scenario what I do I go
So now in this scenario what I do I go and calculate the length of each value
and calculate the length of each value inside my column. So let's go and do
inside my column. So let's go and do that. Now we're going to go in the SQL
that. Now we're going to go in the SQL server. We're going to go and use the
server. We're going to go and use the function data length and our field going
function data length and our field going to be the category. So let's call it
to be the category. So let's call it category length. So let's go execute it.
category length. So let's go execute it. And now let's check the result. The
And now let's check the result. The first one since we have only one
first one since we have only one character, the length of that is going
character, the length of that is going to be one which is correct. And now to
to be one which is correct. And now to the next row we have the category null.
the next row we have the category null. We don't know the value and as well we
We don't know the value and as well we don't know the length of the value,
don't know the length of the value, right? So that's why we will get a null.
right? So that's why we will get a null. So now by moving to the next one as you
So now by moving to the next one as you can see those two looking really exactly
can see those two looking really exactly the same. But now with the help of the
the same. But now with the help of the length or the data length function we
length or the data length function we can see that the third row or the third
can see that the third row or the third category value has the length of zero.
category value has the length of zero. That means it is an empty string and we
That means it is an empty string and we don't have any characters over here that
don't have any characters over here that is hidden. So with that we are sure this
is hidden. So with that we are sure this is an empty string. But now let's move
is an empty string. But now let's move to the last one. Here it is very tricky
to the last one. Here it is very tricky and evil. we have a hidden space inside
and evil. we have a hidden space inside this value and we can understand that by
this value and we can understand that by the length of this field. So as you can
the length of this field. So as you can see we have here a one that means we
see we have here a one that means we have here one hidden space inside this
have here one hidden space inside this value and it is not empty string. So
value and it is not empty string. So that means I have here only one space
that means I have here only one space let's go and give it another space and
let's go and give it another space and calculate the length. So as you can see
calculate the length. So as you can see we have two spaces and that's why the
we have two spaces and that's why the length is two. So don't count on your
length is two. So don't count on your eyes in order to understand the spaces.
eyes in order to understand the spaces. go and calculate the length in order to
go and calculate the length in order to be very precise. So now let's go and
be very precise. So now let's go and compare the three scenarios side by
compare the three scenarios side by side. So let's start with the first one
side. So let's start with the first one about the representations in the table.
about the representations in the table. The null we're going to see it as a null
The null we're going to see it as a null inside the table. The empty string going
inside the table. The empty string going to be like two quotes and nothing
to be like two quotes and nothing between them. And the blank space it's
between them. And the blank space it's as well two quotes and between them one
as well two quotes and between them one or many spaces. And now if you are
or many spaces. And now if you are talking about the meaning the null means
talking about the meaning the null means unknown. We don't know the value. The
unknown. We don't know the value. The empty string it is known but it is
empty string it is known but it is nothing it is empty value. And the third
nothing it is empty value. And the third one blank spaces it is as well known and
one blank spaces it is as well known and the spaces are the value. And now if you
the spaces are the value. And now if you are talking about the data types since
are talking about the data types since the null is no value. So we don't have a
the null is no value. So we don't have a data type for this and it is like a
data type for this and it is like a special marker in the SQL. And now the
special marker in the SQL. And now the empty string has a data type. It is a
empty string has a data type. It is a string and the size of this string going
string and the size of this string going to be zero since we have zero characters
to be zero since we have zero characters inside the empty string. Moving on to
inside the empty string. Moving on to the blank spaces, it is a string since a
the blank spaces, it is a string since a space is a character and it's going to
space is a character and it's going to be the size of one or many. And now if
be the size of one or many. And now if we are talking about the storage, the
we are talking about the storage, the null is the best. They don't consume or
null is the best. They don't consume or occupy a lot of storage. While the empty
occupy a lot of storage. While the empty string and the blank spaces, they occupy
string and the blank spaces, they occupy here storage and memory and they waste
here storage and memory and they waste the space. So if you are worried about
the space. So if you are worried about the storage, the best option here is a
the storage, the best option here is a null. Now talking about the performance,
null. Now talking about the performance, you will get the best performance if you
you will get the best performance if you are using nulls. Now the empty string is
are using nulls. Now the empty string is as well fast but it is not that fast
as well fast but it is not that fast like the nulls. Now the worst option
like the nulls. Now the worst option here is the blank spaces it is slow. So
here is the blank spaces it is slow. So again if the speed is important for you
again if the speed is important for you you have to have those scenarios as a
you have to have those scenarios as a null. So now if you are talking about
null. So now if you are talking about the comparison and you are searching for
the comparison and you are searching for those values if you want to search for
those values if you want to search for the null you have to go and use is null.
the null you have to go and use is null. But in the other hand if you want to
But in the other hand if you want to search for the empty string and the
search for the empty string and the blank spaces you have to go and use the
blank spaces you have to go and use the operator equal. So that's all those are
operator equal. So that's all those are the main differences between the null
the main differences between the null empty string and blank spaces.
Now you might ask you know what why do I have to understand the differences
have to understand the differences between all those stuff the nulls empty
between all those stuff the nulls empty strings and the blanks everything's like
strings and the blanks everything's like empty so why do I care well in new
empty so why do I care well in new projects I'm going to promise you that
projects I'm going to promise you that you will be working with sources and
you will be working with sources and data that has bad data quality and you
data that has bad data quality and you might encounter all those three
might encounter all those three scenarios in your data and now if you
scenarios in your data and now if you don't do any data preparations like
don't do any data preparations like cleaning up the data handling those
cleaning up the data handling those three scenarios and bringing standards
three scenarios and bringing standards to your data and you jump immediately to
to your data and you jump immediately to the analyzes without doing all those
the analyzes without doing all those stuff, you will end up providing
stuff, you will end up providing inaccurate results in your reports and
inaccurate results in your reports and analyzes which leads to wrong decisions.
analyzes which leads to wrong decisions. So preparing your data before doing any
So preparing your data before doing any analyszis by cleaning up the data,
analyszis by cleaning up the data, handling those three scenarios and as
handling those three scenarios and as well bringing standards is very
well bringing standards is very important step before doing any
important step before doing any analyszis. So this is how we're going to
analyszis. So this is how we're going to do it together with the stakeholders and
do it together with the stakeholders and the users of your reports and analyzes.
the users of your reports and analyzes. You have to define a clear data
You have to define a clear data policies. It's like rules and you have
policies. It's like rules and you have to commit yourself during the
to commit yourself during the implementations by following those
implementations by following those rules. And here we have three different
rules. And here we have three different options. The first one you can go and
options. The first one you can go and define the data policies like this. Only
define the data policies like this. Only use nulls and empty string but avoid
use nulls and empty string but avoid using blank spaces. In my project I
using blank spaces. In my project I cannot imagine that there is a scenario
cannot imagine that there is a scenario where we need blank spaces. They are
where we need blank spaces. They are just evil. Just go get rid of them. All
just evil. Just go get rid of them. All right. Right. So with this policy, we
right. Right. So with this policy, we have to go and get rid of all blank
have to go and get rid of all blank spaces inside our data. And in order to
spaces inside our data. And in order to do that, we have a wonderful function in
do that, we have a wonderful function in SQL called trim. The trim function in
SQL called trim. The trim function in SQL going to go and remove the spaces
SQL going to go and remove the spaces from a string from the left side and as
from a string from the left side and as well from the right side. So all the
well from the right side. So all the leading spaces and the trailing spaces
leading spaces and the trailing spaces going to be removed. So now if we go and
going to be removed. So now if we go and apply the trim function on that
apply the trim function on that category, what's going to happen? All
category, what's going to happen? All the blank spaces going to be removed and
the blank spaces going to be removed and it going to be turned into empty string.
it going to be turned into empty string. So let's go and do that. It's very
So let's go and do that. It's very simple. So we're going to use the trim
simple. So we're going to use the trim function and we're going to apply it on
function and we're going to apply it on the
the category. Let's go and call it policy
category. Let's go and call it policy one. So let's go and execute it. So now
one. So let's go and execute it. So now by just comparing the policy one with
by just comparing the policy one with the category. You see like it's
the category. You see like it's identical but it's not. Now in order to
identical but it's not. Now in order to have a better feeling about this we can
have a better feeling about this we can go and test it using the data length.
go and test it using the data length. Now let's go again and use the data
Now let's go again and use the data length function. So we're going to use
length function. So we're going to use it for the whole results and as well I'm
it for the whole results and as well I'm going to go and use it for the category
going to go and use it for the category in order to just compare it. So without
in order to just compare it. So without the
trim so like this. Let's go and execute it. Now if you go and check the result
it. Now if you go and check the result as you can see here again we have the
as you can see here again we have the length of two because here we have two
length of two because here we have two spaces but with the policy one we have
spaces but with the policy one we have zero. So those two values after applying
zero. So those two values after applying the trim function they have the length
the trim function they have the length of zero and with that we don't have
of zero and with that we don't have blank spaces. So that means now we are
blank spaces. So that means now we are sure after applying the trim we have
sure after applying the trim we have either a null or empty string. So let me
either a null or empty string. So let me just get rid of all those informations.
just get rid of all those informations. Now I am sure both of them are empty
Now I am sure both of them are empty string. So as you can see it's very
string. So as you can see it's very simple using only one SQL function you
simple using only one SQL function you are cleaning up the data and bringing
are cleaning up the data and bringing standards. All right moving on to the
standards. All right moving on to the option two. You can define your data
option two. You can define your data policies like this. Only use nulls and
policies like this. Only use nulls and avoid both empty strings and as well
avoid both empty strings and as well blank spaces. So that means in our
blank spaces. So that means in our business we don't have anything
business we don't have anything meaningful for the empty string and the
meaningful for the empty string and the blank spaces. We can go and use only the
blank spaces. We can go and use only the nulls. Okay. So now let's go and
nulls. Okay. So now let's go and implement this rule. We have to go and
implement this rule. We have to go and convert a value to a null. So the value
convert a value to a null. So the value going to be empty string to a null. And
going to be empty string to a null. And as we learned we can go and use the null
as we learned we can go and use the null if function in order to get nulls
if function in order to get nulls instead of values. So let's go and apply
instead of values. So let's go and apply this policy. But now here we have two
this policy. But now here we have two values the empty string and spaces. Now
values the empty string and spaces. Now instead of having two rules for that I'm
instead of having two rules for that I'm going to convert first the blank spaces
going to convert first the blank spaces to an empty string like we have done
to an empty string like we have done here. So I'm going to take the result of
here. So I'm going to take the result of this function first as a first step and
this function first as a first step and afterwards we're going to go and use the
afterwards we're going to go and use the null if. So we're going to say null if
null if. So we're going to say null if for the result of the trim if if you
for the result of the trim if if you find any empty strings convert it to
find any empty strings convert it to null. So that's it policy 2. So as you
null. So that's it policy 2. So as you can see in the result we have converted
can see in the result we have converted those empty spaces and planks to a null.
those empty spaces and planks to a null. So with that we are getting three nulls
So with that we are getting three nulls and of course we're going to get the
and of course we're going to get the value a. And now if you compare those
value a. And now if you compare those three columns side by side you're going
three columns side by side you're going to see the bully C2 is really easier to
to see the bully C2 is really easier to understand compared to the previous
understand compared to the previous ones. Right? So now if you compare the
ones. Right? So now if you compare the policy two now to the policy one, you
policy two now to the policy one, you can see it's easier to understand and
can see it's easier to understand and it's easier as well to handle. So again
it's easier as well to handle. So again it's very easy to do data cleanup with
it's very easy to do data cleanup with only two functions we have now like
only two functions we have now like standards inside our data. And now
standards inside our data. And now moving on to the last option, we can
moving on to the last option, we can define our data policy like this. Use
define our data policy like this. Use only a default value unknown and avoid
only a default value unknown and avoid using anything else like nulls, empty
using anything else like nulls, empty strings and blank spaces. So that means
strings and blank spaces. So that means in the analyzes and reports we want to
in the analyzes and reports we want to see the value unknown and we have to
see the value unknown and we have to handle all those three informations and
handle all those three informations and convert them to unknown. So now in order
convert them to unknown. So now in order to implement the policy three we have to
to implement the policy three we have to go and convert a null with a value a
go and convert a null with a value a default value and here we have two
default value and here we have two options either use the is null or we can
options either use the is null or we can go and use the kalis and I will go with
go and use the kalis and I will go with the kowalis so
the kowalis so kowalis and I'm going to use directly
kowalis and I'm going to use directly the category. So if you find any null
the category. So if you find any null replace it with the default value
replace it with the default value unknown and let's call it policy 3. So
unknown and let's call it policy 3. So let's go and execute it. So now if you
let's go and execute it. So now if you check the result over here you see that
check the result over here you see that we got it only once correct. So we
we got it only once correct. So we replaced the null with the unknown but
replaced the null with the unknown but we still have like empty spaces and
we still have like empty spaces and blanks and that's because we rushed
blanks and that's because we rushed using the qualis and we skipped the
using the qualis and we skipped the other steps. So as you can see preparing
other steps. So as you can see preparing the data you have to do it slowly step
the data you have to do it slowly step by step. So first we have to go and
by step. So first we have to go and convert everything to a null like the
convert everything to a null like the policy 2. And after that the last step
policy 2. And after that the last step we're going to go and use the default
we're going to go and use the default value. So that means instead of using
value. So that means instead of using the category we have to go and get the
the category we have to go and get the result of the policy 2. So let's go and
result of the policy 2. So let's go and copy it and replace the category with
copy it and replace the category with those two steps and let's go and execute
those two steps and let's go and execute it. So now as you can see we have the
it. So now as you can see we have the default value for all those three
default value for all those three scenarios. First we have to trim the
scenarios. First we have to trim the data in order to remove all the blank
data in order to remove all the blank spaces. The second step, we're going to
spaces. The second step, we're going to go and replace all the empty strings
go and replace all the empty strings with a null. And with that, we're going
with a null. And with that, we're going to get a null for all those three
to get a null for all those three scenarios. And finally, we're going to
scenarios. And finally, we're going to go and replace the nulls with a default
go and replace the nulls with a default value, the unknown. So, that's it for
value, the unknown. So, that's it for the three policies. And this is the
the three policies. And this is the different ways in order to clean up the
different ways in order to clean up the data and bring standards before doing
data and bring standards before doing analyszis. And now you might ask me,
analyszis. And now you might ask me, okay, which one should I use in my
okay, which one should I use in my project? Like if I want to suggest
project? Like if I want to suggest something for the users, which one
something for the users, which one should I use? Well, it really depends on
should I use? Well, it really depends on the business, but I tried always to
the business, but I tried always to avoid this one, the policy one, because
avoid this one, the policy one, because it's always confusing and I have always
it's always confusing and I have always explained for the users. So now we are
explained for the users. So now we are left with the two and three. Well, I use
left with the two and three. Well, I use both of them in different scenarios. I
both of them in different scenarios. I normally go with the policy 2 because it
normally go with the policy 2 because it takes less storage and as well the
takes less storage and as well the performance of your queries afterward
performance of your queries afterward going to be really good. So if I'm doing
going to be really good. So if I'm doing data preparations in my ETL before
data preparations in my ETL before inserting it inside a table, I go with
inserting it inside a table, I go with the policy too. But in other hand, if
the policy too. But in other hand, if I'm doing a preparation step before
I'm doing a preparation step before showing it in a report like in Tableau
showing it in a report like in Tableau or PowerBI. So if it is like one of the
or PowerBI. So if it is like one of the last steps before showing the data to
last steps before showing the data to the users, I go with the policy 3
the users, I go with the policy 3 because if you present a null inside a
because if you present a null inside a report, it's going to be really hard to
report, it's going to be really hard to read. So having like a word like
read. So having like a word like unknown, it's easier to understand.
unknown, it's easier to understand. Okay, we have here missing data. So
Okay, we have here missing data. So again if the data preparations is
again if the data preparations is exactly before I present the data to the
exactly before I present the data to the users I go with the policy 3 where I use
users I go with the policy 3 where I use default values but if I'm using a data
default values but if I'm using a data preparations before inserting it in the
preparations before inserting it in the database I go with the policy 2 because
database I go with the policy 2 because it's going to optimize the storage and
it's going to optimize the storage and it's going to be really bad if you go
it's going to be really bad if you go with the policy 3 because it's really
with the policy 3 because it's really bad to store the whole world each time
bad to store the whole world each time there is no value like the unknown. it's
there is no value like the unknown. it's gonna take a lot of space and as well
gonna take a lot of space and as well you're going to get bad performance as
you're going to get bad performance as you are building the queries. That's why
you are building the queries. That's why I tend to store the data using nulls. If
I tend to store the data using nulls. If you present it to the users go and show
you present it to the users go and show it as a default value. So as you can see
it as a default value. So as you can see it's very important to understand the
it's very important to understand the differences between the nulls empty
differences between the nulls empty strings and blanks and how to prepare
strings and blanks and how to prepare the data by cleaning up the data and
the data by cleaning up the data and bringing standards and policies before
bringing standards and policies before doing any analyszis. So with this we
doing any analyszis. So with this we have cleared up the confusion between
have cleared up the confusion between those scenarios and if you encounter it
those scenarios and if you encounter it in your projects you know how to deal
in your projects you know how to deal with
it. All right. So now let's have quick summary about the nulls. Nulls are
summary about the nulls. Nulls are special markers in SQL in order to say
special markers in SQL in order to say there is no value. It is missing. It is
there is no value. It is missing. It is unknown. So nulls are not equal to zero
unknown. So nulls are not equal to zero or empty string or any blank spaces. And
or empty string or any blank spaces. And using nulls inside our database is going
using nulls inside our database is going to save some storage and as well provide
to save some storage and as well provide a strong performance in your queries.
a strong performance in your queries. And in scale we have different functions
And in scale we have different functions in order to handle the nulls. So now if
in order to handle the nulls. So now if you want to replace a null with the
you want to replace a null with the value we can go either with the function
value we can go either with the function kowalis or is null or if you want to do
kowalis or is null or if you want to do the opposite where you want to replace a
the opposite where you want to replace a value with null you can go use the
value with null you can go use the function null if or in other cases we
function null if or in other cases we want only to check whether there is
want only to check whether there is nulls or not we can use the is null or
nulls or not we can use the is null or is not null. And we have learned as well
is not null. And we have learned as well that we have to treat the nulls
that we have to treat the nulls especially before doing any tasks. So
especially before doing any tasks. So that means we have to handle the nulls
that means we have to handle the nulls before doing for example data
before doing for example data aggregations like average, sum, max, min
aggregations like average, sum, max, min and so on. And we have to handle the
and so on. And we have to handle the nulls as well before doing any
nulls as well before doing any mathematical operations like using the
mathematical operations like using the plus operator to concatenate two
plus operator to concatenate two strings. And in some scenarios as we
strings. And in some scenarios as we learned we have to handle the nulls as
learned we have to handle the nulls as well before doing joins. And in other
well before doing joins. And in other cases we have as well to handle the
cases we have as well to handle the nulls before sorting the data. And we
nulls before sorting the data. And we have learned as well by combining the
have learned as well by combining the joins and the isnull we introduce new
joins and the isnull we introduce new types of joins like as we learned the
types of joins like as we learned the left anti- join and the right anti-join
left anti- join and the right anti-join where we exclude the matching rows using
where we exclude the matching rows using the isnull and we can use the null
the isnull and we can use the null functions in order to provide standards
functions in order to provide standards and data policies in our data like using
and data policies in our data like using the nulls or using a default values like
the nulls or using a default values like the unknown. All right my friends. So
the unknown. All right my friends. So with that you have learned how to handle
with that you have learned how to handle the nulls inside your data and now we're
the nulls inside your data and now we're going to move to a very special topic
going to move to a very special topic called the case statements. This is very
called the case statements. This is very important tool in order to do data
important tool in order to do data transformations. So let's
go case statements. It can allow you to build a conditional logic in your SQL
build a conditional logic in your SQL query by evaluating a list of conditions
query by evaluating a list of conditions one by one and return a value when the
one by one and return a value when the first condition is met. So now let's
first condition is met. So now let's understand the syntax of the case
understand the syntax of the case statements and what this
means. Okay. So now let's see the syntax step by step. It start with the keyword
step by step. It start with the keyword case. This case indicates now we are
case. This case indicates now we are starting a logic a conditional logic in
starting a logic a conditional logic in SQL. It's like programming languages as
SQL. It's like programming languages as you start with the if else. So the if is
you start with the if else. So the if is the keyword of a logic and the whole
the keyword of a logic and the whole logic as well ends with another keyword
logic as well ends with another keyword called end. So once SQL sees the end. So
called end. So once SQL sees the end. So this is the end of the conditional
this is the end of the conditional logic. So the case is the start and the
logic. So the case is the start and the end is the end. So now what we're going
end is the end. So now what we're going to have in between is the conditional
to have in between is the conditional logics right. So the conditional logic
logics right. So the conditional logic start with the keyword when. Now we are
start with the keyword when. Now we are telling SQL we have a condition to be
telling SQL we have a condition to be evaluated and then we're going to go and
evaluated and then we're going to go and specify that conditional logic. So now
specify that conditional logic. So now we have to tell SQL what can happen if
we have to tell SQL what can happen if this condition is fulfilled. So now we
this condition is fulfilled. So now we have to use another keyword called then.
have to use another keyword called then. So now we are telling SQL show this
So now we are telling SQL show this results if the condition is true. So as
results if the condition is true. So as you can see it's very simple. It's like
you can see it's very simple. It's like the natural language, right? It's like
the natural language, right? It's like in English when the condition one is met
in English when the condition one is met then show the results. It's very logic,
then show the results. It's very logic, right? And now of course we can go and
right? And now of course we can go and add a second condition inside our case
add a second condition inside our case statements. So we're going to have the
statements. So we're going to have the same setup. When condition two if this
same setup. When condition two if this is true then show the result number two.
is true then show the result number two. We specify the keyword when then we have
We specify the keyword when then we have a second condition. And if this
a second condition. And if this condition is true then we tell SQL to
condition is true then we tell SQL to show another results. And of course it's
show another results. And of course it's very important to understand in the
very important to understand in the syntax of that SQL going to go and
syntax of that SQL going to go and process the conditions from the top to
process the conditions from the top to the bottom. So the first most important
the bottom. So the first most important condition should be at the start. So SQL
condition should be at the start. So SQL going to first check this condition. If
going to first check this condition. If it fails and it's not true then it going
it fails and it's not true then it going to go and jump to the second condition.
to go and jump to the second condition. So the order of the conditions is very
So the order of the conditions is very important in your logic. And now of
important in your logic. And now of course we can go and add multiple
course we can go and add multiple conditions depend on the logic using the
conditions depend on the logic using the keyword when. And now once we are done
keyword when. And now once we are done defining all the conditions we can go
defining all the conditions we can go and specify an else keyword. The else
and specify an else keyword. The else can introduce the default value and it
can introduce the default value and it is optional. You can go and skip it. So
is optional. You can go and skip it. So the value of the else or the default
the value of the else or the default going to be used only if all the
going to be used only if all the condition failed. So that means all our
condition failed. So that means all our conditions are not true and nothing is
conditions are not true and nothing is fulfilled then SQL going to go and use
fulfilled then SQL going to go and use the value from the else. So it is the
the value from the else. So it is the default value that's going to be used if
default value that's going to be used if all conditions are false. So those are
all conditions are false. So those are the keywords that you must use inside
the keywords that you must use inside each case statement. So we have case
each case statement. So we have case when then and only the else is an
when then and only the else is an optional. So you can go and use it or
optional. So you can go and use it or skip it. So this is the main structure
skip it. So this is the main structure and the syntax of each case
statement. Now let's have a very simple example in order to understand how SQL
example in order to understand how SQL execute the case statements behind the
execute the case statements behind the scenes. All right, let's have this very
scenes. All right, let's have this very simple example where we have only one
simple example where we have only one condition. So as you can see in the
condition. So as you can see in the syntax, it starts with case and end and
syntax, it starts with case and end and then we have only one condition and we
then we have only one condition and we are evaluating here the sales. So the
are evaluating here the sales. So the condition says if the sales is higher
condition says if the sales is higher than 50 then show as a result the value
than 50 then show as a result the value of high. So it's very simple only one
of high. So it's very simple only one condition and on the right side we have
condition and on the right side we have here a flowchart in order to understand
here a flowchart in order to understand how the logic is executed. And now what
how the logic is executed. And now what we're going to do, we're going to go and
we're going to do, we're going to go and evaluate those four sales through this
evaluate those four sales through this logic and see what the output going to
logic and see what the output going to be with the case statement. So let's do
be with the case statement. So let's do it one by one. Let's start with the
it one by one. Let's start with the first sales. It is 60. So here we're
first sales. It is 60. So here we're going to go and check is 60 higher than
going to go and check is 60 higher than 50. Well, yes. That means this sales is
50. Well, yes. That means this sales is meeting this condition and we will get
meeting this condition and we will get true and we're going to get in the
true and we're going to get in the output the value of high. So here we're
output the value of high. So here we're going to get the value high in the
going to get the value high in the output. So that means the first sales is
output. So that means the first sales is fulfilling the requirement the condition
fulfilling the requirement the condition and SQL going to give us the value from
and SQL going to give us the value from this condition. All right. So now SQL
this condition. All right. So now SQL going to go to the next value and we're
going to go to the next value and we're going to start evaluating the 30. Now
going to start evaluating the 30. Now we're going to ask the same question the
we're going to ask the same question the same condition is 30 higher than 50.
same condition is 30 higher than 50. Well no. So that means in the output for
Well no. So that means in the output for this condition we will get false. So we
this condition we will get false. So we will take the path of the false. Now if
will take the path of the false. Now if you take the path of the false we will
you take the path of the false we will not get any value. Right? So that means
not get any value. Right? So that means the output going to be a null. So the
the output going to be a null. So the output for the 30 is null. And that's
output for the 30 is null. And that's because we didn't define in our logic
because we didn't define in our logic anything about the default option. So we
anything about the default option. So we don't have here an else. And this is
don't have here an else. And this is what going to happen. If you don't use
what going to happen. If you don't use else, you will get a null in the output
else, you will get a null in the output for the case statement. So now let's
for the case statement. So now let's move to the next one. It's going to be
move to the next one. It's going to be the same thing. So 15 is smaller than
the same thing. So 15 is smaller than 50. So it's not fulfilling the
50. So it's not fulfilling the condition. And as well we're going to
condition. And as well we're going to get a null. And for the last one since
get a null. And for the last one since it's null we will get as well a null
it's null we will get as well a null since it will not fulfill the condition.
since it will not fulfill the condition. So now after evaluating all those sales
So now after evaluating all those sales only the first sales is fulfilling that
only the first sales is fulfilling that condition and that's why we have only
condition and that's why we have only one value the high. All right. So now
one value the high. All right. So now let's keep moving and adding stuff to
let's keep moving and adding stuff to our case statements. Now we are adding a
our case statements. Now we are adding a second condition. So it says after
second condition. So it says after checking the sales whether it's higher
checking the sales whether it's higher than 50 and it fails check again the
than 50 and it fails check again the sales whether it's higher than 20. If
sales whether it's higher than 20. If yes then show the value of medium. So
yes then show the value of medium. So now in our workflow we are adding a
now in our workflow we are adding a second condition to be checked if the
second condition to be checked if the first one is false. So now let's go and
first one is false. So now let's go and evaluate our sales again and check the
evaluate our sales again and check the output the first one the 60. So as you
output the first one the 60. So as you can see the 60 is higher than 50. So we
can see the 60 is higher than 50. So we are fulfilling the first requirement
are fulfilling the first requirement that's why we will get the value of
that's why we will get the value of high. So it's same like before. So here
high. So it's same like before. So here we're going to get high in the output.
we're going to get high in the output. And now here very important to
And now here very important to understand one thing is that SQL didn't
understand one thing is that SQL didn't evaluate here in this scenario the
evaluate here in this scenario the second condition. So SQL didn't waste
second condition. So SQL didn't waste any time by checking the other
any time by checking the other condition. It skipped everything once it
condition. It skipped everything once it get a true from one condition. So this
get a true from one condition. So this is exactly how SQL process the case win.
is exactly how SQL process the case win. It going to check each conditions from
It going to check each conditions from top to down and once it finds a true
top to down and once it finds a true it's going to stop everything
it's going to stop everything immediately and throw the value from
immediately and throw the value from this condition and it will not evaluate
this condition and it will not evaluate any other conditions. So now it's going
any other conditions. So now it's going to go and jump to the next value. We are
to go and jump to the next value. We are at the value of 30. So let's evaluate
at the value of 30. So let's evaluate the conditions. Is 30 higher than 50?
the conditions. Is 30 higher than 50? Well, it's not. So it's false. So now
Well, it's not. So it's false. So now what can happen is going to go and jump
what can happen is going to go and jump to the next condition and start
to the next condition and start evaluating the second one whether it's
evaluating the second one whether it's true or false. So now we're going to
true or false. So now we're going to check here. Is 30 higher than 20? Well,
check here. Is 30 higher than 20? Well, yes. So it's going to be fulfilled and
yes. So it's going to be fulfilled and we will get the value of medium. So it's
we will get the value of medium. So it's going to stop everything and show in the
going to stop everything and show in the output for this value the medium. So
output for this value the medium. So we're going to get medium here. So in
we're going to get medium here. So in this scenario, we have evaluated both of
this scenario, we have evaluated both of the conditions that we have in the case
the conditions that we have in the case statement. Now it's going to go to the
statement. Now it's going to go to the third one. We have 15. Is 15 higher than
third one. We have 15. Is 15 higher than 50? Will no. So we will get to false for
50? Will no. So we will get to false for the first condition. Then it's going to
the first condition. Then it's going to go and jump to the second condition and
go and jump to the second condition and check it. Is 15 higher than 20? Will as
check it. Is 15 higher than 20? Will as well no. So now what going to happen?
well no. So now what going to happen? The false going to be activated over
The false going to be activated over here. And we will not get any value as a
here. And we will not get any value as a return. So we will get the value of null
return. So we will get the value of null in the output. And now for the last one
in the output. And now for the last one we have null. We will get as well null
we have null. We will get as well null because it will not fulfill any of those
because it will not fulfill any of those conditions and that's because we didn't
conditions and that's because we didn't define an else in the case statement. So
define an else in the case statement. So if we define these conditions like this,
if we define these conditions like this, we will get the category medium for the
we will get the category medium for the 30. And this is how SQL evaluate
30. And this is how SQL evaluate multiple conditions in the case
multiple conditions in the case statement. All right. Now we're going to
statement. All right. Now we're going to go to the final form of our case
go to the final form of our case statements and we're going to go and add
statements and we're going to go and add an else. So we're going to have a
an else. So we're going to have a default value. So we are seeing here if
default value. So we are seeing here if the sales is not higher than 50 or
the sales is not higher than 50 or higher than 20 then show a default value
higher than 20 then show a default value as low. So that means any sales that is
as low. So that means any sales that is equal or smaller than 20 going to get
equal or smaller than 20 going to get the value of low. And now very
the value of low. And now very interesting if you check the workflow
interesting if you check the workflow over here you can see that we have now a
over here you can see that we have now a value for each path. So for the first
value for each path. So for the first condition we're going to get high for
condition we're going to get high for the second one medium. And if nothing is
the second one medium. And if nothing is fulfilled we're going to get always the
fulfilled we're going to get always the value of low. So there is no way in this
value of low. So there is no way in this chart to get any nulls. Right? So let's
chart to get any nulls. Right? So let's go and evaluate again our values. I
go and evaluate again our values. I think you already get it. The 60 is
think you already get it. The 60 is fulfilling the first requirement and SQL
fulfilling the first requirement and SQL going to stop everything immediately and
going to stop everything immediately and just show the value of high. So on the
just show the value of high. So on the right side over here nothing going to be
right side over here nothing going to be evaluated because the first condition is
evaluated because the first condition is true. So here in the output we're going
true. So here in the output we're going to get the value of high. So nothing
to get the value of high. So nothing changed like the two previous examples.
changed like the two previous examples. Now it's going to go to the next value.
Now it's going to go to the next value. We have the 30. So we're going to
We have the 30. So we're going to evaluate the first one. It's going to be
evaluate the first one. It's going to be false. The next one it's higher than 20.
false. The next one it's higher than 20. It is true. And that's why is still
It is true. And that's why is still going to show the value of medium. And
going to show the value of medium. And this is as well. We had it in the
this is as well. We had it in the previous example. So medium. So now
previous example. So medium. So now scale going to move to the next one. And
scale going to move to the next one. And here things going to get interesting. So
here things going to get interesting. So the value of 15. We're going to evaluate
the value of 15. We're going to evaluate the first condition. Is it higher than
the first condition. Is it higher than 50? Well, no. Is it higher than 20?
50? Well, no. Is it higher than 20? Well, no. So now we are in scenario
Well, no. So now we are in scenario where none of those conditions are true.
where none of those conditions are true. So that's why SQL going to go and
So that's why SQL going to go and execute the else. So if you check our
execute the else. So if you check our chart it's going to be false and we will
chart it's going to be false and we will get the value of low. So in the output
get the value of low. So in the output we will not get this time a null because
we will not get this time a null because we have else we will get the value of
we have else we will get the value of low. The same thing now for the null.
low. The same thing now for the null. Null will not fulfill the first
Null will not fulfill the first condition as well the second condition
condition as well the second condition and that's why we will get as well the
and that's why we will get as well the value from the else. So here in the
value from the else. So here in the output we will get as well the value of
output we will get as well the value of low. So now as you can see if you use an
low. So now as you can see if you use an else inside the case statements you will
else inside the case statements you will make sure that there will be no nulls in
make sure that there will be no nulls in the output. So that you have learned the
the output. So that you have learned the different options that we have inside
different options that we have inside the case statements and how skill
the case statements and how skill execute the case behind the
scenes. All right friends so now we come to the part where I'm going to show you
to the part where I'm going to show you the most useful use cases of the case
the most useful use cases of the case statements that I usually use in my
statements that I usually use in my projects. So let's start. The main
projects. So let's start. The main purpose of the case statement is to do
purpose of the case statement is to do data transformations. And data
data transformations. And data transformations is very important
transformations is very important process in each data project. And one
process in each data project. And one very important task in data
very important task in data transformations is that we can generate
transformations is that we can generate new informations. We can go and create
new informations. We can go and create new columns based on the existing data
new columns based on the existing data that we have in the database using the
that we have in the database using the case statements and this of course can
case statements and this of course can help us deriving new informations for
help us deriving new informations for our analyzes without modifying the
our analyzes without modifying the source database only for analytics. So
source database only for analytics. So my friends, the main purpose of the case
my friends, the main purpose of the case statement is to do data transformations
statement is to do data transformations by creating and generating new columns.
by creating and generating new columns. So now let's start with the first use
So now let's start with the first use case and the most important and famous
case and the most important and famous one is we use case statement in order to
one is we use case statement in order to categorize the data. This means we are
categorize the data. This means we are going to group up the data into
going to group up the data into different categories based on certain
different categories based on certain conditions. And now you might ask why
conditions. And now you might ask why this use case is important. Well,
this use case is important. Well, classifying and grouping data is
classifying and grouping data is fundamental in data analysis and
fundamental in data analysis and reporting because it makes the data
reporting because it makes the data easier to understand and as well to
easier to understand and as well to track. But what's more important, it
track. But what's more important, it going to help us aggregating the data
going to help us aggregating the data based on the categories. All right. So
based on the categories. All right. So now let's have the following task. And
now let's have the following task. And it says generate a report showing total
it says generate a report showing total sales for each of the following
sales for each of the following categories. category high if the sales
categories. category high if the sales is over 50. Category medium if the sales
is over 50. Category medium if the sales is between 20 and 50 and low if the
is between 20 and 50 and low if the sales is 20 or less and sort the
sales is 20 or less and sort the categories from the highest sales to the
categories from the highest sales to the lowest. Okay, so let's do it step by
lowest. Okay, so let's do it step by step. And now before we do any data
step. And now before we do any data aggregations, we have to go and create a
aggregations, we have to go and create a new column called categories because we
new column called categories because we don't have it in the database. So now
don't have it in the database. So now let's start with very simple select
let's start with very simple select statements. So select what do we need?
statements. So select what do we need? Let's take the order ID, the sales and
Let's take the order ID, the sales and that's it for now. So from sales orders
that's it for now. So from sales orders let's go and execute it. And now we have
let's go and execute it. And now we have our 10 orders and we have to go and now
our 10 orders and we have to go and now create a new column called categories.
create a new column called categories. And we're going to do that using the
And we're going to do that using the case statements. So let's take a new
case statements. So let's take a new line and we start with case and then
line and we start with case and then again a new line in order to define the
again a new line in order to define the first condition using the when. So the
first condition using the when. So the first condition is the high where sales
first condition is the high where sales is over 50. So it's very simple. So when
is over 50. So it's very simple. So when the sales is higher than 50, what can
the sales is higher than 50, what can happen if this is true? We want to show
happen if this is true? We want to show the value high. So this is the first
the value high. So this is the first condition. And then let's move to the
condition. And then let's move to the second one. If the sales is higher than
second one. If the sales is higher than 20, that means it's less than 50 and
20, that means it's less than 50 and higher than 20, then we want to see the
higher than 20, then we want to see the value medium. And now for the last
value medium. And now for the last category, the low, we don't have to go
category, the low, we don't have to go and create a condition for that because
and create a condition for that because if those two fails, then that means that
if those two fails, then that means that the sales either equal to 20 or less. So
the sales either equal to 20 or less. So what we're going to do, we're going to
what we're going to do, we're going to just do a simple else and show the value
just do a simple else and show the value low like this. Let me make this a little
low like this. Let me make this a little bit smaller. Now what is missing in our
bit smaller. Now what is missing in our case is of course the end. Without it,
case is of course the end. Without it, you're going to get an error. So end and
you're going to get an error. So end and let's give it a name
let's give it a name category. So we are ready. Let's go and
category. So we are ready. Let's go and execute it. So now let's check randomly
execute it. So now let's check randomly stuff. So as you can see here we have
stuff. So as you can see here we have the sales of 50 it is low which is
the sales of 50 it is low which is correct and then we have here 60 it's
correct and then we have here 60 it's above 50 and we have the category high
above 50 and we have the category high and now if you check the order number
and now if you check the order number six we have the order 50 it's medium
six we have the order 50 it's medium because it is not higher than 50 it is
because it is not higher than 50 it is between 50 and 20. So now as you can see
between 50 and 20. So now as you can see we have now classified our orders using
we have now classified our orders using the category. Now the next step that
the category. Now the next step that we're going to go and aggregate the
we're going to go and aggregate the data. So how we going to do that? We
data. So how we going to do that? We will use a subquery. So let's do it like
will use a subquery. So let's do it like this. So we're going to go and select
this. So we're going to go and select and of course we're going to group up
and of course we're going to group up the data by the category. So we're going
the data by the category. So we're going to go and select that category and we
to go and select that category and we need the total sales. That means we're
need the total sales. That means we're going to go and use the function sum for
going to go and use the function sum for the sales and we're going to call it
the sales and we're going to call it total sales. So now we have to nest the
total sales. So now we have to nest the queries together. So from this is our
queries together. So from this is our query like this and then we have to
query like this and then we have to close it and group by. So we are
close it and group by. So we are grouping by the category. Okay. So with
grouping by the category. Okay. So with that we are now aggregating the sales by
that we are now aggregating the sales by that category. It's very simple. Let's
that category. It's very simple. Let's go and execute it. So now in the result
go and execute it. So now in the result we have only three categories. We don't
we have only three categories. We don't have the 10 orders because now we are
have the 10 orders because now we are doing data aggregations. So now the
doing data aggregations. So now the granularity now on the level of
granularity now on the level of category. So now we can see the total
category. So now we can see the total sales for the high is 2010. The low we
sales for the high is 2010. The low we have 65 and the medium we have 105. And
have 65 and the medium we have 105. And of course we are not done yet because in
of course we are not done yet because in the task it says sort the categories
the task it says sort the categories from the highest sales to the lowest.
from the highest sales to the lowest. That means we have to go and use an
That means we have to go and use an order by statement at the end and we're
order by statement at the end and we're going to sort the data by the sales from
going to sort the data by the sales from the highest to the lowest. That means
the highest to the lowest. That means descending. So that's it. Let's go and
descending. So that's it. Let's go and execute it. And now with that we have
execute it. And now with that we have our reports. Now we are showing the
our reports. Now we are showing the total sales by the categories and the
total sales by the categories and the data is sorted from the highest to the
data is sorted from the highest to the lowest. So the highest category is high
lowest. So the highest category is high then medium and then the last one is
then medium and then the last one is low. So my friends as you can see with
low. So my friends as you can see with the help of the case win we have created
the help of the case win we have created the new informations from our data we
the new informations from our data we have the category and then we have
have the category and then we have created insights or report based on this
created insights or report based on this new informations where we have
new informations where we have aggregated our data using this new
aggregated our data using this new information. So the use case of
information. So the use case of categorizing data using case statements
categorizing data using case statements is fundamental and very important in
is fundamental and very important in each data
project. Okay. Okay. So now one more thing before we jump to the next use
thing before we jump to the next use case is that there is one rule to follow
case is that there is one rule to follow if you are using case statements and
if you are using case statements and that is the data types of the result
that is the data types of the result must be matching. So what this means if
must be matching. So what this means if we check again our example over here we
we check again our example over here we can see that the result of each
can see that the result of each condition is a string. So as you can see
condition is a string. So as you can see we have here high, medium and low and
we have here high, medium and low and all of those informations are following
all of those informations are following the same data type. So it is correct. So
the same data type. So it is correct. So now if I go and break this rule for
now if I go and break this rule for example after this then let's have the
example after this then let's have the value two. So now we have a number and
value two. So now we have a number and we have characters. So let's go and
we have characters. So let's go and execute it. And now of course we're
execute it. And now of course we're going to get an error because now SSQL
going to get an error because now SSQL is trying to convert the value low to an
is trying to convert the value low to an integer which is incorrect. So the data
integer which is incorrect. So the data types of the output of the result must
types of the output of the result must be matching and that's not only include
be matching and that's not only include the value after the then but also the
the value after the then but also the value after the else because this value
value after the else because this value is as well part of the output. So let's
is as well part of the output. So let's have here again medium. And now let's go
have here again medium. And now let's go and change this to let's say one. So
and change this to let's say one. So let's go and excuse it again. Isl going
let's go and excuse it again. Isl going to throw an error because this is an
to throw an error because this is an integer in number and the others are
integer in number and the others are string characters. So this is the rule
string characters. So this is the rule of using the case statement. The data
of using the case statement. The data types after then and after else must be
types after then and after else must be matching. And if you ask me whether
matching. And if you ask me whether there is restriction about where you can
there is restriction about where you can use the case statement in which clauses
use the case statement in which clauses you can use it everywhere in select, in
you can use it everywhere in select, in joins, from, where, group by, order by,
joins, from, where, group by, order by, everywhere. So there are no restrictions
everywhere. So there are no restrictions and we have only this one rule.
Okay friends, another use case for the case statement. We can use it in order
case statement. We can use it in order to map values. So we can use the case
to map values. So we can use the case statement in order to transform the data
statement in order to transform the data from one form to another in order to
from one form to another in order to make it more readable and more usable
make it more readable and more usable for analytics. One scenario of mapping
for analytics. One scenario of mapping values is that sometimes the database
values is that sometimes the database developers stores the data and values
developers stores the data and values inside the database as codes and as
inside the database as codes and as flags. So for example, the status of the
flags. So for example, the status of the order could be stored as one and zero
order could be stored as one and zero instead of having inactive and active.
instead of having inactive and active. And this is one technique in order to
And this is one technique in order to optimize the performance of the database
optimize the performance of the database for the application because one and zero
for the application because one and zero is way faster than storing the whole
is way faster than storing the whole string. But in data analyzes, we usually
string. But in data analyzes, we usually generate a report to be read by human by
generate a report to be read by human by persons. And now instead of showing the
persons. And now instead of showing the data as zero and one, it's going to be
data as zero and one, it's going to be more nicer and readable if you show the
more nicer and readable if you show the data as active and inactive. So for
data as active and inactive. So for these scenarios, we're going to go and
these scenarios, we're going to go and use the case statement in order to
use the case statement in order to translate those cryptical and technical
translate those cryptical and technical values into readable terms. Otherwise,
values into readable terms. Otherwise, each one going to consume your report.
each one going to consume your report. Going to ask you what do you mean with
Going to ask you what do you mean with the zero and one. Let's have the
the zero and one. Let's have the following task and it says retrieve
following task and it says retrieve employee details with gender displayed
employee details with gender displayed as full text. Okay. So now let's go and
as full text. Okay. So now let's go and solve it. First we're going to go and
solve it. First we're going to go and explore few informations. So let's go
explore few informations. So let's go and show the employee
and show the employee ID and let's take the first name, last
ID and let's take the first name, last name and we need the gender
name and we need the gender informations. So gender from sales
informations. So gender from sales employees. So that's it. Let's go and
employees. So that's it. Let's go and execute it. So now as you can see in the
execute it. So now as you can see in the result we got our five employees and now
result we got our five employees and now the gender informations are stored as
the gender informations are stored as only one character F and M. And of
only one character F and M. And of course it's easy to understand that the
course it's easy to understand that the F is female and M is male. but we would
F is female and M is male. but we would like to show it in the report as a full
like to show it in the report as a full text. So, female and male instead of
text. So, female and male instead of those abbreviations. So, in order to do
those abbreviations. So, in order to do that, we're going to go and use the case
that, we're going to go and use the case statement in order to do the mapping
statement in order to do the mapping between the old value and the new value.
between the old value and the new value. So, let's go and create a new column
So, let's go and create a new column using the case. So, we're going to have
using the case. So, we're going to have here two conditions because we have two
here two conditions because we have two values. Let's start with the first one.
values. Let's start with the first one. So, we're going to have a new line and
So, we're going to have a new line and when. So when the gender equals to f
when. So when the gender equals to f ladies first then female and now for the
ladies first then female and now for the second value it's going to be exactly
second value it's going to be exactly the same when gender equal to m then
the same when gender equal to m then we're going to have male be careful for
we're going to have male be careful for the case sensitivity of the values. So
the case sensitivity of the values. So of course we will not end this without
of course we will not end this without an else. So else then we can have the
an else. So else then we can have the default value. We're going to have the
default value. We're going to have the default value not available. It's better
default value not available. It's better than having nulls. So what we are
than having nulls. So what we are missing is the end. So we're going to
missing is the end. So we're going to have an end over here and we're going to
have an end over here and we're going to call you gender full text. So that's it.
call you gender full text. So that's it. Let's go and execute it. Now if you
Let's go and execute it. Now if you check the results, we have now done the
check the results, we have now done the mapping between the old format of the
mapping between the old format of the value with the new format. So instead of
value with the new format. So instead of m we have males and females. And of
m we have males and females. And of course we don't have here any nulls.
course we don't have here any nulls. That's why we don't have a not available
That's why we don't have a not available in the data. But if you have huge data
in the data. But if you have huge data of course you can have somewhere a null
of course you can have somewhere a null and then you will get this default
and then you will get this default value. So this is how you can do mapping
value. So this is how you can do mapping between values very easily using the
between values very easily using the case statements. Okay let's have another
case statements. Okay let's have another task for the mapping use case and the
task for the mapping use case and the task says retrieve employee details with
task says retrieve employee details with abbreviated country code. Sometimes as
abbreviated country code. Sometimes as we are generating reports maybe using
we are generating reports maybe using PowerBI or Tableau we don't have enough
PowerBI or Tableau we don't have enough spaces in order to use the full name of
spaces in order to use the full name of values. So what do we need? We need
values. So what do we need? We need abbreviations. we need short form of the
abbreviations. we need short form of the values and we can go and use in SQL the
values and we can go and use in SQL the case statement in order to map the full
case statement in order to map the full value to an abbreviated value. So it's
value to an abbreviated value. So it's like the previous example but the way
like the previous example but the way around. All right. So now let's go and
around. All right. So now let's go and solve it. We're going to go and select
solve it. We're going to go and select few details like the customer ID. Let's
few details like the customer ID. Let's take the first name, last name and what
take the first name, last name and what do we need? We need the country
do we need? We need the country information from sales customers. So
information from sales customers. So that's it. Let's go and execute it. And
that's it. Let's go and execute it. And now as you can see we get our five
now as you can see we get our five customers and we have the country
customers and we have the country informations as a full name. Now of
informations as a full name. Now of course for the report we need
course for the report we need abbreviated values from this. So we're
abbreviated values from this. So we're going to go and map those full names of
going to go and map those full names of the countries to a short form. But in
the countries to a short form. But in real project you might get big tables
real project you might get big tables where you have thousands and millions of
where you have thousands and millions of records. So you cannot just check it
records. So you cannot just check it like this. So how I usually do it I go
like this. So how I usually do it I go and retrieve a distinct list of all
and retrieve a distinct list of all values from one column. So I usually go
values from one column. So I usually go and have a separate query for that. So
and have a separate query for that. So we're going to have select distinct
we're going to have select distinct country from the table sales customers.
country from the table sales customers. It's just for me to see all the possible
It's just for me to see all the possible values inside the database. So now you
values inside the database. So now you see the second result over here. We have
see the second result over here. We have only two values Germany and USA. And
only two values Germany and USA. And then I can go and map the data
then I can go and map the data correctly. So always if you are mapping
correctly. So always if you are mapping data using the case win you have to
data using the case win you have to understand all the possible values that
understand all the possible values that you have inside the table. So let's go
you have inside the table. So let's go and generate this new informations.
and generate this new informations. Let's start with case and then you line
Let's start with case and then you line when country equal to the first value.
when country equal to the first value. It's going to be Germany. Make sure you
It's going to be Germany. Make sure you write it exactly like in the database.
write it exactly like in the database. The first character is capital and the
The first character is capital and the rest is small. So what happened? We're
rest is small. So what happened? We're going to have the abbreviation of
going to have the abbreviation of Germany. It's going to be de. All right.
Germany. It's going to be de. All right. So this is for the first value. And then
So this is for the first value. And then let's move to the second one. It's going
let's move to the second one. It's going to be country equal to USA. It's already
to be country equal to USA. It's already abbreviated but maybe we can get only
abbreviated but maybe we can get only two characters.
two characters. So us like this. And now let's go and
So us like this. And now let's go and add an else. It's optional but in case
add an else. It's optional but in case that we have nulls in the data or we get
that we have nulls in the data or we get a new value. So else it's not available.
a new value. So else it's not available. So na. So that's it. And never forget
So na. So that's it. And never forget about the end. So end. And the name
about the end. So end. And the name going to be country abbreviation. So
going to be country abbreviation. So that's it. Let me just get rid of the
that's it. Let me just get rid of the other query. So the mapping is correct.
other query. So the mapping is correct. Let's go and execute it. And now if you
Let's go and execute it. And now if you check the results, we got a new column
check the results, we got a new column called country abbreviation. And as you
called country abbreviation. And as you can see now the mapping is working. Here
can see now the mapping is working. Here we have Germany and we have here DE and
we have Germany and we have here DE and for the USA we have US. So with that we
for the USA we have US. So with that we have solved the task and we done the
have solved the task and we done the mapping correctly between the old value
mapping correctly between the old value and the new
value. All right friends, now there is special case for the syntax of the case
special case for the syntax of the case statements if you are using it for
statements if you are using it for mapping values. So now let's go and
mapping values. So now let's go and check it. So now let's say that we have
check it. So now let's say that we have a lot of different distinct values
a lot of different distinct values inside the country not only to values
inside the country not only to values you have a lot of values and if you are
you have a lot of values and if you are mapping the values using the case when
mapping the values using the case when you're going to end up always writing
you're going to end up always writing the same thing country equal Germany
the same thing country equal Germany country equal India country equal United
country equal India country equal United States and so on. So we are always using
States and so on. So we are always using the column country. So the conditions
the column country. So the conditions over here using always one column and
over here using always one column and it's always the operator is equal. So
it's always the operator is equal. So now only for this scenario we have
now only for this scenario we have another syntax for the case statements
another syntax for the case statements and it looks like this. We start with
and it looks like this. We start with the keyword case but after that
the keyword case but after that immediately we're going to use the
immediately we're going to use the column that we want to evaluate and here
column that we want to evaluate and here you can use only one column you cannot
you can use only one column you cannot use multiple columns. So now we are
use multiple columns. So now we are telling SQL we are now evaluating one
telling SQL we are now evaluating one column the country and then for each
column the country and then for each condition we have the following stuff we
condition we have the following stuff we say when Germany that means when country
say when Germany that means when country is equal to Germany then de so as you
is equal to Germany then de so as you can see here we don't have here the
can see here we don't have here the whole condition we have only a possible
whole condition we have only a possible value that we can see inside the
value that we can see inside the country. So we are saying is the value
country. So we are saying is the value country if it's true then show de the
country if it's true then show de the next one is it India then en United
next one is it India then en United States US and so on. So we call this
States US and so on. So we call this syntax a quick form of the case
syntax a quick form of the case statements and on the left side we call
statements and on the left side we call it full form of the case statements and
it full form of the case statements and of course the restriction and limitation
of course the restriction and limitation using the quick format is that you can
using the quick format is that you can use only one column and it's only for
use only one column and it's only for the equal operator. So that means only
the equal operator. So that means only for these scenarios you can go and use
for these scenarios you can go and use the quick format. If things get a little
the quick format. If things get a little bit complicated where you have to mix
bit complicated where you have to mix and make complex logic, you cannot use
and make complex logic, you cannot use the quick formats. So I would say if you
the quick formats. So I would say if you are sure that the logic will not get
are sure that the logic will not get complicated and you can stay always with
complicated and you can stay always with the same column, you can go with the
the same column, you can go with the quick format. But I would recommend
quick format. But I would recommend always to go with the full format
always to go with the full format because for one simple reason if you add
because for one simple reason if you add one small logic you have to go and
one small logic you have to go and rewrite the whole case statements back
rewrite the whole case statements back to the full format in order to add any
to the full format in order to add any small logic. But of course there is
small logic. But of course there is nothing wrong using the quick form in
nothing wrong using the quick form in order to do the case statements if the
order to do the case statements if the logic can stay static and you are sure
logic can stay static and you are sure we are using only one column and we are
we are using only one column and we are just doing mapping. There is no any
just doing mapping. There is no any extra logic. Okay. So now let's try this
extra logic. Okay. So now let's try this quick format for the case statement for
quick format for the case statement for the previous example. So I will just go
the previous example. So I will just go and copy everything to a new column. So
and copy everything to a new column. So I'm just going to rename it to two. And
I'm just going to rename it to two. And now how we going to do it? So it's going
now how we going to do it? So it's going to be case but this time we're going to
to be case but this time we're going to write country and then inside the wind
write country and then inside the wind we will have only the values. So no need
we will have only the values. So no need for the condition. So it's going to be
for the condition. So it's going to be like this. Let me scroll up. So that's
like this. Let me scroll up. So that's it. As you can see it's smaller and
it. As you can see it's smaller and quicker than writing the whole condition
quicker than writing the whole condition each time. So now let's go and execute
each time. So now let's go and execute this. And as you can see in the result
this. And as you can see in the result we're going to get identical values. So
we're going to get identical values. So now you know one more trick in the case
now you know one more trick in the case statement.
All right, moving on to the next use case for the case statements. We can use
case for the case statements. We can use it in order to handle nulls. Handling
it in order to handle nulls. Handling nulls means replace a null with a value.
nulls means replace a null with a value. And as we learned before with the window
And as we learned before with the window aggregate functions, sometimes nulls
aggregate functions, sometimes nulls leads to incorrect calculations and
leads to incorrect calculations and results which leads to wrong decision-m.
results which leads to wrong decision-m. We're going to have later a dedicated
We're going to have later a dedicated chapter on how to handle nulls in SQL.
chapter on how to handle nulls in SQL. But now we're going to learn how to
But now we're going to learn how to handle nulls using case statements. So
handle nulls using case statements. So now let's have the following task and it
now let's have the following task and it says find the average scores of
says find the average scores of customers and treat nulls as zero and
customers and treat nulls as zero and additionally provide details such as
additionally provide details such as customer ID and the last name. Okay. So
customer ID and the last name. Okay. So now let's solve it step by step and
now let's solve it step by step and again we have here details and as well
again we have here details and as well we have to do aggregations that means we
we have to do aggregations that means we have to go and use the window functions
have to go and use the window functions and we don't have to forget that we have
and we don't have to forget that we have to treat the null so we have to handle
to treat the null so we have to handle it. So now let's go and start with very
it. So now let's go and start with very simple uh select. So select
simple uh select. So select customer ID we need the last name and as
customer ID we need the last name and as well we need the scores. So from sales
well we need the scores. So from sales customers let's go and execute it. So as
customers let's go and execute it. So as usual we have our five customers and the
usual we have our five customers and the scores. And here we have a null. Now
scores. And here we have a null. Now we're going to go and write the window
we're going to go and write the window function but without handling the nulls
function but without handling the nulls just in order to see the differences. So
just in order to see the differences. So we need the average function for what
we need the average function for what for the scores. Do we have to now
for the scores. Do we have to now partition the data? Well no. So we're
partition the data? Well no. So we're going to leave it as empty. We need the
going to leave it as empty. We need the average score of all customers. So
average score of all customers. So that's it. Let's go and give it a name
that's it. Let's go and give it a name and then execute it. I think I have here
and then execute it. I think I have here mistake. So it is a score not scores. So
mistake. So it is a score not scores. So and now as you can see we have the
and now as you can see we have the average of 625. And as you learned
average of 625. And as you learned before SQL going to go and summarize all
before SQL going to go and summarize all those four values and divide it by four.
those four values and divide it by four. But our business understand the nulls as
But our business understand the nulls as zero not as missing information. So we
zero not as missing information. So we have to go and handle the null. Let's go
have to go and handle the null. Let's go and create a new column for the scores.
and create a new column for the scores. But this time we're going to go and use
But this time we're going to go and use the case statements. It's going to be
the case statements. It's going to be very simple. So we're going to say when
very simple. So we're going to say when the score is null. So in SQL we don't
the score is null. So in SQL we don't write equal null, we say is null. So
write equal null, we say is null. So with that we are replacing the nulls
with that we are replacing the nulls with zero. Right? So now otherwise what
with zero. Right? So now otherwise what can happen? So if it's not null so we
can happen? So if it's not null so we need the score as it is. We should not
need the score as it is. We should not manipulate anything. So the default
manipulate anything. So the default value is the score itself if the score
value is the score itself if the score is not null. So now let's go and end it
is not null. So now let's go and end it and let's call it score clean. So let's
and let's call it score clean. So let's go and execute it. Now if you check the
go and execute it. Now if you check the result over here, it's like almost
result over here, it's like almost identical as the score. So we don't have
identical as the score. So we don't have any new values for the scores but only
any new values for the scores but only the nulls now are zero and all other
the nulls now are zero and all other values they are not affected. So we
values they are not affected. So we didn't touch it. We didn't transform it
didn't touch it. We didn't transform it at all. So this is what do we mean with
at all. So this is what do we mean with handling nulls replacing nulls with
handling nulls replacing nulls with another value. So now in order to finish
another value. So now in order to finish the task we have to do the average for
the task we have to do the average for the score clean and not for the original
the score clean and not for the original score. So how we going to do it? Let's
score. So how we going to do it? Let's go and copy the whole case statements.
go and copy the whole case statements. I'm just going to do it in another
I'm just going to do it in another column. So let's have an average and
column. So let's have an average and inside it we have the case statements
inside it we have the case statements like this. Let me just sort it like
like this. Let me just sort it like this. And now what is missing is the
this. And now what is missing is the over and it's going to be empty. So
over and it's going to be empty. So average customer let's call it clean. So
average customer let's call it clean. So this is the logic. Let me just make
this is the logic. Let me just make everything smaller. So now as you can
everything smaller. So now as you can see it's exactly like the previous one
see it's exactly like the previous one but instead of using the original score
but instead of using the original score now we are using the column that we have
now we are using the column that we have created. But of course we don't need the
created. But of course we don't need the alias over here. So we have to remove
alias over here. So we have to remove it. So it start with case and end. So
it. So it start with case and end. So let's go and execute it. And now you can
let's go and execute it. And now you can see in the output we got a new value for
see in the output we got a new value for the average and it is more accurate for
the average and it is more accurate for the business. So now we have 500.
the business. So now we have 500. Previously we had
Previously we had 625. So as you can see you have to
625. So as you can see you have to understand what the nulls means in your
understand what the nulls means in your business and handle it correctly.
business and handle it correctly. Otherwise you will get wrong results. So
Otherwise you will get wrong results. So that's it. We use case statements in
that's it. We use case statements in order to handle the nulls inside our
order to handle the nulls inside our data.
data. Conditional aggregations means we're
Conditional aggregations means we're going to go and apply an aggregate
going to go and apply an aggregate function in SQL like some average count
function in SQL like some average count but this time only on a subset of data
but this time only on a subset of data that meet specific conditions. This
that meet specific conditions. This technique is amazing in order to do deep
technique is amazing in order to do deep dive analyzes or target analyzes on a
dive analyzes or target analyzes on a specific subset of the data. So now
specific subset of the data. So now let's have the following SQL task in
let's have the following SQL task in order to understand this use case. The
order to understand this use case. The task says count how many times each
task says count how many times each customer has made an order with sales
customer has made an order with sales greater than 30. All right. So, as
greater than 30. All right. So, as usual, we can do it step by step. So,
usual, we can do it step by step. So, what do we need? We need the orders. So,
what do we need? We need the orders. So, let's get the order ID and as well,
let's get the order ID and as well, let's get the customer ID like this and
let's get the customer ID like this and the sales from sales orders. Let's go
the sales from sales orders. Let's go and execute it. So now, what else I'm
and execute it. So now, what else I'm going to do with I'm going to go and
going to do with I'm going to go and order the data by customer ID. So, let's
order the data by customer ID. So, let's execute it again. Okay. So, now the task
execute it again. Okay. So, now the task sounds easy, but it's a little bit
sounds easy, but it's a little bit tricky. We have to count the number of
tricky. We have to count the number of orders for each customer where the sales
orders for each customer where the sales is higher than 30. Let's have an
is higher than 30. Let's have an example. For example, this customer
example. For example, this customer number one. So the total number of
number one. So the total number of orders is three orders, right? But we
orders is three orders, right? But we have to count only the orders where the
have to count only the orders where the sales is higher than 30. And in this
sales is higher than 30. And in this example, we have only one order where
example, we have only one order where the sales is higher than 30. So it's
the sales is higher than 30. So it's only the order number four. So the count
only the order number four. So the count for the customer ID number one should be
for the customer ID number one should be one. Now let's check another customer.
one. Now let's check another customer. For example, the two. And as you can
For example, the two. And as you can see, we have three orders, but none of
see, we have three orders, but none of them have the sales higher than 30. So
them have the sales higher than 30. So the count should be zero here. So how we
the count should be zero here. So how we going to do that? We have to go and flag
going to do that? We have to go and flag each row whether it's higher than 30 or
each row whether it's higher than 30 or not. So if it's higher than 30, it gets
not. So if it's higher than 30, it gets the flag of one. If it's less than 30 or
the flag of one. If it's less than 30 or equal to 30, it's going to get zero. And
equal to 30, it's going to get zero. And then we're going to go and summarize all
then we're going to go and summarize all those flags in order to get the count.
those flags in order to get the count. So let's do it step by step. Let's first
So let's do it step by step. Let's first create the flag. So we're going to go
create the flag. So we're going to go and use case and then our condition is
and use case and then our condition is very easy. We're going to say when. So
very easy. We're going to say when. So what is the condition? Sales greater
what is the condition? Sales greater than 30. So sales is higher than 30.
than 30. So sales is higher than 30. Then what can happen? We're going to
Then what can happen? We're going to flag it with the one because later we're
flag it with the one because later we're going to go and summarize the one. And
going to go and summarize the one. And now else if it's not higher than 30,
now else if it's not higher than 30, equal to 30 or less. So it's going to
equal to 30 or less. So it's going to get zero. All right. So now let's go and
get zero. All right. So now let's go and end it. So let's say sales flag. Now
end it. So let's say sales flag. Now let's go and execute it and check the
let's go and execute it and check the results. All right. So now if you check
results. All right. So now if you check the results we got now a very nice flag
the results we got now a very nice flag in order to see which orders has sales
in order to see which orders has sales higher than 30. So now for example let's
higher than 30. So now for example let's take the customer ID number one. As you
take the customer ID number one. As you can see only the order number four has
can see only the order number four has sales higher than 30 and it's flagged
sales higher than 30 and it's flagged with one and all others are zero. Now
with one and all others are zero. Now let's take the customer ID number three.
let's take the customer ID number three. And as you can see we have now two
And as you can see we have now two orders where the sales is higher than
orders where the sales is higher than 30. And as you can see we have the one
30. And as you can see we have the one twice. And now we can use this flag in
twice. And now we can use this flag in order to do the aggregation. So now if
order to do the aggregation. So now if you go and summarize the flag for the
you go and summarize the flag for the customer id number three we will get two
customer id number three we will get two and this is the count of orders where
and this is the count of orders where the sales is higher than 30 right and
the sales is higher than 30 right and let's take another example the customer
let's take another example the customer ID number two we have everywhere zero
ID number two we have everywhere zero and if we summarize those values we will
and if we summarize those values we will get zero which is the count of orders
get zero which is the count of orders where the sales is higher than 30 which
where the sales is higher than 30 which is correct so now as you can see first
is correct so now as you can see first we have built an extra column in order
we have built an extra column in order to help us doing the aggregation and now
to help us doing the aggregation and now in the next step we're going to go and
in the next step we're going to go and aggregate this column so let's go and do
aggregate this column so let's go and do that we don't need all those
that we don't need all those informations the order ID we need the
informations the order ID we need the customer ID because it is the
customer ID because it is the granularity for the aggregation and
granularity for the aggregation and let's remove the order by and now let's
let's remove the order by and now let's go and group up the data by customer ID
go and group up the data by customer ID but of course we need the aggregate
but of course we need the aggregate function so how we going to do it we're
function so how we going to do it we're going to go and summarize the whole flag
going to go and summarize the whole flag so and now of course we're going to go
so and now of course we're going to go and rename this since now it is an
and rename this since now it is an aggregated column so we're going to call
aggregated column so we're going to call it total orders so now let's go and
it total orders so now let's go and execute it. So now let's go and check
execute it. So now let's go and check the result. As you can see, now we have
the result. As you can see, now we have our four customers. And for the customer
our four customers. And for the customer ID number one, we got only one order
ID number one, we got only one order higher than 30. The second one has no
higher than 30. The second one has no orders higher than 30. The third we have
orders higher than 30. The third we have two and one. And with that, we have
two and one. And with that, we have solved the task. Now I would like to add
solved the task. Now I would like to add one more thing to our query in order to
one more thing to our query in order to see the normal aggregations, not the
see the normal aggregations, not the conditional aggregations. So usually we
conditional aggregations. So usually we go and count for example the star in
go and count for example the star in order to get the total orders. And let's
order to get the total orders. And let's rename the previous one to high sales.
rename the previous one to high sales. So let's go and execute it. So we are
So let's go and execute it. So we are just now doing aggregations without any
just now doing aggregations without any conditions. And now we can see how many
conditions. And now we can see how many orders did each customer. So we can see
orders did each customer. So we can see that the customer ID number one did
that the customer ID number one did order three times but only one order
order three times but only one order higher than 30. So this is a normal
higher than 30. So this is a normal aggregation and this is a conditional
aggregation and this is a conditional aggregations using the case
statements. All right friends. So now let's do a recap about the case
let's do a recap about the case statements. Case statement can go and
statements. Case statement can go and evaluate a list of conditions one by one
evaluate a list of conditions one by one and return a value once the first
and return a value once the first condition is met. And if you are talking
condition is met. And if you are talking about the rules of using the case
about the rules of using the case statements, we have only one where the
statements, we have only one where the data types of each condition after the
data types of each condition after the then and else must be matching. And now
then and else must be matching. And now if we talk about the use cases of the
if we talk about the use cases of the case statements, the main use case is to
case statements, the main use case is to do data transformations and especially
do data transformations and especially by creating new columns and deriving new
by creating new columns and deriving new informations. So as we saw there are
informations. So as we saw there are amazing use cases for the case
amazing use cases for the case statements. For example, we can use it
statements. For example, we can use it in order to categorize our data. As we
in order to categorize our data. As we learned, we can go and create a new
learned, we can go and create a new groups of data then to be aggregated for
groups of data then to be aggregated for our reports. And then we saw another use
our reports. And then we saw another use case is mapping values. We can use the
case is mapping values. We can use the case statement in order to help us
case statement in order to help us mapping the cryptical technical values
mapping the cryptical technical values that is stored in databases to new
that is stored in databases to new values which is more readable and more
values which is more readable and more friendly to be used. And the next use
friendly to be used. And the next use case that we have learned is handling
case that we have learned is handling the nulls. We can use the case statement
the nulls. We can use the case statement in order to replace the nulls with value
in order to replace the nulls with value to make our aggregations more accurate.
to make our aggregations more accurate. And the last use case that we have
And the last use case that we have learned and I think the most used one in
learned and I think the most used one in my project is doing conditional
my project is doing conditional aggregations where we can aggregate a
aggregations where we can aggregate a subset of data that meets specific
subset of data that meets specific conditions in order to do focus and
conditions in order to do focus and target analyszis. Okay my friends. So
target analyszis. Okay my friends. So with that we have covered all the topics
with that we have covered all the topics and all the functions in order to
and all the functions in order to transform single value in SQL the role
transform single value in SQL the role level functions that was very important
level functions that was very important especially for data engineers. So we are
especially for data engineers. So we are done with this chapter. Now we are
done with this chapter. Now we are moving to very interesting chapter.
moving to very interesting chapter. Finally we're going to talk about data
Finally we're going to talk about data analytics in SQL and we will be covering
analytics in SQL and we will be covering now the aggregate and the analytical
now the aggregate and the analytical functions that we have in SQL. So first
functions that we have in SQL. So first we're going to start with the basics. So
we're going to start with the basics. So we will learn simple functions on how to
we will learn simple functions on how to aggregate your data. So let's go.
Hey my friends. So now we're going to talk about the aggregate functions in
talk about the aggregate functions in SQL. They are amazing if you are a data
SQL. They are amazing if you are a data analyst or data scientist where we
analyst or data scientist where we usually use them in order to uncover
usually use them in order to uncover insights about our data. So the
insights about our data. So the aggregate functions they accept multiple
aggregate functions they accept multiple rows as an input and the output of the
rows as an input and the output of the aggregate function usually is one single
aggregate function usually is one single value. So now we're going to go and
value. So now we're going to go and cover first the basic aggregate
cover first the basic aggregate functions in SQL. So let's go. So now in
functions in SQL. So let's go. So now in our database we have four orders and we
our database we have four orders and we have the sales informations for each one
have the sales informations for each one of them. So now one question that comes
of them. So now one question that comes in our mind what is the total number of
in our mind what is the total number of orders in our business. So how many
orders in our business. So how many orders do we have? Now in order to do
orders do we have? Now in order to do that we use the function count because
that we use the function count because what it does it's going to go and count
what it does it's going to go and count the number of rows inside our table. So
the number of rows inside our table. So if you apply the count function on this
if you apply the count function on this data SQL going to go and start counting
data SQL going to go and start counting how many rows do we have. So the total
how many rows do we have. So the total number is four and in the output we will
number is four and in the output we will get four. So as you can see we don't
get four. So as you can see we don't really care about the content of the
really care about the content of the tables. Scale is just counting how many
tables. Scale is just counting how many rows. So the number is not based on the
rows. So the number is not based on the sales or formations or the orders. So
sales or formations or the orders. So this is how the count function works.
this is how the count function works. Now we have another question and we say
Now we have another question and we say I would like to find the total sales in
I would like to find the total sales in our data in our business. So that means
our data in our business. So that means we have to go and summarize all those
we have to go and summarize all those sales that we have in the order and for
sales that we have in the order and for that we have the sum function. So if you
that we have the sum function. So if you go and apply the sum function, it's
go and apply the sum function, it's going to go and summarize all the sales
going to go and summarize all the sales and return at the end the total sales.
and return at the end the total sales. In this example, it's going to be 80.
In this example, it's going to be 80. So, as you can see, the aggregate
So, as you can see, the aggregate function accept multiple rows, multiple
function accept multiple rows, multiple values, and the output going to be one
values, and the output going to be one single value, the aggregated value. Now,
single value, the aggregated value. Now, moving on, I would like to understand
moving on, I would like to understand what is the average sales in our
what is the average sales in our business. So, it sounds simple. In order
business. So, it sounds simple. In order to do that, we're going to use the
to do that, we're going to use the average function. So, if you apply it on
average function. So, if you apply it on the sales, it's going to go and
the sales, it's going to go and summarize all those values and divide it
summarize all those values and divide it by the number of values. So, you will
by the number of values. So, you will get the average of 20. Now comes
get the average of 20. Now comes interesting question where you want to
interesting question where you want to find what is the highest sales in my
find what is the highest sales in my data. So for that we can use the
data. So for that we can use the function max. So once you apply it it's
function max. So once you apply it it's going to go and start searching for the
going to go and start searching for the highest value inside our table. So this
highest value inside our table. So this time we are not really aggregating the
time we are not really aggregating the data into something new. It's like
data into something new. It's like searching for the highest value between
searching for the highest value between multiple values. So in this example we
multiple values. So in this example we will get the 35 as the highest sales.
will get the 35 as the highest sales. Now of course if you want to see the
Now of course if you want to see the lowest sales inside your business you
lowest sales inside your business you can use the min function. And if you go
can use the min function. And if you go and apply it as well, the same thing is
and apply it as well, the same thing is going to go and start searching for the
going to go and start searching for the lowest value in the sales. And in this
lowest value in the sales. And in this example, it's going to be the 10. So as
example, it's going to be the 10. So as you can see guys, the aggregate
you can see guys, the aggregate functions is very simple but yet very
functions is very simple but yet very powerful. So it is really useful for
powerful. So it is really useful for insights in order to understand how well
insights in order to understand how well your business is performing. So now
your business is performing. So now let's go to SQL in order to try those
let's go to SQL in order to try those functions. Okay. So now we're going to
functions. Okay. So now we're going to go and analyze the orders table inside
go and analyze the orders table inside our database by doing very simple
our database by doing very simple aggregations. So let's start with the
aggregations. So let's start with the first task. It says find the total
first task. It says find the total number of orders. So this time we are
number of orders. So this time we are targeting the table orders. So let's
targeting the table orders. So let's just start with the select. So now we
just start with the select. So now we can see we have like four orders. And
can see we have like four orders. And now we would like to have like one
now we would like to have like one number. What we can do? We can go and
number. What we can do? We can go and say count star as total number of
say count star as total number of orders. So let's go and execute it. And
orders. So let's go and execute it. And with that we got one number. It is the
with that we got one number. It is the four. This is the total number of
four. This is the total number of orders. Now let's move to the second
orders. Now let's move to the second task. It says find the total sales of
task. It says find the total sales of all orders. So this time we have to
all orders. So this time we have to summarize all the sales values in one
summarize all the sales values in one big value. So how to do it? We're going
big value. So how to do it? We're going to use the function sum and this time we
to use the function sum and this time we are targeting the sales and we're going
are targeting the sales and we're going to go and call it total sales. So let's
to go and call it total sales. So let's go and execute it. And with that we have
go and execute it. And with that we have 80 as the total number of sales. So all
80 as the total number of sales. So all the sales values are summarized in one
the sales values are summarized in one big value. So as you can see now we are
big value. So as you can see now we are exploring the business right? We are
exploring the business right? We are understanding how many sales, how many
understanding how many sales, how many orders. So this is really the basics of
orders. So this is really the basics of analytics in SQL. Now let's go to the
analytics in SQL. Now let's go to the second task. Let's find the average
second task. Let's find the average sales of all orders. So we're going to
sales of all orders. So we're going to have average this time the sales as
have average this time the sales as average sales. Again very simple. Let's
average sales. Again very simple. Let's go and execute it. Now the total sales
go and execute it. Now the total sales is 80 but the average sales is 20. So
is 80 but the average sales is 20. So all the values of the sales is
all the values of the sales is summarized and then divided by the
summarized and then divided by the number of orders. So 80 divided by four.
number of orders. So 80 divided by four. And with that SQL finding the 20 as an
And with that SQL finding the 20 as an average. Now let's go and get
average. Now let's go and get interesting stuff. Let's go and find the
interesting stuff. Let's go and find the highest sales of all orders. So what is
highest sales of all orders. So what is the highest sales that happens in our
the highest sales that happens in our business? In order to do that, we can
business? In order to do that, we can use the function max sales as highest
use the function max sales as highest sales. Very nice. Let's go and execute.
sales. Very nice. Let's go and execute. So the highest sales in the database is
So the highest sales in the database is 35. And now I think you already know
35. And now I think you already know what is the next task. Find the lowest
what is the next task. Find the lowest sales of all orders. So this is exactly
sales of all orders. So this is exactly the opposite. So we're going to go and
the opposite. So we're going to go and use the min sales as lowest sales. So
use the min sales as lowest sales. So let's go and execute. The lowest sales
let's go and execute. The lowest sales in our business was 10. So my friends,
in our business was 10. So my friends, as you can see, the aggregate functions
as you can see, the aggregate functions are really amazing. And if you use it
are really amazing. And if you use it like this, you will get like the big
like this, you will get like the big numbers about our business. But now
numbers about our business. But now don't forget about the aggregate
don't forget about the aggregate functions. If you combine it with a
functions. If you combine it with a group by then you will be breaking those
group by then you will be breaking those big numbers into something like you are
big numbers into something like you are aggregating by the customer ID maybe by
aggregating by the customer ID maybe by a date by a country. So anything you
a date by a country. So anything you specify with the group by it going to
specify with the group by it going to breaks those big numbers into smaller
breaks those big numbers into smaller number based on the column that you are
number based on the column that you are using. For example let's go with the
using. For example let's go with the customer ID over here and let's put it
customer ID over here and let's put it at the start as well. And now if you go
at the start as well. And now if you go and execute it. So now as you can see in
and execute it. So now as you can see in the output all those numbers are not
the output all those numbers are not anymore like big numbers. We drill down
anymore like big numbers. We drill down to more details based on the column that
to more details based on the column that we have specified. So now we have for
we have specified. So now we have for each customer the total number of
each customer the total number of orders, the total number of sales, the
orders, the total number of sales, the average sales, the highest sales or the
average sales, the highest sales or the lowest sales. Of course the data is very
lowest sales. Of course the data is very small and those numbers can be more
small and those numbers can be more interesting if you have bigger data. So
interesting if you have bigger data. So if you combine the aggregate functions
if you combine the aggregate functions together with the group by, you will
together with the group by, you will break those big numbers into more
break those big numbers into more details based on the column that you are
details based on the column that you are grouping by. So now what you can do, you
grouping by. So now what you can do, you can go and apply those functions as well
can go and apply those functions as well for the customers. There we have a score
for the customers. There we have a score and you can go and find the average
and you can go and find the average score, the highest score, the lowest
score, the highest score, the lowest score and then you can group up the data
score and then you can group up the data by the country for example. So pause the
by the country for example. So pause the video and do some aggregations on the
video and do some aggregations on the table customers.
table customers. [Music]
[Music] All right my friends. So with that you
All right my friends. So with that you have learned the basics on how to
have learned the basics on how to aggregate your data using SQL. Now we're
aggregate your data using SQL. Now we're going to move to more advanced way on
going to move to more advanced way on how to aggregate your data. We will
how to aggregate your data. We will start talking about the window functions
start talking about the window functions the analytical functions. So first we're
the analytical functions. So first we're going to start talking about what is
going to start talking about what is exactly window functions and we're going
exactly window functions and we're going to cover the basics about this topic. So
to cover the basics about this topic. So let's go.
window functions or sometimes we call them analytical functions. They are very
them analytical functions. They are very important functions in SQL. Everyone
important functions in SQL. Everyone must know them especially if you are
must know them especially if you are doing data analyszis. Each time I write
doing data analyszis. Each time I write SQL script in order to do data
SQL script in order to do data analytics, I end up using them. So as
analytics, I end up using them. So as usual, we're going to go and now
usual, we're going to go and now understand the concept behind them and
understand the concept behind them and then we're going to start practicing. So
then we're going to start practicing. So let's
go. Okay guys, so now let's start with the first question. What are SQL window
the first question. What are SQL window functions? They are functions that
functions? They are functions that allows you to do calculations like
allows you to do calculations like aggregations but on top of subset of
aggregations but on top of subset of data without losing the level of details
data without losing the level of details of the rows. So it is something very
of the rows. So it is something very similar to the group pi but here we have
similar to the group pi but here we have special case you don't lose the level of
special case you don't lose the level of details. So now in order to understand
details. So now in order to understand the definition let's have a very simple
the definition let's have a very simple example. Okay. So now let's understand
example. Okay. So now let's understand how SQL works with the group by clouds.
how SQL works with the group by clouds. Let's say that we have the very simple
Let's say that we have the very simple example. We have four orders. two orders
example. We have four orders. two orders for the cabs and two order for the
for the cabs and two order for the gloves. And let's say that I would like
gloves. And let's say that I would like to see the total sales for each
to see the total sales for each products. So now if we decided to use
products. So now if we decided to use the group by what SQL going to do going
the group by what SQL going to do going to take the first two orders for the
to take the first two orders for the caps and put it in one row. So in the
caps and put it in one row. So in the output we're going to have only one row
output we're going to have only one row for the caps with the total sales of 40.
for the caps with the total sales of 40. And the same thing going to happen for
And the same thing going to happen for the gloves. So we're going to take the
the gloves. So we're going to take the two rows of the gloves from the input
two rows of the gloves from the input and in the output we're going to have
and in the output we're going to have only one row for the gloves. So that
only one row for the gloves. So that means the number of rows going to be
means the number of rows going to be depending on the number of products we
depending on the number of products we have on our data. We have two products,
have on our data. We have two products, we get two rows. So that means SQL is
we get two rows. So that means SQL is really like smashing or squeezing the
really like smashing or squeezing the results in the output. And this is
results in the output. And this is exactly what the grouper does to our
exactly what the grouper does to our data. It aggregate the rows, aggregate
data. It aggregate the rows, aggregate the data into different level of
the data into different level of details. So now on the left side we see
details. So now on the left side we see four rows. On the right side we have two
four rows. On the right side we have two rows and with that we are losing some
rows and with that we are losing some details in the results. But still we
details in the results. But still we have solved the tasks. So now let's see
have solved the tasks. So now let's see what going to happen if you use window
what going to happen if you use window function in SQL. Okay. So now we have
function in SQL. Okay. So now we have the same data and as with the same task
the same data and as with the same task we have to find the total sales for each
we have to find the total sales for each product. Now if you use window function
product. Now if you use window function SQL going to do the following. It going
SQL going to do the following. It going to go and execute each rows individually
to go and execute each rows individually from each others. So what going to
from each others. So what going to happen it start with the first row the
happen it start with the first row the order ID one. In the output we're going
order ID one. In the output we're going to get as well the same stuff the order
to get as well the same stuff the order ID one the same row but we will get the
ID one the same row but we will get the total sales for the caps. So here the
total sales for the caps. So here the total sales is going to be 10 30 we will
total sales is going to be 10 30 we will get 40. Then it's going to jump to the
get 40. Then it's going to jump to the second row and it's going to process it
second row and it's going to process it as well. So in the output we will get
as well. So in the output we will get the order ID two the product caps and as
the order ID two the product caps and as well we have the same aggregation since
well we have the same aggregation since we are talking about the same product.
we are talking about the same product. So we will get 40. Then it's going to go
So we will get 40. Then it's going to go to the third order and here we have the
to the third order and here we have the gloves. So in the output again we have
gloves. So in the output again we have the order ID 3 the product gloves and
the order ID 3 the product gloves and the total sales this time going to be 5
the total sales this time going to be 5 + 20 so we will get 25 then it goes to
+ 20 so we will get 25 then it goes to the last row to the order ID number four
the last row to the order ID number four in the output we're going to get four
in the output we're going to get four gloves and as well 25. So now we can
gloves and as well 25. So now we can notice that if you use the window
notice that if you use the window function you will not lose the level of
function you will not lose the level of details of your data. So we are doing
details of your data. So we are doing something called rowle calculations. So
something called rowle calculations. So if in input data we have four orders in
if in input data we have four orders in the output we're going to get four
the output we're going to get four orders and as well we will get our
orders and as well we will get our aggregations correctly. So now if you
aggregations correctly. So now if you compare both of the methods side by side
compare both of the methods side by side we can see that we are solving the same
we can see that we are solving the same task. So we are finding the total sales
task. So we are finding the total sales for each products but with the group by
for each products but with the group by we are smashing squeezing the results
we are smashing squeezing the results from four orders into two rows one row
from four orders into two rows one row for each order. So that means with the
for each order. So that means with the group by the granularity is changing
group by the granularity is changing right in the input the order ID is
right in the input the order ID is controlling the level of details but in
controlling the level of details but in the output of the group by the product
the output of the group by the product is controlling the level of details. So
is controlling the level of details. So we have different granularity but in the
we have different granularity but in the other hand with the window functions we
other hand with the window functions we are still able to do aggregations but we
are still able to do aggregations but we are not losing the level of details. So
are not losing the level of details. So the granularity of the input going to be
the granularity of the input going to be the same like the output in the results.
the same like the output in the results. So this is exactly the main difference
So this is exactly the main difference between the group eye and the window
between the group eye and the window function. If you want just to do simple
function. If you want just to do simple aggregations, then go with the group by.
aggregations, then go with the group by. But if you care about the level of
But if you care about the level of details and you need to add more details
details and you need to add more details to your results, then you can go with
to your results, then you can go with the window function where you can do
the window function where you can do aggregations plus having more details.
aggregations plus having more details. And now if you go and compare the
And now if you go and compare the functions between the window and the
functions between the window and the group by, we can find that both of them
group by, we can find that both of them has exactly the same functions for the
has exactly the same functions for the aggregations. So we have the count, sum,
aggregations. So we have the count, sum, average, min, max. And here comes
average, min, max. And here comes another difference between the window
another difference between the window and the group by. The group I has only
and the group by. The group I has only the aggregate functions. So that's it.
the aggregate functions. So that's it. But in the window functions, we have way
But in the window functions, we have way more functions to use for analytics. So
more functions to use for analytics. So for example, we have the ranking
for example, we have the ranking functions. And we have here another
functions. And we have here another group of functions for the value or we
group of functions for the value or we call it analytical functions. So that
call it analytical functions. So that means in the SQL window, we have a lot
means in the SQL window, we have a lot of functions. We can cover a lot of
of functions. We can cover a lot of analytical use cases and advanced
analytical use cases and advanced complex stuff. But with the group by we,
complex stuff. But with the group by we, we have only the aggregate functions
we have only the aggregate functions only for simple use cases. So this is
only for simple use cases. So this is another difference between the group by
another difference between the group by and the window. Group by use it if you
and the window. Group by use it if you have simple analyzes, simple
have simple analyzes, simple aggregations, window functions, we're
aggregations, window functions, we're going to use it for more advanced data
going to use it for more advanced data analyszis where we're going to cover a
analyszis where we're going to cover a lot of use cases. All right guys, so now
lot of use cases. All right guys, so now we're going to have few tasks in order
we're going to have few tasks in order to understand one thing. Why do we need
to understand one thing. Why do we need scale window functions and why in some
scale window functions and why in some scenarios group is not enough and we
scenarios group is not enough and we have to use scale window functions. So
have to use scale window functions. So let's go. All right. So let's start with
let's go. All right. So let's start with very simple task. It's going to say find
very simple task. It's going to say find the total sales across all orders. So we
the total sales across all orders. So we need one value with the total sales.
need one value with the total sales. Let's see how we can do that. First make
Let's see how we can do that. First make sure that you are using the database. So
sure that you are using the database. So use sales database in case you have
use sales database in case you have closed the clients so that we don't get
closed the clients so that we don't get any errors. So now we're going to start
any errors. So now we're going to start with the first thing. We're going to go
with the first thing. We're going to go and select the sales. You're going to
and select the sales. You're going to find it in the table sales orders. So
find it in the table sales orders. So now let's just query the data. And as
now let's just query the data. And as you can see we have 10 orders with 10
you can see we have 10 orders with 10 sales. We didn't aggregate anything yet.
sales. We didn't aggregate anything yet. So we have the row data now. So now in
So we have the row data now. So now in order to solve the task, we're going to
order to solve the task, we're going to use the function sum. So sum of sales
use the function sum. So sum of sales and we're going to give it new name
and we're going to give it new name total sales. We don't have to use any
total sales. We don't have to use any group by because we don't have to group
group by because we don't have to group up anything. So that's it. Let's go and
up anything. So that's it. Let's go and execute that. And as you can see SQL
execute that. And as you can see SQL going to return one value 380. This is
going to return one value 380. This is the total sales that we have inside our
the total sales that we have inside our data. And this is the highest level of
data. And this is the highest level of aggregations. So with that we have
aggregations. So with that we have solved the task. We have the total sales
solved the task. We have the total sales across all orders. We don't have to
across all orders. We don't have to group up anything. Let's move to the
group up anything. Let's move to the next example. Let's say that in the next
next example. Let's say that in the next task, this time we want to find the
task, this time we want to find the total sales but for each products. So
total sales but for each products. So not for the all orders, for each
not for the all orders, for each products we want to find the total
products we want to find the total sales. So this time we don't need only
sales. So this time we don't need only one value. We need one value for each
one value. We need one value for each product. In order to do that now, we're
product. In order to do that now, we're going to go and use the group by
going to go and use the group by function. And we're going to group up by
function. And we're going to group up by the product ID. and group up need as
the product ID. and group up need as well the dimension in the selection. So
well the dimension in the selection. So we can do it like this. So that's it.
we can do it like this. So that's it. Let's go and execute the query. Now as
Let's go and execute the query. Now as you can see in the results we don't have
you can see in the results we don't have one value. We don't have the highest
one value. We don't have the highest aggregations. This time we are drilling
aggregations. This time we are drilling down to the next level of details. So
down to the next level of details. So the level of details here is the product
the level of details here is the product ID. We have one row for each product. So
ID. We have one row for each product. So for the first product we have 140. The
for the first product we have 140. The next one 105 and so on. So as you can
next one 105 and so on. So as you can see we are now splitting the data at the
see we are now splitting the data at the level of product ID and we went from 10
level of product ID and we went from 10 orders now in the results we have four
orders now in the results we have four orders and that's because we have four
orders and that's because we have four products. So the number of rows at the
products. So the number of rows at the output going to be defined by the
output going to be defined by the dimension the product ID and with we
dimension the product ID and with we have solved the task we have the total
have solved the task we have the total sales for each product. All right guys
sales for each product. All right guys so let's keep progressing in our
so let's keep progressing in our examples. Now the next one going to be a
examples. Now the next one going to be a little bit advanced where we have the
little bit advanced where we have the same aggregation. Find the total sales
same aggregation. Find the total sales for each product. Additionally, provide
for each product. Additionally, provide details such order ID and the order
details such order ID and the order date. So, as you can see, we have
date. So, as you can see, we have already solved the first part. We are
already solved the first part. We are finding the total sales for each
finding the total sales for each product. Now, we just have to add some
product. Now, we just have to add some additional informations like the order
additional informations like the order ID and the order date. So, let's go over
ID and the order date. So, let's go over here and just add it in our select. So,
here and just add it in our select. So, order ID and let's have the order date.
order ID and let's have the order date. So, let's go and execute that. Just
So, let's go and execute that. Just going to make it a little bit bigger.
going to make it a little bit bigger. So, let's go. But now as you can see SQL
So, let's go. But now as you can see SQL will not be happy going to throw an
will not be happy going to throw an error and says the stuff that you are
error and says the stuff that you are adding to your select are not included
adding to your select are not included in the group by. So as you can see in
in the group by. So as you can see in the group buy we have only one dimension
the group buy we have only one dimension or one field called the product ID. But
or one field called the product ID. But in our selection we have three
in our selection we have three dimensions the order ID, the order date
dimensions the order ID, the order date and the product ID. So there is no
and the product ID. So there is no matching between the select and group by
matching between the select and group by and SQL will not allow it. And now you
and SQL will not allow it. And now you might say you know what let's add
might say you know what let's add everything to the group by. So with that
everything to the group by. So with that we're going to get our aggregation and
we're going to get our aggregation and as well we're going to get our details.
as well we're going to get our details. So let's try that. I'm just going to
So let's try that. I'm just going to zoom out a little bit and instead of
zoom out a little bit and instead of having the product ID let's add
having the product ID let's add everything. So the order ID, order date
everything. So the order ID, order date and the product ID. So now we have
and the product ID. So now we have matching and SQL should not throw any
matching and SQL should not throw any error. Let's go and execute it. So now
error. Let's go and execute it. So now let's check whether we have solved the
let's check whether we have solved the task. The task has two parts right. We
task. The task has two parts right. We have to do the aggregations and to
have to do the aggregations and to provide details. So as you can see we
provide details. So as you can see we have solved the second part. We have the
have solved the second part. We have the details, order ID and order dates. But
details, order ID and order dates. But now the first part finding the total
now the first part finding the total sales for each product is destroyed
sales for each product is destroyed because if you check the results, we
because if you check the results, we have the product ID 101. It has the
have the product ID 101. It has the total sales of 10. But in the third
total sales of 10. But in the third order, we have it as a 20 for the same
order, we have it as a 20 for the same product. So actually the data is not
product. So actually the data is not aggregated and that's because we are
aggregated and that's because we are aggregating at different levels and we
aggregating at different levels and we have included way more stuff that we
have included way more stuff that we don't need for the aggregations. We are
don't need for the aggregations. We are aggregating at the order ID level. So as
aggregating at the order ID level. So as you can see now we are hitting the
you can see now we are hitting the limits of group by. We cannot provide
limits of group by. We cannot provide aggregations and as well provide
aggregations and as well provide additional informations from our data.
additional informations from our data. You have to pick one. That's why we have
You have to pick one. That's why we have to go to the second option where we can
to go to the second option where we can use the window functions. So let's do
use the window functions. So let's do that. I'm just going to get rid of the
that. I'm just going to get rid of the group by parts and as well all the
group by parts and as well all the fields. Let's back to the root. So now
fields. Let's back to the root. So now we have the sum of sales and if execute
we have the sum of sales and if execute this I'm going to get one value. So we
this I'm going to get one value. So we are at the highest level of
are at the highest level of aggregations. So now we need to use the
aggregations. So now we need to use the window function. I'm just going to
window function. I'm just going to remove the name. And now we're going to
remove the name. And now we're going to tell SQL this is a window functions
tell SQL this is a window functions using over after the aggregations or the
using over after the aggregations or the functions tells SQL we are talking about
functions tells SQL we are talking about window functions. So let's just execute
window functions. So let's just execute it like this. And with that we got 10
it like this. And with that we got 10 rows and that's because we have 10
rows and that's because we have 10 orders and for each row we have exactly
orders and for each row we have exactly the same value. So we have the total
the same value. So we have the total sales of all orders for each row. So as
sales of all orders for each row. So as you can see SQL understands this is a
you can see SQL understands this is a window function and SQL should not like
window function and SQL should not like group all the data in one row. It should
group all the data in one row. It should keep exactly the same rows or same
keep exactly the same rows or same number of rows like the input. So with
number of rows like the input. So with that we have the window function but we
that we have the window function but we have to split the data by the products.
have to split the data by the products. So now we're going to use the keyword
So now we're going to use the keyword partition by it's like the group by but
partition by it's like the group by but another wording products ID the same
another wording products ID the same dimension. So with that we have the
dimension. So with that we have the total sales by products as a name. So
total sales by products as a name. So let's go and execute this. So now as you
let's go and execute this. So now as you can see in the output we still have the
can see in the output we still have the same number of rows. We have 10 orders.
same number of rows. We have 10 orders. We have 10 rows but the result did
We have 10 rows but the result did change because now we are aggregating
change because now we are aggregating the data at the level of product ID. In
the data at the level of product ID. In order to understand the results we have
order to understand the results we have to add more informations to our select.
to add more informations to our select. So now let's add the same dimension.
So now let's add the same dimension. It's going to be the product ID. I'm
It's going to be the product ID. I'm just going to add it at the front over
just going to add it at the front over here. So let's select. And as you can
here. So let's select. And as you can see now it makes more sense. We have
see now it makes more sense. We have those products and they have always the
those products and they have always the exact same uh sales. and as well for the
exact same uh sales. and as well for the next product and so on. And now here
next product and so on. And now here comes the magic of the window function.
comes the magic of the window function. We can add more informations to our
We can add more informations to our select statement without having any
select statement without having any errors. So now we need additional
errors. So now we need additional informations like the order ID. So we
informations like the order ID. So we can go over here and say order ID, order
can go over here and say order ID, order date, any type of column you can add it
date, any type of column you can add it to your select and let's go and execute.
to your select and let's go and execute. So as you can see now we got the result
So as you can see now we got the result even though that those three dimensions
even though that those three dimensions in the select are not part of the window
in the select are not part of the window aggregation. So with that we have solved
aggregation. So with that we have solved the tasks. We have additional
the tasks. We have additional informations. We have the order ID, the
informations. We have the order ID, the order date and as well the first part of
order date and as well the first part of the task to find the total sales for
the task to find the total sales for each products. So each of those values
each products. So each of those values are the total sales for each product.
are the total sales for each product. And with that we have solved the tasks
And with that we have solved the tasks and this is exactly why we need window
and this is exactly why we need window functions. In real projects things get
functions. In real projects things get really complicated. You are doing
really complicated. You are doing different tasks in one query. So you are
different tasks in one query. So you are doing aggregations, you are doing some
doing aggregations, you are doing some other stuff. So just focusing on the
other stuff. So just focusing on the aggregations is not going to be enough.
aggregations is not going to be enough. You have always to add additional
You have always to add additional informations to your query. So as you
informations to your query. So as you can see we use group eye to do simple
can see we use group eye to do simple analyszis but as things get complicated
analyszis but as things get complicated in the analytics we use the window
in the analytics we use the window functions in order to show the
functions in order to show the aggregations and as well add additional
aggregations and as well add additional informations. So as you can see we use
informations. So as you can see we use group eye to do simple analyszis but as
group eye to do simple analyszis but as things get complicated in the analytics
things get complicated in the analytics we use the window functions in order to
we use the window functions in order to show the aggregations and as well add
show the aggregations and as well add additional informations.
All right everyone. So now we're going to go and deep dive into the syntax of
to go and deep dive into the syntax of the SQL window functions. We're going to
the SQL window functions. We're going to cover everything each part of the syntax
cover everything each part of the syntax for you to understand how to use them.
for you to understand how to use them. So let's go. All right. So let's start
So let's go. All right. So let's start first by understanding the basic
first by understanding the basic components or the basic parts of each
components or the basic parts of each window syntax. Mainly we have two parts.
window syntax. Mainly we have two parts. The first part going to be the window
The first part going to be the window function. We have like sum, average and
function. We have like sum, average and so on. The second main part is going to
so on. The second main part is going to be the over close. And inside the overlo
be the over close. And inside the overlo we have three different parts. The first
we have three different parts. The first one going to be the partition close. The
one going to be the partition close. The second order close and the last one we
second order close and the last one we have the frame close. And those are all
have the frame close. And those are all components that you can use inside the
components that you can use inside the window function. So two main parts
window function. So two main parts window function and the offer close. And
window function and the offer close. And inside the over we have partition order
inside the over we have partition order and frame. Let's go more in details. So
and frame. Let's go more in details. So for example we have the following window
for example we have the following window function. So as you can see we have a
function. So as you can see we have a lot of stuff going on here. We're going
lot of stuff going on here. We're going to understand them step by step
to understand them step by step component by component. Let's start from
component by component. Let's start from the left from the first one. So what do
the left from the first one. So what do we have over here? We have a function
we have over here? We have a function window function. So what is a window
window function. So what is a window function? Like here we have the average.
function? Like here we have the average. It's like any other function in SQL. You
It's like any other function in SQL. You can use it in order to do calculations
can use it in order to do calculations on top of the window. So the first thing
on top of the window. So the first thing to do or to define in a window is to
to do or to define in a window is to define the function of the window. And
define the function of the window. And as we learned before, we have a long
as we learned before, we have a long list of many window functions available
list of many window functions available in SQL. And we group them into three
in SQL. And we group them into three groups. The first one we have the
groups. The first one we have the aggregate functions. So we have the
aggregate functions. So we have the count, sum, average, max. All those
count, sum, average, max. All those functions we have them as well for the
functions we have them as well for the group by. So those are used for the
group by. So those are used for the aggregations. The second group of
aggregations. The second group of functions we have the ranking functions.
functions we have the ranking functions. So we have the row number, rank, entile
So we have the row number, rank, entile and so on. So we can use those groups in
and so on. So we can use those groups in order to give a rank for our data. The
order to give a rank for our data. The last group we call it value or sometimes
last group we call it value or sometimes analytics functions. So here we have
analytics functions. So here we have very important functions like the lead,
very important functions like the lead, lag, first value and the last value in
lag, first value and the last value in order to access a specific value and of
order to access a specific value and of course we're going to go and learn all
course we're going to go and learn all of them one by one understanding the
of them one by one understanding the concepts some examples and as well for
concepts some examples and as well for you to understand when to use them for
you to understand when to use them for that analyzers. All right so now let's
that analyzers. All right so now let's keep moving understanding the other
keep moving understanding the other parts of the window syntax. Now inside
parts of the window syntax. Now inside the function average we have here a
the function average we have here a field name or column name called sales.
field name or column name called sales. This called a function expression. It's
This called a function expression. It's like a value, a parameter, argument that
like a value, a parameter, argument that we can pass it to the function. And here
we can pass it to the function. And here we can use multiple different stuff. For
we can use multiple different stuff. For example, depend of the function of
example, depend of the function of course. So here it could be empty like
course. So here it could be empty like here in the ranking. It doesn't allow to
here in the ranking. It doesn't allow to use an expression. So it should be
use an expression. So it should be always empty. Or we can use a column
always empty. Or we can use a column like in the example we use the sales. So
like in the example we use the sales. So we use the column name as an argument or
we use the column name as an argument or an expression. For the average we are
an expression. For the average we are finding the average of sales or we could
finding the average of sales or we could use a number. So here in the intile we
use a number. So here in the intile we are allowed only to use numbers or we
are allowed only to use numbers or we could have multiple stuff. For example
could have multiple stuff. For example in the lead we can have sales then
in the lead we can have sales then numbers and so on. So things get
numbers and so on. So things get complicated. Don't worry about it. I'm
complicated. Don't worry about it. I'm going to explain that. So here we have
going to explain that. So here we have multiple stuff or we can have a whole
multiple stuff or we can have a whole conditional logic. So for example here
conditional logic. So for example here we have the case win so on inside the
we have the case win so on inside the sum. So the whole thing over here holds
sum. So the whole thing over here holds an expression for the sum. So as you can
an expression for the sum. So as you can see we can build here a complex logic
see we can build here a complex logic and the output of this logic can be
and the output of this logic can be passed to the function sum. So that
passed to the function sum. So that means as an expression for the function
means as an expression for the function we can use different stuff of course
we can use different stuff of course depends whether the function allows it
depends whether the function allows it or not. All right. So now let's have a
or not. All right. So now let's have a quick overview in order to understand
quick overview in order to understand which data types are allowed in the
which data types are allowed in the expressions for those functions. Let's
expressions for those functions. Let's see the aggregate functions. As you can
see the aggregate functions. As you can see the count function accept any data
see the count function accept any data type but the others like the sum,
type but the others like the sum, average, min, max, they allow only
average, min, max, they allow only numerical data types. All right. So now
numerical data types. All right. So now let's move to the rank function. The
let's move to the rank function. The expressions it's pretty easy. It should
expressions it's pretty easy. It should be empty. It doesn't allow any argument
be empty. It doesn't allow any argument or anything inside those functions. So
or anything inside those functions. So as you can see all of them are empty but
as you can see all of them are empty but only one that accept numerical values
only one that accept numerical values which is the end tile. You have to
which is the end tile. You have to define a numeric value. And now moving
define a numeric value. And now moving on to the last type we have the value
on to the last type we have the value functions. they accept any data types
functions. they accept any data types inside the expressions. So as you can
inside the expressions. So as you can see each functions has its own
see each functions has its own specifications and you have to be
specifications and you have to be careful which data type you are using in
careful which data type you are using in the expressions. Okay. So now let's keep
the expressions. Okay. So now let's keep moving to the next one. We have a very
moving to the next one. We have a very important part in the window syntax. So
important part in the window syntax. So so far what do we have? We have a
so far what do we have? We have a function. We have an expression. It's
function. We have an expression. It's like usual stuff. We have done that
like usual stuff. We have done that before using the group by. Now we have
before using the group by. Now we have to tell SQL that we are dealing with the
to tell SQL that we are dealing with the window function. It's not a normal one.
window function. It's not a normal one. In order to do that we have to specify
In order to do that we have to specify the keyword over. So the second main
the keyword over. So the second main part in the syntax is the over clause
part in the syntax is the over clause and we use it in order to define a
and we use it in order to define a window and inside it we can define
window and inside it we can define multiple stuff like the partition pie
multiple stuff like the partition pie the order by the frame but all those
the order by the frame but all those stuff are optional. We can skip it and
stuff are optional. We can skip it and leave it empty. So the main task of the
leave it empty. So the main task of the over it tells first SQL we are dealing
over it tells first SQL we are dealing with the window function here and as
with the window function here and as well you can use it in order to define a
well you can use it in order to define a window of your data. So now we're going
window of your data. So now we're going to go and cover everything inside the
to go and cover everything inside the over clause and we're going to start
over clause and we're going to start with the first one the partition
pi. All right. So now we're going to learn how to define a window inside the
learn how to define a window inside the overlaus. The first part that we can
overlaus. The first part that we can define is the partition pi. So for
define is the partition pi. So for example here we have partition pi
example here we have partition pi category. We have to define the
category. We have to define the dimension. It's very similar to the
dimension. It's very similar to the group by another wording. So the first
group by another wording. So the first part is going to be the partition
part is going to be the partition clause. What it going to do? It's going
clause. What it going to do? It's going to divide the entire data sets into
to divide the entire data sets into groups or you can call it windows
groups or you can call it windows partitions. So here we tell SQL how to
partitions. So here we tell SQL how to divide our data. And here we have two
divide our data. And here we have two options. Let me just show you. So if we
options. Let me just show you. So if we don't use anything so we have it empty.
don't use anything so we have it empty. You see over and partition by is not
You see over and partition by is not used. What going to happen? SQL going to
used. What going to happen? SQL going to use the entire data in order to do the
use the entire data in order to do the calculations. So the whole data the
calculations. So the whole data the entire data going to be counted as one
entire data going to be counted as one window. So we are telling SQL don't
window. So we are telling SQL don't divide anything leave it as it is. The
divide anything leave it as it is. The second option that we have is to divide
second option that we have is to divide the data by partition pi. So we define
the data by partition pi. So we define the window like this partition pi
the window like this partition pi products for example. So SQL going to go
products for example. So SQL going to go and divide the entire data into
and divide the entire data into different windows. For example here two
different windows. For example here two windows. And here this time the
windows. And here this time the calculation the sum of sales will not
calculation the sum of sales will not apply on the entire data set. This time
apply on the entire data set. This time it going to be applied on the different
it going to be applied on the different windows individually. So we're going to
windows individually. So we're going to find the sum of sales for window one
find the sum of sales for window one separately from the total sales of
separately from the total sales of window 2. All right. So now we have this
window 2. All right. So now we have this very simple example. We have here three
very simple example. We have here three fields. The month, product, sales. They
fields. The month, product, sales. They are really easy informations. And now we
are really easy informations. And now we have the following SQL window function.
have the following SQL window function. So we have sum of sales and inside the
So we have sum of sales and inside the overlo we are not using anything. So we
overlo we are not using anything. So we are not using partition by. So how SQL
are not using partition by. So how SQL going to define the window. Now SQL
going to define the window. Now SQL going to say okay I don't have to divide
going to say okay I don't have to divide anything. The entire data set is one
anything. The entire data set is one window. So SQL going to go over here and
window. So SQL going to go over here and say the whole thing is one window. So
say the whole thing is one window. So there is no partitions, there is
there is no partitions, there is nothing. We have only one window. So the
nothing. We have only one window. So the entire data going to be aggregated. So
entire data going to be aggregated. So this is what happen if you don't use
this is what happen if you don't use partition pi and you leave the
partition pi and you leave the overclos. The entire data is one window.
overclos. The entire data is one window. All right. So now let's move to the next
All right. So now let's move to the next example. We don't want to have only one
example. We don't want to have only one window. We would like to have multiple
window. We would like to have multiple windows. So we have to divide the data
windows. So we have to divide the data by something. So in the over clause
by something. So in the over clause we're going to define the window like
we're going to define the window like the following partition by month. So
the following partition by month. So it's not empty. We are now dividing the
it's not empty. We are now dividing the data by the field month. So the values
data by the field month. So the values inside this column going to divide the
inside this column going to divide the data sets. So here we have two months
data sets. So here we have two months January and February. So what's going to
January and February. So what's going to do? SQL going to go and divide the data
do? SQL going to go and divide the data into two sets. The first window going to
into two sets. The first window going to be this one of January. So we have the
be this one of January. So we have the first window going to make it smaller
first window going to make it smaller and the second window going to be the
and the second window going to be the February. So it's going to be two
February. So it's going to be two windows inside our data and the
windows inside our data and the calculation going to be happening on
calculation going to be happening on each window separately. So here as you
each window separately. So here as you can see we are using the month in order
can see we are using the month in order to divide our data sets into two
to divide our data sets into two windows. One window for January and
windows. One window for January and another window for the February. So now
another window for the February. So now let's have a quick overview about the
let's have a quick overview about the options that we have with the partition
options that we have with the partition p. The first option as we learned we can
p. The first option as we learned we can just skip it. So without partition by
just skip it. So without partition by for example here total sales across all
for example here total sales across all rows and here we don't find anything
rows and here we don't find anything inside the SQL. The second option we can
inside the SQL. The second option we can use one field one column for example
use one field one column for example partition by product. So we are using
partition by product. So we are using one dimension but we can go and mix
one dimension but we can go and mix stuff. We can use multiple columns or
stuff. We can use multiple columns or multiple dimensions in the partition by
multiple dimensions in the partition by for example here partition by product
for example here partition by product and order status. So here with the
and order status. So here with the partition by we can define a list of
partition by we can define a list of dimensions that could be used in order
dimensions that could be used in order to divide our data. So in this example
to divide our data. So in this example we are saying find the total sales for
we are saying find the total sales for each combination of products and order
each combination of products and order status. So those are the different
status. So those are the different options on how to work with the
options on how to work with the partition pi. So now let's have this
partition pi. So now let's have this overview again for all functions. The
overview again for all functions. The partition pi for all those functions is
partition pi for all those functions is optional. So if you don't use the
optional. So if you don't use the partition pi in all those functions you
partition pi in all those functions you will not get any errors. So now let's go
will not get any errors. So now let's go back to SQL in order to start practicing
back to SQL in order to start practicing with this clause. Okay. So now we have
with this clause. Okay. So now we have the following task. Find the total sales
the following task. Find the total sales across all orders. And we have to
across all orders. And we have to provide additional informations like the
provide additional informations like the order ID and the order date. So let's go
order ID and the order date. So let's go and solve it step by step. First I would
and solve it step by step. First I would like to provide the details. So I'm
like to provide the details. So I'm going to select the order ID and the
going to select the order ID and the order dates from the table sales orders.
order dates from the table sales orders. And next we're going to work with the
And next we're going to work with the aggregations. So we need to find the
aggregations. So we need to find the total sales across all orders. Again
total sales across all orders. Again since we have here details and
since we have here details and aggregations we cannot use group by. We
aggregations we cannot use group by. We have to use the window function. So
have to use the window function. So we're going to go use the function sum
we're going to go use the function sum for sales. And now we have to tell SQL
for sales. And now we have to tell SQL we are working with window functions.
we are working with window functions. That's why we're going to use the over
That's why we're going to use the over close. And now the next step we have to
close. And now the next step we have to think about defining the window. So
think about defining the window. So let's check the task. It says total
let's check the task. It says total sales across all orders. So that means
sales across all orders. So that means we don't have to partition or divide the
we don't have to partition or divide the data sets into like chunks or
data sets into like chunks or partitions. We have to leave it as it is
partitions. We have to leave it as it is like the whole data going to be one
like the whole data going to be one window. And that's why we don't use
window. And that's why we don't use partition pi inside the definition.
partition pi inside the definition. We're going to leave it empty. Let's go
We're going to leave it empty. Let's go now and give it a name. It's going to be
now and give it a name. It's going to be the total sales. Let's go and execute
the total sales. Let's go and execute this. And now at the results, as you can
this. And now at the results, as you can see, we have all the orders, all the
see, we have all the orders, all the details, and as well, we have the total
details, and as well, we have the total sales across all orders. So with that,
sales across all orders. So with that, we have solved the tasks. We have the
we have solved the tasks. We have the total sales and as well some details
total sales and as well some details about the order. All right. So now let's
about the order. All right. So now let's move to the next task. It's going to be
move to the next task. It's going to be very similar. So it says find the total
very similar. So it says find the total sales for each product. And we have to
sales for each product. And we have to provide additional informations like the
provide additional informations like the order ID and the order dates. So it's
order ID and the order dates. So it's going to be very similar task but this
going to be very similar task but this time we have to divide the entire data
time we have to divide the entire data into windows and that's going to be by
into windows and that's going to be by the product. Since we are saying total
the product. Since we are saying total sales for each product. So this time we
sales for each product. So this time we have to go and divide the data. So we're
have to go and divide the data. So we're going to define the window like this
going to define the window like this partition by and we can use the
partition by and we can use the dimension product ID. Let's go and
dimension product ID. Let's go and execute this. So now you can see in the
execute this. So now you can see in the total sales we don't have anymore the
total sales we don't have anymore the total sales of the whole data but they
total sales of the whole data but they are divided but in order to understand
are divided but in order to understand the results let's go and include the
the results let's go and include the product ID in the results. So product ID
product ID in the results. So product ID and execute. So now by looking to the
and execute. So now by looking to the results you can see that the data is
results you can see that the data is divided into four windows. Let's see
divided into four windows. Let's see them. It's going to be by the product
them. It's going to be by the product ID. So this dimension going to be
ID. So this dimension going to be controlling the partition. So the first
controlling the partition. So the first window going to be the product ID 101.
window going to be the product ID 101. So we have the total sales for this
So we have the total sales for this product 140 and the next window going to
product 140 and the next window going to be 102. The third one 104 and the last
be 102. The third one 104 and the last window it's going to be only one row the
window it's going to be only one row the 105 and the total sales of 60. So with
105 and the total sales of 60. So with that we have solved the task. We have
that we have solved the task. We have the total sales for each product and as
the total sales for each product and as well we have some details. Now I would
well we have some details. Now I would like to show you the dynamic of the
like to show you the dynamic of the window function. We can add multiple
window function. We can add multiple aggregations on multiple levels. Let me
aggregations on multiple levels. Let me show you what I mean. Let's say we stay
show you what I mean. Let's say we stay with the same example but we're going to
with the same example but we're going to find the total sales across all orders
find the total sales across all orders and as well the total sales for each
and as well the total sales for each products. So what we can do we can do
products. So what we can do we can do the window functions on different levels
the window functions on different levels by for example here removing the whole
by for example here removing the whole definition. So here we have the total
definition. So here we have the total sales for the entire data for the first
sales for the entire data for the first task and the next one going to be the
task and the next one going to be the total sales but divided by the product
total sales but divided by the product ID. Let's here rename it by
ID. Let's here rename it by products. Let's go and execute this. And
products. Let's go and execute this. And now you know what I'm going to go and
now you know what I'm going to go and add the sales as well just to explain
add the sales as well just to explain the flexibility of the window function.
the flexibility of the window function. So let's go add the sales and execute it
So let's go add the sales and execute it again. And now by looking to the results
again. And now by looking to the results you can see we have the sales
you can see we have the sales informations three time but with
informations three time but with different granularities. The first sales
different granularities. The first sales the sales it sales without any
the sales it sales without any aggregations. It is the highest level of
aggregations. It is the highest level of details of the sales and we're going to
details of the sales and we're going to have the sales for each order. The next
have the sales for each order. The next one the total sales with the window
one the total sales with the window function. Here we have the highest level
function. Here we have the highest level of aggregation. So we have the total
of aggregation. So we have the total sales of all orders and the last one the
sales of all orders and the last one the total sales by product it's something
total sales by product it's something like in the middle we are aggregating on
like in the middle we are aggregating on a window and the window going to be the
a window and the window going to be the product ID. So as you can see we have
product ID. So as you can see we have different granularities of the
different granularities of the aggregations and this is exactly the
aggregations and this is exactly the flexibility that we have with the window
flexibility that we have with the window function. We can do all those stuff in
function. We can do all those stuff in one query. Okay. So now let's keep
one query. Okay. So now let's keep moving and adding stuff to our task.
moving and adding stuff to our task. It's going to say find the total sales
It's going to say find the total sales for each combination of the products and
for each combination of the products and the order status. So this time we have
the order status. So this time we have to divide the data not only by the
to divide the data not only by the product as as well with another
product as as well with another dimension the order status. So now let's
dimension the order status. So now let's see how we can do that. I'm going to
see how we can do that. I'm going to just show the dimension order status and
just show the dimension order status and the results and we're going to add the
the results and we're going to add the following thing. So sum sales over since
following thing. So sum sales over since it's a window function and let's go now
it's a window function and let's go now and define the window partition by. So
and define the window partition by. So we have again the product ID but not
we have again the product ID but not only this dimension as well the order
only this dimension as well the order status and let's go and call it sales by
status and let's go and call it sales by products and status. Let me just rename
products and status. Let me just rename those stuff. Okay. So let's go and
those stuff. Okay. So let's go and execute. All right. So now let's check
execute. All right. So now let's check the results. It is the last aggregation
the results. It is the last aggregation over here. And as you can see here the
over here. And as you can see here the aggregation has different granularities
aggregation has different granularities as the previous one. And we have more
as the previous one. And we have more details. This time we are splitting the
details. This time we are splitting the data by two dimensions. So the first
data by two dimensions. So the first window going to be the product ID with
window going to be the product ID with the order status it's going to be only
the order status it's going to be only those two rows. So we have the order ID
those two rows. So we have the order ID 101 and the order status delivered. So
101 and the order status delivered. So the total sales of this going to be 10 +
the total sales of this going to be 10 + 20 and we're going to have 30. The next
20 and we're going to have 30. The next window going to be the same product but
window going to be the same product but with different status. So it's going to
with different status. So it's going to be the 101 shipped and we're going to go
be the 101 shipped and we're going to go and summarize those two values and we're
and summarize those two values and we're going to have 110. The next product and
going to have 110. The next product and order status going to be the 102 and we
order status going to be the 102 and we have it only once. So 102 delivered it's
have it only once. So 102 delivered it's only once. So it's going to be the same
only once. So it's going to be the same value. The next partition or window it's
value. The next partition or window it's going to be two rows. 102 with the
going to be two rows. 102 with the shipped is going to be those two things
shipped is going to be those two things 60 + 15 we're going to get 75. So as you
60 + 15 we're going to get 75. So as you can see here the product ID and the
can see here the product ID and the order status they are controlling how
order status they are controlling how many windows we're gonna get. So we get
many windows we're gonna get. So we get here around like six windows. With the
here around like six windows. With the product ID we got only four windows and
product ID we got only four windows and without using anything inside the
without using anything inside the overlause we will get only one window.
overlause we will get only one window. So this is how the partition by
works. All right. So that was the first part of the window definition within the
part of the window definition within the overclo. Let's move to the next part. We
overclo. Let's move to the next part. We have the order by. For example, we can
have the order by. For example, we can use order by order date. It's just a
use order by order date. It's just a field. So the order close is very
field. So the order close is very important in order to sort your data
important in order to sort your data within a window. So the order by is very
within a window. So the order by is very important as well for many functions. So
important as well for many functions. So by just checking the overview over here
by just checking the overview over here for the aggregate functions it is
for the aggregate functions it is optional. So you could just leave it or
optional. So you could just leave it or add it. But for the rank function and as
add it. But for the rank function and as well for the value functions they are a
well for the value functions they are a must. So if you want to use those
must. So if you want to use those functions you must use the order clause
functions you must use the order clause because it makes no sense for example if
because it makes no sense for example if you are ranking the data without sorting
you are ranking the data without sorting your data first. Okay guys. So now back
your data first. Okay guys. So now back to our very simple example and we have
to our very simple example and we have the following query. So the function
the following query. So the function this time going to be rank. So we have
this time going to be rank. So we have to rank the data and the definition of
to rank the data and the definition of the window going to be partition by
the window going to be partition by month. So that means we divide the data
month. So that means we divide the data by the months. So we have it over here.
by the months. So we have it over here. And then the second part going to be
And then the second part going to be order by sales descending. So we have to
order by sales descending. So we have to sort each window by descending order.
sort each window by descending order. That means we start with the highest
That means we start with the highest value and we end up by the lowest value.
value and we end up by the lowest value. So let's see how SQL going to go and
So let's see how SQL going to go and execute this. So first partition by
execute this. So first partition by month. So it's going to divide the data
month. So it's going to divide the data into two partitions because we have two
into two partitions because we have two values by the month. So let's see how
values by the month. So let's see how this going to look like. So one window
this going to look like. So one window for January and another window for
for January and another window for February. All right. So now SQL going to
February. All right. So now SQL going to go to the second part and execute order
go to the second part and execute order by sales descending. So what's going to
by sales descending. So what's going to happen? SQL going to go for each window
happen? SQL going to go for each window separately and start sorting the data
separately and start sorting the data from the highest to the lowest without
from the highest to the lowest without checking the other window. So in those
checking the other window. So in those three values, the highest one is this
three values, the highest one is this one. So it's going to be on top. Let me
one. So it's going to be on top. Let me just sort it. This is going to be the
just sort it. This is going to be the lowest. You're going to be in the
lowest. You're going to be in the middle. So SQL going to sort this window
middle. So SQL going to sort this window separately from the next one. And then
separately from the next one. And then once it's done, it's going to go to the
once it's done, it's going to go to the second one. So the highest value going
second one. So the highest value going to be this one. You are the lowest. Let
to be this one. You are the lowest. Let me just do it like this. So SQL going to
me just do it like this. So SQL going to sort it like this. The highest one is
sort it like this. The highest one is 70. The next one is 40. And the last one
70. The next one is 40. And the last one is five. So with that SQL done with the
is five. So with that SQL done with the definition of the window. So it's
definition of the window. So it's splitted by the month. And each window
splitted by the month. And each window is sorted by the cells. The next step is
is sorted by the cells. The next step is going to go and rank those values. So
going to go and rank those values. So it's really simple. In the output, it's
it's really simple. In the output, it's going to rank the data like this. So the
going to rank the data like this. So the first one going to be this value. The
first one going to be this value. The next one going to be two and the third
next one going to be two and the third one going to be three. So as you can
one going to be three. So as you can see, SQL is sorting only this window and
see, SQL is sorting only this window and it's going to go and repeat the same
it's going to go and repeat the same stuff for the second window. So each
stuff for the second window. So each rank is separately from the others. So
rank is separately from the others. So as you can see, it's very simple. This
as you can see, it's very simple. This is how SQL executes partition by
is how SQL executes partition by together with the order buy for the rank
together with the order buy for the rank function. All right. So now let's have a
function. All right. So now let's have a quick task for the order by. It says
quick task for the order by. It says rank each order based on their sales
rank each order based on their sales from the highest to the lowest. And we
from the highest to the lowest. And we have to provide additional informations
have to provide additional informations like order ID and order date. So let's
like order ID and order date. So let's see how we can write the query. So we
see how we can write the query. So we have the basic stuff order ID, order
have the basic stuff order ID, order date and sales. And now we're going to
date and sales. And now we're going to go and rank the data using window
go and rank the data using window function. So we're going to use the
function. So we're going to use the function rank and then we're going to
function rank and then we're going to tell SQL this is a window function and
tell SQL this is a window function and inside it we have now to provide the
inside it we have now to provide the definition of the window. So now by
definition of the window. So now by checking the task you can see that we
checking the task you can see that we don't have to divide the data. So we
don't have to divide the data. So we don't have to use partition by we have
don't have to use partition by we have just to use rank and with rank we have
just to use rank and with rank we have to use the order by it is must. So we're
to use the order by it is must. So we're going to use order by the field going to
going to use order by the field going to be the sales and from the highest to the
be the sales and from the highest to the lowest. So let's just call it rank sales
lowest. So let's just call it rank sales and let's go and execute this. And as
and let's go and execute this. And as you can see our results going to be
you can see our results going to be sorted from the highest to the lowest.
sorted from the highest to the lowest. So you can see the sales 90 at the top
So you can see the sales 90 at the top and the lowest going to be the 10. And
and the lowest going to be the 10. And as well we have a rank. So for the top
as well we have a rank. So for the top rank it's going to be one and the lowest
rank it's going to be one and the lowest rank going to be 10. So as you can see
rank going to be 10. So as you can see we just quickly create a rank in SQL.
we just quickly create a rank in SQL. It's very simple. The whole thing is one
It's very simple. The whole thing is one window since we are not using partition
window since we are not using partition pi. And of course if you want to have
pi. And of course if you want to have ascending so from the lowest to the
ascending so from the lowest to the highest you can just remove it because
highest you can just remove it because optionally going to be ascending. So
optionally going to be ascending. So let's go and execute the query. So now
let's go and execute the query. So now we can see the orders are sorted the way
we can see the orders are sorted the way around. So we start with the lowest and
around. So we start with the lowest and end up with the highest. And of course
end up with the highest. And of course you're going to get the same results if
you're going to get the same results if you go over here and add ascending. So
you go over here and add ascending. So if you execute you see we got exactly
if you execute you see we got exactly the same results. So this is how you use
the same results. So this is how you use the order by inside the window
the order by inside the window definition.
Okay guys, so with that we have covered the second part of the window
the second part of the window definition. Now we're going to go to the
definition. Now we're going to go to the last part to the most advanced part of
last part to the most advanced part of window and we have the following stuff.
window and we have the following stuff. So we have rows unbounded proceeding. We
So we have rows unbounded proceeding. We call this frame close or window frame.
call this frame close or window frame. So what we are doing over here that we
So what we are doing over here that we are defining a subset of rows within
are defining a subset of rows within each window that is relevant for the
each window that is relevant for the calculation. Totally understand if this
calculation. Totally understand if this is confusing at the start or complex. It
is confusing at the start or complex. It was for me as well. So what we're going
was for me as well. So what we're going to do we're going to deep dive into the
to do we're going to deep dive into the concept in order to understand how this
concept in order to understand how this works and we're going to do it step by
works and we're going to do it step by step. So don't worry about it. All
step. So don't worry about it. All right. So now let's understand what is
right. So now let's understand what is going on with the frame close from the
going on with the frame close from the basics. So now if you do aggregations
basics. So now if you do aggregations and you don't use window function you're
and you don't use window function you're going to consider the entire data or
going to consider the entire data or rows inside the table. But what we can
rows inside the table. But what we can do we can go and divide the data using
do we can go and divide the data using partition pi to a window. So for example
partition pi to a window. So for example here we have window one and window two.
here we have window one and window two. Now if you go and do aggregations all
Now if you go and do aggregations all the rows in the window one going to be
the rows in the window one going to be aggregated and then scale going to go to
aggregated and then scale going to go to the window two and aggregate all the
the window two and aggregate all the rows. What we can do in scale is that we
rows. What we can do in scale is that we can say you know what I don't want all
can say you know what I don't want all rows inside the window. I want a subset
rows inside the window. I want a subset of rows inside the window. So what we
of rows inside the window. So what we are doing over here is that we have
are doing over here is that we have those two windows but we specify a scope
those two windows but we specify a scope or we specify subset of data from each
or we specify subset of data from each window to be involved in the
window to be involved in the aggregations. And of course not only
aggregations. And of course not only aggregations we can do ranking other
aggregations we can do ranking other stuff. So I mean calculations. So here
stuff. So I mean calculations. So here like we have a window inside a window.
like we have a window inside a window. So we are defining a scope of rows. Not
So we are defining a scope of rows. Not all rows should be involved in the
all rows should be involved in the calculation but only specific subset of
calculation but only specific subset of data. And we can do that using the frame
data. And we can do that using the frame clause. So again the partition by you
clause. So again the partition by you can use it in order to divide the entire
can use it in order to divide the entire data set into multiple windows. And now
data set into multiple windows. And now for the frame close. If you don't want
for the frame close. If you don't want to consider all the rows within each
to consider all the rows within each window in the calculation, you want to
window in the calculation, you want to focus and specify only a subset of data
focus and specify only a subset of data within each window. Then you going to go
within each window. Then you going to go and use the frame close. All right. So
and use the frame close. All right. So now let's go and understand the syntax
now let's go and understand the syntax of the frame close. Let's have the
of the frame close. Let's have the following example. We are saying the
following example. We are saying the window function is the average of sales
window function is the average of sales and then we define the window. So we
and then we define the window. So we have the first partition by categories,
have the first partition by categories, order by order dates and then we have
order by order dates and then we have the frame close. It's going to be the
the frame close. It's going to be the following rows between current row and
following rows between current row and unbounded preceding. This is the frame
unbounded preceding. This is the frame types and we have two types. We have the
types and we have two types. We have the rows and groups. Then we have like
rows and groups. Then we have like between and range. So the first range
between and range. So the first range going to be the frame boundary, the
going to be the frame boundary, the lower value. And here it accepts three
lower value. And here it accepts three types of keywords like the current row
types of keywords like the current row or number of preceding or the unbounded
or number of preceding or the unbounded preceding. And then we have another
preceding. And then we have another frame boundary. It's going to be the
frame boundary. It's going to be the higher values and it accepts the
higher values and it accepts the following stuff. We can use the current
following stuff. We can use the current row in following or unbounded following.
row in following or unbounded following. So as you can see we are defining like
So as you can see we are defining like boundary or a range from low value to
boundary or a range from low value to higher value. So now we have some rules.
higher value. So now we have some rules. We cannot use the frame clause without
We cannot use the frame clause without order by. So order by must be exist in
order by. So order by must be exist in the definition in order to use frame
the definition in order to use frame clause. And the second rule it says
clause. And the second rule it says lower boundary must be before the higher
lower boundary must be before the higher boundary. So always we start with the
boundary. So always we start with the lower boundary and we end up having the
lower boundary and we end up having the higher boundary. You cannot switch that.
higher boundary. You cannot switch that. Okay. So now we have a very simple
Okay. So now we have a very simple example. We have the month and the sales
example. We have the month and the sales and the following query. Sum of sales.
and the following query. Sum of sales. This is the window function. And the
This is the window function. And the definition of the window going to be
definition of the window going to be order by month. We are not using
order by month. We are not using partition by just in order to make our
partition by just in order to make our life easier. And the frame close going
life easier. And the frame close going to be defined like this. Rows between
to be defined like this. Rows between current row and the two following. So
current row and the two following. So now let's see how SQL going to execute
now let's see how SQL going to execute this. The first definition order by
this. The first definition order by month. As you can see the months are
month. As you can see the months are sorted already. So now SQL going to work
sorted already. So now SQL going to work with the frame definition current row
with the frame definition current row and the two following. So SQL going to
and the two following. So SQL going to process this row by row. So it's going
process this row by row. So it's going to start with the first row and it's
to start with the first row and it's going to be our current row as here in
going to be our current row as here in the SQL. So this is our current row and
the SQL. So this is our current row and we say the range until two rows, two
we say the range until two rows, two following rows. So it's going to be
following rows. So it's going to be February and March. So that means the
February and March. So that means the pointer going to be over here for the
pointer going to be over here for the two following. So with this we have the
two following. So with this we have the frame boundaries and SQL have the
frame boundaries and SQL have the following scope for the first row. So we
following scope for the first row. So we have three rows and the summarization of
have three rows and the summarization of those three rows going to be around 70.
those three rows going to be around 70. So we will get for the first row 70
So we will get for the first row 70 because the scope is not all rows but
because the scope is not all rows but only this subset of data. Okay. So with
only this subset of data. Okay. So with that scale is done with the first row
that scale is done with the first row it's going to jump to the second row. So
it's going to jump to the second row. So the pointer going to be the current row
the pointer going to be the current row at the February and the second two
at the February and the second two following going to be at April. So with
following going to be at April. So with that as you can see we are sliding down
that as you can see we are sliding down in the subset of data or in the window
in the subset of data or in the window and with that we have a new scope a new
and with that we have a new scope a new subset and the summarization of all
subset and the summarization of all those values going to be 45. So that's
those values going to be 45. So that's it. I think you get it already. It's
it. I think you get it already. It's going to go to the next one. The pointer
going to go to the next one. The pointer going to be on March and the two
going to be on March and the two following going to be on June and it's
following going to be on June and it's going to slide like this. We have those
going to slide like this. We have those three rows in the scope and the
three rows in the scope and the summarization of that going to be 105.
summarization of that going to be 105. So now things gets interesting for the
So now things gets interesting for the next row. So the pointer for the current
next row. So the pointer for the current row going to be April but the two
row going to be April but the two following going to be like after the end
following going to be like after the end of the table or something like that. So
of the table or something like that. So as we slide down as you can see the
as we slide down as you can see the scope now or the subset of the frame
scope now or the subset of the frame going to be only two rows and the output
going to be only two rows and the output going to be 75. And finally if you go to
going to be 75. And finally if you go to the last row it's going to be the
the last row it's going to be the current row and we're going to have only
current row and we're going to have only one row for this subset because the two
one row for this subset because the two following is just outside of the table
following is just outside of the table and we're going to get the same value as
and we're going to get the same value as the summarization. So as you can see
the summarization. So as you can see that's it. It's very simple right? So
that's it. It's very simple right? So the frame we use it in order to scope
the frame we use it in order to scope which rows are involved in the
which rows are involved in the calculations. So all you have to do is
calculations. So all you have to do is to define the boundaries of the frame,
to define the boundaries of the frame, the lower and the upper boundary. Let's
the lower and the upper boundary. Let's see what other options do we have with
see what other options do we have with the frames. Okay. So here we have the
the frames. Okay. So here we have the same example but we redefine the
same example but we redefine the boundaries of the frame like this. Rows
boundaries of the frame like this. Rows between current row this is the first
between current row this is the first boundary and unbounded following. This
boundary and unbounded following. This means that we are targeting always the
means that we are targeting always the last record in the window or in the
last record in the window or in the table. So unbounded following going to
table. So unbounded following going to be always static and it's going to be in
be always static and it's going to be in this example pointing to June. And now
this example pointing to June. And now it's still going to go row by row and
it's still going to go row by row and the current row going to be like the
the current row going to be like the start January and then February. I'm
start January and then February. I'm just going to take this example the
just going to take this example the pointer is on February and the subsets
pointer is on February and the subsets or the frame going to be those four
or the frame going to be those four rows. So it's going to be February,
rows. So it's going to be February, March, April, June. So it's going to be
March, April, June. So it's going to be four rows and the total aggregation of
four rows and the total aggregation of that going to be 115. So you can do it
that going to be 115. So you can do it like this. And previously it was like
like this. And previously it was like flexible more flexible it was two
flexible more flexible it was two following but this time we have
following but this time we have unbounded following that means always
unbounded following that means always the boundary going to be the last one.
the boundary going to be the last one. So as we are moving with the records
So as we are moving with the records over here the boundary is going to be
over here the boundary is going to be smaller smaller and like this and the
smaller smaller and like this and the last one they going to be both in the
last one they going to be both in the same record. So the current record going
same record. So the current record going to be as well the unbounded following.
to be as well the unbounded following. Okay let's see the next one. The
Okay let's see the next one. The definition of the window going to be the
definition of the window going to be the following rows between one proceeding
following rows between one proceeding and the current row. So here is the way
and the current row. So here is the way around one proceeding is lower than the
around one proceeding is lower than the current row. So let's see how SQL going
current row. So let's see how SQL going to execute this. Let's say that we are
to execute this. Let's say that we are currently at March. So this is the
currently at March. So this is the current row and we are saying between
current row and we are saying between one proceeding. So that means one row
one proceeding. So that means one row before the current row. So the frame
before the current row. So the frame going to be like this and we have only
going to be like this and we have only two rows. So the value going to be the
two rows. So the value going to be the summarization of those two rows and it's
summarization of those two rows and it's going to be 40. So that means we are
going to be 40. So that means we are always targeting the rows before the
always targeting the rows before the current row. Okay. So now let's keep
current row. Okay. So now let's keep going with the other options in order to
going with the other options in order to understand everything about the frame.
understand everything about the frame. So we redefine like this rows between
So we redefine like this rows between unbounded preceding and the current row.
unbounded preceding and the current row. So unbounded preceding going to be the
So unbounded preceding going to be the first row in the table or in the window.
first row in the table or in the window. So it's going to be static like this.
So it's going to be static like this. It's going to be the first one January.
It's going to be the first one January. And let's say that we are at this
And let's say that we are at this current row in March. So the window or
current row in March. So the window or the subset going to look like this.
the subset going to look like this. Those three rows and the total of that
Those three rows and the total of that going to be 60. So now as SQL is
going to be 60. So now as SQL is proceeding to the next one, it's going
proceeding to the next one, it's going to fix the first boundary. So it's going
to fix the first boundary. So it's going to be always pointing to January and the
to be always pointing to January and the subset going to be a little bit bigger
subset going to be a little bit bigger until we reach the last one. And with
until we reach the last one. And with that we're going to have the subsets the
that we're going to have the subsets the whole rows. So with that we get really
whole rows. So with that we get really great flexibility on how to define the
great flexibility on how to define the subset and how the subset is shifting
subset and how the subset is shifting through the window. Okay, so now we are
through the window. Okay, so now we are just having fun. So we are just playing
just having fun. So we are just playing around with the boundaries. We don't
around with the boundaries. We don't have always to use the current row. So
have always to use the current row. So we can use for example here in this
we can use for example here in this definition row is between one proceeding
definition row is between one proceeding and one following. So we don't include
and one following. So we don't include at all the current row in the
at all the current row in the boundaries. So let's say again our
boundaries. So let's say again our current row going to be in March. So one
current row going to be in March. So one preceding going to be February and one
preceding going to be February and one following going to be April. So with
following going to be April. So with that our frame going to be the three
that our frame going to be the three rows. And let me get it like this. And
rows. And let me get it like this. And the aggregation of this going to be
the aggregation of this going to be around 45. So with that as you can see
around 45. So with that as you can see the boundary is going to be one
the boundary is going to be one proceeding and one following. So it
proceeding and one following. So it should not be always the current row.
should not be always the current row. All right. So now I think you already
All right. So now I think you already get it. What going to be the last
get it. What going to be the last option? We're going to have everything.
option? We're going to have everything. So the definition of the frame going to
So the definition of the frame going to be rows between unbounded preceding and
be rows between unbounded preceding and unbounded following. What we're going to
unbounded following. What we're going to have over here. The unbounded preceding
have over here. The unbounded preceding going to be January and the unbounded
going to be January and the unbounded following going to be June. And now the
following going to be June. And now the frame going to be everything all the
frame going to be everything all the rows. And it doesn't matter where are we
rows. And it doesn't matter where are we with the current row, it's going to be
with the current row, it's going to be always a fixed subsets. So it's going to
always a fixed subsets. So it's going to be always everything. So if we are over
be always everything. So if we are over here or February or March, we're going
here or February or March, we're going to be considering all rows and the total
to be considering all rows and the total sales of that going to be 135. So we
sales of that going to be 135. So we will get the exact same results for
will get the exact same results for everything for all rows. So with that I
everything for all rows. So with that I think it's not that complicated, right?
think it's not that complicated, right? We just have to provide the boundaries
We just have to provide the boundaries and then the calculation going to be
and then the calculation going to be depending on the frame on the subset of
depending on the frame on the subset of data. Okay guys, so now let's go back to
data. Okay guys, so now let's go back to SQL and start practicing in order to
SQL and start practicing in order to understand how the frame work. So let's
understand how the frame work. So let's go and define a window like this. So sum
go and define a window like this. So sum of sales and the window definition like
of sales and the window definition like this. We going to divide the data by
this. We going to divide the data by order status and let's say we're going
order status and let's say we're going to sort it by order date. And let's
to sort it by order date. And let's define a frame like this. rows between
define a frame like this. rows between current row and two following. Let's
current row and two following. Let's give it a name total sales. So let's go
give it a name total sales. So let's go and execute it. So now let's look to the
and execute it. So now let's look to the data. You see that SQL going to divide
data. You see that SQL going to divide our results into two sections, two
our results into two sections, two windows delivered and shipped. And you
windows delivered and shipped. And you can see that the data is sorted by the
can see that the data is sorted by the order date. So as you can see over here
order date. So as you can see over here for example in this status delivered we
for example in this status delivered we can see that 1 of January 10 and so on.
can see that 1 of January 10 and so on. And then the third part we have defined
And then the third part we have defined a frame in each window. So for example,
a frame in each window. So for example, let's take the first one. So this is the
let's take the first one. So this is the current row. So we say the frame is
current row. So we say the frame is between the current row and the two
between the current row and the two following orders. So that means the
following orders. So that means the scope going to be like this. So 10 + 20
scope going to be like this. So 10 + 20 25 it's going to be 55. And now what is
25 it's going to be 55. And now what is interesting as well to check here is the
interesting as well to check here is the last record of each window. So now let's
last record of each window. So now let's take this window over here and the last
take this window over here and the last record going to be number seven. So this
record going to be number seven. So this order and let's say this is the current
order and let's say this is the current record. So we set the frame between
record. So we set the frame between current record and the two following.
current record and the two following. But since it is the last record of this
But since it is the last record of this window, it will not go and consider the
window, it will not go and consider the next two orders because those two orders
next two orders because those two orders are outside of the window and that's why
are outside of the window and that's why we have here 30 and SQL doesn't go and
we have here 30 and SQL doesn't go and summarize all those value. So we have it
summarize all those value. So we have it 30 and there is nothing after that.
30 and there is nothing after that. That's why we will get 30. So as you can
That's why we will get 30. So as you can see the frame going to be calculated
see the frame going to be calculated within one window. So it will not
within one window. So it will not consider anything outside of the window.
consider anything outside of the window. So this is how the frame works within
So this is how the frame works within partitions. So now I would like to show
partitions. So now I would like to show you as well a few stuff about the
you as well a few stuff about the frames. We can use shortcuts but we can
frames. We can use shortcuts but we can use them only with the proceeding. So
use them only with the proceeding. So for example let's say I'm going to
for example let's say I'm going to change the definition like this to
change the definition like this to proceedings and current row. So let's go
proceedings and current row. So let's go and execute it and we will get those
and execute it and we will get those results. So now if you want to check the
results. So now if you want to check the results quickly, let's take for example
results quickly, let's take for example this order over here and we are always
this order over here and we are always summarizing the values of the two
summarizing the values of the two previous orders. So that means those
previous orders. So that means those three orders going to be involved in the
three orders going to be involved in the frame and the output going to be 55. So
frame and the output going to be 55. So now there is a shortcut for SQL but only
now there is a shortcut for SQL but only for the proceeding where we can remove
for the proceeding where we can remove the range. So we can go and remove
the range. So we can go and remove everything and we can leave it like this
everything and we can leave it like this rows to proceeding and if you go and
rows to proceeding and if you go and execute it we will get exact results. So
execute it we will get exact results. So this is a quick way or a shortcut on how
this is a quick way or a shortcut on how to define a window but it only works
to define a window but it only works with the proceeding. So for example, if
with the proceeding. So for example, if I go over here and say for example
I go over here and say for example unbounded it's going to work. So we will
unbounded it's going to work. So we will get the results between the unbounded
get the results between the unbounded proceeding and the current row. But if
proceeding and the current row. But if you go over here and you say you know
you go over here and you say you know what let's have the unbounded following
what let's have the unbounded following SQL going to say there's an error. And
SQL going to say there's an error. And the same thing if you remove the
the same thing if you remove the unbounded let's say for example one
unbounded let's say for example one following SQL will not like it. So you
following SQL will not like it. So you can use the shortcut only with the
can use the shortcut only with the proceeding. And one last thing about the
proceeding. And one last thing about the frames it does there is a default frame.
frames it does there is a default frame. So if you don't use any frame and you
So if you don't use any frame and you use order by what can happen SQL going
use order by what can happen SQL going to use a default frame. So if you check
to use a default frame. So if you check the result you will notice that for this
the result you will notice that for this window over here those values are not
window over here those values are not like the whole values of the sales.
like the whole values of the sales. There is like frame there is hidden
There is like frame there is hidden frame and the default frame in SQL going
frame and the default frame in SQL going to be like this rows between unbounded
to be like this rows between unbounded preceding and current row. So this is
preceding and current row. So this is the default frame if you use order by.
the default frame if you use order by. So now if you go and just execute it you
So now if you go and just execute it you will see that we will get the exact
will see that we will get the exact results. So be careful once you use
results. So be careful once you use order by with the aggregate functions
order by with the aggregate functions there will be a hidden frame or a
there will be a hidden frame or a default frame like this between the
default frame like this between the unbounded proceeding and the current
unbounded proceeding and the current row. So that means there are three ways
row. So that means there are three ways in order to do this scenario framework
in order to do this scenario framework between unbounded proceding and current
between unbounded proceding and current row. Either write it like this or you
row. Either write it like this or you can go and have a shortcut like this.
can go and have a shortcut like this. Let me just execute it. So we'll get the
Let me just execute it. So we'll get the same result or just remove it
same result or just remove it completely. We will get as well the same
completely. We will get as well the same results. Now again the hidden frame or
results. Now again the hidden frame or the default frame is only working with
the default frame is only working with the order by. So if you go for example
the order by. So if you go for example here and remove the order by let's see
here and remove the order by let's see the results. The whole window will be
the results. The whole window will be aggregated. So again let me just select
aggregated. So again let me just select it. So you can see that SQL going to
it. So you can see that SQL going to consider all the rows in the
consider all the rows in the aggregations and we will get the total
aggregations and we will get the total sales for the whole window. So there
sales for the whole window. So there will be no frame defined only it's going
will be no frame defined only it's going to be present once you use order by. All
to be present once you use order by. All right friends so with the frame closed
right friends so with the frame closed we have now covered all the components
we have now covered all the components on how to define a window inside an
on how to define a window inside an overclo and with that we have covered
overclo and with that we have covered everything about the syntax of the
everything about the syntax of the window functions.
Okay guys, so now we're going to go and understand the rules or let's say the
understand the rules or let's say the limitations of window functions. So
limitations of window functions. So let's learn what you are not allowed to
let's learn what you are not allowed to do while using window functions. Okay,
do while using window functions. Okay, so the first rule that you are allowed
so the first rule that you are allowed to use the window function only in the
to use the window function only in the select close and as well in the order by
select close and as well in the order by clause. So here we have again the same
clause. So here we have again the same example where we finding the total sales
example where we finding the total sales by the order status. So as you can see
by the order status. So as you can see we used the window function in the
we used the window function in the select clause and we didn't get any
select clause and we didn't get any error right. So now we can go and use it
error right. So now we can go and use it as well in the order by. So let's say
as well in the order by. So let's say order by and let's go and copy
order by and let's go and copy everything but not the name in the order
everything but not the name in the order by. So if I go and execute this there
by. So if I go and execute this there will be no errors and SQL going to allow
will be no errors and SQL going to allow it. And as you can see the result didn't
it. And as you can see the result didn't change. So let's go and sort it for
change. So let's go and sort it for example descending. So I'm going to
example descending. So I'm going to write here descending and let's execute.
write here descending and let's execute. Now we have the total sales with the
Now we have the total sales with the highest values then the lowest values.
highest values then the lowest values. So having this rule that we can use it
So having this rule that we can use it only in select and order by that means
only in select and order by that means we cannot use window functions in order
we cannot use window functions in order to filter data. So let me show you for
to filter data. So let me show you for example instead of order by let's have
example instead of order by let's have clause where the total sales let's say
clause where the total sales let's say bigger than 100. So let's go and execute
bigger than 100. So let's go and execute this. And as you can see XQL going to
this. And as you can see XQL going to say no you are not allowed to do that.
say no you are not allowed to do that. You can do that only for select and
You can do that only for select and order by. We are not allowed to use it
order by. We are not allowed to use it for filtering data using the wear clause
for filtering data using the wear clause and as well you are not allowed to use
and as well you are not allowed to use it in the group by. So if I go and do a
it in the group by. So if I go and do a group by and as well remove the
group by and as well remove the condition over here. So if you execute
condition over here. So if you execute it you're going to get the same error.
it you're going to get the same error. You are not allowed to use the window
You are not allowed to use the window function in the group by. So only with
function in the group by. So only with the order by or as well in the select
the order by or as well in the select clause. Okay. So now to the second rule.
clause. Okay. So now to the second rule. You cannot use window functions inside
You cannot use window functions inside another window function. So that means
another window function. So that means you cannot go and nest window functions
you cannot go and nest window functions together. Let me show you what I mean
together. Let me show you what I mean with that. So let's remove the group
with that. So let's remove the group pie. Now everything should be working.
pie. Now everything should be working. Let's take and copy the whole window
Let's take and copy the whole window function over here and let's just nest
function over here and let's just nest it. So instead of sales, we're going to
it. So instead of sales, we're going to have now window function inside another
have now window function inside another window function. So as you can see this
window function. So as you can see this is the inner window function and the
is the inner window function and the rest the outside is the outside window
rest the outside is the outside window function. So if I go and execute this
function. So if I go and execute this you will see that scale going to tell us
you will see that scale going to tell us you cannot use the window function in
you cannot use the window function in the context of another window function.
the context of another window function. So we cannot do nesting using window
So we cannot do nesting using window functions. So as you can see this is
functions. So as you can see this is another limitation for those functions.
another limitation for those functions. All right moving to the third rule or
All right moving to the third rule or let's say an info the window function
let's say an info the window function will be executed after filtering the
will be executed after filtering the data with the work clause. Let's have an
data with the work clause. Let's have an example. So okay so now let's say that I
example. So okay so now let's say that I would like to have the same
would like to have the same informations. the total sales for each
informations. the total sales for each status but only for two products 101 and
status but only for two products 101 and 102. So let's go and do that. We're
102. So let's go and do that. We're going to use the wear clause and then
going to use the wear clause and then we're going to say product ID in we're
we're going to say product ID in we're going to specify 101 and 102. So let's
going to specify 101 and 102. So let's go and execute this. Now you can see we
go and execute this. Now you can see we still have two partitions. So one for
still have two partitions. So one for the delivered and one for the shipped
the delivered and one for the shipped but the total sales is reduced because
but the total sales is reduced because we are only focusing on two products and
we are only focusing on two products and we filtered the whole data sets. So how
we filtered the whole data sets. So how SQL works? First the workflow is going
SQL works? First the workflow is going to be executed and then the window
to be executed and then the window functions going to be calculated. So
functions going to be calculated. So that means first filtering and then
that means first filtering and then aggregations. Okay guys, now we're going
aggregations. Okay guys, now we're going to move to the last rule to the most
to move to the last rule to the most interesting one and it says the
interesting one and it says the following. You are allowed to use the
following. You are allowed to use the window function together with the group
window function together with the group by clause only if you use the same
by clause only if you use the same columns. So let me explain what do I
columns. So let me explain what do I mean but first some coffee.
Let's have the following task and it says rank the customers based on their
says rank the customers based on their total sales. So now it sounds really
total sales. So now it sounds really easy but if you check it you have here
easy but if you check it you have here two calculations. The first one you have
two calculations. The first one you have to rank the customers and the second
to rank the customers and the second calculation is an aggregation. You have
calculation is an aggregation. You have to find the total sales for each
to find the total sales for each customers. Okay. So now I'm going to
customers. Okay. So now I'm going to show you step by step how I usually
show you step by step how I usually solve those tasks. So for now let's
solve those tasks. So for now let's check the total sales. It is an
check the total sales. It is an aggregation right? So we can use the sum
aggregation right? So we can use the sum function and this function is available
function and this function is available in both group pi and as well in the
in both group pi and as well in the window function. So for now I'm going to
window function. So for now I'm going to go with the group by and that's because
go with the group by and that's because the task is very simple. We don't have
the task is very simple. We don't have to show any other details. Right? So
to show any other details. Right? So it's all about aggregations. So why not
it's all about aggregations. So why not using the group by and now to the first
using the group by and now to the first part where we have to rank the
part where we have to rank the customers. We cannot use the rank
customers. We cannot use the rank function with the group by right. Groupy
function with the group by right. Groupy uses only aggregations. So here we are
uses only aggregations. So here we are forced to use the window function. So
forced to use the window function. So that means for the rank I'm going to use
that means for the rank I'm going to use window function. For the total sales I'm
window function. For the total sales I'm going to use a group by. So now let's do
going to use a group by. So now let's do it step by step. So first we have to
it step by step. So first we have to find the total sales for each customer
find the total sales for each customer using group by. It's very simple. So I'm
using group by. It's very simple. So I'm just going to remove all those stuff in
just going to remove all those stuff in our select statements. We need the
our select statements. We need the customer ID and then we don't need a
customer ID and then we don't need a window function over here. And then
window function over here. And then after the from we're going to have a
after the from we're going to have a group by customer ID. So now I'm just
group by customer ID. So now I'm just grouping the customers and finding the
grouping the customers and finding the sum of all sales. Let's go and execute
sum of all sales. Let's go and execute this. So now as you can see in the
this. So now as you can see in the results we have four customers and
results we have four customers and that's why we have four rows and as well
that's why we have four rows and as well we have the total sales. So let's say
we have the total sales. So let's say the half of the tasks is already solved.
the half of the tasks is already solved. Right now what is missing that we need a
Right now what is missing that we need a rank. So let's go and build that. The
rank. So let's go and build that. The second step we're going to use the rank
second step we're going to use the rank function and we can define a window for
function and we can define a window for that. So over and inside it will not
that. So over and inside it will not partition the data at all because it's
partition the data at all because it's already like grouped up. So what we're
already like grouped up. So what we're going to do over order by the rank
going to do over order by the rank function always needs an order by don't
function always needs an order by don't worry about it we can talk about it
worry about it we can talk about it later. So now we are ranking the data
later. So now we are ranking the data based on the total sales that means the
based on the total sales that means the sum of sales. So what we're going to do
sum of sales. So what we're going to do let's just go and copy this and put it
let's just go and copy this and put it after the order buy. And now we have to
after the order buy. And now we have to decide whether ascending or descending.
decide whether ascending or descending. It's going to be descending. So the
It's going to be descending. So the highest sales first and then the lowest
highest sales first and then the lowest sales. So now as you can see we have now
sales. So now as you can see we have now a rank customers and we have a window
a rank customers and we have a window function now together with the group by.
function now together with the group by. So now let's go and execute this and see
So now let's go and execute this and see whether SQL going to allow it. So let's
whether SQL going to allow it. So let's run it and as you can see SQL runs it
run it and as you can see SQL runs it and we will get the rank for each
and we will get the rank for each customers. So the customer three has the
customers. So the customer three has the highest total sale. Then the customer
highest total sale. Then the customer number one and the last one going to be
number one and the last one going to be customer number two with the lowest
customer number two with the lowest total sales. All right. So we solve the
total sales. All right. So we solve the tasks. We have now ranked the customers
tasks. We have now ranked the customers based on their total sales. So as you
based on their total sales. So as you can see SQL allows you to use window
can see SQL allows you to use window function together with the group by but
function together with the group by but only with one rule. Anything that you
only with one rule. Anything that you are using inside the window function
are using inside the window function should be part of the group I. So for
should be part of the group I. So for example, we fulfilled the rule because
example, we fulfilled the rule because we are using the sum of sales and the
we are using the sum of sales and the sum of sales is a part of the group I
sum of sales is a part of the group I right. So now if I go I just break the
right. So now if I go I just break the rule by not using the sum just using the
rule by not using the sum just using the sales. So if I just remove the sum and
sales. So if I just remove the sum and use only the sales, SQL will not allow
use only the sales, SQL will not allow it because the sales is not part of the
it because the sales is not part of the group I. So as you can see SQL is very
group I. So as you can see SQL is very strict with this. If you want to use
strict with this. If you want to use everything in one query without using
everything in one query without using like subqueries and so on, you have to
like subqueries and so on, you have to use the exact same columns. So for
use the exact same columns. So for example, if I go over here instead of
example, if I go over here instead of sales, I use the customer ID. So since
sales, I use the customer ID. So since the customer ID is a part of the group
the customer ID is a part of the group by, SQL can allows it. So be careful
by, SQL can allows it. So be careful using window function together with the
using window function together with the group by. As long as you are using the
group by. As long as you are using the same columns, nothing going to go wrong
same columns, nothing going to go wrong and SQL going to allows it. Okay, so now
and SQL going to allows it. Okay, so now I'm just going to go and fix this and
I'm just going to go and fix this and let's run it. So now as you can see it's
let's run it. So now as you can see it's really easy if you follow those steps.
really easy if you follow those steps. First build the query using group by. So
First build the query using group by. So don't you think about the window
don't you think about the window function just build the group by and
function just build the group by and then the next step the last one you go
then the next step the last one you go and define and build the window
and define and build the window function. So with that you can solve
function. So with that you can solve really nice analytical use cases with a
really nice analytical use cases with a simple one query without having you to
simple one query without having you to build like some queries and so on. You
build like some queries and so on. You can go and use group by together with
can go and use group by together with the window functions. All right guys so
the window functions. All right guys so those are the four rules for the SQL
those are the four rules for the SQL window functions.
All right friends, so now let's have a quick recap about the SQL window
quick recap about the SQL window functions. Let's start with the
functions. Let's start with the definition. It will go and perform
definition. It will go and perform calculations like aggregations on top of
calculations like aggregations on top of subset of data without losing the level
subset of data without losing the level of details. So that means we can do
of details. So that means we can do aggregations and at the same time we are
aggregations and at the same time we are not losing the details. Now, of course,
not losing the details. Now, of course, there is a lot of similarity between the
there is a lot of similarity between the window function and the group by. But
window function and the group by. But the main difference is that window
the main difference is that window functions are very powerful and dynamic
functions are very powerful and dynamic compared to the group by. We have way
compared to the group by. We have way more functions than the group by. Right?
more functions than the group by. Right? But now if you are doing data analyzes
But now if you are doing data analyzes and you have an advanced use case, then
and you have an advanced use case, then you have to go and use window function.
you have to go and use window function. It's more suitable for complex and
It's more suitable for complex and advanced data analyzes. But in the other
advanced data analyzes. But in the other hand if you have a simple question
hand if you have a simple question simple data analyzes then you can go and
simple data analyzes then you can go and use the aggregate functions using the
use the aggregate functions using the group by and of course you can go and
group by and of course you can go and use them in the same query in the same
use them in the same query in the same select you can go and mix the group by
select you can go and mix the group by together with the window function with
together with the window function with only one rule you have to use the same
only one rule you have to use the same columns and of course the first step is
columns and of course the first step is to do the group by and then later you do
to do the group by and then later you do the window function in the same query.
the window function in the same query. And now to the next point about the
And now to the next point about the window components we have two main
window components we have two main components. The first one is the window
components. The first one is the window function and the second part is the
function and the second part is the window definition using the over close.
window definition using the over close. And inside the overlo we can define
And inside the overlo we can define three things. If you want to divide the
three things. If you want to divide the data to create windows you can use the
data to create windows you can use the partition by the second section we have
partition by the second section we have the order by in order to sort your data.
the order by in order to sort your data. And the last part you can go and specify
And the last part you can go and specify a subset of data like a frame within
a subset of data like a frame within each window. Now let's move to the last
each window. Now let's move to the last part. We have rules for the SQL window
part. We have rules for the SQL window functions. So the first thing is that if
functions. So the first thing is that if you have two window functions or
you have two window functions or multiple window functions, you cannot go
multiple window functions, you cannot go and nest them together. You have to go
and nest them together. You have to go and use multiple subqueries. The next
and use multiple subqueries. The next point is that you can use the window
point is that you can use the window function only in the select and the
function only in the select and the order by clause. So for example, you
order by clause. So for example, you cannot use the window together with the
cannot use the window together with the wear clause in order to filter the data.
wear clause in order to filter the data. Talking about filtering data, how SQL
Talking about filtering data, how SQL going to go and execute the window
going to go and execute the window function? It's always after SQL filter
function? It's always after SQL filter the data. All right. So those are the
the data. All right. So those are the basic stuff about the SQL window
basic stuff about the SQL window function. So with that we have learned
function. So with that we have learned the basics about the window functions in
the basics about the window functions in SQL. And next we're going to start
SQL. And next we're going to start talking about the functions. So the
talking about the functions. So the first group is the window aggregate
first group is the window aggregate functions. And here we're going to learn
functions. And here we're going to learn how to summarize our data for a specific
how to summarize our data for a specific group of rows. So let's
go. Okay guys, let's say that in our data we have the following informations.
data we have the following informations. We have the months and the sales. Now if
We have the months and the sales. Now if you apply any aggregate functions in SQL
you apply any aggregate functions in SQL what going to happen SQL going to go
what going to happen SQL going to go through all rows of the window or the
through all rows of the window or the entire data and start aggregating the
entire data and start aggregating the data. So that means in the result in the
data. So that means in the result in the output SQL going to give you one single
output SQL going to give you one single aggregated value. SQL going to go and
aggregated value. SQL going to go and summarize all those values and in the
summarize all those values and in the output you're going to find for example
output you're going to find for example here the total sales it's going to be
here the total sales it's going to be 175 or you can use the average or count
175 or you can use the average or count the data and so on. So the aggregate
the data and so on. So the aggregate functions going to deliver at the end
functions going to deliver at the end one aggregated value for a window or for
one aggregated value for a window or for the entire data. Okay. So now let's have
the entire data. Okay. So now let's have a quick overview of the syntax of all
a quick overview of the syntax of all aggregate functions. Most of them follow
aggregate functions. Most of them follow the same rule. So first as usual we have
the same rule. So first as usual we have to define the function name. And in this
to define the function name. And in this example we have the average. Then to the
example we have the average. Then to the next part we have to define inside it as
next part we have to define inside it as well the expression. We cannot leave it
well the expression. We cannot leave it empty. So here we are using the sales
empty. So here we are using the sales and the second rule for all functions
and the second rule for all functions beside the count. The data type of this
beside the count. The data type of this field should be a number. And this of
field should be a number. And this of course makes sense, right? So we cannot
course makes sense, right? So we cannot find the average of the first name of
find the average of the first name of customers or something like that. So we
customers or something like that. So we have to define a number. Then next we
have to define a number. Then next we have to define the frame. So we have the
have to define the frame. So we have the partition pi and it is optional. So you
partition pi and it is optional. So you could use it or leave it depends. And
could use it or leave it depends. And then the next one we have the order by.
then the next one we have the order by. It is as well optional. It is not a must
It is as well optional. It is not a must or required. So you could use it or
or required. So you could use it or leave it. That mean the whole definition
leave it. That mean the whole definition of the window could be empty for the
of the window could be empty for the aggregate functions. Let's have a look
aggregate functions. Let's have a look to all functions. So we have the count,
to all functions. So we have the count, sum, average, min, max. And as you can
sum, average, min, max. And as you can see, only the count accepts all data
see, only the count accepts all data types as an expression or arguments. All
types as an expression or arguments. All others require you to have a number as a
others require you to have a number as a data type. And for all functions, the
data type. And for all functions, the partition by is optional. The same for
partition by is optional. The same for order by and frame. So everything is
order by and frame. So everything is optional over here. So now what we're
optional over here. So now what we're going to do with that, we're going to go
going to do with that, we're going to go and deep dive into each of those
and deep dive into each of those functions in order to understand how
functions in order to understand how they work, what are the use cases, and
they work, what are the use cases, and of course, we're going to practice in
of course, we're going to practice in SQL. So we're going to start with the
SQL. So we're going to start with the first one with the function
count. Okay. So what is the count function? It's really simple. It's going
function? It's really simple. It's going to return the number of rows within each
to return the number of rows within each window. So it's going to help you to
window. So it's going to help you to understand how many rows do you have
understand how many rows do you have within each subset of data. So now let's
within each subset of data. So now let's go and understand how SQL works with
go and understand how SQL works with this function. All right guys, so now we
this function. All right guys, so now we have again this very simple example for
have again this very simple example for the orders and we have the following
the orders and we have the following informations. We have the products and
informations. We have the products and sales and now we want to solve very
sales and now we want to solve very simple task. How many orders do we have
simple task. How many orders do we have within each products? So in order to
within each products? So in order to solve it, we can use the function count
solve it, we can use the function count like the following. So we can say count
like the following. So we can say count and then we pass for it an argument or
and then we pass for it an argument or expression the star. So with that we are
expression the star. So with that we are telling SQL go and count how many rows
telling SQL go and count how many rows do we have in our table. But we have a
do we have in our table. But we have a window definition like this over
window definition like this over partition by products. So now what SQL
partition by products. So now what SQL going to do? We're going to go and
going to do? We're going to go and divide the data sets into two
divide the data sets into two partitions. We're going to have one
partitions. We're going to have one partition for the caps and another one
partition for the caps and another one for the gloves. So with that we have
for the gloves. So with that we have prepared our data into windows and we
prepared our data into windows and we are ready to do aggregations. So how
are ready to do aggregations. So how many rows do we have within each window?
many rows do we have within each window? It's going to be three. So for this
It's going to be three. So for this window it's going to be three rows and
window it's going to be three rows and as well for the next window we have as
as well for the next window we have as well three rows. So we're going to have
well three rows. So we're going to have three three and three. It's very simple
three three and three. It's very simple right guys? We are just finding the
right guys? We are just finding the number of rows within each window. But
number of rows within each window. But now with the aggregate functions we have
now with the aggregate functions we have to be very careful with the null values
to be very careful with the null values for the count star. As you can see over
for the count star. As you can see over here we are not specifying anything
here we are not specifying anything about the sales. So we are just saying
about the sales. So we are just saying find me the number of rows. So that
find me the number of rows. So that means SQL will just count the nulls as
means SQL will just count the nulls as one row. So that means if we are using
one row. So that means if we are using the star as an argument for the function
the star as an argument for the function count the null will not affect anything.
count the null will not affect anything. So whether we have nulls or nots we are
So whether we have nulls or nots we are just counting how many rows do we have
just counting how many rows do we have inside our data. But in some scenarios,
inside our data. But in some scenarios, we should be ignoring the nulls in our
we should be ignoring the nulls in our account. For example, let's say that I
account. For example, let's say that I would like to count how many sales do we
would like to count how many sales do we have within each product. That means if
have within each product. That means if we have nulls, it should not be counted.
we have nulls, it should not be counted. So now in order to achieve this task,
So now in order to achieve this task, what we're going to do, we're going to
what we're going to do, we're going to use instead of a star over here, we're
use instead of a star over here, we're going to have the filled sales. So now
going to have the filled sales. So now with this, we are telling SQL, don't
with this, we are telling SQL, don't just count blindly how many rows do we
just count blindly how many rows do we have within each window. You should be
have within each window. You should be very careful with the values. Find how
very careful with the values. Find how many cells do we have within each
many cells do we have within each window. So now let's see what's going to
window. So now let's see what's going to happen. For the first window we have
happen. For the first window we have three cells. So we have three values. So
three cells. So we have three values. So the number of rows is correct. But for
the number of rows is correct. But for the next one, how many cells do we have?
the next one, how many cells do we have? We have two. So we have this sale and
We have two. So we have this sale and then the 70. But the last one is null.
then the 70. But the last one is null. So it will not be counted. It will be
So it will not be counted. It will be ignored. That's why we're going to get
ignored. That's why we're going to get in the output the value two. We have two
in the output the value two. We have two sales. So as you can see the result did
sales. So as you can see the result did change and we are now more sensitive to
change and we are now more sensitive to the null values. So be careful what you
the null values. So be careful what you are specifying for the count. If you are
are specifying for the count. If you are using a column name like this it will
using a column name like this it will ignore the nulls. But if you have a star
ignore the nulls. But if you have a star it just going to go and find how many
it just going to go and find how many rows do we have within each partition.
rows do we have within each partition. Okay. So now if you go and compare the
Okay. So now if you go and compare the results side by side you can see that if
results side by side you can see that if you specify a column within the count
you specify a column within the count function it's going to be sensitive with
function it's going to be sensitive with the nulls. So it's going to ignore it
the nulls. So it's going to ignore it and will not use it within the
and will not use it within the aggregations. That's why we have here
aggregations. That's why we have here only two rows. But if you go and use the
only two rows. But if you go and use the star within the count function, what
star within the count function, what going to happen? SQL just going to go
going to happen? SQL just going to go and count it. So we're going to find the
and count it. So we're going to find the number of rows that we have inside our
number of rows that we have inside our table. And there is one more way in
table. And there is one more way in order to do the same thing here on the
order to do the same thing here on the left side. You can use instead of star
left side. You can use instead of star you can use one. So you might find it
you can use one. So you might find it somewhere that people are using count
somewhere that people are using count one and then the same window function
one and then the same window function and we will get exactly the same result.
and we will get exactly the same result. So the nulls will be counted and will
So the nulls will be counted and will not be ignored. So now you might ask me
not be ignored. So now you might ask me which one should I use the one or the
which one should I use the one or the star? Well, I would say it doesn't
star? Well, I would say it doesn't matter right we are getting the same
matter right we are getting the same results and if you are thinking about
results and if you are thinking about the performance I hardly find any
the performance I hardly find any differences between them so you can go
differences between them so you can go and try both of them and stick with the
and try both of them and stick with the one that is giving you like more better
one that is giving you like more better performance. Now we have special case
performance. Now we have special case for the count function compared to all
for the count function compared to all other aggregate functions it allows any
other aggregate functions it allows any data type. So that means we can use
data type. So that means we can use numbers we can use characters dates and
numbers we can use characters dates and so on. So that means we can go and
so on. So that means we can go and specify something like the product for
specify something like the product for the count instead of sales. So we can go
the count instead of sales. So we can go over here and say product and it's going
over here and say product and it's going to go and count how many rows do we have
to go and count how many rows do we have for the product. So it's going to be
for the product. So it's going to be three over here. And since here we don't
three over here. And since here we don't have any nulls, it's going to go and
have any nulls, it's going to go and count it like this. So we have three
count it like this. So we have three rows and be careful here. We are not
rows and be careful here. We are not counting the unique rows. We are just
counting the unique rows. We are just counting the rows that we have inside
counting the rows that we have inside our data. So this will not be counted as
our data. So this will not be counted as one and this as well would not be one.
one and this as well would not be one. So we have three times the caps. That's
So we have three times the caps. That's why we have here three. Okay. Okay. So
why we have here three. Okay. Okay. So now we have this very simple example.
now we have this very simple example. Find the total number of orders. This is
Find the total number of orders. This is very simple task in order to find how
very simple task in order to find how many rows, how many records do we have
many rows, how many records do we have inside the table orders. So let's go and
inside the table orders. So let's go and solve it. So let's start by selecting
solve it. So let's start by selecting just star from the table orders without
just star from the table orders without anything like this. So as you can see we
anything like this. So as you can see we have 10 orders. It's very simple. It's
have 10 orders. It's very simple. It's very easy as well. But now let's say
very easy as well. But now let's say that you have thousands or millions of
that you have thousands or millions of rows. You cannot do it like this by just
rows. You cannot do it like this by just checking the rows. What you're going to
checking the rows. What you're going to do? We're going to go and use the
do? We're going to go and use the function count. So we can go over here
function count. So we can go over here and say counts star and then let's give
and say counts star and then let's give it a name total orders. So let's go and
it a name total orders. So let's go and execute it. So now as you can see we got
execute it. So now as you can see we got only one record, one value. We don't see
only one record, one value. We don't see any other details. We got the 10 orders.
any other details. We got the 10 orders. So this is the total number of orders.
So this is the total number of orders. This is very helpful in order to
This is very helpful in order to understand the content of your data. So
understand the content of your data. So this we call it overall analyzes or
this we call it overall analyzes or let's say having the big numbers about
let's say having the big numbers about your business. For example, how many
your business. For example, how many orders do we have? how many customers,
orders do we have? how many customers, products, employees and so on. So having
products, employees and so on. So having those big numbers going to help us to
those big numbers going to help us to track our business to understand how
track our business to understand how well we are doing with the orders and
well we are doing with the orders and with the customers and so on. So this is
with the customers and so on. So this is the basics of reporting. Now let's go
the basics of reporting. Now let's go and extend our task by saying provide
and extend our task by saying provide details such as the order ID and the
details such as the order ID and the order dates. So let's go and do that. So
order dates. So let's go and do that. So select order ID, order dates. And now of
select order ID, order dates. And now of course we cannot do it like this. So let
course we cannot do it like this. So let me just execute it. we will get an error
me just execute it. we will get an error because here we have different level of
because here we have different level of details in our select. So in order to
details in our select. So in order to solve this what we going to do we're
solve this what we going to do we're going to use the over clause and with
going to use the over clause and with that we are telling SQL this is a window
that we are telling SQL this is a window function. So now let's go and execute
function. So now let's go and execute it. So with that you can see with that
it. So with that you can see with that we have solved the task we have details
we have solved the task we have details we have the order ID order dates. So
we have the order ID order dates. So this is the highest level of details
this is the highest level of details since we have the order ID and as well
since we have the order ID and as well we have the highest level of
we have the highest level of aggregations. we have the total number
aggregations. we have the total number of orders in the entire table orders. So
of orders in the entire table orders. So now let's keep going and add more stuff
now let's keep going and add more stuff to our task. Let's say that we want to
to our task. Let's say that we want to find the total number of orders but for
find the total number of orders but for each customers. So that means this time
each customers. So that means this time we have to go and divide our data by the
we have to go and divide our data by the customers. So let's go and do that.
customers. So let's go and do that. We're going to use as well a window
We're going to use as well a window function. So count star over we have to
function. So count star over we have to divide the data using partition by and
divide the data using partition by and we're going to use the field customer
we're going to use the field customer ID. So let's call it orders by customers
ID. So let's call it orders by customers and I would like to see as well the
and I would like to see as well the customer informations in the query.
customer informations in the query. That's why I'm going to go and add it.
That's why I'm going to go and add it. All right. So that's all. Let's go and
All right. So that's all. Let's go and execute it. Now as we learned before
execute it. Now as we learned before that SQL first going to go and divide
that SQL first going to go and divide the data. So that means we have four
the data. So that means we have four customers. We're going to get four
customers. We're going to get four windows. The first window going to be
windows. The first window going to be for the customer ID number one. And as
for the customer ID number one. And as you can see we have three rows. That's
you can see we have three rows. That's why we have here three orders. And the
why we have here three orders. And the same thing for the customer two. We have
same thing for the customer two. We have three orders. customer three three
three orders. customer three three orders but only the last customer the
orders but only the last customer the customer ID number four we have only one
customer ID number four we have only one row and one order. So now if you go and
row and one order. So now if you go and look to the total orders and the orders
look to the total orders and the orders by customers you can see now we are not
by customers you can see now we are not doing the overall analyzes we are doing
doing the overall analyzes we are doing like comparison between different
like comparison between different categories and of course in this example
categories and of course in this example the category is the customers and with
the category is the customers and with that we can understand as well the
that we can understand as well the behavior of our customers. So you can
behavior of our customers. So you can see that we have three customers that
see that we have three customers that has exactly the same amount of orders.
has exactly the same amount of orders. So they are very similar but we have one
So they are very similar but we have one extreme which is the customer ID number
extreme which is the customer ID number four. This customer has only one order.
four. This customer has only one order. So this is the only customer that has
So this is the only customer that has different behavior than all other
different behavior than all other customers. So you see with very simple
customers. So you see with very simple query we are able now to analyze our
query we are able now to analyze our business and understand the behavior of
business and understand the behavior of our customers. So if you divide the data
our customers. So if you divide the data by partition by and using count you can
by partition by and using count you can go and now compare stuff together. All
go and now compare stuff together. All right. So now let's keep moving. Next
right. So now let's keep moving. Next we're going to understand the special
we're going to understand the special cases that we have with the function
cases that we have with the function count. So now we have this very simple
count. So now we have this very simple task. It says find the total number of
task. It says find the total number of customers and additionally we have to
customers and additionally we have to provide all customers details. So I
provide all customers details. So I think it's very easy to solve. What
think it's very easy to solve. What we're going to do we're going to go and
we're going to do we're going to go and select star since we need all details
select star since we need all details from customers from sales customers. So
from customers from sales customers. So let's just have a look. So we have five
let's just have a look. So we have five customers and the function is count star
customers and the function is count star over and we don't have to divide the
over and we don't have to divide the data since we have to find the total
data since we have to find the total number of customers for the entire table
number of customers for the entire table and it's going to be total customers. So
and it's going to be total customers. So nothing new that's it we have five
nothing new that's it we have five customers and now as we learned before
customers and now as we learned before if you are passing the star to the count
if you are passing the star to the count function what you are telling to escale
function what you are telling to escale is that just go and count how many rows
is that just go and count how many rows do we have inside the table customers.
do we have inside the table customers. So SQL just going to go and start
So SQL just going to go and start counting and going to say we have five
counting and going to say we have five customers, five rows. So it doesn't
customers, five rows. So it doesn't matter whether we have nulls inside our
matter whether we have nulls inside our data like in the last name or the score.
data like in the last name or the score. It's just going to count the number of
It's just going to count the number of rows. So now let's say that we have the
rows. So now let's say that we have the following task. It's going to say find
following task. It's going to say find the total number of scores for
the total number of scores for customers. So what do we need with this
customers. So what do we need with this task is to find out how many scores
task is to find out how many scores inside our data. So as you can see we
inside our data. So as you can see we have around four scores but the last
have around four scores but the last customer doesn't have any score. So we
customer doesn't have any score. So we have it as a null. So the result should
have it as a null. So the result should be four. We cannot go now and use the
be four. We cannot go now and use the star for it because we're going to get
star for it because we're going to get five. We have to go and count the
five. We have to go and count the scores. So let's see how we're going to
scores. So let's see how we're going to do that. We're going to count as well.
do that. We're going to count as well. But this time the score and the
But this time the score and the definition of the window going to be
definition of the window going to be empty. So total scores and let's go and
empty. So total scores and let's go and execute this. So now we can see in the
execute this. So now we can see in the results we got four scores which is very
results we got four scores which is very correct because SQL did ignore the null
correct because SQL did ignore the null and SQL now focusing only on one column.
and SQL now focusing only on one column. So focusing on those values the nulls
So focusing on those values the nulls will not be counted. This is really
will not be counted. This is really great in order to check the quality of
great in order to check the quality of your data. So let's say that you are not
your data. So let's say that you are not expecting any nulls inside your data. So
expecting any nulls inside your data. So instead of going manually through the
instead of going manually through the whole records what you can do you can go
whole records what you can do you can go and find the total number of customers
and find the total number of customers like this and then you can go and count
like this and then you can go and count the total number of scores and you can
the total number of scores and you can see there is a difference. So by just
see there is a difference. So by just checking the data I can say you know
checking the data I can say you know what we have one null without checking
what we have one null without checking every record in our data. So with that
every record in our data. So with that we can check the quality of our data and
we can check the quality of our data and understand very quickly how many nulls
understand very quickly how many nulls do we have in the field score and you
do we have in the field score and you can do the same stuff for example for
can do the same stuff for example for the first name show it to you. So I'm
the first name show it to you. So I'm just going to go and copy this and let's
just going to go and copy this and let's say first name or let's say country
say first name or let's say country actually. So I will go with the country.
actually. So I will go with the country. So let's go with the country total
So let's go with the country total countries. So let's go and execute this.
countries. So let's go and execute this. So now if you check the result you can
So now if you check the result you can see we have five rows with the
see we have five rows with the countries. So SQL going to go and focus
countries. So SQL going to go and focus on the countries and it will not find
on the countries and it will not find any nulls. So we have here complete
any nulls. So we have here complete data. We don't have any nulls because
data. We don't have any nulls because the total number of customers is equal
the total number of customers is equal to the total number of values within the
to the total number of values within the country. And I can immediately find okay
country. And I can immediately find okay the data quality of the country is very
the data quality of the country is very good. All right. So now one more thing
good. All right. So now one more thing about the count function that we have
about the count function that we have learned before. We can use either star
learned before. We can use either star or one in order to count how many rows
or one in order to count how many rows do we have. So let's just try it. I'm
do we have. So let's just try it. I'm just going to go and duplicate it. And
just going to go and duplicate it. And instead of having a star, let's have a
instead of having a star, let's have a one. Just going to give it a name here.
one. Just going to give it a name here. It's going to be one and you are star.
It's going to be one and you are star. So let's go and execute it. So now if
So let's go and execute it. So now if you check the output, we got exactly
you check the output, we got exactly identical results. So there is no
identical results. So there is no difference between those two queries.
difference between those two queries. It's up to you. You can try it and check
It's up to you. You can try it and check the performance. I usually go with the
the performance. I usually go with the star instead of one. Okay. So now we're
star instead of one. Okay. So now we're going to talk about a very important use
going to talk about a very important use case for the SQL window function count
case for the SQL window function count that I frequently use in my real
that I frequently use in my real projects. The data that we use for data
projects. The data that we use for data analyzes has usually bad data quality.
analyzes has usually bad data quality. And if we don't find those data quality
And if we don't find those data quality issues and we don't clean it before
issues and we don't clean it before doing the analyzes, what going to
doing the analyzes, what going to happen? We're going to deliver bad
happen? We're going to deliver bad results, bad analyzes which going to
results, bad analyzes which going to lead to bad decisions. And one very
lead to bad decisions. And one very common data quality issue that you might
common data quality issue that you might encounter in your project or on your
encounter in your project or on your data is that having duplicates.
data is that having duplicates. Duplicates are really bad for doing data
Duplicates are really bad for doing data analyszis. So now in order to discover
analyszis. So now in order to discover or let's say identify the duplicates in
or let's say identify the duplicates in our data, we can go and use the SQL
our data, we can go and use the SQL window function count. So now let's go
window function count. So now let's go and have some examples. Okay. So now the
and have some examples. Okay. So now the task says check whether the table orders
task says check whether the table orders contains any duplicate rows. So how we
contains any duplicate rows. So how we going to do that? By checking now the
going to do that? By checking now the table orders over here. We can see that
table orders over here. We can see that there are many orders. But how to find
there are many orders. But how to find out the duplicates? Well, the first step
out the duplicates? Well, the first step is to understand what is the primary key
is to understand what is the primary key of the table orders. So what we usually
of the table orders. So what we usually do we go and check the data model if
do we go and check the data model if there is one. So for example for this
there is one. So for example for this course we have the following data model
course we have the following data model and we can see that it is defined that
and we can see that it is defined that the order ID is the primary key for the
the order ID is the primary key for the orders. The product ID is primary key
orders. The product ID is primary key for the products. So that means for our
for the products. So that means for our table the orders we have the order ID as
table the orders we have the order ID as the primary key and it should be unique.
the primary key and it should be unique. It should not contain any duplicates. So
It should not contain any duplicates. So now let's go to our data and check the
now let's go to our data and check the order ID. By just looking at the data
order ID. By just looking at the data you can see that we don't have any
you can see that we don't have any duplicates. Rightes all of them are
duplicates. Rightes all of them are unique. So we have 1 2 3 4 and so on.
unique. So we have 1 2 3 4 and so on. But of course in real projects you
But of course in real projects you cannot do it like this. You have to go
cannot do it like this. You have to go and build a query in order to find out
and build a query in order to find out whether the primary key is unique. But
whether the primary key is unique. But now you might say the primary keys are
now you might say the primary keys are usually unique because we can define it
usually unique because we can define it in the DDL in the rules of building the
in the DDL in the rules of building the table. Well that's true. If you have it
table. Well that's true. If you have it like this then you don't have to find
like this then you don't have to find any duplicates. But usually in data
any duplicates. But usually in data analyzes we export a lot of files and a
analyzes we export a lot of files and a lot of data inside an extra database and
lot of data inside an extra database and we don't build such a rules. So now in
we don't build such a rules. So now in order to check the quality of the
order to check the quality of the primary keys that you get from the
primary keys that you get from the source we can use the count function. So
source we can use the count function. So let's go and build it. I'm just going to
let's go and build it. I'm just going to select the order ID first as a detail.
select the order ID first as a detail. And now we're going to do the following.
And now we're going to do the following. So count and then star. And let's go and
So count and then star. And let's go and define the window. So it's going to be
define the window. So it's going to be partition by and here the field going to
partition by and here the field going to be the primary key. So the order ID I'm
be the primary key. So the order ID I'm checking now the quality of this field.
checking now the quality of this field. This should not contain any duplicates.
This should not contain any duplicates. And now we're going to go and give it a
And now we're going to go and give it a name check primary key. So now my
name check primary key. So now my expectation is that the result of this
expectation is that the result of this should be at maximum one. That means we
should be at maximum one. That means we have one row for each primary key. And
have one row for each primary key. And that means as well it is unique. So if
that means as well it is unique. So if we get anything more than one then it
we get anything more than one then it means we have duplicates. Let's go and
means we have duplicates. Let's go and run the query. And as you can see in the
run the query. And as you can see in the results we get for each primary key one.
results we get for each primary key one. So that's great. That means we don't
So that's great. That means we don't have any duplicates inside our data and
have any duplicates inside our data and the primary key is unique. So that means
the primary key is unique. So that means the table orders is clean and we don't
the table orders is clean and we don't have any duplicates inside it. Now let's
have any duplicates inside it. Now let's check our database. We have here another
check our database. We have here another table called orders archive. Let's go
table called orders archive. Let's go and check the table. So first I'm just
and check the table. So first I'm just going to go and select the data. So
going to go and select the data. So select from orders archive. So sales do
select from orders archive. So sales do orders archive. Let's check the results.
orders archive. Let's check the results. And here we can see that we have exactly
And here we can see that we have exactly the same structure as the table orders.
the same structure as the table orders. So now let's go and check whether the
So now let's go and check whether the data quality is well clean. So now what
data quality is well clean. So now what we're going to do, we're going to use
we're going to do, we're going to use exactly the same query as before, but
exactly the same query as before, but instead of using the table orders, we're
instead of using the table orders, we're going to take the orders archive. So
going to take the orders archive. So that's it. Let's go and execute it. So
that's it. Let's go and execute it. So now by checking the data, you can see
now by checking the data, you can see that we don't have everywhere one.
that we don't have everywhere one. Sometimes we have two rows for the same
Sometimes we have two rows for the same primary key, which is really bad. So we
primary key, which is really bad. So we have here for the order ID four we have
have here for the order ID four we have two orders with the same order ID and as
two orders with the same order ID and as well for this order id six we have three
well for this order id six we have three orders that means those stuff are
orders that means those stuff are duplicates and they are against our data
duplicates and they are against our data model. So now what else we can do is
model. So now what else we can do is that to generate a list specifically for
that to generate a list specifically for the data quality issue where we have
the data quality issue where we have duplicates. So anything that has one we
duplicates. So anything that has one we are not interested in it. In order to do
are not interested in it. In order to do that we're going to use the subquery. So
that we're going to use the subquery. So let's say select star from and then
let's say select star from and then we're going to use the first query as a
we're going to use the first query as a subquery and we're going to say in our
subquery and we're going to say in our filter where the check primary key is
filter where the check primary key is higher than one. So that means I need
higher than one. So that means I need only the order ids where we have
only the order ids where we have duplicates. So let's go and execute
duplicates. So let's go and execute this. Now I have a list with the primary
this. Now I have a list with the primary keys where we have duplicates. So we
keys where we have duplicates. So we have the order ID 4 and as well the
have the order ID 4 and as well the order ID six. So guys, as you can see,
order ID six. So guys, as you can see, the window count function is wonderful
the window count function is wonderful in order to find data quality issues
in order to find data quality issues like the duplicates. All right guys, so
like the duplicates. All right guys, so those are the four most important use
those are the four most important use cases in the SQL window function count.
cases in the SQL window function count. So the first one we can use it in order
So the first one we can use it in order to do overall analyzes or we can use it
to do overall analyzes or we can use it in order to do category analyzes like we
in order to do category analyzes like we have done the analyzes on the customer
have done the analyzes on the customer behavior or another use case we can use
behavior or another use case we can use it in order to check the nulls inside
it in order to check the nulls inside our data. And the last use case we can
our data. And the last use case we can use it in order to identify or discover
use it in order to identify or discover the data quality issue duplicates in our
the data quality issue duplicates in our data. So now let's go and check the next
data. So now let's go and check the next function. We have the
function. We have the [Music]
[Music] sum. All right. So now let's understand
sum. All right. So now let's understand what is the sum function. It's very
what is the sum function. It's very simple. It's going to return the sum of
simple. It's going to return the sum of all values within each window. So now
all values within each window. So now let's go and understand how SQL works
let's go and understand how SQL works with this function. All right. So this
with this function. All right. So this is very easy and we are using the same
is very easy and we are using the same simple example and now we would like to
simple example and now we would like to find the total sales for each products.
find the total sales for each products. So we can define like this sum of sales
So we can define like this sum of sales since we are finding the total sales and
since we are finding the total sales and then we define the window like this over
then we define the window like this over partition by products. So as we learned
partition by products. So as we learned SQL going to go first and divide our
SQL going to go first and divide our data into two windows. So one window for
data into two windows. So one window for the caps and another window for the
the caps and another window for the gloves right. So now after SQL define
gloves right. So now after SQL define the windows it's going to go and starts
the windows it's going to go and starts aggregating the data. So the sum of
aggregating the data. So the sum of sales that means for the first window we
sales that means for the first window we have the three sales and it's going to
have the three sales and it's going to go and just simply summarize all those
go and just simply summarize all those values. So we are adding 20 + 10 + 5 and
values. So we are adding 20 + 10 + 5 and we will get the result 35. So in the
we will get the result 35. So in the outputs we will get everywhere 35. So
outputs we will get everywhere 35. So that's it for the first window and as
that's it for the first window and as you can see SQL going to go aggregate
you can see SQL going to go aggregate the data within each window separately.
the data within each window separately. So that means as we are aggregating the
So that means as we are aggregating the data for the caps will not check
data for the caps will not check anything with the gloves. So they are
anything with the gloves. So they are completely separated. So now it's going
completely separated. So now it's going to go for the next window. And here we
to go for the next window. And here we have two values and a null. So again
have two values and a null. So again here the null will just be ignored. So
here the null will just be ignored. So what we going to have? We're going to
what we going to have? We're going to have 30 + 70 and the total sales for
have 30 + 70 and the total sales for that going to be 100. So as you can see
that going to be 100. So as you can see it is very simple, right? So 100 100 and
it is very simple, right? So 100 100 and so guys that's it. It's really simple.
so guys that's it. It's really simple. We don't have here like a lot of special
We don't have here like a lot of special cases like the count function. It's only
cases like the count function. It's only that it ignores the null in the
that it ignores the null in the calculation and as well the requirement
calculation and as well the requirement here it allows only integers or let's
here it allows only integers or let's say numbers. So we cannot go and say sum
say numbers. So we cannot go and say sum the products since the products are not
the products since the products are not numbers they are characters. So you can
numbers they are characters. So you can only use numbers for the sum function.
only use numbers for the sum function. Let's go now and have some tasks and
Let's go now and have some tasks and some use cases in order to practice in
some use cases in order to practice in SQL. find the total sales across all
SQL. find the total sales across all orders and as well find the total sales
orders and as well find the total sales for each product and additionally we
for each product and additionally we have to provide some details like the
have to provide some details like the order ID and the order dates. So let's
order ID and the order dates. So let's go and do that. Select order ID, order
go and do that. Select order ID, order date and let's get as well the sales.
date and let's get as well the sales. And now we have to find the total sales
And now we have to find the total sales across all orders. That means we're
across all orders. That means we're going to use the window function sum
going to use the window function sum sales and the definition of the window
sales and the definition of the window going to be empty since we don't have to
going to be empty since we don't have to divide the data. So that's it. total
divide the data. So that's it. total sales and we have to select the table
sales and we have to select the table sales orders. So that's it. Let's go and
sales orders. So that's it. Let's go and execute it. So with that as you can see
execute it. So with that as you can see we got all the details that we need and
we got all the details that we need and as well the total sales the
as well the total sales the summarization of all those sales in one
summarization of all those sales in one field. So with that we have our overall
field. So with that we have our overall analyzes one big number for our
analyzes one big number for our reporting. We know how much sales we did
reporting. We know how much sales we did made in the entire business. So now
made in the entire business. So now let's go for the next task. It says
let's go for the next task. It says total sales for each product. I think
total sales for each product. I think you know already what we're going to do.
you know already what we're going to do. So sum of sales and we're going to do it
So sum of sales and we're going to do it like this. Partition
like this. Partition by product ID. So that's it. We're going
by product ID. So that's it. We're going to call it sales by products. And with
to call it sales by products. And with that we are dividing the data by the
that we are dividing the data by the product. So let's go and execute it. So
product. So let's go and execute it. So as you can see we don't have the product
as you can see we don't have the product information. So let's go and add the
information. So let's go and add the product ID in the query just in order to
product ID in the query just in order to analyze the results. So we can see from
analyze the results. So we can see from the data that the winner is the product
the data that the winner is the product ID 101. So as you can see we have here
ID 101. So as you can see we have here the highest sales if you compare it with
the highest sales if you compare it with the other products and the lowest one
the other products and the lowest one going to be the products ID 105. So as
going to be the products ID 105. So as you can see we can use the window
you can see we can use the window function sum together with the partition
function sum together with the partition by in order to compare stuff to do
by in order to compare stuff to do comparison between the products in order
comparison between the products in order to understand the performance for
to understand the performance for example of the products. So it's really
example of the products. So it's really great analyzes for the performance. All
great analyzes for the performance. All right. Now we're going to move to very
right. Now we're going to move to very interesting use case for the aggregate
interesting use case for the aggregate functions not only for the sum but as
functions not only for the sum but as well for the others. It is the
well for the others. It is the comparison analyzes. Okay. Okay, so
comparison analyzes. Okay. Okay, so let's understand quickly what is the
let's understand quickly what is the comparison use cases. So it's going to
comparison use cases. So it's going to go and compare the current value. For
go and compare the current value. For example, let's say we are currently at
example, let's say we are currently at the month of March and the sales is 30.
the month of March and the sales is 30. So we're going to compare this value,
So we're going to compare this value, the current sales with an aggregated
the current sales with an aggregated value. For example, let's say the total
value. For example, let's say the total sales using the sum function. So what
sales using the sum function. So what happen if you compare the current value
happen if you compare the current value with the total sales? You are comparing
with the total sales? You are comparing here or doing analyszis called part to
here or doing analyszis called part to whole analyszis where it's going to help
whole analyszis where it's going to help us to understand how important was the
us to understand how important was the sales in this month compared to the
sales in this month compared to the total sales or we can go and compare it
total sales or we can go and compare it to the best months to the highest value.
to the best months to the highest value. For example, the highest value is June
For example, the highest value is June and we can go and compare this month
and we can go and compare this month with the best months of the year or to
with the best months of the year or to the lowest month in the year or we can
the lowest month in the year or we can go and compare the sales of the current
go and compare the sales of the current month with the average in order to
month with the average in order to understand are we above the typical
understand are we above the typical sales or below the average. And this is
sales or below the average. And this is very important analysis in order to
very important analysis in order to study and understand the performance of
study and understand the performance of the current data. All right, let's have
the current data. All right, let's have an example in order to understand the
an example in order to understand the use case. Find the percentage
use case. Find the percentage contribution of each product sales to
contribution of each product sales to the total sales. So let's go and solve
the total sales. So let's go and solve it step by step. What we're going to do,
it step by step. What we're going to do, we're going to go and let's select the
we're going to go and let's select the order ID and as well let's take the
order ID and as well let's take the product ID and the sales just like this
product ID and the sales just like this from sales orders. So let's go and
from sales orders. So let's go and execute it. Okay. Okay. So now as you
execute it. Okay. Okay. So now as you can see in the results we got the first
can see in the results we got the first part of the equation. We have the sales.
part of the equation. We have the sales. So nothing like a crazy over here. Now
So nothing like a crazy over here. Now we need the total sales over all data.
we need the total sales over all data. So what we're going to do we're going to
So what we're going to do we're going to have the sum of sales and the definition
have the sum of sales and the definition going to be empty. So this is the total
going to be empty. So this is the total sales. Let's go and execute it. So now
sales. Let's go and execute it. So now we have everything for the equation. We
we have everything for the equation. We have the sales and as well the total
have the sales and as well the total sales and that is enough in order to
sales and that is enough in order to find the percentage of the contribution.
find the percentage of the contribution. So the calculation for that is going to
So the calculation for that is going to be very simple. We're going to divide
be very simple. We're going to divide the sales by the total sales. So it's
the sales by the total sales. So it's really simple. Let's go and do that.
really simple. Let's go and do that. It's going to be the sales divided by
It's going to be the sales divided by the total sales. So we're going to go
the total sales. So we're going to go and copy the whole window function over
and copy the whole window function over here. And then we're going to multiply
here. And then we're going to multiply it with 100. So that's it. Let's go and
it with 100. So that's it. Let's go and execute it. So now you notice that in
execute it. So now you notice that in the output we got zeros. This is because
the output we got zeros. This is because of the data type. So now if we go to our
of the data type. So now if we go to our table over here on the left side you can
table over here on the left side you can see that the orders has the data type of
see that the orders has the data type of integer. So if you divide integers you
integer. So if you divide integers you will not get a float or decimal number.
will not get a float or decimal number. You have to go and change the data type.
You have to go and change the data type. So now what we're going to do we're
So now what we're going to do we're going to go and change the data type for
going to go and change the data type for one of them. So it's enough for the
one of them. So it's enough for the sales over here. So we're going to use
sales over here. So we're going to use the following statement. So cast sales
the following statement. So cast sales as floats. So that's it. I'm just
as floats. So that's it. I'm just converting the integer to floats. So
converting the integer to floats. So that's it. Let me just give it a name.
that's it. Let me just give it a name. So it's going to be percentage of total.
So it's going to be percentage of total. So that's it. Let's go and execute it.
So that's it. Let's go and execute it. So now in the output, you can see we got
So now in the output, you can see we got now the percentage of the total or let's
now the percentage of the total or let's say percentage of contribution. So now
say percentage of contribution. So now what we're going to do with that, we're
what we're going to do with that, we're going to go and round those numbers
going to go and round those numbers because we have a lot of decimals. In
because we have a lot of decimals. In order to do that, we're going to use the
order to do that, we're going to use the round function like this. Then we're
round function like this. Then we're going to have two decimals. And let's go
going to have two decimals. And let's go and execute it. So now, as you can see,
and execute it. So now, as you can see, it is really easier to read because we
it is really easier to read because we have only two decimals. And we can find
have only two decimals. And we can find immediately that the order rate is the
immediately that the order rate is the highest contributor to the total. So
highest contributor to the total. So this is what we call part to whole
this is what we call part to whole analyszis where we find the percentage
analyszis where we find the percentage of total. It is very common analyzes in
of total. It is very common analyzes in order to understand the performance of
order to understand the performance of each order compared to the total. So
each order compared to the total. So this is an example how the window
this is an example how the window function is helping us here to compare
function is helping us here to compare the current value with an aggregated
the current value with an aggregated value. All right everyone. So that's all
value. All right everyone. So that's all for the window function sum. Next we're
for the window function sum. Next we're going to talk about the average
going to talk about the average function.
All right. So now let's understand what is an average function. As the name
is an average function. As the name says, it's going to find the average of
says, it's going to find the average of values within each window. So now let's
values within each window. So now let's go and understand how SQL works with the
go and understand how SQL works with the average. All right. So now back to our
average. All right. So now back to our very simple example and the task says
very simple example and the task says find the average sales for each product.
find the average sales for each product. So it's really easy. We're going to use
So it's really easy. We're going to use the average then pass to it the column
the average then pass to it the column sales and we define the window like this
sales and we define the window like this partition by products. So the first
partition by products. So the first thing that SQL going to go is to define
thing that SQL going to go is to define the window. So it's going to divide our
the window. So it's going to divide our data into two partitions. One for the
data into two partitions. One for the caps and one for the gloves. And now I
caps and one for the gloves. And now I hope that everyone knows how to
hope that everyone knows how to calculate the average. So as you know
calculate the average. So as you know that it's going to go and summarize all
that it's going to go and summarize all the values and divide it by the number
the values and divide it by the number of rows. So it's going to go and
of rows. So it's going to go and summarize 20 + 10 + 5 and divide it on
summarize 20 + 10 + 5 and divide it on three rows and the output going to be
three rows and the output going to be 11. So we're going to get it for each
11. So we're going to get it for each row. So as you can see SQL just ignored
row. So as you can see SQL just ignored everything in the next window. We are
everything in the next window. We are focusing only on the caps. Now it's
focusing only on the caps. Now it's going to go to the second window and
going to go to the second window and start doing the same aggregations. But
start doing the same aggregations. But here we have the special case of null.
here we have the special case of null. So the null is going to be ignored in
So the null is going to be ignored in the calculations and we're going to have
the calculations and we're going to have it like this. It's going to say you know
it like this. It's going to say you know what 30 + 70 and we are just including
what 30 + 70 and we are just including two rows. So it's going to be divided by
two rows. So it's going to be divided by two and the average going to be 50. So
two and the average going to be 50. So we will get the result 50 for each row
we will get the result 50 for each row and we are completely ignoring the
and we are completely ignoring the nulls. But now we might be in scenario
nulls. But now we might be in scenario where your users understand the business
where your users understand the business like this. If we find a null in the
like this. If we find a null in the sales it means a zero. So there is no
sales it means a zero. So there is no sales and it is actually a zero. But we
sales and it is actually a zero. But we store it in the database as a null. So
store it in the database as a null. So that means the average that we have
that means the average that we have provided is not really correct. We have
provided is not really correct. We have to divide by three. So that means first
to divide by three. So that means first we have to handle the nulls before doing
we have to handle the nulls before doing the aggregations before finding the
the aggregations before finding the average. Now we're going to have a whole
average. Now we're going to have a whole chapter on how to handle nulls in SQL.
chapter on how to handle nulls in SQL. What are the different functions? But
What are the different functions? But for now we're going to go with the
for now we're going to go with the functions qualisk. Okay. So now what
functions qualisk. Okay. So now what we're going to do, we will not use the
we're going to do, we will not use the sales as it is. First we're going to
sales as it is. First we're going to handle the nulls. So that means we're
handle the nulls. So that means we're going to use the qualisk sales and
going to use the qualisk sales and replace it with zeros. So as you can see
replace it with zeros. So as you can see we are not using immediately the sales
we are not using immediately the sales we are handling it first and then we're
we are handling it first and then we're going to find the average. So SQL going
going to find the average. So SQL going to go over here and if it finds any null
to go over here and if it finds any null going to go and replace it with zero and
going to go and replace it with zero and that's going to have then an effect on
that's going to have then an effect on our average over here. So it going to be
our average over here. So it going to be 30 + 7 + 70 but now plus 0. And now we
30 + 7 + 70 but now plus 0. And now we have three rows. So instead of dividing
have three rows. So instead of dividing by two, it's going to go and divide it
by two, it's going to go and divide it by three and the total result going to
by three and the total result going to be like this 33. So that means we're
be like this 33. So that means we're going to have in the output 33 for each
going to have in the output 33 for each row and with that we are now fulfilling
row and with that we are now fulfilling the expectation from the business. If
the expectation from the business. If you have a null it's going to be handled
you have a null it's going to be handled as zero and the result going to be more
as zero and the result going to be more accurate. You see right it is very
accurate. You see right it is very tricky. If you are doing data analyszis
tricky. If you are doing data analyszis and aggregations be very careful with
and aggregations be very careful with the nulls. understand them, understand
the nulls. understand them, understand what they mean for the business, handle
what they mean for the business, handle them correctly in order to get correct
them correctly in order to get correct results in your analysis. So now let's
results in your analysis. So now let's go back in order to practice SQL using
go back in order to practice SQL using some tasks and use cases. Okay, so let's
some tasks and use cases. Okay, so let's start with the basics. We have the
start with the basics. We have the following task. Find the average sales
following task. Find the average sales across all orders and as well find the
across all orders and as well find the average sales for each product. And
average sales for each product. And don't forget the details. So now let's
don't forget the details. So now let's go and solve it step by step. So select
go and solve it step by step. So select order ID, order date, and let's get the
order ID, order date, and let's get the sales as well. And let's go and find the
sales as well. And let's go and find the average sales. So it's going to be a
average sales. So it's going to be a window function. And we have the sales
window function. And we have the sales inside it. The usual stuff. The window
inside it. The usual stuff. The window going to be empty. So average sales,
going to be empty. So average sales, we're going to call it the table going
we're going to call it the table going to be sales orders. So that's it. Let's
to be sales orders. So that's it. Let's go and execute it. Oh, we have to select
go and execute it. Oh, we have to select everything of course. So what SQL did in
everything of course. So what SQL did in the output, it going to go and summarize
the output, it going to go and summarize all those values and then divide it by
all those values and then divide it by 10. So with that we have the average
10. So with that we have the average sales of 38. Very easy. So this is again
sales of 38. Very easy. So this is again what we call an overall analyzis. Let's
what we call an overall analyzis. Let's move to the next one. Find the average
move to the next one. Find the average sales for each products. So again we're
sales for each products. So again we're going to go and build the window
going to go and build the window function like this. Average sales over
function like this. Average sales over and we're going to divide it by product
and we're going to divide it by product ID. And we're going to call it average
ID. And we're going to call it average sales by products. And we're going to go
sales by products. And we're going to go and add the product ID in the query. So
and add the product ID in the query. So that's it. Let's go and execute. And we
that's it. Let's go and execute. And we missed something here. So it is the
missed something here. So it is the partition by going to execute again. So
partition by going to execute again. So with that we have the following data. So
with that we have the following data. So now SQL going to go and divide the data.
now SQL going to go and divide the data. So for example for this products we have
So for example for this products we have those four orders. So what going to
those four orders. So what going to happen is still going to go and
happen is still going to go and summarize the four values and then
summarize the four values and then divide it by four. That's why we have
divide it by four. That's why we have here 35. The same thing for the next
here 35. The same thing for the next order. It's going to divide it by three.
order. It's going to divide it by three. And the last one is just going to divide
And the last one is just going to divide it by one. That's why we have 60. So as
it by one. That's why we have 60. So as you can see the aggregation can done
you can see the aggregation can done separately for each window and this is
separately for each window and this is as well very nice way in order to
as well very nice way in order to compare the averages between the
compare the averages between the different products. Okay. So now let's
different products. Okay. So now let's have an example in order to learn how to
have an example in order to learn how to deal with the nulls. Let's say that we
deal with the nulls. Let's say that we have the following task. Find the
have the following task. Find the average scores of customers and show as
average scores of customers and show as well additional informations like the
well additional informations like the customer ID and the last name. So let's
customer ID and the last name. So let's go and solve this. We are now targeting
go and solve this. We are now targeting the table customers. So let's just
the table customers. So let's just select it first.
select it first. like this. And now let's go and include
like this. And now let's go and include the customer ID and the last name. And
the customer ID and the last name. And let's have as well the score. But this
let's have as well the score. But this time we're going to go and find the
time we're going to go and find the average score. So it's going to be the
average score. So it's going to be the average score. And since we don't
average score. And since we don't partition the data, we're going to leave
partition the data, we're going to leave the definition like this and it's going
the definition like this and it's going to be the average score. So that's it.
to be the average score. So that's it. Let's go and execute it. So now as you
Let's go and execute it. So now as you can see, we have the average score of
can see, we have the average score of 625. SQL is going to go and summarize
625. SQL is going to go and summarize the four values and divide it by four.
the four values and divide it by four. But here we have a null. So now we have
But here we have a null. So now we have to understand the business or ask about
to understand the business or ask about it what the null means in the scores of
it what the null means in the scores of the customers. Is it zero or is it
the customers. Is it zero or is it something empty? If it's zero then the
something empty? If it's zero then the average that we have is wrong because it
average that we have is wrong because it should be divided by five and not four.
should be divided by five and not four. So let's say it's zero that means we
So let's say it's zero that means we have to go and handle the nulls. So what
have to go and handle the nulls. So what we're going to do now we're going to go
we're going to do now we're going to go and use the function kalis. So qualis
and use the function kalis. So qualis and for the score and replace the null
and for the score and replace the null with zero. So you are the customer
with zero. So you are the customer score. Let's go and execute this. So now
score. Let's go and execute this. So now as you can see if there is a value it's
as you can see if there is a value it's going to be exactly the same value but
going to be exactly the same value but only if we have null it's going to be
only if we have null it's going to be replaced with zero. So now let's go and
replaced with zero. So now let's go and correct the average. I'm just going to
correct the average. I'm just going to do it like this. So let's go and copy
do it like this. So let's go and copy the whole thing. But now instead of
the whole thing. But now instead of using the score we're going to use the
using the score we're going to use the score that is handled with nulls. So I'm
score that is handled with nulls. So I'm just going to go and replace it like
just going to go and replace it like this. So here without nulls. So let's go
this. So here without nulls. So let's go and execute it. So now as you can see we
and execute it. So now as you can see we are getting more valid result at the
are getting more valid result at the output compared to the previous one. And
output compared to the previous one. And this is only for the case if the null
this is only for the case if the null means zero. So guys as you see be very
means zero. So guys as you see be very careful with the nulls especially if you
careful with the nulls especially if you are doing aggregations and handle it
are doing aggregations and handle it correctly before doing any aggregations
correctly before doing any aggregations like the average. All right. Moving on
like the average. All right. Moving on to the last use case. We have the
to the last use case. We have the comparison analyzes and the task says
comparison analyzes and the task says find all orders where the sales are
find all orders where the sales are higher than the average sales across all
higher than the average sales across all orders. So that means we have to go and
orders. So that means we have to go and compare the current sales with the
compare the current sales with the aggregated value and this time the
aggregated value and this time the average of sales. So now let's go and do
average of sales. So now let's go and do it step by step. So what we're going to
it step by step. So what we're going to do we're going to go and select of
do we're going to go and select of course the order ID. What do we need the
course the order ID. What do we need the let's take the product ID and we need
let's take the product ID and we need the current sales. So it's going to be
the current sales. So it's going to be the sales as it is and that's it for
the sales as it is and that's it for now. So from sales orders. So that's it.
now. So from sales orders. So that's it. Let's go and execute it. So now by
Let's go and execute it. So now by checking the result, you can see that we
checking the result, you can see that we got the first part of the equation,
got the first part of the equation, right? We have the sales for each order.
right? We have the sales for each order. Now we need the second part, the average
Now we need the second part, the average sales across all orders. In order to do
sales across all orders. In order to do that, we're going to go and use the
that, we're going to go and use the window function average sales and we're
window function average sales and we're going to use over since across all
going to use over since across all orders that means it's going to be
orders that means it's going to be empty. So let's give it a name average
empty. So let's give it a name average sales. So let's go and execute it. So
sales. So let's go and execute it. So now in the output we got the average
now in the output we got the average sales. So it's going to be 38. So now we
sales. So it's going to be 38. So now we need all the orders that are higher than
need all the orders that are higher than the average. So as you can see for
the average. So as you can see for example the order one is not higher but
example the order one is not higher but the order for is higher than the
the order for is higher than the average. So in order to filter the data
average. So in order to filter the data we cannot use the window function in the
we cannot use the window function in the wear close. Right? So what we're going
wear close. Right? So what we're going to do sadly we're going to go and use
to do sadly we're going to go and use the subquery. So it's going to be like
the subquery. So it's going to be like this. select star from and then we're
this. select star from and then we're going to define the condition outside
going to define the condition outside the subquery. So it's going to be where
the subquery. So it's going to be where the sales is higher than the average
the sales is higher than the average sales. So that's it. Let's go and
sales. So that's it. Let's go and execute it. And now as you can see it's
execute it. And now as you can see it's very simple. We got all the orders that
very simple. We got all the orders that are higher than the average. Right? So
are higher than the average. Right? So you can see all those sales are higher
you can see all those sales are higher than the average. It would be nice if we
than the average. It would be nice if we can do all those stuff in the first
can do all those stuff in the first query. But since we cannot do that, we
query. But since we cannot do that, we need to use the subqueries in order to
need to use the subqueries in order to filter the data afterward. So that we
filter the data afterward. So that we can understand the importance of the
can understand the importance of the comparison analyszis. For example, here
comparison analyszis. For example, here we are finding or evaluating the data
we are finding or evaluating the data whether they are above the average or
whether they are above the average or below the average. And this is very
below the average. And this is very important in the business analyzes. All
important in the business analyzes. All right, everyone. So that's all for the
right, everyone. So that's all for the window function average. Next, we're
window function average. Next, we're going to talk about two very interesting
going to talk about two very interesting functions, the min and max.
All right guys, so what is min and max functions? They are very simple but yet
functions? They are very simple but yet very powerful functions for analytics.
very powerful functions for analytics. So the min simply is the function that
So the min simply is the function that can return the minimum or let's say the
can return the minimum or let's say the lowest value within a window where the
lowest value within a window where the max it's exactly the opposite. It's
max it's exactly the opposite. It's going to find the maximum value or the
going to find the maximum value or the highest value within a window. So now
highest value within a window. So now let's go and understand how SQL works
let's go and understand how SQL works with these functions. All right. So now
with these functions. All right. So now we have the same data and we have two
we have the same data and we have two tasks. First we have to find the lowest
tasks. First we have to find the lowest sales for each product. And the second
sales for each product. And the second one side by side we would like to find
one side by side we would like to find the highest sales for each product. So
the highest sales for each product. So we're going to go and use the min max.
we're going to go and use the min max. And as you can see the syntax is very
And as you can see the syntax is very simple. Min the sales and then the
simple. Min the sales and then the partition going to be by the products.
partition going to be by the products. And here as well the same stuff but
And here as well the same stuff but having the max. Okay. So now let's see
having the max. Okay. So now let's see how going to execute the first query. As
how going to execute the first query. As usual first it's going to prepare the
usual first it's going to prepare the data. So it's going to split the data
data. So it's going to split the data into two windows. One for the caps and
into two windows. One for the caps and another one for the gloves. And after
another one for the gloves. And after that it's going to search for the lowest
that it's going to search for the lowest sales within each window separately. So
sales within each window separately. So for the first window we have the
for the first window we have the following values 20 10 and five. And of
following values 20 10 and five. And of course the lowest value going to be the
course the lowest value going to be the five. So that's why SQL going to find it
five. So that's why SQL going to find it over here. And everywhere for this
over here. And everywhere for this window it's going to be the value five.
window it's going to be the value five. So we have it as the lowest sales for
So we have it as the lowest sales for the product caps. So now it's going to
the product caps. So now it's going to jump to the next window for the gloves
jump to the next window for the gloves and start searching the values. So as
and start searching the values. So as you can see we have 30 70 and null. Null
you can see we have 30 70 and null. Null will be ignored. So null will not be
will be ignored. So null will not be considered as the lowest value. So SQL
considered as the lowest value. So SQL going to find the lowest sales with the
going to find the lowest sales with the 30. So it's going to be actually the
30. So it's going to be actually the first row within this window and the
first row within this window and the value the output going to be 30 for each
value the output going to be 30 for each row. So that's it. It's very simple,
row. So that's it. It's very simple, right? Now let's move to the next one.
right? Now let's move to the next one. We have the same stuff but using max. So
We have the same stuff but using max. So the data is partitions and for the first
the data is partitions and for the first partition what is the highest value?
partition what is the highest value? It's going to be the first row, right?
It's going to be the first row, right? The 20. So SQL going to find it and in
The 20. So SQL going to find it and in the output we will get the highest sales
the output we will get the highest sales 20 for this window and then it's going
20 for this window and then it's going to go to the second window and search
to go to the second window and search for the highest value. So here we have
for the highest value. So here we have two values 30 and 70 and it's going to
two values 30 and 70 and it's going to be the 70 right. So it's going to point
be the 70 right. So it's going to point it over here and in the output we will
it over here and in the output we will get everywhere 70. So guys it's really
get everywhere 70. So guys it's really simple right now let's back to our
simple right now let's back to our scenario in the average where in our
scenario in the average where in our business we understand nulls as zero in
business we understand nulls as zero in the sales. So that means first we have
the sales. So that means first we have to handle the nulls and replace it with
to handle the nulls and replace it with zero and then we're going to go and
zero and then we're going to go and search for the value. So what's going to
search for the value. So what's going to happen? We're going to go and replace
happen? We're going to go and replace nulls with zero. For the max nothing
nulls with zero. For the max nothing going to change the highest value going
going to change the highest value going to be 70 and we're going to get the same
to be 70 and we're going to get the same output. But for the min now we have new
output. But for the min now we have new lowest value. So it's not anymore the
lowest value. So it's not anymore the 30. It's actually the zero. So SQL going
30. It's actually the zero. So SQL going to go over here and replace the 30 with
to go over here and replace the 30 with nulls. So nulls is the lowest sales for
nulls. So nulls is the lowest sales for the product gloves. So again guys, the
the product gloves. So again guys, the nulls are very tricky and those
nulls are very tricky and those functions are really sensitive with the
functions are really sensitive with the nulls. Understand what the nulls means
nulls. Understand what the nulls means and handle it correctly so that you get
and handle it correctly so that you get correct results in the output. So that's
correct results in the output. So that's it. Let's go back to SQL to have some
it. Let's go back to SQL to have some tasks and use cases in order to practice
tasks and use cases in order to practice SQL. All right everyone, let's start
SQL. All right everyone, let's start with the basic stuff. find the highest
with the basic stuff. find the highest and lowest sales of all orders and as
and lowest sales of all orders and as well find the highest and lowest sales
well find the highest and lowest sales for each product and we have to provide
for each product and we have to provide additional informations. So let's go and
additional informations. So let's go and solve it. Select order ID order and
solve it. Select order ID order and let's take as well the product ID. Now
let's take as well the product ID. Now let's find the highest sales of all
let's find the highest sales of all orders. It going to be the max function
orders. It going to be the max function for the sales and the window function
for the sales and the window function going to be empty since of all orders.
going to be empty since of all orders. So you are the highest sales. Let's go
So you are the highest sales. Let's go for the lowest sales of all orders. It's
for the lowest sales of all orders. It's going to be exactly the opposite. The
going to be exactly the opposite. The main function for sales over then we
main function for sales over then we have the lowest sales. So I'm just going
have the lowest sales. So I'm just going to make it bigger capital. So let's
to make it bigger capital. So let's select the table sales orders. So I
select the table sales orders. So I think that's it. Let's have as well the
think that's it. Let's have as well the sales actually. All right. So now let's
sales actually. All right. So now let's go and execute it. So now this is very
go and execute it. So now this is very simple, right? This is the wholesales.
simple, right? This is the wholesales. What is the highest sales? We have the
What is the highest sales? We have the 90 of the order eight. So, as you can
90 of the order eight. So, as you can see, we have now the highest sales, the
see, we have now the highest sales, the 90, and the lowest sales is the 10. The
90, and the lowest sales is the 10. The first order is the lowest. So, it's very
first order is the lowest. So, it's very easy. Now, we're going to go and repeat
easy. Now, we're going to go and repeat the same stuff for the product. So, we
the same stuff for the product. So, we have go and partition the data by the
have go and partition the data by the product ID. So, what I'm going to do,
product ID. So, what I'm going to do, I'm just going to go and copy paste
I'm just going to go and copy paste stuff around. So, the first one going to
stuff around. So, the first one going to be partition by the product ID. So,
be partition by the product ID. So, highest sales by product. And the next
highest sales by product. And the next one going to be the same stuff. Copy
one going to be the same stuff. Copy paste by the product. So that's it.
paste by the product. So that's it. Let's go and execute it. So now again
Let's go and execute it. So now again the data going to be partitioned and
the data going to be partitioned and divided by the product. So for the first
divided by the product. So for the first window what is the highest sales? It's
window what is the highest sales? It's going to be the 90 and the lowest sales
going to be the 90 and the lowest sales is going to be the 10. So it's exactly
is going to be the 10. So it's exactly like the overall rights now let's go to
like the overall rights now let's go to the second window over here. We can see
the second window over here. We can see that the lowest or the highest sales is
that the lowest or the highest sales is the 60 the first one and the lowest this
the 60 the first one and the lowest this time is 15. And this is great in order
time is 15. And this is great in order to see that the SQL going to execute
to see that the SQL going to execute each of those functions for each window
each of those functions for each window separately. So let's go to the last
separately. So let's go to the last window. It's funny one. So the sales is
window. It's funny one. So the sales is 60 and we have only one row. So it's
60 and we have only one row. So it's going to be the highest and as well the
going to be the highest and as well the lowest sales. So with that as you can
lowest sales. So with that as you can see we can define a range for each
see we can define a range for each product and the range are different from
product and the range are different from each product to another one. For
each product to another one. For example, for this product 101 the range
example, for this product 101 the range from 10 until 90. But for the second
from 10 until 90. But for the second product we have it between 15 and 60.
product we have it between 15 and 60. Okay guys, let's move to the next one
Okay guys, let's move to the next one which is one of my favorites in the
which is one of my favorites in the window function where we filter the data
window function where we filter the data using the minmax functions. Let's have
using the minmax functions. Let's have the following task. It says show the
the following task. It says show the employees who have the highest salaries.
employees who have the highest salaries. So this sounds very simple but we can
So this sounds very simple but we can use the help of window functions in
use the help of window functions in order to solve it. So now we are working
order to solve it. So now we are working with the table employees. Let's just
with the table employees. Let's just select the data. So select from sales
select the data. So select from sales employees. So that's it. Let's go and
employees. So that's it. Let's go and execute it. So now we have five
execute it. So now we have five employees and we have those different
employees and we have those different salaries. Let's go and find the highest
salaries. Let's go and find the highest salary. So max salary and let's use the
salary. So max salary and let's use the window function over but we don't
window function over but we don't partition the data at all. So it's going
partition the data at all. So it's going to be like this highest salary. So let's
to be like this highest salary. So let's go and execute it. And now by checking
go and execute it. And now by checking the results we got a new column called
the results we got a new column called highest salary and inside it we have the
highest salary and inside it we have the 90k. So if you check those five salaries
90k. So if you check those five salaries you can see that the highest is from the
you can see that the highest is from the employee Michael. But still the task is
employee Michael. But still the task is not solved. We have to show only the
not solved. We have to show only the employees who have the highest salaries.
employees who have the highest salaries. So we have somehow to filter the data
So we have somehow to filter the data and only show this employee. So in order
and only show this employee. So in order to do that we have to use the subqueries
to do that we have to use the subqueries since we cannot use the window function
since we cannot use the window function in the wear clause. So what we're going
in the wear clause. So what we're going to do select star from and then our
to do select star from and then our first query going to be the inner query.
first query going to be the inner query. So we have the following condition. It's
So we have the following condition. It's going to be the
going to be the salary should be equal to the highest
salary should be equal to the highest salary. So it's very simple. So with
salary. So it's very simple. So with that we are comparing the salaries with
that we are comparing the salaries with the highest salaries. If there is a
the highest salaries. If there is a match the data going to be presented. So
match the data going to be presented. So let's go and execute that. And that's
let's go and execute that. And that's it. As you can see we got the employee
it. As you can see we got the employee with the highest salary. But if there
with the highest salary. But if there are like multiple employees with the
are like multiple employees with the same salary of 90k of course we're going
same salary of 90k of course we're going to get it in the results. I think
to get it in the results. I think Michael going to need a new job. Right.
Michael going to need a new job. Right. This is the worst.
So this is another use case for the window functions minmax. All right. So
window functions minmax. All right. So now we come to the use case of the
now we come to the use case of the comparison analyzers where we want to
comparison analyzers where we want to compare the current sales with the
compare the current sales with the highest and the lowest value. So we have
highest and the lowest value. So we have the following task. It says find the
the following task. It says find the deviation of each sales from the minimum
deviation of each sales from the minimum and the maximum sales amount. So now as
and the maximum sales amount. So now as you can see this is our sales. This is
you can see this is our sales. This is the highest and this is the lowest. So
the highest and this is the lowest. So now we just have to go and subtract the
now we just have to go and subtract the data from each others in order to get
data from each others in order to get the deviation. So it's very simple.
the deviation. So it's very simple. Let's get the first deviation where
Let's get the first deviation where we're going to go and subtract the sales
we're going to go and subtract the sales with the lowest value. So it's going to
with the lowest value. So it's going to be like this. So now what we are doing
be like this. So now what we are doing over here, we are subtracting the sales
over here, we are subtracting the sales from the lowest sales of all records. So
from the lowest sales of all records. So we're going to go and call you
we're going to go and call you deviation from min. So let's go and
deviation from min. So let's go and execute it. So now we can see from those
execute it. So now we can see from those values how far is the current value from
values how far is the current value from the extreme. The extreme here is the
the extreme. The extreme here is the lowest value. So this is a really great
lowest value. So this is a really great way on to analyze the extremes in your
way on to analyze the extremes in your data. So now as we are near to the
data. So now as we are near to the extreme the value going to be low. So as
extreme the value going to be low. So as you can see here we have a zero. This is
you can see here we have a zero. This is the lowest because we have it exactly as
the lowest because we have it exactly as the extreme. So actually this is our
the extreme. So actually this is our value. So the 10. Now the next one is
value. So the 10. Now the next one is little bit far away from the extreme
little bit far away from the extreme which is 15. So we have it here as a
which is 15. So we have it here as a five. So this is not far away from our
five. So this is not far away from our extreme value. And then if you check
extreme value. And then if you check this value over here we have it 80. So
this value over here we have it 80. So the distance is very far away from our
the distance is very far away from our extreme value the lowest sales. So this
extreme value the lowest sales. So this is really nice analyszis in order to
is really nice analyszis in order to analyze and evaluate the sales of your
analyze and evaluate the sales of your data. Now of course we can go and
data. Now of course we can go and evaluate our data with an another
evaluate our data with an another extreme which is the highest sales. So
extreme which is the highest sales. So in order to do that we're going to first
in order to do that we're going to first say let's get the highest sorry this one
say let's get the highest sorry this one the highest sales and subtract it from
the highest sales and subtract it from the sales. So you are the deviation from
the sales. So you are the deviation from the max. So let's go and execute it. So
the max. So let's go and execute it. So now we can see in the output we're going
now we can see in the output we're going to get exactly the opposite distances.
to get exactly the opposite distances. So the order number one is the farthest
So the order number one is the farthest from the extreme. So as you can see we
from the extreme. So as you can see we have the value of 80 and the order eight
have the value of 80 and the order eight is the identical one. So that's why we
is the identical one. So that's why we have the distance of zero. So now we can
have the distance of zero. So now we can see as well very quickly which data
see as well very quickly which data points are the nearest to the extreme to
points are the nearest to the extreme to the highest sales. So as you can see
the highest sales. So as you can see guys using the window function min and
guys using the window function min and max it is very powerful in order to
max it is very powerful in order to understand and evaluate your data points
understand and evaluate your data points to the
to the [Music]
[Music] extremes. All right everyone so now
extremes. All right everyone so now we're going to focus on very important
we're going to focus on very important use case. One of the must know use cases
use case. One of the must know use cases for data aggregations is doing running
for data aggregations is doing running total and rolling total. These two
total and rolling total. These two concepts are very important for data
concepts are very important for data analyszis and doing reporting that you
analyszis and doing reporting that you must know. The key use case for those
must know. The key use case for those two concept is to do tracking. For
two concept is to do tracking. For example, we can go and track the current
example, we can go and track the current total sales with the target sales in our
total sales with the target sales in our business. And as well, it's great in
business. And as well, it's great in order to do historical analyszis for the
order to do historical analyszis for the trends. Okay. So now the question is
trends. Okay. So now the question is what is running a rolling total. They
what is running a rolling total. They are basically very similar. They're
are basically very similar. They're going to go and aggregate a sequence of
going to go and aggregate a sequence of members and the aggregation going to get
members and the aggregation going to get updated each time we add a new member to
updated each time we add a new member to the sequence. A sequence could be like a
the sequence. A sequence could be like a time sequence. That's why we call this
time sequence. That's why we call this type an analyzes over time. So now we
type an analyzes over time. So now we still have the question, what is the
still have the question, what is the difference between the running and the
difference between the running and the rolling totals. The running total going
rolling totals. The running total going to go and aggregate everything from the
to go and aggregate everything from the beginning until the current data point
beginning until the current data point without dropping off any old data. Where
without dropping off any old data. Where on the other hand in the rolling total
on the other hand in the rolling total it going to go and focus on a specific
it going to go and focus on a specific time window like the last 30 days or the
time window like the last 30 days or the last two monthses and each time we add a
last two monthses and each time we add a new member or a new data point to the
new member or a new data point to the window we will be dropping off the
window we will be dropping off the oldest data point in the window and with
oldest data point in the window and with this we're going to get the effect of
this we're going to get the effect of rolling or let's say shifting window
rolling or let's say shifting window okay I totally understand if this might
okay I totally understand if this might be complicated now let's go and have
be complicated now let's go and have very simple example in order to
very simple example in order to understand this concept and as well how
understand this concept and as well how we can solve it using SQL all right guys
we can solve it using SQL all right guys so now We have very simple example. We
so now We have very simple example. We have the months and sales and we have it
have the months and sales and we have it twice because I want to show you side by
twice because I want to show you side by side how SQL works with the running
side how SQL works with the running total and the rolling total. So now what
total and the rolling total. So now what is the task on the left side? We want to
is the task on the left side? We want to find the running total of sales for each
find the running total of sales for each month and on the right side we would
month and on the right side we would like to find three month rolling total
like to find three month rolling total of the sales for each month. So they
of the sales for each month. So they sound very similar but on the right side
sound very similar but on the right side we have only fixed window. So now how we
we have only fixed window. So now how we can solve this using SQL. On the left
can solve this using SQL. On the left side we can use sum of sales. So we want
side we can use sum of sales. So we want to go and aggregate all the sales using
to go and aggregate all the sales using the sum function. And the definition for
the sum function. And the definition for the window going to be like this order
the window going to be like this order by month and of course you can go and do
by month and of course you can go and do anything like you can have here an
anything like you can have here an average. And if you use an average with
average. And if you use an average with order by you will get the running
order by you will get the running average or the running max or the
average or the running max or the running count and so on. So that means
running count and so on. So that means always if you go and mix an aggregate
always if you go and mix an aggregate function together with an order by you
function together with an order by you will generate an effect of running
will generate an effect of running total. Now on the right side we can have
total. Now on the right side we can have the same stuff. So we can have an
the same stuff. So we can have an aggregate function together with order
aggregate function together with order by. So sum of sales, order by month. So
by. So sum of sales, order by month. So far we have everything like the left
far we have everything like the left side, right? But now you might ask why
side, right? But now you might ask why is going to go and generate this effect
is going to go and generate this effect the running total. We didn't here
the running total. We didn't here specify like crazy stuff, right? It's
specify like crazy stuff, right? It's all about the definition of the frame
all about the definition of the frame close. So now do you remember if you use
close. So now do you remember if you use an order by and you don't specify a
an order by and you don't specify a frame close you will get like hidden or
frame close you will get like hidden or let's say default frame close and it's
let's say default frame close and it's going to look like this rows between
going to look like this rows between unbounded preceding and current row. And
unbounded preceding and current row. And what was the definition of the running
what was the definition of the running total? It's going to go and aggregate
total? It's going to go and aggregate all the data from the very first
all the data from the very first beginning well the unbounded proceeding
beginning well the unbounded proceeding until the current position the current
until the current position the current row without dropping off any old
row without dropping off any old members. So that means the definition of
members. So that means the definition of the running total going to be the exact
the running total going to be the exact definition of the default frame clause.
definition of the default frame clause. That's why it's going to go and generate
That's why it's going to go and generate the effect of the running total. Now
the effect of the running total. Now let's go to the right side the rolling
let's go to the right side the rolling total. Here again we have the same stuff
total. Here again we have the same stuff right. We're going to go and aggregate
right. We're going to go and aggregate the data using the sum function and
the data using the sum function and we're going to go and sort the data
we're going to go and sort the data order by month. So with that we are as
order by month. So with that we are as well generating the effect of running
well generating the effect of running total. So each time you use order by
total. So each time you use order by with aggregate function. So now in the
with aggregate function. So now in the running total we want always to specify
running total we want always to specify a frame. So here in this example three
a frame. So here in this example three months. So that means if we are getting
months. So that means if we are getting a new month we don't want to include the
a new month we don't want to include the latest months. We want always to be
latest months. We want always to be fixed window. Now in order to have this
fixed window. Now in order to have this fixed window effect we have to go and
fixed window effect we have to go and redefine the frame close because if you
redefine the frame close because if you leave it as a default like the running
leave it as a default like the running total the frame going to keep extending.
total the frame going to keep extending. You will see this effect in the example.
You will see this effect in the example. So now we define it like this rows
So now we define it like this rows between two preceding and current row.
between two preceding and current row. So the total number of rows going to be
So the total number of rows going to be included in each window going to be
included in each window going to be maximum of three months. So now I know
maximum of three months. So now I know you might saying bar what you are
you might saying bar what you are talking about you didn't get anything.
talking about you didn't get anything. It's total normal you will understand it
It's total normal you will understand it only with an example. So in order to do
only with an example. So in order to do this let's start with the left side. So
this let's start with the left side. So first going to go and sort the data. So
first going to go and sort the data. So everything is sorted from the smallest
everything is sorted from the smallest month until the highest one. So from
month until the highest one. So from January until July everything is good.
January until July everything is good. And now it's going to go and start
And now it's going to go and start working with the frame. So the frame
working with the frame. So the frame says unbounded proceeding. So this going
says unbounded proceeding. So this going to be static. It's going to be always
to be static. It's going to be always pointing to January. This is the
pointing to January. This is the unbounded proceeding. The first row in
unbounded proceeding. The first row in the data set. And now of course we are
the data set. And now of course we are starting from top to bottom. The current
starting from top to bottom. The current row going to be pointing as well to
row going to be pointing as well to January. So the frame going to look like
January. So the frame going to look like this. It's going to be only one row and
this. It's going to be only one row and the total sale of this row going to be
the total sale of this row going to be 20. So that's why we're going to have in
20. So that's why we're going to have in the output 20. So now let's move to the
the output 20. So now let's move to the right side. The current row going to be
right side. The current row going to be as well January. And what is the two
as well January. And what is the two proceeding? We don't have it yet. So
proceeding? We don't have it yet. So it's going to be pointing maybe
it's going to be pointing maybe somewhere here before the table. So
somewhere here before the table. So again, what is the frame? It's going to
again, what is the frame? It's going to be as well one row. So in the output, we
be as well one row. So in the output, we will get exactly the same result 20. So
will get exactly the same result 20. So so far there is no differences between
so far there is no differences between the running total and the rolling total.
the running total and the rolling total. But let's keep going. Now we're going to
But let's keep going. Now we're going to go to the next row over here. So what
go to the next row over here. So what can happen to our frame? It going to go
can happen to our frame? It going to go and extend, right? So we're going to
and extend, right? So we're going to have now two months in this frame. And
have now two months in this frame. And what is the total sales over here? It's
what is the total sales over here? It's going to be 30. So we added a new
going to be 30. So we added a new member. You can calculate it like this.
member. You can calculate it like this. Either go and calculate all the sales
Either go and calculate all the sales within the frame or you can go and say
within the frame or you can go and say this is the previous aggregated value
this is the previous aggregated value plus the new member. So the previous one
plus the new member. So the previous one was 20. The new member is 10. We will
was 20. The new member is 10. We will get 30. Both of them is correct. So now
get 30. Both of them is correct. So now let's move to the right side. What's
let's move to the right side. What's going to happen? We're going to be as
going to happen? We're going to be as well at February. The two preceding is
well at February. The two preceding is still like pointing somewhere outside.
still like pointing somewhere outside. And here the window going to go and
And here the window going to go and extend like this. We have two months and
extend like this. We have two months and the same aggregation going to happen. So
the same aggregation going to happen. So we have 30. So so far nothing crazy
we have 30. So so far nothing crazy right. Let's go to the next month March.
right. Let's go to the next month March. The frame going to be extended. So we
The frame going to be extended. So we have now three months. And the
have now three months. And the aggregation going to be either here 60
aggregation going to be either here 60 or 30 + 30. We will get the running
or 30 + 30. We will get the running total of 60. And now on the right side
total of 60. And now on the right side what going to happen? We're going to
what going to happen? We're going to point as well to March. And this time
point as well to March. And this time the two proceeding going to be pointing
the two proceeding going to be pointing to January. And this is the first time
to January. And this is the first time we are getting the whole fixed frame.
we are getting the whole fixed frame. Right? So we have here three muscles in
Right? So we have here three muscles in this frame. So what is the total of
this frame. So what is the total of that? It's going to be 60. Okay. So now
that? It's going to be 60. Okay. So now you say, okay, we're still getting the
you say, okay, we're still getting the same results. There's no difference. I'm
same results. There's no difference. I'm going to say wait for it. It's going to
going to say wait for it. It's going to be the next one. So as we go to April,
be the next one. So as we go to April, the effect here is that the frame going
the effect here is that the frame going to get extended to four months because
to get extended to four months because always we start from the first month
always we start from the first month until the current month without dropping
until the current month without dropping any member outside. So what is the total
any member outside. So what is the total of this? It's going to be 65. Sorry,
of this? It's going to be 65. Sorry, like this. So now on the right side,
like this. So now on the right side, what going to happen? We're going to go
what going to happen? We're going to go and add a new member. the April but we
and add a new member. the April but we are at the maximum sides of the window
are at the maximum sides of the window we have only three and that's because
we have only three and that's because the two preceding going to shift as well
the two preceding going to shift as well down over here so the boundary going to
down over here so the boundary going to be from February until April and with
be from February until April and with that we are dropping off January and now
that we are dropping off January and now you're going to see the effect it is
you're going to see the effect it is sliding it is rolling or shifting from
sliding it is rolling or shifting from top to bottom and that's because the
top to bottom and that's because the boundaries as well shifting so you can
boundaries as well shifting so you can see now the effect of the rolling total
see now the effect of the rolling total the newest member going to be added the
the newest member going to be added the oldest member going to be But we are
oldest member going to be But we are allowed only to have three muscles. So
allowed only to have three muscles. So what is the total of this? It's going to
what is the total of this? It's going to be 45. So this times we are not
be 45. So this times we are not aggregating this value the 60 together
aggregating this value the 60 together with the five. We are aggregating the
with the five. We are aggregating the values within the window. So now let's
values within the window. So now let's keep going. Now we are at June. What can
keep going. Now we are at June. What can happen on the left side? The frame going
happen on the left side? The frame going to get bigger. And with that we will get
to get bigger. And with that we will get the result of 135. So the frame is
the result of 135. So the frame is getting really bigger. But on the right
getting really bigger. But on the right side it's going to has a fixed frame. So
side it's going to has a fixed frame. So we are just sliding, shifting and
we are just sliding, shifting and rolling. So with that we are adding new
rolling. So with that we are adding new member. Another member is leaving the
member. Another member is leaving the oldest one. And the total over here
oldest one. And the total over here going to be 105. And now we're going to
going to be 105. And now we're going to go to the last row. We will have
go to the last row. We will have everything for the ring total. So the
everything for the ring total. So the whole data set is going to be
whole data set is going to be aggregated. So this is the maximum what
aggregated. So this is the maximum what we're going to get. It's going to be
we're going to get. It's going to be around 175. But on the right side it
around 175. But on the right side it just going to keep shifting until we
just going to keep shifting until we reach the last record. the window the
reach the last record. the window the frame going to be as well shifting like
frame going to be as well shifting like this. So the total of this going to be
this. So the total of this going to be 105. Okay guys so you see it's very
105. Okay guys so you see it's very simple the running total it's always
simple the running total it's always consider everything from the starting
consider everything from the starting position until the current row without
position until the current row without dropping any member. The rolling total
dropping any member. The rolling total it's always drop the oldest member in
it's always drop the oldest member in order to add something new and the
order to add something new and the window is keep shifting. So the running
window is keep shifting. So the running total is very great in order to do
total is very great in order to do tracking like for example budget
tracking like for example budget tracking or we track for example the
tracking or we track for example the current total sales with a target or
current total sales with a target or something like that. So always we are
something like that. So always we are considering the whole data sets but with
considering the whole data sets but with the rolling total we always do here
the rolling total we always do here focused analyzes. We are always
focused analyzes. We are always interested with the window of 3 months.
interested with the window of 3 months. So they might sound very similar but
So they might sound very similar but they have completely different scope for
they have completely different scope for analyzes but both of them are doing
analyzes but both of them are doing aggregations over time. So they're going
aggregations over time. So they're going to help us to do analyzes over time like
to help us to do analyzes over time like checking whether our business is growing
checking whether our business is growing over time or declining. So guys as you
over time or declining. So guys as you can see using very simple SQLs using the
can see using very simple SQLs using the window functions we can do really great
window functions we can do really great analysis on our data. So those stuff are
analysis on our data. So those stuff are really fundamental of data analyzes or
really fundamental of data analyzes or doing reporting for our business. So
doing reporting for our business. So window functions are really powerful for
window functions are really powerful for data
analytics. Okay. Okay. So now we have the following task and it says calculate
the following task and it says calculate the moving average of sales for each
the moving average of sales for each products over the time. So now we have
products over the time. So now we have here something called moving average. It
here something called moving average. It is very similar to the running total. In
is very similar to the running total. In the running total we used count and sum
the running total we used count and sum and so on. But here we're going to go
and so on. But here we're going to go and use the function average and instead
and use the function average and instead of calling it running average we call it
of calling it running average we call it moving average. So let's go and solve
moving average. So let's go and solve the task. Let's start always by
the task. Let's start always by selecting the usual stuff. So let's get
selecting the usual stuff. So let's get the order ID. Let's get the product ID
the order ID. Let's get the product ID and I would say since it's over the time
and I would say since it's over the time I will get the order date as well and
I will get the order date as well and the last one the sales from our table
the last one the sales from our table sales orders. So that's it. Let's go and
sales orders. So that's it. Let's go and execute it. So now we got our 10 orders
execute it. So now we got our 10 orders with the products order date and sales.
with the products order date and sales. Let's start building our window function
Let's start building our window function step by step. So which function do we
step by step. So which function do we need? We need the average. This is the
need? We need the average. This is the easiest one. It says moving average. So
easiest one. It says moving average. So that's it. We need the sales. So it's
that's it. We need the sales. So it's going to be the average of sales. Let's
going to be the average of sales. Let's go and define the window. So now do we
go and define the window. So now do we have to divide the data, partition the
have to divide the data, partition the data? Well, yes. It says for each
data? Well, yes. It says for each product that means we're going to go and
product that means we're going to go and use the partition by clause by the
use the partition by clause by the product ID. So now I would say that's it
product ID. So now I would say that's it for the first step. So average by
for the first step. So average by product. So let's go and execute it. So
product. So let's go and execute it. So now if you check the result, you can see
now if you check the result, you can see that we got our windows. So the first
that we got our windows. So the first one for the product 101 and the total
one for the product 101 and the total average of the sales going to be 35. So
average of the sales going to be 35. So we have like aggregated one value for
we have like aggregated one value for each window. The same thing for the next
each window. The same thing for the next product and for the next and so on. So
product and for the next and so on. So we don't have any progress over time or
we don't have any progress over time or something like moving average all the
something like moving average all the time. Right? We don't have this effect.
time. Right? We don't have this effect. We have just one average for each
We have just one average for each window. So now in order to have the
window. So now in order to have the effect of the moving average, it's going
effect of the moving average, it's going to be like the running total. We have to
to be like the running total. We have to use the aggregate function together with
use the aggregate function together with the order by. So I'm just going to make
the order by. So I'm just going to make it in the new column. I'm just going to
it in the new column. I'm just going to copy everything like here. And now what
copy everything like here. And now what we going to do? Order by. And since it's
we going to do? Order by. And since it's over the time, we're going to go and use
over the time, we're going to go and use the order dates. Order dates. And we're
the order dates. Order dates. And we're going to have it ascending because it's
going to have it ascending because it's overtime. Over time always like start
overtime. Over time always like start with the earliest dates, end up with the
with the earliest dates, end up with the latest dates. So from the lowest to the
latest dates. So from the lowest to the highest, we're going to leave it like
highest, we're going to leave it like this. So let's call it moving average.
this. So let's call it moving average. So now let's go and execute it. And we
So now let's go and execute it. And we got here an extra comma because of the
got here an extra comma because of the copy paste. So let's execute it again.
copy paste. So let's execute it again. All right. So now let's check the
All right. So now let's check the results. Let's take the first window
results. Let's take the first window over here. And you can see we have on
over here. And you can see we have on the moving average like a progress. So
the moving average like a progress. So it start with 10 15 14 35. So there is
it start with 10 15 14 35. So there is like moving average. We don't have one
like moving average. We don't have one solid number for the average. We have
solid number for the average. We have different values. So now how SQL going
different values. So now how SQL going to solve this? It's really simple. It's
to solve this? It's really simple. It's going to start row by row. So the first
going to start row by row. So the first row what is the average of 10? It's
row what is the average of 10? It's going to be 10. Then moving on to the
going to be 10. Then moving on to the next one it's going to be 10 + 20
next one it's going to be 10 + 20 divided by 2 you will get 15. So now
divided by 2 you will get 15. So now moving to the third one all those three
moving to the third one all those three values going to be summarized divided by
values going to be summarized divided by three you will get 40. And now to the
three you will get 40. And now to the last row in the window it's going to be
last row in the window it's going to be summarizing all those four values
summarizing all those four values divided by four and you will get 35. And
divided by four and you will get 35. And this is exactly the same value in the
this is exactly the same value in the previous column. You have here the
previous column. You have here the average byproducts. We don't have order
average byproducts. We don't have order by you got as well 35 exactly like this
by you got as well 35 exactly like this last row and that's because we have the
last row and that's because we have the same calculation. It is summarizing all
same calculation. It is summarizing all those four values dividing it by four.
those four values dividing it by four. But now it's interesting the next value.
But now it's interesting the next value. So as you can see the next value it
So as you can see the next value it comes from another window. So you see
comes from another window. So you see here we have 15 for the product 102 but
here we have 15 for the product 102 but the average going to be as well 15. So
the average going to be as well 15. So scale is not considering the old values
scale is not considering the old values from the other window. So SQL going to
from the other window. So SQL going to calculate each window separately. So
calculate each window separately. So it's again here this is the first value
it's again here this is the first value of this window 15 the average 15 then
of this window 15 the average 15 then the same stuff right. So summarizing
the same stuff right. So summarizing those values divided by two and so on.
those values divided by two and so on. And this we call in data analyzes this
And this we call in data analyzes this last field over here we call it a moving
last field over here we call it a moving average and you can implement it very
average and you can implement it very simply using an average function
simply using an average function together with the order by. All right,
together with the order by. All right, let's move to the next task and it says
let's move to the next task and it says calculate the moving average of sales
calculate the moving average of sales for each product over time including
for each product over time including only the next order. So as you can see
only the next order. So as you can see the first part we have already done it
the first part we have already done it right. We have the moving average and
right. We have the moving average and divided by partition by for the products
divided by partition by for the products but here we have more specifications. It
but here we have more specifications. It says including only the next order. That
says including only the next order. That means we are talking about the current
means we are talking about the current order and as well the next order. So
order and as well the next order. So here we have like a fixed frame or fixed
here we have like a fixed frame or fixed window. So we don't need the whole
window. So we don't need the whole average of the window. We need only
average of the window. We need only maximum two orders in each calculation.
maximum two orders in each calculation. So how we going to do that? We can have
So how we going to do that? We can have our custom frame close inside our window
our custom frame close inside our window function. So that means we cannot leave
function. So that means we cannot leave it as a default. We have to specify it.
it as a default. We have to specify it. So let's go and do that. I will just
So let's go and do that. I will just copy the old definition of the window
copy the old definition of the window because we have the exact stuff. So we
because we have the exact stuff. So we have the average sales over partition by
have the average sales over partition by product ID order by date. So this is the
product ID order by date. So this is the first part. So now we would like to have
first part. So now we would like to have this fixed window. So we're going to go
this fixed window. So we're going to go now and define our frame close. I'm just
now and define our frame close. I'm just going to zoom out a little bit. It's
going to zoom out a little bit. It's going to be rows between. So we have now
going to be rows between. So we have now the boundaries of the frame. It says
the boundaries of the frame. It says including the next order. So we're going
including the next order. So we're going to go and use the following. So the
to go and use the following. So the first boundary going to be the current
first boundary going to be the current row. And since it's next order, so it's
row. And since it's next order, so it's going to be one following. So that is
going to be one following. So that is our frame including only the next order.
our frame including only the next order. And we have it like this one following.
And we have it like this one following. Let's call it yeah rolling average. So
Let's call it yeah rolling average. So that's it. Let's go and execute. So now
that's it. Let's go and execute. So now let's go and check the result. You can
let's go and check the result. You can see the moving average has completely
see the moving average has completely different values as the rolling average.
different values as the rolling average. So let's go and understand why. We're
So let's go and understand why. We're going to do it row by row. Let's take
going to do it row by row. Let's take the first row over here. So the sales
the first row over here. So the sales here is 10 and the rolling average is
here is 10 and the rolling average is 15. So why is that? Because in the
15. So why is that? Because in the calculation we are considering the next
calculation we are considering the next value. So 10 + 20 divided by 2 you will
value. So 10 + 20 divided by 2 you will get 15. So that means the SQL defined
get 15. So that means the SQL defined the frame like this those two rows for
the frame like this those two rows for this calculation for the first row. So
this calculation for the first row. So now moving on to the second row. SQL
now moving on to the second row. SQL going to include as well the third one
going to include as well the third one right the next one. But since the window
right the next one. But since the window is only two orders it's going to go and
is only two orders it's going to go and drop the first row. So the next frame
drop the first row. So the next frame going to be like this. And as you can
going to be like this. And as you can see it's going to be 20 + 19 divided by
see it's going to be 20 + 19 divided by 2. You will get 55. So now you can see
2. You will get 55. So now you can see the effect of the rolling average.
the effect of the rolling average. Right? So now for the next one going to
Right? So now for the next one going to be exact same. So we are at the third
be exact same. So we are at the third row. It's going to go and include the
row. It's going to go and include the next one and we're going to get the same
next one and we're going to get the same value because 19 + 20 divid by two you
value because 19 + 20 divid by two you will get 55. Now interesting to the last
will get 55. Now interesting to the last row in the window over here. It will not
row in the window over here. It will not go and consider the next value because
go and consider the next value because it is outside of the window. So it's
it is outside of the window. So it's going to be 20 and it's going to stay as
going to be 20 and it's going to stay as well 20. So that's it. All right guys.
well 20. So that's it. All right guys. So with that we have learned about the
So with that we have learned about the moving average, the rolling average and
moving average, the rolling average and those amazing concepts using the window
those amazing concepts using the window function. All right. Now we're going to
function. All right. Now we're going to have a quick overview of the different
have a quick overview of the different use cases in the aggregate functions and
use cases in the aggregate functions and how the definition of the window going
how the definition of the window going to change the whole use case. So now the
to change the whole use case. So now the first use case is finding the overall
first use case is finding the overall total. And here if you don't define
total. And here if you don't define anything in the window if you leave it
anything in the window if you leave it empty what going to happen you are doing
empty what going to happen you are doing here overall analyzes. So you're going
here overall analyzes. So you're going to go and aggregate the whole data sets
to go and aggregate the whole data sets and then provide this aggregation for
and then provide this aggregation for each row. So this is what happen if you
each row. So this is what happen if you leave it empty. You don't define
leave it empty. You don't define anything. You are aggregating the whole
anything. You are aggregating the whole data sets. Now moving to the next step,
data sets. Now moving to the next step, we can do analysis called total pair
we can do analysis called total pair groups. So what we're going to do, we
groups. So what we're going to do, we will add partition by to the definition
will add partition by to the definition of the window. So by adding for example
of the window. So by adding for example here partition by products, what can
here partition by products, what can happen? The data going to be splitted
happen? The data going to be splitted into two categories or two groups and
into two categories or two groups and the aggregation going to be done for
the aggregation going to be done for each window separately. This is of
each window separately. This is of course a great analysis in order to go
course a great analysis in order to go and compare different products like here
and compare different products like here the caps and gloves. So this is helpful
the caps and gloves. So this is helpful in order to compare categories. So you
in order to compare categories. So you can do this analysis total pair groups
can do this analysis total pair groups if you use the partition by. Now if you
if you use the partition by. Now if you go and use the order by you're going to
go and use the order by you're going to land in the third use case. As we
land in the third use case. As we learned we will be doing running total.
learned we will be doing running total. So as you can see here in the output we
So as you can see here in the output we are building a cumulative value for the
are building a cumulative value for the sales and this going to help us in order
sales and this going to help us in order to do progress over time analyzes in
to do progress over time analyzes in order to understand the performance of
order to understand the performance of our business. And now moving on to the
our business. And now moving on to the last use case the final phase of the
last use case the final phase of the window function with the aggregation.
window function with the aggregation. Here you have the aggregate function
Here you have the aggregate function together with the order by with
together with the order by with customized fixed window. And of course
customized fixed window. And of course we can use it in order to help us
we can use it in order to help us building progress over time in specific
building progress over time in specific fixed window. And of course you can use
fixed window. And of course you can use those use cases you will get the same
those use cases you will get the same effect if you use the other functions
effect if you use the other functions not only the sum you can use average
not only the sum you can use average count max so all aggregate functions. So
count max so all aggregate functions. So guys as you can see the window function
guys as you can see the window function in scale is very important in order to
in scale is very important in order to do data analytics by just like changing
do data analytics by just like changing the part of the window you are
the part of the window you are generating a whole new use case for data
analytics. All right friends so now let's do a quick recap about the window
let's do a quick recap about the window aggregate functions. So what they do
aggregate functions. So what they do they're going to go and aggregate a set
they're going to go and aggregate a set of values and return a single aggregated
of values and return a single aggregated value for each row. So it's very similar
value for each row. So it's very similar to the groupy but here we don't lose
to the groupy but here we don't lose details. Now to the next point what are
details. Now to the next point what are the rules for the syntax about the
the rules for the syntax about the expressions they all expect a number in
expressions they all expect a number in the expression. So you have to pass a
the expression. So you have to pass a number like sales or any integer but
number like sales or any integer but only for the count you can go and use
only for the count you can go and use any data type. And the things for the
any data type. And the things for the aggregate functions are very simple.
aggregate functions are very simple. Everything is optional inside the
Everything is optional inside the definition of the overclouds or the
definition of the overclouds or the definition of the window. So you can go
definition of the window. So you can go and use partition by order by frames or
and use partition by order by frames or not or just leave everything empty. So
not or just leave everything empty. So everything is optional. So now as we
everything is optional. So now as we learned we have a lot of use cases for
learned we have a lot of use cases for the aggregate functions and they are
the aggregate functions and they are really amazing for analytics. So the
really amazing for analytics. So the first one the simplest one you can do
first one the simplest one you can do overall analyzes if you just leave the
overall analyzes if you just leave the window function empty. So you will get
window function empty. So you will get one big number about your business. And
one big number about your business. And the next use case we can do total bear
the next use case we can do total bear groups analyzes. As you've learned, we
groups analyzes. As you've learned, we can use partition by in order to compare
can use partition by in order to compare categories with each others like
categories with each others like comparing the products or customers and
comparing the products or customers and so on. Moving on to the next one, we can
so on. Moving on to the next one, we can do partto-hole analyszis. We can go and
do partto-hole analyszis. We can go and compare the performance of each data
compare the performance of each data point with the overall. So you can for
point with the overall. So you can for example compare the sales to the total
example compare the sales to the total sales in the window or to the all data
sales in the window or to the all data sets. And we have many comparison
sets. And we have many comparison analyzes. We can go and compare the
analyzes. We can go and compare the current value with the average or we can
current value with the average or we can compare them to the extreme to the
compare them to the extreme to the highest sales to the lowest sales and so
highest sales to the lowest sales and so on. And another use case, we can go and
on. And another use case, we can go and identify data quality issues in our
identify data quality issues in our data. So we can go for example and
data. So we can go for example and identify duplicates using the count
identify duplicates using the count function. Moving on to the next use
function. Moving on to the next use case, we have the outlier detection. We
case, we have the outlier detection. We can go and find out which data points
can go and find out which data points are above the average and below the
are above the average and below the average and so on. Then the next one we
average and so on. Then the next one we have the running total. As we learned,
have the running total. As we learned, it is great tool in order to track the
it is great tool in order to track the progress or to monitor the performance
progress or to monitor the performance of our business over the time. Or if you
of our business over the time. Or if you want to be more specific, you can go and
want to be more specific, you can go and use the rolling total in order to have
use the rolling total in order to have like a specific window and only track
like a specific window and only track this window like three months or
this window like three months or something like that. And the last use
something like that. And the last use case, we can go and calculate the moving
case, we can go and calculate the moving average of our data. So it's really
average of our data. So it's really amazing how order by and aggregate
amazing how order by and aggregate functions can open for you a door for
functions can open for you a door for amazing or advanced analyzers. So guys,
amazing or advanced analyzers. So guys, as you can see, we have a lot of use
as you can see, we have a lot of use cases for the window aggregate functions
cases for the window aggregate functions in the world of data analytics. All
in the world of data analytics. All right. Right. So with that we have
right. Right. So with that we have covered the aggregate window functions
covered the aggregate window functions and in the next step it's going to be
and in the next step it's going to be very important. We will learn how to
very important. We will learn how to rank our data using window functions. So
rank our data using window functions. So let's
go. All right. So now let's say that we have the following data. We have
have the following data. We have products and their sales. If you want
products and their sales. If you want now to go and rank your products first
now to go and rank your products first you have to sort the data based on
you have to sort the data based on something like for example ranking the
something like for example ranking the products based on their sales. So that
products based on their sales. So that means SQL first is going to go and start
means SQL first is going to go and start sorting your data from the highest to
sorting your data from the highest to the lowest. So sorting the data is
the lowest. So sorting the data is always the first thing SQL has to do
always the first thing SQL has to do before ranking anything. Now in order to
before ranking anything. Now in order to rank our data we have two methods. The
rank our data we have two methods. The first method we call it the integer
first method we call it the integer based ranking. So that means SQL going
based ranking. So that means SQL going to go and assign for each row an integer
to go and assign for each row an integer a whole number based on the position of
a whole number based on the position of the row. So now by looking to the
the row. So now by looking to the example the first row we have the
example the first row we have the product E with the sales 70 it's going
product E with the sales 70 it's going to be rank number one then the next row
to be rank number one then the next row the product B with 30 sales we will get
the product B with 30 sales we will get the rank number two then the next one
the rank number two then the next one going to be three four and the last one
going to be three four and the last one going to be five. So that means SQL here
going to be five. So that means SQL here is assigning an integer for each row
is assigning an integer for each row based on their position in the sorted
based on their position in the sorted list. So this method we call it integer
list. So this method we call it integer based ranking. Now let's go to the
based ranking. Now let's go to the second method we have the
second method we have the percentagebased ranking. So in this
percentagebased ranking. So in this methods going to go first and calculate
methods going to go first and calculate the relative position of the row
the relative position of the row compared to all others and then assign a
compared to all others and then assign a percentage for each row. So here in the
percentage for each row. So here in the output is going to start assigning
output is going to start assigning percentages instead of integer and we're
percentages instead of integer and we're going to have a scale from 0 to one. So
going to have a scale from 0 to one. So now if you go and compare both of the
now if you go and compare both of the methods you can see that on the left
methods you can see that on the left side on the integer base ranking we have
side on the integer base ranking we have discrete distinct values. So it starts
discrete distinct values. So it starts from 1 then 2 3 and end up in this
from 1 then 2 3 and end up in this example by five. So it really depends on
example by five. So it really depends on how many rows do we have in the results.
how many rows do we have in the results. So it could be five, it could be 500, 5
So it could be five, it could be 500, 5 million and so on. But in the right side
million and so on. But in the right side we have always the same scale from one
we have always the same scale from one to zero. So between 0 and one we have
to zero. So between 0 and one we have infinite number of data points and this
infinite number of data points and this scale we call it a normalized scale or
scale we call it a normalized scale or we call it continuous scale continuous
we call it continuous scale continuous values. So now the question is when to
values. So now the question is when to use which method. So for example for the
use which method. So for example for the percentage based ranking it is great to
percentage based ranking it is great to answer such questions find the top 20%
answer such questions find the top 20% products based on their sales. So this
products based on their sales. So this method is a great way in order to
method is a great way in order to understand the contributions of data
understand the contributions of data values to the overall total and we call
values to the overall total and we call this kind of analyszis a distribution
this kind of analyszis a distribution analyszis where in the other hand in the
analyszis where in the other hand in the integer based ranking we can answer
integer based ranking we can answer questions like find the top three
questions like find the top three products. So with this question we are
products. So with this question we are not interesting about the contributions
not interesting about the contributions of each product to the overall total. We
of each product to the overall total. We are just interested in the position of
are just interested in the position of the value within a list. So this is as
the value within a list. So this is as well very commonly used analyzes and
well very commonly used analyzes and reporting. We call it top button in
reporting. We call it top button in analyzers. So now let's group up our
analyzers. So now let's group up our ranking functions based on those two
ranking functions based on those two methods. For the first group in the
methods. For the first group in the integer based ranking we have four
integer based ranking we have four functions. Row number rank d rank and
functions. Row number rank d rank and inile. But in the other hand we have
inile. But in the other hand we have only two functions that generate
only two functions that generate percentage based ranking. We have the
percentage based ranking. We have the cumid list and as well the percentile.
cumid list and as well the percentile. So now that was an introduction an
So now that was an introduction an overview of those methods and how we
overview of those methods and how we group up those ranking functions. Next
group up those ranking functions. Next we're going to go and learn about the
we're going to go and learn about the syntax of the ranking functions. Most of
syntax of the ranking functions. Most of them follow the same rules. So for
them follow the same rules. So for example we start always with the
example we start always with the function name. So we have here the rank.
function name. So we have here the rank. But as you can see we don't use any
But as you can see we don't use any expressions. So they don't allow you to
expressions. So they don't allow you to use any argument inside it. It must be
use any argument inside it. It must be empty. So this is the first rule using
empty. So this is the first rule using rank functions. Then about the
rank functions. Then about the definition of the window as usual the
definition of the window as usual the partition by it is an optional thing.
partition by it is an optional thing. You can use it or leave it. And now to
You can use it or leave it. And now to the second part we have the order by it
the second part we have the order by it is as well required. So you must order
is as well required. So you must order the data or sort your data in order to
the data or sort your data in order to do ranking. So you cannot leave it
do ranking. So you cannot leave it empty. So that means for the definition
empty. So that means for the definition of the window at least we should have an
of the window at least we should have an order by for example here sales. So we
order by for example here sales. So we cannot leave it empty. All right. So the
cannot leave it empty. All right. So the two requirements you cannot use any
two requirements you cannot use any expressions for those functions and as
expressions for those functions and as well you have to sort your data using
well you have to sort your data using order by. Okay. So now let's have an
order by. Okay. So now let's have an overview of all functions. So as you can
overview of all functions. So as you can see all those functions are ranking
see all those functions are ranking functions and almost all of them don't
functions and almost all of them don't allow to use any expressions inside
allow to use any expressions inside them. Beside this function here we have
them. Beside this function here we have the end tile. it accepts a number inside
the end tile. it accepts a number inside it. So that means you cannot use it
it. So that means you cannot use it empty. You should use a number inside
empty. You should use a number inside it. All others must be empty. So now for
it. All others must be empty. So now for the partition by all of them are
the partition by all of them are optional and for the order by all of
optional and for the order by all of them are required. So you must use order
them are required. So you must use order by and the frame clause they are not
by and the frame clause they are not allowed to use in the ranking functions.
allowed to use in the ranking functions. So you cannot change the definition of
So you cannot change the definition of the frame inside the window function. So
the frame inside the window function. So now what we're going to do as usual,
now what we're going to do as usual, we're going to go and deep dive into all
we're going to go and deep dive into all of those functions in order to
of those functions in order to understand when to use them and what are
understand when to use them and what are the use cases and as well practice in
the use cases and as well practice in SQL. So we're going to start with the
SQL. So we're going to start with the first one, the row
number. All right. So what is a row number in SQL? The row number function
number in SQL? The row number function going to go and assign for each row a
going to go and assign for each row a unique number as a rank and it doesn't
unique number as a rank and it doesn't care at all about the ties. That means
care at all about the ties. That means if you have two rows sharing the same
if you have two rows sharing the same value, they will not share the same
value, they will not share the same rank. Okay. So now we have very simple
rank. Okay. So now we have very simple example. We have a list of all sales and
example. We have a list of all sales and we have the following query. So it's
we have the following query. So it's going to start with the ranking function
going to start with the ranking function row number. It doesn't accept any
row number. It doesn't accept any argument inside it. And the definition
argument inside it. And the definition of the window going to be like this
of the window going to be like this order by sales disk. So that means we're
order by sales disk. So that means we're going to go and sort the data descending
going to go and sort the data descending from the highest to the lowest. So SQL
from the highest to the lowest. So SQL going to go and do the following. The
going to go and do the following. The highest going to be the 100. The lowest
highest going to be the 100. The lowest going to be the 20. And here we have
going to be the 20. And here we have twice the 80. So now once SQL done
twice the 80. So now once SQL done sorting the data, what's going to
sorting the data, what's going to happen? It's going to start assigning a
happen? It's going to start assigning a rank. So the row number going to go and
rank. So the row number going to go and assign a unique number for each row. So
assign a unique number for each row. So that means it's going to start with the
that means it's going to start with the first one. The 100 going to be the rank
first one. The 100 going to be the rank number one. The next one going to be
number one. The next one going to be rank number two. The 80 going to be rank
rank number two. The 80 going to be rank number three. And the 54. And then the
number three. And the 54. And then the last one going to be five. And now if
last one going to be five. And now if you check the output you can see that
you check the output you can see that all those numbers are unique. We don't
all those numbers are unique. We don't have any repetitions. So 1 2 3 4 5
have any repetitions. So 1 2 3 4 5 there's no repetitions. They are unique
there's no repetitions. They are unique distinct value. And as well there are no
distinct value. And as well there are no skipping of ranking. So that means we
skipping of ranking. So that means we have here 1 2 3 there is no jumping to 6
have here 1 2 3 there is no jumping to 6 7 or something. They are clear sequence
7 or something. They are clear sequence of distinct value and there are no gaps.
of distinct value and there are no gaps. But still there is something special in
But still there is something special in our data. We can see that in the sales
our data. We can see that in the sales we have the same value twice. So we have
we have the same value twice. So we have two rows with the same sales. As you can
two rows with the same sales. As you can see in the row number they will get
see in the row number they will get distinct values. So they will not share
distinct values. So they will not share the same ranking. So that means row
the same ranking. So that means row number does not handle the ties. If you
number does not handle the ties. If you have multiple rows sharing the same
have multiple rows sharing the same values they will not share the same
values they will not share the same rank. They going to have a distinct rank
rank. They going to have a distinct rank different ranks. So this is how the row
different ranks. So this is how the row number works in SQL. It generates unique
number works in SQL. It generates unique ranks for each row. It does not handle
ranks for each row. It does not handle the ties and as well it doesn't leave
the ties and as well it doesn't leave any gaps. So there is no skipping of
any gaps. So there is no skipping of ranking. So now let's go to SQL in order
ranking. So now let's go to SQL in order to have few examples and use cases. All
to have few examples and use cases. All right. So now we have the following
right. So now we have the following task. It's very simple. Rank the orders
task. It's very simple. Rank the orders based on their sales from the highest to
based on their sales from the highest to the lowest. So now this is very easy.
the lowest. So now this is very easy. We're going to go and select first the
We're going to go and select first the data. So order ID, product ID. Let's
data. So order ID, product ID. Let's take the sales as well and select the
take the sales as well and select the table. So it's going to be sales orders.
table. So it's going to be sales orders. Let's go and execute it. So with that we
Let's go and execute it. So with that we got all our orders. What we're going to
got all our orders. What we're going to do now is to assign for each row a rank.
do now is to assign for each row a rank. So that means we need a column here that
So that means we need a column here that contains the rank for each row. So in
contains the rank for each row. So in order to do that we're going to go and
order to do that we're going to go and use the window function row number. It
use the window function row number. It doesn't accept any argument inside it.
doesn't accept any argument inside it. So should be empty. And then we have to
So should be empty. And then we have to define the window. So as we learned in
define the window. So as we learned in the ranking functions we cannot leave it
the ranking functions we cannot leave it empty. We have to sort the data using
empty. We have to sort the data using order by. So order by is a must. We
order by. So order by is a must. We don't have to use any partition by. So
don't have to use any partition by. So we're going to rank all the data that we
we're going to rank all the data that we have inside the table. So how to sort
have inside the table. So how to sort the data? It says it should be based on
the data? It says it should be based on their sales from highest to lowest. That
their sales from highest to lowest. That means we order by sales since from
means we order by sales since from highest to lowest we have to use the
highest to lowest we have to use the descending. And now we're going to go
descending. And now we're going to go and give it a name sales rank and let's
and give it a name sales rank and let's say row since we are using the row
say row since we are using the row number. So that's it. It's very simple.
number. So that's it. It's very simple. Let's go and execute it. So now let's
Let's go and execute it. So now let's have a look to the results. Before SQL
have a look to the results. Before SQL did sort the data by the order ID since
did sort the data by the order ID since we didn't define anything. But since now
we didn't define anything. But since now we are order by sale descending SQL went
we are order by sale descending SQL went and sorted the data by the sales from
and sorted the data by the sales from the highest to the lowest and start
the highest to the lowest and start assigning a rank or let's say an integer
assigning a rank or let's say an integer unique integer for each row. So now the
unique integer for each row. So now the highest order going to be the order
highest order going to be the order number eight. We have the sales of 90.
number eight. We have the sales of 90. This is the highest one. So as you can
This is the highest one. So as you can see we have 1 2 3 4 5 until 10. So now
see we have 1 2 3 4 5 until 10. So now by checking the results you can see that
by checking the results you can see that the ranking here is unique. So there is
the ranking here is unique. So there is no duplicates over here and as well
no duplicates over here and as well there is no skipping or gaps. So we have
there is no skipping or gaps. So we have everything between 1 and 10 even though
everything between 1 and 10 even though that we have in our data a couple of
that we have in our data a couple of sales that sharing the same value. So
sales that sharing the same value. So for example we have those two orders you
for example we have those two orders you can see both of them has the 60 at the
can see both of them has the 60 at the sales but they don't share the same
sales but they don't share the same ranking. Right? So we have here as well
ranking. Right? So we have here as well the 9 and three they share the same
the 9 and three they share the same value 20 but they don't share the same
value 20 but they don't share the same ranking. So with that we have solved the
ranking. So with that we have solved the task. It's very simple. We have now a
task. It's very simple. We have now a rank based on the sales from highest to
rank based on the sales from highest to the lowest.
All right. So what is a rank function in SQL? The rank function going to go and
SQL? The rank function going to go and assign for each row a number a rank and
assign for each row a number a rank and this time it going to go and handle the
this time it going to go and handle the ties. So that means if in your data you
ties. So that means if in your data you have two rows having the same values
have two rows having the same values they going to share the same ranking.
they going to share the same ranking. One thing about the ranking function is
One thing about the ranking function is that it's going to go and leave gaps in
that it's going to go and leave gaps in the ranking. So there is possibility of
the ranking. So there is possibility of skipping ranks. In order to understand
skipping ranks. In order to understand how the rank function works in SQL,
how the rank function works in SQL, we're going to have a very simple
we're going to have a very simple example. All right. So again with the
example. All right. So again with the same data but with different function.
same data but with different function. So our window looks like this. It start
So our window looks like this. It start with the function rank doesn't accept
with the function rank doesn't accept any argument inside it. Then we have the
any argument inside it. Then we have the window like this. Order by sales
window like this. Order by sales descending from the highest to the
descending from the highest to the lowest. And our data is already sorted
lowest. And our data is already sorted like that. So now how is scale going to
like that. So now how is scale going to go and assign the ranks. The first row
go and assign the ranks. The first row going to be the highest rank. So the
going to be the highest rank. So the value 100 is going to be one. Then the
value 100 is going to be one. Then the second one going to be two. But now for
second one going to be two. But now for the third one, as you can see, we have
the third one, as you can see, we have here two values that are the same. So we
here two values that are the same. So we have a tie and this time SQL going to go
have a tie and this time SQL going to go and as well let them to share the same
and as well let them to share the same rank. So both of them going to be the
rank. So both of them going to be the rank two. So it's not like the row
rank two. So it's not like the row number where we have over here three.
number where we have over here three. This time we have two because we have a
This time we have two because we have a tie. So having same values means they
tie. So having same values means they going to share the same rank. And now
going to share the same rank. And now moving to the next value going to be
moving to the next value going to be tricky one because if you check over
tricky one because if you check over here you can see that the next rank
here you can see that the next rank should be like the three right? So we
should be like the three right? So we have one two and then the next value
have one two and then the next value that generated in the rank should be
that generated in the rank should be three but going to say you know what
three but going to say you know what this value position going to be number
this value position going to be number four. So as you can see 1 2 3 four. So
four. So as you can see 1 2 3 four. So actually the position number here is
actually the position number here is four and going to go and give it the
four and going to go and give it the rank of four. So with that SQL going to
rank of four. So with that SQL going to be leaving a gap in the ranking. You can
be leaving a gap in the ranking. You can see we are skipping the rank number
see we are skipping the rank number three and this always happen once you
three and this always happen once you have a tie where you are sharing the
have a tie where you are sharing the same ranking. So for the next one it's
same ranking. So for the next one it's going to be easy. It's going to be the
going to be easy. It's going to be the row number five. So now by looking to
row number five. So now by looking to the output of the rank function you can
the output of the rank function you can see that we don't have a unique ranking.
see that we don't have a unique ranking. Here we have shared ranking in case of
Here we have shared ranking in case of the ties. So it handles the ties but
the ties. So it handles the ties but here we have gaps in the ranks. So we
here we have gaps in the ranks. So we are skipping ranks. When I think about
are skipping ranks. When I think about the rank function I think about the
the rank function I think about the Olympics. If two athletes tie for the
Olympics. If two athletes tie for the gold medal, the first place, there will
gold medal, the first place, there will be no silver medal for the second place,
be no silver medal for the second place, the next medal going to be given to the
the next medal going to be given to the bronze to the third place. All right. So
bronze to the third place. All right. So now let's go in SQL in order to practice
now let's go in SQL in order to practice the rank function. All right. Now we're
the rank function. All right. Now we're going to go and solve the same task but
going to go and solve the same task but using the rank function. So what we're
using the rank function. So what we're going to do, we're going to stay with
going to do, we're going to stay with the same example over here and we're
the same example over here and we're going to rank the order based on their
going to rank the order based on their sales from highest to lowest but this
sales from highest to lowest but this time using the rank function. So we use
time using the rank function. So we use the rank and everything inside is going
the rank and everything inside is going to be empty and then our window going to
to be empty and then our window going to be exactly the same as before. So over
be exactly the same as before. So over order by sales and disk. So let's give
order by sales and disk. So let's give it a name sales rank. Yeah, let's give
it a name sales rank. Yeah, let's give it a rank. So that's it. As you can see
it a rank. So that's it. As you can see the syntax is very simple and very
the syntax is very simple and very similar to the row number. We just
similar to the row number. We just changed the function. So now let's go
changed the function. So now let's go and execute this in order to check the
and execute this in order to check the results. So now let's go and check the
results. So now let's go and check the results by looking to the new rank. If
results by looking to the new rank. If you go and compare it with the old rank,
you go and compare it with the old rank, we can see that we are sharing some
we can see that we are sharing some ranking, right? We have here the two
ranking, right? We have here the two twice. So the rank number two, we have
twice. So the rank number two, we have it twice because we have over here the
it twice because we have over here the same value. So 60 60 we have it here two
same value. So 60 60 we have it here two and two. But if you compare it to the
and two. But if you compare it to the row number, you can see that it is not
row number, you can see that it is not sharing the same ranking. So this is one
sharing the same ranking. So this is one difference. And as well here the same
difference. And as well here the same thing. They have the same value. The
thing. They have the same value. The sales is 20. So we have it twice the
sales is 20. So we have it twice the rank number seven. And here we have it
rank number seven. And here we have it as different values. And the next value
as different values. And the next value as you can see we are skipping the rank.
as you can see we are skipping the rank. So there is gap there is no rank of
So there is gap there is no rank of eight. So you can see that this is the
eight. So you can see that this is the row number nine and that's why it get
row number nine and that's why it get the nine. The same thing I believe over
the nine. The same thing I believe over here. So now if you check those two
here. So now if you check those two ranks the next one should be three. But
ranks the next one should be three. But since it is in the row number four it's
since it is in the row number four it's going to get the rank four. So by
going to get the rank four. So by checking the results we can see that
checking the results we can see that sharing the same ranks and as well we
sharing the same ranks and as well we have gaps. So this is how the rank
have gaps. So this is how the rank works.
All right. So what is a dense rank? It is very similar to the ranking function.
is very similar to the ranking function. It's going to go and assign for each row
It's going to go and assign for each row a number rank and it as well handles the
a number rank and it as well handles the ties. So same values they going to share
ties. So same values they going to share the same ranking but this time it
the same ranking but this time it doesn't leave any gaps like the rank
doesn't leave any gaps like the rank function. So the d rank it will not
function. So the d rank it will not leave any gaps. It will not skip any
leave any gaps. It will not skip any ranking. So in order to understand this
ranking. So in order to understand this we're going to have a very simple
we're going to have a very simple example. So let's go. All right. So
example. So let's go. All right. So again the same data but with different
again the same data but with different function. We have this time the rank
function. We have this time the rank function dense rank and the window going
function dense rank and the window going to be the same order by sales descending
to be the same order by sales descending from the highest to the lowest. So now
from the highest to the lowest. So now the data is as well sorted already.
the data is as well sorted already. Let's see how SQL going to go and assign
Let's see how SQL going to go and assign the ranks as usual. The first row going
the ranks as usual. The first row going to be the rank number one the second as
to be the rank number one the second as well but again here we have the same
well but again here we have the same values. So we have same values and it's
values. So we have same values and it's like the rank it's going to go and share
like the rank it's going to go and share the same rank. So both of them going to
the same rank. So both of them going to has the rank number two. And now you
has the rank number two. And now you might say, well this is very similar to
might say, well this is very similar to the rank function. So why do we have
the rank function. So why do we have dense rank? I'm going to say wait for
dense rank? I'm going to say wait for it. We're going to have the difference
it. We're going to have the difference in the next value. So it's going to come
in the next value. So it's going to come over here. This value is exactly after
over here. This value is exactly after the tie. In rank SQL went and took the
the tie. In rank SQL went and took the position number. So the row number it
position number. So the row number it was four, right? So 1 2 3 4. But this
was four, right? So 1 2 3 4. But this time with the dense rank SQL will not
time with the dense rank SQL will not leave gaps in ranking. So there will be
leave gaps in ranking. So there will be no skipping the next rank in the
no skipping the next rank in the sequence going to be three. So that's
sequence going to be three. So that's why we're going to have the rank three
why we're going to have the rank three for this value. So as you can see there
for this value. So as you can see there is no gap. We have one, we have two and
is no gap. We have one, we have two and three. So we are not skipping, we are
three. So we are not skipping, we are not leaving any gaps. And the last one
not leaving any gaps. And the last one going to be four. So this is exactly the
going to be four. So this is exactly the difference between the dense rank and
difference between the dense rank and the rank. So now by checking the output
the rank. So now by checking the output of the dense rank, you can see that we
of the dense rank, you can see that we don't have unique ranks. We have here
don't have unique ranks. We have here shared ranks. As you can see, we have
shared ranks. As you can see, we have here repetition. So, it handles the ties
here repetition. So, it handles the ties and as well it doesn't leave any gaps.
and as well it doesn't leave any gaps. It doesn't skip anything in the ranking.
It doesn't skip anything in the ranking. Okay, so that's it. Now, let's go back
Okay, so that's it. Now, let's go back to SQL to practice the dense rank. All
to SQL to practice the dense rank. All right, so now we have the same task.
right, so now we have the same task. Rank the orders based on their sales
Rank the orders based on their sales from highest to lowest. So, we're going
from highest to lowest. So, we're going to do the same stuff, but this time
to do the same stuff, but this time using the function dense rank. So, dense
using the function dense rank. So, dense rank is going to be empty. And then
rank is going to be empty. And then we're going to define it like all others
we're going to define it like all others over order by sales disk. And then we're
over order by sales disk. And then we're going to give it the name of sales rank
going to give it the name of sales rank dense. And that's it. So as you can see
dense. And that's it. So as you can see all of those functions having the exact
all of those functions having the exact syntax, right? So let's go and execute
syntax, right? So let's go and execute it. Okay. So now let's go and check the
it. Okay. So now let's go and check the results. We got our newest rank using
results. We got our newest rank using the dens. And by just checking the
the dens. And by just checking the results, you can see that it handles the
results, you can see that it handles the tie. We have two twice, right? So let's
tie. We have two twice, right? So let's check the example over here. We have the
check the example over here. We have the sales 60 twice. That's why they are
sales 60 twice. That's why they are sharing the same ranking in the dense
sharing the same ranking in the dense and as well in the normal rank. But now
and as well in the normal rank. But now what is interesting is the value after
what is interesting is the value after the tie. So as you can see over here
the tie. So as you can see over here with the dense rank we have three. So we
with the dense rank we have three. So we didn't skip any ranking. We don't have
didn't skip any ranking. We don't have any gap 1 2 and then three. But with the
any gap 1 2 and then three. But with the rank it's just focus on the position
rank it's just focus on the position number. So it is the row number four.
number. So it is the row number four. That's why it's four. With that we have
That's why it's four. With that we have a gap. So as you can see now we don't
a gap. So as you can see now we don't have any gaps in the dense rank. So we
have any gaps in the dense rank. So we have three four five. And now we have
have three four five. And now we have over here the same two values. So we
over here the same two values. So we have sales of 2020 and they share the
have sales of 2020 and they share the six twice. So as you can see there is
six twice. So as you can see there is difference now between the dense and the
difference now between the dense and the rank. So here we have seven seven but
rank. So here we have seven seven but here we are at the rank 66. So that's
here we are at the rank 66. So that's why we have differences between them
why we have differences between them because we skipped before in the rank
because we skipped before in the rank number three. Now the other stuff you
number three. Now the other stuff you can see we have seven and eight. So now
can see we have seven and eight. So now if you compare those three ranking you
if you compare those three ranking you can see that they all start with the
can see that they all start with the rank number one but they didn't all end
rank number one but they didn't all end with the same ranking. So the row number
with the same ranking. So the row number and the rank they really focus on the
and the rank they really focus on the position number or the row number of the
position number or the row number of the orders. So you can see over here it is
orders. So you can see over here it is the row number 10. That's why we have
the row number 10. That's why we have here 10 and 10. So the scale is from 1
here 10 and 10. So the scale is from 1 to 10. And that is exactly the same for
to 10. And that is exactly the same for the row number from 1 to 10. But with
the row number from 1 to 10. But with the d over here we have it from 1 to 8
the d over here we have it from 1 to 8 and that's because we shared the same
and that's because we shared the same ranking and with that we wasted let's
ranking and with that we wasted let's say few ranks. So the scale is different
say few ranks. So the scale is different from the two others. And that's because
from the two others. And that's because we have ties twice. This is one tie and
we have ties twice. This is one tie and as well we have over here one tie.
as well we have over here one tie. That's why we are missing over here two
That's why we are missing over here two ranks. So this is how the dense ranks
ranks. So this is how the dense ranks works. And you can go and compare now
works. And you can go and compare now all three togethers in order to
all three togethers in order to understand how those ranks are
working. All right. So now let's quickly compare the three functions side by
compare the three functions side by side. Let's start with the first point
side. Let's start with the first point about the uniqueness of the rank. And if
about the uniqueness of the rank. And if you compare those three you can see that
you compare those three you can see that only the row number generates unique
only the row number generates unique distinct rank. So this going to be
distinct rank. So this going to be unique rank and the two others we have
unique rank and the two others we have duplicates or let's say shared ranks.
duplicates or let's say shared ranks. Okay. So now the second point whether
Okay. So now the second point whether the function handles the ties and the
the function handles the ties and the only one that doesn't handle the ties is
only one that doesn't handle the ties is the row number. So this one doesn't
the row number. So this one doesn't handle the ties and the two others
handle the ties and the two others handles the ties since they offer the
handles the ties since they offer the shared rank. And now we have the last
shared rank. And now we have the last point about leaving gaps or skipping
point about leaving gaps or skipping ranking. So now if you check the row
ranking. So now if you check the row number and the dense rank you can see
number and the dense rank you can see there will be no skipping. So there is
there will be no skipping. So there is no gaps for the row number and as well
no gaps for the row number and as well for the dense rank only for the rank
for the dense rank only for the rank function the middle one we are skipping
function the middle one we are skipping ranks and we are leaving gaps. So that's
ranks and we are leaving gaps. So that's it guys. This is the differences between
it guys. This is the differences between those three functions. I tend usually to
those three functions. I tend usually to work with the row number more often than
work with the row number more often than that to others.
All right guys, so now I had a look to those three functions and I checked my
those three functions and I checked my projects real projects and I found out
projects real projects and I found out that there are many use cases for the
that there are many use cases for the function row number compared to the
function row number compared to the other functions dense rank and rank. So
other functions dense rank and rank. So now what we're going to do I'm going to
now what we're going to do I'm going to show you a few use cases for the rank
show you a few use cases for the rank number that I usually use in my real
number that I usually use in my real projects in order for you to understand
projects in order for you to understand how important is the row number
how important is the row number function. So let's go to SQL. All right.
function. So let's go to SQL. All right. So now let's start with the first use
So now let's start with the first use case and we have the task of find the
case and we have the task of find the top highest sales for each product. So
top highest sales for each product. So this is very classic in reporting or
this is very classic in reporting or data analyzes. We call this top end
data analyzes. We call this top end analyzes. So here the managers or
analyzes. So here the managers or decision makers they would like to have
decision makers they would like to have the best performers or the best success
the best performers or the best success in our data. So for example the top
in our data. So for example the top highest five customers or the top five
highest five customers or the top five products or categories and so on. So
products or categories and so on. So this is very important analyzis in order
this is very important analyzis in order to focus on the best products or on to
to focus on the best products or on to the most important customers and so on
the most important customers and so on and this is as I said very classic and
and this is as I said very classic and very important in order to make
very important in order to make decisions in the business. So now let's
decisions in the business. So now let's see how we can solve this. So we're
see how we can solve this. So we're going to start with the usual stuff.
going to start with the usual stuff. Let's first select the data. So select
Let's first select the data. So select order ID. Let's take as well the product
order ID. Let's take as well the product ID and the sales from sales orders. So
ID and the sales from sales orders. So let's go and execute this. And now as we
let's go and execute this. And now as we know that for each product we have
know that for each product we have multiple orders and we have multiple
multiple orders and we have multiple sales but we are interested only in the
sales but we are interested only in the highest sales for each product. So we
highest sales for each product. So we have to go and create a rank. In order
have to go and create a rank. In order to do that we're going to use the row
to do that we're going to use the row function row number and we have to
function row number and we have to define the window now. So do we need
define the window now. So do we need partition by check the query. So it says
partition by check the query. So it says for each product that means we have to
for each product that means we have to divide the data by the product ID. So
divide the data by the product ID. So let's go and use the partition by
let's go and use the partition by products ID. And now we must use the
products ID. And now we must use the order by. So order by. And now how to
order by. So order by. And now how to sort the data by the sales, right? And
sort the data by the sales, right? And it is from the highest to the lowest. So
it is from the highest to the lowest. So let's go sales. And we have here
let's go sales. And we have here descending. So from highest to lowest.
descending. So from highest to lowest. Let's go and give it a name. So you're
Let's go and give it a name. So you're going to be rank by products. So let's
going to be rank by products. So let's go and execute this. And now by looking
go and execute this. And now by looking to the result, you can see that SQL did
to the result, you can see that SQL did divide the data by the product ID. So we
divide the data by the product ID. So we have here like around four windows. The
have here like around four windows. The first one over here you can see that the
first one over here you can see that the rank starts from one end with four. So
rank starts from one end with four. So the highest rank can be the order number
the highest rank can be the order number eight with the sales of 90 and then it
eight with the sales of 90 and then it goes to the four. Now as you can see
goes to the four. Now as you can see that the second window we have a new
that the second window we have a new ranking. So it resets the first going to
ranking. So it resets the first going to be uh the order number 10 and the last
be uh the order number 10 and the last one going to be order number two. So as
one going to be order number two. So as you can see each window has its own
you can see each window has its own ranking and as well the last one we have
ranking and as well the last one we have it only as one row. So now of course in
it only as one row. So now of course in the task we have to return the highest.
the task we have to return the highest. So we are not interested in the others.
So we are not interested in the others. We have to return this row this row as
We have to return this row this row as well and this one and this one. So as
well and this one and this one. So as you can see we have to return everything
you can see we have to return everything that has the rank one. We are not
that has the rank one. We are not interested in the rank 2 3 4 and so on.
interested in the rank 2 3 4 and so on. So we would like to have the highest. So
So we would like to have the highest. So now in order to filter the data what
now in order to filter the data what we're going to do we're going to go and
we're going to do we're going to go and use subqueries. So select star from and
use subqueries. So select star from and then we're going to have the following
then we're going to have the following condition. So where and we're going to
condition. So where and we're going to say rank by product equals to one. So we
say rank by product equals to one. So we are interested only on the rank number
are interested only on the rank number one. So let's go and execute it. And
one. So let's go and execute it. And with that since we have four products in
with that since we have four products in our data, we're going to have only four
our data, we're going to have only four rows and we have the highest sales. So
rows and we have the highest sales. So as you can see we have only number one
as you can see we have only number one over here. And those sales are the
over here. And those sales are the highest for each product. And with that
highest for each product. And with that we have solved the tasks by finding the
we have solved the tasks by finding the top end analyzers.
Okay, moving on to the next use case. We have the following task and it says find
have the following task and it says find the lowest two customers based on their
the lowest two customers based on their total sales. So now we have the exact
total sales. So now we have the exact opposite use case. We call it button in
opposite use case. We call it button in analyzes. So now in this example in the
analyzes. So now in this example in the business the decision makers want to
business the decision makers want to optimize the costs want to cut costs and
optimize the costs want to cut costs and with that they have to analyze the
with that they have to analyze the lowest performers in the products or the
lowest performers in the products or the lowest performance in the employees in
lowest performance in the employees in order to cut costs. So now with this
order to cut costs. So now with this analysis the decision makers are not
analysis the decision makers are not focusing on the best successful stuff.
focusing on the best successful stuff. We are focusing on the lowest stuff the
We are focusing on the lowest stuff the lowest performers. So now let's solve
lowest performers. So now let's solve this tasks. So now if you check the
this tasks. So now if you check the question we have multiple stuff right we
question we have multiple stuff right we have the total sales and as well we have
have the total sales and as well we have to find the lowest two customers. So we
to find the lowest two customers. So we have ranking and as well aggregations
have ranking and as well aggregations remember we can do stuff together with
remember we can do stuff together with the group I. So now let's do it step by
the group I. So now let's do it step by step. First let's select the data right.
step. First let's select the data right. So what do we need? Order ID customer ID
So what do we need? Order ID customer ID and we need the sales from sales orders.
and we need the sales from sales orders. So let's go and execute this. So now if
So let's go and execute this. So now if you check the customers over here we
you check the customers over here we have around four customers and they have
have around four customers and they have multiple sales. Now we would like to
multiple sales. Now we would like to have the total sales for each customers
have the total sales for each customers in order to find the lowest two. So
in order to find the lowest two. So let's start first with the aggregations.
let's start first with the aggregations. So what we going to do? We're going to
So what we going to do? We're going to go and aggregate the sales. So the sum
go and aggregate the sales. So the sum of sales and let's call it total sales.
of sales and let's call it total sales. And now in order to do the group by we
And now in order to do the group by we have to have only the customer. So group
have to have only the customer. So group by and we have the customer ID. So it is
by and we have the customer ID. So it is very simple group by statements. Let's
very simple group by statements. Let's go and execute this. So now by checking
go and execute this. So now by checking the result we can see that SQL did
the result we can see that SQL did aggregate the data. We have four rows
aggregate the data. We have four rows and that's because we have four
and that's because we have four customers and we have their total sales.
customers and we have their total sales. So we have solved the first part of the
So we have solved the first part of the task. We have the total sales for each
task. We have the total sales for each customers. Now let's move to the second
customers. Now let's move to the second part. It says lowest two customers. That
part. It says lowest two customers. That means we have to use the ranking
means we have to use the ranking functions in order to rank those
functions in order to rank those customers. So we are not interested in
customers. So we are not interested in all customers. We are interested only in
all customers. We are interested only in the lowest two. So in order to do that
the lowest two. So in order to do that now we're going to go and use the window
now we're going to go and use the window function row number. So and then over.
function row number. So and then over. Now do we have to partition the data?
Now do we have to partition the data? Well no we don't have to do that. We
Well no we don't have to do that. We have now to sort the data. So order by.
have now to sort the data. So order by. So this time we're going to go and use
So this time we're going to go and use the aggregations in the order by. So the
the aggregations in the order by. So the sum of sales and we want to have it
sum of sales and we want to have it sorted from the lowest to the highest.
sorted from the lowest to the highest. So I'm just going to go and use the
So I'm just going to go and use the defaults. So it is ascending. Now let's
defaults. So it is ascending. Now let's call it rank customers. So that's it.
call it rank customers. So that's it. Again here the rule is that if you are
Again here the rule is that if you are using a window function together with
using a window function together with the group by function, you have to use
the group by function, you have to use only columns that is used in the group
only columns that is used in the group by. So this should be working. Let's go
by. So this should be working. Let's go and execute it. So now as you can see in
and execute it. So now as you can see in the results, we got an extra column for
the results, we got an extra column for the rank. So now the lowest customer
the rank. So now the lowest customer going to be the customer number two. The
going to be the customer number two. The second one going to be four with the 90
second one going to be four with the 90 total sales. And the highest customer
total sales. And the highest customer with the sales is going to be the last
with the sales is going to be the last one, the 125 customer number three. So
one, the 125 customer number three. So now we have almost everything but the
now we have almost everything but the list should contain only the last two.
list should contain only the last two. So in order to do that to filter the
So in order to do that to filter the data, we're going to go and use
data, we're going to go and use subquery. So select star
subquery. So select star from and then we have to define the
from and then we have to define the condition where rank customers it should
condition where rank customers it should be smaller or equal to two. Right? So
be smaller or equal to two. Right? So with that we will get the first two. So
with that we will get the first two. So let's go and execute this. And with that
let's go and execute this. And with that we got the lowest two customers based on
we got the lowest two customers based on their total sales. So customer number ID
their total sales. So customer number ID you two and the four. So that's it. We
you two and the four. So that's it. We have solved the task and now we have
have solved the task and now we have done button in
analyzes. Okay let's keep moving to the next use case and we have the following
next use case and we have the following task. It says assign unique ids to the
task. It says assign unique ids to the rows of the table orders archive. So now
rows of the table orders archive. So now guys we might be in situation where you
guys we might be in situation where you have a table without any primary key and
have a table without any primary key and you would like to create an ID for each
you would like to create an ID for each row. So in order to do that we can use
row. So in order to do that we can use the function row number in order to
the function row number in order to generate unique identifier ids for each
generate unique identifier ids for each row inside our table if we don't have
row inside our table if we don't have one. And generating such ID for each
one. And generating such ID for each row. It's very important to do stuff
row. It's very important to do stuff like importing data, exporting data,
like importing data, exporting data, maybe joining tables as well using this
maybe joining tables as well using this ID or let's say optimizing the
ID or let's say optimizing the performance of query using the ID. So
performance of query using the ID. So now let's see how we can generate that
now let's see how we can generate that using row number. Okay. So now let's
using row number. Okay. So now let's first select the table order archives in
first select the table order archives in order to understand the content. So
order to understand the content. So select star from sales orders archive.
select star from sales orders archive. So let's go and execute. So now by
So let's go and execute. So now by checking the result you can see that we
checking the result you can see that we have 10 orders and we have repetitions
have 10 orders and we have repetitions in the order ID over here. So it is not
in the order ID over here. So it is not really primary key. As you can see over
really primary key. As you can see over here we have twice the ID four and here
here we have twice the ID four and here we have three times the ID6. So now what
we have three times the ID6. So now what we're going to do we're going to go and
we're going to do we're going to go and generate unique identifier for each row.
generate unique identifier for each row. So in order to do that what we're going
So in order to do that what we're going to do going to go over here and say row
to do going to go over here and say row number and then we're going to define
number and then we're going to define the window function. We don't partition
the window function. We don't partition the data at all but we have to sort the
the data at all but we have to sort the data by the order ID. So order by order
data by the order ID. So order by order ID or you can use something else as well
ID or you can use something else as well using the order date or something
using the order date or something doesn't matter. So let's add to it order
doesn't matter. So let's add to it order data as well and let's call it unique
data as well and let's call it unique ID. Let's go and execute this. Now by
ID. Let's go and execute this. Now by checking the data you can see that we
checking the data you can see that we have a new ID over here that comes from
have a new ID over here that comes from the row number and we have like a unique
the row number and we have like a unique identifier. As you can see we have 10
identifier. As you can see we have 10 rows and with that we have as well 10
rows and with that we have as well 10 different distinct unique ids. So with
different distinct unique ids. So with this as you can see we have solved the
this as you can see we have solved the task and we have now a unique identifier
task and we have now a unique identifier an ID for the table orders archive. So
an ID for the table orders archive. So now having this ID we can do many stuff
now having this ID we can do many stuff like joining tables or doing something
like joining tables or doing something special and important called pagenating.
special and important called pagenating. Imagine we have like a huge table and we
Imagine we have like a huge table and we would like to retrieve the data. So now
would like to retrieve the data. So now in order to not have all the data in one
in order to not have all the data in one go we can go and divide the data by the
go we can go and divide the data by the primary ID or by unique identifier. For
primary ID or by unique identifier. For example, we can make a page from 1 until
example, we can make a page from 1 until 100,000 and then the second page starts
100,000 and then the second page starts from 100K to 200ks. So now by dividing
from 100K to 200ks. So now by dividing the data, we can maybe improve exporting
the data, we can maybe improve exporting or importing data or we can have faster
or importing data or we can have faster retrieval for the users. We don't want
retrieval for the users. We don't want to have the whole data in one go in one
to have the whole data in one go in one page. So it has a lot of benefits using
page. So it has a lot of benefits using pagionating and we can do that only if
pagionating and we can do that only if we have a nice ID like
this. All right. Right. Today I'm going to show you the last use case for the
to show you the last use case for the function row number that I usually use
function row number that I usually use in my real projects. So sometimes if you
in my real projects. So sometimes if you are doing data analyszis you're going to
are doing data analyszis you're going to find out that there are data quality
find out that there are data quality issues especially with the duplicates.
issues especially with the duplicates. So what I usually use I use the raw
So what I usually use I use the raw number in order to identify the
number in order to identify the duplicates. Not only that I can use it
duplicates. Not only that I can use it in order to delete the duplicates. So we
in order to delete the duplicates. So we can use it in order to do data
can use it in order to do data cleansing. And this is essential task
cleansing. And this is essential task for each data engineer not only data
for each data engineer not only data analysts in order to prepare and clean
analysts in order to prepare and clean up the data before doing data analyzes.
up the data before doing data analyzes. So let's have the following task.
So let's have the following task. Identify duplicate rows in the table
Identify duplicate rows in the table orders archive and return a clean result
orders archive and return a clean result without any duplicates. So not only we
without any duplicates. So not only we have to identify the duplicates, we have
have to identify the duplicates, we have to return no duplicates in our results.
to return no duplicates in our results. So let's see how we can do this. Let's
So let's see how we can do this. Let's first select the data. So select star
first select the data. So select star from sales orders archive. So let's go
from sales orders archive. So let's go and execute. So now by looking to the
and execute. So now by looking to the data you can see that we have
data you can see that we have duplicates. We have an issue. So the
duplicates. We have an issue. So the order ID number four is twice in our
order ID number four is twice in our database. It doesn't make sense, right?
database. It doesn't make sense, right? It should be only one. So which one is
It should be only one. So which one is the correct one? If you check the data
the correct one? If you check the data over here, you can see that this order
over here, you can see that this order is shipped and then delivered. So it
is shipped and then delivered. So it looks like the last one is the correct
looks like the last one is the correct one. So how we can do that? If you just
one. So how we can do that? If you just scroll to the right, you can see that we
scroll to the right, you can see that we have a creation time. And we usually use
have a creation time. And we usually use such a time stamp in order to identify
such a time stamp in order to identify what was the last valid like order. And
what was the last valid like order. And here we can see immediately that this
here we can see immediately that this order time is higher than the previous
order time is higher than the previous one. Which means this is the more up to
one. Which means this is the more up to date, right? The more current. So what
date, right? The more current. So what we're going to do, we're going to go and
we're going to do, we're going to go and rank our data for each order ID and sort
rank our data for each order ID and sort the data by the creation time in order
the data by the creation time in order to find the last inserted or created row
to find the last inserted or created row for this order. So let's see how we can
for this order. So let's see how we can do that. What we going to do? We're
do that. What we going to do? We're going to go over here and say let's have
going to go over here and say let's have a row number and then over and what
a row number and then over and what we're going to do, we're going to
we're going to do, we're going to partition by the primary key. So
partition by the primary key. So partition by order ID and as we said we
partition by order ID and as we said we have to order the data by this time stab
have to order the data by this time stab at the end. So partition by or order by
at the end. So partition by or order by creation time and descending. So we want
creation time and descending. So we want the highest then the lowest. So that's
the highest then the lowest. So that's it. Let's call it Rn and execute the
it. Let's call it Rn and execute the query. So now by checking the data if
query. So now by checking the data if everything is clean and we don't have
everything is clean and we don't have duplicates everything should be one
duplicates everything should be one because maximum for each primary key we
because maximum for each primary key we should has one row. So but you can see
should has one row. So but you can see over here we have here two and we have
over here we have here two and we have here three two. So that means this is
here three two. So that means this is indicator that we have duplicates inside
indicator that we have duplicates inside our data. So now by checking one by one
our data. So now by checking one by one as you can see the order ID is only one.
as you can see the order ID is only one. So we have the rank one the second one
So we have the rank one the second one as well we have the rank one but here we
as well we have the rank one but here we have the issue. So as you can see we
have the issue. So as you can see we have now two ranks for the order ID
have now two ranks for the order ID four. So now which one is the correct in
four. So now which one is the correct in our logic? We say it is the last row
our logic? We say it is the last row that is inserted inside our data and
that is inserted inside our data and this is rank number one. So if you
this is rank number one. So if you scroll to the right side you can see
scroll to the right side you can see that the creation time here is higher
that the creation time here is higher than the second one. So with that we
than the second one. So with that we have identified what we want. We want
have identified what we want. We want the last inserted row for each ID. And
the last inserted row for each ID. And now let's check this over here. So here
now let's check this over here. So here we have it three times. So it says the
we have it three times. So it says the first one is the highest creation date.
first one is the highest creation date. So if you go to the right side and now
So if you go to the right side and now by comparing those time stamps you can
by comparing those time stamps you can see that this record the first one is
see that this record the first one is the la latest one that is inserted
the la latest one that is inserted inside our data. So as you can see this
inside our data. So as you can see this one is the one that we need the other
one is the one that we need the other two we don't need it because it is old
two we don't need it because it is old informations. So now everything that
informations. So now everything that doesn't has the rank number one is not
doesn't has the rank number one is not valid. It's something old and it's
valid. It's something old and it's actually bad data quality. So we want to
actually bad data quality. So we want to remove it or not to select it. So now in
remove it or not to select it. So now in order to have a clean data what we going
order to have a clean data what we going to do we're going to go and select the
to do we're going to go and select the following as sub select. So select star
following as sub select. So select star from the table and now we are interested
from the table and now we are interested only with the rank number one. We don't
only with the rank number one. We don't need anything else. So let's go and
need anything else. So let's go and execute. And now if you check the
execute. And now if you check the results you can check the order ID over
results you can check the order ID over here. It is unique. We don't have any
here. It is unique. We don't have any duplicates. Right? 1 2 3 4 5 6 7. There
duplicates. Right? 1 2 3 4 5 6 7. There is no duplicates at all. And we have now
is no duplicates at all. And we have now only the latest inserted data inside the
only the latest inserted data inside the orders. and we don't have any duplicates
orders. and we don't have any duplicates or data quality issue. So now of course
or data quality issue. So now of course now we can go with this results in order
now we can go with this results in order to do for the analyzes and this is
to do for the analyzes and this is exactly what data engineers usually do
exactly what data engineers usually do clean up the data and prepare the data
clean up the data and prepare the data before doing any data analyzes. And of
before doing any data analyzes. And of course if you want to communicate those
course if you want to communicate those data quality issues to the source of the
data quality issues to the source of the data let's say you are not the owner of
data let's say you are not the owner of those informations. You can generate a
those informations. You can generate a list of all bad data quality issues and
list of all bad data quality issues and you can send it to the source system and
you can send it to the source system and tell them to clean it up from the
tell them to clean it up from the sources. So now in order to select the
sources. So now in order to select the bad data what we're going to do is we
bad data what we're going to do is we can just change here the condition and
can just change here the condition and say if it is higher than one then you
say if it is higher than one then you are like bad data. So let's go and
are like bad data. So let's go and execute this. And now with this we have
execute this. And now with this we have in the results all records that
in the results all records that shouldn't exist in the data in the first
shouldn't exist in the data in the first place. So we can go and export it and
place. So we can go and export it and communicate it to the source and tell
communicate it to the source and tell them check here you have something wrong
them check here you have something wrong in your system and those information
in your system and those information should not be inserted in the data. So
should not be inserted in the data. So everyone it is very strong right? It is
everyone it is very strong right? It is very powerful. I use it a lot in my
very powerful. I use it a lot in my projects. There are many use cases for
projects. There are many use cases for the row number function in SQL. We can
the row number function in SQL. We can do it in order to find the top end
do it in order to find the top end analyzes, the bottom end analyzes, the
analyzes, the bottom end analyzes, the best performance, worst performance and
best performance, worst performance and as well we can assign unique ids to do
as well we can assign unique ids to do benating or we can use it in order to
benating or we can use it in order to discover data quality issues to clean up
discover data quality issues to clean up our data. So it is amazing function in
our data. So it is amazing function in SQL and you're going to use it a lot. So
SQL and you're going to use it a lot. So that's it for the three functions ro
that's it for the three functions ro number, rank and dense rank. Now we're
number, rank and dense rank. Now we're going to talk about the
inile. Okay. So what is inile? Intile in SQL is very simple. It's going to go and
SQL is very simple. It's going to go and divide your rows, your data into
divide your rows, your data into specific number of almost equal groups
specific number of almost equal groups or sometimes we call them packets. So
or sometimes we call them packets. So now in order to understand this and how
now in order to understand this and how it scale works with this function, we're
it scale works with this function, we're going to have a very simple example. So
going to have a very simple example. So let's go. Okay, we have the following
let's go. Okay, we have the following setup. We have four rows for sales and
setup. We have four rows for sales and we would like to divide it into two
we would like to divide it into two groups or into two buckets. So in order
groups or into two buckets. So in order to do that we can use the entile
to do that we can use the entile function. It has different syntax than
function. It has different syntax than the other ranking functions. So it
the other ranking functions. So it starts with entile then we must define a
starts with entile then we must define a number. So we cannot leave it empty like
number. So we cannot leave it empty like the other ranking. So here we have two
the other ranking. So here we have two buckets then over and here again we have
buckets then over and here again we have to sort the data. So it is must order by
to sort the data. So it is must order by sales descending from the highest to the
sales descending from the highest to the lowest. So now as usual SQL going to go
lowest. So now as usual SQL going to go and sort the data. We have it already
and sort the data. We have it already sorted in this example. Then it going to
sorted in this example. Then it going to start assigning each of those rows into
start assigning each of those rows into buckets. But SQL first has to calculate
buckets. But SQL first has to calculate the bucket size. So how many rows we can
the bucket size. So how many rows we can like insert inside each bucket. So the
like insert inside each bucket. So the calculation is very simple. It says the
calculation is very simple. It says the bucket size equals to the number of rows
bucket size equals to the number of rows divided by the number of buckets. So
divided by the number of buckets. So what is the number of rows here? We have
what is the number of rows here? We have four rows, right? So we have four over
four rows, right? So we have four over here. Then the number of buckets we
here. Then the number of buckets we define it in the syntax of the query. So
define it in the syntax of the query. So here we defined two buckets. We need two
here we defined two buckets. We need two groups. So that means we are dividing
groups. So that means we are dividing four by two. And the size of the bucket
four by two. And the size of the bucket going to be two. So now with this SQL is
going to be two. So now with this SQL is ready and going to start assigning each
ready and going to start assigning each row to a bucket. So it's going to start
row to a bucket. So it's going to start on the top. The first one going to be in
on the top. The first one going to be in the bucket number one. Then go to the
the bucket number one. Then go to the next one. It's going to say okay we
next one. It's going to say okay we still have enough space in the bucket.
still have enough space in the bucket. Right? So it's going to sign as well to
Right? So it's going to sign as well to one. But with this we reach the maximum
one. But with this we reach the maximum number of rows within each bucket. So
number of rows within each bucket. So the next row going to be assigned to
the next row going to be assigned to another bucket. So it's going to be two
another bucket. So it's going to be two and the last one going to be as well
and the last one going to be as well too. So as you can see it's very simple.
too. So as you can see it's very simple. We have just assigned our sales based on
We have just assigned our sales based on the sorting of course into two buckets.
the sorting of course into two buckets. These two sales belongs to the bucket
These two sales belongs to the bucket number one and the other two belongs to
number one and the other two belongs to the bucket number two. Very easy. So
the bucket number two. Very easy. So that was very straightforward because we
that was very straightforward because we are dividing even numbers and we got
are dividing even numbers and we got perfectly sized buckets. But now what
perfectly sized buckets. But now what going to happen if we have an odd
going to happen if we have an odd number? So we have here five instead of
number? So we have here five instead of four. So the bucket size going to be
four. So the bucket size going to be dividing five by two. We're going to get
dividing five by two. We're going to get 2.5. And now of course SQL will not go
2.5. And now of course SQL will not go and divide like two half for each
and divide like two half for each bucket. Then we are splitting this into
bucket. Then we are splitting this into two packets. Of course this will not be
two packets. Of course this will not be working. We should has now a bucket with
working. We should has now a bucket with three and another bucket with two. So
three and another bucket with two. So now the rule in SQL make it very clear.
now the rule in SQL make it very clear. It says larger groups comes first then
It says larger groups comes first then smaller. So that means if we have here
smaller. So that means if we have here an even number like this, the larger
an even number like this, the larger group going to be the first group. So
group going to be the first group. So that's going to look like this. It's
that's going to look like this. It's going to like reset everything. So let's
going to like reset everything. So let's see what's going to happen. The first
see what's going to happen. The first one going to be one. The second one has
one going to be one. The second one has bill one. The third one going to be as
bill one. The third one going to be as well one. So it going to has a larger
well one. So it going to has a larger package than the second one. Then the
package than the second one. Then the rest going to be two. So as you can see
rest going to be two. So as you can see the larger group comes first then the
the larger group comes first then the smaller. And this is how a scale going
smaller. And this is how a scale going to work. if you have odd numbers. So you
to work. if you have odd numbers. So you don't have here perfectly sized buckets.
don't have here perfectly sized buckets. You have approximately or roughly
You have approximately or roughly equally sized buckets. So this is how
equally sized buckets. So this is how the intel works. Now let's go back to
the intel works. Now let's go back to scale in order to practice this
scale in order to practice this function. Okay. So now let's have some
function. Okay. So now let's have some fun working with this function. So we
fun working with this function. So we just going to select something like
just going to select something like order ID sales from sales orders. So
order ID sales from sales orders. So let's go and execute it. And with that
let's go and execute it. And with that we got our 10 rows. Now let's say that I
we got our 10 rows. Now let's say that I would like to create only one bucket
would like to create only one bucket from the data. So entile and only one
from the data. So entile and only one bucket over partition let's say not
bucket over partition let's say not partition by let's take order by sales
partition by let's take order by sales descending. So that's it. I'm going to
descending. So that's it. I'm going to call it one bucket. So let's go and
call it one bucket. So let's go and execute it. As usual it's still going to
execute it. As usual it's still going to go and sort the data and then calculate
go and sort the data and then calculate the bucket. It's going to be 10 rows
the bucket. It's going to be 10 rows divided by one. So the size of the
divided by one. So the size of the bucket going to be 10. So that's why
bucket going to be 10. So that's why you're going to see everywhere here as
you're going to see everywhere here as one because all those rows going to fit
one because all those rows going to fit into one bucket. So this is very simple.
into one bucket. So this is very simple. We have only one bucket. Let's go and
We have only one bucket. Let's go and now have two buckets. So I'm just going
now have two buckets. So I'm just going to copy and paste. And instead of one,
to copy and paste. And instead of one, we're going to have two and let's call
we're going to have two and let's call it two buckets. So let's go and execute
it two buckets. So let's go and execute this. So now here again, what is the
this. So now here again, what is the size of the buckets? It is 10 divided by
size of the buckets? It is 10 divided by two. So we will get perfectly grouped
two. So we will get perfectly grouped buckets. So the first bucket going to be
buckets. So the first bucket going to be five rows and the second one going to be
five rows and the second one going to be the next five rows. So it is very
the next five rows. So it is very perfect. Let's go to the next one. Let's
perfect. Let's go to the next one. Let's have three buckets. So three. So let's
have three buckets. So three. So let's go and execute. So now what going to
go and execute. So now what going to happen is going to go and divide 10 by
happen is going to go and divide 10 by three in order to get the size of the
three in order to get the size of the bucket. And it's going to be 3.3. So it
bucket. And it's going to be 3.3. So it is decimal and we will not get perfectly
is decimal and we will not get perfectly sized buckets. So again the larger group
sized buckets. So again the larger group comes first then the smaller. So as you
comes first then the smaller. So as you can see we have to fit then in the first
can see we have to fit then in the first group four in order to get the others
group four in order to get the others with three. So that's why the first
with three. So that's why the first bucket is going to be the biggest one.
bucket is going to be the biggest one. So four rows into the first bucket. Then
So four rows into the first bucket. Then the second three rows going to be in the
the second three rows going to be in the bucket two. And as well the last one
bucket two. And as well the last one going to be bucket three. So as you can
going to be bucket three. So as you can see the largest group is going to be the
see the largest group is going to be the first bucket. So now let's keep playing
first bucket. So now let's keep playing with the data. Let's go and take now
with the data. Let's go and take now four. We would like to have four
four. We would like to have four buckets. Now things going to get
buckets. Now things going to get interesting. So now by checking the
interesting. So now by checking the result it's going to be interesting. SQL
result it's going to be interesting. SQL going to divide 10 by four and we will
going to divide 10 by four and we will get something like 2.5. So again we will
get something like 2.5. So again we will not get perfectly sized groups. So SQL
not get perfectly sized groups. So SQL has to fit now 10 rows into four groups.
has to fit now 10 rows into four groups. So the first three rows going to be fit
So the first three rows going to be fit in the bucket number one and as well the
in the bucket number one and as well the second three rows like this going to be
second three rows like this going to be in the bucket number two. And then you
in the bucket number two. And then you can see over here we have two buckets
can see over here we have two buckets with a size of two. And with that we can
with a size of two. And with that we can fit 10 into four groups. And again you
fit 10 into four groups. And again you can see the larger groups comes first
can see the larger groups comes first like this one and then the second and
like this one and then the second and the smallers comes later. Okay. So this
the smallers comes later. Okay. So this is how the inter works in SQL. And now
is how the inter works in SQL. And now you might say you know what why do I
you might say you know what why do I need buckets in the first place. So what
need buckets in the first place. So what is the use
case? There is two use cases for the intel function in my projects. In one
intel function in my projects. In one hands if I am data analyst I'm going to
hands if I am data analyst I'm going to use the intel function in order to
use the intel function in order to segment my data. In the other hand, if
segment my data. In the other hand, if I'm data engineer, I'm going to use the
I'm data engineer, I'm going to use the intel function in order to do ETL
intel function in order to do ETL processing and as well to do load
processing and as well to do load balancing. So now let's start with the
balancing. So now let's start with the first use case as a data analyst where
first use case as a data analyst where you want to do segmentations with the
you want to do segmentations with the entire function. Segmentations is very
entire function. Segmentations is very nice way in order to understand your
nice way in order to understand your data. So you can go and segment your
data. So you can go and segment your data into different buckets or groups
data into different buckets or groups like for example doing segmentations for
like for example doing segmentations for the customers. So you can go and group
the customers. So you can go and group up your customers depend on their
up your customers depend on their behavior like the total sales or the
behavior like the total sales or the total number of orders. So with that you
total number of orders. So with that you can make like for example VIB section
can make like for example VIB section and then the medium and then the low. So
and then the medium and then the low. So now in order to understand the
now in order to understand the segmentation use case let's have the
segmentation use case let's have the following task. Okay. The task says
following task. Okay. The task says segment all orders into three categories
segment all orders into three categories high medium and low sales. So in order
high medium and low sales. So in order to solve this let's do the basic stuff
to solve this let's do the basic stuff right. So select order ID. Let's take
right. So select order ID. Let's take the sales from our table sales orders
the sales from our table sales orders and let's go and execute it. So as usual
and let's go and execute it. So as usual we got our 10 sales. So now if you check
we got our 10 sales. So now if you check the task it says we need three
the task it says we need three categories. So that means we need three
categories. So that means we need three buckets right and it says high, medium
buckets right and it says high, medium and low sales. So that means we are
and low sales. So that means we are dividing by the sales. So let's go and
dividing by the sales. So let's go and do it step by step. So we're going to
do it step by step. So we're going to use inile since we need to segment the
use inile since we need to segment the data. Three categories means three
data. Three categories means three buckets. And then let's define the
buckets. And then let's define the window over we don't have to divide the
window over we don't have to divide the data by partition by we just need to
data by partition by we just need to sort it first by the sales. So it's
sort it first by the sales. So it's going to be by sales and let's take
going to be by sales and let's take discrete since we want to sort it from
discrete since we want to sort it from the highest to the lowest. So that's it.
the highest to the lowest. So that's it. Let's say you are our buckets. So let's
Let's say you are our buckets. So let's go and execute this. So now if you check
go and execute this. So now if you check the data you can see that they are
the data you can see that they are segmented into three buckets. So the
segmented into three buckets. So the first bucket going to contain all orders
first bucket going to contain all orders with the high sales. Then the second one
with the high sales. Then the second one going to be all sales with the medium.
going to be all sales with the medium. And then the last one going to be all
And then the last one going to be all sales with the low sales. So as you can
sales with the low sales. So as you can see we have already categorized our data
see we have already categorized our data into three groups. But now as you can
into three groups. But now as you can see we have numbers and maybe the user
see we have numbers and maybe the user is expecting to have those text high,
is expecting to have those text high, medium, low. So that means what we're
medium, low. So that means what we're going to do now we're going to go and
going to do now we're going to go and translate those numbers into text into
translate those numbers into text into words. And of course we cannot do that
words. And of course we cannot do that inside the window function. We're going
inside the window function. We're going to use data transformation using the
to use data transformation using the case when statements. Don't worry about
case when statements. Don't worry about it. We're going to have complete
it. We're going to have complete dedicated section explaining the case
dedicated section explaining the case when. So for now just follow me in order
when. So for now just follow me in order to see how this works. We're going to go
to see how this works. We're going to go and use subquery. So it's going to be
and use subquery. So it's going to be select and let's take the star for
select and let's take the star for everything and then let's have the
everything and then let's have the following logic. Case when buckets equal
following logic. Case when buckets equal to one then it is high the sales is
to one then it is high the sales is high. So we are just mapping the numbers
high. So we are just mapping the numbers into text. So otherwise case when the
into text. So otherwise case when the brackets equal to two then we are
brackets equal to two then we are targeting the medium medium and then the
targeting the medium medium and then the last group packets equal to three then
last group packets equal to three then those sales are low. So let's call it
those sales are low. So let's call it end it and let's call it sales
end it and let's call it sales segmentations. So that's it. Let me just
segmentations. So that's it. Let me just make it a little bit smaller in order
make it a little bit smaller in order for you to see it. And all right so from
for you to see it. And all right so from and then we have our subquery like this.
and then we have our subquery like this. So as you can see we just mapped the
So as you can see we just mapped the numbers into text. We are just doing
numbers into text. We are just doing translations. So let's go and execute
translations. So let's go and execute it. And now by checking the results we
it. And now by checking the results we got our three categories for the users.
got our three categories for the users. So the first category going to be the
So the first category going to be the high sales. The second one going to be
high sales. The second one going to be the medium sales and the third one going
the medium sales and the third one going to be the low sales. So guys you see
to be the low sales. So guys you see Intel is very powerful in order to
Intel is very powerful in order to segment our data. So now you can go and
segment our data. So now you can go and segment stuff like the customers by
segment stuff like the customers by their total sales or the products by
their total sales or the products by prices, employees by their salaries and
prices, employees by their salaries and so
on. All right. So this is the first use case for the Intel function as a data
case for the Intel function as a data analyst where you go and segment your
analyst where you go and segment your data in order to understand the
data in order to understand the behavior. Now in the other hand, if you
behavior. Now in the other hand, if you are data engineer, you can use Intel
are data engineer, you can use Intel function in order to do load balancing
function in order to do load balancing in your ETL. So now I'm just going to
in your ETL. So now I'm just going to explain it in very simple sketch. All
explain it in very simple sketch. All right. So now we have the following
right. So now we have the following scenario where we have two databases and
scenario where we have two databases and we would like to move one big table from
we would like to move one big table from the database A to database B. So in this
the database A to database B. So in this case I'm doing something called full
case I'm doing something called full load. That means I'm loading all the
load. That means I'm loading all the rows from one database to another. So if
rows from one database to another. So if you do it in one go what could happen is
you do it in one go what could happen is that it could take long time. So it
that it could take long time. So it could take hours or even sometimes days
could take hours or even sometimes days and maybe at the end you will get maybe
and maybe at the end you will get maybe some network errors because you have
some network errors because you have stressed the networks between those two
stressed the networks between those two databases and everything going to break
databases and everything going to break and you're going to lose the data and
and you're going to lose the data and you have to start again. So now instead
you have to start again. So now instead of loading this table in one go what we
of loading this table in one go what we can do we can go and split it into
can do we can go and split it into fractions or let's say packets. So we
fractions or let's say packets. So we can split this table for example into
can split this table for example into four small tables using the function
four small tables using the function entile. So now after we split this big
entile. So now after we split this big table into small tables, we're going to
table into small tables, we're going to go and start moving those small tables
go and start moving those small tables one after another and with that we are
one after another and with that we are not stressing the networks and it's
not stressing the networks and it's going to succeed. So now after loading
going to succeed. So now after loading everything at the end in the target
everything at the end in the target database we're going to have those small
database we're going to have those small tables and of course we can go and use
tables and of course we can go and use the union in order to merge them in
the union in order to merge them in order to have again the big table that
order to have again the big table that we have it in the original database. So
we have it in the original database. So this is very common use case for the
this is very common use case for the entile in order to split the load and to
entile in order to split the load and to balance the processing of extracting
balance the processing of extracting data. All right. So now we have the
data. All right. So now we have the following SQL task. It says in order to
following SQL task. It says in order to export the data divide the orders into
export the data divide the orders into two groups. So let's go and do that.
two groups. So let's go and do that. First we're going to select everything
First we're going to select everything from the table just in order to see the
from the table just in order to see the data sales orders. So let's go and
data sales orders. So let's go and execute it. So now we got our 10 orders
execute it. So now we got our 10 orders and what we have to do is that to go and
and what we have to do is that to go and split it into two groups. In order to do
split it into two groups. In order to do that we can use the entile function. Two
that we can use the entile function. Two groups means two buckets. So let's
groups means two buckets. So let's define the window. So here we don't have
define the window. So here we don't have to partition the data using partition by
to partition the data using partition by but we have to specify the order by. So
but we have to specify the order by. So now which column we're going to use in
now which column we're going to use in order to sort the data. Of course here
order to sort the data. Of course here there is no rule like you can go and
there is no rule like you can go and split the data by sales or by the order
split the data by sales or by the order status by date by anything you want. But
status by date by anything you want. But we usually go and use the primary key.
we usually go and use the primary key. It's just systematic, better, and more
It's just systematic, better, and more clean, especially if you have a sequence
clean, especially if you have a sequence of numbers in the order ID. So you can
of numbers in the order ID. So you can export the first range of the orders,
export the first range of the orders, then you can go to the next group and so
then you can go to the next group and so on. So let's go with the order ID and
on. So let's go with the order ID and let's give it a name buckets. So that's
let's give it a name buckets. So that's it. Let's go and hit execute. Now, as
it. Let's go and hit execute. Now, as you can see, it's very simple. We got
you can see, it's very simple. We got our two groups. So this is the first
our two groups. So this is the first batch of of the data and this is the
batch of of the data and this is the second batch of data. So now we can go
second batch of data. So now we can go and select the first batch and export
and select the first batch and export it, import it in the next system. And
it, import it in the next system. And then after that we go with the second
then after that we go with the second batch. And of course if you still suffer
batch. And of course if you still suffer from the size of those packets, you can
from the size of those packets, you can go and split it to more smaller size. So
go and split it to more smaller size. So you can go over here and make it four.
you can go over here and make it four. So with that we're going to get smaller
So with that we're going to get smaller buckets and it might be easier to export
buckets and it might be easier to export the data. So this is really great use
the data. So this is really great use case for the entile function. All right
case for the entile function. All right everyone. So with this you have learned
everyone. So with this you have learned the two use cases for the entile
the two use cases for the entile function that I usually follow in my
function that I usually follow in my projects. So as a data analyst you can
projects. So as a data analyst you can use it in order to do segmentations and
use it in order to do segmentations and as a data engineer you can use it in
as a data engineer you can use it in order to do load balancing of the
ETL. Okay everyone so with that we have covered everything about the integer
covered everything about the integer based ranking functions. Now we're going
based ranking functions. Now we're going to talk about the second methods. We
to talk about the second methods. We have the percentagebased ranking
have the percentagebased ranking functions and here we have two functions
functions and here we have two functions the cubist and as well the percentile.
the cubist and as well the percentile. So now let's have a quick recap. So with
So now let's have a quick recap. So with the percentage based ranking SQL going
the percentage based ranking SQL going to go and calculate a relative position
to go and calculate a relative position as a percentage and assign it for each
as a percentage and assign it for each row. So the output going to be a
row. So the output going to be a continuous normalized scale from 0 to
continuous normalized scale from 0 to one. And this is really amazing in order
one. And this is really amazing in order to do distribution analyszis. So those
to do distribution analyszis. So those functions going to consider in their
functions going to consider in their calculation the overall total the whole
calculation the overall total the whole size of the data set which can help us
size of the data set which can help us in order to find out the contribution of
in order to find out the contribution of each value to the overall total. And now
each value to the overall total. And now in SQL in order to generate the
in SQL in order to generate the percentage we have two different
percentage we have two different formulas. So in one hand we have the
formulas. So in one hand we have the function cumist and in the other hand we
function cumist and in the other hand we have the percent rank. So that means we
have the percent rank. So that means we have two different functions with
have two different functions with different formulas in order to generate
different formulas in order to generate and calculate the percentage. So now
and calculate the percentage. So now let's start with the first function the
cumist. All right everyone. So now let's start with the first function. We have
start with the first function. We have the dis and it stands for
the dis and it stands for commumulative distribution. It's going
commumulative distribution. It's going to go and focus or calculate the
to go and focus or calculate the distribution of your data points within
distribution of your data points within a window. So what this means in order to
a window. So what this means in order to understand it, we're going to go and
understand it, we're going to go and have very simple example to understand
have very simple example to understand how SQL works with this function. So
how SQL works with this function. So let's go. All right. Again we have our
let's go. All right. Again we have our very simple example of the sales and we
very simple example of the sales and we have the following query. So dist
have the following query. So dist then we don't give any argument inside
then we don't give any argument inside it. So it's going to be empty and the
it. So it's going to be empty and the window going to be like usual order by
window going to be like usual order by sales descending from the highest to the
sales descending from the highest to the lowest and the order by is must. So the
lowest and the order by is must. So the first step is SQL going to go and sort
first step is SQL going to go and sort the data. We have it already sorted from
the data. We have it already sorted from the highest to the lowest. So now the
the highest to the lowest. So now the next step is that SQL going to go and
next step is that SQL going to go and start calculating the percentage for
start calculating the percentage for each row. And we have a very simple
each row. And we have a very simple formula. It says the cumist equals to
formula. It says the cumist equals to the position number of the value divided
the position number of the value divided by the number of rows. So now the next
by the number of rows. So now the next step is still going to go and start
step is still going to go and start calculate the percentage for each row.
calculate the percentage for each row. And we have this very simple formula. It
And we have this very simple formula. It says the cubist equals to the position
says the cubist equals to the position number of the value divided by the
number of the value divided by the number of rows. It's very simple. Let's
number of rows. It's very simple. Let's do it step by step. So SQL going to
do it step by step. So SQL going to start with the first value in our list.
start with the first value in our list. So it going to be calculated like this.
So it going to be calculated like this. So what is the position number of the
So what is the position number of the first value? It's going to be one,
first value? It's going to be one, right? So this is the first value in our
right? So this is the first value in our list. And what is the total number of
list. And what is the total number of rows? We have five rows, right? So 1 2 3
rows? We have five rows, right? So 1 2 3 4 5. So we're going to divide one by
4 5. So we're going to divide one by five. And the result going to be 0.2. So
five. And the result going to be 0.2. So this going to be the first value for the
this going to be the first value for the first row. Okay. So now SQL going to go
first row. Okay. So now SQL going to go to the next row. And this time we're
to the next row. And this time we're going to get a special case. As you can
going to get a special case. As you can see, we have the 80 twice. So we have
see, we have the 80 twice. So we have here a tie. So now first we need the
here a tie. So now first we need the position number. As you can see, we are
position number. As you can see, we are at the position number two, right? But
at the position number two, right? But since we have the 80 multiple times, SQL
since we have the 80 multiple times, SQL going to go and take the last position
going to go and take the last position that we see the value 80 and the last
that we see the value 80 and the last position going to be the record number
position going to be the record number three. So that's why SQL going to say
three. So that's why SQL going to say for this record it's going to be the
for this record it's going to be the position number three and not two and
position number three and not two and then it's going to go and divide it by
then it's going to go and divide it by five and we will get the value of 0.6.
five and we will get the value of 0.6. So this is the most confusing thing with
So this is the most confusing thing with this function. So if SQL finds a tie, it
this function. So if SQL finds a tie, it will completely ignore the current
will completely ignore the current position number. So we don't have two.
position number. So we don't have two. It going to go and take the last
It going to go and take the last position number for the same value. And
position number for the same value. And the last in our list going to be the
the last in our list going to be the record number three. So that's why we
record number three. So that's why we have three over here. Okay. So now let's
have three over here. Okay. So now let's keep moving. Let's go to the third row.
keep moving. Let's go to the third row. And as you can see, we are again in the
And as you can see, we are again in the tie. But this time, this is the last
tie. But this time, this is the last time we see 80. So next we don't have
time we see 80. So next we don't have 80. So what's going to happen? We're
80. So what's going to happen? We're going to have exact same result. So it's
going to have exact same result. So it's going to be 3 divided by 5. So as you
going to be 3 divided by 5. So as you can see if we have a tie they going to
can see if we have a tie they going to share the same percentage. So that means
share the same percentage. So that means with the cube list if you have same
with the cube list if you have same values they going to share the same
values they going to share the same rank. So let's keep moving to the fourth
rank. So let's keep moving to the fourth one. So now what is the position number
one. So now what is the position number of the 50 we are at the record four. So
of the 50 we are at the record four. So position number four divided by five we
position number four divided by five we will get 0 comma 8. Okay. So now let's
will get 0 comma 8. Okay. So now let's move to the last one and it is the
move to the last one and it is the easiest one. So which position do we
easiest one. So which position do we have over here? It is the position
have over here? It is the position number five. It's the last one. And the
number five. It's the last one. And the number of rows is five. That's why we
number of rows is five. That's why we will get one. So guys, that's it. This
will get one. So guys, that's it. This is how the cumulative distribution
is how the cumulative distribution works. Once you understand the formula,
works. Once you understand the formula, it's going to be very easy in order to
it's going to be very easy in order to understand the output. So as you can
understand the output. So as you can see, calculating the percentage always
see, calculating the percentage always depends on the total size of our data
depends on the total size of our data sets. You can see here the number of
sets. You can see here the number of rows. So with this we're going to get an
rows. So with this we're going to get an output that help us in order to
output that help us in order to understand the distribution of our data
understand the distribution of our data points within the data
sets. All right everyone. So now we're going to go and focus on the second
going to go and focus on the second function that generate percentage as a
function that generate percentage as a rank. We have the percent rank. So the
rank. We have the percent rank. So the percent rank going to go and focus on
percent rank going to go and focus on generating the relative position of each
generating the relative position of each row within a window. So in order to
row within a window. So in order to understand what this means, we can have
understand what this means, we can have a very simple example in order to
a very simple example in order to understand how scale works with this
understand how scale works with this function. So let's go. Okay, again we
function. So let's go. Okay, again we have those sales very simple example and
have those sales very simple example and the syntax going to be like this percent
the syntax going to be like this percent rank and inside it we don't use any
rank and inside it we don't use any arguments and the window going to be
arguments and the window going to be like this order by it is a must sales
like this order by it is a must sales descending from the highest to the
descending from the highest to the lowest the first step that is going to
lowest the first step that is going to do is that it's going to go and sort the
do is that it's going to go and sort the data from the highest to the lowest and
data from the highest to the lowest and we have it already like this and next
we have it already like this and next SQL going to go and start calculate the
SQL going to go and start calculate the percentage which is very similar to the
percentage which is very similar to the cumulative distribution but this time
cumulative distribution but this time it's going to be like this position
it's going to be like this position number then we subtract it from one and
number then we subtract it from one and as well divided by the number of rows
as well divided by the number of rows subtracted from one. So it's like exact
subtracted from one. So it's like exact formula but we are only subtracting here
formula but we are only subtracting here once for both numbers. Okay. So now
once for both numbers. Okay. So now let's go through all rows step by step
let's go through all rows step by step and see the output. So it's still going
and see the output. So it's still going to start with the first row right. So
to start with the first row right. So what is the position number of the first
what is the position number of the first row? It's going to be one. Then we have
row? It's going to be one. Then we have to subtract it by one. That's why we
to subtract it by one. That's why we will get zero. Now what is the total
will get zero. Now what is the total number of rows? So we have here five
number of rows? So we have here five rows and it is subtracted by one that's
rows and it is subtracted by one that's why we're going to get four. So now 0
why we're going to get four. So now 0 divided by any value the output going to
divided by any value the output going to be a zero. So that's why for the first
be a zero. So that's why for the first value we will get a zero. All right. So
value we will get a zero. All right. So now let's move to the second row over
now let's move to the second row over here. And here we have our special case
here. And here we have our special case where we have a tie. So we have two
where we have a tie. So we have two sales sharing the same value 80. So now
sales sharing the same value 80. So now for the percent rank SQL gonna have
for the percent rank SQL gonna have different behavior than the cumist.
different behavior than the cumist. Remember in the list SQL did search
Remember in the list SQL did search for the last position of the shared
for the last position of the shared value. So it was the position number
value. So it was the position number three since this is the last time we see
three since this is the last time we see 80. But now with the person rank is
80. But now with the person rank is still going to stick with the first
still going to stick with the first occurrence of the shared value. So now
occurrence of the shared value. So now by checking those two 80s what is the
by checking those two 80s what is the first occurrence? It is the record
first occurrence? It is the record number two. So that's why we have
number two. So that's why we have position number two subtracted by one we
position number two subtracted by one we will get one. And here the same going to
will get one. And here the same going to be number of totals we have five
be number of totals we have five subtract by one we have four. So now if
subtract by one we have four. So now if you divide one by four we will get the
you divide one by four we will get the result of 0 comma 25. So this is the
result of 0 comma 25. So this is the percentage of this value. So now let's
percentage of this value. So now let's go to the second row. Here we have again
go to the second row. Here we have again the tie. So scale going to stick with
the tie. So scale going to stick with the position number two the first
the position number two the first occurrence. So it's going to be the same
occurrence. So it's going to be the same two subtracted by one we will get one.
two subtracted by one we will get one. And as well the total number of rows
And as well the total number of rows five subtract by one we will have four.
five subtract by one we will have four. That's why we will get the same exact
That's why we will get the same exact results. So here as you can see with the
results. So here as you can see with the percent rank it's like the list the
percent rank it's like the list the shared value going to share as well the
shared value going to share as well the same percentage rank. Now let's move to
same percentage rank. Now let's move to the fourth one. So we have the value 50.
the fourth one. So we have the value 50. So what is the position of this? It's
So what is the position of this? It's going to be the record number four.
going to be the record number four. Subtract it by one we will get three.
Subtract it by one we will get three. And if you divide three by four you will
And if you divide three by four you will get
get 0.75. And now moving to the last value
0.75. And now moving to the last value over here it's going to be easy. So what
over here it's going to be easy. So what is the position number of the 30? It is
is the position number of the 30? It is five. Five subtracted by one it's going
five. Five subtracted by one it's going to be four. And as well we're going to
to be four. And as well we're going to have four as well here for the total
have four as well here for the total numbers subtracted by one. So if you
numbers subtracted by one. So if you divide four by four you will get one. So
divide four by four you will get one. So that's it guys. This is how the percent
that's it guys. This is how the percent rank works. It always has the scale from
rank works. It always has the scale from 0 to one. So it's always like this.
0 to one. So it's always like this. Doesn't matter which values do we have
Doesn't matter which values do we have inside and it's going to has like
inside and it's going to has like continuous scale. And again here if you
continuous scale. And again here if you have a tie they're going to go and share
have a tie they're going to go and share the same percentage rank. Okay guys. So
the same percentage rank. Okay guys. So now if you go and compare those two
now if you go and compare those two functions you're going to see that they
functions you're going to see that they are really similar to each others. The
are really similar to each others. The output of both functions we are
output of both functions we are generating percentage based ranking and
generating percentage based ranking and both of them as well handling the ties
both of them as well handling the ties perfectly. So they share the same
perfectly. So they share the same percentage rank. If you check the syntax
percentage rank. If you check the syntax they are very similar. And now by
they are very similar. And now by checking the formulas of both of them we
checking the formulas of both of them we are always considering the overall size
are always considering the overall size of the data sets. So here the size is
of the data sets. So here the size is considered in the calculation to help us
considered in the calculation to help us finding the relative position of each
finding the relative position of each value to the overall and this is very
value to the overall and this is very important in the analyszis in order to
important in the analyszis in order to measure the contribution of each value
measure the contribution of each value to the overall. So now about the use
to the overall. So now about the use cases if you want to focus on the
cases if you want to focus on the distribution of your data points go with
distribution of your data points go with the cumulative distribution but if you
the cumulative distribution but if you want to focus on the relative position
want to focus on the relative position of each rows then go with the percent
of each rows then go with the percent rank. All right. So now there is one
rank. All right. So now there is one more difference between the and the
more difference between the and the percent rank and that's if you check the
percent rank and that's if you check the formulas. You can see that the is
formulas. You can see that the is more inclusive. We always consider the
more inclusive. We always consider the position number of the current row. But
position number of the current row. But with the person rank we don't consider
with the person rank we don't consider the current row. We like skip it or make
the current row. We like skip it or make it exclusive. So we say for the person
it exclusive. So we say for the person rank it is more exclusive and the
rank it is more exclusive and the cumulative distribution it is more
cumulative distribution it is more inclusive. So now if you ask me the hard
inclusive. So now if you ask me the hard question which one to use, I'm going to
question which one to use, I'm going to say if you want to be more inclusive, go
say if you want to be more inclusive, go with the commutive distribution. If you
with the commutive distribution. If you want to be more exclusive with the
want to be more exclusive with the current row, go with the person rank. So
current row, go with the person rank. So they are very similar to each others. So
they are very similar to each others. So if you want to calculate the
if you want to calculate the distribution of your data, go with the
distribution of your data, go with the cumulative distribution. If you want to
cumulative distribution. If you want to find the relative position of each row,
find the relative position of each row, then go with the percent rank. All
then go with the percent rank. All right. So now we have the following task
right. So now we have the following task that says find the products that fall
that says find the products that fall within the highest 40% of the prices.
within the highest 40% of the prices. Let's go and solve this. Now we are
Let's go and solve this. Now we are targeting the table products and I will
targeting the table products and I will just select like two columns products
just select like two columns products price from sales products. So that's it.
price from sales products. So that's it. Let's go and execute this. So now as you
Let's go and execute this. So now as you can see we got five products and their
can see we got five products and their prices. And the task says find the
prices. And the task says find the highest 40%. So we have to find and
highest 40%. So we have to find and generate a percentage rank. In order to
generate a percentage rank. In order to do that we have the two functions cumist
do that we have the two functions cumist and the percent rank. I will go this
and the percent rank. I will go this time with the list. So let's go and
time with the list. So let's go and do that. So list and then let's go
do that. So list and then let's go and define the window like this. It's
and define the window like this. It's going to be order by we are targeting
going to be order by we are targeting now the prices right? So order by the
now the prices right? So order by the price from the highest to the lowest and
price from the highest to the lowest and let's go give it a name this rank. So
let's go give it a name this rank. So let's go and execute this. So now with
let's go and execute this. So now with that SQL going to go and generate for us
that SQL going to go and generate for us a percentage ranking using the formula
a percentage ranking using the formula that we just learned before. So now in
that we just learned before. So now in the output we are getting all the
the output we are getting all the products but the task says we have to
products but the task says we have to get only the products that are in the
get only the products that are in the highest 40%. So that means the first row
highest 40%. So that means the first row the second row and that's it. So those
the second row and that's it. So those rows are in the highest 40% the rest are
rows are in the highest 40% the rest are below that. So in order to do that to
below that. So in order to do that to filter the data we're going to use the
filter the data we're going to use the subquery. So select star from and then
subquery. So select star from and then we have our sub query like this and then
we have our sub query like this and then our filter going to be this rank smaller
our filter going to be this rank smaller or equal to 0.4. So this is our
or equal to 0.4. So this is our threshold in order to get the data. So
threshold in order to get the data. So let's go and execute this. And now as
let's go and execute this. And now as you can see we got the top products the
you can see we got the top products the top 40%. Now of course you can go and
top 40%. Now of course you can go and format the percentage. We can do that
format the percentage. We can do that like this. So let's take the test
like this. So let's take the test rank multiply it with 100. So let's go
rank multiply it with 100. So let's go and execute this. So as you can see we
and execute this. So as you can see we got 20 and 40%. We can go and add to it
got 20 and 40%. We can go and add to it as well the percentage character right.
as well the percentage character right. So we can go and say concat and we're
So we can go and say concat and we're going to add the character after that
going to add the character after that like this and let's call it test rank
like this and let's call it test rank percentage. So that's it. Let's go and
percentage. So that's it. Let's go and execute it. So that we have solved the
execute it. So that we have solved the task. We have the products that fall
task. We have the products that fall within the highest 40%. Now, of course,
within the highest 40%. Now, of course, you can go and try the percent rank. So,
you can go and try the percent rank. So, it's very simple. We just have to go and
it's very simple. We just have to go and switch the cumulative distribution with
switch the cumulative distribution with the function percent bank. So, let's go
the function percent bank. So, let's go and execute it. Now, as you can see, we
and execute it. Now, as you can see, we will get the exact same results. So,
will get the exact same results. So, we're still getting the gloves and caps
we're still getting the gloves and caps as the highest products within the 40%
as the highest products within the 40% of the price. So, guys, that's it. It's
of the price. So, guys, that's it. It's very simple, right?
All right friends, so now let's have a quick recap for the window ranking
quick recap for the window ranking functions. So what they're going to do,
functions. So what they're going to do, they're going to go and assign a rank
they're going to go and assign a rank for each row within a window. And we
for each row within a window. And we have two types of ranking, right? The
have two types of ranking, right? The first one is the integer based ranking.
first one is the integer based ranking. It's going to go and assign a number an
It's going to go and assign a number an integer for each row. And here we have
integer for each row. And here we have four functions. Row number, rank, dense
four functions. Row number, rank, dense rank, and in tile. And the second type
rank, and in tile. And the second type of ranking, we have the percentage based
of ranking, we have the percentage based ranking. So scale fair is going to go
ranking. So scale fair is going to go and calculate a rank and then assign it
and calculate a rank and then assign it for each row. And here we have two types
for each row. And here we have two types of formula or functions. So we have the
of formula or functions. So we have the cube dist the cumulative distribution
cube dist the cumulative distribution and the second one we have the percent
and the second one we have the percent rank. And now to the next point if we
rank. And now to the next point if we are talking about the rules of the
are talking about the rules of the syntax. So the expression should be
syntax. So the expression should be empty. We should not pass any argument
empty. We should not pass any argument to the functions. We must use order by
to the functions. We must use order by in order to sort our data. So it is
in order to sort our data. So it is required and the frame clause are not
required and the frame clause are not allowed to use. So you cannot go and
allowed to use. So you cannot go and customize a frame within the window
customize a frame within the window function. And as we learned there are
function. And as we learned there are many use cases for the ranking
many use cases for the ranking functions. For example, we have the top
functions. For example, we have the top end analyzes the button end analyzes in
end analyzes the button end analyzes in order to identify our top performers or
order to identify our top performers or the worst performers in our business.
the worst performers in our business. Another use case using the row number we
Another use case using the row number we can identify and remove duplicates in
can identify and remove duplicates in our data. So we can use it in order to
our data. So we can use it in order to find data quality issues and as well to
find data quality issues and as well to improve the quality. And another use
improve the quality. And another use case if our table don't have a clean
case if our table don't have a clean primary key we can go and generate
primary key we can go and generate unique ids using the row number in order
unique ids using the row number in order to do as well by generating one more use
to do as well by generating one more use case it was the data segmentations you
case it was the data segmentations you can use the intel in order to segment
can use the intel in order to segment your customers your products employees
your customers your products employees and so on and another use case we can do
and so on and another use case we can do data distribution analysis as we learned
data distribution analysis as we learned we can use the cubeist in order to
we can use the cubeist in order to understand the data distributions of our
understand the data distributions of our data points compared to the overall and
data points compared to the overall and the last use case it's more for data
the last use case it's more for data engineering we can use the intel
engineering we can use the intel function in order to equalize the
function in order to equalize the loading process of our ETLs. So as you
loading process of our ETLs. So as you can see there are many use cases for the
can see there are many use cases for the ranking functions. Okay, so that's all
ranking functions. Okay, so that's all about how to rank your data using the
about how to rank your data using the window functions and now we're going to
window functions and now we're going to cover the last group. We will learn
cover the last group. We will learn about the value window functions. How to
about the value window functions. How to access another records. So let's
go. All right everyone. So now we have this very simple example. We have the
this very simple example. We have the months and the sales. Now we can use the
months and the sales. Now we can use the value functions in order to access a
value functions in order to access a value from another row. So in order to
value from another row. So in order to understand it let's say that SQL now
understand it let's say that SQL now processing the months and we are
processing the months and we are currently at the month of March. So now
currently at the month of March. So now for example I would like to access the
for example I would like to access the value from the previous month from
value from the previous month from February. So in order to do that we can
February. So in order to do that we can use the lag function in order to get the
use the lag function in order to get the value of 10. So with that we have in the
value of 10. So with that we have in the same row the current sales of the month
same row the current sales of the month March and as well the sales from the
March and as well the sales from the previous month the February. And maybe
previous month the February. And maybe in other cases I would like to get the
in other cases I would like to get the sales of the next month from April. In
sales of the next month from April. In order to do that we can use the function
order to do that we can use the function lead and we will get at the same row the
lead and we will get at the same row the value five. So now I can very quickly
value five. So now I can very quickly compare the current month with the
compare the current month with the previous month and as well with the next
previous month and as well with the next month. And now in the other cases you
month. And now in the other cases you might be interested in the first month
might be interested in the first month of your list. So it's going to be here
of your list. So it's going to be here January. So in order to get the sales of
January. So in order to get the sales of the first month you can use the function
the first month you can use the function first value. So we're going to get at
first value. So we're going to get at the same row 20. And now for the last
the same row 20. And now for the last option I think you already get it. We
option I think you already get it. We can go and get the value of sales of the
can go and get the value of sales of the last month. So here we can get the July.
last month. So here we can get the July. So for that we're going to use the
So for that we're going to use the function last value and we will get the
function last value and we will get the value of 40. So this is exactly the
value of 40. So this is exactly the purpose of the value functions or
purpose of the value functions or analytical functions. We can access a
analytical functions. We can access a value from another rows. And here is
value from another rows. And here is really important to understand as well
really important to understand as well the value functions is like the ranking
the value functions is like the ranking functions. We have to use the order by
functions. We have to use the order by in order to sort the data in order to
in order to sort the data in order to understand what is the first row and the
understand what is the first row and the last row. In this example, the data is
last row. In this example, the data is sorted by the month. So guys, the access
sorted by the month. So guys, the access functions are really important for
functions are really important for analytics. You can use it in order to
analytics. You can use it in order to access a value from other rows in order
access a value from other rows in order to do comparison. All right. Right. So
to do comparison. All right. Right. So now let's have a quick overview of the
now let's have a quick overview of the syntax and the rules for the value
syntax and the rules for the value functions. So here we have four
functions. So here we have four functions lead, lag, first value and
functions lead, lag, first value and last value. So as you can see we can
last value. So as you can see we can group them into two groups. So we have
group them into two groups. So we have the lead and lag. They are very similar
the lead and lag. They are very similar to each others. Especially with the
to each others. Especially with the syntax we can use three things or three
syntax we can use three things or three arguments inside it. Expression offset
arguments inside it. Expression offset default for both of them. For the first
default for both of them. For the first value we can use only an expression. So
value we can use only an expression. So that means we have to pass a value for
that means we have to pass a value for those functions. You cannot leave it
those functions. You cannot leave it empty. So now about the expression data
empty. So now about the expression data type, you can use any field with any
type, you can use any field with any data type. There is no restrictions
data type. There is no restrictions about only for example using numbers.
about only for example using numbers. Any data type is allowed. Now about the
Any data type is allowed. Now about the definition of the window. The partition
definition of the window. The partition by as usual is optional like any other
by as usual is optional like any other group. The order by here is a must. You
group. The order by here is a must. You must define an order by. It's like the
must define an order by. It's like the ranking. So here you cannot leave it
ranking. So here you cannot leave it empty. And now we come to the last one.
empty. And now we come to the last one. We have the frame clause. There are
We have the frame clause. There are really different stuff over here. So for
really different stuff over here. So for the first two functions lead and lag you
the first two functions lead and lag you are not allowed to define any frame. So
are not allowed to define any frame. So you are not allowed to define any subset
you are not allowed to define any subset of data. It's very similar to the
of data. It's very similar to the ranking. So you must use order by but
ranking. So you must use order by but you cannot define the frame of the
you cannot define the frame of the window. But for the other two functions
window. But for the other two functions the first value and the last value they
the first value and the last value they are optional. You can go and use them.
are optional. You can go and use them. And for the last value it is recommended
And for the last value it is recommended to define frame close. Don't worry about
to define frame close. Don't worry about it. We're going to have enough examples
it. We're going to have enough examples in order to understand. So as you can
in order to understand. So as you can see those functions has different
see those functions has different requirements. So there is no generic
requirements. So there is no generic rule for all of them. But one thing that
rule for all of them. But one thing that they all agree on that you must use
they all agree on that you must use order by. So now as usual what we're
order by. So now as usual what we're going to do we're going to go and deep
going to do we're going to go and deep dive into those functions. We're going
dive into those functions. We're going to address first the two functions lead
to address first the two functions lead and lag because they are very similar to
and lag because they are very similar to each others. We're going to understand
each others. We're going to understand the use cases when to use them and of
the use cases when to use them and of course we're going to practice in SQL.
course we're going to practice in SQL. So let's go.
lead and lag functions. The lead function can allow you to access a value
function can allow you to access a value from the next row within a window where
from the next row within a window where the lack function is exactly the
the lack function is exactly the opposite. It's going to allow you to
opposite. It's going to allow you to access a value from a previous row
access a value from a previous row within a window. It sounds very easy,
within a window. It sounds very easy, right? So let's understand how is SQL
right? So let's understand how is SQL going to execute those functions. Okay.
going to execute those functions. Okay. So now let's have a quick overview of
So now let's have a quick overview of the syntax for both of the functions
the syntax for both of the functions lead and lag. We have here very simple
lead and lag. We have here very simple example for the lead function. So as
example for the lead function. So as usual we start with the function name.
usual we start with the function name. It's going to be the lead. And now after
It's going to be the lead. And now after that we're going to go and pass the
that we're going to go and pass the arguments. And as you can see we have
arguments. And as you can see we have here multiple stuff. So let's do it step
here multiple stuff. So let's do it step by step. So the first thing is that
by step. So the first thing is that we're going to go and specify an
we're going to go and specify an expression. And the data type could be
expression. And the data type could be any data type. It could be a number like
any data type. It could be a number like here the sales. It could be a character
here the sales. It could be a character like names or dates or anything. So this
like names or dates or anything. So this is required. We have to specify an
is required. We have to specify an expression. We cannot leave it empty.
expression. We cannot leave it empty. And we can use any data type. Now moving
And we can use any data type. Now moving on to the next one. We have here a
on to the next one. We have here a number. So what is this? This is the
number. So what is this? This is the offset and this offset is optional. So
offset and this offset is optional. So you can go and skip it. So what offsets
you can go and skip it. So what offsets means? What we are doing over here? We
means? What we are doing over here? We are specifying for SQL the number of
are specifying for SQL the number of rows forward or backward from the
rows forward or backward from the current row. So here in this example we
current row. So here in this example we are specifying the offset as two using
are specifying the offset as two using the lead. And with that we are telling
the lead. And with that we are telling SQL go jump to the next two rows and get
SQL go jump to the next two rows and get me the value. And if you are using lag
me the value. And if you are using lag it means you are telling SQL go back two
it means you are telling SQL go back two rows up and get me the value. So here
rows up and get me the value. So here you are telling SQL how many rows it
you are telling SQL how many rows it needs to jump and if you don't specify
needs to jump and if you don't specify anything like leave it empty SQL going
anything like leave it empty SQL going to go and use a one. So the default of
to go and use a one. So the default of this with the offsets going to be one if
this with the offsets going to be one if you don't specify anything. All right
you don't specify anything. All right moving on to the last one and to the
moving on to the last one and to the third one. This is as well optional. You
third one. This is as well optional. You can go and leave it empty. So here it is
can go and leave it empty. So here it is the default value. Now what happens with
the default value. Now what happens with those functions is that sometimes SQL
those functions is that sometimes SQL jump to the next two rows or something
jump to the next two rows or something like that and SQL doesn't find anything.
like that and SQL doesn't find anything. So there is no more rows available to
So there is no more rows available to access and with that SQL going to go and
access and with that SQL going to go and return a null. So that means if SQL goes
return a null. So that means if SQL goes to the next rows or go to the previous
to the next rows or go to the previous rows and doesn't find anything SQL as a
rows and doesn't find anything SQL as a default going to go and return a null.
default going to go and return a null. So if you don't specify anything over
So if you don't specify anything over here in those scenarios you will have a
here in those scenarios you will have a null values as a return from the whole
null values as a return from the whole function. But in some scenarios you
function. But in some scenarios you don't want to have a null you would like
don't want to have a null you would like to have a value. So here you are
to have a value. So here you are defining the default value. So it should
defining the default value. So it should not be a null, it should be a 10. So
not be a null, it should be a 10. So scale if you don't find anything return
scale if you don't find anything return a 10. Don't return a null. So again
a 10. Don't return a null. So again guys, the default values, the offsets,
guys, the default values, the offsets, all those informations are optional for
all those informations are optional for you in order to configure it. But you
you in order to configure it. But you should know the default if you don't use
should know the default if you don't use anything for the offset is going to be
anything for the offset is going to be one for the default value going to be
one for the default value going to be null. But you must specify an
null. But you must specify an expression. So here you cannot leave it
expression. So here you cannot leave it empty. All right. So that's all about
empty. All right. So that's all about the arguments that you can pass to the
the arguments that you can pass to the lead or lag functions. Then the next
lead or lag functions. Then the next stuff are the standard stuff. So we have
stuff are the standard stuff. So we have the overclos then we have the partition
the overclos then we have the partition by as usual partition by is optional.
by as usual partition by is optional. And then to the order by those functions
And then to the order by those functions it's like the rank functions. It
it's like the rank functions. It requires you to sort the data. So it is
requires you to sort the data. So it is a must to sort the data otherwise will
a must to sort the data otherwise will not know what is the next row what are
not know what is the next row what are the previous rows. So we have to sort
the previous rows. So we have to sort the data. It is required. You cannot
the data. It is required. You cannot skip this. So it is not optional. All
skip this. So it is not optional. All right. So the syntax is not crazy right?
right. So the syntax is not crazy right? We have the usual stuff but only we can
We have the usual stuff but only we can go and configure the default value and
go and configure the default value and the offsets. Okay guys, now we have very
the offsets. Okay guys, now we have very simple example. We have months and sales
simple example. We have months and sales and we're going to go and understand how
and we're going to go and understand how the SQL works for both of the functions
the SQL works for both of the functions lead and lag side by side. So now in the
lead and lag side by side. So now in the first example we are interested in the
first example we are interested in the sales of the next month. So in order to
sales of the next month. So in order to do that we're going to use the lead
do that we're going to use the lead function. So lead and then we're going
function. So lead and then we're going to specify the argument. It is the
to specify the argument. It is the sales. We want the value of sales and
sales. We want the value of sales and then we define the window like this
then we define the window like this order by month. So it's going to be
order by month. So it's going to be ascending. And now in the right side
ascending. And now in the right side we're going to be interested in the
we're going to be interested in the sales of the previous months. So in
sales of the previous months. So in order to do that we're going to use the
order to do that we're going to use the lag function. So it's going to be very
lag function. So it's going to be very similar to the lead. We have lag and
similar to the lead. We have lag and then the sales since we are interested
then the sales since we are interested in the sales and we're going to sort the
in the sales and we're going to sort the data by the month. So now let's see how
data by the month. So now let's see how going to do it step by step and side by
going to do it step by step and side by side. So going to start with the first.
side. So going to start with the first. So now let's see how skill going to
So now let's see how skill going to process those informations side by side
process those informations side by side and row by row. So it's going to start
and row by row. So it's going to start with the first row over here. What is
with the first row over here. What is the next month of January? It is
the next month of January? It is February and we are interested in the
February and we are interested in the sales of this row. So SQL going to take
sales of this row. So SQL going to take the value from the next row and we're
the value from the next row and we're going to have the value of 10. So now by
going to have the value of 10. So now by looking through the January we can see
looking through the January we can see the sales of the next month of February
the sales of the next month of February in the same row. So now let's check the
in the same row. So now let's check the right side over here. Now we are
right side over here. Now we are interested in the previous month. So
interested in the previous month. So what is the previous month of the first
what is the previous month of the first row? It will be nothing. Right? So we
row? It will be nothing. Right? So we cannot point it with anything. That's
cannot point it with anything. That's why going to say this is null. There is
why going to say this is null. There is no previous month for the current row.
no previous month for the current row. And we're going to have it as a null.
And we're going to have it as a null. Okay. So now it's going to go to the
Okay. So now it's going to go to the next row. We are at February. What is
next row. We are at February. What is the next month? It's going to be March.
the next month? It's going to be March. And it's going to point to it. So we
And it's going to point to it. So we will get the 30 as the sales of the next
will get the 30 as the sales of the next month of March. And on the right side,
month of March. And on the right side, what is the previous month of February?
what is the previous month of February? It's going to be January, right? So,
It's going to be January, right? So, it's going to get the value the sales of
it's going to get the value the sales of the previous month. And here we will get
the previous month. And here we will get 20. So, as you can see, it's very
20. So, as you can see, it's very simple. On the lead, we are always
simple. On the lead, we are always checking the next values. On the lag, we
checking the next values. On the lag, we are always checking the previous value.
are always checking the previous value. So, let's keep going. We are currently
So, let's keep going. We are currently at March. What is the next month? It's
at March. What is the next month? It's going to be April. So, it's going to go
going to be April. So, it's going to go and point to it like this. and we will
and point to it like this. and we will get the sales of the next month April.
get the sales of the next month April. For the March on the right side, what is
For the March on the right side, what is the previous month? It is February.
the previous month? It is February. Right? So, it's going to go and point to
Right? So, it's going to go and point to February. So, we will get the sales of
February. So, we will get the sales of 10. And now, interesting to the last row
10. And now, interesting to the last row over here. You can see that we are at
over here. You can see that we are at April. What is the next month of April?
April. What is the next month of April? There is nothing because we are at the
There is nothing because we are at the end of our table, right? So, since there
end of our table, right? So, since there is no month after that, we will get a
is no month after that, we will get a null in the output. But for the lag, we
null in the output. But for the lag, we still have a previous month for April.
still have a previous month for April. So what is the previous month? It is
So what is the previous month? It is March. And we will get the sales of the
March. And we will get the sales of the March. So it's going to be 30. So that's
March. So it's going to be 30. So that's it guys. It's really simple, right? It's
it guys. It's really simple, right? It's just like they are doing the opposite
just like they are doing the opposite things. So now if you check those values
things. So now if you check those values side by side, you can see that with the
side by side, you can see that with the lead, we will always get a value for the
lead, we will always get a value for the first row, but for the last row, it can
first row, but for the last row, it can be always empty because there is no next
be always empty because there is no next value. We are at the end of the table.
value. We are at the end of the table. But if you check the lag for the first
But if you check the lag for the first value, we will always get a null because
value, we will always get a null because there is no previous value or previous
there is no previous value or previous record from the first row. And for the
record from the first row. And for the last record, as you can see, we're
last record, as you can see, we're always going to get a value because we
always going to get a value because we will have a previous value. Okay, let's
will have a previous value. Okay, let's move on in order to understand how scale
move on in order to understand how scale this time works with the offsets and the
this time works with the offsets and the default value. So now we have the same
default value. So now we have the same data, but we have different task. So now
data, but we have different task. So now on the left side, we would like to get
on the left side, we would like to get the sales of two months ahead. So it's
the sales of two months ahead. So it's not the next month, it's going to be two
not the next month, it's going to be two months. And we would like to tell SQL if
months. And we would like to tell SQL if you don't find any value don't return
you don't find any value don't return null return for us is zero. So this is
null return for us is zero. So this is going to be our default. Now if you
going to be our default. Now if you check the syntax it's going to be exact
check the syntax it's going to be exact like before but we are adding now an
like before but we are adding now an offset of two because we are interested
offset of two because we are interested in two months ahead and we are
in two months ahead and we are specifying here a default value zero. So
specifying here a default value zero. So if you don't find anything put zero
if you don't find anything put zero don't put null. Now on the right side we
don't put null. Now on the right side we have the exact opposite. We are
have the exact opposite. We are interested in the sales of two months
interested in the sales of two months ago. So we are not interested in the
ago. So we are not interested in the direct previous month we need the sales
direct previous month we need the sales of two months ago. And here the same
of two months ago. And here the same thing if you don't find anything don't
thing if you don't find anything don't return null give us a zero. So as you
return null give us a zero. So as you can see we have the same syntax but
can see we have the same syntax but using the function lag. So now let's
using the function lag. So now let's understand how going to execute this
understand how going to execute this step by step and side by side. So going
step by step and side by side. So going to start with the first month January.
to start with the first month January. So now SQL going to ask what is the
So now SQL going to ask what is the sales of two months ahead. So we are at
sales of two months ahead. So we are at January. It will not be February it's
January. It will not be February it's going to be the month of March. So it's
going to be the month of March. So it's going to go and point it like this and
going to go and point it like this and we will get the value of 30. So 30 is
we will get the value of 30. So 30 is the sales of two months ahead. And now
the sales of two months ahead. And now on the right side we are as well at
on the right side we are as well at January. It's going to ask the question
January. It's going to ask the question what is the sales of two months ago. So
what is the sales of two months ago. So we don't have any previous data. Right?
we don't have any previous data. Right? So we will not get anything. It's going
So we will not get anything. It's going to return null but it's going to check
to return null but it's going to check do we have a default value? Well yes. So
do we have a default value? Well yes. So this time HQL will not return null. It's
this time HQL will not return null. It's going to return the default value. And
going to return the default value. And this time it's going to be zero. All
this time it's going to be zero. All right. All right. So now let's go to the
right. All right. So now let's go to the next value. We are currently at
next value. We are currently at February. What is the sales of two
February. What is the sales of two months ahead? So it will not be March,
months ahead? So it will not be March, it's going to be April. So it's going to
it's going to be April. So it's going to go and point it like this and we will
go and point it like this and we will get the value of five. So now on the
get the value of five. So now on the right side we are currently at February.
right side we are currently at February. Now the question is what is the sales of
Now the question is what is the sales of two months ago? We have history. We have
two months ago? We have history. We have the previous month but we don't have two
the previous month but we don't have two months in the history. That's why we
months in the history. That's why we will still get zero as the output with
will still get zero as the output with the default value. Okay. Okay. So now
the default value. Okay. Okay. So now let's keep going to the next value. We
let's keep going to the next value. We are currently at March. SQL going to ask
are currently at March. SQL going to ask what is the sales of the two months
what is the sales of the two months ahead. We have only one month after that
ahead. We have only one month after that but we don't have two months. That's why
but we don't have two months. That's why SQL will not find anything and it's
SQL will not find anything and it's going to return null but it's going to
going to return null but it's going to go and use the default. So here we're
go and use the default. So here we're going to go and get the value of zero.
going to go and get the value of zero. There is no more data available in the
There is no more data available in the table. But now on the right side we are
table. But now on the right side we are currently at March and we are asking
currently at March and we are asking what is the sales of two months ago. So
what is the sales of two months ago. So now we have enough history in the past
now we have enough history in the past and it's going to get the value of 20.
and it's going to get the value of 20. All right. So now let's go to the last
All right. So now let's go to the last month over here in our table. April.
month over here in our table. April. What is the sales of two months ahead?
What is the sales of two months ahead? We don't have any data. So it's going to
We don't have any data. So it's going to be zero as well. But now on the right
be zero as well. But now on the right side, we are currently at April. What is
side, we are currently at April. What is sales of two months ago? We have enough
sales of two months ago? We have enough history. That's why SQL going to get and
history. That's why SQL going to get and point it like this. So we will get the
point it like this. So we will get the February going to be 10. So that's it.
February going to be 10. So that's it. This is how SQL works with the lead and
This is how SQL works with the lead and lag using offsets and as well default
lag using offsets and as well default value. Let's go back in SQL in order to
value. Let's go back in SQL in order to practice those two
functions. Okay, so now we have the following task and it says analyze the
following task and it says analyze the month over month performance by finding
month over month performance by finding the percentage change in sales between
the percentage change in sales between the current and the previous month. So
the current and the previous month. So that means we have to go and compare the
that means we have to go and compare the current month with the previous month.
current month with the previous month. So the main use case for the lead unlock
So the main use case for the lead unlock is to do comparison analyszis and we
is to do comparison analyszis and we have a very common use case it's called
have a very common use case it's called time series analyzes. So it is the
time series analyzes. So it is the method of analyzing our business our
method of analyzing our business our data in order to understand the patterns
data in order to understand the patterns and trends over the time. And one of the
and trends over the time. And one of the most important and classical question
most important and classical question that you're going to get from the
that you're going to get from the decision makers or business is to do
decision makers or business is to do year-over-year analyszis or month over
year-over-year analyszis or month over month analyszis. So the year-over-year
month analyszis. So the year-over-year analysis is going to help us in order to
analysis is going to help us in order to understand the overall growth or decline
understand the overall growth or decline in the performance of our business over
in the performance of our business over the years over the time. But in the
the years over the time. But in the other hand, we have month- over-month
other hand, we have month- over-month analyszis in order to do shortterm
analyszis in order to do shortterm trends analyzes and as well discover the
trends analyzes and as well discover the patterns in the seasonality. So the main
patterns in the seasonality. So the main focus is to understand the performance
focus is to understand the performance of our business over the time. So now
of our business over the time. So now let's go back to it in order to solve
let's go back to it in order to solve the task. Okay guys, so now let's go and
the task. Okay guys, so now let's go and do it step by step. Now what is the
do it step by step. Now what is the first step? Before we go and compare
first step? Before we go and compare things together, we have to collect the
things together, we have to collect the data. We have to do the calculations
data. We have to do the calculations first. So we have to find out first the
first. So we have to find out first the total sales for the current month and
total sales for the current month and then the total sales for the previous
then the total sales for the previous month. And after that we can go and
month. And after that we can go and compare them. So now let's start with
compare them. So now let's start with the easy stuff. We have to find out the
the easy stuff. We have to find out the current sales for the current month. So
current sales for the current month. So in order to do that, let's just do very
in order to do that, let's just do very simple select. So what do we need? We
simple select. So what do we need? We need let's take the order ID. Let's take
need let's take the order ID. Let's take the order date because inside it we have
the order date because inside it we have the month. Uh let's go and collect the
the month. Uh let's go and collect the sales. So that's it for now from sales
sales. So that's it for now from sales orders. So let's go and execute this. So
orders. So let's go and execute this. So now in the result we got the usual
now in the result we got the usual stuff. We have 10 orders, sales and
stuff. We have 10 orders, sales and order dates. But the order date is on
order dates. But the order date is on the level of the days and we are not
the level of the days and we are not interested on the whole date. We would
interested on the whole date. We would like to get only the month in order to
like to get only the month in order to calculate the total sales for the month.
calculate the total sales for the month. Now we're going to go and use a function
Now we're going to go and use a function in order to extract the month from a
in order to extract the month from a date. Don't worry about it. We're going
date. Don't worry about it. We're going to have a dedicated chapter in order to
to have a dedicated chapter in order to show you how to deal with the dates
show you how to deal with the dates format in SQL. So now what we're going
format in SQL. So now what we're going to do, we will use a very simple
to do, we will use a very simple function called month and order dates.
function called month and order dates. And let's call it order month. So that's
And let's call it order month. So that's it. Let's go and execute it. Now, as you
it. Let's go and execute it. Now, as you can see, we got a new field where we
can see, we got a new field where we have only the month of informations. So
have only the month of informations. So here we have January, February, and
here we have January, February, and March. So now the next step is that we
March. So now the next step is that we want to find the total sales for each
want to find the total sales for each month. So what we're going to do, we're
month. So what we're going to do, we're going to go and use group by. So, let's
going to go and use group by. So, let's do that. We're going to go and say we
do that. We're going to go and say we want the sum of sales. I'm just going to
want the sum of sales. I'm just going to call it current month sales. And let's
call it current month sales. And let's go and get rid of all those
go and get rid of all those informations. We're going to go and
informations. We're going to go and group by the month, right? So, group by
group by the month, right? So, group by and let's have the month. So, that's it.
and let's have the month. So, that's it. Let's go and execute it. So, it's very
Let's go and execute it. So, it's very simple, right? We got now the three
simple, right? We got now the three months and the total sales of the
months and the total sales of the current month. So now with that we got
current month. So now with that we got the first information that we need in
the first information that we need in order to do the comparison. We have for
order to do the comparison. We have for each row the total sales for the current
each row the total sales for the current month. So now the next thing that we're
month. So now the next thing that we're going to do is to find out the total
going to do is to find out the total sales for the previous month like side
sales for the previous month like side by side in the same row. And in order to
by side in the same row. And in order to do that we have learned we can go and
do that we have learned we can go and use the lag function. So we're going to
use the lag function. So we're going to go and integrate the lag window function
go and integrate the lag window function in the same group by. So we're going to
in the same group by. So we're going to do it like this. So lag we are now
do it like this. So lag we are now interested in the previous month. So
interested in the previous month. So that's why we're going to go and get the
that's why we're going to go and get the sum of sales as an expression inside it.
sum of sales as an expression inside it. And after that we're going to define the
And after that we're going to define the window. It's going to be like this over
window. It's going to be like this over and order by is a must. So we're going
and order by is a must. So we're going to go and sort the data by the month.
to go and sort the data by the month. Right? So let's go and do it. And with
Right? So let's go and do it. And with that we have defined the previous month
that we have defined the previous month sales. So you are the previous month
sales. So you are the previous month sales. So now let's go and execute it in
sales. So now let's go and execute it in order to see the results. All right. So
order to see the results. All right. So now let's check the results. The first
now let's check the results. The first row what is the previous month? There is
row what is the previous month? There is no previous month. We are at the first
no previous month. We are at the first record and the first month that's why we
record and the first month that's why we have null. Now let's go to February.
have null. Now let's go to February. What is the sales of the previous month
What is the sales of the previous month from January? It is 105. So this is
from January? It is 105. So this is correct. And now to the last value to
correct. And now to the last value to the March. What is the sales of
the March. What is the sales of February? The previous month it is 195.
February? The previous month it is 195. So with that we got the two
So with that we got the two informations. We have the current month
informations. We have the current month and as well the previous month. So guys
and as well the previous month. So guys as you can see it's magic right? It's
as you can see it's magic right? It's very simple. we can go and use the lead
very simple. we can go and use the lead and lag functions in order to access
and lag functions in order to access another values from another rows without
another values from another rows without doing any complicated joins and so on.
doing any complicated joins and so on. Okay. So now what is the next step?
Okay. So now what is the next step? We're going to go and subtract the total
We're going to go and subtract the total sales from the current month with the
sales from the current month with the previous month. So in order to do that
previous month. So in order to do that we're going to go and use a sub query
we're going to go and use a sub query like this. So select star from and we're
like this. So select star from and we're going to have it like this as subquery.
going to have it like this as subquery. And now the calculation is very simple.
And now the calculation is very simple. Let me just move this a little bit down.
Let me just move this a little bit down. So it is the current month subtracted
So it is the current month subtracted from the previous month and let's go and
from the previous month and let's go and call it month over month change. So
call it month over month change. So that's it. Let's go and execute this. So
that's it. Let's go and execute this. So now let's go and check the results for
now let's go and check the results for the first month. You can see that we
the first month. You can see that we don't have any value and that is correct
don't have any value and that is correct because the previous month is empty. So
because the previous month is empty. So there is no change. And now moving on to
there is no change. And now moving on to the February. You can see over here we
the February. You can see over here we got plus 90. That means we have here
got plus 90. That means we have here improvement in the performance of our
improvement in the performance of our sales. Now moving on to the last one.
sales. Now moving on to the last one. It's really bad. We have decline in our
It's really bad. We have decline in our performance. We can see that we have
performance. We can see that we have minus 115. So that means the current
minus 115. So that means the current month is doing really bad compared to
month is doing really bad compared to the previous month. So the March is
the previous month. So the March is really bad month. Okay. So now as you
really bad month. Okay. So now as you can see in the output we got the
can see in the output we got the absolute numbers but the task says find
absolute numbers but the task says find the percentage change. So we have to
the percentage change. So we have to convert this to a percentage and we can
convert this to a percentage and we can do it like this. It's very simple. Let's
do it like this. It's very simple. Let's do it in a new column. Just going to
do it in a new column. Just going to zoom out a little bit. So, it's going to
zoom out a little bit. So, it's going to be the change the differences divided by
be the change the differences divided by the previous month sales. And then let's
the previous month sales. And then let's go and multiply it with 100 in order to
go and multiply it with 100 in order to get the percentage. So, like this. And
get the percentage. So, like this. And now, as you can see, we got zeros. And
now, as you can see, we got zeros. And that's because those numbers are
that's because those numbers are integer. So, we have to go and cast one
integer. So, we have to go and cast one of those values. Just going to do it for
of those values. Just going to do it for the first. So, cast and float. So,
the first. So, cast and float. So, that's it. Let's go and execute it
that's it. Let's go and execute it again. Now the result looks better. We
again. Now the result looks better. We have the percentages but we have a lot
have the percentages but we have a lot of decimals. So let's go and round the
of decimals. So let's go and round the number to let's say one decimal. So only
number to let's say one decimal. So only one and let's give it a name. So you are
one and let's give it a name. So you are month over month percentage. So let's
month over month percentage. So let's execute. So now as you can see things
execute. So now as you can see things get better. And with that we have
get better. And with that we have calculated the percentage change in
calculated the percentage change in sales between the current and the
sales between the current and the previous months. And this is how we do
previous months. And this is how we do month overmonth analyszis.
All right. So now we have another use case for the lead and lag function. We
case for the lead and lag function. We can use them in order to do customer
can use them in order to do customer retention analyzes. It's all about
retention analyzes. It's all about measuring the customer behavior and
measuring the customer behavior and loyalty. So we are helping the business
loyalty. So we are helping the business and decision makers to build strong
and decision makers to build strong relationship with the loyal customers
relationship with the loyal customers and for them as well to focus on their
and for them as well to focus on their needs. So now let's see how we can use
needs. So now let's see how we can use lead and lag function in order to do
lead and lag function in order to do customer retention analyszis. So let's
customer retention analyszis. So let's go. All right. Right. So now we have the
go. All right. Right. So now we have the following task and it says in order to
following task and it says in order to analyze customer loyalty rank customers
analyze customer loyalty rank customers based on the average days between their
based on the average days between their orders. So there is a lot of things
orders. So there is a lot of things going on over here. Let's do it step by
going on over here. Let's do it step by step. And I would like always to start
step. And I would like always to start with a very simple select. So let's go
with a very simple select. So let's go select informations like the order ID.
select informations like the order ID. Let's get the customer ID and as well
Let's get the customer ID and as well since we want the days we would like to
since we want the days we would like to have the date. So order dates from the
have the date. So order dates from the table sales orders and let's go and sort
table sales orders and let's go and sort the data. So order by customer ID and
the data. So order by customer ID and order dates. So that's it. Let's go and
order dates. So that's it. Let's go and execute. So now as usual we got our 10
execute. So now as usual we got our 10 orders, the customers and when they did
orders, the customers and when they did order. So now let's check the task.
order. So now let's check the task. Let's solve this over here. Days between
Let's solve this over here. Days between their orders. So we have to find how
their orders. So we have to find how many days are between two orders. For
many days are between two orders. For example, if we check the customer number
example, if we check the customer number one over here, he did order around 10
one over here, he did order around 10 January and the second order is like
January and the second order is like after 10 days 20 January. So we have to
after 10 days 20 January. So we have to go and subtract those two dates. Now in
go and subtract those two dates. Now in order to subtract those informations and
order to subtract those informations and do calculations, we have to have
do calculations, we have to have everything in the same row. So for
everything in the same row. So for example, if we are at the first row over
example, if we are at the first row over here, I would like to have as well one
here, I would like to have as well one column about the next order. So the date
column about the next order. So the date of the next order. So we have to access
of the next order. So we have to access a value from another row. Of course, we
a value from another row. Of course, we can go and do joins, but we have lead
can go and do joins, but we have lead and lag functions. And for this
and lag functions. And for this scenario, we're going to go and use the
scenario, we're going to go and use the lead window function. So let's go and do
lead window function. So let's go and do that. I'm going to go and call the order
that. I'm going to go and call the order date over here as a current order. And
date over here as a current order. And let's go and calculate the lead. So we I
let's go and calculate the lead. So we I would like to get the next order date.
would like to get the next order date. So I would like to get this value over
So I would like to get this value over here in the same row. That's why we this
here in the same row. That's why we this time we're going to get the order dates.
time we're going to get the order dates. And now let's go and define the window.
And now let's go and define the window. Now we have to go and partition the data
Now we have to go and partition the data because we are analyzing each customers
because we are analyzing each customers separately, right? So that's why we have
separately, right? So that's why we have to partition that by the customer ID.
to partition that by the customer ID. And of course in order to do the lead,
And of course in order to do the lead, we have to use the order by. So let's go
we have to use the order by. So let's go and define that as well. Order by and
and define that as well. Order by and it's going to be by the order date. So
it's going to be by the order date. So now we have to give it a name. The order
now we have to give it a name. The order date here is the current order. This
date here is the current order. This going to be the next order. So next
going to be the next order. So next order. Let me zoom out a little bit and
order. Let me zoom out a little bit and make this smaller. So let's go and
make this smaller. So let's go and execute it. So now as you can see in the
execute it. So now as you can see in the output we got a new column called next
output we got a new column called next order. And with that we got the current
order. And with that we got the current order, the current row and as well the
order, the current row and as well the value from the next row. So what is the
value from the next row. So what is the next row? It's going to be the 20
next row? It's going to be the 20 January. The same thing of course for
January. The same thing of course for the next row. Over here we have the
the next row. Over here we have the current order date and the next order
current order date and the next order date. So this value going to be exactly
date. So this value going to be exactly as the next one over here 15 of February
as the next one over here 15 of February and then since we are working with
and then since we are working with window since this is the whole window
window since this is the whole window over here the last order for this
over here the last order for this customer it's 15 of the February there
customer it's 15 of the February there is no next order so this going to be
is no next order so this going to be null the same thing if you check the
null the same thing if you check the other customers you're going to see
other customers you're going to see always the last order don't have any
always the last order don't have any next order so looks like everything is
next order so looks like everything is fine and for the last customer he has
fine and for the last customer he has only one order so now with this we got
only one order so now with this we got all the informations for our
all the informations for our calculations. So we have the current
calculations. So we have the current order and the next order in the same
order and the next order in the same row. Now we can go and subtract them in
row. Now we can go and subtract them in order to get the days between those two
order to get the days between those two orders. And now in order to subtract
orders. And now in order to subtract date we has to use the function date
date we has to use the function date div. Don't worry about those functions.
div. Don't worry about those functions. We're going to explain all those stuff
We're going to explain all those stuff in the next chapters. So now just follow
in the next chapters. So now just follow me with those steps. What we're going to
me with those steps. What we're going to do, we're going to go and subtract this
do, we're going to go and subtract this date the order date with the whole thing
date the order date with the whole thing over here. Right? So the whole thing
over here. Right? So the whole thing here is the next order. So let's do it
here is the next order. So let's do it in a new line and it's going to be very
in a new line and it's going to be very simple. So date diff we are finding the
simple. So date diff we are finding the differences between two dates. So the
differences between two dates. So the syntax going to be like this. First we
syntax going to be like this. First we have to define what we are talking
have to define what we are talking about. Are they days, months, years and
about. Are they days, months, years and so on. So we have to tell SQL find me
so on. So we have to tell SQL find me the differences in days. Now we have to
the differences in days. Now we have to specify two days. So the first one going
specify two days. So the first one going to be the order date. This is the
to be the order date. This is the current date and the second date going
current date and the second date going to be the whole thing from here. So
to be the whole thing from here. So let's take it and put it side by side
let's take it and put it side by side and this calculation going to give us
and this calculation going to give us number of days. So we're going to call
number of days. So we're going to call this days until next order. All right.
this days until next order. All right. So now let's go and execute the whole
So now let's go and execute the whole thing. So now let's check the result. As
thing. So now let's check the result. As you can see over here we got 10. So this
you can see over here we got 10. So this is 10 days between those two dates and
is 10 days between those two dates and the next one we have around 26 days.
the next one we have around 26 days. Here we have a null because we don't
Here we have a null because we don't have here a date and for the next one we
have here a date and for the next one we have 31 days. So we have a whole month
have 31 days. So we have a whole month over here. So everything is working
over here. So everything is working perfectly and with that we have solved
perfectly and with that we have solved only this part days between their
only this part days between their orders. So guys you see right this is
orders. So guys you see right this is the magic of the lead and lag function.
the magic of the lead and lag function. We can very easily access any
We can very easily access any information you need in the same row in
information you need in the same row in order to do such a important analyzis
order to do such a important analyzis and with very simple query. We are not
and with very simple query. We are not doing any crazy stuff like joining and
doing any crazy stuff like joining and stuff. We are just specifying the lead
stuff. We are just specifying the lead function. So now we got all the
function. So now we got all the informations that we need. Next we're
informations that we need. Next we're going to go and calculate the average of
going to go and calculate the average of those days. So in order to do that we
those days. So in order to do that we have to go and use a subquery. So let me
have to go and use a subquery. So let me just zoom out. So let's go and select
just zoom out. So let's go and select star just prepare the subquery. So the
star just prepare the subquery. So the whole thing going to be a subquery. I'm
whole thing going to be a subquery. I'm going just get rid of the order by it's
going just get rid of the order by it's not now necessary. So let's me just put
not now necessary. So let's me just put it like this and shift it. So now what
it like this and shift it. So now what do we need? We need the average of the
do we need? We need the average of the days. So we need the average of this
days. So we need the average of this value. So what can we do? We're going to
value. So what can we do? We're going to go and use a group by. So customer ID
go and use a group by. So customer ID since we have to find the average for
since we have to find the average for each customers and we're going to get
each customers and we're going to get this value and say average days until
this value and say average days until the next order and we're going to call
the next order and we're going to call it average days. So and we have here to
it average days. So and we have here to group by. So group by customer ID. So
group by. So group by customer ID. So like this just make this a little bit
like this just make this a little bit smaller and zoom in here. So that's it.
smaller and zoom in here. So that's it. Now we are just doing a very simple
Now we are just doing a very simple average and group I statements. So let's
average and group I statements. So let's go and execute it. Now as you can see
go and execute it. Now as you can see it's going to go and aggregate the data.
it's going to go and aggregate the data. So we have now only four customers and
So we have now only four customers and for each customer we have the average
for each customer we have the average days between their orders. So now what
days between their orders. So now what is missing in our task? If you check
is missing in our task? If you check over here it says rank the customers
over here it says rank the customers based on this average. So we have to go
based on this average. So we have to go and use the rank function. So here again
and use the rank function. So here again another window function that we have to
another window function that we have to go and use. We're going to do it
go and use. We're going to do it together with the group by. So let me
together with the group by. So let me just make this a little bit smaller and
just make this a little bit smaller and then let's do it over here. So I'm just
then let's do it over here. So I'm just going to go with the rank function. Then
going to go with the rank function. Then we're going to define the window like
we're going to define the window like this over order by and then we're going
this over order by and then we're going to go and sort the data by the average
to go and sort the data by the average days. So that means we're going to go
days. So that means we're going to go and get this calculation over here and
and get this calculation over here and put it as order by it's going to be
put it as order by it's going to be ascending. So we are focusing on the
ascending. So we are focusing on the lowest average days. So that's it. Let's
lowest average days. So that's it. Let's call it rank average. So now let's go
call it rank average. So now let's go and execute this. So now by checking the
and execute this. So now by checking the result, you can see now we have a
result, you can see now we have a ranking for the average. And here skill
ranking for the average. And here skill says that the number one customer or the
says that the number one customer or the number one loyal customer is the
number one loyal customer is the customer number four which is not really
customer number four which is not really correct because the number four we don't
correct because the number four we don't have a lot of informations about this
have a lot of informations about this customer he or she did order only once.
customer he or she did order only once. So either now you go and like filter the
So either now you go and like filter the data and remove this customer where you
data and remove this customer where you say if the average is null then don't
say if the average is null then don't put it in the rank or we can go and
put it in the rank or we can go and replace this value with a very huge
replace this value with a very huge value in order to make it at the end of
value in order to make it at the end of our list. For example, we can go over
our list. For example, we can go over here and replace the null with
here and replace the null with qualisk like this. And we say if the
qualisk like this. And we say if the average is null, then let's say give me
average is null, then let's say give me a crazy number like this very huge one.
a crazy number like this very huge one. So that's it. Let's go and execute. And
So that's it. Let's go and execute. And now as you can see this customer going
now as you can see this customer going to be at the end of our list. And now we
to be at the end of our list. And now we can see that the most loyal customer is
can see that the most loyal customer is number one. And then the other two
number one. And then the other two customers are in the rank two. Here we
customers are in the rank two. Here we are sharing the same rank since we have
are sharing the same rank since we have the same average. So guys with that we
the same average. So guys with that we have solved the task and we have ranked
have solved the task and we have ranked the customers based on the average days
the customers based on the average days between their orders. So we have now a
between their orders. So we have now a really nice rank and we can understand
really nice rank and we can understand now the behavior of the customers and
now the behavior of the customers and maybe we have to go and focus on the
maybe we have to go and focus on the customer number one and understand her
customer number one and understand her or her needs. And of course the function
or her needs. And of course the function that helped us here in order to do such
that helped us here in order to do such a customer retention analyszis is the
a customer retention analyszis is the lead function in order to find the next
lead function in order to find the next order to calculate the days. So this is
order to calculate the days. So this is how you use lead functions to do such a
how you use lead functions to do such a use case.
the first value and the last value functions. I think the name says
functions. I think the name says everything, right? So the first value
everything, right? So the first value going to allow you to access a value
going to allow you to access a value from the first row within a window where
from the first row within a window where the last value exactly the opposite. It
the last value exactly the opposite. It going to allow you to access a value
going to allow you to access a value from the last row within a window. Easy,
from the last row within a window. Easy, right? So now let's understand how SQL
right? So now let's understand how SQL execute those functions. Okay. So now as
execute those functions. Okay. So now as usual, we have this very simple example.
usual, we have this very simple example. we have the months and sales and we have
we have the months and sales and we have it twice because we would like now to go
it twice because we would like now to go and compare side by side the two
and compare side by side the two functions first value and last value. So
functions first value and last value. So now for the left sides we would like to
now for the left sides we would like to get the sales of the first month and on
get the sales of the first month and on the right sides we would like to get the
the right sides we would like to get the sales of the last month. So now for the
sales of the last month. So now for the first task we can go and use the first
first task we can go and use the first value. It's very simple. So the first
value. It's very simple. So the first value function then the argument going
value function then the argument going to be sales since we want the sales and
to be sales since we want the sales and then the window going to be defined like
then the window going to be defined like this order by month because we want to
this order by month because we want to get the first month. So as usual we must
get the first month. So as usual we must use order by now on the right side in
use order by now on the right side in order to get the sales of the last
order to get the sales of the last months we can go and use the last value
months we can go and use the last value right so the same things last value
right so the same things last value sales over order by month. So as you can
sales over order by month. So as you can see on the left and right we don't use
see on the left and right we don't use any frame definition but the default
any frame definition but the default going to be used from this. All right.
going to be used from this. All right. So now let's see how SQL going to
So now let's see how SQL going to process both of those queries side by
process both of those queries side by side. So the first step is SQL going to
side. So the first step is SQL going to go and sort the data. They are already
go and sort the data. They are already sorted from the lowest to the highest.
sorted from the lowest to the highest. And then the next step is going to start
And then the next step is going to start row by row finding the first value on
row by row finding the first value on the left side. So what is the unbounded
the left side. So what is the unbounded proceeding? It's going to be static and
proceeding? It's going to be static and always pointing to January. So this is
always pointing to January. So this is always going to be the unpounded
always going to be the unpounded proceeding. We have it in both sides
proceeding. We have it in both sides like this. And what is the current row?
like this. And what is the current row? It's going to be at the start the first
It's going to be at the start the first row. And on the right side the same
row. And on the right side the same things over here. So the window
things over here. So the window definition going to be is only one row
definition going to be is only one row right. So what is the first value in
right. So what is the first value in this window? It is 20. Right? The same
this window? It is 20. Right? The same things on the right side. What is the
things on the right side. What is the last value in this window? It is as well
last value in this window? It is as well 20. So we will get exactly same results.
20. So we will get exactly same results. Now let's move to the second row. So
Now let's move to the second row. So it's going to be pointing to February.
it's going to be pointing to February. And the frame definition going to be
And the frame definition going to be here extended like this. So what is the
here extended like this. So what is the first value in this frame? It's going to
first value in this frame? It's going to be as well 20. Right? So in the output
be as well 20. Right? So in the output we're going to get 20. And now in the
we're going to get 20. And now in the right side the current row going to be
right side the current row going to be as well pointing to February and the
as well pointing to February and the window going to go get extended. So now
window going to go get extended. So now what is the last value of this frame?
what is the last value of this frame? It's going to be 10. Now let's keep
It's going to be 10. Now let's keep going. We're going to go to the March
going. We're going to go to the March and the window going to get extended.
and the window going to get extended. What is the first value? It's always
What is the first value? It's always going to be the same. So 20 on the right
going to be the same. So 20 on the right side window going to get extended. What
side window going to get extended. What is the last value? It's going to be 30.
is the last value? It's going to be 30. So as you can see the default definition
So as you can see the default definition is always having the static start always
is always having the static start always the same start of the subset and as we
the same start of the subset and as we are moving with the current row the
are moving with the current row the frame going to get extended. So now
frame going to get extended. So now moving to the last one and with that we
moving to the last one and with that we will get the whole data set inside the
will get the whole data set inside the frame and the first cell is going to be
frame and the first cell is going to be 20 on the right side. the same things
20 on the right side. the same things going to get extended like this and this
going to get extended like this and this time the last one going to be April and
time the last one going to be April and five. So now if you go and compare them
five. So now if you go and compare them side by side you see that on the left
side by side you see that on the left side the task is solved and everything
side the task is solved and everything is working correctly right. So we have
is working correctly right. So we have for each row always the sales of the
for each row always the sales of the first row and what is the first row it
first row and what is the first row it is January. So we have everywhere a 20
is January. So we have everywhere a 20 which is correct. But now if you check
which is correct. But now if you check the right side you can see there is
the right side you can see there is something wrong right? We are getting
something wrong right? We are getting not the last value. We should always get
not the last value. We should always get April right? We should have here
April right? We should have here everywhere a five. So we have here
everywhere a five. So we have here exactly the same result as the sales. So
exactly the same result as the sales. So it's really useless to use it like this,
it's really useless to use it like this, right? And that's of course because SQL
right? And that's of course because SQL is using the default definition of the
is using the default definition of the window frame. Last value is the only
window frame. Last value is the only function from all window functions that
function from all window functions that you cannot use the default frame
you cannot use the default frame definition. You have to go and customize
definition. You have to go and customize the frame definition in order to get the
the frame definition in order to get the effect of the last value. For the first
effect of the last value. For the first value, everything is working. If you're
value, everything is working. If you're using a default frame, if you are not
using a default frame, if you are not specifying anything, but for the last
specifying anything, but for the last value, you will not get the effect
value, you will not get the effect correctly without customizing the frame
correctly without customizing the frame window. So my friends, you can go and
window. So my friends, you can go and use the first value function like all
use the first value function like all other window functions without defining
other window functions without defining a frame. You can go with the default and
a frame. You can go with the default and you will get the effect of the first
you will get the effect of the first value, but the last value you have to go
value, but the last value you have to go and define a frame. So let's see how we
and define a frame. So let's see how we can solve that. All right. So now in
can solve that. All right. So now in order to solve this, we going to define
order to solve this, we going to define the frame like this. It's going to be
the frame like this. It's going to be the rows between the current row and the
the rows between the current row and the unbounded following. So we just switch
unbounded following. So we just switch things around. So now let's see how this
things around. So now let's see how this going to work. Now of course it's going
going to work. Now of course it's going to go and sort the data and so on. Now
to go and sort the data and so on. Now it's still going to have a pointer to
it's still going to have a pointer to the unbounded following. So it's going
the unbounded following. So it's going to point always to the last row in our
to point always to the last row in our data set and then it's going to proceed
data set and then it's going to proceed step by step. So the first row going to
step by step. So the first row going to be like this and the frame going to be
be like this and the frame going to be the whole thing, right? So from the
the whole thing, right? So from the current row until the unbounded
current row until the unbounded following. So what is the last value the
following. So what is the last value the last row? It's going to be the five,
last row? It's going to be the five, right? The April. So we will get in the
right? The April. So we will get in the output five. Now let's proceed to the
output five. Now let's proceed to the next value. The frame going to be
next value. The frame going to be shorter and smaller. And what is the
shorter and smaller. And what is the last value? It's going to be as well the
last value? It's going to be as well the five. Right? So now we jump to the next
five. Right? So now we jump to the next one. And the frame going to be like
one. And the frame going to be like this. What is the last value? As well
this. What is the last value? As well five. And then we will get the last
five. And then we will get the last value like this. Current row is equal to
value like this. Current row is equal to the unbounded following. We have only
the unbounded following. We have only one row and it's going to be as well
one row and it's going to be as well five. So as you can see it's very simple
five. So as you can see it's very simple just fix the frame clause and you will
just fix the frame clause and you will get the last value working as expected.
get the last value working as expected. So this is how SQL going to go and do
So this is how SQL going to go and do it. Now let's go back to SQL and start
it. Now let's go back to SQL and start practicing. All right. So now we have
practicing. All right. So now we have the following task. It says find the
the following task. It says find the lowest and highest sales for each
lowest and highest sales for each product. So now let's see how we can do
product. So now let's see how we can do this. As usual we're going to start with
this. As usual we're going to start with very simple select statement. So select
very simple select statement. So select order ID. We need the product ID and as
order ID. We need the product ID and as well their sales. So let's select the
well their sales. So let's select the table sales orders. So that's it. Let's
table sales orders. So that's it. Let's go and select this. Now in the output we
go and select this. Now in the output we got our orders, products and sales. So
got our orders, products and sales. So now let's start with the first part of
now let's start with the first part of the task. Find the lowest sales for each
the task. Find the lowest sales for each product. So in order to do that, we can
product. So in order to do that, we can use the first value function. So let's
use the first value function. So let's go and do that. First value. Then what
go and do that. First value. Then what we are talking about, we have to give an
we are talking about, we have to give an expression. We need the lowest and
expression. We need the lowest and highest sales. So let's go and have the
highest sales. So let's go and have the sales inside it. And now we have to
sales inside it. And now we have to define the window. So over since we are
define the window. So over since we are saying for each product that means we
saying for each product that means we have to go and make windows. So we have
have to go and make windows. So we have to divide the data using partition by
to divide the data using partition by products ID. And then we must use an
products ID. And then we must use an order by right. So we have to go and
order by right. So we have to go and sort the data by the sales. Since the
sort the data by the sales. Since the first value should be the lowest value,
first value should be the lowest value, we have to do it as ascending from the
we have to do it as ascending from the lowest sales to the highest sales. So
lowest sales to the highest sales. So we're just going to leave it like this
we're just going to leave it like this as a default and we're going to call it
as a default and we're going to call it lowest sales. So let's go and execute
lowest sales. So let's go and execute this. So now let's go and check our
this. So now let's go and check our results. First going to go and partition
results. First going to go and partition the data by the product ID. So as you
the data by the product ID. So as you can see we got now here four windows.
can see we got now here four windows. Then sort the data by the sales. So the
Then sort the data by the sales. So the data are sorted from the lowest to the
data are sorted from the lowest to the highest from 10 to 90. So now what is
highest from 10 to 90. So now what is the first value of the sales? It is the
the first value of the sales? It is the first row, right? So it's going to be
first row, right? So it's going to be 10. That's why we have everywhere a 10.
10. That's why we have everywhere a 10. Let's check another one. Let's take this
Let's check another one. Let's take this one here. So this window has two rows
one here. So this window has two rows and it is sorted the lowest sales or
and it is sorted the lowest sales or let's say the first value is 25. So with
let's say the first value is 25. So with that we have solved the first part of
that we have solved the first part of the task finding the lowest sales for
the task finding the lowest sales for each product. Let's go to the next one.
each product. Let's go to the next one. We have to find out the highest sales
We have to find out the highest sales for each product. So let's go and use
for each product. So let's go and use the last value for this. So let's have a
the last value for this. So let's have a new line. We're going to have a last
new line. We're going to have a last value again the
value again the sales. Then we're going to go and define
sales. Then we're going to go and define the window. So it's going to be the
the window. So it's going to be the exact same window. We have to partition
exact same window. We have to partition the data by the product ID and order the
the data by the product ID and order the data by sales. So let's go and just copy
data by sales. So let's go and just copy the previous one and let's call it for
the previous one and let's call it for now highest sales. So let's go and
now highest sales. So let's go and execute it. So now if you check the
execute it. So now if you check the results, you will see our issue over
results, you will see our issue over here again. Right? We are not getting
here again. Right? We are not getting the highest sales for this window. The
the highest sales for this window. The highest sales is 90. But as you can see,
highest sales is 90. But as you can see, we are getting the exact same sales. And
we are getting the exact same sales. And we have explained that in the previous
we have explained that in the previous example. So in order to fix this, we're
example. So in order to fix this, we're going to go and add for it the frame. So
going to go and add for it the frame. So rows between current row and the
rows between current row and the unbounded following.
unbounded following. So now let's go and execute this. So now
So now let's go and execute this. So now let's check the result. As you can see
let's check the result. As you can see over here, we got the highest sales
over here, we got the highest sales correctly. So for this window, the
correctly. So for this window, the highest ones is 90. and as well for this
highest ones is 90. and as well for this window the 60 and so on. So with that
window the 60 and so on. So with that you have solved both of the tasks the
you have solved both of the tasks the lowest and the highest sales. But now I
lowest and the highest sales. But now I would like to show you my honest opinion
would like to show you my honest opinion about this tasks. I will not go and use
about this tasks. I will not go and use the last value to find the highest
the last value to find the highest sales. So let me show you how I usually
sales. So let me show you how I usually do it. I'm going to go and use the first
do it. I'm going to go and use the first value in order to find the last value.
value in order to find the last value. So now let me show you what I mean.
So now let me show you what I mean. Let's go and add a new row. I will just
Let's go and add a new row. I will just take the whole thing from the lowest
take the whole thing from the lowest sales. But what I'm going to do, I'm
sales. But what I'm going to do, I'm just going to go and change the order.
just going to go and change the order. So that means we will not go and sort
So that means we will not go and sort the data like this ascending from the
the data like this ascending from the lowest sales to the highest sales. We're
lowest sales to the highest sales. We're going to go and switch it. So we're
going to go and switch it. So we're going to go and sort the data from the
going to go and sort the data from the highest sales to the lowest sales. And
highest sales to the lowest sales. And with that, the first value going to be
with that, the first value going to be the highest sales. So let me just rename
the highest sales. So let me just rename it highest sales. Let's give it like
it highest sales. Let's give it like two. So let's go and execute this. And
two. So let's go and execute this. And now you can see over here we got the
now you can see over here we got the exact same results because we sorted the
exact same results because we sorted the data differently and we get the first
data differently and we get the first value. So this is going to give you the
value. So this is going to give you the exact same effect like the last value.
exact same effect like the last value. And as you can see I don't have to
And as you can see I don't have to define now any window or something like
define now any window or something like that. I can stick with the default frame
that. I can stick with the default frame but just twisting the order by. So this
but just twisting the order by. So this is how you can do it as well using only
is how you can do it as well using only the first value. So now just for the
the first value. So now just for the sake of this task there's as well
sake of this task there's as well another possibility in how to solve
another possibility in how to solve this. You can go and use the minmax
this. You can go and use the minmax functions. So let me just take the same
functions. So let me just take the same and have a new one the lowest sales. We
and have a new one the lowest sales. We can go and say you know what let's get
can go and say you know what let's get the min. So we are saying find me the
the min. So we are saying find me the minimum sales and we don't have to go
minimum sales and we don't have to go and sort anything. So we can go and just
and sort anything. So we can go and just divide it like this. So let's give it
divide it like this. So let's give it another ID. Let's go and execute it. So
another ID. Let's go and execute it. So as you can see we got the exact same
as you can see we got the exact same results like the other two highest
results like the other two highest sales. So as you can see we can solve
sales. So as you can see we can solve this task using three different
this task using three different functions. Either go and use the last
functions. Either go and use the last value but you have to define the frame
value but you have to define the frame or you can go and use the first value
or you can go and use the first value where you switch or flip the order by or
where you switch or flip the order by or simply just using the max function in
simply just using the max function in order to get the highest sales. So guys
order to get the highest sales. So guys as you can see we can use the first
as you can see we can use the first value and the last value in order to
value and the last value in order to find out the extremes like here in this
find out the extremes like here in this example the lowest and the highest
example the lowest and the highest sales. So there is like similarity
sales. So there is like similarity between those two functions and as well
between those two functions and as well the min and max. And of course what
the min and max. And of course what we're going to do with this value over
we're going to do with this value over here we can go and compare it with the
here we can go and compare it with the current sales. So for example we can go
current sales. So for example we can go and extend our task where we say find
and extend our task where we say find the difference in sales between the
the difference in sales between the current and the lowest sales. So in
current and the lowest sales. So in order to do that let me just clean up
order to do that let me just clean up all those stuff and let's stick with the
all those stuff and let's stick with the first value and the highest value like
first value and the highest value like this. So we have to compare now the
this. So we have to compare now the current sales which is this field over
current sales which is this field over here. the sales the original one with
here. the sales the original one with the lowest sales with the whole thing
the lowest sales with the whole thing from here. So let's go and do that. So
from here. So let's go and do that. So we're going to have a new line and we're
we're going to have a new line and we're going to say just simply subtract the
going to say just simply subtract the sales from the lowest sales like this.
sales from the lowest sales like this. And let's give it a name sales
And let's give it a name sales difference. So that's it. Let's go and
difference. So that's it. Let's go and execute it. Now as you can see the
execute it. Now as you can see the result in one row I'm comparing the
result in one row I'm comparing the current sales which is 90 with the
current sales which is 90 with the lowest sales from this product. It's
lowest sales from this product. It's going to be the 10. So with that we're
going to be the 10. So with that we're going to get the distance let's say
going to get the distance let's say between those two informations and it
between those two informations and it going to be 80. So now for the next one
going to be 80. So now for the next one the distance between this value and the
the distance between this value and the lowest value is shorter. So we are near
lowest value is shorter. So we are near the lowest value. So as you can see over
the lowest value. So as you can see over here we can now compare the sales
here we can now compare the sales between the current sales and one
between the current sales and one extreme in order to find the distances
extreme in order to find the distances between two values. So this is again
between two values. So this is again very important analysis in order to do
very important analysis in order to do comparison analyszis.
All right friends, so now let's do a quick recap about the value functions or
quick recap about the value functions or we call them sometimes analytical
we call them sometimes analytical functions. So what they do, they're
functions. So what they do, they're going to go and allow you to access a
going to go and allow you to access a specific value from another row. This
specific value from another row. This going to help you in order to do complex
going to help you in order to do complex calculations with very simple SQL
calculations with very simple SQL without having you joining tables
without having you joining tables together or doing self joins. And for
together or doing self joins. And for the value functions we have four types
the value functions we have four types or let's say for functions the first one
or let's say for functions the first one allows you to access the previous value
allows you to access the previous value like the previous month using the lag
like the previous month using the lag function. The next one it allows you to
function. The next one it allows you to access the next values the next month
access the next values the next month using the lead function. Then we have
using the lead function. Then we have another one it allows you to access the
another one it allows you to access the first value in a subset using the first
first value in a subset using the first value function. And another option we
value function. And another option we can go and access the last value in a
can go and access the last value in a subset using the last value function.
subset using the last value function. Moving on to the next one, we have the
Moving on to the next one, we have the rules of the syntax. So about the first
rules of the syntax. So about the first point, it is the expressions. We can go
point, it is the expressions. We can go and use any data type. It could be a
and use any data type. It could be a number, string, a date, anything. Now in
number, string, a date, anything. Now in order to perform those functions, we
order to perform those functions, we have to go and sort the data by the
have to go and sort the data by the order by. So order by is required. It is
order by. So order by is required. It is a must. Then for the frame, you are
a must. Then for the frame, you are allowed to use it. So it is an optional
allowed to use it. So it is an optional thing. I would say always leave it empty
thing. I would say always leave it empty for the frame. But only for the last
for the frame. But only for the last value, you have to go and customize
value, you have to go and customize otherwise it will not work. Now to the
otherwise it will not work. Now to the next point, we have the use cases. We
next point, we have the use cases. We have simply very important use cases for
have simply very important use cases for the value functions in data analytics.
the value functions in data analytics. So what we can do? We can do time series
So what we can do? We can do time series analyszis. As we learned, we can do
analyszis. As we learned, we can do month overmonth analyzes and
month overmonth analyzes and yearover-year analyzes. Those analyszis
yearover-year analyzes. Those analyszis are classical and it's always the first
are classical and it's always the first question in that analyszis in order to
question in that analyszis in order to measure are we growing with the business
measure are we growing with the business or are we declining? How the performance
or are we declining? How the performance between the current year and the
between the current year and the previous year. So as you can see we are
previous year. So as you can see we are doing always comparison using those
doing always comparison using those window functions. The next use case is
window functions. The next use case is as well about the time we can do time
as well about the time we can do time gap analyzes as we analyzed the customer
gap analyzes as we analyzed the customer behavior the customer retention where we
behavior the customer retention where we have calculated the average days between
have calculated the average days between two orders and the last use case it's as
two orders and the last use case it's as well about comparison comparison
well about comparison comparison analyzes we can go and use the value
analyzes we can go and use the value functions in order to compare the
functions in order to compare the current value with extreme like
current value with extreme like comparing the current sales with the
comparing the current sales with the highest sales or to the lowest sales. So
highest sales or to the lowest sales. So my friends those analyzers are essential
my friends those analyzers are essential in data analyzers you will be countering
in data analyzers you will be countering them in each company in each business
them in each company in each business you have to answer those questions and
you have to answer those questions and you can do that very easily using the
you can do that very easily using the SQL window functions all right my
SQL window functions all right my friends so that's all about the window
friends so that's all about the window value functions and with that we have
value functions and with that we have covered everything about how to
covered everything about how to aggregate your data using SQL and those
aggregate your data using SQL and those are very important tools on how to do
are very important tools on how to do data analytics in SQL especially if you
data analytics in SQL especially if you are a data scientist and data analyst.
are a data scientist and data analyst. So with that we are done with this
So with that we are done with this chapter and I can tell you with that we
chapter and I can tell you with that we have covered the intermediate level. So
have covered the intermediate level. So we have learned how to filter the data,
we have learned how to filter the data, how to combine the data and as well the
how to combine the data and as well the most important functions in SQL. Now
most important functions in SQL. Now we're going to go to the third and last
we're going to go to the third and last level we will cover now the advanced
level we will cover now the advanced level. So the first level going to be
level. So the first level going to be about the advanced SQL techniques. So
about the advanced SQL techniques. So now if you go inside it and in SQL there
now if you go inside it and in SQL there are like different techniques in order
are like different techniques in order to organize our complex projects. So
to organize our complex projects. So first I'm going to explain for you what
first I'm going to explain for you what is exactly I'm talking about what is
is exactly I'm talking about what is complex queries and why we have it and
complex queries and why we have it and then we're going to start with the first
then we're going to start with the first topic the subqueries. So let's
go. Normally in projects we have a database and we have a person that is
database and we have a person that is responsible for the database the
responsible for the database the database administrator that take cares
database administrator that take cares of the database structure. And now in
of the database structure. And now in very simple scenario we're going to have
very simple scenario we're going to have a user that is writing queries in order
a user that is writing queries in order to retrieve data from the database. So
to retrieve data from the database. So he or she going to write an SQL query
he or she going to write an SQL query and then this query going to be sent to
and then this query going to be sent to the database where it's going to execute
the database where it's going to execute it and then the database going to return
it and then the database going to return the results. So at the end our user
the results. So at the end our user going to see the result of the query
going to see the result of the query that he wrote. So this is a very
that he wrote. So this is a very simplified scenario on how we use a
simplified scenario on how we use a database. But my friends in the real
database. But my friends in the real world things are totally different.
world things are totally different. Things in real projects get very
Things in real projects get very complicated like this. So for example,
complicated like this. So for example, you have a financial analyst that is
you have a financial analyst that is writing a huge block of SQL query that
writing a huge block of SQL query that is very complex and there will be like
is very complex and there will be like another user that have different role
another user that have different role like a risk manager that is as well
like a risk manager that is as well writing a very complex query and from
writing a very complex query and from different departments from different
different departments from different projects for different tasks. You will
projects for different tasks. You will have a lot of analysts that are writing
have a lot of analysts that are writing many complex queries. So all those
many complex queries. So all those analysts and managers have a direct
analysts and managers have a direct access to your database and they are
access to your database and they are executing a complex analytical queries
executing a complex analytical queries in order to generate maybe a report or
in order to generate maybe a report or something. Now not only those guys are
something. Now not only those guys are doing analyszis on your database you
doing analyszis on your database you will have as well our friend the data
will have as well our friend the data engineer that is saying you know what
engineer that is saying you know what I'm building a data warehouse and I
I'm building a data warehouse and I would like to extract your data. So that
would like to extract your data. So that data engineer going to go and write an
data engineer going to go and write an extract query in order to extract the
extract query in order to extract the data from the database. And then he has
data from the database. And then he has a different script for the
a different script for the transformations in order to manipulate,
transformations in order to manipulate, filter, clean up, aggregate your data.
filter, clean up, aggregate your data. And then a third script in order to
And then a third script in order to collect the result of the
collect the result of the transformations and load it in another
transformations and load it in another database called data warehouse. A data
database called data warehouse. A data warehouse is like special database that
warehouse is like special database that collect data from different sources and
collect data from different sources and integrate it in one place. in order to
integrate it in one place. in order to do analytics and reporting. And now at
do analytics and reporting. And now at the end of this chain, you will have a
the end of this chain, you will have a data analyst and she writes as well
data analyst and she writes as well queries in order to analyze the data in
queries in order to analyze the data in the data warehouse. Or you might have a
the data warehouse. Or you might have a different query in order to prepare the
different query in order to prepare the data before inserting it to a tool like
data before inserting it to a tool like PowerBI in order to generate
PowerBI in order to generate visualizations and reports. So we call
visualizations and reports. So we call this a data warehouse system or a
this a data warehouse system or a business intelligence system that
business intelligence system that extract and extract from your data and
extract and extract from your data and manipulate it and transform it for
manipulate it and transform it for analyzes. Now not only we have a data
analyzes. Now not only we have a data engineer and data analyst accessing your
engineer and data analyst accessing your database and doing queries, we have as
database and doing queries, we have as well our friend the data scientist. So
well our friend the data scientist. So now our data scientist as well has a
now our data scientist as well has a direct access to your database. So he
direct access to your database. So he might write like different queries in
might write like different queries in order to extract the data and as well to
order to extract the data and as well to manipulate the data that are needed in
manipulate the data that are needed in order to develop a model and doing
order to develop a model and doing machine learning and AI. And now one
machine learning and AI. And now one more scenario that I see in many
more scenario that I see in many projects where the result of the data
projects where the result of the data analyst going to be used in another
analyst going to be used in another query in order to prepare the results
query in order to prepare the results for data visualizations PowerBI or in
for data visualizations PowerBI or in order to export like a Excel list. So as
order to export like a Excel list. So as you can see we have a lot of people with
you can see we have a lot of people with different roles that want to access your
different roles that want to access your database and do analyzes on top of it
database and do analyzes on top of it and that's because everyone want to
and that's because everyone want to answer questions based on the data and
answer questions based on the data and now if I look to this I still think this
now if I look to this I still think this is a simplified version and how things
is a simplified version and how things works in the data projects and I can
works in the data projects and I can tell you in real projects things are way
tell you in real projects things are way more complicated than this so now if you
more complicated than this so now if you sit back and look to this we will find
sit back and look to this we will find many challenges and problems for example
many challenges and problems for example all those people are not talking to each
all those people are not talking to each others And each one of them are creating
others And each one of them are creating like their own query. But if you go and
like their own query. But if you go and take all those queries and compare them
take all those queries and compare them side by side, you will find in the
side by side, you will find in the scripts and queries logic that is keep
scripts and queries logic that is keep repeating. So the queries from the
repeating. So the queries from the analyst or the data scientists and data
analyst or the data scientists and data engineers, they might contain a
engineers, they might contain a redundant logic. And of course the issue
redundant logic. And of course the issue of this we have the same effort
of this we have the same effort repeating over and over and maybe not
repeating over and over and maybe not everyone is getting the logic
everyone is getting the logic implemented correctly because not all of
implemented correctly because not all of them having the right skills in SQL. So
them having the right skills in SQL. So this is a big issue in this setup. And
this is a big issue in this setup. And now we have another challenge having
now we have another challenge having this scenario. If you don't optimize it
this scenario. If you don't optimize it you will have a performance issue
you will have a performance issue everywhere. So the data warehouse or the
everywhere. So the data warehouse or the data engineer scripts might take like 5
data engineer scripts might take like 5 hours and the query from the analyst
hours and the query from the analyst might take like 40 minutes and before
might take like 40 minutes and before inserting the data to reports we might
inserting the data to reports we might have 30 minutes and 1 hour there 30
have 30 minutes and 1 hour there 30 minutes there and everyone else is as
minutes there and everyone else is as well suffering from bad performance on
well suffering from bad performance on their queries and the performance
their queries and the performance everywhere is really bad. So if everyone
everywhere is really bad. So if everyone is writing big complex queries don't
is writing big complex queries don't expect that they will have a good
expect that they will have a good performance. Now to the third challenge
performance. Now to the third challenge that I observed in many projects and
that I observed in many projects and that is the complexity. Now behind the
that is the complexity. Now behind the original database you might have a data
original database you might have a data model that is prepared and optimized
model that is prepared and optimized only for one application. So you will
only for one application. So you will have in the data model a lot of tables
have in the data model a lot of tables and all those tables have different
and all those tables have different relationship between them and of course
relationship between them and of course only the developers and the experts of
only the developers and the experts of this database understand the physical
this database understand the physical data model behind this database. And now
data model behind this database. And now if you give access to all those analysts
if you give access to all those analysts they will have a lot of questions
they will have a lot of questions because first they have to understand
because first they have to understand the data model before writing any query.
the data model before writing any query. So that means a lot of data workers are
So that means a lot of data workers are keep asking our expert from this
keep asking our expert from this database questions. So for example how
database questions. So for example how to connect the table A with the table B
to connect the table A with the table B and where do I find my columns? What
and where do I find my columns? What this table means? I'm getting bad result
this table means? I'm getting bad result in my query because your data is really
in my query because your data is really corrupt. So the developers of the
corrupt. So the developers of the database will get a lot of questions
database will get a lot of questions from the analyst and they have to
from the analyst and they have to explain over and over their data model
explain over and over their data model so that the users are able to write
so that the users are able to write those complex queries. So that means all
those complex queries. So that means all those users are stressing the database
those users are stressing the database team by many questions and as well the
team by many questions and as well the users are writing very complex queries.
users are writing very complex queries. So the complexity is a really big
So the complexity is a really big challenge. Now as well by looking to
challenge. Now as well by looking to this picture you will find a lot of
this picture you will find a lot of errors from those queries to the
errors from those queries to the database and this might cause a lot of
database and this might cause a lot of database stress. So keep executing
database stress. So keep executing repeatedly a big complex queries going
repeatedly a big complex queries going to makes really big stress for the
to makes really big stress for the database and it going to bring the
database and it going to bring the database down. And the last challenge of
database down. And the last challenge of this picture is that the data security.
this picture is that the data security. So if you leave it like this by giving
So if you leave it like this by giving the users a direct access to your
the users a direct access to your database tables you might have a problem
database tables you might have a problem because it might be okay for like some
because it might be okay for like some data engineers and so on but you don't
data engineers and so on but you don't want to give for each data analyst a
want to give for each data analyst a full access to the database tables. So
full access to the database tables. So you have to protect your tables the
you have to protect your tables the columns the rows everything. So you
columns the rows everything. So you cannot leave it like this where everyone
cannot leave it like this where everyone having a direct access to the physical
having a direct access to the physical database tables. Now enough talking
database tables. Now enough talking about challenges problems and issues.
about challenges problems and issues. Let's be solutionoriented. So what are
Let's be solutionoriented. So what are the solutions of those issues? Of
the solutions of those issues? Of course, there are many solutions, but
course, there are many solutions, but we're going to focus now on five
we're going to focus now on five techniques. We can go and use sub
techniques. We can go and use sub queries or CTE, common table
queries or CTE, common table expressions. We can introduce views to
expressions. We can introduce views to our database or temporary tables or we
our database or temporary tables or we can go and use the technique of the CTAs
can go and use the technique of the CTAs carrier table as select. So this is
carrier table as select. So this is exactly why we have to understand those
exactly why we have to understand those five techniques in order to solve all
five techniques in order to solve all those issues that we might face in our
those issues that we might face in our data
projects. All right friends, so now after we understood the importance of
after we understood the importance of those five techniques, let's take a
those five techniques, let's take a quick and simplified look to the
quick and simplified look to the database architecture because I want you
database architecture because I want you to understand what happens behind the
to understand what happens behind the scenes and how the database execute the
scenes and how the database execute the queries from these five techniques. So
queries from these five techniques. So by understanding this architecture you
by understanding this architecture you will understand how things works. So
will understand how things works. So let's go. For each story there are two
let's go. For each story there are two sides. We have the server side and the
sides. We have the server side and the client side. In the client side it's
client side. In the client side it's like for example you you are writing an
like for example you you are writing an SQL query for a specific purpose. Now in
SQL query for a specific purpose. Now in the server side we have many things. So
the server side we have many things. So the server is where the database lives
the server is where the database lives and it has many components like the
and it has many components like the database engine. The database engine is
database engine. The database engine is the brain of the database that handles
the brain of the database that handles different operations like storing,
different operations like storing, retrieving and managing data in the
retrieving and managing data in the database. So each time you execute a
database. So each time you execute a query, the database engine going to take
query, the database engine going to take care of it. And now in the database we
care of it. And now in the database we have very important component that is
have very important component that is the storage and the two main types of
the storage and the two main types of storage in a database are disk storage
storage in a database are disk storage and cache. The disk storage is like a
and cache. The disk storage is like a long-term memory where the data is
long-term memory where the data is stored permanently. So it's like the
stored permanently. So it's like the disk at your PC. It stores the data
disk at your PC. It stores the data permanently even if you turn off the
permanently even if you turn off the system. And one important feature of the
system. And one important feature of the disk is that it can stores a lot of
disk is that it can stores a lot of data. But the disadvantage of the disk
data. But the disadvantage of the disk storage is that it is slow. So it is
storage is that it is slow. So it is slow to write and to read. Now in the
slow to write and to read. Now in the other hand we have the cache is a
other hand we have the cache is a short-term memory where the data is
short-term memory where the data is stored temporary. It's like the RAMs at
stored temporary. It's like the RAMs at your PC. It holds the most frequently
your PC. It holds the most frequently used data. So the database can access it
used data. So the database can access it quickly in order to retrieve data. And
quickly in order to retrieve data. And the big advantage of the cache is that
the big advantage of the cache is that it is fast. So it is very fast for the
it is fast. So it is very fast for the database to retrieve data from the cache
database to retrieve data from the cache compared to the disk. But the
compared to the disk. But the disadvantage of the cache, the data is
disadvantage of the cache, the data is stored there only for short period. So
stored there only for short period. So it's like tradeoff between the speed and
it's like tradeoff between the speed and how much data you can store and how
how much data you can store and how long. Now let's talk about the disk
long. Now let's talk about the disk storage. This is very important in
storage. This is very important in databases. There are typically three
databases. There are typically three types of storage areas. There we have
types of storage areas. There we have the user data, the system catalog and
the user data, the system catalog and the temporary data and each storage type
the temporary data and each storage type has a different purpose. So what is user
has a different purpose. So what is user data storage? It is the main content of
data storage? It is the main content of the database. So it stores the actual
the database. So it stores the actual data all the informations that are
data all the informations that are relevant for the users. So it's stored
relevant for the users. So it's stored there all the important data that the
there all the important data that the users cares about. So this is the
users cares about. So this is the storage where the users are interacting
storage where the users are interacting all the time. So where do we find the
all the time. So where do we find the user data? If you go to our database
user data? If you go to our database sales DB and then you go to the tables
sales DB and then you go to the tables now we find all these tables that we are
now we find all these tables that we are already used the customers employees
already used the customers employees orders and so on those tables are the
orders and so on those tables are the user data. So now if I go and say select
user data. So now if I go and say select from sales orders and all those
from sales orders and all those informations that we are seeing now are
informations that we are seeing now are the users data. So this is what we users
the users data. So this is what we users actually care about. All other stuff
actually care about. All other stuff that we see inside databases as a user
that we see inside databases as a user we don't care about it. We care only
we don't care about it. We care only about our data. But in the database, we
about our data. But in the database, we don't have only the user data. We have
don't have only the user data. We have many other informations. So this is what
many other informations. So this is what we mean with the user data
storage. Now what is system catalog? This is the internal storage for the
This is the internal storage for the database for its own information. So
database for its own information. So it's like a blueprint that keeps
it's like a blueprint that keeps tracking everything about the database
tracking everything about the database itself. So that means the main purpose
itself. So that means the main purpose of the system catalog is that it holds
of the system catalog is that it holds the metadata informations about the
the metadata informations about the database. So what is a metadata?
database. So what is a metadata? Metadata is data about data. Now let's
Metadata is data about data. Now let's understand what this means. What we have
understand what this means. What we have done so far is that we have created a
done so far is that we have created a table called customers and we have
table called customers and we have defined inside it like multiple columns
defined inside it like multiple columns like the customer ID, first name, last
like the customer ID, first name, last name and then we have inserted our data
name and then we have inserted our data inside this table. So we have inserted
inside this table. So we have inserted five customers. So those informations
five customers. So those informations are my data. I have created those
are my data. I have created those informations and stored it inside the
informations and stored it inside the database. That's why we call it the user
database. That's why we call it the user data. So nothing so far is new. So now
data. So nothing so far is new. So now what happens behind the scenes is that
what happens behind the scenes is that the database server will not only store
the database server will not only store the user data that you have provided but
the user data that you have provided but also it's going to go and store a
also it's going to go and store a different type of data inside the
different type of data inside the database and this data is the metadata.
database and this data is the metadata. So the database server going to store
So the database server going to store the metadata of the customer's table and
the metadata of the customer's table and it going to look like this. There is
it going to look like this. There is like a table name, there is a column
like a table name, there is a column names and those are the column names
names and those are the column names that you have defined inside your
that you have defined inside your database and those are the column names
database and those are the column names that you have defined as you are
that you have defined as you are creating the table customers and it's
creating the table customers and it's going to store as well additional
going to store as well additional informations like which data type like
informations like which data type like the customer ID is int and the last name
the customer ID is int and the last name is v charts and many other informations
is v charts and many other informations like the length of the column and
like the length of the column and whether the column is nullable or not.
whether the column is nullable or not. So as you can see in the metadata we are
So as you can see in the metadata we are having a description a data about the
having a description a data about the structure of the customers and in the
structure of the customers and in the metadata we can find a lot of
metadata we can find a lot of informations about not only the tables
informations about not only the tables and columns but as well about the
and columns but as well about the schemas and the database. So you can
schemas and the database. So you can find a full catalog about the structure
find a full catalog about the structure of your database. Basic table the
of your database. Basic table the customers table it contains data about
customers table it contains data about the actual data. So it stores data about
the actual data. So it stores data about the customers. But the metadata of the
the customers. But the metadata of the customers table contains data about
customers table contains data about data. So in the databases each table
data. So in the databases each table that you are using in order to store
that you are using in order to store your data has a table twin that
your data has a table twin that describes the structure of your data. So
describes the structure of your data. So this is what we mean with a system
this is what we mean with a system catalog or a metadata. And now you might
catalog or a metadata. And now you might ask where I can find all those system
ask where I can find all those system catalog and metadata inside our client
catalog and metadata inside our client here. Well, you cannot navigate through
here. Well, you cannot navigate through those informations in the object
those informations in the object explorer like we used to do for the user
explorer like we used to do for the user data. But you can find those
data. But you can find those informations in a special hidden schema
informations in a special hidden schema called the information schema. The
called the information schema. The information schema in SQL server is a
information schema in SQL server is a systemdefined schema that contains a set
systemdefined schema that contains a set of built-in views that help us to find
of built-in views that help us to find information about our database like
information about our database like tables, columns, and other objects. So
tables, columns, and other objects. So let's go and explore it. We're going to
let's go and explore it. We're going to go and say select star from
go and say select star from information schema. And then let's have
information schema. And then let's have a dot. And now we get from SQL a list of
a dot. And now we get from SQL a list of all views that are available in order to
all views that are available in order to browse the metadata of our database. So
browse the metadata of our database. So for example, you can see here tables.
for example, you can see here tables. You can see informations about the views
You can see informations about the views and as well about the columns. So let's
and as well about the columns. So let's go and select the columns and let's go
go and select the columns and let's go and execute it. And now in the output we
and execute it. And now in the output we can find informations about the schema
can find informations about the schema about the table names like for example
about the table names like for example here the customers. Let me just go and
here the customers. Let me just go and select this table. And then we find all
select this table. And then we find all the columns inside this table how they
the columns inside this table how they are sorted. So we have here the order of
are sorted. So we have here the order of each column and as well the data type
each column and as well the data type and the size of each column and many
and the size of each column and many other stuff. So as you can see we got
other stuff. So as you can see we got here all the informations all the
here all the informations all the metadata of each table and as well for
metadata of each table and as well for each column inside the table. So with
each column inside the table. So with that you can check which tables does
that you can check which tables does exist in your database. For example I
exist in your database. For example I find here like something called test
find here like something called test two. So maybe I was trying to test
two. So maybe I was trying to test something. I can go now and clean up
something. I can go now and clean up stuff right and this is exactly why the
stuff right and this is exactly why the database maintain such a catalog. It
database maintain such a catalog. It helps the database to quickly find the
helps the database to quickly find the structure of each table and of each
structure of each table and of each column. and it helps me as well as a
column. and it helps me as well as a user to browse the catalog of the
user to browse the catalog of the database. So for example I can go over
database. So for example I can go over here and say okay let's get a distinct
here and say okay let's get a distinct table name. So with that I will get a
table name. So with that I will get a list of everything that I have inside
list of everything that I have inside the database. So we have the customers
the database. So we have the customers employees and some tests that I have
employees and some tests that I have done. So metadata are
awesome. Now we come to the third storage that temporary data storage. It
storage that temporary data storage. It is a temporary space used by the
is a temporary space used by the database for short-term task like
database for short-term task like processing a query or sorting data. And
processing a query or sorting data. And once these tasks are done, what going to
once these tasks are done, what going to happen? The database going to go and
happen? The database going to go and clean up the storage. And now of course
clean up the storage. And now of course the question is where we can find these
the question is where we can find these temporary tables that is using the
temporary tables that is using the temporary storage in the disk. Well
temporary storage in the disk. Well actually if you go to the object
actually if you go to the object explorer you will not find it inside our
explorer you will not find it inside our database sales DB but you will find it
database sales DB but you will find it inside the system databases. Now since
inside the system databases. Now since we are working locally we have the full
we are working locally we have the full access to everything inside the SQL
access to everything inside the SQL server. But in real projects if you are
server. But in real projects if you are just a user or let's say developer you
just a user or let's say developer you will not have access to the system
will not have access to the system databases only for the database
databases only for the database administrators. But now we are working
administrators. But now we are working on the local copy. So let's go to the
on the local copy. So let's go to the system database and here you have a
system database and here you have a special database from the SQL server
special database from the SQL server called temp DB. And if you go inside it
called temp DB. And if you go inside it we will find here tables and temporary
we will find here tables and temporary tables. So this is exactly where you can
tables. So this is exactly where you can find all the temporal tables that you
find all the temporal tables that you are generating. Now currently we didn't
are generating. Now currently we didn't create any temporary tables that's why
create any temporary tables that's why it's empty. But once you start creating
it's empty. But once you start creating temporary tables you will find those
temporary tables you will find those tables underneath this folder. We will
tables underneath this folder. We will learn about the temporary tables in the
learn about the temporary tables in the next
sections. So these are the main component of the database architecture.
component of the database architecture. So now let's have an example. Now we
So now let's have an example. Now we have a table called orders that is
have a table called orders that is stored inside the user storage and the
stored inside the user storage and the metadata of this table is stored in the
metadata of this table is stored in the catalog. So now let's say that you are
catalog. So now let's say that you are at the client side and you write a
at the client side and you write a simple select query in order to select
simple select query in order to select the data of the orders. So now that
the data of the orders. So now that query is sent to the server in order to
query is sent to the server in order to be executed and the database engine
be executed and the database engine going to take the query in order to
going to take the query in order to process it. So first the database engine
process it. So first the database engine going to check whether we have the data
going to check whether we have the data in the cache because if the data is
in the cache because if the data is stored in the cache then things going to
stored in the cache then things going to be really fast and the database engine
be really fast and the database engine can solve the task quickly but in this
can solve the task quickly but in this scenario we don't have the orders
scenario we don't have the orders informations in the cache that's why the
informations in the cache that's why the database engine going to say okay it's
database engine going to say okay it's not in the cache let's check the disk so
not in the cache let's check the disk so it will find the orders information in
it will find the orders information in the disk and the query going to be
the disk and the query going to be executed then the result of this query
executed then the result of this query going to be sent back to the client side
going to be sent back to the client side where at the end in return you will see
where at the end in return you will see in the output the result of the table
in the output the result of the table orders. So this is how the SQL database
orders. So this is how the SQL database execute very simple select
query query is a query inside another query. So what this means let's have a
query. So what this means let's have a sketch to understand it. So so far what
sketch to understand it. So so far what we have learned we have different
we have learned we have different database tables like the orders
database tables like the orders customers and so on and we write a
customers and so on and we write a simple SQL queries like select from
simple SQL queries like select from where. So the SQL going to retrieve data
where. So the SQL going to retrieve data from the database tables and in the
from the database tables and in the output we will get some kind of results.
output we will get some kind of results. So this is so far what you have done. We
So this is so far what you have done. We have done very simple queries. Now in
have done very simple queries. Now in our query we can have things little bit
our query we can have things little bit different. So we could have another
different. So we could have another query that is inside our query where we
query that is inside our query where we do the same things like select from
do the same things like select from where. So we have now a query inside our
where. So we have now a query inside our query and we call this embedded query we
query and we call this embedded query we call it a sub query and the original
call it a sub query and the original query the first one where we have select
query the first one where we have select from we call it main query. So now if
from we call it main query. So now if you execute the whole query what going
you execute the whole query what going to happen SQL first going to go and
to happen SQL first going to go and select the subquery and then it's going
select the subquery and then it's going to execute it. So it's going to go and
to execute it. So it's going to go and select and retrieve data from our
select and retrieve data from our database tables and the result of the
database tables and the result of the subquery will not be sent to the users
subquery will not be sent to the users to us. So we cannot see it. What can
to us. So we cannot see it. What can happen? the result can stay inside the
happen? the result can stay inside the query as an intermediate results and
query as an intermediate results and then now our main query can go and start
then now our main query can go and start interacting with this intermediate
interacting with this intermediate result from the subquery. So the main
result from the subquery. So the main query going to do some kind of
query going to do some kind of operations on top of this intermediate
operations on top of this intermediate results and use it for filtering or
results and use it for filtering or joining or any purpose and still the
joining or any purpose and still the main query can go and query the original
main query can go and query the original database tables. So now the main query
database tables. So now the main query has two sources for data. The original
has two sources for data. The original database tables and as well the result
database tables and as well the result from another query. So now by looking to
from another query. So now by looking to this you can see the subquery is a query
this you can see the subquery is a query inside the main query and it play a role
inside the main query and it play a role of supporter. So it supports the main
of supporter. So it supports the main query with data and the main job of the
query with data and the main job of the main query is of course to get all those
main query is of course to get all those data and to show us at the end the final
data and to show us at the end the final results. Now there is now two things
results. Now there is now two things about this intermediate results that we
about this intermediate results that we got as a result from the subquery. Once
got as a result from the subquery. Once the execution of the query is completely
the execution of the query is completely done, what can happen is going to go and
done, what can happen is going to go and destroy this intermediate result. So
destroy this intermediate result. So it's going to totally drop it. So we
it's going to totally drop it. So we will not find it anywhere. It's
will not find it anywhere. It's completely lost. Now the other thing
completely lost. Now the other thing about the intermediate results is that
about the intermediate results is that imagine you are making another query
imagine you are making another query that is completely outside of the first
that is completely outside of the first query. We are selecting few tables from
query. We are selecting few tables from our database. Now you might say you know
our database. Now you might say you know what is it possible to access the
what is it possible to access the intermediate results from the first
intermediate results from the first query. So now we are talking about
query. So now we are talking about completely external query you cannot do
completely external query you cannot do that. The intermediate result of the
that. The intermediate result of the subquery is only locally known from the
subquery is only locally known from the main query itself and it is not globally
main query itself and it is not globally available for any other query. So the
available for any other query. So the subquery can be used only from the main
subquery can be used only from the main query.
So with that we have understood what are subqueries and now you might ask me why
subqueries and now you might ask me why do we need them in the first place? Why
do we need them in the first place? Why sub queries are important? Let's have
sub queries are important? Let's have the following sketch. Now in our complex
the following sketch. Now in our complex task we might have to do several stuffs
task we might have to do several stuffs in our query. Like for example the first
in our query. Like for example the first step we have to go and join tables in
step we have to go and join tables in order to prepare the data and then the
order to prepare the data and then the outcome of the joins should be filtered.
outcome of the joins should be filtered. So this going to be our step two. And
So this going to be our step two. And then on top of that in the step three we
then on top of that in the step three we have to go and do transformations like
have to go and do transformations like maybe handling the nulls or creating new
maybe handling the nulls or creating new columns and many other stuff. And the
columns and many other stuff. And the last step we want to go and do data
last step we want to go and do data aggregations like summarizing the data
aggregations like summarizing the data or finding average. Now if you go
or finding average. Now if you go immediately and start writing the SQL
immediately and start writing the SQL query without having a plan what can
query without having a plan what can happen you're going to end up having a
happen you're going to end up having a long complex SQL query and it's going to
long complex SQL query and it's going to be really hard to write and as well to
be really hard to write and as well to understand and read. And now what we can
understand and read. And now what we can do instead of that we're going to go and
do instead of that we're going to go and divide our task based on those steps. So
divide our task based on those steps. So we're going to write one query section
we're going to write one query section for each step. For example, for joining
for each step. For example, for joining tables we're going to have one query for
tables we're going to have one query for filtering another one transformation
filtering another one transformation another one and for the aggregation
another one and for the aggregation we're going to have the last query. So
we're going to have the last query. So now since each step is like a
now since each step is like a preparation for the next step we can go
preparation for the next step we can go and say each of those queries is a
and say each of those queries is a subquery. So for step one, step two,
subquery. So for step one, step two, step three, we have sub queries and they
step three, we have sub queries and they are all doing like calculations and
are all doing like calculations and preparations for the last step to the
preparations for the last step to the aggregations and we call the last step
aggregations and we call the last step the main query and of course the whole
the main query and of course the whole thing can exist in one single query. So
thing can exist in one single query. So if you want to visual this like you have
if you want to visual this like you have a subquery in circle and then this
a subquery in circle and then this circle belongs to a bigger circle called
circle belongs to a bigger circle called the main query. By the way, sometimes we
the main query. By the way, sometimes we call the main query as the outer query
call the main query as the outer query and the subquery we can call it an inner
and the subquery we can call it an inner query. And of course, we can have many
query. And of course, we can have many subqueries and many small circles inside
subqueries and many small circles inside each others to form something called
each others to form something called nested queries. So this is the main
nested queries. So this is the main purpose of using subqueries in our
purpose of using subqueries in our scripts and queries. It's going to help
scripts and queries. It's going to help us to reduce the complexity and going to
us to reduce the complexity and going to make it easier to read and we can have
make it easier to read and we can have like a flow logical flow inside our
like a flow logical flow inside our queries.
Now for the sub queries there are many different types and categories. So now
different types and categories. So now what we're going to do I'm going to show
what we're going to do I'm going to show you an overview of all those types and
you an overview of all those types and categories and then later we're going to
categories and then later we're going to deep dive into each of those types. So
deep dive into each of those types. So first of all if you are thinking about
first of all if you are thinking about the dependencies between the subquery
the dependencies between the subquery and the main query. There is mainly two
and the main query. There is mainly two types of subqueries. We have the
types of subqueries. We have the non-correlated subquery. That means the
non-correlated subquery. That means the subquery is independent from the main
subquery is independent from the main query. And the second type is the
query. And the second type is the correlated subquery. It's exactly the
correlated subquery. It's exactly the opposite. The subquery gonna depend on
opposite. The subquery gonna depend on the main query. Of course, we can
the main query. Of course, we can explain all those stuff in details.
explain all those stuff in details. Don't worry about it. So, this is the
Don't worry about it. So, this is the first group. Now, there is another group
first group. Now, there is another group on how to group up the subqueries
on how to group up the subqueries depending on the result type. So, I mean
depending on the result type. So, I mean with this that the subquery has
with this that the subquery has different output and results. For
different output and results. For example, we have scalar subquery. It
example, we have scalar subquery. It returns only one single value. or
returns only one single value. or another type it's called the row
another type it's called the row subquery. It's going to return multiple
subquery. It's going to return multiple rows and the final type called the table
rows and the final type called the table subquery. It is a subquery that returns
subquery. It is a subquery that returns multiple rows and as well multiple
multiple rows and as well multiple columns. Now we come to the third way
columns. Now we come to the third way and the last way on how to categorize
and the last way on how to categorize the subqueries and this time based on
the subqueries and this time based on the location and the clauses. So we are
the location and the clauses. So we are describing here where the subquery going
describing here where the subquery going to be used within the main query. So we
to be used within the main query. So we can use it in different locations and
can use it in different locations and clauses like the select clause or we can
clauses like the select clause or we can use it in the from clause and this is
use it in the from clause and this is the most common type for the subqueries
the most common type for the subqueries or we can use it before joining tables
or we can use it before joining tables and we can use it in order to filter the
and we can use it in order to filter the data in the work clause and in the work
data in the work clause and in the work clause as we learned there are two
clause as we learned there are two different sets of operators. We can use
different sets of operators. We can use the subgrade together with the
the subgrade together with the comparison operators the less, greater,
comparison operators the less, greater, equal and so on. Or we can use it with
equal and so on. Or we can use it with the logical operators like the in, any,
the logical operators like the in, any, all and exists. So now those are the
all and exists. So now those are the different types and categories for the
different types and categories for the subqueries and we're going to now deep
subqueries and we're going to now deep dive into all of them. So now let's go
dive into all of them. So now let's go and start with the easiest category, the
and start with the easiest category, the result types of the subqueries.
Now we have different types of subqueries based on the results. So this
subqueries based on the results. So this means the amount of data that the
means the amount of data that the subquery going to return. So the first
subquery going to return. So the first type is the scalar subquery. So it is a
type is the scalar subquery. So it is a subquery that it's going to return only
subquery that it's going to return only one single value like for example the
one single value like for example the value three. Let's have an example for
value three. Let's have an example for the scalar subquery. So in this query
the scalar subquery. So in this query for example if you are saying select
for example if you are saying select star you will get all columns all the
star you will get all columns all the rows from one table. But for the scalar
rows from one table. But for the scalar subquery we need only one value. So how
subquery we need only one value. So how we usually get it is by doing some
we usually get it is by doing some aggregations. For example, if you go and
aggregations. For example, if you go and say let's get the average of sales. So
say let's get the average of sales. So let's execute it. And with that in the
let's execute it. And with that in the output we have only one value with a 38.
output we have only one value with a 38. We call such a query as a scalar query.
We call such a query as a scalar query. So it has only one row and only one
So it has only one row and only one column. So this is a scalar query. All
column. So this is a scalar query. All right. So now to the second type we have
right. So now to the second type we have the row subquery. So it is a subquery
the row subquery. So it is a subquery that going to return multiple rows and a
that going to return multiple rows and a single column. So we're going to have
single column. So we're going to have like values 1 2 3. So it is only one
like values 1 2 3. So it is only one column with multiple rows. Let's have an
column with multiple rows. Let's have an example for the row query. As you can
example for the row query. As you can see now we are saying select star from
see now we are saying select star from the table orders and now we are getting
the table orders and now we are getting multiple rows and multiple columns. But
multiple rows and multiple columns. But for the row queries we need only one
for the row queries we need only one column. So you can go over here for
column. So you can go over here for example say customer ID. And if you go
example say customer ID. And if you go and execute it. So now if you check the
and execute it. So now if you check the output we have a single column and as
output we have a single column and as well multiple rows. So we have like a
well multiple rows. So we have like a list of values and this is what we call
list of values and this is what we call row query. All right. So now to the last
row query. All right. So now to the last type we have the table sub query. It's
type we have the table sub query. It's going to go and return multiple rows and
going to go and return multiple rows and as well multiple columns like any
as well multiple columns like any regular tables. So this subquery going
regular tables. So this subquery going to return a lot of values. Okay. So
to return a lot of values. Okay. So let's see an example of that table
let's see an example of that table query. So if you check our example here,
query. So if you check our example here, select star from orders, we got here
select star from orders, we got here multiple rows and as well multiple
multiple rows and as well multiple columns and of course we can go and
columns and of course we can go and select multiple columns like for example
select multiple columns like for example the order ID and the order dates. So if
the order ID and the order dates. So if we execute it here in the output we have
we execute it here in the output we have multiple columns we have two columns and
multiple columns we have two columns and as well multiple rows that's why this
as well multiple rows that's why this kind of query is as well a table query.
kind of query is as well a table query. All right. So with that we have learned
All right. So with that we have learned the different types of subqueries based
the different types of subqueries based on the result type. Now we're going to
on the result type. Now we're going to go and learn how to use the subqueries
go and learn how to use the subqueries in different locations in our query. So
in different locations in our query. So we're going to start with how to use
we're going to start with how to use subquery in the from
subquery in the from [Music]
[Music] clause. Okay. So we typically use the
clause. Okay. So we typically use the subqueries in the from clause in order
subqueries in the from clause in order to create temporary result sets that act
to create temporary result sets that act as a table for the main query. So it's
as a table for the main query. So it's like in some scenarios we cannot use the
like in some scenarios we cannot use the tables directly from the database. We
tables directly from the database. We have to prepare it somehow before we do
have to prepare it somehow before we do our actual query. Okay. So let's check
our actual query. Okay. So let's check the syntax of the sub query inside the
the syntax of the sub query inside the from clause. So we start with the usual
from clause. So we start with the usual stuff where we go and say select and few
stuff where we go and say select and few columns that we want to retrieve and
columns that we want to retrieve and then we say okay from usually after the
then we say okay from usually after the from comes the table name from our
from comes the table name from our database that we want to query. But this
database that we want to query. But this time instead of writing the table name,
time instead of writing the table name, we're going to have another SQL query.
we're going to have another SQL query. So that means we don't define the table
So that means we don't define the table name, we define another select
name, we define another select statements where we have as well again
statements where we have as well again select a column from specific table and
select a column from specific table and then maybe we have a filter. And in
then maybe we have a filter. And in order now to tell SQL this is a
order now to tell SQL this is a subquery, we have to use the
subquery, we have to use the parenthesis. So we're going to have the
parenthesis. So we're going to have the parenthesis at the start and at the end.
parenthesis at the start and at the end. This is a subquery. This is not the main
This is a subquery. This is not the main query. And after the parenthesis, we can
query. And after the parenthesis, we can go and define the alias for the results
go and define the alias for the results that we're going to get from this
that we're going to get from this subquery. In many databases, this alias
subquery. In many databases, this alias is an optional, but for the SQL server,
is an optional, but for the SQL server, we have to go and specify an alias. So,
we have to go and specify an alias. So, it is a must in SQL server. So, again,
it is a must in SQL server. So, again, we call this a subquery and the outer
we call this a subquery and the outer query we call it a main query. So, this
query we call it a main query. So, this is the syntax of the subquery in the
is the syntax of the subquery in the from clause. Okay. Okay, so now we have
from clause. Okay. Okay, so now we have the following task and it says find the
the following task and it says find the products that have a price higher than
products that have a price higher than the average price of all products. So
the average price of all products. So we're going to do it step by step and
we're going to do it step by step and here we have two steps. The first one is
here we have two steps. The first one is that we have to go and calculate the
that we have to go and calculate the average price of all products and the
average price of all products and the second step we're going to use this
second step we're going to use this value in order to filter the table
value in order to filter the table products in order to find the prices
products in order to find the prices that is higher than this average price.
that is higher than this average price. So let's start with the first step where
So let's start with the first step where we're going to find the average price.
we're going to find the average price. I'm going to select the following
I'm going to select the following informations. So product ID,
informations. So product ID, price from the table sales products. So
price from the table sales products. So let's go and execute it. So now we have
let's go and execute it. So now we have the product and as well the prices and
the product and as well the prices and we need this price here in order to
we need this price here in order to compare it with the average price. So
compare it with the average price. So that means we need this price and as
that means we need this price and as well side by side we need the average
well side by side we need the average price. So that means we need
price. So that means we need aggregations and details and that's why
aggregations and details and that's why we're going to go with the window
we're going to go with the window function average. So let's go and do
function average. So let's go and do this. This is very simple. So it's going
this. This is very simple. So it's going to be the average
to be the average price and we don't want to partition the
price and we don't want to partition the data. So it's going to be an over empty
data. So it's going to be an over empty and this going to be the average price
and this going to be the average price like this. So let's go and execute it.
like this. So let's go and execute it. And with that we have calculated the
And with that we have calculated the average price. So now we have all the
average price. So now we have all the informations in the first step. We have
informations in the first step. We have the average price, we have the price and
the average price, we have the price and as well the products. So now the next
as well the products. So now the next step is that we have to go and filter
step is that we have to go and filter the data to find out all the products
the data to find out all the products where the price is higher than the
where the price is higher than the average. That means we will do this step
average. That means we will do this step based on those information that we have
based on those information that we have now. So that means we have to go and use
now. So that means we have to go and use the logic of subquery and main query.
the logic of subquery and main query. Since this is the first step to prepare
Since this is the first step to prepare the data, we're going to use this as a
the data, we're going to use this as a subquery. So we're going to call this a
subquery. So we're going to call this a sub query like this. And we have to go
sub query like this. And we have to go and use it in the main query. So how we
and use it in the main query. So how we going to do that? We have to go and
going to do that? We have to go and write the main query. So it's going to
write the main query. So it's going to be I'm going to start over here. Select
be I'm going to start over here. Select and then I will take all the columns
and then I will take all the columns from. So this is the main query. Let me
from. So this is the main query. Let me just make this a little bit smaller. And
just make this a little bit smaller. And what we're going to do now so now the
what we're going to do now so now the main query going to get the data from
main query going to get the data from the sub query. So the whole thing going
the sub query. So the whole thing going to be used inside the from close. So now
to be used inside the from close. So now in order to put the subquery inside the
in order to put the subquery inside the main query we have to go and use the
main query we have to go and use the parenthesis. So we're going to have it
parenthesis. So we're going to have it at the start and as well at the end and
at the start and as well at the end and what we usually do we go and add like a
what we usually do we go and add like a tab in order to understand okay this is
tab in order to understand okay this is the subquery and then this is the main
the subquery and then this is the main query. So now one more thing that we
query. So now one more thing that we have to add for the whole subquery in
have to add for the whole subquery in the SQL server that we have to give it
the SQL server that we have to give it an alias. So you can go and give it any
an alias. So you can go and give it any name that you would like. I usually go
name that you would like. I usually go with only one character with the T. It
with only one character with the T. It stands for table. So you can use
stands for table. So you can use anything that you want. But we have in
anything that you want. But we have in SQL Server to give an alias for the
SQL Server to give an alias for the subquery. So now what we are saying, we
subquery. So now what we are saying, we are saying select everything from the
are saying select everything from the subquery. If you go over here and
subquery. If you go over here and execute it, you will get the exact same
execute it, you will get the exact same results because the main query is doing
results because the main query is doing nothing. It's saying just select
nothing. It's saying just select everything from the subquery. But now in
everything from the subquery. But now in order to solve the task, we are not
order to solve the task, we are not interested with all products. We are
interested with all products. We are interested only the products where the
interested only the products where the price is higher than the average. That's
price is higher than the average. That's why we have to go and use the where
why we have to go and use the where clause. So we're going to say where the
clause. So we're going to say where the price is higher than the average price.
price is higher than the average price. So this filtering is done in the main
So this filtering is done in the main query. It's not inside the subquery. So
query. It's not inside the subquery. So now that means in the main query we are
now that means in the main query we are doing something. Let's go and execute
doing something. Let's go and execute it. And with that we saw the task. We
it. And with that we saw the task. We are getting now two products where the
are getting now two products where the price is higher than the average price.
price is higher than the average price. So as you can see it's very simple. If
So as you can see it's very simple. If the task has multiple steps then we can
the task has multiple steps then we can do that using multiple sub queries until
do that using multiple sub queries until we have the main query and we can learn
we have the main query and we can learn from this that the subquery is here is
from this that the subquery is here is only to support the main query. So we
only to support the main query. So we are preparing here that all the data
are preparing here that all the data that we need in order to have the final
that we need in order to have the final result for the main query. So for this
result for the main query. So for this task we cannot go immediately
task we cannot go immediately calculating the results we have first.
calculating the results we have first. So for this kind of task we cannot
So for this kind of task we cannot immediately like put everything in one
immediately like put everything in one select query. We have first to prepare
select query. We have first to prepare the data in one subquery and then pass
the data in one subquery and then pass the values for the main query. And this
the values for the main query. And this is what we mean with the table subquery.
is what we mean with the table subquery. And here one quick tip for you. If you
And here one quick tip for you. If you would like to see the intermediate
would like to see the intermediate results that we are getting from the
results that we are getting from the subquery, you can go and highlight the
subquery, you can go and highlight the subquery itself without the parenthesis.
subquery itself without the parenthesis. So we are just highlighting the
So we are just highlighting the subquery. You can go now and execute it.
subquery. You can go now and execute it. And with that SQL will not go and
And with that SQL will not go and execute everything. SQL going to execute
execute everything. SQL going to execute only what you are highlighting. So this
only what you are highlighting. So this is really nice way in order to see the
is really nice way in order to see the results of the subquery as you are like
results of the subquery as you are like debugging or searching for errors. You
debugging or searching for errors. You can go and see the intermediate results
can go and see the intermediate results that is used from the main query. And of
that is used from the main query. And of course if you deselect and not highlight
course if you deselect and not highlight anything and execute SQL going to go and
anything and execute SQL going to go and execute everything the whole query. So
execute everything the whole query. So this is how we use the table sub query
this is how we use the table sub query inside the from close. All right. Right.
inside the from close. All right. Right. So let's have another task and it says
So let's have another task and it says rank the customers based on their total
rank the customers based on their total amount of sales. So again if you check
amount of sales. So again if you check here we have like two steps. First we
here we have like two steps. First we have to find the total amount of sales
have to find the total amount of sales and then after that we have to go and
and then after that we have to go and rank the customers. So again we have
rank the customers. So again we have like two steps and we can use the
like two steps and we can use the subqueries in order to solve it. So
subqueries in order to solve it. So let's start with the first step where
let's start with the first step where we're going to find the total amount of
we're going to find the total amount of sales. So let's go and select the
sales. So let's go and select the customer ID and as well the
customer ID and as well the sales from the table sales
orders. Let's go and execute it. So now in the output we have like multiple
in the output we have like multiple customers and their sales. We have to go
customers and their sales. We have to go and now find the total amount of sales
and now find the total amount of sales for each customer. That means we have to
for each customer. That means we have to go and use the group by. So we're going
go and use the group by. So we're going to go and summarize the
to go and summarize the sales. So total sales and then group up
sales. So total sales and then group up the data by the customer ID. So like
the data by the customer ID. So like this. Let's go and execute it. Now as
this. Let's go and execute it. Now as you can see in the output we have four
you can see in the output we have four customers and we have the total sales
customers and we have the total sales for each customer. And with that we have
for each customer. And with that we have solved the first step. We have the total
solved the first step. We have the total amount of sales for each customer and we
amount of sales for each customer and we have now prepared the data for the next
have now prepared the data for the next query in order to rank the customers. So
query in order to rank the customers. So now I think you already getting how
now I think you already getting how important are the subqueries in order to
important are the subqueries in order to do stepby-step analyszis. So this is our
do stepby-step analyszis. So this is our subquery. Now we need the main query. So
subquery. Now we need the main query. So I will start preparing it. So main query
I will start preparing it. So main query like this. And let's go first and select
like this. And let's go first and select everything. So select star from let me
everything. So select star from let me just make this a little bit bigger like
just make this a little bit bigger like this. And now we have to go and convert
this. And now we have to go and convert this query to a subquery. So we need the
this query to a subquery. So we need the parenthesis. So the starting and the
parenthesis. So the starting and the ending and for the SQL server I'm going
ending and for the SQL server I'm going to give it an alias and I would like to
to give it an alias and I would like to push everything to the right side. So
push everything to the right side. So let's go and execute it. Perfect. So it
let's go and execute it. Perfect. So it is working with that the subquery is
is working with that the subquery is passing the data in the from clause to
passing the data in the from clause to the main query. Now of course the main
the main query. Now of course the main query is now is useless. It's just like
query is now is useless. It's just like selecting the data. We have to go and
selecting the data. We have to go and calculate the rank and for that we have
calculate the rank and for that we have a very nice window function. So we're
a very nice window function. So we're going to go and use the rank. So it
going to go and use the rank. So it doesn't need any parameters over we have
doesn't need any parameters over we have to sort the data order by. So we have to
to sort the data order by. So we have to go and sort the data by the total sales
go and sort the data by the total sales descending from the highest to the
descending from the highest to the lowest. So we're going to go with the
lowest. So we're going to go with the total sales and descending. So now as
total sales and descending. So now as you can see we are using the total sales
you can see we are using the total sales that we have already prepared in the
that we have already prepared in the subquery. So without preparing first the
subquery. So without preparing first the data we will not be able to rank the
data we will not be able to rank the customers in the main query. So that's
customers in the main query. So that's it. Let's go and execute it. And with
it. Let's go and execute it. And with that SQL sorted our data and we have a
that SQL sorted our data and we have a nice ranking based on the data that we
nice ranking based on the data that we had from the subquery. So this is the
had from the subquery. So this is the highest customer with the sales and then
highest customer with the sales and then the customer number one and so on. So
the customer number one and so on. So again in this task we have like multiple
again in this task we have like multiple steps and we use the power of the
steps and we use the power of the subqueries in order to do it step by
subqueries in order to do it step by step. So that's all on how to use the
step. So that's all on how to use the subquery inside the from close. Okay. So
subquery inside the from close. Okay. So now let's see quickly how SQL executed
now let's see quickly how SQL executed our query. So we have here our query and
our query. So we have here our query and we are quering the table orders. So the
we are quering the table orders. So the first step is that SQL going to go and
first step is that SQL going to go and identify the subquery and then it going
identify the subquery and then it going to go and execute it. So SQL going to go
to go and execute it. So SQL going to go and execute the subquery part where we
and execute the subquery part where we are aggregating the data based on the
are aggregating the data based on the customer ID. So once the subquery is
customer ID. So once the subquery is executed the next step is that the
executed the next step is that the result going to be introduced as an
result going to be introduced as an intermediate results. So these results
intermediate results. So these results we will not see it in the output. It's
we will not see it in the output. It's going to be like temporarily saved in
going to be like temporarily saved in the memory. So now the next step is that
the memory. So now the next step is that SQL going to go to the main query and
SQL going to go to the main query and it's going to execute it based on the
it's going to execute it based on the intermediate results. So that means the
intermediate results. So that means the main query will not go back to the
main query will not go back to the original table. It's going to go and
original table. It's going to go and query the intermediate results. So here
query the intermediate results. So here what SQL going to do going to go and
what SQL going to do going to go and rank the intermediate results by
rank the intermediate results by introducing a new column where we see
introducing a new column where we see the ranks 1 2 3 4 and the output of the
the ranks 1 2 3 4 and the output of the main query going to be the final
main query going to be the final results. So as you can see it's very
results. So as you can see it's very simple. First SQL is executing the
simple. First SQL is executing the subquery and the result of the subquery
subquery and the result of the subquery going to be used in the main query and
going to be used in the main query and once the main query is executed we will
once the main query is executed we will get the final results. So the subquery
get the final results. So the subquery here is only supporting the main query.
here is only supporting the main query. So those are the steps that SQL uses in
So those are the steps that SQL uses in order to execute the
subqueries. So now let's understand how the database server execute the
the database server execute the subqueries behind the scenes. Let's go.
subqueries behind the scenes. Let's go. So now let's say that you are data
So now let's say that you are data analyst and you are writing a query at
analyst and you are writing a query at the client side where you have a
the client side where you have a subquery inside the main query. So once
subquery inside the main query. So once you go and execute it what's going to
you go and execute it what's going to happen the database engine going to go
happen the database engine going to go and identify the subquery and in this
and identify the subquery and in this situation the database going to execute
situation the database going to execute first the subquery. So here subquery is
first the subquery. So here subquery is like selecting and retrieving data from
like selecting and retrieving data from the table orders. So that means the
the table orders. So that means the database has to retrieve the data from
database has to retrieve the data from the disk storage from the user data. So
the disk storage from the user data. So now once the subquery is executed the
now once the subquery is executed the result the intermediate results going to
result the intermediate results going to be stored in the cache. So this means
be stored in the cache. So this means the result of the subquery is temporary
the result of the subquery is temporary and as well very fast to retrieve. And
and as well very fast to retrieve. And now once the database engine is done
now once the database engine is done with the subquery it going to go and
with the subquery it going to go and start executing the main query. So let's
start executing the main query. So let's see in this scenario it's completely
see in this scenario it's completely depending on the result of the subquery.
depending on the result of the subquery. So that means the main query going to go
So that means the main query going to go and interact with the cache storage. So
and interact with the cache storage. So this means now the data going to be
this means now the data going to be retrieved very fast from the result of
retrieved very fast from the result of the subquery. Once it's done, it's going
the subquery. Once it's done, it's going to forward the result to the database
to forward the result to the database engine and the database engine going to
engine and the database engine going to forward the results to the client side.
forward the results to the client side. And at your side, you will find the
And at your side, you will find the final result. And of course, once
final result. And of course, once everything is executed, the database
everything is executed, the database engine going to go and clean up the
engine going to go and clean up the cache. So the subquery results going to
cache. So the subquery results going to be destroyed and removed completely from
be destroyed and removed completely from the cache in order to have a free space
the cache in order to have a free space for other queries. So this is how the
for other queries. So this is how the database server execute the subqueries
database server execute the subqueries behind the
scenes. All right. So now we're going to talk about how to use the subquery in
talk about how to use the subquery in the select clause. So now we typically
the select clause. So now we typically use the subqueries in the select clause
use the subqueries in the select clause to aggregate the data side by side with
to aggregate the data side by side with the columns of the main query. Okay. So
the columns of the main query. Okay. So let's check the syntax of the subquery
let's check the syntax of the subquery in the select clause. So we start with
in the select clause. So we start with the simple stuff where we say okay let's
the simple stuff where we say okay let's go and select a column that we want to
go and select a column that we want to retrieve from specific table. So nothing
retrieve from specific table. So nothing new we are just quering a table. And now
new we are just quering a table. And now what we can do in this query is that not
what we can do in this query is that not only we can go and select the columns
only we can go and select the columns from specific table we can go and insert
from specific table we can go and insert here inside the select another query
here inside the select another query like a full query like select from and
like a full query like select from and where. So again it's like query inside
where. So again it's like query inside another query and we call this of course
another query and we call this of course a subquery. In order to tell SQL this is
a subquery. In order to tell SQL this is a subquery we go and add the
a subquery we go and add the parenthesis. So with SQL going to
parenthesis. So with SQL going to understand huh this is a subquery and
understand huh this is a subquery and the result of this query going to be
the result of this query going to be used in the select. So we can handle it
used in the select. So we can handle it like any other column. We can go and
like any other column. We can go and give it like an alias. It is here
give it like an alias. It is here optional and not m to add an alias. So
optional and not m to add an alias. So this inner query we call it a subquery
this inner query we call it a subquery and the outer query going to be the main
and the outer query going to be the main query. So this is how you put a subquery
query. So this is how you put a subquery in the select clause. But there is one
in the select clause. But there is one rule for this query that the result of
rule for this query that the result of this subquery must be a scalar query.
this subquery must be a scalar query. That means the result must be a single
That means the result must be a single value because otherwise it will not
value because otherwise it will not work. SQL here is expecting only one
work. SQL here is expecting only one value. So this is how we use the
value. So this is how we use the subquery inside the select clause. All
subquery inside the select clause. All right, let's have the following task and
right, let's have the following task and it says show the product ids, product
it says show the product ids, product names, prices and the total number of
names, prices and the total number of orders. So now if we check the task
orders. So now if we check the task there is like two parts. The first part
there is like two parts. The first part is that we are showing the details about
is that we are showing the details about the products and the second part that we
the products and the second part that we have to go and calculate the total
have to go and calculate the total number of orders. So let's see what
number of orders. So let's see what we're going to do. First let's go and
we're going to do. First let's go and solve this simple part here where we
solve this simple part here where we have the product ID, product names and
have the product ID, product names and prices. So we're going to go and select
prices. So we're going to go and select the product ID and the product and then
the product ID and the product and then the price from the table sales products.
the price from the table sales products. Let's go and execute it. So with that we
Let's go and execute it. So with that we have solved the first part of the task.
have solved the first part of the task. We have the details about the products.
We have the details about the products. Now we go and solve the second part. We
Now we go and solve the second part. We have to go and calculate the total
have to go and calculate the total number of orders. Now this information
number of orders. Now this information come from different table from the
come from different table from the products. We cannot calculate it from
products. We cannot calculate it from products. We have to go and query the
products. We have to go and query the orders. So now what am I going to do?
orders. So now what am I going to do? I'm going to go and calculate this part
I'm going to go and calculate this part in separate query. Instead of having it
in separate query. Instead of having it here inside the products. So let's have
here inside the products. So let's have a semicolon in order to have a second
a semicolon in order to have a second query. So we're going to go and select
query. So we're going to go and select the total number of orders. That means
the total number of orders. That means we can go simply do account star from
we can go simply do account star from the table sales orders. Let me just make
the table sales orders. Let me just make it a little bit bigger. So we're going
it a little bit bigger. So we're going to call it total orders and a semicolon
to call it total orders and a semicolon as well. So now if you just execute the
as well. So now if you just execute the whole thing, you will get here like two
whole thing, you will get here like two parts in the results. First you have the
parts in the results. First you have the details of the products and the second
details of the products and the second part we have now the total number of
part we have now the total number of orders. We have 10 orders. But now with
orders. We have 10 orders. But now with that we have like two different queries
that we have like two different queries like separated from each others and we
like separated from each others and we have two different results. But in the
have two different results. But in the task we have to show all those
task we have to show all those informations in one result. So now what
informations in one result. So now what we can do we can put one query inside
we can do we can put one query inside another query. So now if you check the
another query. So now if you check the second query the total orders you can
second query the total orders you can see we have only single value. So we
see we have only single value. So we have a scalar query scalar subquery.
have a scalar query scalar subquery. That's why we can go with this as a
That's why we can go with this as a [Music]
[Music] subquery like this. And I'm going to go
subquery like this. And I'm going to go and put everything in one line in order
and put everything in one line in order to see it. So let's remove the
to see it. So let's remove the semicolons. We don't need it. And now
semicolons. We don't need it. And now what we're going to do, we're going to
what we're going to do, we're going to go and take the whole thing and put it
go and take the whole thing and put it inside the main query. So this is the
inside the main query. So this is the main query. And now think about it as
main query. And now think about it as new column. So I will put the query
new column. So I will put the query here. So it is just one new column in
here. So it is just one new column in our select. But in order to have it as a
our select. But in order to have it as a subquery, we have to use the parenthesis
subquery, we have to use the parenthesis at the start and at the end. And of
at the start and at the end. And of course, we have to go and give it a
course, we have to go and give it a name. So I'm going to go and use the
name. So I'm going to go and use the same name over here. So it's going to be
same name over here. So it's going to be as total orders. So with that, the setup
as total orders. So with that, the setup for the subquery is ready and it is
for the subquery is ready and it is inside the select clause in the main
inside the select clause in the main query. Let's go and execute it. Now, as
query. Let's go and execute it. Now, as you can see, we have everything
you can see, we have everything together. We have the three informations
together. We have the three informations the product details and as well side by
the product details and as well side by side with the total orders and since it
side with the total orders and since it is always the same value it going to go
is always the same value it going to go and be repeated for each row. So this is
and be repeated for each row. So this is what we call scalar sub query inside the
what we call scalar sub query inside the select clause and here again very
select clause and here again very important to understand if you are using
important to understand if you are using a subquery inside the select clause only
a subquery inside the select clause only the scalar subquery is allowed. So for
the scalar subquery is allowed. So for example instead of having one value from
example instead of having one value from the aggregation we can go and use the
the aggregation we can go and use the order ID. So let's see what going to
order ID. So let's see what going to happen. We will get an error. It going
happen. We will get an error. It going to says subquery is returning more than
to says subquery is returning more than one value and this is not allowed
one value and this is not allowed because we are using the subquery in the
because we are using the subquery in the select clause. So that's why we have to
select clause. So that's why we have to have only one value and by using the
have only one value and by using the aggregation you will get one value. So
aggregation you will get one value. So let's repair it. And it's working. And
let's repair it. And it's working. And now again if you would like only to see
now again if you would like only to see the results from the subquery what you
the results from the subquery what you can do you can go and highlight the
can do you can go and highlight the subquery like this without the
subquery like this without the parenthesis of course and you go and
parenthesis of course and you go and execute it and with that you can see in
execute it and with that you can see in the output the 10 this is the
the output the 10 this is the intermediate results that's going to be
intermediate results that's going to be passed to the main query and if you want
passed to the main query and if you want the whole thing to be executed just like
the whole thing to be executed just like unmark it and execute and with that
unmark it and execute and with that everything can be executed the subquery
everything can be executed the subquery and the main query. So this is the
and the main query. So this is the scalar subquery in the select clause.
scalar subquery in the select clause. Okay, so now let's see quickly how SQL
Okay, so now let's see quickly how SQL executed this query step by step. So
executed this query step by step. So this is our original query and we need
this is our original query and we need two tables from our database for it. So
two tables from our database for it. So the first step is that SQL going to go
the first step is that SQL going to go and identify the subquery and it's going
and identify the subquery and it's going to go and execute it. So this is the
to go and execute it. So this is the first step. So the query is targeting
first step. So the query is targeting the orders table and we are just simply
the orders table and we are just simply doing a count. So in the output we will
doing a count. So in the output we will get an intermediate results where we are
get an intermediate results where we are counting the number of rows of the
counting the number of rows of the orders. Now the next step is that SQ is
orders. Now the next step is that SQ is going to go and pass this value to the
going to go and pass this value to the main query. So this is the second step
main query. So this is the second step and if you go and pass this value to the
and if you go and pass this value to the main query, it's going to look like
main query, it's going to look like this. So you are saying product ID,
this. So you are saying product ID, products and the tin. So after SQL
products and the tin. So after SQL prepared the main query, SQL going to go
prepared the main query, SQL going to go and execute it. So this time we are
and execute it. So this time we are targeting the products and in the output
targeting the products and in the output we will get all the informations from
we will get all the informations from the products without any filter because
the products without any filter because here we don't have any work clouds and
here we don't have any work clouds and the final results we will get it like
the final results we will get it like this. So we will have the product ID,
this. So we will have the product ID, the product and the total that we got it
the product and the total that we got it from the subquery. So as you can see
from the subquery. So as you can see here the subquery here is a scalar
here the subquery here is a scalar subquery where we have only one single
subquery where we have only one single value. So again it's very simple always
value. So again it's very simple always SQL starts with the subquery and then
SQL starts with the subquery and then it's going to go and pass the values to
it's going to go and pass the values to the main query and at the end the main
the main query and at the end the main query going to be executed and we will
query going to be executed and we will get the final result from it. So this is
get the final result from it. So this is how SQL executed our query.
All right, next we're going to talk about how to use the subquery in the
about how to use the subquery in the join clause. All right, so now as we are
join clause. All right, so now as we are joining tables in SQL, sometimes we have
joining tables in SQL, sometimes we have to go and prepare the data before doing
to go and prepare the data before doing the join to dynamically create a result
the join to dynamically create a result sets for joining with another table. So
sets for joining with another table. So again here we cannot join tables
again here we cannot join tables directly. We have to do a preparation
directly. We have to do a preparation step before doing the joins. Okay, let's
step before doing the joins. Okay, let's have the following task and it says show
have the following task and it says show all customer details and find the total
all customer details and find the total orders of each customer. Now, of course,
orders of each customer. Now, of course, in SQL, you don't have only one
in SQL, you don't have only one solution, you have multiple solutions.
solution, you have multiple solutions. But I would like to solve this task
But I would like to solve this task using the subquery. So, now if you check
using the subquery. So, now if you check the task, we have like two parts. The
the task, we have like two parts. The first part we have to show all the
first part we have to show all the customer details. And the second part,
customer details. And the second part, we have like here an aggregation find
we have like here an aggregation find the total orders of each customer. So,
the total orders of each customer. So, now let's solve those different parts
now let's solve those different parts using two different queries. Let's start
using two different queries. Let's start with the easiest one. Show all customer
with the easiest one. Show all customer details. So I think this is very simple.
details. So I think this is very simple. So select star from sales customers. So
So select star from sales customers. So let's go and execute it. So in the
let's go and execute it. So in the output we have all the details about the
output we have all the details about the customers and we have solved the first
customers and we have solved the first part. Very simple. Now let's go and
part. Very simple. Now let's go and solve the second part. We have defined
solve the second part. We have defined the total number of orders of each
the total number of orders of each customer. That means let me just have a
customer. That means let me just have a semicolon over here. We have to go to
semicolon over here. We have to go to the table orders. So let's go and select
the table orders. So let's go and select first the order ID, customer
first the order ID, customer ID from the table sales orders like
ID from the table sales orders like this. So I will just highlight the
this. So I will just highlight the second query and execute it. Now in the
second query and execute it. Now in the output we have 10 orders and we have the
output we have 10 orders and we have the different customers. Now in order to
different customers. Now in order to find the total orders for each customer
find the total orders for each customer we have to go and use the group pie. In
we have to go and use the group pie. In order to do that it's very simple. We're
order to do that it's very simple. We're going to go over here and say so count
going to go over here and say so count let's go with the star and then we're
let's go with the star and then we're going to go and group up the data by the
going to go and group up the data by the customer ID. I will go and call this
customer ID. I will go and call this total orders. So let's go and execute
total orders. So let's go and execute only these parts and with that we have
only these parts and with that we have four customers and we have the total
four customers and we have the total number of orders. So with that we have
number of orders. So with that we have solved the second part of the task. So
solved the second part of the task. So now what I'm going to do, I'm going to
now what I'm going to do, I'm going to go and execute both of those queries
go and execute both of those queries using the semicolon separately like
using the semicolon separately like this. I will just make this a little bit
this. I will just make this a little bit bigger. So let's go and execute it. Now
bigger. So let's go and execute it. Now in the output we have the two results,
in the output we have the two results, all details about the customers and the
all details about the customers and the total number of orders for each
total number of orders for each customer. So now what we want to do is
customer. So now what we want to do is to go and combine those two results in
to go and combine those two results in one. And in order to do that we can use
one. And in order to do that we can use the joins. So now we have to think about
the joins. So now we have to think about what is the first query, what is the
what is the first query, what is the second query. Since the first query
second query. Since the first query returns all the customers that we have
returns all the customers that we have in the database, I would like to have
in the database, I would like to have this as the left table and since in the
this as the left table and since in the second query we have only four
second query we have only four customers, I would like to have it then
customers, I would like to have it then as the right table and I will go with
as the right table and I will go with the left join so that I don't miss any
the left join so that I don't miss any customer because if I do the inner join,
customer because if I do the inner join, I will lose the customer number five. So
I will lose the customer number five. So let's go and do that. So this is the
let's go and do that. So this is the first query in the main query. So I'm
first query in the main query. So I'm going to call this main query.
going to call this main query. And now I'm going to give this as well
And now I'm going to give this as well an alias like the C. And now we're going
an alias like the C. And now we're going to go and join this table from the
to go and join this table from the database together with the results the
database together with the results the output of this query. So that means
output of this query. So that means we're going to do it like this. Left
we're going to do it like this. Left join and now we're going to join with a
join and now we're going to join with a sub query. So we will have our
sub query. So we will have our parenthesis. I will just put here few
parenthesis. I will just put here few spaces so that it's clear it is a
spaces so that it's clear it is a subquery and we need for this an alias.
subquery and we need for this an alias. So let's go and say for example the O.
So let's go and say for example the O. So with that we are joining a table with
So with that we are joining a table with the result of a sub query. And now of
the result of a sub query. And now of course what is missing is joining the
course what is missing is joining the tables using a key. Now if you check the
tables using a key. Now if you check the two results you can see in both queries
two results you can see in both queries we have the customer ID. That's why
we have the customer ID. That's why we're going to join with the customer
we're going to join with the customer ID. So on then the customer
ID. So on then the customer ID with the customer ID from the sub
ID with the customer ID from the sub query like this. So we have everything
query like this. So we have everything and let's go and execute it. Now as you
and let's go and execute it. Now as you can see in the output we have all the
can see in the output we have all the details about the customer and as well
details about the customer and as well together with the total number of orders
together with the total number of orders for each customer together with the
for each customer together with the total number of orders for each customer
total number of orders for each customer and as you can see we didn't miss any
and as you can see we didn't miss any customer. So we have all the customers
customer. So we have all the customers from the database and we can see that
from the database and we can see that Anna doesn't have any orders. Now you
Anna doesn't have any orders. Now you might say you know what we have here the
might say you know what we have here the customer ID twice. So what I'm going to
customer ID twice. So what I'm going to do I will select all the columns from
do I will select all the columns from the customers but from the subquery I'm
the customers but from the subquery I'm interested only on the total orders. So
interested only on the total orders. So like this let's go and execute it. Let's
like this let's go and execute it. Let's make this a little bit smaller. So now
make this a little bit smaller. So now the results are really clean. We have
the results are really clean. We have all details from the customers and as
all details from the customers and as well the total orders of each customer.
well the total orders of each customer. And of course as we learned if you would
And of course as we learned if you would like to check the results from only the
like to check the results from only the subquery you go and highlight it and
subquery you go and highlight it and execute it. So as you can see you can
execute it. So as you can see you can put the subqueries almost everywhere and
put the subqueries almost everywhere and this is how we use subqueries inside
joins. Okay. So now we're going to focus on how to use the subquery in the wear
on how to use the subquery in the wear clause. So now in this scale as we
clause. So now in this scale as we learned we can go and filter the tables
learned we can go and filter the tables using the wear clause by using like
using the wear clause by using like static values. But now in real data
static values. But now in real data projects we're going to go and filter
projects we're going to go and filter the data based on like complex logic. So
the data based on like complex logic. So now in order to prepare this complex
now in order to prepare this complex logic we go and use the sub queries in
logic we go and use the sub queries in order to make like dynamic filtering for
order to make like dynamic filtering for our main tables. And now in order to
our main tables. And now in order to filter data using the wear clause we
filter data using the wear clause we have to go and use operators and we can
have to go and use operators and we can split it into like two groups. We have
split it into like two groups. We have the comparison operators and another
the comparison operators and another sets we can call it logical operators or
sets we can call it logical operators or sometime we call it subqueries
sometime we call it subqueries operators. So now first we're going to
operators. So now first we're going to talk about the comparison operators. So
talk about the comparison operators. So there are operators that we can use in
there are operators that we can use in order to compare two values in order to
order to compare two values in order to help us filtering the data based on
help us filtering the data based on specific condition. And now in SQL
specific condition. And now in SQL basics we have learned that we have
basics we have learned that we have different comparison operators and they
different comparison operators and they are very simple. So in order to compare
are very simple. So in order to compare two values we have operator like the
two values we have operator like the equal we have as well not equal the
equal we have as well not equal the opposite. So we have greater than less
opposite. So we have greater than less than and as well we have greater than or
than and as well we have greater than or equal to and the last one we have less
equal to and the last one we have less than or equal to. So they are very
than or equal to. So they are very simple. Now instead of comparing two
simple. Now instead of comparing two values, we're going to go and compare a
values, we're going to go and compare a value with the result of subquery using
value with the result of subquery using the comparison operators. All right,
the comparison operators. All right, let's check the syntax of the subquery
let's check the syntax of the subquery inside the wear clause using the
inside the wear clause using the comparison operators. So we start with
comparison operators. So we start with the standard stuff where we say select
the standard stuff where we say select few columns that we want to retrieve and
few columns that we want to retrieve and we want to get the data directly from
we want to get the data directly from specific table in our database and now
specific table in our database and now we come to the where condition where we
we come to the where condition where we want to filter the table. So we say
want to filter the table. So we say where and then we select specific column
where and then we select specific column from the table one. Now since we are
from the table one. Now since we are talking about the comparison operators
talking about the comparison operators we can go with operator for example
we can go with operator for example equal and usually we go and specify here
equal and usually we go and specify here like static value like a number or
like static value like a number or string but instead of having a static
string but instead of having a static value what we can do we can get the
value what we can do we can get the value from another select statements
value from another select statements another query like here for saying
another query like here for saying select a column from table two and with
select a column from table two and with a filter. So now whatever comes from
a filter. So now whatever comes from this subquery going to be used in order
this subquery going to be used in order to filter the table number one. And of
to filter the table number one. And of course we are telling SQL this is a
course we are telling SQL this is a subquery by defining the parenthesis at
subquery by defining the parenthesis at the start and at the ends and the outer
the start and at the ends and the outer query going to be the main query. So as
query going to be the main query. So as you can see we are using the subquery in
you can see we are using the subquery in order to filter the main query. And here
order to filter the main query. And here in SQL if you're using subquery with the
in SQL if you're using subquery with the comparison operators we have a rule the
comparison operators we have a rule the subquery must be a scalar subquery. So
subquery must be a scalar subquery. So only one single value. So that's all
only one single value. So that's all about how to use the subquery in the
about how to use the subquery in the wear clause using the comparison
wear clause using the comparison operators. All right. So now we have
operators. All right. So now we have again the same task and it says find the
again the same task and it says find the products that have a price higher than
products that have a price higher than the average price of all products. We
the average price of all products. We have solved this task already using the
have solved this task already using the subquery inside the from clause. But now
subquery inside the from clause. But now we're going to go and solve it again
we're going to go and solve it again using the subquery but this time inside
using the subquery but this time inside the wear clause. So let's do it step by
the wear clause. So let's do it step by step. Let's go and get the informations
step. Let's go and get the informations that we need. So we need the product ID,
that we need. So we need the product ID, we need the price from the table sales
we need the price from the table sales products. So let's go and execute it. So
products. So let's go and execute it. So now we got the list of all products. But
now we got the list of all products. But we have to go and filter those
we have to go and filter those informations using the column price. So
informations using the column price. So with that in the result, we got all the
with that in the result, we got all the products, but we don't need all the
products, but we don't need all the products. We need only the products
products. We need only the products where the price is higher than the
where the price is higher than the average. That means we have to go and
average. That means we have to go and filter the table based on the values of
filter the table based on the values of the price. So now in order to do that
the price. So now in order to do that what we're going to do we're going to
what we're going to do we're going to use the wear clause and we have to go
use the wear clause and we have to go and filter the data based on the price
and filter the data based on the price and since we need higher than we're
and since we need higher than we're going to go and use the compressor
going to go and use the compressor operator higher than now next we need
operator higher than now next we need the value average price. So how we going
the value average price. So how we going to do it? We don't have the average
to do it? We don't have the average price like out of the box in the table
price like out of the box in the table products. We have to go and calculate
products. We have to go and calculate it. That's why we're going to go and
it. That's why we're going to go and write another query where we're going to
write another query where we're going to go and find the average
go and find the average price from the table sales products like
price from the table sales products like this. So now let's go and highlight it
this. So now let's go and highlight it and then execute it. And with that we
and then execute it. And with that we got now the average price of our
got now the average price of our products. And as you can see in the
products. And as you can see in the output we have only one single value. So
output we have only one single value. So this is a scalar query. So now what we
this is a scalar query. So now what we need? We need this value in order to be
need? We need this value in order to be used in order to filter the first query.
used in order to filter the first query. So that's why the first query is the
So that's why the first query is the main query bigger. The second one is the
main query bigger. The second one is the subquery that going to support the main
subquery that going to support the main query in order to filter the data. So
query in order to filter the data. So now what we're going to do, we're going
now what we're going to do, we're going to take the subquery and use it in the
to take the subquery and use it in the wear clause. And now of course we have
wear clause. And now of course we have to tell SQL this is a subquery. That's
to tell SQL this is a subquery. That's why we have to put it inside two
why we have to put it inside two parenthesis. So with that we have the
parenthesis. So with that we have the sub query inside the wear clause in
sub query inside the wear clause in order to filter the main query. So let's
order to filter the main query. So let's go and execute it. And now as you can
go and execute it. And now as you can see in the output we have now only two
see in the output we have now only two products where the price is higher than
products where the price is higher than the average price. So with that we have
the average price. So with that we have solved the task but this time using the
solved the task but this time using the subquery in the wear clouds in order to
subquery in the wear clouds in order to filter the main query. And of course in
filter the main query. And of course in order to see this value in our select
order to see this value in our select since it is scalar sub query we can as
since it is scalar sub query we can as well go over here and put it in our
well go over here and put it in our select just in order to see the value.
select just in order to see the value. So average price. So let's go and
So average price. So let's go and execute it. And with that we can see as
execute it. And with that we can see as well in our results the average price.
well in our results the average price. So this is how we use the subquery in
So this is how we use the subquery in the workcloud using the comparison
the workcloud using the comparison operator. Okay. So let's see quickly how
operator. Okay. So let's see quickly how is going to execute our query step by
is going to execute our query step by step. So as usual first is going to go
step. So as usual first is going to go and identify the subquery. It's going to
and identify the subquery. It's going to be our select average price and so on.
be our select average price and so on. And now the next step SQL going to go
And now the next step SQL going to go and execute our sub query. So it is
and execute our sub query. So it is based on the products and since we are
based on the products and since we are doing aggregations without group by at
doing aggregations without group by at the output we will get only one value.
the output we will get only one value. So the average going to be 20. This
So the average going to be 20. This value is start intermediately in the
value is start intermediately in the memory. So we will not see it in the
memory. So we will not see it in the output. SQL going to go and pass this
output. SQL going to go and pass this value to the main query. So the main
value to the main query. So the main query going to look like this. We are
query going to look like this. We are selecting few columns from the table and
selecting few columns from the table and we are filtering the data based on the
we are filtering the data based on the price that is higher than the value 20
price that is higher than the value 20 that we got it from the subquery. So now
that we got it from the subquery. So now once SQL have everything for the main
once SQL have everything for the main query SQL going to go and execute it. So
query SQL going to go and execute it. So SQL going to go to the products and only
SQL going to go to the products and only select the products where the price is
select the products where the price is higher than 20. So it's only those two
higher than 20. So it's only those two rows and in the output we will get the
rows and in the output we will get the final results the two products as well.
final results the two products as well. So product ID and product price. So
So product ID and product price. So that's it. It's very simple. This is how
that's it. It's very simple. This is how SQL executed our query. So as usual
SQL executed our query. So as usual first starting with the subquery passing
first starting with the subquery passing the value to the main query and at the
the value to the main query and at the end so the main query going to be
end so the main query going to be executed with the informations from the
executed with the informations from the subquery and we will get at the end the
subquery and we will get at the end the final results. So that's
final results. So that's [Music]
[Music] it. All right. So now we're going to
it. All right. So now we're going to talk about the second group of operators
talk about the second group of operators and we're going to start with the in
and we're going to start with the in operator. So what is in operator? As we
operator. So what is in operator? As we learned before in the comparison
learned before in the comparison operators, we can go and filter the data
operators, we can go and filter the data based on only one single value. But now
based on only one single value. But now in some scenarios, we have to go and
in some scenarios, we have to go and filter the data based on multiple
filter the data based on multiple values, not only one. In this case, we
values, not only one. In this case, we can go and use the n operator. So if you
can go and use the n operator. So if you go and use the n operator, it's going to
go and use the n operator, it's going to go and check whether the value matches
go and check whether the value matches any value from a list. So a list of
any value from a list. So a list of multiple values. If it matches any of
multiple values. If it matches any of them, so we will get a true. Okay. Okay.
them, so we will get a true. Okay. Okay. So now let's have a quick look to the
So now let's have a quick look to the syntax of the sub query using the in
syntax of the sub query using the in operator. So we start with the classic
operator. So we start with the classic stuff where we say okay we would like to
stuff where we say okay we would like to retrieve the column one column two from
retrieve the column one column two from the table one and we want to filter the
the table one and we want to filter the data based on the column from the table
data based on the column from the table one. Now after specifying the column
one. Now after specifying the column we're going to use the in operator and
we're going to use the in operator and after that we can go and specify static
after that we can go and specify static values but since we are talking about
values but since we are talking about the subqueries the values going to come
the subqueries the values going to come from another query. So here we have
from another query. So here we have another select statements from table two
another select statements from table two and we filter the data for this query.
and we filter the data for this query. And now the result of this subquery
And now the result of this subquery going to be used in order to filter the
going to be used in order to filter the data using the in operator. And now the
data using the in operator. And now the big difference between the in operator
big difference between the in operator and the comparison operators that the
and the comparison operators that the subquery is allowed to have multiple
subquery is allowed to have multiple rows. So there is no rule about having
rows. So there is no rule about having like one single value scalar subquery.
like one single value scalar subquery. We can have in the result a list of
We can have in the result a list of multiple values. So this is the syntax
multiple values. So this is the syntax of the subquery using the in operator.
of the subquery using the in operator. All right, let's practice using this
All right, let's practice using this task. It says show the details of orders
task. It says show the details of orders made by customers in Germany. So let's
made by customers in Germany. So let's see how we can solve this task. First it
see how we can solve this task. First it needs the details of orders. So as we
needs the details of orders. So as we know we have the
know we have the table sales orders. So let's go and
table sales orders. So let's go and execute it. So in the output we have all
execute it. So in the output we have all orders and with all details. But for the
orders and with all details. But for the task we don't need all the orders. We
task we don't need all the orders. We need only the orders that made by
need only the orders that made by customers from Germany. So now if you
customers from Germany. So now if you check the table orders, you don't find
check the table orders, you don't find any informations about the countries,
any informations about the countries, right? So we have to go and get it from
right? So we have to go and get it from another table. And as we know, we can
another table. And as we know, we can find these informations in the table
find these informations in the table customers. So let's build another query.
customers. So let's build another query. So let's say select star from sales
So let's say select star from sales customers like this. So let's go and
customers like this. So let's go and execute only the second query like this.
execute only the second query like this. Now, as you can see in the customers, we
Now, as you can see in the customers, we have the country column, and this is
have the country column, and this is exactly what we need. So, now let's make
exactly what we need. So, now let's make a list of all customers from Germany.
a list of all customers from Germany. So, we don't need all customers. We need
So, we don't need all customers. We need only the one that come from Germany.
only the one that come from Germany. That's why we're going to go and use the
That's why we're going to go and use the work clause and we say country equal to
work clause and we say country equal to the value Germany like this. So, let's
the value Germany like this. So, let's go and execute it again and check the
go and execute it again and check the results. Now, in the output, we have our
results. Now, in the output, we have our German customers number one and number
German customers number one and number four. So now we're going to go and use
four. So now we're going to go and use this information in order to filter the
this information in order to filter the table orders. So let's go back to the
table orders. So let's go back to the table orders over here. And here we have
table orders over here. And here we have the customer ID informations. And as we
the customer ID informations. And as we can see we need the orders where the
can see we need the orders where the customer is either one or four. Now in
customer is either one or four. Now in order to filter that we're going to go
order to filter that we're going to go to the first query and use the work
to the first query and use the work clause like this and say the customer
clause like this and say the customer ID. So now since we have like two values
ID. So now since we have like two values one on four we can go and use the
one on four we can go and use the operator in. So let's go and use the in
operator in. So let's go and use the in and let's go and build the list. So
and let's go and build the list. So let's go and have the one and four. So
let's go and have the one and four. So let's go and execute it. Now we can see
let's go and execute it. Now we can see the results. We have the orders but only
the results. We have the orders but only from the customers one and four. So with
from the customers one and four. So with that we have solved the task. We have
that we have solved the task. We have the details of orders made by customers
the details of orders made by customers in Germany. Right? And now of course
in Germany. Right? And now of course this is really bad solution because what
this is really bad solution because what about if we get like in the future new
about if we get like in the future new customer you don't want to go and keep
customer you don't want to go and keep adding here like values and so on for
adding here like values and so on for each time you have a new customer. We
each time you have a new customer. We want to make the values for this list to
want to make the values for this list to be dynamic. So we don't need a static
be dynamic. So we don't need a static value we need like dynamic values and we
value we need like dynamic values and we can use the subqueries in order to
can use the subqueries in order to retrieve those informations. Right? And
retrieve those informations. Right? And we have it already in the second query.
we have it already in the second query. So let's go back to the second query
So let's go back to the second query over here. We need only those two values
over here. We need only those two values one and four. That's why we're going to
one and four. That's why we're going to go to the query and say okay let's
go to the query and say okay let's retrieve the customer ID. So let's go
retrieve the customer ID. So let's go and execute it again. And with that we
and execute it again. And with that we have with a one and four exactly like we
have with a one and four exactly like we have it here in the first query. And of
have it here in the first query. And of course in the future if there's like
course in the future if there's like another customer that come from Germany
another customer that come from Germany this list going to be little bit longer.
this list going to be little bit longer. So this query going to always retrieve
So this query going to always retrieve all the customer ids that have the
all the customer ids that have the country equal to Germany. So now what
country equal to Germany. So now what we're going to do, we're going to take
we're going to do, we're going to take this as a sub query. Let's go and get
this as a sub query. Let's go and get everything from it and now put it
everything from it and now put it instead of those static values. So of
instead of those static values. So of course we're going to go now and put few
course we're going to go now and put few spaces to the right side in order to
spaces to the right side in order to understand this is subquery and of
understand this is subquery and of course here we don't use any aliases. So
course here we don't use any aliases. So now what we are doing the results from
now what we are doing the results from this subquery going to be used in order
this subquery going to be used in order to filter our main query. So let me just
to filter our main query. So let me just call it main
call it main query like this and make this smaller.
query like this and make this smaller. So let's go and execute it. And now we
So let's go and execute it. And now we are getting the same results. We are
are getting the same results. We are getting all the orders from only the
getting all the orders from only the customers one and four where they come
customers one and four where they come from Germany. And this informations come
from Germany. And this informations come dynamically from the subquery and we
dynamically from the subquery and we don't have to worry about new customers
don't have to worry about new customers from Germany. It's going to be added
from Germany. It's going to be added here automatically. And this query going
here automatically. And this query going to always return all the orders from
to always return all the orders from Germany. So this is the power of the
Germany. So this is the power of the subquery together with the in operator
subquery together with the in operator if you are having like multiple values
if you are having like multiple values multiple rows. So we have solved the
multiple rows. So we have solved the task. All right. Now one more thing.
task. All right. Now one more thing. Let's say that the task is exactly the
Let's say that the task is exactly the opposite. It says show the details of
opposite. It says show the details of orders made by customers who don't come
orders made by customers who don't come from Germany. So now here there's like
from Germany. So now here there's like two ways in order to do it. Either you
two ways in order to do it. Either you go to the subquery and you say you know
go to the subquery and you say you know what the country should not be equal to
what the country should not be equal to Germany. So if you go and execute it,
Germany. So if you go and execute it, you will get all the customers ids that
you will get all the customers ids that are not from Germany. And if you execute
are not from Germany. And if you execute the whole thing, you will get all the
the whole thing, you will get all the orders where the customers are not from
orders where the customers are not from Germany. So either you do that or you
Germany. So either you do that or you stay with the equal to Germany, but you
stay with the equal to Germany, but you go and convert the whole logic by using
go and convert the whole logic by using the operator not. So now we are saying
the operator not. So now we are saying the customer ID should not be equal to
the customer ID should not be equal to one of those values. So it should not be
one of those values. So it should not be equal to one or four. And for that we
equal to one or four. And for that we are using the notin operator. So let's
are using the notin operator. So let's go and execute it. So now with that we
go and execute it. So now with that we are getting all the orders where the
are getting all the orders where the customers don't come from Germany by
customers don't come from Germany by just using notin operator. So that's all
just using notin operator. So that's all about the notin and the in operators.
about the notin and the in operators. All right. So now let's see step by step
All right. So now let's see step by step how is execute our query. So we are
how is execute our query. So we are targeting two tables the customers and
targeting two tables the customers and the orders. So the first step is that
the orders. So the first step is that SQL going to go and identify the
SQL going to go and identify the subquery and it's going to go and
subquery and it's going to go and execute it. So the subquery here is
execute it. So the subquery here is filtering the data based on the country.
filtering the data based on the country. So the query going to be executed and in
So the query going to be executed and in the output we will get only two rows. So
the output we will get only two rows. So it is one column with multiple rows.
it is one column with multiple rows. This is the row subquery and this is our
This is the row subquery and this is our intermediate results where it's going to
intermediate results where it's going to be passed to the main query. So our main
be passed to the main query. So our main query going to look like this. We are
query going to look like this. We are selecting few informations from the
selecting few informations from the orders and we are filtering the table
orders and we are filtering the table orders based on the customer ID where we
orders based on the customer ID where we are saying the customer ID must be one
are saying the customer ID must be one of those values one or four. So the
of those values one or four. So the subquery here is supporting the main
subquery here is supporting the main query with the informations for the
query with the informations for the filter. Now once SQL have everything
filter. Now once SQL have everything going to go and execute our main query
going to go and execute our main query and this going to be like the following.
and this going to be like the following. So we will start with the first row and
So we will start with the first row and here the customer ID is equal to two. So
here the customer ID is equal to two. So the value two is not equal to 1 or four.
the value two is not equal to 1 or four. That's why this row will be excluded
That's why this row will be excluded from the final results. Now let's move
from the final results. Now let's move to the second row. We have here the
to the second row. We have here the value three and the value three is not
value three and the value three is not equal to one of those values. That's why
equal to one of those values. That's why this value going to be as well failing.
this value going to be as well failing. So we will not have it at the output.
So we will not have it at the output. And then it's still going to go to the
And then it's still going to go to the next one. Now this time the customer ID
next one. Now this time the customer ID is one and it is equal to one of those
is one and it is equal to one of those values. It's equal to one. So we have a
values. It's equal to one. So we have a match. That's why this row will be
match. That's why this row will be included to the results. And the same
included to the results. And the same thing for the next row because we have
thing for the next row because we have the customer ID one and so on. Now after
the customer ID one and so on. Now after SQL checking all those customer ids
SQL checking all those customer ids whether they are in the list one or four
whether they are in the list one or four we will get the final results where we
we will get the final results where we have here all the orders where the
have here all the orders where the customer ID either one or four. So this
customer ID either one or four. So this is how SQL executed the in operator
is how SQL executed the in operator using the
subqueries. Okay. So now moving on to the any operator. So we can go and use
the any operator. So we can go and use the any operator in order to compare a
the any operator in order to compare a value if it matches any value from a
value if it matches any value from a list. So that means we can go and use it
list. So that means we can go and use it in order to check whether a condition is
in order to check whether a condition is true for at least one of the values in a
true for at least one of the values in a list. Okay. So now let's check quickly
list. Okay. So now let's check quickly the syntax of the subquery using the any
the syntax of the subquery using the any and all operators. So as we learned
and all operators. So as we learned before we can go and use a subquery
before we can go and use a subquery inside the wear clause in order to
inside the wear clause in order to filter the main query using like the
filter the main query using like the comparison operators like here less
comparison operators like here less than. Now the syntax of the any operator
than. Now the syntax of the any operator is that you're going to go and use the
is that you're going to go and use the comparison operator and after that
comparison operator and after that immediately you use the keyword any. And
immediately you use the keyword any. And for the all operator going to be exactly
for the all operator going to be exactly the same where you're going to go and
the same where you're going to go and put after the comparison operator the
put after the comparison operator the keyword all. So the syntax is very
keyword all. So the syntax is very simple. We just add those keywords. So
simple. We just add those keywords. So let's practice using the following task.
let's practice using the following task. Find female employees whose salaries are
Find female employees whose salaries are greater than the salaries of any male
greater than the salaries of any male employee. So that means we want to go
employee. So that means we want to go and compare the salaries between the
and compare the salaries between the male and female and specifically we are
male and female and specifically we are searching for female employees whose
searching for female employees whose salary is greater than at least one male
salary is greater than at least one male employee. So let's solve it step by
employee. So let's solve it step by step. Let's go and start selecting few
step. Let's go and start selecting few informations like for example the
informations like for example the employee
employee ID and first
ID and first name, gender, salary from the table
name, gender, salary from the table sales employees. So let's go and execute
sales employees. So let's go and execute it. So now we have like five employees.
it. So now we have like five employees. Three of them are male and two are
Three of them are male and two are female. So now since we want to compare
female. So now since we want to compare the data between male and female let's
the data between male and female let's go and create two queries. The first one
go and create two queries. The first one is filtering the data based on the
is filtering the data based on the gender. So the first one is for the
gender. So the first one is for the female. So and we can go and remove this
female. So and we can go and remove this information over here. Let me just make
information over here. Let me just make this little bit smaller and zoom out.
this little bit smaller and zoom out. And the second query it's going to be
And the second query it's going to be the exact opposite. Let's go and get
the exact opposite. Let's go and get employee informations for the male. So
employee informations for the male. So let's go and execute it. Now the first
let's go and execute it. Now the first results are the female employees and the
results are the female employees and the second one are So now for the first
second one are So now for the first result is for the female employees and
result is for the female employees and the second one is for the male
the second one is for the male employees. So now what do we need in the
employees. So now what do we need in the output? We need the female employees.
output? We need the female employees. That means this is going to be our main
That means this is going to be our main query. So we are focusing on the female
query. So we are focusing on the female employees and we are using the male
employees and we are using the male employees only as a filter and what we
employees only as a filter and what we need we need only the salary
need we need only the salary informations that's why we can prepare
informations that's why we can prepare it like this. I will just put everything
it like this. I will just put everything in one line to make it clear. So this
in one line to make it clear. So this going to be our sub query. So now we're
going to be our sub query. So now we're going to go and work with the main query
going to go and work with the main query where we're going to add one more filter
where we're going to add one more filter where we're going to filter the data
where we're going to filter the data based on the salary. Right? So we're
based on the salary. Right? So we're going to say if the salary is greater
going to say if the salary is greater than and now we need the values from the
than and now we need the values from the subquery right so this is our subquery
subquery right so this is our subquery we're going to put it like this and
we're going to put it like this and don't forget about the parenthesis at
don't forget about the parenthesis at the start and at the ends and I would
the start and at the ends and I would like still to have those two uh queries
like still to have those two uh queries so let's go ahead execute it and now we
so let's go ahead execute it and now we will get an error and that's because our
will get an error and that's because our sub query is returning multiple rows and
sub query is returning multiple rows and this is not acceptable we are using the
this is not acceptable we are using the comparison operator and SQL expect from
comparison operator and SQL expect from the subquery to have scalar subquery. So
the subquery to have scalar subquery. So only one single value. But now in order
only one single value. But now in order to solve this issue, we can go and use
to solve this issue, we can go and use the logical operators either all or any.
the logical operators either all or any. So now since we are saying it's enough
So now since we are saying it's enough for the salary of the female employee to
for the salary of the female employee to be higher than at least one male
be higher than at least one male employee, we will go with the operator
employee, we will go with the operator any. So let's go after the comparison
any. So let's go after the comparison operator and have the keyword any. And
operator and have the keyword any. And let's go and execute it again. And now
let's go and execute it again. And now as you can see in the output we got only
as you can see in the output we got only one female employee where her salary is
one female employee where her salary is higher to one of those male employees.
higher to one of those male employees. So let me just go and get the first name
So let me just go and get the first name as well from the
as well from the second query just to have it like this.
second query just to have it like this. So now if you go and compare the salary
So now if you go and compare the salary of Mary it is not higher than Michael
of Mary it is not higher than Michael but it is higher than Frank and Kevin.
but it is higher than Frank and Kevin. And since we are using the any operator
And since we are using the any operator it's enough for Mary to have salary
it's enough for Mary to have salary higher to one of those values. In this
higher to one of those values. In this case, it's higher than both Frank and
case, it's higher than both Frank and Kevin. And the condition is fulfilled.
Kevin. And the condition is fulfilled. That's why we are getting the marry. And
That's why we are getting the marry. And the other female, let me just check. Do
the other female, let me just check. Do we have else? So, we have Carol is
we have else? So, we have Carol is salary is less than all the salaries of
salary is less than all the salaries of the male employees. So, it must be at
the male employees. So, it must be at least higher than one of the male
least higher than one of the male employees. So, with that, we have solved
employees. So, with that, we have solved the task, right? All right. So, now we
the task, right? All right. So, now we have another operator that is similar.
have another operator that is similar. We call it the all operator. We can go
We call it the all operator. We can go and use it in order to compare a value
and use it in order to compare a value if it matches all values in a list. So
if it matches all values in a list. So that means we can go and use it if we
that means we can go and use it if we need to check whether a condition is
need to check whether a condition is true every value in a list. I know that
true every value in a list. I know that might sound a little bit complicated but
might sound a little bit complicated but don't worry about it. We can have
don't worry about it. We can have examples. Now let's say that our task
examples. Now let's say that our task says find female employees whose salary
says find female employees whose salary are greater than the salaries of all
are greater than the salaries of all male employees. So that means now the
male employees. So that means now the condition is more restrictive. Mary
condition is more restrictive. Mary should now has a salary higher than
should now has a salary higher than every male employee. So it should be
every male employee. So it should be higher to all those values that we have
higher to all those values that we have from the male employees. And of course
from the male employees. And of course in this scenario it's not because we
in this scenario it's not because we have Michael. Mary has less salaries
have Michael. Mary has less salaries than Michael. And this is a problem
than Michael. And this is a problem because Mary should has higher salary
because Mary should has higher salary than everyone. So let's go and try it.
than everyone. So let's go and try it. If I go and write here all and let's go
If I go and write here all and let's go and execute it, you will see we will not
and execute it, you will see we will not find any results that fulfill this
find any results that fulfill this requirement. So we don't have any female
requirement. So we don't have any female employee who her salary is higher than
employee who her salary is higher than all male employees and that's because we
all male employees and that's because we have a very small data sets. So this is
have a very small data sets. So this is how we use all and any operators in our
how we use all and any operators in our subqueries in SQL. All right. So with
subqueries in SQL. All right. So with that we have covered almost everything
that we have covered almost everything about how to use the subqueries in
about how to use the subqueries in different locations and clauses. But we
different locations and clauses. But we didn't talk about the exist operator and
didn't talk about the exist operator and that's because I would like you to
that's because I would like you to understand a very important concept in
understand a very important concept in the subqueries where we have two
the subqueries where we have two different types of the subqueries based
different types of the subqueries based on the dependencies the non-correlated
on the dependencies the non-correlated and correlated subqueries. And after
and correlated subqueries. And after that we're going to go back to the exist
operator. All right friends. So now we come to the part where it is a little
come to the part where it is a little bit complicated about the subqueries.
bit complicated about the subqueries. Now we're going to talk about the
Now we're going to talk about the dependencies between the subquery and
dependencies between the subquery and the main query. So far all the examples
the main query. So far all the examples and the subqueries that we have learned
and the subqueries that we have learned where a noncorrelated subquery. A
where a noncorrelated subquery. A non-correlated subquery means a subquery
non-correlated subquery means a subquery that can run independently from the main
that can run independently from the main query. So that means the subquery is
query. So that means the subquery is like standalone query. But in the other
like standalone query. But in the other hand we have the exact opposite type of
hand we have the exact opposite type of the subquery. We have the correlated
the subquery. We have the correlated subquery. A correlated subquery is a
subquery. A correlated subquery is a subquery that relies on values from the
subquery that relies on values from the main query for each row it processes. So
main query for each row it processes. So that means the subquery here is
that means the subquery here is completely depending on the main query.
completely depending on the main query. So I know this might be a little bit
So I know this might be a little bit confusing. That's why we can have the
confusing. That's why we can have the following very simple sketch in order to
following very simple sketch in order to exactly understand how this works. So as
exactly understand how this works. So as usual we have a database tables and now
usual we have a database tables and now this time going to go and start
this time going to go and start executing the main query first. This is
executing the main query first. This is the first thing happens. So the main
the first thing happens. So the main query going to go and query the database
query going to go and query the database in order to get results and SQL going to
in order to get results and SQL going to process the results row by row. So now
process the results row by row. So now what going to happen? The main query
what going to happen? The main query going to go and pass the first row
going to go and pass the first row informations to the sub query. So now
informations to the sub query. So now the subquery going to get the data from
the subquery going to get the data from the main query. So SQL going to execute
the main query. So SQL going to execute the subquery. So here the subquery going
the subquery. So here the subquery going to return a value like for example one.
to return a value like for example one. So here it's very important to
So here it's very important to understand that now the SQL or the main
understand that now the SQL or the main query going to check is there a result
query going to check is there a result from the sub query in this example yes
from the sub query in this example yes we have a results. So here SQL is
we have a results. So here SQL is checking the output for the subquery for
checking the output for the subquery for the first row. So if there is a result
the first row. So if there is a result SQL going to go and return the row in
SQL going to go and return the row in the final result. So this is the whole
the final result. So this is the whole iteration happened only for the first
iteration happened only for the first row. So we're going to process the whole
row. So we're going to process the whole thing again from the start for the
thing again from the start for the second row. So the main query going to
second row. So the main query going to get the second row from the database and
get the second row from the database and it going to pass it to the subquery.
it going to pass it to the subquery. Once the subquery gets this new
Once the subquery gets this new informations, SQL going to go and
informations, SQL going to go and execute the subquery once again. So now
execute the subquery once again. So now let's say that after executing the
let's say that after executing the subquery, there were no results. So the
subquery, there were no results. So the subquery is not returning anything after
subquery is not returning anything after the execution. So now what can happen?
the execution. So now what can happen? SQL and the main query going to check
SQL and the main query going to check okay there is no result from the sub
okay there is no result from the sub query and this means this row should be
query and this means this row should be excluded and not presented in the
excluded and not presented in the output. So we will not see this row at
output. So we will not see this row at the output. So as you can see SQL is
the output. So as you can see SQL is executing the subquery once again for
executing the subquery once again for the second row. So this will keep
the second row. So this will keep happening as long as we have row. For
happening as long as we have row. For example, we have another row. The main
example, we have another row. The main query going to pass it to the subquery.
query going to pass it to the subquery. The subquery going to be executed for
The subquery going to be executed for the third time and the result of the
the third time and the result of the subquery is going to be one. So the same
subquery is going to be one. So the same thing going to happen. SQL going to
thing going to happen. SQL going to check it. Okay, we have a value. So this
check it. Okay, we have a value. So this row is allowed to be in the final
row is allowed to be in the final results and so on. The cycle going to
results and so on. The cycle going to keep repeating for each row that's going
keep repeating for each row that's going to be retrieved from the main query and
to be retrieved from the main query and once we have processed all the rows, the
once we have processed all the rows, the final result going to be presented in
final result going to be presented in the output. So what we have understood
the output. So what we have understood so far the correlated subqueries is
so far the correlated subqueries is always depending on the main query and
always depending on the main query and the subquery going to be executed for
the subquery going to be executed for each row that we're going to get from
each row that we're going to get from the main query. So in this example we
the main query. So in this example we have four rows and the subquery is
have four rows and the subquery is executed four times. So this is how the
executed four times. So this is how the correlated subquery works. It's a little
correlated subquery works. It's a little bit more complicated than the
bit more complicated than the non-correlated subquery. The
non-correlated subquery. The non-correlated subqueries are really
non-correlated subqueries are really straightforward. So first the subquery
straightforward. So first the subquery going to go and execute the database
going to go and execute the database only once and the output of the subquery
only once and the output of the subquery going to be like an intermediate results
going to be like an intermediate results that going to be used from the main
that going to be used from the main query. So the main query going to go and
query. So the main query going to go and query the intermediate results and in
query the intermediate results and in the output we're going to get the final
the output we're going to get the final results. So as you can see in the
results. So as you can see in the execution of the non-correlated subquery
execution of the non-correlated subquery it is straightforward. There's no
it is straightforward. There's no iterations everything going to be
iterations everything going to be executed only once. So now if you
executed only once. So now if you compare them side by side you can see
compare them side by side you can see that with the non-correlated subquery it
that with the non-correlated subquery it is completely independent from the main
is completely independent from the main query. So that means the subquery going
query. So that means the subquery going to be executed only once and after that
to be executed only once and after that SQL going to go and as well execute the
SQL going to go and as well execute the main query only once using the result
main query only once using the result from the subquery. But on the left side
from the subquery. But on the left side the subquery is going to be executed
the subquery is going to be executed multiple times and it is completely
multiple times and it is completely depending on the main query and there is
depending on the main query and there is like an iteration for each row that's
like an iteration for each row that's going to be retrieved from the main
going to be retrieved from the main query. So the process going to be
query. So the process going to be cycling until all the rows are processed
cycling until all the rows are processed and this is exactly how the correlated
and this is exactly how the correlated and the non-correlated subqueries work
and the non-correlated subqueries work in SQL. All right. So now let's have the
in SQL. All right. So now let's have the following task and it says show all
following task and it says show all customer details and find the total
customer details and find the total orders of each customer. We have already
orders of each customer. We have already solved this task and you know in scale
solved this task and you know in scale we don't have only one query in order to
we don't have only one query in order to solve something. We have multiple ways
solve something. We have multiple ways in order to do it. So we solved this
in order to do it. So we solved this task before using the subqueries and the
task before using the subqueries and the joins. Now we're going to go and solve
joins. Now we're going to go and solve this task using subquery in the select
this task using subquery in the select clause and as well using the correlated
clause and as well using the correlated subqueries. So again let's do it step by
subqueries. So again let's do it step by step. It's very simple. First we need
step. It's very simple. First we need all the customer details. So as we
all the customer details. So as we learned select star from sales
learned select star from sales customers. So if you execute it you will
customers. So if you execute it you will get all the details of all customers.
get all the details of all customers. Now we need to find the total number of
Now we need to find the total number of orders of each customer. Now before we
orders of each customer. Now before we have solved this using a simple query
have solved this using a simple query where we have used the count function
where we have used the count function together with a group I but this time
together with a group I but this time we're going to do it little bit
we're going to do it little bit different. So let's go and write query
different. So let's go and write query saying select count star from the table
saying select count star from the table sales orders. So now let's go and
sales orders. So now let's go and execute it. With that we have the total
execute it. With that we have the total number of orders. So let's go and take
number of orders. So let's go and take this sub query and use it in the select.
this sub query and use it in the select. So we are using it as a scalar subquery.
So we are using it as a scalar subquery. So let's just put it over here. And this
So let's just put it over here. And this is the main query. And in order to make
is the main query. And in order to make this as a subquery, what we're going to
this as a subquery, what we're going to do, we're going to have the parenthesis
do, we're going to have the parenthesis and we're going to say the total sales.
and we're going to say the total sales. So now let's go and execute it. So now
So now let's go and execute it. So now as you can see, we have here all the
as you can see, we have here all the details about the customers and we have
details about the customers and we have the total sales. But we have one issue.
the total sales. But we have one issue. We don't need just the total order. We
We don't need just the total order. We need the total orders for each customer.
need the total orders for each customer. So each customer has different total
So each customer has different total orders. So we cannot have like the
orders. So we cannot have like the following setup. We cannot say group by
following setup. We cannot say group by customer ID. And then you have like here
customer ID. And then you have like here the customer ID and so on. So if you go
the customer ID and so on. So if you go and execute it, you will get a problem.
and execute it, you will get a problem. And that's because if you go and execute
And that's because if you go and execute this subquery over here, you will get
this subquery over here, you will get like multiple rows and multiple columns.
like multiple rows and multiple columns. So you have like a table query. And this
So you have like a table query. And this type of subquery is not allowed to be
type of subquery is not allowed to be used in the select clause, right? We
used in the select clause, right? We have to have only scalar
have to have only scalar subquery. So that's why we cannot do
subquery. So that's why we cannot do that. So we have to go and remove all
that. So we have to go and remove all those stuff.
those stuff. But we can go and solve it using the
But we can go and solve it using the correlated subqueries. So now the
correlated subqueries. So now the subquery is completely independent from
subquery is completely independent from the main query. So in order to correlate
the main query. So in order to correlate it, what we're going to do, we're going
it, what we're going to do, we're going to go and connect it. So I'm going to
to go and connect it. So I'm going to give aliases for the tables and I'm
give aliases for the tables and I'm going to say where the customer ID equal
going to say where the customer ID equal to the customer ID from the main query
to the customer ID from the main query from the customers. So again we are
from the customers. So again we are connecting the customer ID from the
connecting the customer ID from the orders in the subquery with the customer
orders in the subquery with the customer ID from the table customers that comes
ID from the table customers that comes from the main query. So now we are
from the main query. So now we are saying okay execute this only for a
saying okay execute this only for a specific customer not for the whole
specific customer not for the whole table. So let's go and execute it. So
table. So let's go and execute it. So now in the output we have the total
now in the output we have the total sales for each customer and we don't
sales for each customer and we don't have here like the total sales in the
have here like the total sales in the whole table orders and that's because
whole table orders and that's because what is happening for each row the
what is happening for each row the subquery going to be executed. So for
subquery going to be executed. So for the customer number one this query going
the customer number one this query going to be executed like this count the total
to be executed like this count the total number of orders where the customer ID
number of orders where the customer ID equal to the one. So let me just show
equal to the one. So let me just show you what this means. If I go and remove
you what this means. If I go and remove this from here and just put the number
this from here and just put the number one. So if I go and execute this, you
one. So if I go and execute this, you will see the customer ID one has three
will see the customer ID one has three orders. And let's just put it back and
orders. And let's just put it back and execute. And the same thing going to
execute. And the same thing going to happen for each customer. So for each
happen for each customer. So for each customer, for each row, this subquery
customer, for each row, this subquery going to be executed and it can be
going to be executed and it can be filtered with the customer ID that comes
filtered with the customer ID that comes from the main query. So this is another
from the main query. So this is another way in how to solve this task using the
way in how to solve this task using the correlated subqueries. So now let's
correlated subqueries. So now let's summarize and understand quickly what
summarize and understand quickly what are the differences between the
are the differences between the non-correlated and the correlated
non-correlated and the correlated subqueries. So now if you are talking
subqueries. So now if you are talking about the definition the non-correlated
about the definition the non-correlated subquery are subqueries that are
subquery are subqueries that are independent of the main query but in the
independent of the main query but in the other hand the correlated subqueries are
other hand the correlated subqueries are dependent of the main query. And now if
dependent of the main query. And now if you're talking about the execution the
you're talking about the execution the non-correlated subquery is going to be
non-correlated subquery is going to be executed only once and then the results
executed only once and then the results going to be used by the main query but
going to be used by the main query but by the correlated subqueries the
by the correlated subqueries the subquery going to be executed for each
subquery going to be executed for each row that we have from the main query.
row that we have from the main query. And as we learned for the non-correlated
And as we learned for the non-correlated subqueries we can execute it on its own.
subqueries we can execute it on its own. So we can go and select it and execute
So we can go and select it and execute it. But the correlated subqueries we
it. But the correlated subqueries we cannot execute it on its own. So we have
cannot execute it on its own. So we have to execute always the whole thing. And
to execute always the whole thing. And if you are talking about which one is
if you are talking about which one is easier, I think it's clear that the
easier, I think it's clear that the noncorrelated subqueries are easier to
noncorrelated subqueries are easier to write and to read. And in the other
write and to read. And in the other hand, the correlated subqueries are
hand, the correlated subqueries are harder to read and as well it's complex.
harder to read and as well it's complex. Now, if you're talking about the
Now, if you're talking about the performance of the database since the
performance of the database since the correlated subqueries can be executed
correlated subqueries can be executed only once, this of course going to lead
only once, this of course going to lead you to have better performance because
you to have better performance because things are really straightforward and
things are really straightforward and not complicated. But in the other hand
not complicated. But in the other hand with the correlated subqueries there is
with the correlated subqueries there is more effort because SQL has to check a
more effort because SQL has to check a lot of stuff and the subquery going to
lot of stuff and the subquery going to be executed many times. So the
be executed many times. So the noncorrelated subqueries are faster. We
noncorrelated subqueries are faster. We use the noncorrelated subqueries in
use the noncorrelated subqueries in order to do static comparison. So the
order to do static comparison. So the value that we are getting from the
value that we are getting from the subquery is executed only once and we
subquery is executed only once and we will get only one static value in order
will get only one static value in order to use it for filtering and so on. But
to use it for filtering and so on. But in the other hand we use correlated
in the other hand we use correlated subqueries in order to do rowby row
subqueries in order to do rowby row comparison. And since we don't have here
comparison. And since we don't have here a static value each time the subquery
a static value each time the subquery going to run we're going to have
going to run we're going to have different results. This going to add
different results. This going to add more dynamic to the filters and we don't
more dynamic to the filters and we don't have a static value. So those are the
have a static value. So those are the big differences between the
big differences between the non-correlated and the correlated
non-correlated and the correlated subqueries. All right. So now after we
subqueries. All right. So now after we understood the concept of the two types
understood the concept of the two types correlated and non-correlated subqueries
correlated and non-correlated subqueries we're going to go now and cover the last
we're going to go now and cover the last operator for the subqueries. We have the
operator for the subqueries. We have the exists. So what is exist
operator? All right. So now we're going to talk about a very interesting
to talk about a very interesting operator function in SQL the exists. So
operator function in SQL the exists. So now in some scenarios as you are
now in some scenarios as you are querying the data from one table you
querying the data from one table you would need to go and check whether the
would need to go and check whether the rows of this table exist in another
rows of this table exist in another table. So that means you are checking
table. So that means you are checking like the existence of your rows in
like the existence of your rows in different table. And exactly in this
different table. And exactly in this scenario we go and use subqueries
scenario we go and use subqueries together with the operator exists. So
together with the operator exists. So the exist operator is very simple. It
the exist operator is very simple. It just simply check whether the subquery
just simply check whether the subquery returns any results any rows. All right.
returns any results any rows. All right. So now let's understand the syntax of
So now let's understand the syntax of the correlated subqueries using the
the correlated subqueries using the exist operator. This can be a little bit
exist operator. This can be a little bit complicated but we're going to do it
complicated but we're going to do it step by step. Don't worry about it. So
step by step. Don't worry about it. So let's start with the easy stuff. In the
let's start with the easy stuff. In the main query we're going to go and write a
main query we're going to go and write a simple select. We are selecting few
simple select. We are selecting few columns from the table two. And now we
columns from the table two. And now we don't need all the data from table two.
don't need all the data from table two. We want to filter the table using the
We want to filter the table using the wear clause. Now what we're going to do
wear clause. Now what we're going to do after the wear clause, we're going to
after the wear clause, we're going to write immediately another keyword called
write immediately another keyword called exists. So we don't specify any column
exists. So we don't specify any column before the exist like we have done in
before the exist like we have done in the comparison operator or the in
the comparison operator or the in operator. We don't need that because we
operator. We don't need that because we are not filtering based on a value. We
are not filtering based on a value. We are filtering based on the logic. That's
are filtering based on the logic. That's why we have the word exist immediately.
why we have the word exist immediately. And now directly after they exist, we're
And now directly after they exist, we're going to go and define the subquery like
going to go and define the subquery like this. So we're going to start saying
this. So we're going to start saying select one from the table number one.
select one from the table number one. Well, it is not like a must or
Well, it is not like a must or something. But it is very commonly used
something. But it is very commonly used to specify here a one. We are not using
to specify here a one. We are not using the subquery in order to retrieve
the subquery in order to retrieve informations from the table one. We are
informations from the table one. We are just testing whether the subquery going
just testing whether the subquery going to return a value or not. And we don't
to return a value or not. And we don't care about the returned value. It could
care about the returned value. It could be one, it could be column, it could be
be one, it could be column, it could be anything. So we don't care about the
anything. So we don't care about the data that is retrieved. We are just care
data that is retrieved. We are just care whether the subquery is returning
whether the subquery is returning anything. So that's why we go and write
anything. So that's why we go and write any value like here a one. So now we are
any value like here a one. So now we are not done yet. This subquery is not yet
not done yet. This subquery is not yet connected to the main query. We have
connected to the main query. We have somehow to go and connect them together.
somehow to go and connect them together. And we can do that using the wear clause
And we can do that using the wear clause where we go and connect the ID from the
where we go and connect the ID from the table one from the subquery with the ID
table one from the subquery with the ID from the outer query from the main
from the outer query from the main query. And with that we are building
query. And with that we are building like a relationship between the subquery
like a relationship between the subquery and the main query. So with that the
and the main query. So with that the subquery is now depending on the values
subquery is now depending on the values from the main query because here we have
from the main query because here we have the table 2 id. So the ids from the main
the table 2 id. So the ids from the main query going to filter the subquery. So
query going to filter the subquery. So this is the syntax of correlated sub
this is the syntax of correlated sub queries using the exist where we are
queries using the exist where we are making the subquery depending totally on
making the subquery depending totally on the main query. So let's understand how
the main query. So let's understand how exist works. So now for each row that we
exist works. So now for each row that we have from the main query, it's going to
have from the main query, it's going to trigger and cause an execution of the
trigger and cause an execution of the subquery. This subquery going to help us
subquery. This subquery going to help us to evaluate this row. So we are testing
to evaluate this row. So we are testing this row. Now if the subquery doesn't
this row. Now if the subquery doesn't return anything, so there is no results,
return anything, so there is no results, what can happen? The row that we are
what can happen? The row that we are evaluating from the main query will be
evaluating from the main query will be excluded from the final results. But now
excluded from the final results. But now in the other hand if the subquery is
in the other hand if the subquery is returning a value so we have like some
returning a value so we have like some kind of results then this row that we
kind of results then this row that we are evaluating going to be included in
are evaluating going to be included in the final results. So the subquery is
the final results. So the subquery is used in order to do a test. Do we have a
used in order to do a test. Do we have a results or we don't and based on this
results or we don't and based on this SQL either going to include or exclude
SQL either going to include or exclude the row from the final results. So this
the row from the final results. So this is the logic behind the exist in SQL.
is the logic behind the exist in SQL. All right. So now we're going to go and
All right. So now we're going to go and solve the same task using the exists. So
solve the same task using the exists. So the task says show the details of orders
the task says show the details of orders made by the customers in Germany. So we
made by the customers in Germany. So we have already solved this task using the
have already solved this task using the in operator and the subquery. Now we're
in operator and the subquery. Now we're going to go and solve it using the
going to go and solve it using the exists. So again we're going to have the
exists. So again we're going to have the same logical steps that we have done
same logical steps that we have done before. So first we're going to go and
before. So first we're going to go and select all the details from the table
select all the details from the table sales orders. So let's execute it. And
sales orders. So let's execute it. And with that we have all the orders and all
with that we have all the orders and all the details. But of course we don't need
the details. But of course we don't need all those informations. We need only the
all those informations. We need only the orders that's made by customers from
orders that's made by customers from Germany. So that is the first query.
Germany. So that is the first query. Let's go and construct the second query.
Let's go and construct the second query. We're going to say select star from
We're going to say select star from sales customers. But we don't need all
sales customers. But we don't need all the customers. We need only the
the customers. We need only the customers from country equal to the
customers from country equal to the value Germany. So let's go and execute
value Germany. So let's go and execute it. So now we have all customers that
it. So now we have all customers that come from Germany. Now we have to go and
come from Germany. Now we have to go and put those two queries together in order
put those two queries together in order to get the final results. So as we
to get the final results. So as we learned before the second query going to
learned before the second query going to be our subquery. So it's going to be
be our subquery. So it's going to be supporting the first query in order to
supporting the first query in order to filter the data. So the first query
filter the data. So the first query going to be our main query. Let me just
going to be our main query. Let me just make this smaller and the text as well.
make this smaller and the text as well. Now we don't need all the orders, right?
Now we don't need all the orders, right? We need only the orders where the
We need only the orders where the customer come from Germany. So we need
customer come from Germany. So we need the work clause. So now we can have the
the work clause. So now we can have the filter logic like this. Show the order
filter logic like this. Show the order details only if the customer ID exist
details only if the customer ID exist from the subquery. And now we have to go
from the subquery. And now we have to go and put our subquery. So our subquery
and put our subquery. So our subquery going to be this one over here. So let's
going to be this one over here. So let's just move it to the right side. And in
just move it to the right side. And in order to have it as a subquery, we have
order to have it as a subquery, we have to close the parenthesis. And now since
to close the parenthesis. And now since exist is correlated subquery, we cannot
exist is correlated subquery, we cannot have it like this. we have to go and
have it like this. we have to go and connect the subquery together with the
connect the subquery together with the main query. So now the subquery is
main query. So now the subquery is currently independent from the main
currently independent from the main query because we want to check each
query because we want to check each order information from the order table
order information from the order table to check whether the customer exist in
to check whether the customer exist in the sub query. We're going to go and add
the sub query. We're going to go and add the condition like the following. And
the condition like the following. And now it's like the joins we have to go
now it's like the joins we have to go and connect the customer ids together.
and connect the customer ids together. So we're going to go over here and give
So we're going to go over here and give it like an alias and as well for the
it like an alias and as well for the subquery. And now we're going to say
subquery. And now we're going to say customer ID from the orders should be
customer ID from the orders should be equal to the customer
equal to the customer ID from the subquery the table customers
ID from the subquery the table customers like this. So again this customer ID
like this. So again this customer ID come from the subquery and this customer
come from the subquery and this customer ID comes from the main query. So now
ID comes from the main query. So now since we are using the subquery only in
since we are using the subquery only in order to test the existence of the
order to test the existence of the customer. So if the subquery returns
customer. So if the subquery returns anything or not, it doesn't matter what
anything or not, it doesn't matter what you are selecting in the subquery. So so
you are selecting in the subquery. So so you can go with the star or a column or
you can go with the star or a column or any static value. But for some reason
any static value. But for some reason all the SQL developers decided to go
all the SQL developers decided to go with the static value one. And of course
with the static value one. And of course you can go and add like a column like
you can go and add like a column like the customer ID but it's like
the customer ID but it's like unnecessary step for the SQL in order to
unnecessary step for the SQL in order to retrieve the information from the
retrieve the information from the customer ID. So it's going to be way
customer ID. So it's going to be way faster for SQL if you say okay select
faster for SQL if you say okay select one. So let's stick with the best
one. So let's stick with the best practices. Use the one value if you are
practices. Use the one value if you are working with exist. So this is our sub
working with exist. So this is our sub query and I think we have everything.
query and I think we have everything. Let's go and execute it. Now as you can
Let's go and execute it. Now as you can see in the output we got all the orders
see in the output we got all the orders where the customers come from Germany.
where the customers come from Germany. Now of course if you want to go and try
Now of course if you want to go and try another value and execute you will get
another value and execute you will get exactly the same results. So it doesn't
exactly the same results. So it doesn't matter which value you are using. So
matter which value you are using. So with that we have solved the task this
with that we have solved the task this time using the exists. Now if the task
time using the exists. Now if the task says show the details of orders made by
says show the details of orders made by customers that don't come from Germany
customers that don't come from Germany it's going to be very simple. We're
it's going to be very simple. We're going to go and use the operator not
going to go and use the operator not before the exist. So where not exists.
before the exist. So where not exists. So now we are flipping the whole logic
So now we are flipping the whole logic and we are saying there should be no
and we are saying there should be no matching with the subquery. So now if
matching with the subquery. So now if you go and execute it you will get all
you go and execute it you will get all the orders where the customers don't
the orders where the customers don't come from Germany by simply using the
come from Germany by simply using the not logic. And there is one more thing
not logic. And there is one more thing that is annoying about the correlated
that is annoying about the correlated subqueries. If you compare to the
subqueries. If you compare to the non-correlated subqueries as we learned
non-correlated subqueries as we learned before, let me go back to the n
before, let me go back to the n operator. Now this is a non-correlated
operator. Now this is a non-correlated subquery. And if I go and select only
subquery. And if I go and select only the subquery, I can go and execute it
the subquery, I can go and execute it independently. So I can go and check the
independently. So I can go and check the intermediate results and like validate
intermediate results and like validate my query. But the problem with the
my query. But the problem with the correlated subquery, I cannot go and
correlated subquery, I cannot go and highlight the subquery and then go and
highlight the subquery and then go and execute it. And that's because in the
execute it. And that's because in the syntax of the subquery we are adding a
syntax of the subquery we are adding a column that is outside our subquery that
column that is outside our subquery that come from the main query. So this piece
come from the main query. So this piece of information currently for the SQL is
of information currently for the SQL is unknown and that's why we are getting
unknown and that's why we are getting this error because SQL saying okay I
this error because SQL saying okay I don't know where this column come from.
don't know where this column come from. So this is little bit annoying using the
So this is little bit annoying using the correlated subqueries you cannot go and
correlated subqueries you cannot go and test the intermediate results. But how I
test the intermediate results. But how I usually do it I go and test like an
usually do it I go and test like an intermediate result for only one row. So
intermediate result for only one row. So for example, I'm going to go and pick
for example, I'm going to go and pick like a customer here. For example, two.
like a customer here. For example, two. So I'm going to go and say okay, the
So I'm going to go and say okay, the customer ID should be equal to two. So
customer ID should be equal to two. So let me just remove this from here. I got
let me just remove this from here. I got this value from the main query. So if I
this value from the main query. So if I go now and execute it, I can see here.
go now and execute it, I can see here. Okay, the subquery is not returning
Okay, the subquery is not returning anything because there is no such a
anything because there is no such a value. So with that, I'm just testing
value. So with that, I'm just testing like one row. And of course in order to
like one row. And of course in order to make this working I have to go and add
make this working I have to go and add as well the column from the main query.
as well the column from the main query. So this is why correlated subqueries are
So this is why correlated subqueries are a little bit more hard to understand
a little bit more hard to understand compared to the non-correlated because
compared to the non-correlated because we cannot go and test the intermediate
we cannot go and test the intermediate results like we can do there. So this is
results like we can do there. So this is another way on how to solve this task
another way on how to solve this task using a correlated subqueries with the
using a correlated subqueries with the operator exists. Okay. So now let's see
operator exists. Okay. So now let's see step by step how SQL executed the
step by step how SQL executed the correlated subqueries using the exists
correlated subqueries using the exists operator. So now this time SQL will not
operator. So now this time SQL will not start with the subquery. SQL going to go
start with the subquery. SQL going to go and start immediately with the main
and start immediately with the main query. SQL first going to identify the
query. SQL first going to identify the main query and it going to go and
main query and it going to go and execute it. But it's going to executed
execute it. But it's going to executed row by row. So the first row going to be
row by row. So the first row going to be the first customer. So now SQL going to
the first customer. So now SQL going to go and put the first customer under the
go and put the first customer under the test. So now the next step is that SQL
test. So now the next step is that SQL going to go and pass the value of the
going to go and pass the value of the customer ID from the main query to the
customer ID from the main query to the subquery. So we are doing now exactly
subquery. So we are doing now exactly the opposite. So now what going to
the opposite. So now what going to happen? SQL going to prepare the
happen? SQL going to prepare the subquery with the following information.
subquery with the following information. So we are saying the customer ID equal
So we are saying the customer ID equal to one and then SQL going to go and
to one and then SQL going to go and execute it. So now once SQL executed
execute it. So now once SQL executed this query, we will get the result of
this query, we will get the result of one and that's because we have here
one and that's because we have here multiple times where the customer ID is
multiple times where the customer ID is equal to one. So there is rows in the
equal to one. So there is rows in the order table where the customer ID equal
order table where the customer ID equal to one. So now what going to happen? the
to one. So now what going to happen? the row from the main query going to pass
row from the main query going to pass the test and this customer going to be
the test and this customer going to be included in the final results. So now
included in the final results. So now the next step with that is going to go
the next step with that is going to go and start testing the second customer.
and start testing the second customer. So we're going to put this customer
So we're going to put this customer under the test. Now we're going to go
under the test. Now we're going to go and pass the value to the subquery. So
and pass the value to the subquery. So here we're going to have the value of
here we're going to have the value of two and then SQL going to go and execute
two and then SQL going to go and execute this query and of course we will get a
this query and of course we will get a result because we have here multiple
result because we have here multiple times where the customer ID equal to
times where the customer ID equal to two. So that's why in the output of this
two. So that's why in the output of this subquery we will get one. So now it's
subquery we will get one. So now it's still going to say great we have a value
still going to say great we have a value from the subquery that's why it is safe
from the subquery that's why it is safe to show this customer in the output. And
to show this customer in the output. And now it's still going to go to the next
now it's still going to go to the next row and so on. So for the next two
row and so on. So for the next two customers the same things going to
customers the same things going to happen. All of those customers will have
happen. All of those customers will have a value from the subquery and that's why
a value from the subquery and that's why they are all like passing the test. So
they are all like passing the test. So we will have it in the output. Now skill
we will have it in the output. Now skill going to go to the last row from the
going to go to the last row from the table customers. So we have the Anna and
table customers. So we have the Anna and we're going to put Anna to the test. So
we're going to put Anna to the test. So now what going to happen? SQL going to
now what going to happen? SQL going to go and pass the value five to the
go and pass the value five to the subquery and SQL going to go and execute
subquery and SQL going to go and execute this query to the table orders. Now once
this query to the table orders. Now once SQL execute this query there will be
SQL execute this query there will be nothing returned and that's because we
nothing returned and that's because we don't have here in the table orders a
don't have here in the table orders a customer ID equal to five. And now SQL
customer ID equal to five. And now SQL going to say well we are not getting any
going to say well we are not getting any results from the subquery. That's why
results from the subquery. That's why this customer going to fail and SQL will
this customer going to fail and SQL will not show it at the output. So it will be
not show it at the output. So it will be completely removed. So the customer Anna
completely removed. So the customer Anna is excluded because the subquery is not
is excluded because the subquery is not returning anything. Customer ID number
returning anything. Customer ID number five Anna does not exist in the table
five Anna does not exist in the table orders. So it's going to fail the test
orders. So it's going to fail the test and we will have in the final results
and we will have in the final results only for customers. So this is exactly
only for customers. So this is exactly the purpose of the exist. we are
the purpose of the exist. we are checking and testing the existence of
checking and testing the existence of our rows from another table from another
our rows from another table from another query. So this is how SQL executes the
query. So this is how SQL executes the correlated subqueries using the operator
correlated subqueries using the operator [Music]
[Music] exists. All right friends, so with that
exists. All right friends, so with that you have covered everything about the
you have covered everything about the subqueries, all the different categories
subqueries, all the different categories and types of the subqueries and now
and types of the subqueries and now we're going to do a quick recap about
we're going to do a quick recap about the subqueries. So as we learned
the subqueries. So as we learned subqueries is just simply a query inside
subqueries is just simply a query inside another query. And we use the subqueries
another query. And we use the subqueries in order to break down a complex queries
in order to break down a complex queries into smaller, simpler, easy to manage
into smaller, simpler, easy to manage pieces that makes everything easier to
pieces that makes everything easier to develop and as well to read. And as we
develop and as well to read. And as we learned there are like many different
learned there are like many different use cases for the subqueries. So we use
use cases for the subqueries. So we use subqueries in order to create temporary
subqueries in order to create temporary result sets to be used later from
result sets to be used later from another query. And we learned that we
another query. And we learned that we can use the subqueries in order to
can use the subqueries in order to prepare the data before joining the
prepare the data before joining the tables. And another very important use
tables. And another very important use case for the subquery is that we can use
case for the subquery is that we can use it in order to filter our data using a
it in order to filter our data using a dynamic and as well complex filter
dynamic and as well complex filter logics. And as we learned, we can go and
logics. And as we learned, we can go and use the correlated subqueries using the
use the correlated subqueries using the exist operator in order to check the
exist operator in order to check the existence of data and rows from another
existence of data and rows from another tables. and as well using the correlated
tables. and as well using the correlated subqueries help us to do rowby row
subqueries help us to do rowby row comparison. All right my friends, so
comparison. All right my friends, so with that we have covered an important
with that we have covered an important technique on how to nest your queries in
technique on how to nest your queries in SQL. Now in the next step we're going to
SQL. Now in the next step we're going to talk about one of the most famous
talk about one of the most famous technique on how to do multi steps in
technique on how to do multi steps in SQL the city common table expression. So
SQL the city common table expression. So let's go.
A city common table expression is a temporary named result set like a
temporary named result set like a virtual table that could be used
virtual table that could be used multiple times within your query to
multiple times within your query to simplify and organize complex query. So
simplify and organize complex query. So let's understand what this means using
let's understand what this means using the following sketch. So we have our
the following sketch. So we have our database tables like orders, customers
database tables like orders, customers and so on. And in very simple scenario
and so on. And in very simple scenario we write a simple SQL in order to query
we write a simple SQL in order to query and retrieve the data from the database
and retrieve the data from the database and then in the output we will get the
and then in the output we will get the result of the query. So this is the
result of the query. So this is the simplest version of querying data. Now
simplest version of querying data. Now things get complicated in our project
things get complicated in our project and we could have the following
and we could have the following technique in our query. So we still have
technique in our query. So we still have this section where we are saying select
this section where we are saying select from. But now inside our query we can
from. But now inside our query we can write another query like for example
write another query like for example select from where which is completely
select from where which is completely nothing to do with the first query and
nothing to do with the first query and we can give this new query inside our
we can give this new query inside our query a name CTE and we can call this
query a name CTE and we can call this query a CTE query common table
query a CTE query common table expression. And the first query outside
expression. And the first query outside this CDE we call it a main query. Now if
this CDE we call it a main query. Now if you check this we have like a query
you check this we have like a query inside another query. So now let's see
inside another query. So now let's see what is going to do with this. The first
what is going to do with this. The first thing is going to go and execute the
thing is going to go and execute the city query. So the city query going to
city query. So the city query going to be executed and we're going to go and
be executed and we're going to go and retrieve few informations from our
retrieve few informations from our database tables. Now the output going to
database tables. Now the output going to be available only in the query and the
be available only in the query and the output going to have the shape of like a
output going to have the shape of like a table like for example the sales. So now
table like for example the sales. So now the sales table and the orders tables
the sales table and the orders tables both of them are tables but one is
both of them are tables but one is stored in the database and the other one
stored in the database and the other one is an intermediate virtual table. So now
is an intermediate virtual table. So now what can happen in the main query we can
what can happen in the main query we can go and start querying the sales table
go and start querying the sales table the result from the CTE as any other
the result from the CTE as any other normal table like we do to the database
normal table like we do to the database tables. So the main query going to go
tables. So the main query going to go and retrieve few informations and maybe
and retrieve few informations and maybe do some manipulations on top of the
do some manipulations on top of the sales table or let's say the CTE results
sales table or let's say the CTE results and of course the main query as well can
and of course the main query as well can go and say you know what let's go and
go and say you know what let's go and query as well few tables from the
query as well few tables from the database. So the main query has two
database. So the main query has two sources of tables. Either get it
sources of tables. Either get it directly from the database or get it
directly from the database or get it from the table that is created inside
from the table that is created inside the query and then once everything is
the query and then once everything is done the final results of the main query
done the final results of the main query going to be presented for the user as a
going to be presented for the user as a final result. So as you can see the CTA
final result. So as you can see the CTA query has one task where it generates
query has one task where it generates like a table that lives inside our query
like a table that lives inside our query and we can go and use it as we want. So
and we can go and use it as we want. So now this intermediate table that is
now this intermediate table that is created from the city has two features.
created from the city has two features. First this table will not live long. So
First this table will not live long. So once the query ends what going to happen
once the query ends what going to happen is going to go and destroy this table.
is going to go and destroy this table. So it will not be available afterward
So it will not be available afterward and we are not able to query it anymore.
and we are not able to query it anymore. So SQL is doing here like a cleanup and
So SQL is doing here like a cleanup and the second character about this let's
the second character about this let's imagine that we have another side query
imagine that we have another side query and it's retrieving tables directly from
and it's retrieving tables directly from the database tables. Now if you say
the database tables. Now if you say let's go and join those tables as well
let's go and join those tables as well with the sales from the first query well
with the sales from the first query well it will not be working because SQL going
it will not be working because SQL going to say I don't know what you are talking
to say I don't know what you are talking about and that's because the sales is
about and that's because the sales is only locally available for the main
only locally available for the main query in the same query. So that means
query in the same query. So that means it's not globally available like the
it's not globally available like the database tables for any query. It is
database tables for any query. It is dedicated only for the main query within
dedicated only for the main query within the same query. And now you might tell
the same query. And now you might tell me bar wait I have heard this story
me bar wait I have heard this story before right? So this is an identical
before right? So this is an identical story to the one that you have told us
story to the one that you have told us about the subqueries. So what is exactly
about the subqueries. So what is exactly the difference between the subquery and
the difference between the subquery and the CTE? Well, you are totally right.
the CTE? Well, you are totally right. The story is identical between the
The story is identical between the subqueries and the CTE but still there
subqueries and the CTE but still there are differences between them. So let me
are differences between them. So let me show you few differences. Now let's put
show you few differences. Now let's put them side by side. We have on the left
them side by side. We have on the left side the subqueries on the right side we
side the subqueries on the right side we have the CTE. So now if you look on how
have the CTE. So now if you look on how we wrote the CT and the subqueries you
we wrote the CT and the subqueries you can see that on the subquery we are
can see that on the subquery we are writing it from bottom to top. So first
writing it from bottom to top. So first we have this inner query the subquery
we have this inner query the subquery and then on top of it we have the main
and then on top of it we have the main query. But now on the other hand the CTE
query. But now on the other hand the CTE we are writing it from top to bottom. So
we are writing it from top to bottom. So first we write this inner query the CTE
first we write this inner query the CTE query and then beneath it we're going to
query and then beneath it we're going to go and write the main query. So this is
go and write the main query. So this is the first difference between them on the
the first difference between them on the way we write the query. So if I'm
way we write the query. So if I'm thinking about subqueries, I start from
thinking about subqueries, I start from bottom to top. If I'm thinking about
bottom to top. If I'm thinking about CTE, I think from top to bottom. But
CTE, I think from top to bottom. But still you say, you know what, I don't
still you say, you know what, I don't care how we write it. They are doing the
care how we write it. They are doing the same thing. The subquery is introducing
same thing. The subquery is introducing an intermediate result that is used
an intermediate result that is used later from the main query. And the same
later from the main query. And the same thing for the CTE. It present like
thing for the CTE. It present like intermediate table that is used as well
intermediate table that is used as well from the main query. Now let me tell you
from the main query. Now let me tell you the big differences between them is that
the big differences between them is that in the subquery the result can be used
in the subquery the result can be used only once. So you cannot have another
only once. So you cannot have another place in your main query where you go
place in your main query where you go and reuse the result from the subquery.
and reuse the result from the subquery. So you can use it maximum only in one
So you can use it maximum only in one position and only once. But in the other
position and only once. But in the other hand with the city technique, you can
hand with the city technique, you can think about the sales table as a virtual
think about the sales table as a virtual table and not only you can use it in one
table and not only you can use it in one place in the main query, you can go and
place in the main query, you can go and use it in many other places. So you can
use it in many other places. So you can go and join it again. So that means I'm
go and join it again. So that means I'm using the output from the CTE query in
using the output from the CTE query in two different places in the main query
two different places in the main query or maybe from three different places. So
or maybe from three different places. So you can have another place where you go
you can have another place where you go as well and query the sales table that
as well and query the sales table that is only available in our query. So this
is only available in our query. So this is the main and the most important
is the main and the most important difference between the subquery and the
difference between the subquery and the CTE. It's from the name common table
CTE. It's from the name common table expression. We think about the result of
expression. We think about the result of the CTE as a table. So we can go and
the CTE as a table. So we can go and select it. We can go and join it with
select it. We can go and join it with any other table. So it is like a hidden
any other table. So it is like a hidden virtual table lives inside our query.
virtual table lives inside our query. But the subqueries it's totally
But the subqueries it's totally different. It's a result only for one
different. It's a result only for one position in the main query and it's used
position in the main query and it's used only once. So that means if you want the
only once. So that means if you want the subquery in two three different places,
subquery in two three different places, you have to go and write the subquery
you have to go and write the subquery three different times. So now you
three different times. So now you understand why do we have CTE and why do
understand why do we have CTE and why do we have
subqueries. All right. So with that you have understood what is CTE. Now the
have understood what is CTE. Now the question is why do we need CTE in the
question is why do we need CTE in the first place? What is the main purpose of
first place? What is the main purpose of the CTE? Let's go back to the sketch.
the CTE? Let's go back to the sketch. Now let's say in our complex SQL task we
Now let's say in our complex SQL task we have to do the following step. Step one
have to do the following step. Step one we have to go and join the tables
we have to go and join the tables together in order to prepare all the
together in order to prepare all the data that we need for the next step. And
data that we need for the next step. And now in the second step we have to go and
now in the second step we have to go and aggregate the data. Maybe we are doing
aggregate the data. Maybe we are doing summarizations. Now in our task we have
summarizations. Now in our task we have to do as well different types of
to do as well different types of aggregations based on different data.
aggregations based on different data. And now what might happen is that we
And now what might happen is that we have to go and join again the same
have to go and join again the same tables in order to prepare the data and
tables in order to prepare the data and perform different type of aggregations
perform different type of aggregations like for example the average which going
like for example the average which going to be in the last step. Now we have
to be in the last step. Now we have learned before we can go and use the
learned before we can go and use the subqueries in order to make this logical
subqueries in order to make this logical flow. So for step one, step two, step
flow. So for step one, step two, step three, we will have subqueries and the
three, we will have subqueries and the final step going to be in the main
final step going to be in the main query. But now if we keep doing this
query. But now if we keep doing this we're gonna have a problem and that is
we're gonna have a problem and that is we are repeating the same step more than
we are repeating the same step more than once. So we are joining the table twice
once. So we are joining the table twice in step number one and three for
in step number one and three for different purposes which cause us to
different purposes which cause us to have two different subqueries that looks
have two different subqueries that looks exactly the same and this is exactly the
exactly the same and this is exactly the weak point of the subqueries. It might
weak point of the subqueries. It might introduce redundancies. So that means
introduce redundancies. So that means the subqueries alone will not help you
the subqueries alone will not help you to eliminate all the duplicates in your
to eliminate all the duplicates in your code. But still we have different
code. But still we have different techniques in order to solve this issue.
techniques in order to solve this issue. So what we going to do? We're going to
So what we going to do? We're going to have only one step in order to join the
have only one step in order to join the tables. And then this data going to be
tables. And then this data going to be used in the step two in order to
used in the step two in order to aggregate the data. And then we don't
aggregate the data. And then we don't need the step three of joining again the
need the step three of joining again the data. We're going to reuse the step one.
data. We're going to reuse the step one. And we're going to use the same data for
And we're going to use the same data for the step four which is aggregating the
the step four which is aggregating the data using average. And we can do this
data using average. And we can do this with the help of the amazing CTE. So now
with the help of the amazing CTE. So now if you compare the steps in the
if you compare the steps in the subqueries with the steps with the CTE
subqueries with the steps with the CTE you can see with the CTE we are reducing
you can see with the CTE we are reducing the number of steps which can lead to
the number of steps which can lead to reduce the size of the query. So now
reduce the size of the query. So now again here in subquery we think about
again here in subquery we think about the steps from bottom to top but in the
the steps from bottom to top but in the city it's the way around we think from
city it's the way around we think from top to bottom. So that means the first
top to bottom. So that means the first step on the top it's going to be joining
step on the top it's going to be joining the tables and then below it going to be
the tables and then below it going to be step two and step three. And of course
step two and step three. And of course since we are repeating the join we're
since we are repeating the join we're going to put it in CTE and then we can
going to put it in CTE and then we can use it twice in different places in the
use it twice in different places in the main query. So as you can see there are
main query. So as you can see there are a lot of benefits of the CTE. It's like
a lot of benefits of the CTE. It's like the subqueries. We are breaking down
the subqueries. We are breaking down complex queries into smaller pieces that
complex queries into smaller pieces that are easier to write manage understand
are easier to write manage understand and as well we have like a logical flow
and as well we have like a logical flow from step one to three but with one more
from step one to three but with one more benefit that we reduce the redundancies
benefit that we reduce the redundancies of our code. So we don't have to join
of our code. So we don't have to join the tables twice. Now I'm going to show
the tables twice. Now I'm going to show you a simple example how the CTE makes
you a simple example how the CTE makes our life easier in our query. We might
our life easier in our query. We might have to do different stuff like for
have to do different stuff like for example we have to go and find the top
example we have to go and find the top customers. So we can put this in one CTE
customers. So we can put this in one CTE and we might need as well to calculate
and we might need as well to calculate what are the top products and we can put
what are the top products and we can put as well this in another city. So you
as well this in another city. So you don't have to put everything in one big
don't have to put everything in one big city. Then you can have the same issue
city. Then you can have the same issue of having complex query. And let's say
of having complex query. And let's say that we have as well to find and
that we have as well to find and calculate the daily revenue. And for
calculate the daily revenue. And for this as well, we have to put it in one
this as well, we have to put it in one CTE. Now once we have all those parts,
CTE. Now once we have all those parts, we can put everything together in the
we can put everything together in the main query. So now if you look to this
main query. So now if you look to this structure, you can see it's really easy
structure, you can see it's really easy to understand this code. It's easy to
to understand this code. It's easy to read. So CTE improves the readability of
read. So CTE improves the readability of our queries. So that means your code is
our queries. So that means your code is divided into clear sections making it
divided into clear sections making it easier to understand what each part
easier to understand what each part does. Now if you keep looking to this we
does. Now if you keep looking to this we have another advantage of the CTE
have another advantage of the CTE introduces modularity. So that means it
introduces modularity. So that means it breaks your code into smaller manageable
breaks your code into smaller manageable parts. So this means instead of writing
parts. So this means instead of writing one huge complex query you break it down
one huge complex query you break it down into smaller chunks using CTE. Each city
into smaller chunks using CTE. Each city is like self-contained and handles
is like self-contained and handles specific part of the problem and then
specific part of the problem and then you can combine them all together in the
you can combine them all together in the final query. It's like we are putting
final query. It's like we are putting together a puzzle piece by piece. And
together a puzzle piece by piece. And now one very important advantage of the
now one very important advantage of the CTE is the reusability. So that means we
CTE is the reusability. So that means we can have a result set that is used
can have a result set that is used multiple times inside our query. So that
multiple times inside our query. So that means you write the logic the code only
means you write the logic the code only once and then use it in different places
once and then use it in different places inside your query. This is very
inside your query. This is very important. Not only you are wasting time
important. Not only you are wasting time writing the same stuff over and over,
writing the same stuff over and over, but also it reduces the errors and
but also it reduces the errors and mistakes that you might do if you are
mistakes that you might do if you are repeating the same code. Especially if
repeating the same code. Especially if later you want to go and change the
later you want to go and change the logic then you have to go and visit each
logic then you have to go and visit each time you have done this logic and then
time you have done this logic and then do the changes and you might forget some
do the changes and you might forget some places. That's why the CTE is amazing.
places. That's why the CTE is amazing. You can write the logic once and then
You can write the logic once and then you go and reuse it in different places.
you go and reuse it in different places. So these are the advantages of using
So these are the advantages of using this technique the CTE inside your
this technique the CTE inside your [Music]
[Music] queries. So again you are at the client
queries. So again you are at the client side and you are data analyst. You are
side and you are data analyst. You are writing a query where you are defining a
writing a query where you are defining a CTE called details and inside it you
CTE called details and inside it you have some logic and now in the main
have some logic and now in the main query you are selecting the data from
query you are selecting the data from the orders and as well you are joining
the orders and as well you are joining it with the details with the CTE
it with the details with the CTE multiple times using multiple
multiple times using multiple conditions. Now once you go and execute
conditions. Now once you go and execute this query the database engine going to
this query the database engine going to read the query and say aha we have here
read the query and say aha we have here a CTE and it has the main priority. So
a CTE and it has the main priority. So that means it going to go and execute
that means it going to go and execute the CTE first. And now let's say that in
the CTE first. And now let's say that in the city you are retrieving data from
the city you are retrieving data from the table orders and the table orders of
the table orders and the table orders of course in the disk storage inside the
course in the disk storage inside the user data. And now once the city is
user data. And now once the city is completely executed the database engine
completely executed the database engine going to go and place the results in the
going to go and place the results in the cache and it's going to name this result
cache and it's going to name this result as details. It's like a table name. So
as details. It's like a table name. So the database engine is done with the
the database engine is done with the CTE. It's going to go now and grab the
CTE. It's going to go now and grab the main query and it's going to start
main query and it's going to start executing it step by step. So the first
executing it step by step. So the first step is that to get the data from the
step is that to get the data from the orders. So since the orders exist in the
orders. So since the orders exist in the disk storage, it going to go and
disk storage, it going to go and retrieve it from there. Now the database
retrieve it from there. Now the database engine going to check the details. Okay,
engine going to check the details. Okay, we have it in the cache. That means we
we have it in the cache. That means we don't have to search for it in the disk
don't have to search for it in the disk storage and it going to start retrieving
storage and it going to start retrieving the data from the details with high
the data from the details with high speed. And now it's going to go to the
speed. And now it's going to go to the second step as well joining the data
second step as well joining the data with the details. So again the database
with the details. So again the database engine going to go to the cache and
engine going to go to the cache and going to see the table details and
going to see the table details and retrieve the data based maybe in
retrieve the data based maybe in different conditions. And then to the
different conditions. And then to the third time as well we are joining to the
third time as well we are joining to the details and we're going to get the data
details and we're going to get the data from the cache. So as you can see from
from the cache. So as you can see from the main query we are using the result
the main query we are using the result from the CTE multiple times in different
from the CTE multiple times in different places and the retrieval of all those
places and the retrieval of all those informations is happening in high speed.
informations is happening in high speed. So this is one big benefit of using the
So this is one big benefit of using the CTE is to utilize using the high-speed
CTE is to utilize using the high-speed memory of the cache. So that means
memory of the cache. So that means retrieving the data from the cache from
retrieving the data from the cache from the details is way faster than
the details is way faster than retrieving the data from the disk
retrieving the data from the disk storage from the orders. Now once the
storage from the orders. Now once the main query is completely executed the
main query is completely executed the result going to be returned to the
result going to be returned to the database engine and then it's going to
database engine and then it's going to send it back to the client side and we
send it back to the client side and we will see the results in the output. So
will see the results in the output. So that's it. It's amazing right? This is
that's it. It's amazing right? This is how the database server execute the
how the database server execute the amazing technique the CTE behind the
amazing technique the CTE behind the scenes.
All right. So now for the CTE, we don't have only one CTE. We have different
have only one CTE. We have different types of CTE. So mainly there are like
types of CTE. So mainly there are like two types of CTE. We have the
two types of CTE. We have the nonrecursive CTE and recursive CTE. And
nonrecursive CTE and recursive CTE. And we can say for the nonrecursive CTE, we
we can say for the nonrecursive CTE, we have two subtypes. The first type is the
have two subtypes. The first type is the standalone CTE and the second one is the
standalone CTE and the second one is the nested CTE. And now what we're going to
nested CTE. And now what we're going to do, we're going to deep dive into each
do, we're going to deep dive into each type. And we will start with the easiest
type. And we will start with the easiest form of the CTE, the standalone CTE. It
form of the CTE, the standalone CTE. It is the simplest
form. So what is standalone CTE? It is a CTE query that is defined and used
CTE query that is defined and used independently in the query. So that
independently in the query. So that means it is self-contained and it
means it is self-contained and it doesn't depend on anything. It doesn't
doesn't depend on anything. It doesn't depend on any other CTE or queries. So
depend on any other CTE or queries. So that means we can run the standalone
that means we can run the standalone query independently from anything inside
query independently from anything inside our query. So let's understand what this
our query. So let's understand what this means. We have our CTE. It's going to go
means. We have our CTE. It's going to go and query the database tables and in the
and query the database tables and in the output we will get an intermediate
output we will get an intermediate results and then the output can be used
results and then the output can be used from the main query. So the main query
from the main query. So the main query going to query the intermediate results
going to query the intermediate results and present in the output the final
and present in the output the final results. So now if you check our CTE, it
results. So now if you check our CTE, it is completely independent from anything
is completely independent from anything else. So it simply query the database
else. So it simply query the database and it has one output. So since this CTE
and it has one output. So since this CTE is independent from anything else we
is independent from anything else we call it a standalone CTE. Now if you
call it a standalone CTE. Now if you compare this CT with the main query you
compare this CT with the main query you can see that the main query cannot be
can see that the main query cannot be executed alone. And that's because it
executed alone. And that's because it needs the result from the first query.
needs the result from the first query. So we cannot say the main query is
So we cannot say the main query is independent cannot be executed alone. It
independent cannot be executed alone. It always depend on the city query. So that
always depend on the city query. So that means city first need to be executed
means city first need to be executed then the main query can be executed. So
then the main query can be executed. So this is what we mean with the standalone
this is what we mean with the standalone city. It doesn't depend on anything
city. It doesn't depend on anything else. So now we can understand the
else. So now we can understand the syntax of the CTE. So we have a very
syntax of the CTE. So we have a very simple query select from where. So it is
simple query select from where. So it is a very simple select statement. Now in
a very simple select statement. Now in order to put it inside a CTE we can go
order to put it inside a CTE we can go and use the with clause. So it starts
and use the with clause. So it starts with the keyword with then the CTE name.
with the keyword with then the CTE name. It's like a table name and then we have
It's like a table name and then we have the keyword as in order to say this CTE
the keyword as in order to say this CTE is defined like the following. So this
is defined like the following. So this is the definition of the CTE and it has
is the definition of the CTE and it has two parenthesis the starting and the
two parenthesis the starting and the ending. So with this you are telling a
ending. So with this you are telling a scale okay now we are talking about CTE
scale okay now we are talking about CTE and it has a name. So if you are using a
and it has a name. So if you are using a query inside with clause we call this a
query inside with clause we call this a CTE query it is where you define the
CTE query it is where you define the CTE. Now of course we don't want only to
CTE. Now of course we don't want only to define a CTE. We want to use it. So
define a CTE. We want to use it. So outside of this definition we can go and
outside of this definition we can go and use it like this. So we are saying
use it like this. So we are saying select from the CTE name. So that means
select from the CTE name. So that means we want to select the data from the
we want to select the data from the result of the CTE. And here it's very
result of the CTE. And here it's very important to use exactly the same name
important to use exactly the same name as you define it in the width clause. So
as you define it in the width clause. So if you leave it like this, we can call
if you leave it like this, we can call this the main query. It is the place
this the main query. It is the place where we use the CTE. So this is the
where we use the CTE. So this is the syntax of a very simple CTE in SQL.
syntax of a very simple CTE in SQL. Okay. So now what we're going to do,
Okay. So now what we're going to do, we're going to have like a task that's
we're going to have like a task that's going to keep progressing through this
going to keep progressing through this section. So we're going to start with
section. So we're going to start with the first step and we will keep adding
the first step and we will keep adding steps as we progress in the CTE. So now
steps as we progress in the CTE. So now the first step in this task says find
the first step in this task says find the total sales per customer. And now of
the total sales per customer. And now of course since we have only one step, it
course since we have only one step, it makes no sense to use the CTE. But we
makes no sense to use the CTE. But we will use it since we know that there
will use it since we know that there will be different steps later. So let's
will be different steps later. So let's start doing that. Now before I use any
start doing that. Now before I use any CTE, I would like just to write our
CTE, I would like just to write our query first. So we need the total sales
query first. So we need the total sales for each customers. It's very simple. So
for each customers. It's very simple. So we're going to go and select and what do
we're going to go and select and what do we need? Let's go and get the customer
we need? Let's go and get the customer ID and we need to do aggregations on the
ID and we need to do aggregations on the sales. So summarize the sales and we're
sales. So summarize the sales and we're going to call it total sales from the
going to call it total sales from the table. And now since this is our first
table. And now since this is our first query, we have to get the data from our
query, we have to get the data from our database. So we don't have any other
database. So we don't have any other option. Our data going to be in the
option. Our data going to be in the sales orders. So let's go and get it.
sales orders. So let's go and get it. And don't forget to group by for the
And don't forget to group by for the aggregation. We are grouping by the
aggregation. We are grouping by the customer ID. That's it. Let's go and
customer ID. That's it. Let's go and execute it. And as you can see in the
execute it. And as you can see in the output, nothing is fancy. We are just
output, nothing is fancy. We are just aggregating the sales by the customers.
aggregating the sales by the customers. So with that, we have solved the task.
So with that, we have solved the task. But now I would like to put my query in
But now I would like to put my query in a CTE. And that's because later we're
a CTE. And that's because later we're going to add more steps. So let's put
going to add more steps. So let's put our query in a city. And in order to do
our query in a city. And in order to do that, we're going to start with the with
that, we're going to start with the with keyword. And now we have to define the
keyword. And now we have to define the name of the CD. So I'm going to call it
name of the CD. So I'm going to call it city
city total sales like this. And then
total sales like this. And then afterward we're going to say as and then
afterward we're going to say as and then we have to go and add the parenthesis at
we have to go and add the parenthesis at the start and as well at the end. And
the start and as well at the end. And with that we are telling SQL this query
with that we are telling SQL this query is a CTE query. So that means the SQL
is a CTE query. So that means the SQL should store the result of this query in
should store the result of this query in a cache in memory to be used later in
a cache in memory to be used later in the main query. our CTE and of course
the main query. our CTE and of course what is missing is the main query and
what is missing is the main query and you have to do it exactly after the
you have to do it exactly after the definition of the CTE. I will just make
definition of the CTE. I will just make here a small comment about the main
here a small comment about the main query. Uh let me just make this smaller
query. Uh let me just make this smaller like this. And now we have to go and
like this. And now we have to go and have a very simple select
have a very simple select statements from. And now I would like to
statements from. And now I would like to get more details from the customers
get more details from the customers table. So I will just go now to the
table. So I will just go now to the customers. So now we are not querying
customers. So now we are not querying the CTE right? We are just querying the
the CTE right? We are just querying the database table that we have and I would
database table that we have and I would like to get from the customer the
like to get from the customer the customer ID and the first name and let's
customer ID and the first name and let's go and get as well the last
go and get as well the last name. So now if we go and query this
name. So now if we go and query this what happens in the output we are
what happens in the output we are getting the data actually completely
getting the data actually completely from the database table the customers
from the database table the customers and of course we are not using at all
and of course we are not using at all the CTE inside our main query. Of
the CTE inside our main query. Of course, we can do that, but it's just
course, we can do that, but it's just waste of like space in the memory
waste of like space in the memory because SQL did execute this and stored
because SQL did execute this and stored it in the database memory. And of
it in the database memory. And of course, we would like to use the city in
course, we would like to use the city in our main query. So, let's go and do
our main query. So, let's go and do that. So, let's go and do a join, but
that. So, let's go and do a join, but this time we're going to join the data
this time we're going to join the data from the CTE. So, let's go and get the
from the CTE. So, let's go and get the name and I will just call it CTS. So
name and I will just call it CTS. So what we are doing now we are joining the
what we are doing now we are joining the physical table the customers with the
physical table the customers with the virtual table that we have created with
virtual table that we have created with the CTE that exist only in our query and
the CTE that exist only in our query and of course not only we are joining the
of course not only we are joining the tables we would like to get the
tables we would like to get the informations from the CTE. So CTS and we
informations from the CTE. So CTS and we need only the total sales. So total
need only the total sales. So total sales. So that means those three columns
sales. So that means those three columns comes from our database table customers
comes from our database table customers and only this column the total sales
and only this column the total sales comes from our CTE. So let's go and
comes from our CTE. So let's go and execute the whole thing. Now as you can
execute the whole thing. Now as you can see in the output everything is working.
see in the output everything is working. We have the three columns from the table
We have the three columns from the table customers and we have the total sales
customers and we have the total sales for each customer and this total sales
for each customer and this total sales comes from our city. Now as you can see
comes from our city. Now as you can see the last customer has a null over here
the last customer has a null over here and that's because in the table orders
and that's because in the table orders we don't have the customer five. And now
we don't have the customer five. And now you might say you know what I would like
you might say you know what I would like to see the intermediate result from the
to see the intermediate result from the CTE because what we are seeing now in
CTE because what we are seeing now in the output is the final result from the
the output is the final result from the main query. So now what we can do in
main query. So now what we can do in order to see the result of the CTE we're
order to see the result of the CTE we're going to mark the query in the CTE of
going to mark the query in the CTE of course without any parenthesis or the
course without any parenthesis or the width. So just the query and execute it.
width. So just the query and execute it. And with that you can see in the output
And with that you can see in the output the intermediate results that we are
the intermediate results that we are passing to the main query. And as you
passing to the main query. And as you can see we don't have here customer
can see we don't have here customer number five. That's why in the final
number five. That's why in the final results we are getting null and that's
results we are getting null and that's of course because we are using the lift
of course because we are using the lift join. So if I execute the whole thing
join. So if I execute the whole thing you can see we are getting the customer
you can see we are getting the customer five over here with the null. So as you
five over here with the null. So as you can see is very simple. We just treat it
can see is very simple. We just treat it as any normal database table. But this
as any normal database table. But this table is created from our query that we
table is created from our query that we have defined in the city over here. Now
have defined in the city over here. Now of course in the city you can use any
of course in the city you can use any kind of clauses like select from join
kind of clauses like select from join group by having everything that you want
group by having everything that you want window functions all aggregate functions
window functions all aggregate functions but there is only one restriction you
but there is only one restriction you cannot go and use the order by clause so
cannot go and use the order by clause so you cannot sort the data in the city so
you cannot sort the data in the city so let's go and try it out let's go and say
let's go and try it out let's go and say order by and let's say I want to sort by
order by and let's say I want to sort by the order ID for example so let's go and
the order ID for example so let's go and execute it you can see here SQL is
execute it you can see here SQL is saying Okay, I cannot do it for you
saying Okay, I cannot do it for you because order by is not allowed in many
because order by is not allowed in many things. So you cannot use it in views,
things. So you cannot use it in views, in sub queries, in comment table
in sub queries, in comment table expressions, the CTE over here. So it is
expressions, the CTE over here. So it is not allowed. You cannot use order by in
not allowed. You cannot use order by in the CTE. But of course you can go and
the CTE. But of course you can go and sort the data in the main query. So if
sort the data in the main query. So if you go over here and say order by
you go over here and say order by customer ID. So if we execute it, it's
customer ID. So if we execute it, it's going to be working. So in the main
going to be working. So in the main query you can use order by but in the
query you can use order by but in the CTE this is the only thing that you
CTE this is the only thing that you cannot use inside the city. So that's
cannot use inside the city. So that's it. This is our first CTE in this
it. This is our first CTE in this section. All right. So this is the
section. All right. So this is the simplest form of the CTE the standalone.
simplest form of the CTE the standalone. Now we can have not only one CTE, we can
Now we can have not only one CTE, we can have multiple
CTE. So it's going to look like this. We have our database and this time we don't
have our database and this time we don't have only one CTE. We have multiple
have only one CTE. We have multiple CTEes in our query and each CTE is going
CTEes in our query and each CTE is going directly to the database and it will
directly to the database and it will query the database in order to prepare
query the database in order to prepare the intermediate results. So in this
the intermediate results. So in this example four CDEs is going to the
example four CDEs is going to the database and preparing four different
database and preparing four different intermediate results and of course SQL
intermediate results and of course SQL going to execute it from the top to the
going to execute it from the top to the bottom. So first the CD 1 then 2 3 four
bottom. So first the CD 1 then 2 3 four but they have nothing to do with each
but they have nothing to do with each others. So now once we have all the four
others. So now once we have all the four intermediate results the main query
intermediate results the main query going to go and retrieve all those
going to go and retrieve all those informations and do some magic in order
informations and do some magic in order to prepare the final result for the end
to prepare the final result for the end user. So now by looking to this sketch
user. So now by looking to this sketch you can understand all those CTE are
you can understand all those CTE are independent from each others. So there
independent from each others. So there is no nesting or something. Each CTE is
is no nesting or something. Each CTE is self-contained and it could be executed
self-contained and it could be executed on its own without depending on any
on its own without depending on any other results from any other CTE or any
other results from any other CTE or any other query. So it goes directly to the
other query. So it goes directly to the database and get the data. So that's why
database and get the data. So that's why all of them are standalone CDs. And
all of them are standalone CDs. And since we have multiple CDs, then it is
since we have multiple CDs, then it is standalone multiple CDs. That's it. It's
standalone multiple CDs. That's it. It's simple. So now let's check the syntax of
simple. So now let's check the syntax of the multiple standalone cities. So we're
the multiple standalone cities. So we're going to start writing our first city.
going to start writing our first city. So it start with the with clause and
So it start with the with clause and then we have the city name and then the
then we have the city name and then the logic of our city. So nothing new. This
logic of our city. So nothing new. This is how we define the city. And then in
is how we define the city. And then in order to use it, we're going to have our
order to use it, we're going to have our main query where we select from our new
main query where we select from our new city and we make sure we are using the
city and we make sure we are using the name of our city. So nothing new. Now in
name of our city. So nothing new. Now in order to add another city to our query,
order to add another city to our query, what we're going to do, we're going to
what we're going to do, we're going to go after the definition of the city. And
go after the definition of the city. And below it, we're going to go and start
below it, we're going to go and start defining the city too. But this time, as
defining the city too. But this time, as you can see, we are not using the width
you can see, we are not using the width clause. We are using a comma. So that
clause. We are using a comma. So that means only the first city going to be
means only the first city going to be using the with clause in order to tell
using the with clause in order to tell SQL we are talking about CTE. All the
SQL we are talking about CTE. All the other CDEs you're going to separate it
other CDEs you're going to separate it using the comma. So the syntax going to
using the comma. So the syntax going to be comma instead of with then the name
be comma instead of with then the name of the CTE and then we're going to say
of the CTE and then we're going to say as the following definition. So we're
as the following definition. So we're going to write here the query of the
going to write here the query of the second CTE. So now of course if you want
second CTE. So now of course if you want to go and add more CTE you go and use
to go and add more CTE you go and use the comma below it and as well you
the comma below it and as well you define the third city. So you can have
define the third city. So you can have as much cities as you want and always
as much cities as you want and always separate it with comma but only the
separate it with comma but only the first city start with the width. And of
first city start with the width. And of course in the main query we can go and
course in the main query we can go and use the results from the city 2 where we
use the results from the city 2 where we are for example here joining the data
are for example here joining the data between the city 1 and city 2. So as you
between the city 1 and city 2. So as you can see in the main query here we are
can see in the main query here we are like collecting the data from these
like collecting the data from these different cities in order to do the
different cities in order to do the final step in the main query. It start
final step in the main query. It start with the width. So SQL understands okay
with the width. So SQL understands okay now we are talking about CTE and once
now we are talking about CTE and once SQL sees after the parenthesis a comma
SQL sees after the parenthesis a comma SQL can understands okay now we are
SQL can understands okay now we are talking about another city and now if
talking about another city and now if you don't go and use a comma after the
you don't go and use a comma after the parenthesis SQL can understands okay we
parenthesis SQL can understands okay we don't have any more CDEs the next query
don't have any more CDEs the next query it's about the main query so this is how
it's about the main query so this is how you create multiple standalone CTE all
you create multiple standalone CTE all right so now back to our task where we
right so now back to our task where we are creating like a report step by step
are creating like a report step by step so now we have in the task a second step
so now we have in the task a second step where it says find the last order date
where it says find the last order date for each customer. So now we have to go
for each customer. So now we have to go and add one more information about our
and add one more information about our customer. So when the last time the
customer. So when the last time the customer did order. So how we going to
customer did order. So how we going to do it? Now we have to add this to our
do it? Now we have to add this to our query. And I would like to use as well
query. And I would like to use as well the CTE in order to have this logic. So
the CTE in order to have this logic. So as we learned from the first task, this
as we learned from the first task, this is the first step in order to find the
is the first step in order to find the total sales for each customer. And here
total sales for each customer. And here we have the main query. Now I would like
we have the main query. Now I would like to put now in between another CTE. And
to put now in between another CTE. And as we learned from the syntax, we have
as we learned from the syntax, we have to go and add a comma. We cannot go and
to go and add a comma. We cannot go and use the width again. And we have to give
use the width again. And we have to give it a name. So let's call it CTE and last
it a name. So let's call it CTE and last order. So latex and we have to define
order. So latex and we have to define it. So as and then double parenthesis.
it. So as and then double parenthesis. And now in between we have to go and add
And now in between we have to go and add our logic. So now we have to focus only
our logic. So now we have to focus only in this logic. So forget about the other
in this logic. So forget about the other CTE and the main query. So we have to
CTE and the main query. So we have to find the last order date for each
find the last order date for each customer. So we're going to go and query
customer. So we're going to go and query again the table orders. So what do we
again the table orders. So what do we need? We need the customer ID. We need
need? We need the customer ID. We need the order
the order date from our table sales orders. So
date from our table sales orders. So that's it for now. Let's just select it
that's it for now. Let's just select it and execute it. And now with that you
and execute it. And now with that you can see all the customers and as well
can see all the customers and as well all the orders. But we would like to
all the orders. But we would like to have the highest order for each
have the highest order for each customer. And we can go and use our
customer. And we can go and use our aggregate function, the max function. So
aggregate function, the max function. So what we're going to do it's like here at
what we're going to do it's like here at the top. So we have to go and use the
the top. So we have to go and use the function max and group up by the
function max and group up by the customer ID. So group up the customer
customer ID. So group up the customer ID. Uh let me just shift it like this.
ID. Uh let me just shift it like this. And let's give it the name last order.
And let's give it the name last order. So like this. And as you can see I'm
So like this. And as you can see I'm just selecting now only my query. I'm
just selecting now only my query. I'm not selecting everything. And I keep
not selecting everything. And I keep executing in order just to check the
executing in order just to check the results before we integrate it in the
results before we integrate it in the main query. So now as you can see we
main query. So now as you can see we have for each customer one row and we
have for each customer one row and we have as well the highest order for each
have as well the highest order for each customer. So with that we have solved
customer. So with that we have solved this subtask. So as you can see it's
this subtask. So as you can see it's really easy to extend. I'm just making
really easy to extend. I'm just making like another box and I'm adding inside
like another box and I'm adding inside it the business logic that I want and
it the business logic that I want and this going to solve one problem from the
this going to solve one problem from the whole task. So you feel now exactly the
whole task. So you feel now exactly the power of the CTE. We are making complex
power of the CTE. We are making complex logic but still it's easy to add. Now
logic but still it's easy to add. Now imagine you are not doing this. You are
imagine you are not doing this. You are always extending one big query. It's
always extending one big query. It's going to be really hard to extend and
going to be really hard to extend and that's why a lot of SQL developers
that's why a lot of SQL developers really love using CTE and they like use
really love using CTE and they like use it in each query or in each task that
it in each query or in each task that they have. So we have solved this task
they have. So we have solved this task and we have to go now integrated in the
and we have to go now integrated in the main query. It's going to be very
main query. It's going to be very simple. So we're going to get over here
simple. So we're going to get over here and we will go and just add another
and we will go and just add another join. So we're going to join it with the
join. So we're going to join it with the city and as you can see SQL now is
city and as you can see SQL now is offering it as a table even though it is
offering it as a table even though it is not a physical table that exists in our
not a physical table that exists in our database. It only lives inside our data
database. It only lives inside our data but still SQL treat it as a table. And
but still SQL treat it as a table. And this is exactly what we are doing. We
this is exactly what we are doing. We treat those informations as table. So
treat those informations as table. So city the last order and I will call it
city the last order and I will call it CL. And then of course we have to go and
CL. And then of course we have to go and do the same condition like here. So the
do the same condition like here. So the CLLO customer ID should be equal to the
CLLO customer ID should be equal to the customer ID from the first table, the
customer ID from the first table, the customers. And of course we have to go
customers. And of course we have to go and add this new information to the main
and add this new information to the main query. So
query. So CL the last order. So now what we're
CL the last order. So now what we're going to do, we're going to go and
going to do, we're going to go and execute the whole thing. So we have now
execute the whole thing. So we have now two CDs and as well our main query. So
two CDs and as well our main query. So let's go and execute it. Now again let's
let's go and execute it. Now again let's check the data. The first three columns
check the data. The first three columns comes from the physical table customers.
comes from the physical table customers. The fourth one, the total sales comes
The fourth one, the total sales comes from our first city over here. So from
from our first city over here. So from here and the last order comes from our
here and the last order comes from our new city that we just defined the city
new city that we just defined the city number two. So as you can see guys,
number two. So as you can see guys, everything feels like organized and
everything feels like organized and structures and we have like flow and of
structures and we have like flow and of course those cities are standalone
course those cities are standalone cities. So we can go always and select
cities. So we can go always and select the city and execute it separately. It
the city and execute it separately. It doesn't need anything else from outside
doesn't need anything else from outside this query. It just needs the tables
this query. It just needs the tables inside your database. So guys again here
inside your database. So guys again here pay attention if you want to add more
pay attention if you want to add more CDs use the comma. You cannot go and use
CDs use the comma. You cannot go and use for example here I another width. So if
for example here I another width. So if I execute it I will get an error. So you
I execute it I will get an error. So you have to separate it with this comma. And
have to separate it with this comma. And another mistake that I do frequently
another mistake that I do frequently that I forget and go add here like to
that I forget and go add here like to the last CTE a comma and this happens to
the last CTE a comma and this happens to me if I'm using a lot of CDEs. So if I
me if I'm using a lot of CDEs. So if I go and do it like this, I will get as
go and do it like this, I will get as well an error because the main query
well an error because the main query doesn't need a comma. So the last city
doesn't need a comma. So the last city should not has a comma after the
should not has a comma after the parenthesis. So I just removed it and
parenthesis. So I just removed it and execute. So guys with us we have now
execute. So guys with us we have now multiple cities inside our
query. All right. So now what is a nested CTE? It is a city inside another
nested CTE? It is a city inside another city. So it's kind of like subqueries, a
city. So it's kind of like subqueries, a query inside another query. So not only
query inside another query. So not only a main query can use the result of CTE
a main query can use the result of CTE another CTE can use the result from a
another CTE can use the result from a CTE and of course the nested CTE is like
CTE and of course the nested CTE is like a main query is depend on other query
a main query is depend on other query that means you cannot go and select it
that means you cannot go and select it and run it independently from the query.
and run it independently from the query. So always you have to run the CTE inside
So always you have to run the CTE inside it first before seeing the result of the
it first before seeing the result of the nested CTE. Okay. So now let's
nested CTE. Okay. So now let's understand what this means. Again we
understand what this means. Again we have our database and we have a city
have our database and we have a city query that goes directly to the database
query that goes directly to the database and queries the data from there and in
and queries the data from there and in the output we will get the intermediate
the output we will get the intermediate results. And now in this scenario this
results. And now in this scenario this time we will not have only one
time we will not have only one intermediate results because we have
intermediate results because we have many different steps. We need another
many different steps. We need another intermediate results before everything
intermediate results before everything is prepared for the main query. So that
is prepared for the main query. So that means we have another step that's going
means we have another step that's going to be built up on top of the first
to be built up on top of the first intermediate results. So that means we
intermediate results. So that means we can have another CTE that's going to be
can have another CTE that's going to be quering the results from the first CTE
quering the results from the first CTE and build on top of it another
and build on top of it another intermediate result. So as you can see
intermediate result. So as you can see here we have CTE1 and CTE2 and that
here we have CTE1 and CTE2 and that means now we have like two intermediate
means now we have like two intermediate results. And now of course we can go and
results. And now of course we can go and add CTE 3 4 and so on. But now let's say
add CTE 3 4 and so on. But now let's say that the CTE2 going to prepare the final
that the CTE2 going to prepare the final intermediate result for the main query.
intermediate result for the main query. So now the main query going to go and
So now the main query going to go and query the second intermedator results
query the second intermedator results and it's going to do the final step
and it's going to do the final step where the final result can be presented
where the final result can be presented for the user and of course if it is
for the user and of course if it is needed the main query can access not
needed the main query can access not only the second intermediate result from
only the second intermediate result from the second CTE but also the first
the second CTE but also the first intermediate result from the CTE1. Now
intermediate result from the CTE1. Now we call the first CTE a standalone CTE
we call the first CTE a standalone CTE because it doesn't depend on any
because it doesn't depend on any intermediate results. It goes directly
intermediate results. It goes directly to the database and gets the data. But
to the database and gets the data. But now since the second city is completely
now since the second city is completely depending on the city one. So this time
depending on the city one. So this time we're going to call this CTE a nested
we're going to call this CTE a nested CTE because we cannot go and execute it
CTE because we cannot go and execute it on its own. It always depends on the
on its own. It always depends on the city one. And of course the main city is
city one. And of course the main city is depending on everything. So as you can
depending on everything. So as you can see we're using the CTE we're going to
see we're using the CTE we're going to go and build like a chain. So this is
go and build like a chain. So this is what we mean with the standalone city
what we mean with the standalone city and nested city. Okay. So now let's
and nested city. Okay. So now let's understand the syntax of the nested
understand the syntax of the nested city. So we start as usual with the
city. So we start as usual with the definition of the first city using the
definition of the first city using the with clause and then the name of the
with clause and then the name of the city and the definition of the city. So
city and the definition of the city. So here it's nothing new. Now we go and
here it's nothing new. Now we go and define the second city as we learned
define the second city as we learned using the comma then the name of the CTE
using the comma then the name of the CTE and the definition. So this is our CTE
and the definition. So this is our CTE number two. So now the second CTE is
number two. So now the second CTE is depending on the results of the first
depending on the results of the first CTE. So how we going to do it? It's very
CTE. So how we going to do it? It's very simple. Now for the CTE number two,
simple. Now for the CTE number two, we're going to select the data from the
we're going to select the data from the CTE number one. And with that, we are
CTE number one. And with that, we are making the second city depending on the
making the second city depending on the first one. So this means the second CTE
first one. So this means the second CTE is getting the data from the first one
is getting the data from the first one and it's querying the data in order to
and it's querying the data in order to do the second step. And with that we are
do the second step. And with that we are nesting one CTE in another. And the CTE2
nesting one CTE in another. And the CTE2 is completely depending on the first
is completely depending on the first one. So again we call the first CTE as a
one. So again we call the first CTE as a standalone CTE because it doesn't depend
standalone CTE because it doesn't depend on anything. We can execute it on its
on anything. We can execute it on its own and it just need the data directly
own and it just need the data directly from the database. But the second city
from the database. But the second city since is completely depending on the
since is completely depending on the city number one we call it a nested
city number one we call it a nested city. So they are very similar. We are
city. So they are very similar. We are just selecting the data from the city
just selecting the data from the city number one. And now comes our main
number one. And now comes our main query. And of course it's going to go
query. And of course it's going to go and use the data from the second step.
and use the data from the second step. So it's going to go and select the data
So it's going to go and select the data from the city number two. But it's still
from the city number two. But it's still of course it's not a rule. It can go and
of course it's not a rule. It can go and access the data and select the data from
access the data and select the data from the city number one. So this is how we
the city number one. So this is how we can create a nested city in SQL. All
can create a nested city in SQL. All right guys, back to our project where we
right guys, back to our project where we are creating a report about the
are creating a report about the customers and we would like to add one
customers and we would like to add one more step. So the task is rank the
more step. So the task is rank the customers based on total sales per
customers based on total sales per customer. So this is one more step
customer. So this is one more step inside our projects and we would like to
inside our projects and we would like to go and use as well the CTEs in order to
go and use as well the CTEs in order to implement this step. So now what do we
implement this step. So now what do we need? We need to rank the customers
need? We need to rank the customers based on total sales for each customer.
based on total sales for each customer. So here like we have two steps. First we
So here like we have two steps. First we have to calculate the total sales per
have to calculate the total sales per customer and then we have to go and rank
customer and then we have to go and rank it based on this information and of
it based on this information and of course the sales are stores inside the
course the sales are stores inside the orders. So now let's go and start
orders. So now let's go and start implementing the CDE. So we're going to
implementing the CDE. So we're going to have a comma and we're going to call it
have a comma and we're going to call it CTE customer
CTE customer rank as and then we're going to go have
rank as and then we're going to go have the parenthesis and inside it we're
the parenthesis and inside it we're going to develop now the logic. So first
going to develop now the logic. So first we have to go and aggregate the data by
we have to go and aggregate the data by the total sales. So select customer ID
the total sales. So select customer ID and then sum the
and then sum the sales from the table sales orders and
sales from the table sales orders and then of course group
then of course group by the customer id. And now I can hear
by the customer id. And now I can hear you even telling me bar we have already
you even telling me bar we have already done this. We have already this logic.
done this. We have already this logic. So why we are repeating? If we go to the
So why we are repeating? If we go to the first CTE you can see we have already
first CTE you can see we have already done that. And you are totally right. We
done that. And you are totally right. We have already the logic. So it makes no
have already the logic. So it makes no sense to repeat it again. And if we do
sense to repeat it again. And if we do this then we didn't understood the power
this then we didn't understood the power of the city. So we don't have to repeat
of the city. So we don't have to repeat the same logic and we can reuse the city
the same logic and we can reuse the city inside another city. So now we don't
inside another city. So now we don't need all those stuff. We can go and
need all those stuff. We can go and focus immediately with ranking the
focus immediately with ranking the customers. So first let me just select
customers. So first let me just select the data from the first city. So I'm
the data from the first city. So I'm going to go and select. So what do we
going to go and select. So what do we have? We have customer
have? We have customer ID and we have total
ID and we have total sales. And we're going to select it this
sales. And we're going to select it this time not from any physical table. We're
time not from any physical table. We're going to select our city. So like this.
going to select our city. So like this. And now what we're going to do, we're
And now what we're going to do, we're going to go and select the whole thing
going to go and select the whole thing and execute it. Well, this is the issue
and execute it. Well, this is the issue of nesting cities. Sadly, this CTE is
of nesting cities. Sadly, this CTE is completely depending on the first city.
completely depending on the first city. So we cannot go and execute it on its
So we cannot go and execute it on its own. And this is of course very annoying
own. And this is of course very annoying because each time I execute the query by
because each time I execute the query by the end of the query SQL gonna go and
the end of the query SQL gonna go and destroy all the CTE. So in the memory we
destroy all the CTE. So in the memory we will not find the CT and that's why once
will not find the CT and that's why once I executed it SQL don't know anything
I executed it SQL don't know anything about this city. And in order now to see
about this city. And in order now to see the result of this we have always to
the result of this we have always to execute as well with it the city that
execute as well with it the city that I'm using. So what I usually do I go
I'm using. So what I usually do I go over here and make everything in comment
over here and make everything in comment in the main query and now I can go and
in the main query and now I can go and execute the whole thing and now I will
execute the whole thing and now I will see in the output the outcome of this
see in the output the outcome of this nested city. So this is the big
nested city. So this is the big difference between the standalone cities
difference between the standalone cities like here and the nested. So now let's
like here and the nested. So now let's go back to our task. We have to rank
go back to our task. We have to rank those sales based on the total sales. So
those sales based on the total sales. So we can go and use the rank function from
we can go and use the rank function from the window function. So rank over and
the window function. So rank over and now we don't have to partition the data.
now we don't have to partition the data. We just want to sort the data by the
We just want to sort the data by the total sales
total sales descending. So like this the highest
descending. So like this the highest sales going to get the rank number one.
sales going to get the rank number one. So let's go and give it the name as
So let's go and give it the name as customer rank. Now as you can see we
customer rank. Now as you can see we have a really nice rank beside those
have a really nice rank beside those informations. Customer three has the
informations. Customer three has the highest sales and customer two has the
highest sales and customer two has the lowest total sales. So with that, as you
lowest total sales. So with that, as you can see, we didn't repeat ourself. We
can see, we didn't repeat ourself. We just reused another CTE in our current
just reused another CTE in our current city. And this is exactly why this
city. And this is exactly why this technique is very amazing in order to
technique is very amazing in order to reduce redundancies and to reduce the
reduce redundancies and to reduce the complexity of the whole query. So nested
complexity of the whole query. So nested are annoying to execute, but they reduce
are annoying to execute, but they reduce the redundancies of our code. Now we are
the redundancies of our code. Now we are done with our logic. We tested
done with our logic. We tested everything. So what we're going to do,
everything. So what we're going to do, we're going to go and integrate it in
we're going to go and integrate it in our main query. So let me just remove
our main query. So let me just remove the comments from here and let's go and
the comments from here and let's go and add it in the main query. So we will do
add it in the main query. So we will do the same thing. We're going to go and do
the same thing. We're going to go and do a left join with the last city that we
a left join with the last city that we just created. So let me just call it
just created. So let me just call it CCR and the same conditions. We are
CCR and the same conditions. We are always joining on the customer ID. But
always joining on the customer ID. But don't forget to rename the alias. So it
don't forget to rename the alias. So it is CCR customer ID equal to the customer
is CCR customer ID equal to the customer ID from the first table. And of course
ID from the first table. And of course we have to go and select the new
we have to go and select the new information. So CCR dot customer rank.
information. So CCR dot customer rank. And now let's go and execute the whole
And now let's go and execute the whole thing. Now as you can see in the results
thing. Now as you can see in the results those three columns comes from the
those three columns comes from the customers table. The total sales comes
customers table. The total sales comes from the first city. The last order from
from the first city. The last order from the second city and the customer rank
the second city and the customer rank comes from our nested city that we just
comes from our nested city that we just created. So guys, it is not a simple
created. So guys, it is not a simple task creating such a reports because it
task creating such a reports because it involves different aggregations and
involves different aggregations and different functions, but our work is
different functions, but our work is organized. As you can see, it's very
organized. As you can see, it's very simple. We have step one, step two, step
simple. We have step one, step two, step three, and the main query. And it's
three, and the main query. And it's really easy to add more components to
really easy to add more components to our query. Now, I would like really to
our query. Now, I would like really to keep practicing using those nested
keep practicing using those nested queries. So, we have the following task.
queries. So, we have the following task. We would like to add one more step in
We would like to add one more step in our report. segment the customers based
our report. segment the customers based on their total sales. So I would like to
on their total sales. So I would like to implement this as well using CTE. So
implement this as well using CTE. So let's go and solve it. We want to go and
let's go and solve it. We want to go and add a new CTE. It's going to be CTE
add a new CTE. It's going to be CTE customer
customer segments as and then we have to go and
segments as and then we have to go and define our logic. Now if you check our
define our logic. Now if you check our task, it has two parts. We have to find
task, it has two parts. We have to find the total sales and then we have to
the total sales and then we have to segment the customers based on this
segment the customers based on this information. So it is something very
information. So it is something very similar to what we have done in the step
similar to what we have done in the step three. So that means we don't have to go
three. So that means we don't have to go and calculate again the total sales. We
and calculate again the total sales. We have to go and use as well our amazing
have to go and use as well our amazing first city. So let's go and do it. What
first city. So let's go and do it. What do we need? We need the customer ID like
do we need? We need the customer ID like this. And let's do basic segmentations
this. And let's do basic segmentations using the case win. So let's say case
using the case win. So let's say case when the total sales if it's higher than
when the total sales if it's higher than 100 then let's say the customer going to
100 then let's say the customer going to belong to the group high and let's go
belong to the group high and let's go and add another category. If it's not
and add another category. If it's not higher than 100 if it is higher than
higher than 100 if it is higher than 50 then the customer going to belong to
50 then the customer going to belong to medium. And if the total sales is less
medium. And if the total sales is less or equal to 50. So what's going to
or equal to 50. So what's going to happen? We're going to say else the
happen? We're going to say else the customer belong to the low category. So
customer belong to the low category. So that's it. We're going to have an end
that's it. We're going to have an end and let's call it customer
and let's call it customer segments. All right. But of course we
segments. All right. But of course we have to go and select it from a table
have to go and select it from a table and it's going to be our city. So total
and it's going to be our city. So total sales and let's put it in our new city.
sales and let's put it in our new city. And I would like to test it before like
And I would like to test it before like putting it inside our main query. That's
putting it inside our main query. That's why I will put everything in comments in
why I will put everything in comments in my main query since it is a nested city
my main query since it is a nested city sadly. And we will just go and select
sadly. And we will just go and select our new nested city like we have done
our new nested city like we have done before. So let's go and execute it. Now
before. So let's go and execute it. Now as you can see in the output we have two
as you can see in the output we have two customers with the category high and two
customers with the category high and two customers with the medium. But in order
customers with the medium. But in order to make sure that everything working
to make sure that everything working perfectly, I would like to go and add
perfectly, I would like to go and add the total sales just to see the numbers.
the total sales just to see the numbers. So let's go and execute it. Well, you
So let's go and execute it. Well, you can see everything is correct. So those
can see everything is correct. So those customers having higher than 100 in the
customers having higher than 100 in the total sales and those two having higher
total sales and those two having higher than 50. But let's go and change stuff
than 50. But let's go and change stuff around. I would like to have it like 80
around. I would like to have it like 80 as a medium just in order to have a low.
as a medium just in order to have a low. So with that the customer number two
So with that the customer number two having a lower sales than 80. That's why
having a lower sales than 80. That's why we are getting the segment low.
we are getting the segment low. Everything is done and we have segmented
Everything is done and we have segmented the users into different categories. So
the users into different categories. So I don't need to test anymore. Let's go
I don't need to test anymore. Let's go integrate it in our main query. So we're
integrate it in our main query. So we're going to do the same things over here.
going to do the same things over here. We're going to say lift join and we're
We're going to say lift join and we're going to get our new CTE. So CCS and we
going to get our new CTE. So CCS and we have to do the join condition. Don't
have to do the join condition. Don't forget to change it. And we have to
forget to change it. And we have to select our new nice information. It's
select our new nice information. It's going to be the customer segments. And
going to be the customer segments. And now we can go and execute the whole
now we can go and execute the whole thing. So we have now like four
thing. So we have now like four different cities and one main query. And
different cities and one main query. And now we can see in the output we got all
now we can see in the output we got all three informations from the table
three informations from the table customers. The first city, the second,
customers. The first city, the second, third and this is our new column that we
third and this is our new column that we just created. So again we have done this
just created. So again we have done this using a necessityd like this. Let me
using a necessityd like this. Let me just add
just add it and it was really easy to extend and
it and it was really easy to extend and to add to our report. All right guys, so
to add to our report. All right guys, so with us we have done like a many
with us we have done like a many projects where we have analyzed the
projects where we have analyzed the customer information based on different
customer information based on different aspects from our data and we have done
aspects from our data and we have done it like step by step and now you have
it like step by step and now you have like a feeling on how to write complex
like a feeling on how to write complex SQL queries using the help of the CTE
SQL queries using the help of the CTE and we have done it like step by step.
and we have done it like step by step. So as you can see if you go through the
So as you can see if you go through the scripts you can understand okay it is
scripts you can understand okay it is divided into multiple steps and each
divided into multiple steps and each block is responsible for one specific
block is responsible for one specific problem of the whole report and this is
problem of the whole report and this is exactly the power of the CTE it
exactly the power of the CTE it introduce modularity. So each CTE is
introduce modularity. So each CTE is self-contained and talk about one issue
self-contained and talk about one issue and this is amazing way on how to
and this is amazing way on how to organize your project using SQL and how
organize your project using SQL and how to structure your work.
All right, my friends. So, now let's have a little break in order to have a
have a little break in order to have a real talk about the city. But first,
real talk about the city. But first, some
some coffee. And now I can say that I'm
coffee. And now I can say that I'm working with SQL since really long long
working with SQL since really long long time ago, over 15 years. And I can say
time ago, over 15 years. And I can say as well, I have met a lot of SQL
as well, I have met a lot of SQL developers in different projects. And if
developers in different projects. And if there is one thing that all those SQL
there is one thing that all those SQL developers love is the CTE, they love
developers love is the CTE, they love using it everywhere. like each time they
using it everywhere. like each time they write a query they going to be writing
write a query they going to be writing SQL CTE and of course it's fine it's not
SQL CTE and of course it's fine it's not a bad thing but the problem with that
a bad thing but the problem with that they overuse it of course not all of
they overuse it of course not all of them but a lot of SQL developers overuse
them but a lot of SQL developers overuse using the CTE of course the CTE is very
using the CTE of course the CTE is very powerful but with power comes great
powerful but with power comes great responsibility
responsibility remember with great power comes great
remember with great power comes great responsibility so my advice for you
responsibility so my advice for you especially if you are new to the CTS try
especially if you are new to the CTS try to not add a new CTE each time you are
to not add a new CTE each time you are doing something new and I saw it a lot
doing something new and I saw it a lot like for each new calculation for each
like for each new calculation for each new column they jump immediately and
new column they jump immediately and create a new CT and what happens at the
create a new CT and what happens at the end we can have like massive number of
end we can have like massive number of CTE inside one query and the developer
CTE inside one query and the developer thinks now everything is organized and
thinks now everything is organized and easy to read but believe me it's exactly
easy to read but believe me it's exactly the opposite if you open any code and
the opposite if you open any code and you have a lot of CDEs and especially if
you have a lot of CDEs and especially if they are necessities it is impossible to
they are necessities it is impossible to understand what is going on even if the
understand what is going on even if the developer like describe each CTE and the
developer like describe each CTE and the task of the CTE, it's going to be really
task of the CTE, it's going to be really hard to understand and as well to read.
hard to understand and as well to read. If everything is like nested and you
If everything is like nested and you have like I don't know 20 cities in one
have like I don't know 20 cities in one query. So it's going to be impossible to
query. So it's going to be impossible to read and to understand and as well
read and to understand and as well you're going to be using a lot of memory
you're going to be using a lot of memory and you might get bad performance. So my
and you might get bad performance. So my advice for you try always as you are
advice for you try always as you are creating new CDs to think about how
creating new CDs to think about how about to merge two CDEs in one. So it is
about to merge two CDEs in one. So it is really always important to rethink and
really always important to rethink and refactor your CDEs in order to merge it
refactor your CDEs in order to merge it into one and to reduce the number of
into one and to reduce the number of CTE. But now if you ask me how many CTEs
CTE. But now if you ask me how many CTEs are okay in one query, well I don't have
are okay in one query, well I don't have a magic number for that. But normally I
a magic number for that. But normally I tend to say between three and five CTE
tend to say between three and five CTE it's fine. So it's going to be easy to
it's fine. So it's going to be easy to understand and to read and so on. But
understand and to read and so on. But once you get more than five CTE then you
once you get more than five CTE then you have to rethink your code. Maybe you
have to rethink your code. Maybe you have to create another complete query so
have to create another complete query so you don't have to put everything in one
you don't have to put everything in one query. So this is my advice for you. Try
query. So this is my advice for you. Try to not overuse the CTEs in your
to not overuse the CTEs in your projects. Not for each step always
projects. Not for each step always refactor the CTE, consolidate them and
refactor the CTE, consolidate them and try to not have more than five CTEs in
try to not have more than five CTEs in one query. So that's my advice for you.
one query. So that's my advice for you. Be responsible using the CTE. And let's
Be responsible using the CTE. And let's go back to our course.
So with that we have learned the standalone CTE and the NIST CDE and both
standalone CTE and the NIST CDE and both of them belongs to a type called
of them belongs to a type called nonrecursive CTE. So what is a
nonrecursive CTE. So what is a non-recursive CDE? It means it is a city
non-recursive CDE? It means it is a city that is executed only once. So there is
that is executed only once. So there is no repetitions or looping or anything.
no repetitions or looping or anything. So the SQL going to execute it in one go
So the SQL going to execute it in one go and that's it. But in the other hand the
and that's it. But in the other hand the recursive city is exactly the opposite.
recursive city is exactly the opposite. So a recursive city it is a
So a recursive city it is a selfreferencering query that repeatedly
selfreferencering query that repeatedly processing the data until a certain
processing the data until a certain condition is met and we usually use the
condition is met and we usually use the recursive city if we have like
recursive city if we have like hierarchical structure and we want to
hierarchical structure and we want to navigate and travel through the
navigate and travel through the hierarchy. I know this might be
hierarchy. I know this might be confusing but don't worry about it.
confusing but don't worry about it. We're going to have very simple
We're going to have very simple examples. Now again we have our tables
examples. Now again we have our tables in the database and we have a CTE. Now
in the database and we have a CTE. Now the query of the CTE going to be
the query of the CTE going to be executed for the first time and in the
executed for the first time and in the results we're going to have the initial
results we're going to have the initial data from the CTE but it is not
data from the CTE but it is not everything yet. Now this intermediate
everything yet. Now this intermediate result is not ready yet for the main
result is not ready yet for the main query but instead of that it's going to
query but instead of that it's going to go back to the CTE and CTE going to
go back to the CTE and CTE going to check whether the current results is
check whether the current results is meeting a specific condition. So now if
meeting a specific condition. So now if the check says no it's not meeting the
the check says no it's not meeting the condition what's going to happen the
condition what's going to happen the city query going to be executed for the
city query going to be executed for the second time. So as you can see we are
second time. So as you can see we are looping through the CTE. Now the result
looping through the CTE. Now the result of the second iteration the second
of the second iteration the second execution will be added to the
execution will be added to the intermediate result. So now the
intermediate result. So now the intermediate result has more data and
intermediate result has more data and again before we can use it from the main
again before we can use it from the main query it going to be checked from the
query it going to be checked from the CTE. Does the result fulfill the
CTE. Does the result fulfill the condition? If it's still no, then go and
condition? If it's still no, then go and execute the CTE again. So we're going to
execute the CTE again. So we're going to have a third iteration and a new data
have a third iteration and a new data going to be added to the intermediate
going to be added to the intermediate result. So this is our third iteration.
result. So this is our third iteration. Now it's going to be checked again from
Now it's going to be checked again from the CTE. Did we fulfill the condition?
the CTE. Did we fulfill the condition? If the answer is yes, then the loop
If the answer is yes, then the loop going to break and everything else. So
going to break and everything else. So there will be no fourth iteration of the
there will be no fourth iteration of the CTE. So with that, the CTE says okay,
CTE. So with that, the CTE says okay, I'm done. This is the final result of
I'm done. This is the final result of the intermediate result. then the loop
the intermediate result. then the loop going to break and everything ends and
going to break and everything ends and the city will not be executed for the
the city will not be executed for the first time and now the city going to say
first time and now the city going to say okay I'm done now my intermediate result
okay I'm done now my intermediate result is ready to be used from the main query
is ready to be used from the main query and now nothing new happens the main
and now nothing new happens the main query going to go and retrieve the data
query going to go and retrieve the data from the intermediate results and do
from the intermediate results and do some magic in order to prepare the final
some magic in order to prepare the final results so that means there will be no
results so that means there will be no iterations or looping inside the main
iterations or looping inside the main query the looping going to be happen
query the looping going to be happen only in the CTE and that's why we call
only in the CTE and that's why we call it recursive CTE. So now if you compare
it recursive CTE. So now if you compare it with the other types, all other types
it with the other types, all other types are always in one direction and all the
are always in one direction and all the CTE is going to be executed only once
CTE is going to be executed only once but the recursive CTE going to be keep
but the recursive CTE going to be keep looping until the condition is met and
looping until the condition is met and only then it's going to forward the data
only then it's going to forward the data to the main query. And normally we use
to the main query. And normally we use the recursive CTE if you are navigating
the recursive CTE if you are navigating through hierarchical structure. So if
through hierarchical structure. So if you have in your data like hierarchal
you have in your data like hierarchal structures, you can go and use the
structures, you can go and use the recursive CTE in order to navigate
recursive CTE in order to navigate through it. So this is the recursive
through it. So this is the recursive city. Okay. So now let's check the
city. Okay. So now let's check the syntax of the recursive CTE. It is a
syntax of the recursive CTE. It is a little bit complicated but we're going
little bit complicated but we're going to do it step by step. So what do we
to do it step by step. So what do we have? We have a query and we would like
have? We have a query and we would like to put it in a city. So we're going to
to put it in a city. So we're going to have the usual stuff with clause the
have the usual stuff with clause the name of the city and as and then the
name of the city and as and then the query. So this is the definition of our
query. So this is the definition of our city. But now if you leave it like this
city. But now if you leave it like this SQL going to execute it only once. But
SQL going to execute it only once. But we would like to make a loop iteration.
we would like to make a loop iteration. So in order to do that we have to go and
So in order to do that we have to go and define a second select statement inside
define a second select statement inside our CTE like this. So we are selecting
our CTE like this. So we are selecting the data and here we have to define a
the data and here we have to define a breaking condition. So here in the
breaking condition. So here in the second query we are defining a condition
second query we are defining a condition in order to break the loop otherwise
in order to break the loop otherwise it's going to loop for infinite or the
it's going to loop for infinite or the system going to break. You could use it
system going to break. You could use it in the wear clause or you can use it
in the wear clause or you can use it even in an inner join because both of
even in an inner join because both of them are filtering the data and you can
them are filtering the data and you can use it in order to break the condition.
use it in order to break the condition. All right. So now still there is
All right. So now still there is something missing. How we going to make
something missing. How we going to make like things looping? Well, we have to
like things looping? Well, we have to reference this CTE to itself. So what we
reference this CTE to itself. So what we going to do? We're going to say the
going to do? We're going to say the second query going to select the data
second query going to select the data from the same CTE. So that means we have
from the same CTE. So that means we have now a query that is quering itself. And
now a query that is quering itself. And this is of course what we want. We want
this is of course what we want. We want to make iterations and we want to make a
to make iterations and we want to make a loop. That's why we have to go and
loop. That's why we have to go and reference it to itself. And now in SQL
reference it to itself. And now in SQL you cannot have it like this. You cannot
you cannot have it like this. You cannot have like two select statements in one
have like two select statements in one query. you have to connect it somehow.
query. you have to connect it somehow. That's why we can go and use the union
That's why we can go and use the union all or union depend if you want to have
all or union depend if you want to have duplicates or not. So now we call the
duplicates or not. So now we call the first query the anchor query. The anchor
first query the anchor query. The anchor query going to be the first query that
query going to be the first query that interacts with the database and provide
interacts with the database and provide us the initial intermediate results. So
us the initial intermediate results. So it is the starting point of the
it is the starting point of the iteration and we can say it is the first
iteration and we can say it is the first step in the process. So this going to be
step in the process. So this going to be executed only once and it going to
executed only once and it going to provide us the initial step the first
provide us the initial step the first step in the process. Now we call the
step in the process. Now we call the second step as a recursive query and we
second step as a recursive query and we call it like this because this query
call it like this because this query going to be executed multiple times and
going to be executed multiple times and it will keep repeating and add data to
it will keep repeating and add data to the intermediate results until the
the intermediate results until the condition is met or let's say there will
condition is met or let's say there will be no more data that is available to be
be no more data that is available to be processed. So this is the syntax of the
processed. So this is the syntax of the city query for the main query nothing is
city query for the main query nothing is changed. So we have to go and use the
changed. So we have to go and use the city name in the main query. So this is
city name in the main query. So this is the syntax of the recursive city. So
the syntax of the recursive city. So think about it like this. SQL going to
think about it like this. SQL going to go and execute the anchor query only
go and execute the anchor query only once and then after that going to go
once and then after that going to go through the recursive query and keep
through the recursive query and keep looping and looping and iterating until
looping and looping and iterating until a certain condition is met and then SQL
a certain condition is met and then SQL going to go out from the CTE. So this is
going to go out from the CTE. So this is actually what we mean with the anchor
actually what we mean with the anchor and recursive queries. All right. Right.
and recursive queries. All right. Right. So now let's have a simple task in order
So now let's have a simple task in order to understand the recursive city. So the
to understand the recursive city. So the task says generate a sequence of numbers
task says generate a sequence of numbers from 1 to 20. So now let's do it step by
from 1 to 20. So now let's do it step by step. So that means we have to create a
step. So that means we have to create a loop from 1 to 20 and after 20 the loop
loop from 1 to 20 and after 20 the loop should stop. So let's go and do it. Now
should stop. So let's go and do it. Now the first step of the recursive CTE is
the first step of the recursive CTE is to build the anchor query. So the anchor
to build the anchor query. So the anchor query is responsible for the first
query is responsible for the first iteration. So that means the first row
iteration. So that means the first row of the output. So what is the first
of the output. So what is the first value between 1 and 20? It is the one.
value between 1 and 20? It is the one. So let's go and write a query that
So let's go and write a query that generate the value one. So select and
generate the value one. So select and we're going to say one as I'm going to
we're going to say one as I'm going to give it the name my number. So that's
give it the name my number. So that's it. Let's go and execute it. Now you can
it. Let's go and execute it. Now you can see in the output we have the first
see in the output we have the first member of our sequence. And this is
member of our sequence. And this is exactly the task of the anchor query. It
exactly the task of the anchor query. It retrieves the first step in the
retrieves the first step in the iteration. So let's go and call
iteration. So let's go and call it
it anchor query. Now the next step with
anchor query. Now the next step with that we have to go and build the
that we have to go and build the iteration. So we need a CTE. So I will
iteration. So we need a CTE. So I will build now the city. So we're going to
build now the city. So we're going to say with we're going to call it series
say with we're going to call it series and then we're going to put everything
and then we're going to put everything in parenthesis and then we're going to
in parenthesis and then we're going to go to the main query. So this is the
go to the main query. So this is the main query and we will go and select
main query and we will go and select everything from the Sirius the city. So
everything from the Sirius the city. So let's go and execute it just to make
let's go and execute it just to make sure that everything is working fine. So
sure that everything is working fine. So we didn't create any loop or anything.
we didn't create any loop or anything. We have just created a city on top on
We have just created a city on top on the anchor query and we just call it
the anchor query and we just call it from the main query. So now we come to
from the main query. So now we come to the second step of building the
the second step of building the recursive city. We have to build the
recursive city. We have to build the recursive query. So let's do it. I will
recursive query. So let's do it. I will just make this little bit smaller. And
just make this little bit smaller. And now before we start writing the query,
now before we start writing the query, we have to go and use union
we have to go and use union all in order to go and connect the
all in order to go and connect the anchor query with the recursive query.
anchor query with the recursive query. And let me say this is the
And let me say this is the recursive query. So how we going to
recursive query. So how we going to build it? Let's go and start with the
build it? Let's go and start with the select. And now next what I usually do I
select. And now next what I usually do I just make sure that we are making a
just make sure that we are making a recursive city. So I go with selecting
recursive city. So I go with selecting from and then we're going to use the
from and then we're going to use the name of the current city so that we are
name of the current city so that we are referencing the city to itself in order
referencing the city to itself in order to make the city recursive and to do the
to make the city recursive and to do the looping. Now here comes the tricky part.
looping. Now here comes the tricky part. So we need to create like the sequence.
So we need to create like the sequence. Now what is the current value? The
Now what is the current value? The current value is one. Right? Now what do
current value is one. Right? Now what do we need? We need the second value in the
we need? We need the second value in the sequence which is two. So we can do it
sequence which is two. So we can do it by 1 + 1. So if you do it like this you
by 1 + 1. So if you do it like this you will get the output two. But actually
will get the output two. But actually what we are doing here we are always
what we are doing here we are always taking the current value and we are
taking the current value and we are saying plus one in order to generate the
saying plus one in order to generate the next value. So in order to do that
next value. So in order to do that instead of saying one we're going to
instead of saying one we're going to take the my number the current value and
take the my number the current value and we're going to add to it plus one in
we're going to add to it plus one in order to generate the second value in
order to generate the second value in the sequence. So that means my number
the sequence. So that means my number always holds the current value and we do
always holds the current value and we do the operation + one in order to generate
the operation + one in order to generate the next sequence. So having it like
the next sequence. So having it like this what we are doing we are generating
this what we are doing we are generating the sequence of numbers. Now if you go
the sequence of numbers. Now if you go and execute it like this let me just
and execute it like this let me just execute it what will happen it going to
execute it what will happen it going to breaks because SQL will not allow it and
breaks because SQL will not allow it and SQL set it to 100 iterations. So more
SQL set it to 100 iterations. So more than 100 SQL going to break the query so
than 100 SQL going to break the query so that we don't have infinite number of
that we don't have infinite number of looping. So this is bad because we
looping. So this is bad because we didn't define the breaking mechanism of
didn't define the breaking mechanism of the looping. So now we have to define as
the looping. So now we have to define as well in the recursive query how the loop
well in the recursive query how the loop going to ends and we usually use a
going to ends and we usually use a condition. For example, we can go and
condition. For example, we can go and use the wear clause and we can say okay
use the wear clause and we can say okay keep looping and keep generating but
keep looping and keep generating but always check whether the value of the my
always check whether the value of the my number is less than 20. And you might
number is less than 20. And you might ask okay it should be less or equal to
ask okay it should be less or equal to 20 right? Well no because if you are
20 right? Well no because if you are making less and equal to 20 what going
making less and equal to 20 what going to happen once the my number is equal to
to happen once the my number is equal to 20 you are allowing one more iterations
20 you are allowing one more iterations where you will get in the output 21. So
where you will get in the output 21. So that's why we are making it with 20. So
that's why we are making it with 20. So now let's go and execute it and let's
now let's go and execute it and let's check the sequence. It start with 1 2 3
check the sequence. It start with 1 2 3 4 5 and until we reach the 20. So with
4 5 and until we reach the 20. So with that we have solved the task. Again here
that we have solved the task. Again here it's not that hard right? We are just
it's not that hard right? We are just providing the initial step and then we
providing the initial step and then we are providing the loop where we are
are providing the loop where we are defining inside it how the loop going to
defining inside it how the loop going to ends. Now there is one more thing that
ends. Now there is one more thing that you can do with the recursive CTE is to
you can do with the recursive CTE is to define the limit of iterations. So for
define the limit of iterations. So for example in your code if you say okay if
example in your code if you say okay if this iterates more than 10 times then
this iterates more than 10 times then the SQL should breaks and stops. So you
the SQL should breaks and stops. So you can define for the SQL the maximum
can define for the SQL the maximum number of recursions. So how we can do
number of recursions. So how we can do that? We can do that in the main query.
that? We can do that in the main query. So if you go over here and say option
So if you go over here and say option then two parenthesis and then max
then two parenthesis and then max recursion and after that you can define
recursion and after that you can define the limit. So for example let's go with
the limit. So for example let's go with the 10. Now of course we are iterating
the 10. Now of course we are iterating in our code now more than 20 but here we
in our code now more than 20 but here we are making the rule it should not
are making the rule it should not iterate more than 10. So let's go and
iterate more than 10. So let's go and execute it. So now we can see that our
execute it. So now we can see that our SQL breaks and it says the maximum
SQL breaks and it says the maximum recursion is 10. So as you can see now
recursion is 10. So as you can see now in the output we are getting the error
in the output we are getting the error of having more than 10 iterations which
of having more than 10 iterations which is not allowed. So with that you can
is not allowed. So with that you can control how many recursions you can
control how many recursions you can have. Let's say that you would like to
have. Let's say that you would like to have like thousand iteration. So if you
have like thousand iteration. So if you go over here and say you know what I
go over here and say you know what I would like to have a sequence of 1,000.
would like to have a sequence of 1,000. If you let me just comment this out. So
If you let me just comment this out. So if you execute it you will get an error
if you execute it you will get an error because the default is 100. But of
because the default is 100. But of course you can go and increase the
course you can go and increase the maximum recursion. For example let's go
maximum recursion. For example let's go with 5,000s. in the output it will work
with 5,000s. in the output it will work and you will get a sequence of 1,000. So
and you will get a sequence of 1,000. So with this you can control how many
with this you can control how many iterations are allowed in your query. So
iterations are allowed in your query. So that you have like a control on it.
that you have like a control on it. Okay. So now we can understand step by
Okay. So now we can understand step by step how SQL executed the recursive
step how SQL executed the recursive query. And here we have like flow
query. And here we have like flow diagram in order to understand the
diagram in order to understand the process the steps of executing the
process the steps of executing the recursive query. So let's go and do it.
recursive query. So let's go and do it. Now in the start we have the first step
Now in the start we have the first step is to run the anchor query. So our
is to run the anchor query. So our anchor query is just a select for the
anchor query is just a select for the value one. So in the output we will get
value one. So in the output we will get the value one in my number and as you
the value one in my number and as you can see the anchor query going to be
can see the anchor query going to be executed only once. So there is no
executed only once. So there is no iterations or anything. SQL executed
iterations or anything. SQL executed once and then goes to the next step. So
once and then goes to the next step. So what is the next step? It's going to
what is the next step? It's going to execute the recursive query. So it's
execute the recursive query. So it's going to go over here and now what going
going to go over here and now what going to happen? We will get the current value
to happen? We will get the current value of my number. The current value is one.
of my number. The current value is one. and then we're going to add to it a one.
and then we're going to add to it a one. So 1 + 1 we will get from the recursive
So 1 + 1 we will get from the recursive query the two which is added to our
query the two which is added to our results. Now it's going to check the
results. Now it's going to check the condition is my number now smaller than
condition is my number now smaller than 20. Well yes it's smaller than 20 and
20. Well yes it's smaller than 20 and what's going to happen since it's true
what's going to happen since it's true is going to go and reexecute the
is going to go and reexecute the recursive query. So now we are doing the
recursive query. So now we are doing the second iteration. So again it's going to
second iteration. So again it's going to go to the recursive query and going to
go to the recursive query and going to say okay what is the current value of my
say okay what is the current value of my number? It is two. So 2 + 1 the second
number? It is two. So 2 + 1 the second iteration will give us the value three.
iteration will give us the value three. So as you can see each time the
So as you can see each time the recursive query is executed it is adding
recursive query is executed it is adding more values to our result. So the same
more values to our result. So the same question can be asked is now my number
question can be asked is now my number smaller than 20. Well yes it is smaller.
smaller than 20. Well yes it is smaller. Well what can happen is still going to
Well what can happen is still going to reexecute the recursive query. So SQL
reexecute the recursive query. So SQL going to keep looping and iterating and
going to keep looping and iterating and adding values to the output until we
adding values to the output until we reach the value 20. So now SQL going to
reach the value 20. So now SQL going to ask is 20 my number now smaller than 20.
ask is 20 my number now smaller than 20. Well no. So it is false and what's going
Well no. So it is false and what's going to happen the chain will break and we
to happen the chain will break and we will not loop anymore. So it's going to
will not loop anymore. So it's going to be the end of the city and this going to
be the end of the city and this going to be the final results that's going to be
be the final results that's going to be used from the main query. So this is how
used from the main query. So this is how SQL executed this recursive CD. Okay. So
SQL executed this recursive CD. Okay. So now let's have another task for the
now let's have another task for the recursive CD. This time it's going to be
recursive CD. This time it's going to be a little bit more advanced. So the task
a little bit more advanced. So the task says show the employee hierarchy by
says show the employee hierarchy by displaying each employees level within
displaying each employees level within the organization. So that means we have
the organization. So that means we have to show for each employee for each row a
to show for each employee for each row a level that tells us the hierarchy of the
level that tells us the hierarchy of the employee. So first let's go and explore
employee. So first let's go and explore the table employees. So let's go and
the table employees. So let's go and select everything prompt sales
select everything prompt sales employees. Okay, let's go execute it. So
employees. Okay, let's go execute it. So now by looking to the results we have
now by looking to the results we have like few informations about the
like few informations about the employee. We have information about
employee. We have information about which department the gender salaries but
which department the gender salaries but here we have the key. It is the manager
here we have the key. It is the manager ID. So this is like self referencing to
ID. So this is like self referencing to the same table. So for example the first
the same table. So for example the first employee the value is null. That means
employee the value is null. That means this employee has no manager which makes
this employee has no manager which makes this employee like the big boss, the
this employee like the big boss, the CEO. Then now by looking to the next two
CEO. Then now by looking to the next two employees, they have a manager ID one.
employees, they have a manager ID one. So who is the manager of those two? It's
So who is the manager of those two? It's going to be the first row, the manager
going to be the first row, the manager ID number one. So the manager ID number
ID number one. So the manager ID number one is the post of those two employees.
one is the post of those two employees. And then for the fourth one, we can see
And then for the fourth one, we can see the manager ID number two. So the
the manager ID number two. So the manager of Michael is actually Kevin,
manager of Michael is actually Kevin, the second row. And for Carol the
the second row. And for Carol the manager ID is three. That means Mary is
manager ID is three. That means Mary is the manager of Carol. And this is
the manager of Carol. And this is exactly what we can do with the
exactly what we can do with the recursive CTE. We can use such
recursive CTE. We can use such informations in order to create like a
informations in order to create like a loop. So let's go and do it step by
loop. So let's go and do it step by step. First we're going to start with
step. First we're going to start with the anchor query as usual. So this is
the anchor query as usual. So this is the anchor query and here the first step
the anchor query and here the first step or the first record going to be the
or the first record going to be the highest manager which is the CEO, right?
highest manager which is the CEO, right? The first record. So in order to select
The first record. So in order to select now the only the first record what we
now the only the first record what we can say we can say where manager id is
can say we can say where manager id is null. So let's go and execute it. And
null. So let's go and execute it. And with that we have now the first row and
with that we have now the first row and we can use this as the first step in our
we can use this as the first step in our iteration. So now let's go and pick few
iteration. So now let's go and pick few informations in the select like the
informations in the select like the employee ID and the first name and as
employee ID and the first name and as well let's go and get the manager ID.
well let's go and get the manager ID. And now we have to start creating the
And now we have to start creating the levels. Right? So this is the first
levels. Right? So this is the first level. So I'm going to have the value
level. So I'm going to have the value one as let's have it like level. So our
one as let's have it like level. So our CEO has the level number one. So let's
CEO has the level number one. So let's go and execute it. So now as you can see
go and execute it. So now as you can see Frank is the CEO and he is in the level
Frank is the CEO and he is in the level number one. So this is our anchor query.
number one. So this is our anchor query. Now we have to do the iteration right.
Now we have to do the iteration right. So we have to go and start creating the
So we have to go and start creating the city. So let's call it with CD employee
city. So let's call it with CD employee hierarchy and then as and then this is
hierarchy and then as and then this is the definition of our CD. So let me just
the definition of our CD. So let me just make it like this. And of course what do
make it like this. And of course what do we need? We need the main query.
we need? We need the main query. So main query we will select
So main query we will select everything from our new city like this.
everything from our new city like this. So let's go and test it. All right. So
So let's go and test it. All right. So now we have prepared the CTE and the
now we have prepared the CTE and the main query and of course the next step
main query and of course the next step with that we're going to go and build
with that we're going to go and build the recursive query but first we need
the recursive query but first we need the union all in order to connect the
the union all in order to connect the two queries and recursive query and now
two queries and recursive query and now we can start building the logic. So now
we can start building the logic. So now we want to find all the employees where
we want to find all the employees where their manager is the employee ID number
their manager is the employee ID number one right because they going to have the
one right because they going to have the second level in the hierarchy. So what
second level in the hierarchy. So what we're going to do, we're going to go and
we're going to do, we're going to go and select and we need the same stuff. So we
select and we need the same stuff. So we would like to get the employee ID, the
would like to get the employee ID, the first name, and the manager ID. And we
first name, and the manager ID. And we need the level. So this going to be the
need the level. So this going to be the level number two. It's not correct yet.
level number two. It's not correct yet. I'm just want to show what this means
I'm just want to show what this means because we need to get the employee ID
because we need to get the employee ID and the first name and so on. We cannot
and the first name and so on. We cannot get it yet from the CT because in the
get it yet from the CT because in the city we have only one employee. So we
city we have only one employee. So we still have to go to the database and
still have to go to the database and grab the next employees. So now I will
grab the next employees. So now I will give this as an alias like E and I will
give this as an alias like E and I will select it as well from those employees.
select it as well from those employees. So so far we are not doing any recursive
So so far we are not doing any recursive yet right in the recursive query we're
yet right in the recursive query we're still querying the database but now we
still querying the database but now we don't need all the employees from this
don't need all the employees from this table we need all the employees where
table we need all the employees where the manager ID equal to one right now.
the manager ID equal to one right now. Of course, in order to get those
Of course, in order to get those employees where the manager equal to
employees where the manager equal to one. So we can do it with the workclouds
one. So we can do it with the workclouds for example and say manager ID equal to
for example and say manager ID equal to one. Let me just select this and query
one. Let me just select this and query it. Now we will get those two employees
it. Now we will get those two employees where their manager is the CEO the top
where their manager is the CEO the top manager. But of course we cannot do it
manager. But of course we cannot do it like this. What we're going to do we're
like this. What we're going to do we're going to join this table with our
going to join this table with our current CTE in order to make a loop. So
current CTE in order to make a loop. So let me show you what I mean. We will
let me show you what I mean. We will remove this. We're going to use the
remove this. We're going to use the inner join and we're going to reference
inner join and we're going to reference it from the CTE and let's give this the
it from the CTE and let's give this the name C H and we connect it like this. So
name C H and we connect it like this. So on we're going to say the manager ID of
on we're going to say the manager ID of the employee should be equal to the
the employee should be equal to the employee ID. So the employee ID at the
employee ID. So the employee ID at the start going to be the number one. So
start going to be the number one. So it's going to be like
it's going to be like this employee ID. Now we are connecting
this employee ID. Now we are connecting the manager ID with the employee ID and
the manager ID with the employee ID and we are as well reusing the CD inside
we are as well reusing the CD inside itself in order to make the iterations
itself in order to make the iterations and here we don't need the work clause
and here we don't need the work clause because the inner join going to filter
because the inner join going to filter the data automatically as we learned the
the data automatically as we learned the inner join going to show only the
inner join going to show only the matching rows from the left and to right
matching rows from the left and to right so that mean there will be filtering. So
so that mean there will be filtering. So we are almost there but of course we
we are almost there but of course we don't want to show it as a two. What
don't want to show it as a two. What we're going to do, we're going to show
we're going to do, we're going to show it like this. Level + one. So the
it like this. Level + one. So the current level is one. The second
current level is one. The second iteration going to be two. And the third
iteration going to be two. And the third iteration going to be three. So I think
iteration going to be three. So I think we have everything for our iteration.
we have everything for our iteration. Let me just check and make this smaller.
Let me just check and make this smaller. Now again we have here our anchor query.
Now again we have here our anchor query. This is only for the top level manager.
This is only for the top level manager. And then here we are just connecting the
And then here we are just connecting the managers with the employees. And we are
managers with the employees. And we are reusing the CTE in order to make the
reusing the CTE in order to make the effect of the loop. And as well we are
effect of the loop. And as well we are using the inner join in order to break
using the inner join in order to break the loop once there are no more rows to
the loop once there are no more rows to process. So let's go and execute it. Now
process. So let's go and execute it. Now let's check the output. This is our top
let's check the output. This is our top manager. So level one. This information
manager. So level one. This information comes from the anchor query. Then the
comes from the anchor query. Then the second iteration it is the employees
second iteration it is the employees where the manager ID equal to one. So
where the manager ID equal to one. So it's going to be those two employees. So
it's going to be those two employees. So those employees in our hierarchy are the
those employees in our hierarchy are the second level in our organization. And
second level in our organization. And then we're going to search for employees
then we're going to search for employees where their manager ID is equal to
where their manager ID is equal to either two or three. And this is going
either two or three. And this is going to be those two employees, Carol and
to be those two employees, Carol and Miracle. And now to the third iteration,
Miracle. And now to the third iteration, we're going to search for all employees
we're going to search for all employees where their manager ID equal to either
where their manager ID equal to either two or three. And now to the third
two or three. And now to the third iteration, we're going to search for all
iteration, we're going to search for all employees where their manager ID equal
employees where their manager ID equal to either two or three. And this going
to either two or three. And this going to result having those two employees
to result having those two employees because their manager ID is equal to
because their manager ID is equal to three or two and they're going to get
three or two and they're going to get the level of three. And then after that
the level of three. And then after that SQL going to try to search for employees
SQL going to try to search for employees where their manager ID equal to five and
where their manager ID equal to five and four and SQL will not find anything and
four and SQL will not find anything and that's why it kind of breaks. So with
that's why it kind of breaks. So with that we have solved the task. All right.
that we have solved the task. All right. I totally understand if this is
I totally understand if this is complicated but now we're going to do it
complicated but now we're going to do it step by step in order to understand how
step by step in order to understand how SQL executed this and why we have done
SQL executed this and why we have done it in this way. So again we have our
it in this way. So again we have our flow diagram. We start by running the
flow diagram. We start by running the anchor query then the recursive query
anchor query then the recursive query and then we have a check. If the check
and then we have a check. If the check fails we iterate otherwise we end. So
fails we iterate otherwise we end. So let's do it step by step. Here we have
let's do it step by step. Here we have the table employees and beneath it we
the table employees and beneath it we have the result of the city. So the
have the result of the city. So the first step it says we run the anchor
first step it says we run the anchor query and we run it only once. So it's
query and we run it only once. So it's going to go to the anchor query and
going to go to the anchor query and start executing it. So here we are
start executing it. So here we are selecting from the table employees but
selecting from the table employees but we are making a filter on the manager
we are making a filter on the manager ID. So the manager ID should be null. So
ID. So the manager ID should be null. So that means we will get the record of
that means we will get the record of Frank and Frank going to be at the
Frank and Frank going to be at the output and we are saying the level of
output and we are saying the level of this employee is one. So we will have
this employee is one. So we will have here at the level one. So this is the
here at the level one. So this is the output of the anchor query and that's
output of the anchor query and that's it. This will never be executed. Now we
it. This will never be executed. Now we go to the next step. Now we will run the
go to the next step. Now we will run the recursive query. So what's going to
recursive query. So what's going to happen in the recursive query we are
happen in the recursive query we are saying okay I would like to select as
saying okay I would like to select as well data from the employees and join it
well data from the employees and join it with the city results but the join
with the city results but the join should be an inner join so only the
should be an inner join so only the matching data between the CTE and the
matching data between the CTE and the employees and now comes the join
employees and now comes the join condition and this is the key for this
condition and this is the key for this iteration we are saying the manager ID
iteration we are saying the manager ID of the employee should be matching to
of the employee should be matching to the employee ID from the CTE. So SQL
the employee ID from the CTE. So SQL going to go and join the table with the
going to go and join the table with the CTE. So now we have here only employee
CTE. So now we have here only employee number ID one. So it's still going to do
number ID one. So it's still going to do it step by step searching for any
it step by step searching for any matches. So for the first one we don't
matches. So for the first one we don't have a match because the manager ID is
have a match because the manager ID is not equal to one. So that's why it will
not equal to one. So that's why it will not be included in the result. The
not be included in the result. The second row here the manager ID is equal
second row here the manager ID is equal to one and this is a match with the
to one and this is a match with the employee ID. So SQL going to take it and
employee ID. So SQL going to take it and put it at the output. Not only that, SQL
put it at the output. Not only that, SQL going to increase the level. So we have
going to increase the level. So we have here the current value is one. So level
here the current value is one. So level + one. What can happen? We will get the
+ one. What can happen? We will get the value two. We are still in the same
value two. We are still in the same iteration. We are not iterating yet. So
iteration. We are not iterating yet. So this is the first iteration of the
this is the first iteration of the recursive query. So until the whole join
recursive query. So until the whole join is done to the next row, we have a match
is done to the next row, we have a match as well because the manager ID is equal
as well because the manager ID is equal to one. And we're going to have the same
to one. And we're going to have the same thing. The level going to be as well too
thing. The level going to be as well too because the value of the level didn't
because the value of the level didn't change. It's still the current value is
change. It's still the current value is equal to one. And this going to keep
equal to one. And this going to keep going. So two, three, we don't have any
going. So two, three, we don't have any matches. And with that, SQL is done
matches. And with that, SQL is done executing the recursive query. All
executing the recursive query. All right. So now the SQL going to say,
right. So now the SQL going to say, okay, did we process everything? Well,
okay, did we process everything? Well, no. We still have missing output. We
no. We still have missing output. We still have missing employees. That's why
still have missing employees. That's why we didn't fulfill the condition. And
we didn't fulfill the condition. And we're going to run this again. So now in
we're going to run this again. So now in the second iteration, it's going to join
the second iteration, it's going to join as well again the city result with the
as well again the city result with the employees by matching the manager ID and
employees by matching the manager ID and the employee ID. But this time it's
the employee ID. But this time it's going to focus only on those two ids. So
going to focus only on those two ids. So the two and three. So SQL going to go
the two and three. So SQL going to go and find any matching where the major ID
and find any matching where the major ID equal to two or three. So it's going to
equal to two or three. So it's going to do it step by step. The first one is
do it step by step. The first one is not. The second one is as well not. The
not. The second one is as well not. The third one is not because the manager ID
third one is not because the manager ID is one. But now to the employee number
is one. But now to the employee number four we have a match. So it's still
four we have a match. So it's still going to take this one and put it in the
going to take this one and put it in the output like this. And now in this
output like this. And now in this iteration what is the current level? It
iteration what is the current level? It is two but we add to it one that's why
is two but we add to it one that's why we will get in the output three. And
we will get in the output three. And then SQL keep going. So we have here the
then SQL keep going. So we have here the employee number five and the manager ID
employee number five and the manager ID is equal to three. So what happens? SQL
is equal to three. So what happens? SQL takes it as well and put it in the
takes it as well and put it in the output as the result of the CTE and as
output as the result of the CTE and as well the current level is two + one.
well the current level is two + one. We're going to have as well three. So
We're going to have as well three. So with that SQL done joining the tables
with that SQL done joining the tables and going to ask again did we process
and going to ask again did we process all employees? Well yes it's true that
all employees? Well yes it's true that means we don't have to do any more
means we don't have to do any more iterations because if you do any
iterations because if you do any iterations SQL will not find anything.
iterations SQL will not find anything. So for example if you go over here let
So for example if you go over here let me just remove this and let's say we are
me just remove this and let's say we are joining with the four and five. So what
joining with the four and five. So what can happen isql going to search in the
can happen isql going to search in the manager's ID for four and five and it
manager's ID for four and five and it will not find anything. So that means we
will not find anything. So that means we will not be adding anything to the CTE.
will not be adding anything to the CTE. That's why SQL stops. So we have a
That's why SQL stops. So we have a complete results and we have now all the
complete results and we have now all the data from the employees in the output
data from the employees in the output and this results going to be passed to
and this results going to be passed to the main query. So this is why we have
the main query. So this is why we have done it like this and this is how
done it like this and this is how executed this recursive query. I would
executed this recursive query. I would like to visual for you what this means
like to visual for you what this means the level or the structure of the
the level or the structure of the organization. So the hierarchy looks
organization. So the hierarchy looks like this. The level one the top manager
like this. The level one the top manager is Frank. So this is the level number
is Frank. So this is the level number one. And then we go to the level number
one. And then we go to the level number two. So we have those two employees. So
two. So we have those two employees. So we have Kevin. So this is the level
we have Kevin. So this is the level number one. And then we have two
number one. And then we have two employees Kevin and Mary at the level
employees Kevin and Mary at the level two. So they work together and their
two. So they work together and their boss is Frank. So it's going to look
boss is Frank. So it's going to look like this. And they are at the level
like this. And they are at the level two. We have then Michael that directly
two. We have then Michael that directly reports to who? To Kevin because here we
reports to who? To Kevin because here we have the employee ID two and two. So we
have the employee ID two and two. So we have one employee here and as well Carol
have one employee here and as well Carol is as well at the level three and she
is as well at the level three and she reports to Mary and both Michael and
reports to Mary and both Michael and Carol are at the level three. So this is
Carol are at the level three. So this is what we mean with the level. It can help
what we mean with the level. It can help us to identify which employee at which
us to identify which employee at which level in the organization. If you have
level in the organization. If you have like hierarchy in your data and you can
like hierarchy in your data and you can see in one table things are referencing
see in one table things are referencing each others like here the manager ID is
each others like here the manager ID is actually the employee ID. So it's like
actually the employee ID. So it's like we are referencing those ID to each
we are referencing those ID to each others. This means there is hierarchy
others. This means there is hierarchy and there is a structure in this table
and there is a structure in this table and you can use the recursive city in
and you can use the recursive city in order to build those levels and to
order to build those levels and to navigate as well through the hierarchy.
navigate as well through the hierarchy. All right. So that's all for the
All right. So that's all for the recursive city and with that we have
recursive city and with that we have covered all the different types of
covered all the different types of cities that we have in
SQL. So now let's have a quick recap. So we have learned that the CTE the common
we have learned that the CTE the common table expression is a temporary named
table expression is a temporary named result like a virtual table that could
result like a virtual table that could be used from different places in the
be used from different places in the query and we have a lot of advantages
query and we have a lot of advantages for the CTE. The main one is it breaks
for the CTE. The main one is it breaks the complexity of query into small
the complexity of query into small multiple pieces which makes our query
multiple pieces which makes our query much easier to read and as well to
much easier to read and as well to understand. So it improves readability.
understand. So it improves readability. Another advantage of the city is that
Another advantage of the city is that those small multiple pieces they are
those small multiple pieces they are really easy to manage and to develop. So
really easy to manage and to develop. So those pieces are like self-contained
those pieces are like self-contained which makes our queries more modular. So
which makes our queries more modular. So it introduces modularity inside our
it introduces modularity inside our queries. And we also learned that the
queries. And we also learned that the CTE help us to reduce the redundancy
CTE help us to reduce the redundancy inside our queries where it makes the
inside our queries where it makes the result of one query usable in multiple
result of one query usable in multiple places inside our query. So it makes our
places inside our query. So it makes our code smaller and reduce redundancy. And
code smaller and reduce redundancy. And one more advantage of the city is that
one more advantage of the city is that it help us to do looping and iterating
it help us to do looping and iterating in SQL by using the recursive CTE. And
in SQL by using the recursive CTE. And we have understood as well that we can
we have understood as well that we can treat the CTE result as any other
treat the CTE result as any other physical table inside our database. So
physical table inside our database. So we can treat it and handle it like any
we can treat it and handle it like any other tables. Only one exception that
other tables. Only one exception that this table lives only in one query. So
this table lives only in one query. So we cannot query the CTE from an external
we cannot query the CTE from an external query. Now we have learned that the
query. Now we have learned that the result of the CTE could be used from the
result of the CTE could be used from the main query. This is the classical one.
main query. This is the classical one. But not only we can use it in the main
But not only we can use it in the main query but also we can use it in another
query but also we can use it in another CTE query which leads to having nested
CTE query which leads to having nested cities. And of course we have learned as
cities. And of course we have learned as well we can use the result of the CTE
well we can use the result of the CTE within itself which makes the CTE
within itself which makes the CTE recursive and allows for looping and
recursive and allows for looping and iterating. And I can only keep
iterating. And I can only keep recommending to not use more than five
recommending to not use more than five CTEs in one query. Otherwise you're
CTEs in one query. Otherwise you're going to get the exact opposite and
going to get the exact opposite and benefits from cdes where your code going
benefits from cdes where your code going to be really hard to understand and to
to be really hard to understand and to read and even to extend. Okay my friends
read and even to extend. Okay my friends with that we have covered this amazing
with that we have covered this amazing and very important technique in SQL the
and very important technique in SQL the common table expressions the city. Now
common table expressions the city. Now in the next step we're going to talk
in the next step we're going to talk about a new type of objects that you can
about a new type of objects that you can use in databases. We don't have only
use in databases. We don't have only tables we have as well views. And views
tables we have as well views. And views are amazing in order to give you dynamic
are amazing in order to give you dynamic and flexibility in your project. So
and flexibility in your project. So let's talk about
views. Now a view is not like a query that we can use in SQL. It is an object
that we can use in SQL. It is an object that we can find in the database. So
that we can find in the database. So before we jump immediately to the view,
before we jump immediately to the view, I would like to give you the big
I would like to give you the big picture, the whole structure of the
picture, the whole structure of the database. So let's go. We have like
database. So let's go. We have like hierarchy structure and the highest
hierarchy structure and the highest level of this hierarchy is the SQL
level of this hierarchy is the SQL server. The SQL server manages multiple
server. The SQL server manages multiple databases. It's like the control center
databases. It's like the control center that keep everything running and
that keep everything running and accessible. Now inside the SQL server,
accessible. Now inside the SQL server, we have multiple databases. So a
we have multiple databases. So a database is collection of informations
database is collection of informations that are stored in structured way. It's
that are stored in structured way. It's where all your data is kept and
where all your data is kept and organized in different tables and
organized in different tables and objects. And each database is separated
objects. And each database is separated from others and it has its own data. Now
from others and it has its own data. Now inside each database we can find
inside each database we can find multiple schemas. A schema is like a
multiple schemas. A schema is like a logical way on how you group up related
logical way on how you group up related objects like tables and views together
objects like tables and views together within a database. Like for example, if
within a database. Like for example, if you have a database called sales, we can
you have a database called sales, we can group up different tables about the
group up different tables about the orders underneath the schema orders. And
orders underneath the schema orders. And maybe we have like multiple views and
maybe we have like multiple views and tables about the customers where we can
tables about the customers where we can put it in the schema customers. So if
put it in the schema customers. So if you find like multiple tables and views
you find like multiple tables and views that are describing the same object, the
that are describing the same object, the same topic, we put them all together
same topic, we put them all together underneath one schema. So again, a
underneath one schema. So again, a database could be like the sales
database could be like the sales database and the HR database. They are
database and the HR database. They are completely different types of data. And
completely different types of data. And underneath the sales, we can have like
underneath the sales, we can have like different sections. We have the sections
different sections. We have the sections about the orders and sections about the
about the orders and sections about the customers. And now moving on, what we
customers. And now moving on, what we can find inside the schema, we can find
can find inside the schema, we can find tables. A table is where actually your
tables. A table is where actually your data is stored. It contains multiple
data is stored. It contains multiple columns and rows. So it is where the
columns and rows. So it is where the data physically lives. And now inside
data physically lives. And now inside the schemas, we have another type of
the schemas, we have another type of object. We call it view. And of course
object. We call it view. And of course in this section, we are focusing on the
in this section, we are focusing on the views. So a view is like a virtual table
views. So a view is like a virtual table that has a structure and everything but
that has a structure and everything but inside it we don't have any data. So the
inside it we don't have any data. So the view does not store any data and in
view does not store any data and in order to see the data we have to execute
order to see the data we have to execute the query behind the view and only after
the query behind the view and only after that we're going to see some data but it
that we're going to see some data but it is not like the tables it doesn't store
is not like the tables it doesn't store the data permanently. Now inside the
the data permanently. Now inside the tables we can define multiple stuff like
tables we can define multiple stuff like columns and as well keys and the same
columns and as well keys and the same thing for the views. Inside the views we
thing for the views. Inside the views we can define multiple columns and one last
can define multiple columns and one last level for each column we have like a
level for each column we have like a name and a data type. So as you can see
name and a data type. So as you can see the databases are really organized and
the databases are really organized and we have like hierarchy where the top
we have like hierarchy where the top node is the SQL server and the lowest
node is the SQL server and the lowest node is the columns and rows. So this is
node is the columns and rows. So this is what we call the database structure. Now
what we call the database structure. Now in order for you to build and manage
in order for you to build and manage this structure we have set of commands
this structure we have set of commands we call it DDL the shortcut of data
we call it DDL the shortcut of data definition language. So the detail is a
definition language. So the detail is a set of commands that allow us to define
set of commands that allow us to define and manage the structure of the
and manage the structure of the database. So we have commands like
database. So we have commands like create where it help us to create
create where it help us to create databases, schemas, tables, views.
databases, schemas, tables, views. Another command called alter. Of course
Another command called alter. Of course after you create something you would
after you create something you would like maybe later to do changes and
like maybe later to do changes and updates and of course we have the drop
updates and of course we have the drop in order to remove any database object
in order to remove any database object like dropping a schema, dropping a
like dropping a schema, dropping a database, tables, views. So as you can
database, tables, views. So as you can see the DDL commands can help us to
see the DDL commands can help us to manage the database structure. So from
manage the database structure. So from this picture we have understood that we
this picture we have understood that we can create views inside schemas in the
can create views inside schemas in the database. So now if you check the client
database. So now if you check the client and the object explorer you can find the
and the object explorer you can find the exact hierarchy. So it start with the
exact hierarchy. So it start with the SQL server. This is our local server
SQL server. This is our local server that's run at our machine and then we
that's run at our machine and then we can find inside it multiple databases
can find inside it multiple databases and one of them is our sales DB that you
and one of them is our sales DB that you have installed together with other
have installed together with other database like the adventure works. So
database like the adventure works. So now if you go to the sales DB over here
now if you go to the sales DB over here you can go and drill to the next level
you can go and drill to the next level and now we can find here a lot of
and now we can find here a lot of objects and one of them that you know we
objects and one of them that you know we have tables and views and now you might
have tables and views and now you might say okay but between the database and
say okay but between the database and tables we have schemas so where are the
tables we have schemas so where are the schemas well actually if you go inside
schemas well actually if you go inside the tables you're going to find our
the tables you're going to find our tables customers employees and so on but
tables customers employees and so on but before it we have a name called sales
before it we have a name called sales doc customers and you can find it
doc customers and you can find it everywhere sales doc customers sales do
everywhere sales doc customers sales do employees and so on the sales is the
employees and so on the sales is the schema that bring all those tables
schema that bring all those tables together underneath one logical schema.
together underneath one logical schema. So we have a database called sales DB.
So we have a database called sales DB. We have a schema called sales and we
We have a schema called sales and we have a table called customers. And now
have a table called customers. And now if you would like to see all the schemas
if you would like to see all the schemas inside this database, what you can do?
inside this database, what you can do? You can go to the securities over here
You can go to the securities over here and then here we have like a folder
and then here we have like a folder called schemas. If you go over there,
called schemas. If you go over there, you will find the list of all schemas
you will find the list of all schemas that we have in this database. You might
that we have in this database. You might say, but we didn't create all those
say, but we didn't create all those stuff. If we have only the sales that we
stuff. If we have only the sales that we know. Well, as you create a database in
know. Well, as you create a database in SQL server, you will get a lot of other
SQL server, you will get a lot of other system default schemas that the server
system default schemas that the server can create. One of them is the
can create. One of them is the information schema where it holds a lot
information schema where it holds a lot of views about the catalog and the
of views about the catalog and the metadata where you can find the list of
metadata where you can find the list of columns, tables, views and so on. So
columns, tables, views and so on. So here we have only one schema that we
here we have only one schema that we have created for the user. It is the
have created for the user. It is the sales. So let's go back. Now if you go
sales. So let's go back. Now if you go inside one of those tables you will find
inside one of those tables you will find here multiple stuff like we have
here multiple stuff like we have columns, keys, constraints and so on.
columns, keys, constraints and so on. And if you go to the columns you will
And if you go to the columns you will end up at the lowest level of the
end up at the lowest level of the hierarchy. And here we have the columns
hierarchy. And here we have the columns like the customer ID and we have some
like the customer ID and we have some extra informations like the data type
extra informations like the data type length and so on. So this is the
length and so on. So this is the structure and the hierarchy of
structure and the hierarchy of databases.
Now I would like you to understand a fundamental concept on the database in
fundamental concept on the database in order to understand the views the
order to understand the views the three-level architecture of the
three-level architecture of the database. This architecture can describe
database. This architecture can describe the different levels of data
the different levels of data abstractions in a database. So let's see
abstractions in a database. So let's see what this means. So the architecture is
what this means. So the architecture is divided into three levels. The first
divided into three levels. The first level is the physical level. Then we
level is the physical level. Then we have the logical level and the third one
have the logical level and the third one is the view level. Now let's understand
is the view level. Now let's understand each level what it means. So now the
each level what it means. So now the physical level it is the lowest level of
physical level it is the lowest level of the database where the actual data is
the database where the actual data is stored in a physical storage and usually
stored in a physical storage and usually who has access to this layer are the
who has access to this layer are the database administrators because they are
database administrators because they are the experts and they have to manage the
the experts and they have to manage the access and the security of this layer
access and the security of this layer because they are the expert that have to
because they are the expert that have to manage a lot of stuff like optimizing
manage a lot of stuff like optimizing the performance making sure that
the performance making sure that everything is secure and managing the
everything is secure and managing the backup and recovery and to do all the
backup and recovery and to do all the configurations and many other tasks. So
configurations and many other tasks. So at the physical layer we have to deal
at the physical layer we have to deal with a lot of stuff like the data files,
with a lot of stuff like the data files, partitions, logs, cataloges, blocks and
partitions, logs, cataloges, blocks and caches and many other stuff that each
caches and many other stuff that each database needs in order to store your
database needs in order to store your data. So as you can see this layer is
data. So as you can see this layer is very complicated and you need to be
very complicated and you need to be really an expert of databases in order
really an expert of databases in order to be able to manage all those stuff. So
to be able to manage all those stuff. So we call this layer a physical layer or
we call this layer a physical layer or sometimes we call it an internal layer.
sometimes we call it an internal layer. So now let's move to the next level. we
So now let's move to the next level. we have the logical level. So the logical
have the logical level. So the logical layer it is less complicated than the
layer it is less complicated than the physical layer. Here at this level you
physical layer. Here at this level you have to deal on how to organize your
have to deal on how to organize your data and normally we have here like an
data and normally we have here like an application developer or we have like
application developer or we have like data engineers that access the logical
data engineers that access the logical level in order to define the structure
level in order to define the structure of your data. So those developers can
of your data. So those developers can focus on how to structure your data
focus on how to structure your data rather than how the data is exactly
rather than how the data is exactly storing the data physically at the
storing the data physically at the storage. So they don't have to deal with
storage. So they don't have to deal with all those details. they leave it for the
all those details. they leave it for the database administrator and they can
database administrator and they can focus only on how to structure the data.
focus only on how to structure the data. That's why we need for this kind of role
That's why we need for this kind of role an abstraction level for them which is
an abstraction level for them which is the logical level. So now what actually
the logical level. So now what actually the developers are doing at this level?
the developers are doing at this level? Well, they are like creating tables and
Well, they are like creating tables and defining the relationships between those
defining the relationships between those tables or they can go and define views.
tables or they can go and define views. they can create indexes on the tables in
they can create indexes on the tables in order to optimize the performance of the
order to optimize the performance of the tables or maybe they are creating stored
tables or maybe they are creating stored procedures and functions and some other
procedures and functions and some other codes in order to manage those tables.
codes in order to manage those tables. So as you can see they are building the
So as you can see they are building the data model they are structuring your
data model they are structuring your data but they don't care at all where
data but they don't care at all where are those data stored physically in the
are those data stored physically in the database. So as you can see here things
database. So as you can see here things are less complicated than the physical
are less complicated than the physical layer and it is perfect abstraction for
layer and it is perfect abstraction for developers to build projects. So we call
developers to build projects. So we call this the logical layer or sometimes we
this the logical layer or sometimes we call it the conceptual layer. Okay. So
call it the conceptual layer. Okay. So now moving on to another level of
now moving on to another level of abstraction. We have the view level. So
abstraction. We have the view level. So the view level is the highest level of
the view level is the highest level of abstraction in the database and it is
abstraction in the database and it is what the end users and applications can
what the end users and applications can access and can see. So for example, you
access and can see. So for example, you could have like one view for business
could have like one view for business analyst. So you prepare and customize a
analyst. So you prepare and customize a views that are suitable only for the
views that are suitable only for the business analyst and you might say you
business analyst and you might say you know what let's prepare another set of
know what let's prepare another set of views that are suitable for data
views that are suitable for data visualizations and reporting like you
visualizations and reporting like you can go and connect for example a PowerBI
can go and connect for example a PowerBI in order to create dashboards. So they
in order to create dashboards. So they are fully customized and prepared views
are fully customized and prepared views in order to be connected with the
in order to be connected with the PowerBI reports and you can keep doing
PowerBI reports and you can keep doing that by creating multiple set of views
that by creating multiple set of views that are suitable for specific purpose
that are suitable for specific purpose and use case. So as you can see at this
and use case. So as you can see at this level we are exposing our data for
level we are exposing our data for multiple users and multiple
multiple users and multiple applications. So now the question is
applications. So now the question is what do we have to deal at the view
what do we have to deal at the view level? Well, you have their only views
level? Well, you have their only views that holds only the relevant
that holds only the relevant informations for the use case or users.
informations for the use case or users. So the users at this level have only
So the users at this level have only views. They don't have to deal with the
views. They don't have to deal with the tables, indexes, store procedures, any
tables, indexes, store procedures, any files, logs, partitions or anything.
files, logs, partitions or anything. This is the highest level of abstraction
This is the highest level of abstraction because the focus of this layer is to
because the focus of this layer is to make it friendly for the end users and
make it friendly for the end users and easy to consume. So we call this layer
easy to consume. So we call this layer the view layer or sometimes we call it
the view layer or sometimes we call it an external layer. So this is the
an external layer. So this is the three-level architecture of the
three-level architecture of the databases or we call it the three
databases or we call it the three abstraction levels of the database. So
abstraction levels of the database. So the physical layer has the highest
the physical layer has the highest complexity, the lowest abstraction and
complexity, the lowest abstraction and the view layer has the highest
the view layer has the highest abstraction. So this is one more reason
abstraction. So this is one more reason why the views are very important concept
why the views are very important concept in SQL
in SQL [Music]
[Music] databases. Okay. So with that we have
databases. Okay. So with that we have enough fundamentals in order to start
enough fundamentals in order to start talking about the views. So the question
talking about the views. So the question is what are views? A view is a virtual
is what are views? A view is a virtual table in SQL that is based on the result
table in SQL that is based on the result of a query without actually storing the
of a query without actually storing the data in the database. So in short this
data in the database. So in short this means views are stored or persisted SQL
means views are stored or persisted SQL query in the database. So let's
query in the database. So let's understand what this exactly means. Now
understand what this exactly means. Now so far what you have learned we have
so far what you have learned we have like database table and all what you
like database table and all what you have done we create a select query in
have done we create a select query in order to retrieve the data from this
order to retrieve the data from this table. So once we execute our query we
table. So once we execute our query we will get the result back. Now if you are
will get the result back. Now if you are talking about views they have as well
talking about views they have as well like the structure of the table but
like the structure of the table but without any data inside it. And for each
without any data inside it. And for each view there is like a query attached to
view there is like a query attached to it. So there is no data but we have like
it. So there is no data but we have like a query in order to get data. We call
a query in order to get data. We call the normal table as a physical table and
the normal table as a physical table and the view we call it a virtual table. So
the view we call it a virtual table. So now how exactly we're going to get the
now how exactly we're going to get the data. So now if you go and write query
data. So now if you go and write query by selecting data from the view not from
by selecting data from the view not from the table from the view what going to
the table from the view what going to happen SQL going to go and trigger the
happen SQL going to go and trigger the queue that is attached to the view and
queue that is attached to the view and this query is responsible to query the
this query is responsible to query the physical table and then the result going
physical table and then the result going to fill the structure of the view and we
to fill the structure of the view and we will get back of course the results. So
will get back of course the results. So we are directly querying a view but
we are directly querying a view but actually we are indirectly querying a
actually we are indirectly querying a physical table. So the view is like
physical table. So the view is like between us and the data. So that means
between us and the data. So that means my real data is stored inside the
my real data is stored inside the database tables and the views are like
database tables and the views are like an abstraction layer between me and my
an abstraction layer between me and my real data. And of course the data will
real data. And of course the data will not be stored inside the view. Each time
not be stored inside the view. Each time I'm querying the view what's going to
I'm querying the view what's going to happen the SQL query behind the view
happen the SQL query behind the view going to be executed again. So it's
going to be executed again. So it's going to go and retrieve the data and
going to go and retrieve the data and get it back to the view and then I will
get it back to the view and then I will see it in the output. So this is what we
see it in the output. So this is what we mean with SQL
view. So now let's have a quick comparison between tables and views.
comparison between tables and views. Tables stores the actual data physically
Tables stores the actual data physically at a database. So the tables where the
at a database. So the tables where the data is persisted with in the other hand
data is persisted with in the other hand the views they are virtual tables and
the views they are virtual tables and they do not store any data inside the
they do not store any data inside the database but they present the data from
database but they present the data from the underlying tables. So that means
the underlying tables. So that means views don't persist any data physically.
views don't persist any data physically. Now the tables are hard to maintain and
Now the tables are hard to maintain and as well hard to change. So it needs a
as well hard to change. So it needs a lot of efforts in order to do any change
lot of efforts in order to do any change like adding columns and moving columns
like adding columns and moving columns always requires a lot of efforts for the
always requires a lot of efforts for the migration especially if you have large
migration especially if you have large tables. But in the other hand the views
tables. But in the other hand the views are way easier to maintain and very
are way easier to maintain and very flexible to change. All what you have to
flexible to change. All what you have to do is only to change the query of the
do is only to change the query of the view. So that means you can very quickly
view. So that means you can very quickly change stuff in the views compared to
change stuff in the views compared to the tables. But if you are talking about
the tables. But if you are talking about performance, tables are faster than
performance, tables are faster than views. For example, if you go and do a
views. For example, if you go and do a simple select on the table, you will get
simple select on the table, you will get the data back as soon as the database
the data back as soon as the database fetches the data. But if you are
fetches the data. But if you are selecting something from the view, it is
selecting something from the view, it is actually two queries. The query that
actually two queries. The query that comes from the user and as well the
comes from the user and as well the second query is the view query. and the
second query is the view query. and the query of the view could be very
query of the view could be very complicated in order to extract the data
complicated in order to extract the data from the underlying table. So selecting
from the underlying table. So selecting something from the view is always slower
something from the view is always slower than selecting something from a table.
than selecting something from a table. Now if you have a table you can read
Now if you have a table you can read from the table and as well you can write
from the table and as well you can write to a table but the views are read only
to a table but the views are read only as the name says it is only a view. You
as the name says it is only a view. You cannot go and write something to the
cannot go and write something to the database using the view. Okay. So those
database using the view. Okay. So those are the big differences between views
are the big differences between views and tables.
All right. So with that we have a clear understanding what are views. But now
understanding what are views. But now you might ask me why do we need views?
you might ask me why do we need views? That's why now what we're going to do
That's why now what we're going to do we're going to deep dive into multiple
we're going to deep dive into multiple scenarios and use cases that you might
scenarios and use cases that you might encounter in your SQL projects. So let's
encounter in your SQL projects. So let's start with the first use case. The first
start with the first use case. The first use case and the core reason why we use
use case and the core reason why we use views in our data projects is to store
views in our data projects is to store central logic from a complex query in
central logic from a complex query in the database so that everyone can access
the database so that everyone can access it and with that we improve reusability
it and with that we improve reusability between multiple queries and we reduce
between multiple queries and we reduce as well the complexity of the overall
as well the complexity of the overall projects. So let's understand what this
projects. So let's understand what this means. So now in our project we have
means. So now in our project we have like two tables in the database orders
like two tables in the database orders and customers and we have learned
and customers and we have learned previously that if we have like a
previously that if we have like a complex query we can go and use the
complex query we can go and use the city. So for example in our city we are
city. So for example in our city we are joining tables and doing some
joining tables and doing some aggregations using the sum and the city
aggregations using the sum and the city going to store the data in an
going to store the data in an intermediate results and then we have
intermediate results and then we have the main query. For example we are doing
the main query. For example we are doing the step two where we are ranking the
the step two where we are ranking the data. So the whole thing is in one query
data. So the whole thing is in one query and let's say that a financial analyst
and let's say that a financial analyst was doing this type of analyszis. Now
was doing this type of analyszis. Now what could happen is that you might have
what could happen is that you might have another user for example a budget
another user for example a budget analyst where he is doing exactly the
analyst where he is doing exactly the same first step. So he has as well a
same first step. So he has as well a city query where first the data are
city query where first the data are joined and then aggregated using the
joined and then aggregated using the sum. But the last step in the main query
sum. But the last step in the main query he's not doing ranking he's just doing
he's not doing ranking he's just doing like max and min. And not only that, we
like max and min. And not only that, we have a third user, the risk analyst,
have a third user, the risk analyst, where as well doing the same initial
where as well doing the same initial step using the CTE, joining the tables
step using the CTE, joining the tables and doing the summarization. But here
and doing the summarization. But here the risk analyst in this scenario, he's
the risk analyst in this scenario, he's just comparing the data at the last step
just comparing the data at the last step in the main query. So now if you sit
in the main query. So now if you sit back and look to this, you can see all
back and look to this, you can see all those three data workers, all of them
those three data workers, all of them are doing the same first step. So all of
are doing the same first step. So all of them are doing the same CTE. They are
them are doing the same CTE. They are joining the data and then doing
joining the data and then doing summarization. And of course this is a
summarization. And of course this is a complete waste of time that each one of
complete waste of time that each one of them has to create first the city from
them has to create first the city from the scratch in order to do some
the scratch in order to do some analyszis. So it is complete redundancy
analyszis. So it is complete redundancy and makes no sense. So this is exactly
and makes no sense. So this is exactly the disadvantage of only using cities in
the disadvantage of only using cities in the projects. Now what we can do instead
the projects. Now what we can do instead of that those three data workers going
of that those three data workers going to decide to say you know what let's put
to decide to say you know what let's put the first step as view in the database.
the first step as view in the database. So instead of using CTE each time we're
So instead of using CTE each time we're going to take this script and put it in
going to take this script and put it in the database. So we have now a central
the database. So we have now a central logic that is stored in the database
logic that is stored in the database where everyone can use it. So we have
where everyone can use it. So we have this query this logic only once and
this query this logic only once and everyone can benefit from it. So now the
everyone can benefit from it. So now the financial analyst instead of going
financial analyst instead of going directly to the physical tables they can
directly to the physical tables they can go to the view. So thus means she needs
go to the view. So thus means she needs only to write one script the rank
only to write one script the rank script. Same thing goes for the budget
script. Same thing goes for the budget analyst. he has only to write the query
analyst. he has only to write the query for the max and min and as well for the
for the max and min and as well for the risk analyst he just need to compare the
risk analyst he just need to compare the data. So as you can see all those
data. So as you can see all those queries are reduced and they can only
queries are reduced and they can only focus on the analyzes. So this is
focus on the analyzes. So this is exactly the magic of views in data
exactly the magic of views in data analytics. This logic this knowledge can
analytics. This logic this knowledge can be centralized in the database and this
be centralized in the database and this is way faster and better than having
is way faster and better than having this logic written each time someone
this logic written each time someone want to do any analyzes. So this is why
want to do any analyzes. So this is why we need views in data projects.
So now if you compare views with CTE, the CTE are used in order to reduce the
the CTE are used in order to reduce the redundancy within one single query. So
redundancy within one single query. So it improves the reusability within one
it improves the reusability within one query. Where in the other hand in the
query. Where in the other hand in the views we are reducing the redundancies
views we are reducing the redundancies from multiple queries. So we are
from multiple queries. So we are reducing the complexity of the whole
reducing the complexity of the whole project. So the views are improving the
project. So the views are improving the reusability in multiple queries. Now
reusability in multiple queries. Now think about it like this. We use views
think about it like this. We use views in order to persist a logic in the
in order to persist a logic in the database. So the logic is so important
database. So the logic is so important that we want to persist it in the
that we want to persist it in the database. It's like in the tables we
database. It's like in the tables we persist data but with the views we are
persist data but with the views we are persisting logic. But in the other hand
persisting logic. But in the other hand in the CTE the logic is not persisted.
in the CTE the logic is not persisted. It is temporary and going to be
It is temporary and going to be calculated only on the fly within the
calculated only on the fly within the scope of one query. So this logic is
scope of one query. So this logic is important only in this scenario and it
important only in this scenario and it is not important for any other queries.
is not important for any other queries. That's why it makes no sense to persist
That's why it makes no sense to persist it using the views. So you have to
it using the views. So you have to decide is this logic is very important
decide is this logic is very important then take it away from the city and put
then take it away from the city and put it in the view. But if you think you
it in the view. But if you think you know what this logic is not really
know what this logic is not really important and only important in this one
important and only important in this one query then stay with the city because
query then stay with the city because creating views always needs some extra
creating views always needs some extra steps in order to maintain the view. You
steps in order to maintain the view. You have to create the view. You have to
have to create the view. You have to drop the view if you don't need it. But
drop the view if you don't need it. But the CTE, there is almost no maintenance
the CTE, there is almost no maintenance for it. The database going to do
for it. The database going to do automatically the cleanup once the query
automatically the cleanup once the query is done. So there is no extra activity
is done. So there is no extra activity to drop a city or something. That's why
to drop a city or something. That's why CTE is easier to use than views. So
CTE is easier to use than views. So those are the big difference between the
those are the big difference between the views and
cities. Okay. So now let's check quickly the syntax of a view. So now we have a
the syntax of a view. So now we have a query like select from where. So this is
query like select from where. So this is a query a simple select statement. But
a query a simple select statement. But now in order to create a view an object
now in order to create a view an object in database we have to go and use a DDL
in database we have to go and use a DDL command create. So we're going to say
command create. So we're going to say create view cuz we want to create a view
create view cuz we want to create a view then the name of the view and then it's
then the name of the view and then it's like the CTE we say as and then double
like the CTE we say as and then double parenthesis. So as you can see it's very
parenthesis. So as you can see it's very simple and we call this a DDL command
simple and we call this a DDL command where we are telling the database go and
where we are telling the database go and create a view and the logic of the view
create a view and the logic of the view comes from this query. So it's very
comes from this query. So it's very simple. This is how you can create views
simple. This is how you can create views in database. Okay. So now let's have the
in database. Okay. So now let's have the following task and it says find the
following task and it says find the running total of sales for each month.
running total of sales for each month. I'm going to start this task by solving
I'm going to start this task by solving it using the CTE. So first I'm going to
it using the CTE. So first I'm going to go and do few aggregations on the top of
go and do few aggregations on the top of the month. So let's go and select. So
the month. So let's go and select. So now what do we need? We need the order
now what do we need? We need the order dates but we need it as a month. I'm
dates but we need it as a month. I'm going to go and use the date truncate
going to go and use the date truncate like this and say okay I would like to
like this and say okay I would like to have the date as the granularity of
have the date as the granularity of month. So let's go and call it order
month. So let's go and call it order month. And now after that we're going to
month. And now after that we're going to do a few aggregations like for example
do a few aggregations like for example let's go and get the sum of sales and
let's go and get the sum of sales and we're going to call it total sales. And
we're going to call it total sales. And that's it for the start. So now let's go
that's it for the start. So now let's go and call it from the table sales orders
and call it from the table sales orders and group by and we are grouping up by
and group by and we are grouping up by by the month. So something like this.
by the month. So something like this. Let's go and execute it. And now for
Let's go and execute it. And now for this we get for each month the total
this we get for each month the total sales. And now the next step that we
sales. And now the next step that we have to go and calculate the running
have to go and calculate the running total for the sales. This is of course
total for the sales. This is of course not the running total. So that means
not the running total. So that means either we can go and use subqueries. So
either we can go and use subqueries. So this means this is our first step and we
this means this is our first step and we need a second step. So either use
need a second step. So either use queries or cities. I will go with the
queries or cities. I will go with the city over here. So I'm going to say with
city over here. So I'm going to say with city and
city and monthly summary and we're going to
monthly summary and we're going to define it like this. And now what we're
define it like this. And now what we're going to do, we're going to go and
going to do, we're going to go and define the main query. So the main query
define the main query. So the main query going to be simple. So select and let's
going to be simple. So select and let's go and get the order month. And now we
go and get the order month. And now we have to build the running total. So
have to build the running total. So we're going to go and use the window
we're going to go and use the window function. So sum total sales. And then
function. So sum total sales. And then we're going to say over we don't have to
we're going to say over we don't have to partition the data. We will just sort it
partition the data. We will just sort it by the order
by the order month and we can leave it ascending. So
month and we can leave it ascending. So this is the running
this is the running total and we have to go and select of
total and we have to go and select of course our CTE from here. So let's go
course our CTE from here. So let's go and execute it and with that we are
and execute it and with that we are getting the running total. Of course we
getting the running total. Of course we can go and add the total sales in the
can go and add the total sales in the output in order to understand the
output in order to understand the results. So here in the output we are
results. So here in the output we are just building accumulative sales. So for
just building accumulative sales. So for this scope everything is fine. and we
this scope everything is fine. and we are using the CTE. But now imagine that
are using the CTE. But now imagine that this logic is important for multiple
this logic is important for multiple queries. So it's really nice to have
queries. So it's really nice to have such a report where we are aggregating
such a report where we are aggregating the data at the level of the month and
the data at the level of the month and this could be used from different users
this could be used from different users and different queries. So now we say how
and different queries. So now we say how about to put this logic in one view so
about to put this logic in one view so that everyone can access it and we don't
that everyone can access it and we don't have to repeat the same aggregations
have to repeat the same aggregations over and over. And now before we put it
over and over. And now before we put it in view, someone comes and say how about
in view, someone comes and say how about to add one more aggregation so that not
to add one more aggregation so that not only the total sales we can add. So now
only the total sales we can add. So now before we put it as view maybe some
before we put it as view maybe some other user says you know what we would
other user says you know what we would like to have one more aggregation not
like to have one more aggregation not only the total sales let's make the
only the total sales let's make the scope a little bit bigger so that
scope a little bit bigger so that everyone can believe it. So for example
everyone can believe it. So for example we can go over here and say you know
we can go over here and say you know what let's go and add the total number
what let's go and add the total number of orders. So we can go over here and
of orders. So we can go over here and say counts and let's get the order ID
say counts and let's get the order ID and say this is the total orders or
and say this is the total orders or maybe some other says let's get the
maybe some other says let's get the quantities as well. So we can go and
quantities as well. So we can go and summarize the quantity like this and we
summarize the quantity like this and we call it total
call it total quantities. So with that we are like
quantities. So with that we are like doing a lot of aggregations on the month
doing a lot of aggregations on the month level. Let's go and execute only the
level. Let's go and execute only the CTE. So now we have really nice report
CTE. So now we have really nice report that is based on the months and can be
that is based on the months and can be used from many different queries. So now
used from many different queries. So now what we're going to do, we're going to
what we're going to do, we're going to take this and put it in a view. Let's go
take this and put it in a view. Let's go and select only this logic and create a
and select only this logic and create a new query. And now what we're going to
new query. And now what we're going to do, we're going to put our query here
do, we're going to put our query here and we have to create now the DDL in
and we have to create now the DDL in order to create a view. So it's going to
order to create a view. So it's going to be like this. Create view. Let's give it
be like this. Create view. Let's give it the name maybe starts with the V
the name maybe starts with the V underscore and this going to be the
underscore and this going to be the monthly summary. So this is the name of
monthly summary. So this is the name of the view and as then we put everything
the view and as then we put everything in parenthesis. It's like you are
in parenthesis. It's like you are building a CTE. So we have here our
building a CTE. So we have here our logic and here is our DDL query in order
logic and here is our DDL query in order to create the view. So now let's go and
to create the view. So now let's go and execute it. Now as you can see in the
execute it. Now as you can see in the output it says only that the command is
output it says only that the command is completed because this is not a select
completed because this is not a select query. This is a DDL command. So the SQL
query. This is a DDL command. So the SQL going to tell you okay either I created
going to tell you okay either I created it successfully or not. So now the
it successfully or not. So now the question is where do I find now my view?
question is where do I find now my view? Well, if you go to the object explorer,
Well, if you go to the object explorer, you can see over here underneath our
you can see over here underneath our database sales DB, we have here
database sales DB, we have here something called tables where we are
something called tables where we are used to query those tables. But beneath
used to query those tables. But beneath it, we have as well our views. So if you
it, we have as well our views. So if you check the views and expand it, now we
check the views and expand it, now we are not seeing M view because we just
are not seeing M view because we just created the view here. So go over here
created the view here. So go over here and refresh. And once you do that, you
and refresh. And once you do that, you will see the newly created view. So this
will see the newly created view. So this is the one that we just created. So now
is the one that we just created. So now what we can do, we can go and create a
what we can do, we can go and create a new query and let's go and just query
new query and let's go and just query the view. So select star from so v month
the view. So select star from so v month monthly summary. Let's go and execute
monthly summary. Let's go and execute it. And now as you can see we are
it. And now as you can see we are getting now the result of the view and
getting now the result of the view and I'm accessing now this logic from
I'm accessing now this logic from completely external query. So now I can
completely external query. So now I can think about the view as any other table
think about the view as any other table that we have in the database. And again
that we have in the database. And again the big differences between the views
the big differences between the views and the tables. The tables has data has
and the tables. The tables has data has actual data and everything there is
actual data and everything there is persisted but the view is just an
persisted but the view is just an abstraction for me and behind it there
abstraction for me and behind it there is like a query that goes to the table
is like a query that goes to the table and query the tables in order to present
and query the tables in order to present the results. But for me I don't care
the results. But for me I don't care about all those details. I can go
about all those details. I can go immediately to the query over here and
immediately to the query over here and start querying. So now in order to
start querying. So now in order to create the total running sales I don't
create the total running sales I don't have to create the CTE and sub queries.
have to create the CTE and sub queries. I just go and get for example our main
I just go and get for example our main query. Let's go back over here. So now
query. Let's go back over here. So now instead of using the CTE I can go
instead of using the CTE I can go directly and access the view. So as you
directly and access the view. So as you can see now my query is very simple. I'm
can see now my query is very simple. I'm doing immediately the step two without
doing immediately the step two without having to prepare the data first. So if
having to prepare the data first. So if I go and execute it I will get exact
I go and execute it I will get exact results. And now if you compare the
results. And now if you compare the query on top of the view like this with
query on top of the view like this with the city query you can see that the CTE
the city query you can see that the CTE has more steps and it is like little bit
has more steps and it is like little bit more complicated than the query on top
more complicated than the query on top of the view and this is exactly the
of the view and this is exactly the benefit of the view. We reduce the
benefit of the view. We reduce the complexity and it is very easy to
complexity and it is very easy to consume from the point of view of users.
consume from the point of view of users. So this is how you can put your logic in
So this is how you can put your logic in central place using views and with that
central place using views and with that we have learned how we create a view.
we have learned how we create a view. Now one more thing about the schemas. If
Now one more thing about the schemas. If you check our tables over here, they
you check our tables over here, they have all one schema. So we have sales
have all one schema. So we have sales dot customers, sales do employees,
dot customers, sales do employees, orders and so on. Our new view has the
orders and so on. Our new view has the schema of DBO. If you create any object
schema of DBO. If you create any object whether it's table or view and you don't
whether it's table or view and you don't specify a schema in a default schema
specify a schema in a default schema called DBO. And now let's go back to our
called DBO. And now let's go back to our DDL scripts. So as you can see over
DDL scripts. So as you can see over here, we didn't specify any schema. We
here, we didn't specify any schema. We just said okay, this is the view name.
just said okay, this is the view name. And now in order to put our view in the
And now in order to put our view in the correct schema we don't want it to be in
correct schema we don't want it to be in the defaults. You have to go and specify
the defaults. You have to go and specify the schema name in the DDL. And now in
the schema name in the DDL. And now in order to do that we go to the name of
order to do that we go to the name of the view and we write the schema name
the view and we write the schema name and then separated with a dot. So the
and then separated with a dot. So the first one is the schema name and the
first one is the schema name and the second one is the view name. So now
second one is the view name. So now let's go and execute it. Now if you
let's go and execute it. Now if you check over here you don't see anything
check over here you don't see anything new. But if you refresh you will find
new. But if you refresh you will find another view in the correct schema. So
another view in the correct schema. So we have sales dot vmon monthly summary
we have sales dot vmon monthly summary and this is exactly what we want. So
and this is exactly what we want. So this is how you can assign a view or
this is how you can assign a view or even a table to the correct schema if
even a table to the correct schema if you don't want to use the default one
you don't want to use the default one the view. All right. So now the next
the view. All right. So now the next step is that you say you know what I
step is that you say you know what I would like to clean up. I don't need
would like to clean up. I don't need those two views in my database. So how
those two views in my database. So how to delete a view? We can go and use the
to delete a view? We can go and use the command drop. It is very simple. If you
command drop. It is very simple. If you go and create a new query and you say
go and create a new query and you say drop and then you say what you want to
drop and then you say what you want to drop. you want to drop a view and then
drop. you want to drop a view and then you have to specify the name and schema
you have to specify the name and schema of the view. But now since it is the
of the view. But now since it is the default schema DBU, I don't have to
default schema DBU, I don't have to write it down. So we can start
write it down. So we can start immediately with the view name. So V
immediately with the view name. So V monthly summary. So that's it. It's very
monthly summary. So that's it. It's very simple. So now we go and execute it. It
simple. So now we go and execute it. It says it's completed but as you can see
says it's completed but as you can see nothing has changed. We go and refresh.
nothing has changed. We go and refresh. And now we can see that the database did
And now we can see that the database did remove the view with the schema DBU. So
remove the view with the schema DBU. So it's very simple. This is how you can
it's very simple. This is how you can drop a view in SQL. Okay. So now to the
drop a view in SQL. Okay. So now to the next step. Let's go back to our DDL of
next step. Let's go back to our DDL of creating the view sales monthly summary.
creating the view sales monthly summary. And now you say you know what I would
And now you say you know what I would like to change the logic inside the
like to change the logic inside the view. So how we can update this content?
view. So how we can update this content? How I can update my query? If you say
How I can update my query? If you say let's go and for example delete this
let's go and for example delete this column. I need only three columns. So
column. I need only three columns. So and you go execute it. The database say
and you go execute it. The database say I cannot do it for you because we have
I cannot do it for you because we have already such a view. So SQL will not go
already such a view. So SQL will not go and replace stuff going to say no we
and replace stuff going to say no we have the same name and I cannot do
have the same name and I cannot do anything for it. So how we can update
anything for it. So how we can update the view? Well in other databases like
the view? Well in other databases like ocris for example it's very simple. You
ocris for example it's very simple. You can go over here and say create or
can go over here and say create or replace view. So it's like you are
replace view. So it's like you are telling the database create this view or
telling the database create this view or if it already exists then replace it and
if it already exists then replace it and you will not get error in the postcress.
you will not get error in the postcress. But in the SQL server it is little bit
But in the SQL server it is little bit more complicated. we don't have this
more complicated. we don't have this command. So here you have two ways.
command. So here you have two ways. Either you go over here and say you know
Either you go over here and say you know what let's first drop the view. So you
what let's first drop the view. So you go with the same name over
go with the same name over here and then what you're going to do
here and then what you're going to do you're going to go and mark the drop
you're going to go and mark the drop view. So if you execute it like this the
view. So if you execute it like this the view going to be dropped and then we
view going to be dropped and then we recreate the view like this. So what we
recreate the view like this. So what we have done we destroy the view and then
have done we destroy the view and then we recreate it using the new logic. Or
we recreate it using the new logic. Or you say you know what I would like to
you say you know what I would like to have everything in one go like I don't
have everything in one go like I don't want to do it in two steps. I would like
want to do it in two steps. I would like to have everything in one command and
to have everything in one command and for that you have to use in SQL server
for that you have to use in SQL server the TSQL the transacts SQL it is like an
the TSQL the transacts SQL it is like an extension for SQL only in SQL server
extension for SQL only in SQL server well it's like programming where you can
well it's like programming where you can go and add variables or you can all go
go and add variables or you can all go and add checks we will not do a deep
and add checks we will not do a deep dive in this language but I would like
dive in this language but I would like to show you how to do it for the views
to show you how to do it for the views so just follow me with that I'm going to
so just follow me with that I'm going to go and replace the whole thing and then
go and replace the whole thing and then we're going to say if and now we are
we're going to say if and now we are checking the system catalog if the
checking the system catalog if the object ID
object ID And now we go and specify the view name.
And now we go and specify the view name. So let's go and copy the whole thing
So let's go and copy the whole thing with the schema as well. And then we're
with the schema as well. And then we're going to say for SQL this is a view. So
going to say for SQL this is a view. So if this object exists so we are saying
if this object exists so we are saying is not null. So that means it exist in
is not null. So that means it exist in the catalog then what SQL should do?
the catalog then what SQL should do? Should drop this view. So we're going to
Should drop this view. So we're going to say drop view and it's like we have done
say drop view and it's like we have done it first and then semicolon and then we
it first and then semicolon and then we say for scale go and with that we are
say for scale go and with that we are saying for SQL the tscale is done. So
saying for SQL the tscale is done. So the logic is done and after that we have
the logic is done and after that we have the DDL for our view. So again what we
the DDL for our view. So again what we are doing we are checking before
are doing we are checking before creating the view whether the view
creating the view whether the view exist. If it exist then we are telling
exist. If it exist then we are telling the scale go and drop it and if it
the scale go and drop it and if it doesn't exist that means we haven't
doesn't exist that means we haven't created this view yet. it is completely
created this view yet. it is completely brand new view then this step going to
brand new view then this step going to be skipped so that there is nothing to
be skipped so that there is nothing to drop. So now if you go and execute the
drop. So now if you go and execute the whole thing it will work and of course
whole thing it will work and of course if you go and refresh over here you
if you go and refresh over here you still see the view. So SQL did destroy
still see the view. So SQL did destroy the table first and then recreated. So
the table first and then recreated. So if you execute it again. So this is how
if you execute it again. So this is how you replace your logic in view in SQL
you replace your logic in view in SQL server. And with that we have learned
server. And with that we have learned all possible scenarios. How to create a
all possible scenarios. How to create a view, how to drop a view and how to
view, how to drop a view and how to update the logic of a
view. Now back to our database architecture and let's understand how
architecture and let's understand how the database executes views. So now
the database executes views. So now let's say that the data engineer is
let's say that the data engineer is creating view called top end. So the
creating view called top end. So the query going to be sent to the database
query going to be sent to the database engine and once the database engine
engine and once the database engine understand this is a view this is not a
understand this is a view this is not a table. So now the database engine going
table. So now the database engine going to go to the disk storage and to the
to go to the disk storage and to the catalog and it will stores not only the
catalog and it will stores not only the metadata about the view also the SQL
metadata about the view also the SQL that is responsible for the view. So
that is responsible for the view. So it's going to take the SQL statements
it's going to take the SQL statements that you have defined in the create view
that you have defined in the create view and place it as well in the catalog. So
and place it as well in the catalog. So if you compare to the tables we have in
if you compare to the tables we have in tables only metadata but in the views we
tables only metadata but in the views we have both the metadata and as well the
have both the metadata and as well the query of the view and as well you can
query of the view and as well you can see that the database engine will not
see that the database engine will not create a table in the user's data. So
create a table in the user's data. So there is nowhere data stored inside the
there is nowhere data stored inside the disk or the cache. So the actual data
disk or the cache. So the actual data the physical data will not be stored
the physical data will not be stored anywhere. We are storing only metadata
anywhere. We are storing only metadata and the query inside the system catalog.
and the query inside the system catalog. So now we tell our data analyst okay we
So now we tell our data analyst okay we have a new view and the data analyst can
have a new view and the data analyst can go and write a query in order to
go and write a query in order to retrieve the data from the view. So he
retrieve the data from the view. So he going to say and say select from the
going to say and say select from the view and execute it. The database engine
view and execute it. The database engine going to take it and understand okay now
going to take it and understand okay now we are talking about view. So the
we are talking about view. So the database first has to retrieve not the
database first has to retrieve not the data going to retrieve the query from
data going to retrieve the query from the catalog in order to understand what
the catalog in order to understand what do we have now to execute. Then the
do we have now to execute. Then the database going to execute the query of
database going to execute the query of the view first and the data for this
the view first and the data for this query comes from a physical table called
query comes from a physical table called orders. So now the database engine is
orders. So now the database engine is querying the order to retrieve the data
querying the order to retrieve the data so that we have a data for the end user
so that we have a data for the end user and then it's going to be executed and
and then it's going to be executed and the result going to be sent back to the
the result going to be sent back to the data analyst. So as you can see there is
data analyst. So as you can see there is like two queries. The SQL engine first
like two queries. The SQL engine first has to execute the query from the view
has to execute the query from the view and only after that the database engine
and only after that the database engine can execute the query that comes from
can execute the query that comes from the user. So actually the data comes
the user. So actually the data comes always from a physical table but we are
always from a physical table but we are not providing the data analyst an access
not providing the data analyst an access to the table. We are just providing an
to the table. We are just providing an access to the view. So this can happen
access to the view. So this can happen each time an end user selecting data
each time an end user selecting data from the view. Always the database
from the view. Always the database engine going to grab the query from the
engine going to grab the query from the catalog, execute it first in order to
catalog, execute it first in order to get the data and then execute what the
get the data and then execute what the end user wants. And now if the data
end user wants. And now if the data engineer says no, let's go and drop the
engineer says no, let's go and drop the view. So she writes a query in order to
view. So she writes a query in order to drop the view. And the database engine
drop the view. And the database engine going to go to the system catalog and
going to go to the system catalog and delete both the metadata and the query.
delete both the metadata and the query. So as you can see, if you are dropping a
So as you can see, if you are dropping a view, you are not losing the actual
view, you are not losing the actual data. So there will be no user data lost
data. So there will be no user data lost at all. So don't worry about it. What
at all. So don't worry about it. What you are losing is only the query and the
you are losing is only the query and the metadata about your view. It's only if
metadata about your view. It's only if you drop a physical table like the
you drop a physical table like the orders, you will lose your data. So
orders, you will lose your data. So dropping views is not that bad like
dropping views is not that bad like dropping a database table. So this is
dropping a database table. So this is how the database works with the views
how the database works with the views behind the scenes.
Now moving on to the second scenario to the next use case of using views in
the next use case of using views in projects is that we use views in order
projects is that we use views in order to hide complexity and to improve
to hide complexity and to improve abstraction. In many scenarios we work
abstraction. In many scenarios we work with a very large and complex databases
with a very large and complex databases and we can use views in order to reduce
and we can use views in order to reduce the complexity and make things easier
the complexity and make things easier for the users. So let's understand what
for the users. So let's understand what this means. Now I'm going to explain for
this means. Now I'm going to explain for you a scenario that happens almost in
you a scenario that happens almost in each project. Like if you get an access
each project. Like if you get an access to a database where you want to do
to a database where you want to do analyzes, you will be in scenario and
analyzes, you will be in scenario and this can happen a lot where you're going
this can happen a lot where you're going to find a large database where the
to find a large database where the tables are very complex to understand.
tables are very complex to understand. They have a lot of columns. They have
They have a lot of columns. They have like technical and cryptical names and
like technical and cryptical names and how tables are connected to each others
how tables are connected to each others and relationship between them. It's
and relationship between them. It's almost impossible to understand. then
almost impossible to understand. then you have to be deeply involved with the
you have to be deeply involved with the data models with documentations and with
data models with documentations and with experts until you understand how to
experts until you understand how to query this database. So if you are not a
query this database. So if you are not a developer and from end user perspective
developer and from end user perspective it can be nightmare where you are trying
it can be nightmare where you are trying to do multiple joins in order to make
to do multiple joins in order to make simple analyzes and of course from the
simple analyzes and of course from the database perspective this data model is
database perspective this data model is good enough for one application but if
good enough for one application but if you are opening your database for
you are opening your database for multiple data analyszis projects this
multiple data analyszis projects this can be a nightmare because you have to
can be a nightmare because you have to go and explain for each user how to
go and explain for each user how to query the data. So what we usually do
query the data. So what we usually do instead of giving a direct access to
instead of giving a direct access to such technical and hard to understand
such technical and hard to understand data model we go as developers creating
data model we go as developers creating multiple views since we are the expert
multiple views since we are the expert of the data model and these new views
of the data model and these new views going to be an abstraction of the
going to be an abstraction of the complexity that I have in my database
complexity that I have in my database and we have to make sure that those
and we have to make sure that those views are providing objects that are
views are providing objects that are friendly. So they have like a full
friendly. So they have like a full English name that makes sense and as
English name that makes sense and as well the columns are friendly and we try
well the columns are friendly and we try to not offer a lot of views so the user
to not offer a lot of views so the user don't have to do all the joins. So we
don't have to do all the joins. So we provide like few views that are friendly
provide like few views that are friendly and has a lot of informations that the
and has a lot of informations that the users needs for the analyzes. So with
users needs for the analyzes. So with that the users can have an access to
that the users can have an access to something more friendly and easy to
something more friendly and easy to consume and then they can write simple
consume and then they can write simple queries in order to do analyzes on top
queries in order to do analyzes on top of these friendly views. And this is
of these friendly views. And this is what we can give a name like we are
what we can give a name like we are providing a data product from my complex
providing a data product from my complex physical database. So here again how
physical database. So here again how important are the views to provide an
important are the views to provide an abstraction and easy to consume objects
abstraction and easy to consume objects for the users and with that I can hide
for the users and with that I can hide all my complexity and the script of the
all my complexity and the script of the view going to be developed from the
view going to be developed from the experts and only once so that the users
experts and only once so that the users don't have to understand or to write
don't have to understand or to write these complex SQL joins and with that
these complex SQL joins and with that you can make your data projects way
you can make your data projects way easier than before. So this is another
easier than before. So this is another important use case for the views where
important use case for the views where we can use it in order to provide
we can use it in order to provide abstraction and as well easy and
abstraction and as well easy and friendly objects for the end users.
friendly objects for the end users. Okay. So now let's have the following
Okay. So now let's have the following task and it says provide view that
task and it says provide view that combines details from orders, products,
combines details from orders, products, customers and employees. So now instead
customers and employees. So now instead of having all those tables from our
of having all those tables from our database, we have to provide one
database, we have to provide one combined view that has everything well
combined view that has everything well almost everything. So now let's see how
almost everything. So now let's see how we can create such a view. So let's
we can create such a view. So let's start first by the table orders. I'm
start first by the table orders. I'm going to go and
going to go and select first star from sales orders and
select first star from sales orders and let's go and execute it. This is the
let's go and execute it. This is the central table that connects everything.
central table that connects everything. You can see here we have the order ID,
You can see here we have the order ID, product ID, sales, customers and so on.
product ID, sales, customers and so on. So it is a great start point. So now
So it is a great start point. So now we're going to go and be picky about the
we're going to go and be picky about the columns. I would not show all the
columns. I would not show all the columns but I would say let's go and
columns but I would say let's go and show for example the order ID. This is
show for example the order ID. This is essential. It's nice to have a unique
essential. It's nice to have a unique identifier. Now the product ID, I will
identifier. Now the product ID, I will not show it but I will just list it over
not show it but I will just list it over here. The same for the customer ID,
here. The same for the customer ID, saleserson ID. Those stuff I would like
saleserson ID. Those stuff I would like to replace later. So I will just make it
to replace later. So I will just make it as comment so I don't forget about it
as comment so I don't forget about it because it makes no sense to show the
because it makes no sense to show the product ID and customer ids and so on.
product ID and customer ids and so on. We would like to show the details about
We would like to show the details about each object because instead of having
each object because instead of having the product ID, I would like to show for
the product ID, I would like to show for example the product name itself and some
example the product name itself and some other informations from the table
other informations from the table products. And with that we are reducing
products. And with that we are reducing the complexity. So now what else we can
the complexity. So now what else we can get from the table orders? We can go and
get from the table orders? We can go and get the order date. I will put it here.
get the order date. I will put it here. And maybe we can go and get stuff like
And maybe we can go and get stuff like sales and
sales and quantity. So like this. Of course, we
quantity. So like this. Of course, we can go and put all the columns. But for
can go and put all the columns. But for now, I will go with those informations.
now, I will go with those informations. Now, it's important since we're going to
Now, it's important since we're going to have a lot of tables. Let's go and make
have a lot of tables. Let's go and make sure we are using aliases. So, now we're
sure we are using aliases. So, now we're going to have the O for each of those
going to have the O for each of those columns. All right. Fine. So, now we
columns. All right. Fine. So, now we have four details from the table orders.
have four details from the table orders. Now, what is next? We have the product
Now, what is next? We have the product ID. So, let's go and get the
ID. So, let's go and get the informations from the products. What
informations from the products. What we're going to do, we're going to use a
we're going to do, we're going to use a left join just to make sure to not miss
left join just to make sure to not miss any order. If you go with the inner
any order. If you go with the inner join, you might miss some orders. So I
join, you might miss some orders. So I will not do that. So let's join it with
will not do that. So let's join it with the products like this.
the products like this. And so now we have to go and join the
And so now we have to go and join the tables. So we can use the keys product
tables. So we can use the keys product ID equal the order product ID. All
ID equal the order product ID. All right. So now the question is which
right. So now the question is which informations we want to show for the
informations we want to show for the users. Let's go to the table orders. So
users. Let's go to the table orders. So we have the product and category and the
we have the product and category and the price. I would say let's go and get the
price. I would say let's go and get the product and category. That's enough. So
product and category. That's enough. So now instead of the ID I'm going to have
now instead of the ID I'm going to have it like this. So it's going to be the
it like this. So it's going to be the product and the
product and the category. Now let's go and test it. I'm
category. Now let's go and test it. I'm going to execute it. Now as you can see
going to execute it. Now as you can see we don't have a product ID. We have the
we don't have a product ID. We have the product name which is more friendly. So
product name which is more friendly. So we have now those two columns from the
we have now those two columns from the orders and those two from the products
orders and those two from the products and the last two as well from the
and the last two as well from the orders. So it looks really nice and
orders. So it looks really nice and friendly and with that the user don't
friendly and with that the user don't need extra table called products. We
need extra table called products. We have everything in one. Now let's go and
have everything in one. Now let's go and do the same for the customers. So let's
do the same for the customers. So let's go
go and do the same thing. So let's join
and do the same thing. So let's join sales customers see and as well join
sales customers see and as well join them using the key customer ID equal to
them using the key customer ID equal to the customer ID. Now we have to go and
the customer ID. Now we have to go and grab a few columns from the customers.
grab a few columns from the customers. Let's go and check. So we have a first
Let's go and check. So we have a first name, last name and country and score. I
name, last name and country and score. I would say I would go with the names and
would say I would go with the names and the countries but instead of having
the countries but instead of having first name and last name I'm going to
first name and last name I'm going to put everything in one. So we have to go
put everything in one. So we have to go and concatenate the informations. So
and concatenate the informations. So we're going to get the first
we're going to get the first name then plus then empty between the
name then plus then empty between the first name and the last name and then
first name and the last name and then the last name like
the last name like this. Now we will not call it a name.
this. Now we will not call it a name. We're going to go and call it the
We're going to go and call it the customer name because later we're going
customer name because later we're going to have as well an employee name. All
to have as well an employee name. All right. So next we want to get the
right. So next we want to get the country and we have to say this is the
country and we have to say this is the country from the customers. So we're
country from the customers. So we're going to call it customer country and
going to call it customer country and that's it. Let's go and execute it. Now
that's it. Let's go and execute it. Now we can see we have again our orders
we can see we have again our orders products and now we have the
products and now we have the informations from that customer. But
informations from that customer. But here we have issue that we have some
here we have issue that we have some nulls and that's because there is no
nulls and that's because there is no last name. So what we're going to do,
last name. So what we're going to do, we're going to go and handle the nulls
we're going to go and handle the nulls for the last name and as well for the
for the last name and as well for the first name. So we're going to use the
first name. So we're going to use the kowalis. If the last name is null then
kowalis. If the last name is null then make an empty string and the same thing
make an empty string and the same thing for the first name. So first name. All
for the first name. So first name. All right. So now let's go and execute it.
right. So now let's go and execute it. So with that we are getting as well the
So with that we are getting as well the first name if the last name is missing
first name if the last name is missing or if the first name is missing we can
or if the first name is missing we can get the last name. So looks good. So it
get the last name. So looks good. So it looks good with that. We have the
looks good with that. We have the customer's details. The last thing we
customer's details. The last thing we have to go and get the employees. So the
have to go and get the employees. So the employee here is called salesperson ID
employee here is called salesperson ID which we can connect it directly to the
which we can connect it directly to the table employees. So if you go to the
table employees. So if you go to the employees over here, which columns do we
employees over here, which columns do we need? We have the first name, last name,
need? We have the first name, last name, department and so on. I would say let's
department and so on. I would say let's go get the names and the departments. So
go get the names and the departments. So first let's go and join it. So lift join
first let's go and join it. So lift join sales
sales employees and we're going to join it
employees and we're going to join it using the employee ID. and we're going
using the employee ID. and we're going to join it with the sales person ID that
to join it with the sales person ID that comes from the order table. So now
comes from the order table. So now instead of the person ID we're going to
instead of the person ID we're going to have as well the same thing. So I will
have as well the same thing. So I will just go and copy paste this. So instead
just go and copy paste this. So instead of the alias we're going to have E and
of the alias we're going to have E and as well E over here and we're going to
as well E over here and we're going to call it sales name and as well what we
call it sales name and as well what we going to have we're going to have the
going to have we're going to have the department. So
department. So department and that's it. Let's go and
department and that's it. Let's go and execute it. So now we have a lot of
execute it. So now we have a lot of informations in our view. So we have the
informations in our view. So we have the first columns from the orders then from
first columns from the orders then from the products and here we have from
the products and here we have from customers and those two from the
customers and those two from the employees and the last two again from
employees and the last two again from the orders. So that we have combined now
the orders. So that we have combined now all the relevant informations from
all the relevant informations from multiple tables in our database in only
multiple tables in our database in only one view. This result is relative big
one view. This result is relative big but still we have all the informations
but still we have all the informations in one and it is more friendly for the
in one and it is more friendly for the users in order to consume our data
users in order to consume our data instead of going and joining like all
instead of going and joining like all those four tables together. So now the
those four tables together. So now the next step we're going to put the result
next step we're going to put the result of this query in view in our database so
of this query in view in our database so that our end users can start consuming
that our end users can start consuming it. So how we going to do it? This is
it. So how we going to do it? This is our combined query and now we're going
our combined query and now we're going to write the DDL for it. So create view
to write the DDL for it. So create view and now we're going to give it the name
and now we're going to give it the name order details and then as and we're
order details and then as and we're going to put the whole thing in two
going to put the whole thing in two parenthesis. So at the start and at the
parenthesis. So at the start and at the end and of course don't forget the
end and of course don't forget the schema. So our schema is sales sales dot
schema. So our schema is sales sales dot then we have the view name just in order
then we have the view name just in order to have it in the correct schema and not
to have it in the correct schema and not in dbo. So everything is ready. Let's go
in dbo. So everything is ready. Let's go ahead and execute it. So now let's go
ahead and execute it. So now let's go and check our database. So if you go and
and check our database. So if you go and refresh, you will find our second view
refresh, you will find our second view order details. So now let's go and test
order details. So now let's go and test it. We're going to say select star
it. We're going to say select star from sales v order details. Let's go and
from sales v order details. Let's go and execute it. And with that we are getting
execute it. And with that we are getting now a combined view that are showing all
now a combined view that are showing all important informations from the
important informations from the database. So this is what the users can
database. So this is what the users can see. And with that the users don't care
see. And with that the users don't care about how many tables do we have in the
about how many tables do we have in the tables and how to join all those tables.
tables and how to join all those tables. We have only one view and we can start
We have only one view and we can start working on it. This is a very common use
working on it. This is a very common use case for the
views. Okay. Moving on to the next scenario to the next use case. We use
scenario to the next use case. We use SQL views in order to implement security
SQL views in order to implement security and to protect our data in the database.
and to protect our data in the database. In many scenarios, we have sensitive
In many scenarios, we have sensitive informations in our data and we cannot
informations in our data and we cannot go and share it with everyone. So one of
go and share it with everyone. So one of the best practices is to create views in
the best practices is to create views in order to protect your data before
order to protect your data before sharing it with the users. So let's
sharing it with the users. So let's understand what this means. So now let's
understand what this means. So now let's understand first the scenario without
understand first the scenario without views only tables. So now let's say that
views only tables. So now let's say that you have the table orders four columns
you have the table orders four columns and three rows and then you have like
and three rows and then you have like for example a manager that has an access
for example a manager that has an access directly to the database and start
directly to the database and start writing some queries in order to
writing some queries in order to retrieve data. But in your project you
retrieve data. But in your project you have multiple people that has an access
have multiple people that has an access to your database like for example a data
to your database like for example a data analyst and as well she is writing a
analyst and as well she is writing a script in order to retrieve data from
script in order to retrieve data from the orders and as well you have maybe a
the orders and as well you have maybe a students that has an access to your
students that has an access to your database and querying the data like any
database and querying the data like any other role like a manager and data
other role like a manager and data analyst. So as you can see you have now
analyst. So as you can see you have now different rules in your project and all
different rules in your project and all of them having the same rights by
of them having the same rights by accessing directly your table. So a
accessing directly your table. So a manager or data analyst or a student
manager or data analyst or a student they are seeing the whole table all rows
they are seeing the whole table all rows and all columns. And of course in the
and all columns. And of course in the real projects this is a big problem.
real projects this is a big problem. Sometimes the data are sensitive and you
Sometimes the data are sensitive and you cannot give an access for everyone. And
cannot give an access for everyone. And of course if you are using only tables
of course if you are using only tables this going to be a nightmare because you
this going to be a nightmare because you can go and create multiple tables but
can go and create multiple tables but it's going to be really hard to make all
it's going to be really hard to make all those tables in sync. But instead of
those tables in sync. But instead of that we have views. So what you can do
that we have views. So what you can do you can go and remove all accesses to
you can go and remove all accesses to the physical table but instead you can
the physical table but instead you can go and create multiple views for each
go and create multiple views for each role. For example you can go and create
role. For example you can go and create a view called orders managers and maybe
a view called orders managers and maybe you can give all the data and all the
you can give all the data and all the columns because the managers are allowed
columns because the managers are allowed to see let's say sensitive data but
to see let's say sensitive data but still it's nice to create a view maybe
still it's nice to create a view maybe you change your mind later and you go
you change your mind later and you go and remove something. Now let's say that
and remove something. Now let's say that for the data analyst you want to offer
for the data analyst you want to offer all the data but there is only one
all the data but there is only one column that is very sensitive. So what
column that is very sensitive. So what you can do you can go and create another
you can do you can go and create another view called orders analyst. So in the
view called orders analyst. So in the view only three columns are available
view only three columns are available ABC and then you give access to all data
ABC and then you give access to all data analyst and with that you have protected
analyst and with that you have protected this sensitive information. So we call
this sensitive information. So we call this column level security. And now we
this column level security. And now we come to our poor students. And here we
come to our poor students. And here we create another view where we are not
create another view where we are not only protecting the column D but also we
only protecting the column D but also we are protecting few rows like for example
are protecting few rows like for example the row number three because we want to
the row number three because we want to offer only few informations to the
offer only few informations to the students. So we are protecting the
students. So we are protecting the columns and as well the rows and for
columns and as well the rows and for that we can create another dedicated
that we can create another dedicated view called for example orders students
view called for example orders students and we can offer it to the students and
and we can offer it to the students and with that we are doing column level
with that we are doing column level security and as well row level security.
security and as well row level security. So we are offering multiple views very
So we are offering multiple views very easily without having to worry how to
easily without having to worry how to load the data from one table to another.
load the data from one table to another. So creating those views are really easy
So creating those views are really easy and provide us a perfect tool in order
and provide us a perfect tool in order to manage the security of our data. So
to manage the security of our data. So this is one very common use case of
this is one very common use case of using views in data projects. All right.
using views in data projects. All right. So now let's have the following task and
So now let's have the following task and it says provide a view for EU sales team
it says provide a view for EU sales team that combines details from all tables
that combines details from all tables and excludes data related to the USA. So
and excludes data related to the USA. So the first part of the task is similar to
the first part of the task is similar to what we have already done but we cannot
what we have already done but we cannot offer all data for the user. So this
offer all data for the user. So this time we are providing a view that is
time we are providing a view that is specifically created for a team the
specifically created for a team the sales team. So the first part we have
sales team. So the first part we have already done it where we are combining
already done it where we are combining all details in one view. But the problem
all details in one view. But the problem with the view that we have created that
with the view that we have created that it shows all data. But now the
it shows all data. But now the requirement change we cannot show all
requirement change we cannot show all data. We have to go and exclude the USA
data. We have to go and exclude the USA data from our details. So let's see how
data from our details. So let's see how we can do that. It's very simple. We're
we can do that. It's very simple. We're going to go and grab the same query. We
going to go and grab the same query. We will not repeat that. So we have as well
will not repeat that. So we have as well here joining tables and prepare
here joining tables and prepare everything. But instead of showing all
everything. But instead of showing all data, what we're going to do, we're
data, what we're going to do, we're going to go and filter the data based on
going to go and filter the data based on the customer country. So it's very
the customer country. So it's very simple. At the ends we will have a work
simple. At the ends we will have a work clause where the C country is not equal
clause where the C country is not equal to
to USA. So we have now a filter. Let's go
USA. So we have now a filter. Let's go and execute it. And with that, as you
and execute it. And with that, as you can see in the output, we are getting
can see in the output, we are getting the orders that are not from USA. And
the orders that are not from USA. And with that we are protecting the data of
with that we are protecting the data of the USA and the EU sales teams can
the USA and the EU sales teams can access only their data. So it looks nice
access only their data. So it looks nice and protected. And with that we are
and protected. And with that we are doing now role level security. That
doing now role level security. That means we are hiding now all the orders
means we are hiding now all the orders all the rows that are not allowed to be
all the rows that are not allowed to be seen and consumed from this group of
seen and consumed from this group of users. So now what is the next step? It
users. So now what is the next step? It is very simple. We're going to go and
is very simple. We're going to go and put everything in one view. So with that
put everything in one view. So with that we have the query ready and we can go
we have the query ready and we can go and create the new view. So we're going
and create the new view. So we're going to call it create view. Then we need the
to call it create view. Then we need the schema and the name going to be almost
schema and the name going to be almost the same. So order details but EU. And
the same. So order details but EU. And then we have to have as punch
then we have to have as punch parenthesis like this. So everything is
parenthesis like this. So everything is ready. Let's go and execute it. And now
ready. Let's go and execute it. And now we can go and refresh in order to see
we can go and refresh in order to see our new view. If you still don't see it,
our new view. If you still don't see it, you can go to the views over here and
you can go to the views over here and refresh as well to the folder. So with
refresh as well to the folder. So with that I can see we have our new view.
that I can see we have our new view. Now, of course, the next step we go and
Now, of course, the next step we go and test it. So, let's create a new query.
test it. So, let's create a new query. Select star
Select star from sales and v order details EU. So,
from sales and v order details EU. So, let's test it. And with that, as you can
let's test it. And with that, as you can see, we are getting the combined view
see, we are getting the combined view only for the data that is relevant for
only for the data that is relevant for the EU sales team. So, I'm not seeing
the EU sales team. So, I'm not seeing here any USA records. So, with that, we
here any USA records. So, with that, we are providing view that protects few
are providing view that protects few rows like the orders from USA. So as you
rows like the orders from USA. So as you can see views are really great in order
can see views are really great in order to provide security to our data whether
to provide security to our data whether we are protecting the columns or the
we are protecting the columns or the rows. For example in our view we can say
rows. For example in our view we can say not only I want to remove the USA orders
not only I want to remove the USA orders but let's say the department information
but let's say the department information is sensitive information and I would
is sensitive information and I would like to hide it from the view. So you
like to hide it from the view. So you can just simply remove it from the
can just simply remove it from the select and with that you are doing
select and with that you are doing column level security. So now I have two
column level security. So now I have two options that I can provide to the users.
options that I can provide to the users. The first option doesn't has any like
The first option doesn't has any like role level security. It is the first
role level security. It is the first view the order details. We don't have
view the order details. We don't have there any filters. So it's going to show
there any filters. So it's going to show all the orders. So here we give access
all the orders. So here we give access only to people that are allowed to see
only to people that are allowed to see all data. And we have another option the
all data. And we have another option the details with the EU. It doesn't show all
details with the EU. It doesn't show all data. It shows only a subset that is
data. It shows only a subset that is relevant for the EU team. So now it's
relevant for the EU team. So now it's really easy to control the security of
really easy to control the security of my data using the views. And this is
my data using the views. And this is very important use case for the
very important use case for the [Music]
[Music] views. Okay. Okay, so moving on to the
views. Okay. Okay, so moving on to the next use case for the views, we can use
next use case for the views, we can use it in order to have more dynamic and
it in order to have more dynamic and flexibility in our projects. So let's
flexibility in our projects. So let's understand what this means. If you have
understand what this means. If you have a table and you have multiple users
a table and you have multiple users accessing this table, now what can
accessing this table, now what can happen? you might change your mind about
happen? you might change your mind about the design and the data model of your
the design and the data model of your database where you can say you know what
database where you can say you know what instead of having one table I'm going to
instead of having one table I'm going to go and split it into two tables or maybe
go and split it into two tables or maybe another decision you say you know what
another decision you say you know what I'm going to go and rename a table or in
I'm going to go and rename a table or in another day you decide you know what
another day you decide you know what let's go and rename few columns or maybe
let's go and rename few columns or maybe add a column remove column so you are
add a column remove column so you are doing changes to your physical data
doing changes to your physical data model and you are changing stuff in the
model and you are changing stuff in the tables you know what's going to happen
tables you know what's going to happen all those users that are accessing the
all those users that are accessing the tables going to scream because all of
tables going to scream because all of them having a complex SQL queries and
them having a complex SQL queries and your small changes at the tables are
your small changes at the tables are breaking everything in their queries and
breaking everything in their queries and what this means this means escalations
what this means this means escalations and you don't have anymore the freedom
and you don't have anymore the freedom to change anything in your database
to change anything in your database without talking before to 100 people
without talking before to 100 people before doing any change. So we don't do
before doing any change. So we don't do that instead of that we use views. So
that instead of that we use views. So what's going to happen? You create a
what's going to happen? You create a view and you tell the users, okay, take
view and you tell the users, okay, take this view and consume it and leave me
this view and consume it and leave me alone. And now you have again your
alone. And now you have again your freedom to do any changes you want. So
freedom to do any changes you want. So you go to your tables and do splitting,
you go to your tables and do splitting, renaming and changing everything you
renaming and changing everything you want as long as you are updating the
want as long as you are updating the query between the table and the view to
query between the table and the view to make sure that the users are not
make sure that the users are not noticing any change. So for example, if
noticing any change. So for example, if you go and split the table into two
you go and split the table into two tables, then you have to put in the view
tables, then you have to put in the view a join or union in order to reconstruct
a join or union in order to reconstruct the same structure that the users are
the same structure that the users are used to. And if you would like to rename
used to. And if you would like to rename something in your database, like instead
something in your database, like instead of ID, you are now calling it a key. All
of ID, you are now calling it a key. All what you have to do now is to go to the
what you have to do now is to go to the query of the view and rename it back
query of the view and rename it back from a key to an ID. So no one going to
from a key to an ID. So no one going to notice that you are doing changes to the
notice that you are doing changes to the physical tables. So using views and
physical tables. So using views and offering it to users is a gamecher for
offering it to users is a gamecher for you because giving the users views kind
you because giving the users views kind of gives you more freedom dynamic and
of gives you more freedom dynamic and flexibility to change anything in your
flexibility to change anything in your data model and the tables without
data model and the tables without getting any headache. So this is amazing
getting any headache. So this is amazing use case for the
views. Okay, moving on. We have a lot of use cases for the views. They are just
use cases for the views. They are just amazing. So the next one is we can use
amazing. So the next one is we can use views in order to introduce a second
views in order to introduce a second version of my data model in another
version of my data model in another language. So we could offer multiple
language. So we could offer multiple languages to the users. Let's understand
languages to the users. Let's understand what this means. So now we have the
what this means. So now we have the following scenario. We have again our
following scenario. We have again our table orders where the data is persisted
table orders where the data is persisted and everything in English and of course
and everything in English and of course what happens sometimes you have like
what happens sometimes you have like international team that are accessing
international team that are accessing your data. So you have team in USA and
your data. So you have team in USA and maybe you have team from Germany that as
maybe you have team from Germany that as well are end users that want to access
well are end users that want to access the data. Of course it depend on the
the data. Of course it depend on the number of users that are using your
number of users that are using your database. But if you have a lot of users
database. But if you have a lot of users that come from Germany and as well from
that come from Germany and as well from India, it might make sense that you go
India, it might make sense that you go and translate your data and the table
and translate your data and the table structure into another language. So for
structure into another language. So for example, instead of giving access to the
example, instead of giving access to the table orders, we can create another view
table orders, we can create another view called bishong. That's the order in
called bishong. That's the order in German. But not only you are giving a
German. But not only you are giving a new name for the object, you could go as
new name for the object, you could go as well and rename all the columns inside
well and rename all the columns inside the view. Then the German users going to
the view. Then the German users going to access the German view and it's going to
access the German view and it's going to be for them easier to understand the
be for them easier to understand the content of your database. The same thing
content of your database. The same thing for the Indian team. And for the Indian
for the Indian team. And for the Indian users, you can go and provide a view in
users, you can go and provide a view in Hindi. I'm not sure whether I'm
Hindi. I'm not sure whether I'm pronouncing the word correct, but this
pronouncing the word correct, but this is the first word that I said in Hindi.
is the first word that I said in Hindi. I don't promise that I'm going to learn
I don't promise that I'm going to learn the Hindi language because it's enough
the Hindi language because it's enough to learn Germany. So I'm trying as well
to learn Germany. So I'm trying as well to write this word Adish. I hope it is
to write this word Adish. I hope it is correct. And to be honest, it is really
correct. And to be honest, it is really interesting how you write this word in
interesting how you write this word in Hindi. So now back to the topic. As you
Hindi. So now back to the topic. As you can see now we are using like the views
can see now we are using like the views in order to provide a translation for
in order to provide a translation for our database by just giving a new name
our database by just giving a new name for the views and as well for the
for the views and as well for the columns. So this is another nice use
columns. So this is another nice use case that I usually use as well in my
case that I usually use as well in my projects in order to provide multi-
projects in order to provide multi- languages for the data model that I have
languages for the data model that I have and I can do that with the power of
views. Now we come to my favorite use case for the views and that I personally
case for the views and that I personally recommend in each project that we can
recommend in each project that we can use views as a virtual data ms in a data
use views as a virtual data ms in a data warehouse. So now why this is my
warehouse. So now why this is my favorite? Because I'm specialist in data
favorite? Because I'm specialist in data warehouses and data leaks and this topic
warehouses and data leaks and this topic is very important decision in each
is very important decision in each project like this. So let's understand
project like this. So let's understand what this means. So now a classical data
what this means. So now a classical data warehouse architecture based on the
warehouse architecture based on the approach of enmon is going to look like
approach of enmon is going to look like this. We have multiple source systems
this. We have multiple source systems where our data are spreaded and now we
where our data are spreaded and now we would like to go and extract all our
would like to go and extract all our data from these multiple sources and put
data from these multiple sources and put it in one big database called data
it in one big database called data warehouse. And there will be a lot of
warehouse. And there will be a lot of operations on this central database like
operations on this central database like the data going to be first cleaned and
the data going to be first cleaned and then maybe integrated together and maybe
then maybe integrated together and maybe we are building there some historical
we are building there some historical data. So we're going to be doing
data. So we're going to be doing multiple steps in order to prepare the
multiple steps in order to prepare the data for complex reporting and analyzes.
data for complex reporting and analyzes. And what we usually do in the data
And what we usually do in the data warehouse, we're going to store all
warehouse, we're going to store all those informations as a physical table.
those informations as a physical table. Now once we have built the data
Now once we have built the data warehouse, what's going to happen? We're
warehouse, what's going to happen? We're going to have multiple use cases that
going to have multiple use cases that would like to access the data warehouse
would like to access the data warehouse in order maybe to do some different
in order maybe to do some different reporting. Now, it's going to be very
reporting. Now, it's going to be very complex if we connect immediately like a
complex if we connect immediately like a reporting engine like PowerBI directly
reporting engine like PowerBI directly to the data warehouse. But instead of
to the data warehouse. But instead of this, we try to split the data warehouse
this, we try to split the data warehouse into multiple subsets like we can split
into multiple subsets like we can split it after topic or domain or departments
it after topic or domain or departments and we call those subsets as data marts.
and we call those subsets as data marts. So a data mart is always specific for a
So a data mart is always specific for a use case that's focus on one topic like
use case that's focus on one topic like for example we could have a dedicated
for example we could have a dedicated mart for the sales and another data m
mart for the sales and another data m which is dedicated only for finance
which is dedicated only for finance topics but both of them comes from our
topics but both of them comes from our data warehouse. Then the last layer
data warehouse. Then the last layer going to be like for example the
going to be like for example the reporting and dashboarding maybe you
reporting and dashboarding maybe you have something like powerbi where you
have something like powerbi where you are creating a dashboard one data m like
are creating a dashboard one data m like the sales or and as well maybe few stuff
the sales or and as well maybe few stuff from other marts. But now the big
from other marts. But now the big question here in the data mart is how
question here in the data mart is how should I store the data? Should I store
should I store the data? Should I store the data using tables or should I use
the data using tables or should I use views? And now the best practice says if
views? And now the best practice says if you are building data marts then use
you are building data marts then use views. And we call this virtual data
views. And we call this virtual data marts. And there are many reasons why
marts. And there are many reasons why using views at a data mart it's way
using views at a data mart it's way better than using tables. Like for
better than using tables. Like for example, it is more dynamic and quicker
example, it is more dynamic and quicker to change them cuz usually at the data
to change them cuz usually at the data mart you are building a lot of business
mart you are building a lot of business logics and you want to have some
logics and you want to have some flexibility and speed and the
flexibility and speed and the maintenance efforts is very simplified.
maintenance efforts is very simplified. No need to build any ETLs or data loads
No need to build any ETLs or data loads from the data warehouse to the data
from the data warehouse to the data parts and this makes the data warehouse
parts and this makes the data warehouse as a real single point of truth for your
as a real single point of truth for your data. And once you start copying data
data. And once you start copying data from one layer to another layer, it's
from one layer to another layer, it's going to be really hard to maintain and
going to be really hard to maintain and chaotic and you have to have really
chaotic and you have to have really restrict monitoring and data quality. So
restrict monitoring and data quality. So that's why using views you're going to
that's why using views you're going to always reflect the status of the data
always reflect the status of the data warehouse and this can help you of
warehouse and this can help you of course with the data consistency which
course with the data consistency which is a critical point in each data
is a critical point in each data warehouse project. So there are many
warehouse project. So there are many reasons why we build virtual data mart
reasons why we build virtual data mart and we go with the views in this layer.
and we go with the views in this layer. So as you can see how the views are
So as you can see how the views are playing a very important role in
playing a very important role in building a data warehouse. So this is
building a data warehouse. So this is another amazing and very important use
another amazing and very important use case of using views in your data
projects. All right friends, so now let's have a quick recap about views. So
let's have a quick recap about views. So we have learned that views are a virtual
we have learned that views are a virtual table that is based on the result of a
table that is based on the result of a query without actually storing any data
query without actually storing any data in the database. So we use views in
in the database. So we use views in order to presist a complex SQL logic and
order to presist a complex SQL logic and query in the database. And we have
query in the database. And we have learned that in some scenarios views are
learned that in some scenarios views are better than CTE because it improves the
better than CTE because it improves the reusability and reduce the complexity in
reusability and reduce the complexity in multiple queries which reduce the
multiple queries which reduce the complexity of the whole projects where
complexity of the whole projects where the CTE only improves the reusability in
the CTE only improves the reusability in one query. And we have learned that as
one query. And we have learned that as well the views in some scenarios are
well the views in some scenarios are better than tables. We have learned that
better than tables. We have learned that they are very flexible and easier to
they are very flexible and easier to maintain since they don't store any data
maintain since they don't store any data and it's really fast and easy to change
and it's really fast and easy to change stuff in the view compared to the
stuff in the view compared to the tables. But as well we have learned that
tables. But as well we have learned that the tables are faster than views. Now
the tables are faster than views. Now there are like endless use cases for the
there are like endless use cases for the views. But from my experience in
views. But from my experience in projects I have choose for you the best
projects I have choose for you the best use cases for the views. The first use
use cases for the views. The first use case is if we find like a common
case is if we find like a common repeated logic in SQL queries, we can go
repeated logic in SQL queries, we can go and store this logic in view in the
and store this logic in view in the database so that the users don't have to
database so that the users don't have to keep repeating the logic over and over.
keep repeating the logic over and over. So we use views in order to have a
So we use views in order to have a central business logic. Another use case
central business logic. Another use case is to hide the complexity of your
is to hide the complexity of your physical data model and to offer for the
physical data model and to offer for the users and high abstracted layer. So you
users and high abstracted layer. So you provide for the user something very
provide for the user something very friendly and you hide all the complex
friendly and you hide all the complex technical data model that you have in
technical data model that you have in the database because not everyone is
the database because not everyone is expert with your data model. One more
expert with your data model. One more use case we can use views in order to
use case we can use views in order to implement security and to protect our
implement security and to protect our sensitive data in the database. So we
sensitive data in the database. So we can offer multiple views in order to
can offer multiple views in order to protect columns or rows in a table.
protect columns or rows in a table. Another use case we have learned that we
Another use case we have learned that we can use views in order to have more
can use views in order to have more dynamic and flexibility for your
dynamic and flexibility for your database where we offer the users a
database where we offer the users a table view and then you have the freedom
table view and then you have the freedom to change stuff at your physical data
to change stuff at your physical data model without affecting all users. And
model without affecting all users. And another nice use case for the views we
another nice use case for the views we can offer multiple languages from our
can offer multiple languages from our data model. And the last use case we
data model. And the last use case we have learned how views play an important
have learned how views play an important role in a data warehouse system. So
role in a data warehouse system. So views are amazing. All right my friends.
views are amazing. All right my friends. So with that we have learned everything
So with that we have learned everything about this new objects the views in
about this new objects the views in databases. This is amazing for
databases. This is amazing for flexibility and dynamic in your
flexibility and dynamic in your projects. Now in the next one we're
projects. Now in the next one we're going to learn how to create tables
going to learn how to create tables based on query and we will learn about
based on query and we will learn about the temporary tables. So let's
go. Okay. So now first let's have a look again to the database structure. We have
again to the database structure. We have learned that in each SQL server there
learned that in each SQL server there are multiple databases and in each
are multiple databases and in each database there are multiple schemas. And
database there are multiple schemas. And now inside each schema we can define
now inside each schema we can define multiple objects like we can define
multiple objects like we can define tables and views. And now we will be
tables and views. And now we will be focusing on the object table. And we
focusing on the object table. And we have learned as well we can use the
have learned as well we can use the language DDL data definition language
language DDL data definition language which is a set of SQL commands in order
which is a set of SQL commands in order to define this database structure. So we
to define this database structure. So we can use the SQL command create in order
can use the SQL command create in order to define a new table or alter in order
to define a new table or alter in order to update the structure or drop in order
to update the structure or drop in order to drop the whole table. So a table is
to drop the whole table. So a table is an object in the database structure and
an object in the database structure and we have learned as well there is three
we have learned as well there is three levels of the database architecture and
levels of the database architecture and we have understood that at the logical
we have understood that at the logical level the middle one the conceptual
level the middle one the conceptual level we deal as application developer
level we deal as application developer or data engineer with the tables. So we
or data engineer with the tables. So we define tables and relationship between
define tables and relationship between them. So if you are an end user or a
them. So if you are an end user or a business analyst it's going to be little
business analyst it's going to be little bit more hard to work with the tables.
bit more hard to work with the tables. You have to be a developer or a data
You have to be a developer or a data engineer. But working with tables is way
engineer. But working with tables is way easier than working with the complexity
easier than working with the complexity of the database at the physical level.
of the database at the physical level. So you don't have to be a database
So you don't have to be a database expert or administrator to work with
expert or administrator to work with tables. So the difficulty here is like
tables. So the difficulty here is like in the middle. The abstraction is not
in the middle. The abstraction is not that low but as well not that high. So
that low but as well not that high. So now let's answer the question, what are
now let's answer the question, what are tables? A database table is a structured
tables? A database table is a structured collection of data. It's like a simple
collection of data. It's like a simple grid or spreadsheet that you might find
grid or spreadsheet that you might find in Excel. So it has different columns
in Excel. So it has different columns like each column represent a field like
like each column represent a field like the ID, name, country and the table has
the ID, name, country and the table has as well multiple rows and each row
as well multiple rows and each row represent a record or an entry of the
represent a record or an entry of the data. So for example if this table is
data. So for example if this table is about the employees then each record
about the employees then each record each row is one employee. Now the
each row is one employee. Now the intersect between the rows and columns
intersect between the rows and columns we call it a cell and a cell is a single
we call it a cell and a cell is a single piece of data. Now the whole table going
piece of data. Now the whole table going to be stored physically in the database
to be stored physically in the database as database files. So they are in the
as database files. So they are in the database like multiple files that are
database like multiple files that are holding the informations about the table
holding the informations about the table and those files are stored physically in
and those files are stored physically in that disk storage of the database. So
that disk storage of the database. So that means your data inside the tables
that means your data inside the tables are not stored like a spreadsheet like
are not stored like a spreadsheet like an Excel but they are stored in special
an Excel but they are stored in special database files that usual developers and
database files that usual developers and end users don't have access to those
end users don't have access to those files. So tables again it's like an
files. So tables again it's like an abstraction and representation for the
abstraction and representation for the actual data that are in the files. So
actual data that are in the files. So actually each time you are querying the
actually each time you are querying the database table the database has to go to
database table the database has to go to those files and fetch the data for you.
those files and fetch the data for you. All right. So this is what we mean with
All right. So this is what we mean with database
tables. Okay. So now we have like different types of tables in SQL. We
different types of tables in SQL. We have tables that stays forever. We call
have tables that stays forever. We call it permanent tables. So they stay as
it permanent tables. So they stay as long as you don't drop them. And you
long as you don't drop them. And you have another type of tables they called
have another type of tables they called the temporary tables. And those tables
the temporary tables. And those tables going to be deleted and dropped once the
going to be deleted and dropped once the session ends. So now we're going to
session ends. So now we're going to focus first on the first type, the
focus first on the first type, the permanent tables. And there are two ways
permanent tables. And there are two ways on how to create them. The first way is
on how to create them. The first way is the classical way where you create table
the classical way where you create table from the scratch and then you go and
from the scratch and then you go and insert your data. So we call it create
insert your data. So we call it create insert and the other way called create
insert and the other way called create table as select. It's going to create as
table as select. It's going to create as well the table but based on SQL query.
well the table but based on SQL query. So let's understand the differences
So let's understand the differences between
them. The create insert method is the classical way on how we define and
classical way on how we define and create tables in SQL where first we have
create tables in SQL where first we have to go and create the table and define
to go and create the table and define the structure and after that we insert
the structure and after that we insert our data into the database table where
our data into the database table where the other method the CTAs create table
the other method the CTAs create table as select. And this one going to create
as select. And this one going to create a new table as well but this time based
a new table as well but this time based on the result of SQL query. So let's
on the result of SQL query. So let's understand what this means. Okay. So now
understand what this means. Okay. So now to the first method create insert. So
to the first method create insert. So here we have two steps. The first step
here we have two steps. The first step is we have a DDL statements where we use
is we have a DDL statements where we use the command create. So once we execute
the command create. So once we execute the first step what's going to happen
the first step what's going to happen the database engine going to go and
the database engine going to go and create for us an empty table. It is a
create for us an empty table. It is a brand new table where we can hold our
brand new table where we can hold our data. So with that we have defined the
data. So with that we have defined the structure of our table but it's still an
structure of our table but it's still an empty table. So now in the next step we
empty table. So now in the next step we have to go and insert our data inside
have to go and insert our data inside this new table. So our data can come
this new table. So our data can come from multiple sources like a CSV file or
from multiple sources like a CSV file or maybe completely from another database
maybe completely from another database where we are doing migration or maybe
where we are doing migration or maybe you are inserting manually your data or
you are inserting manually your data or maybe it come from an application or you
maybe it come from an application or you are doing data migration from one
are doing data migration from one database to another. So at the end once
database to another. So at the end once you execute insert what's going to
you execute insert what's going to happen your data going to be inserted in
happen your data going to be inserted in this new table. So in this method we
this new table. So in this method we have like two steps. First we define the
have like two steps. First we define the structure of the table and the second
structure of the table and the second step we take care of inserting our data
step we take care of inserting our data inside the table. And now this new table
inside the table. And now this new table and your data going to be persisted
and your data going to be persisted permanently. Now let's check the other
permanently. Now let's check the other method the CTIS. Here it's only one step
method the CTIS. Here it's only one step where you define a query and once you
where you define a query and once you execute this query what going to happen
execute this query what going to happen the database has to retrieve the data
the database has to retrieve the data from another table. So it might retrieve
from another table. So it might retrieve data from our new table that we just
data from our new table that we just created using create insert. So once the
created using create insert. So once the query is executed we will get a result.
query is executed we will get a result. So now what the database going to do
So now what the database going to do going to create a new brand table but
going to create a new brand table but this time the definition and the data of
this time the definition and the data of this new table it doesn't come from any
this new table it doesn't come from any definition that we specify. it comes
definition that we specify. it comes from the result of the query. So
from the result of the query. So whatever structure that we have in the
whatever structure that we have in the results, it going to be reflected in our
results, it going to be reflected in our new table. So again the definition and
new table. So again the definition and the data that we see in this new table
the data that we see in this new table comes one one to one from the result of
comes one one to one from the result of our query. So in this type we don't have
our query. So in this type we don't have to define anything or to insert any
to define anything or to insert any data. We are just writing a query and
data. We are just writing a query and the output of this query going to define
the output of this query going to define the table. But in this method as you can
the table. But in this method as you can see it always needs a database table in
see it always needs a database table in order to execute the query. But the
order to execute the query. But the create insert method we are creating
create insert method we are creating something from the scratch. So these are
something from the scratch. So these are the two different ways on how you create
the two different ways on how you create tables in SQL and the differences
tables in SQL and the differences between
them. Okay. So now you might ask you know what the CTAs are very similar to
know what the CTAs are very similar to the views. We have a query and the
the views. We have a query and the output of this query going to be like an
output of this query going to be like an object in the database. So what are the
object in the database. So what are the differences between them? Let's check
differences between them? Let's check this. Now let's say that in our database
this. Now let's say that in our database we have a table that has three columns
we have a table that has three columns A, B, C. And now what we can do, we can
A, B, C. And now what we can do, we can go and create view based on a query. So
go and create view based on a query. So you create the DDL statement in order to
you create the DDL statement in order to create the view in the database. And of
create the view in the database. And of course the database going to go and
course the database going to go and store the query in the database and it's
store the query in the database and it's going to be empty. So there will be no
going to be empty. So there will be no data because views does not store any
data because views does not store any data and the query of the view will not
data and the query of the view will not be yet executed. But now in the other
be yet executed. But now in the other hand if you go and create a table using
hand if you go and create a table using CTIS. So here again we have a query
CTIS. So here again we have a query attached to the object to the table. So
attached to the object to the table. So here what happens the database has to
here what happens the database has to execute the query in order to understand
execute the query in order to understand the structure and as well the data that
the structure and as well the data that should be inserted inside the table. So
should be inserted inside the table. So our SQL query going to be executed and
our SQL query going to be executed and the result of the query going to be
the result of the query going to be inserted inside the table. So that means
inserted inside the table. So that means this new table is storing already the
this new table is storing already the result of the query. So now this is the
result of the query. So now this is the first differences between the table and
first differences between the table and view. As you create view the query will
view. As you create view the query will not be executed and we don't have
not be executed and we don't have anything about the result of the query
anything about the result of the query where in the CTIS we have already result
where in the CTIS we have already result of the query stored inside the table and
of the query stored inside the table and everything is prepared. So now let's see
everything is prepared. So now let's see what's going to happen once the user
what's going to happen once the user selects something from the view. So now
selects something from the view. So now the database going to go for the first
the database going to go for the first time executing the query of the view in
time executing the query of the view in order to fetch the data from the
order to fetch the data from the original table and then presented as a
original table and then presented as a result for the user. But now in the
result for the user. But now in the other hand if the user go and query the
other hand if the user go and query the table that is created from the CTIS. So
table that is created from the CTIS. So now what can happen? SQL will not
now what can happen? SQL will not execute again the query of the CTIS
execute again the query of the CTIS because the database already done that
because the database already done that and prepared everything. So that means
and prepared everything. So that means we are not querying anything from the
we are not querying anything from the original table and the data can be
original table and the data can be directly fetched from the new table. So
directly fetched from the new table. So the user is going to get immediately the
the user is going to get immediately the result from our table that is created
result from our table that is created from the CDIS. So here comes the second
from the CDIS. So here comes the second difference between the tables and views.
difference between the tables and views. The views are slower than CTIS and
The views are slower than CTIS and that's because the database has here an
that's because the database has here an extra task. It must execute the query of
extra task. It must execute the query of the view in order to get the data. But
the view in order to get the data. But in the CTIS the query going to be faster
in the CTIS the query going to be faster than the view because we have already
than the view because we have already executed everything and prepared it for
executed everything and prepared it for the user. So that's why tables from CTIS
the user. So that's why tables from CTIS are way faster than views. And now there
are way faster than views. And now there is another difference and perspective
is another difference and perspective about this which is from my point of
about this which is from my point of view is more important than the
view is more important than the performance. So now let's say that in
performance. So now let's say that in the next day we are doing data updates
the next day we are doing data updates on the original table like we are doing
on the original table like we are doing updates on the column C and as well in
updates on the column C and as well in the column P. So now let's see what this
the column P. So now let's see what this means for the user if they are using
means for the user if they are using views. So the user in the next day is
views. So the user in the next day is executing again the same query and again
executing again the same query and again here the database has to execute the
here the database has to execute the query of the view in order to fetch the
query of the view in order to fetch the data from the original table. So that
data from the original table. So that means today in the views we are getting
means today in the views we are getting different data than yesterday because we
different data than yesterday because we have a new data and new updates and the
have a new data and new updates and the user in the result going to see as well
user in the result going to see as well the new updates and the fresh data. So
the new updates and the fresh data. So the user is seeing exactly the status of
the user is seeing exactly the status of the data in the original tables. But now
the data in the original tables. But now let's see what going to happen if the
let's see what going to happen if the user go and query the table from the
user go and query the table from the CATS. So in the table of the CATS, we
CATS. So in the table of the CATS, we are still having the data from
are still having the data from yesterday. All those new updates from
yesterday. All those new updates from the original data will not be reflected
the original data will not be reflected in this new table because once the user
in this new table because once the user selects something from this table, the
selects something from this table, the database will not go and query or fetch
database will not go and query or fetch the new changes from the original table
the new changes from the original table because we have already prepared the
because we have already prepared the data from yesterday. So that means our
data from yesterday. So that means our user now is getting old data from the
user now is getting old data from the CTAs table and the only way to get new
CTAs table and the only way to get new fresh data from the CTIS is to reexecute
fresh data from the CTIS is to reexecute the CTIS query. And of course this is
the CTIS query. And of course this is another step and it is harder to
another step and it is harder to maintain the table from the CTAs and
maintain the table from the CTAs and this is a big difference for the users
this is a big difference for the users between the views and the tables from
between the views and the tables from the CTAs. Now think about views you are
the CTAs. Now think about views you are ordering a pizza at restaurants. So
ordering a pizza at restaurants. So every time you are quering the view you
every time you are quering the view you are placing an order the chef going to
are placing an order the chef going to go and make a pizza from the scratch
go and make a pizza from the scratch using the freshest ingredients. So that
using the freshest ingredients. So that means you are always getting a fresh hot
means you are always getting a fresh hot pizza. And think about the CTS as like a
pizza. And think about the CTS as like a frozen pizza from a grocery store. The
frozen pizza from a grocery store. The pizza was prepared earlier and stored in
pizza was prepared earlier and stored in the freezer. And if you want to eat it,
the freezer. And if you want to eat it, you have to go and heat it up in the
you have to go and heat it up in the oven. But it's still not like a fresh
oven. But it's still not like a fresh pizza that is made on the spot and from
pizza that is made on the spot and from the scratch. Now I made myself hungry
the scratch. Now I made myself hungry because I love pizza. So I think I'm
because I love pizza. So I think I'm going to go for a quick break.
going to go for a quick break. [Music]
Okay, so now let's check quickly the syntax of those two methods. The first
syntax of those two methods. The first one is create insert. So first step we
one is create insert. So first step we have to go and create a table using a
have to go and create a table using a DDL statements. So we use the command
DDL statements. So we use the command create and then we have to tell SQL are
create and then we have to tell SQL are we creating a table or view. In this
we creating a table or view. In this scenario we are creating a table and
scenario we are creating a table and then we specify the name of the table.
then we specify the name of the table. Then after that we have two parenthesis
Then after that we have two parenthesis and inside them we make a list of all
and inside them we make a list of all columns that we need inside this table.
columns that we need inside this table. So we have two columns the ID and the
So we have two columns the ID and the name. And after that we are defining the
name. And after that we are defining the data type of those columns and maybe as
data type of those columns and maybe as well the length. There are a lot of
well the length. There are a lot of options that we can add to this syntax
options that we can add to this syntax but now we are just checking the
but now we are just checking the simplest form of creating a table. Now
simplest form of creating a table. Now the next step is that we need an insert
the next step is that we need an insert statement. So we are saying insert into
statement. So we are saying insert into our new table the following values. we
our new table the following values. we are inserting the id number one and the
are inserting the id number one and the value for the name going to be frank. So
value for the name going to be frank. So this is a classical way on creating new
this is a classical way on creating new table and inserting data to it. Now
table and inserting data to it. Now let's move to the second method the cas.
let's move to the second method the cas. Now this time we have an SQL query like
Now this time we have an SQL query like select from where and some extra logic.
select from where and some extra logic. So this is our query and then we're
So this is our query and then we're going to go and put our query inside a
going to go and put our query inside a DDL statement. It's like we have done it
DDL statement. It's like we have done it in the views. It's exactly like we have
in the views. It's exactly like we have done it in the views but this time
done it in the views but this time instead of saying view we're going to
instead of saying view we're going to say table. So again we have the create
say table. So again we have the create command and we are creating a table then
command and we are creating a table then the name of the table and then we say as
the name of the table and then we say as and then we have two parenthesis and
and then we have two parenthesis and inside them we have our query and this
inside them we have our query and this is where the name come from create table
is where the name come from create table as select cas. So it is very simple in
as select cas. So it is very simple in one statement you have everything you
one statement you have everything you are creating a new table and as well you
are creating a new table and as well you are inserting the data that comes from
are inserting the data that comes from this query. Now this syntax is used in
this query. Now this syntax is used in databases like MySQL, Postgress and
databases like MySQL, Postgress and Oracle. But in MySQL we have like a
Oracle. But in MySQL we have like a shorter way on how to do it. Again we
shorter way on how to do it. Again we have our query select from where. But
have our query select from where. But now in SQL server we can insert a
now in SQL server we can insert a command between the select and from like
command between the select and from like this. So we are saying select the
this. So we are saying select the following columns into new table. So we
following columns into new table. So we have this keyword into then the table
have this keyword into then the table name and then you continue after that
name and then you continue after that with your query from where aggregations
with your query from where aggregations and so on. So here it's like the DDL is
and so on. So here it's like the DDL is inside your query itself but in the
inside your query itself but in the other databases you can have like the
other databases you can have like the query is separated from the DDL
query is separated from the DDL statements. Personally, I prefer this
statements. Personally, I prefer this syntax than having this into because if
syntax than having this into because if you have like big complex query, this
you have like big complex query, this can be really hard to see and to miss
can be really hard to see and to miss the column selection. So, this is the
the column selection. So, this is the syntax of creating a new table from a
syntax of creating a new table from a query the CTAs in different
query the CTAs in different [Music]
[Music] databases. Okay. So, now we're going to
databases. Okay. So, now we're going to check the scenarios and use cases where
check the scenarios and use cases where it makes sense to use. So, let's start
it makes sense to use. So, let's start with the first one. Now we have learned
with the first one. Now we have learned before it makes sense to have a complex
before it makes sense to have a complex logic stored inside the database so that
logic stored inside the database so that our end users don't have to keep
our end users don't have to keep repeating the same logic over and over
repeating the same logic over and over and it's as well maybe complicated for
and it's as well maybe complicated for some users. So that's why we have used
some users. So that's why we have used views and the result of the view going
views and the result of the view going to be used from our users. So everything
to be used from our users. So everything can stay easy and friendly to consume
can stay easy and friendly to consume for our users. But now what might happen
for our users. But now what might happen is that the logic of the view could be
is that the logic of the view could be very complicated and needs a lot of time
very complicated and needs a lot of time to be executed from the database. So it
to be executed from the database. So it takes really long time until we get the
takes really long time until we get the intermediate result from the database.
intermediate result from the database. So that means if it's going to takes 30
So that means if it's going to takes 30 minutes then each users has to wait 30
minutes then each users has to wait 30 minutes until the query is executed and
minutes until the query is executed and none of your users going to be happy
none of your users going to be happy with this situation. In this scenario,
with this situation. In this scenario, if this happens, you have to try maybe
if this happens, you have to try maybe to optimize the query. But if you cannot
to optimize the query. But if you cannot do anything about that, you have to
do anything about that, you have to switch the view to CTAs table. So now
switch the view to CTAs table. So now what you have to do, you have to take
what you have to do, you have to take the same logic and then put it
the same logic and then put it in so that the intermediate results are
in so that the intermediate results are stored in a table. And of course at the
stored in a table. And of course at the moment of creating the table, it will
moment of creating the table, it will take 30 minutes. It will take long time
take 30 minutes. It will take long time because it is the same query and the
because it is the same query and the database going to need the time until
database going to need the time until creating the intermittent results. But
creating the intermittent results. But the big advantage is that once
the big advantage is that once everything is prepared maybe at the
everything is prepared maybe at the night at the morning once your users are
night at the morning once your users are like online and start querying the data
like online and start querying the data they have everything prepared. So the
they have everything prepared. So the user is going to go and start selecting
user is going to go and start selecting and analyzing the intermediate result
and analyzing the intermediate result but this time using the table that you
but this time using the table that you have created from the CTAs and the
have created from the CTAs and the response time going to be for all users
response time going to be for all users again normal and fast. So if you have a
again normal and fast. So if you have a scenario where your views are very slow
scenario where your views are very slow you have to go and prepare the data at
you have to go and prepare the data at the night using the CTIS and prepare the
the night using the CTIS and prepare the tables to be analyzed from the end
tables to be analyzed from the end users. So this is the most common use
users. So this is the most common use case for the CTIS and this scenario
case for the CTIS and this scenario happens a lot in projects where you
happens a lot in projects where you decide to go instead of views to go with
decide to go instead of views to go with the CTIS in order to have persistence
the CTIS in order to have persistence data and you gain performance. Okay, so
data and you gain performance. Okay, so finally back to SQL let's go and create
finally back to SQL let's go and create a table using now we're going to go and
a table using now we're going to go and create a table that shows the total
create a table that shows the total number of orders for each month. Let's
number of orders for each month. Let's go and do it. So first what do we need?
go and do it. So first what do we need? We need a query. So let's write it.
We need a query. So let's write it. select. I'm going to go with the date
select. I'm going to go with the date name in order to get the name of the
name in order to get the name of the month from our order dates and we're
month from our order dates and we're going to call it order month. And then
going to call it order month. And then we're going to go and aggregate the data
we're going to go and aggregate the data by counting the order ID for total
by counting the order ID for total orders from our table sales orders. Uh
orders from our table sales orders. Uh don't forget to group by our month. So
don't forget to group by our month. So something like this. Let's go and
something like this. Let's go and execute it. So the result is very
execute it. So the result is very simple. We have the order month and the
simple. We have the order month and the total orders. So we have two columns and
total orders. So we have two columns and three rows. So we have our query and of
three rows. So we have our query and of course we didn't create anything yet.
course we didn't create anything yet. Now in SQL server in order to create a
Now in SQL server in order to create a table from the query what we're going to
table from the query what we're going to do exactly before the from we're going
do exactly before the from we're going to write into and now we have to specify
to write into and now we have to specify the schema and the table name. I'm going
the schema and the table name. I'm going to stay with the schema sales and I'm
to stay with the schema sales and I'm going to call it monthly
going to call it monthly orders like this. So that means we have
orders like this. So that means we have our query and the DDL is exactly between
our query and the DDL is exactly between the from and select. So now if I go and
the from and select. So now if I go and execute this what going to happen we
execute this what going to happen we will not see here the result of the
will not see here the result of the query. We're going to get here like
query. We're going to get here like three rows affected because this is a
three rows affected because this is a DDL statement. It is not anymore a query
DDL statement. It is not anymore a query and the database is telling us I have
and the database is telling us I have created now a table with three rows. So
created now a table with three rows. So now if you check our tables we don't see
now if you check our tables we don't see it yet. Let's go and refresh and check
it yet. Let's go and refresh and check again the tables. Now we can see our
again the tables. Now we can see our table here sales monthly orders. Now of
table here sales monthly orders. Now of course we have to go and check whether
course we have to go and check whether everything is fine. So let's go and
everything is fine. So let's go and select the rows from our new table sales
select the rows from our new table sales monthly
monthly orders. So let's go select it first and
orders. So let's go select it first and execute. And now we can see again the
execute. And now we can see again the result of our query. But we are not
result of our query. But we are not writing here the query. We are just
writing here the query. We are just selecting it from the table. So our data
selecting it from the table. So our data is stored in our table. And we can go
is stored in our table. And we can go and check the structure of this table.
and check the structure of this table. So if you go to the columns you can see
So if you go to the columns you can see we have here the order month and the
we have here the order month and the total orders and those informations
total orders and those informations comes from our query. So SQL is saying
comes from our query. So SQL is saying here the order month is a var which is
here the order month is a var which is correct because here we have the names
correct because here we have the names of the month. So SQL is able to define
of the month. So SQL is able to define the data type of the table from our
the data type of the table from our query and the second column the total
query and the second column the total orders it is an integer and that's
orders it is an integer and that's because we have here numbers. So as you
because we have here numbers. So as you can see SQL is defining the structure of
can see SQL is defining the structure of the table based on the result of our
the table based on the result of our query over here. And of course the data
query over here. And of course the data inside the table comes as well from the
inside the table comes as well from the query. And the result of this table
query. And the result of this table going to stay like this as long as you
going to stay like this as long as you don't change anything. So if you go and
don't change anything. So if you go and close this and open it after one year
close this and open it after one year it's going to show exact same results.
it's going to show exact same results. So it's going to live in the database as
So it's going to live in the database as long as you don't drop this table. But
long as you don't drop this table. But if things change in the table orders,
if things change in the table orders, this table will not be updated
this table will not be updated automatically like we have learned in
automatically like we have learned in the views. So now if you want to say you
the views. So now if you want to say you know what I would like to go and drop
know what I would like to go and drop this table well it is very simple just
this table well it is very simple just go and say drop table and the table name
go and say drop table and the table name over here. So make sure you select it
over here. So make sure you select it and execute it. And now if you go over
and execute it. And now if you go over here and refresh. So let's check the
here and refresh. So let's check the tables. You can see here the table is
dropped. And now if you say you know what let's go and refresh the table that
what let's go and refresh the table that come from the CTAs every day so that we
come from the CTAs every day so that we always get refresh data inside this
always get refresh data inside this table. So now let's go and execute again
table. So now let's go and execute again our CIS. And with that if we go and
our CIS. And with that if we go and refresh we're going to find again our
refresh we're going to find again our table inside it. Now if you go and
table inside it. Now if you go and execute it one more time in order to
execute it one more time in order to refresh the data of the table what you
refresh the data of the table what you going to get? You're going to get an
going to get? You're going to get an error. The database going to tell you we
error. The database going to tell you we have already this table so we cannot
have already this table so we cannot recreate it. So now the question is how
recreate it. So now the question is how we can update the the content of this
we can update the the content of this table. Well, we have to go and drop it
table. Well, we have to go and drop it first and then recreate it. And if you
first and then recreate it. And if you want to put everything in one statement,
want to put everything in one statement, we have to go and use the TSQL. It is
we have to go and use the TSQL. It is transacts SQL. It's like extension where
transacts SQL. It's like extension where you can do some programming inside SQL.
you can do some programming inside SQL. So in order to do that, what we're going
So in order to do that, what we're going to do, we're going to go at the start
to do, we're going to go at the start over here and we're going to make an if
over here and we're going to make an if logic. So we're going to go and search
logic. So we're going to go and search for the objects. So we're going to say
for the objects. So we're going to say if the object ID and now we have to go
if the object ID and now we have to go and specify the name of this object
and specify the name of this object together with the schema. Make sure to
together with the schema. Make sure to select everything sales monthly order
select everything sales monthly order and put it inside here. And then we have
and put it inside here. And then we have to define the type of this object. And
to define the type of this object. And here we're going to go with you. It is
here we're going to go with you. It is userdefined table. So we are saying if
userdefined table. So we are saying if the object sales monthly orders is not
the object sales monthly orders is not null. So that means it exist. So what
null. So that means it exist. So what you want to do? we have to go and drop
you want to do? we have to go and drop it. I'm going to take the statement from
it. I'm going to take the statement from here and then we're going to put it
here and then we're going to put it after the if over here. So we are saying
after the if over here. So we are saying if this table exist then drop the table
if this table exist then drop the table otherwise don't do anything because we
otherwise don't do anything because we don't have any new table and the query
don't have any new table and the query going to work and at the end of the TSQL
going to work and at the end of the TSQL we have go in order to say the TSQL is
we have go in order to say the TSQL is done and then our usual query after all
done and then our usual query after all that. So let's go and execute the whole
that. So let's go and execute the whole thing and as you can see it is working.
thing and as you can see it is working. So what happens? The database did find
So what happens? The database did find this table and drop it and then executed
this table and drop it and then executed our query. So if you keep executing
our query. So if you keep executing this, you are just refreshing the
this, you are just refreshing the content of this table. So this is how we
content of this table. So this is how we work with the CTAs in
SQL. All right, moving on to another common use case for the CTAs that I
common use case for the CTAs that I usually use as well in my projects. We
usually use as well in my projects. We use CDS in order to create a persistent
use CDS in order to create a persistent snapshot of the data at specific time in
snapshot of the data at specific time in order to analyze data quality issue. So
order to analyze data quality issue. So let's understand what this means. Now in
let's understand what this means. Now in some scenarios you have like a table and
some scenarios you have like a table and you are analyzing an issue. So there is
you are analyzing an issue. So there is like a data quality issue at your data
like a data quality issue at your data and you are analyzing this scenario in
and you are analyzing this scenario in order to understand why it happens. But
order to understand why it happens. But the problem is that at the same time
the problem is that at the same time there will be updates on the table and
there will be updates on the table and your data is changing. So there will be
your data is changing. So there will be updates maybe on some fields or you are
updates maybe on some fields or you are getting new records and everything is
getting new records and everything is getting mixed up and you will not be
getting mixed up and you will not be able to analyze the scenario where the
able to analyze the scenario where the data quality issue happened. So now it's
data quality issue happened. So now it's almost impossible to find the ro cause
almost impossible to find the ro cause of your issue. But instead of that what
of your issue. But instead of that what we do if we have like an issue of the
we do if we have like an issue of the data we go and create a fixed persisted
data we go and create a fixed persisted snapshot of the data in a separate table
snapshot of the data in a separate table using CTS so that we make sure nothing
using CTS so that we make sure nothing is changing and everything is fixed. And
is changing and everything is fixed. And with that I can keep doing my analysis
with that I can keep doing my analysis on the same data without the worry that
on the same data without the worry that data are getting changed. So this is
data are getting changed. So this is another way why we use CTS in projects
another way why we use CTS in projects to make sure that we have snapshot of
to make sure that we have snapshot of the data to ensure that our analyzes are
the data to ensure that our analyzes are done on the same scenario that caused
done on the same scenario that caused the buck and going to be used as a
the buck and going to be used as a foundation for finding the problem and
foundation for finding the problem and fixing
it. All right, moving on to another use case of the CTAs. We can use it in order
case of the CTAs. We can use it in order to create our data m to make it physical
to create our data m to make it physical data m instead of virtual data ms using
data m instead of virtual data ms using views. So let's understand what this
views. So let's understand what this means now. As we learned before, if you
means now. As we learned before, if you have a data warehouse system, our data
have a data warehouse system, our data warehouse layer going to store the data
warehouse layer going to store the data inside tables. But for the second layer,
inside tables. But for the second layer, the data m, we can go and use views in
the data m, we can go and use views in order to have dynamic and flexibility in
order to have dynamic and flexibility in order to generate multiple data ms. And
order to generate multiple data ms. And we called it the virtual layer. But now
we called it the virtual layer. But now in some scenarios if things get
in some scenarios if things get complicated your data m and reports
complicated your data m and reports going to be slow because there for each
going to be slow because there for each action you are generating a query. So
action you are generating a query. So the powerbi reports and dashboards are
the powerbi reports and dashboards are creating queries in your data marts and
creating queries in your data marts and your data marts have always to go to the
your data marts have always to go to the data warehouse in order to retrieve the
data warehouse in order to retrieve the data for the reports and the whole thing
data for the reports and the whole thing could take minutes or maybe sometimes
could take minutes or maybe sometimes hours. So in these scenarios we cannot
hours. So in these scenarios we cannot stay using views because they are
stay using views because they are slowing everything down. But instead of
slowing everything down. But instead of that we have to convert our data mart to
that we have to convert our data mart to a physical layer. That means instead of
a physical layer. That means instead of using views we have to go and use
using views we have to go and use tables. And one very common way in order
tables. And one very common way in order to generate the tables of the data marts
to generate the tables of the data marts on daily basis is to use queries between
on daily basis is to use queries between the data warehouse layer and the data
the data warehouse layer and the data mart layer. It's still going to take
mart layer. It's still going to take maybe 30 minutes. That's why you can go
maybe 30 minutes. That's why you can go and prepare the data at the night. But
and prepare the data at the night. But at the reporting layer where things and
at the reporting layer where things and the performance really matters, the
the performance really matters, the performance going to be better because
performance going to be better because the response time from the tables is way
the response time from the tables is way faster than views and the reports don't
faster than views and the reports don't have always to waste time waiting for
have always to waste time waiting for the data marts to get data from the
the data marts to get data from the warehouse. So this is another use case
warehouse. So this is another use case where you use CTAs where the views at
where you use CTAs where the views at the data marts are slow and we have to
the data marts are slow and we have to go and replace them with stables using
go and replace them with stables using CTAs to speed up things. But still my
CTAs to speed up things. But still my recommendation here is that start first
recommendation here is that start first with the views. So create a virtual data
with the views. So create a virtual data mart using views because the
mart using views because the implementation going to be very dynamic
implementation going to be very dynamic and fast and you are always getting
and fast and you are always getting fresh data from the warehouse but maybe
fresh data from the warehouse but maybe later if you notice okay some data ms
later if you notice okay some data ms and models are complex then maybe go and
and models are complex then maybe go and replace few marts from views to tables
replace few marts from views to tables using cis. So this is another use case
using cis. So this is another use case for the and it is nice workaround for
for the and it is nice workaround for your data warehouse system. All right
your data warehouse system. All right friends, so with that we have covered
friends, so with that we have covered now the first type of the tables that we
now the first type of the tables that we have in databases. The permanent tables
have in databases. The permanent tables where you create a table and it's going
where you create a table and it's going to live forever until you go and drop
to live forever until you go and drop it. Now we're going to talk about
it. Now we're going to talk about another type of tables in databases. We
another type of tables in databases. We have the temporary tables. So let's
have the temporary tables. So let's understand what are temporary
tables. So temporary tables or sometimes you call them as a shortcut temp tables.
you call them as a shortcut temp tables. They store intermediate results in a
They store intermediate results in a temporary storage in the database during
temporary storage in the database during a session and the database automatically
a session and the database automatically drop these tables after the session
drop these tables after the session ends. So let's understand what this
ends. So let's understand what this means. Now we have learned in the CIS we
means. Now we have learned in the CIS we could use a query in order to retrieve
could use a query in order to retrieve data from one table and then it puts the
data from one table and then it puts the intermediate results in brand new table
intermediate results in brand new table in the database. So with that we are
in the database. So with that we are creating another table based on a query.
creating another table based on a query. The same thing for the temporary tables.
The same thing for the temporary tables. We have as well a query that goes and
We have as well a query that goes and retrieves the data from a table and as
retrieves the data from a table and as well the database going to go and create
well the database going to go and create new brand table in the database that has
new brand table in the database that has the structure and the data from the
the structure and the data from the result of the query. So it is exactly at
result of the query. So it is exactly at the CTIS. What is the difference here?
the CTIS. What is the difference here? Well, it is about the lifetime of the
Well, it is about the lifetime of the table. Now the database tables that you
table. Now the database tables that you have created using create insert or CTIS
have created using create insert or CTIS those tables going to stay permanent and
those tables going to stay permanent and they're going to live in the database as
they're going to live in the database as long as you don't drop them. So even if
long as you don't drop them. So even if the system is completely offline the
the system is completely offline the data going to stay at the database once
data going to stay at the database once it is online again but the temporary
it is online again but the temporary tables going to get deleted and dropped
tables going to get deleted and dropped from the database automatically once the
from the database automatically once the session ends. So what session means like
session ends. So what session means like once you open the client and you connect
once you open the client and you connect to the database and you are start doing
to the database and you are start doing queries we call the time between
queries we call the time between connecting ourself to the database and
connecting ourself to the database and disconnecting from the database we call
disconnecting from the database we call this a session. So that means once you
this a session. So that means once you close the client and you disconnect from
close the client and you disconnect from the database and maybe shut down your PC
the database and maybe shut down your PC and do something else. What going to
and do something else. What going to happen? The database going to go and
happen? The database going to go and destroy and delete all the temporary
destroy and delete all the temporary tables that you have created during the
tables that you have created during the session. So that mean the table going to
session. So that mean the table going to live as long as you have a session and
live as long as you have a session and you can access during this time the
you can access during this time the table as you are accessing any other
table as you are accessing any other permanent table. So this is what we mean
permanent table. So this is what we mean with temporary tables or sometimes we
with temporary tables or sometimes we call it as a shortcut temp
call it as a shortcut temp [Music]
[Music] tables. Okay. So now let's check the
tables. Okay. So now let's check the easiest syntax ever. So for the
easiest syntax ever. So for the temporary table the syntax going to look
temporary table the syntax going to look like this. you're going to have like a
like this. you're going to have like a query select from where and as we
query select from where and as we learned in the CTIS if you go and say
learned in the CTIS if you go and say into then the table name it's going to
into then the table name it's going to go and create a physical new table but
go and create a physical new table but now if you want it as a temporary table
now if you want it as a temporary table what you going to do you're going to
what you going to do you're going to just put hash before the name of the
just put hash before the name of the table then SQL can understand okay now
table then SQL can understand okay now we are talking about temporary table and
we are talking about temporary table and the database going to store it in that
the database going to store it in that temporary storage so it is very simple
temporary storage so it is very simple this is the syntax of that temporary
this is the syntax of that temporary tables so so far we have learned that we
tables so so far we have learned that we have a database called sales DB and
have a database called sales DB and inside it we can find the tables that we
inside it we can find the tables that we have created the customers, employees,
have created the customers, employees, orders and so on. Those are our tables
orders and so on. Those are our tables and they are always there like if you go
and they are always there like if you go and close everything and then start it
and close everything and then start it or in the next day you're going to find
or in the next day you're going to find always those tables with the same data.
always those tables with the same data. So they're going to exist as long as we
So they're going to exist as long as we are not dropping them. Now the question
are not dropping them. Now the question is where do we find the temporary
is where do we find the temporary tables? Well, as we learned, if you go
tables? Well, as we learned, if you go over here at the system databases, you
over here at the system databases, you will find multiple databases from the
will find multiple databases from the SQL server and normally only the
SQL server and normally only the database administrator has an access to
database administrator has an access to this and one of those databases called
this and one of those databases called temp DB, temporary database. So, let's
temp DB, temporary database. So, let's go inside it. Now, we can find multiple
go inside it. Now, we can find multiple objects and one of them we can find here
objects and one of them we can find here the temporary tables. And now, of
the temporary tables. And now, of course, we don't have anything inside it
course, we don't have anything inside it because we didn't create anything. So,
because we didn't create anything. So, let's go and create one. We have already
let's go and create one. We have already an open session and active session with
an open session and active session with the SQL server. As you can see here, we
the SQL server. As you can see here, we are connected to the database and we can
are connected to the database and we can start creating temporal tables. So now
start creating temporal tables. So now what is the plan? I would like now to do
what is the plan? I would like now to do few modifications on the table orders.
few modifications on the table orders. But I will not do it directly at the
But I will not do it directly at the table orders. I would like to take a
table orders. I would like to take a copy from the sales DB and create from
copy from the sales DB and create from it a temporary table. So let's go and do
it a temporary table. So let's go and do that. What do we need first? We need a
that. What do we need first? We need a query. So I would like to select
query. So I would like to select everything all the columns all the rows
everything all the columns all the rows from the table orders. So from sales
from the table orders. So from sales orders. So this is my query. Now so far
orders. So this is my query. Now so far nothing is created. We have only select
nothing is created. We have only select statements. But now in order to create a
statements. But now in order to create a temporary table what we're going to do
temporary table what we're going to do we're going to put a statement between
we're going to put a statement between the select and from. So exactly before
the select and from. So exactly before the from go over here and say into then
the from go over here and say into then in order to make sure it is a temporary
in order to make sure it is a temporary table we use hash and then the table
table we use hash and then the table name. So we're going to call it orders.
name. So we're going to call it orders. So that's it. We have our query and in
So that's it. We have our query and in between we have the into and make sure
between we have the into and make sure you are using hash in order to be a
you are using hash in order to be a temporary table. So let's go and execute
temporary table. So let's go and execute it. And now we can see that 10 rows are
it. And now we can see that 10 rows are affected and we don't have any error.
affected and we don't have any error. And now of course we cannot see it yet
And now of course we cannot see it yet because we have to go and refresh the
because we have to go and refresh the object explorer. So let's go and do
object explorer. So let's go and do that. And now let's expand it. And now
that. And now let's expand it. And now we can see our temporary tables. As you
we can see our temporary tables. As you can see it is at the schema dbo because
can see it is at the schema dbo because we haven't defined any schema. And this
we haven't defined any schema. And this is the default one from the database. So
is the default one from the database. So nice. Now we have the table and let's go
nice. Now we have the table and let's go and check few stuff. So let's go and
and check few stuff. So let's go and select the table itself. So select star
select the table itself. So select star from and make sure to say hash orders.
from and make sure to say hash orders. Let's go and select it. And now we are
Let's go and select it. And now we are getting the data from the temporary
getting the data from the temporary table and not from the original table.
table and not from the original table. The orders in the database sales DB. So
The orders in the database sales DB. So all those informations comes from the
all those informations comes from the temporary table. Now, of course, you can
temporary table. Now, of course, you can do whatever you want to this temporary
do whatever you want to this temporary table because it's not that important
table because it's not that important and it's anyway going to get deleted.
and it's anyway going to get deleted. So, let's say that I would like to
So, let's say that I would like to delete all the orders where the order
delete all the orders where the order status equal to delivered. So, let's go
status equal to delivered. So, let's go and do that. What we're going to do
and do that. What we're going to do delete from our hash orders. So, make
delete from our hash orders. So, make sure we are selecting the temporary
sure we are selecting the temporary table and then where we're going to say
table and then where we're going to say the
the order status equal to what I say
order status equal to what I say delivered. Yeah, delivered. So delivered
delivered. Yeah, delivered. So delivered like this. Let's go and execute it.
like this. Let's go and execute it. Okay, with that it says five rows are
Okay, with that it says five rows are affected. Let's go and select it again.
affected. Let's go and select it again. So
So select from
select from orders and let's check that. So as you
orders and let's check that. So as you can see now we don't have all orders. We
can see now we don't have all orders. We have only the orders where the status
have only the orders where the status equal to shipped. So all delivered
equal to shipped. So all delivered orders are removed. And now we can do
orders are removed. And now we can do whatever we want to this copy. We can
whatever we want to this copy. We can analyze it. We can modify it. We can go
analyze it. We can modify it. We can go and insert a new data. So we can do
and insert a new data. So we can do whatever manipulation we want on this
whatever manipulation we want on this copy. And now if you say, you know what,
copy. And now if you say, you know what, I like this result and I would like to
I like this result and I would like to have it not only during the session.
have it not only during the session. Maybe I'm going to need it for tomorrow
Maybe I'm going to need it for tomorrow or something. So now what we're going to
or something. So now what we're going to do, we're going to do the exact
do, we're going to do the exact opposite. We're going to now store the
opposite. We're going to now store the result of the temporary table back to
result of the temporary table back to our database so that we don't lose this
our database so that we don't lose this intermediate result. So in order to do
intermediate result. So in order to do that, we're going to say into and then
that, we're going to say into and then make sure to specify the sales dot
make sure to specify the sales dot because we want to select the correct
because we want to select the correct schema and then let's say it is orders
schema and then let's say it is orders and I'm going to call it test like this.
and I'm going to call it test like this. So let's go and execute it. So it says
So let's go and execute it. So it says five rows are affected. Now we have to
five rows are affected. Now we have to see those informations in the sales DB.
see those informations in the sales DB. We still don't have this table over
We still don't have this table over here. So right click on the DB and then
here. So right click on the DB and then refresh it. So let's go again to the
refresh it. So let's go again to the tables. And now you can see we have our
tables. And now you can see we have our new table orders test. So it is amazing
new table orders test. So it is amazing right? What we have done is we have took
right? What we have done is we have took a copy from the original table orders to
a copy from the original table orders to a temporary space. We have done some
a temporary space. We have done some modifications and play with the data and
modifications and play with the data and we have done some analyzes and then the
we have done some analyzes and then the end result of our temporary table. We
end result of our temporary table. We have loaded back to another new table
have loaded back to another new table called orders test in order maybe in the
called orders test in order maybe in the next day to keep working on it. So it is
next day to keep working on it. So it is really nice way to do changes in place
really nice way to do changes in place where you say you know what it is
where you say you know what it is temporary and whatever mistakes you
temporary and whatever mistakes you makes it's okay it is like playground.
makes it's okay it is like playground. So now we still have an active session
So now we still have an active session with the database and our temporary
with the database and our temporary table going to be always here. Now let's
table going to be always here. Now let's see what going to happen if we end our
see what going to happen if we end our session. So in order to do that let's go
session. So in order to do that let's go and just close everything. So I will
and just close everything. So I will just close and we'll not store anything.
just close and we'll not store anything. So with that we have now ended the
So with that we have now ended the session. Let's go and start it again and
session. Let's go and start it again and see whether we still have the temporary
see whether we still have the temporary table. So we have now again to connect
table. So we have now again to connect to the SQL server and now we have
to the SQL server and now we have another session. So that means the old
another session. So that means the old session is already lost. Let's go to the
session is already lost. Let's go to the databases to the system databases to the
databases to the system databases to the temp DB and let's go to the temporary
temp DB and let's go to the temporary tables. As you can see the database
tables. As you can see the database already cleaned up everything and this
already cleaned up everything and this space is again empty for any new
space is again empty for any new temporary table that I'm going to
temporary table that I'm going to create. So as you can see once you close
create. So as you can see once you close the session everything going to get
the session everything going to get lost. Now let's go back to our sales DB
lost. Now let's go back to our sales DB over here to the tables. We can see the
over here to the tables. We can see the table that we have created orders test
table that we have created orders test it is still living here and still has
it is still living here and still has like the data that we have created. So
like the data that we have created. So this is how things works with the
this is how things works with the temporary tables in
SQL. Now let's see how the database server executed that temporary SQL. So
server executed that temporary SQL. So now let's say that you are as a data
now let's say that you are as a data analyst. You have created a query and
analyst. You have created a query and then you say into in a temporary table.
then you say into in a temporary table. Now the database engine going to
Now the database engine going to identify the query and first it's going
identify the query and first it's going to go and execute the query and then
to go and execute the query and then it's going to go and execute it and
it's going to go and execute it and maybe we're going to get the data from
maybe we're going to get the data from the table orders and after the query is
the table orders and after the query is executed the database engine now has the
executed the database engine now has the results. Now two things can happen.
results. Now two things can happen. First the database engine going to go
First the database engine going to go and store the metadata informations in
and store the metadata informations in the system catalog. And now the second
the system catalog. And now the second thing the database engine going to
thing the database engine going to create a table but this time not in the
create a table but this time not in the users but in the temporary storage in
users but in the temporary storage in the disk. So the table going to live
the disk. So the table going to live there for a short time. And now what you
there for a short time. And now what you can do you can write multiple SQL
can do you can write multiple SQL queries that are doing maybe multiple
queries that are doing maybe multiple analysis on top of this table. So each
analysis on top of this table. So each time you select something the database
time you select something the database engine has to go to the temporary
engine has to go to the temporary storage and fetch the data from there.
storage and fetch the data from there. And now once you are finished and let's
And now once you are finished and let's say you close your client the session
say you close your client the session between you and the database going to
between you and the database going to ends and now the database going to
ends and now the database going to understand okay there is no more
understand okay there is no more connection to this user and it going to
connection to this user and it going to go and clean up now the temporary
go and clean up now the temporary storage with any tables that are created
storage with any tables that are created from this session. So that means the
from this session. So that means the database is automatically cleaning up
database is automatically cleaning up the storage maybe for other sessions. So
the storage maybe for other sessions. So this is how the database engine works
this is how the database engine works with the temporary tables.
So now the question is why do we need temporary tables? Let's see the
temporary tables? Let's see the following scenario. Now let's say that
following scenario. Now let's say that in our source database we have a table
in our source database we have a table called orders and now we would like to
called orders and now we would like to go and load the table in our data
go and load the table in our data warehouse. We have to do several
warehouse. We have to do several transformations in order to prepare the
transformations in order to prepare the data for the analyzes in the data
data for the analyzes in the data warehouse. So maybe you have one query
warehouse. So maybe you have one query to remove the duplicates and another one
to remove the duplicates and another one to handle the nulls and maybe you are
to handle the nulls and maybe you are doing filtering and cleaning up and the
doing filtering and cleaning up and the last step you would like to aggregate
last step you would like to aggregate the data. And now of course those
the data. And now of course those queries those transformations want to
queries those transformations want to change the content of the table orders
change the content of the table orders and there is no scenario where you can
and there is no scenario where you can do that directly on the source database
do that directly on the source database and of course this is not allowed.
and of course this is not allowed. That's why in data warehousing we have
That's why in data warehousing we have to go and get our own copy of the data
to go and get our own copy of the data and then on top of this data we can do
and then on top of this data we can do our transformations. Now one way to do
our transformations. Now one way to do this using the temporary tables. So you
this using the temporary tables. So you have one script in order to extract the
have one script in order to extract the data from the table orders and put it in
data from the table orders and put it in temporary table as an intermediate
temporary table as an intermediate results and then you come with the
results and then you come with the transformations and all those queries
transformations and all those queries and they start manipulating and changing
and they start manipulating and changing the data of this extra copy in the
the data of this extra copy in the temporary table and the last step you
temporary table and the last step you have the load where you go and load the
have the load where you go and load the final version of the intermediate
final version of the intermediate results in the database. This is if you
results in the database. This is if you would like to do the whole ETL before
would like to do the whole ETL before inserting the data to the database. So
inserting the data to the database. So now the orders table and the final table
now the orders table and the final table in the data warehouse both of them are
in the data warehouse both of them are tables. So they are permanent tables and
tables. So they are permanent tables and they will stay there as long as we don't
they will stay there as long as we don't drop them. So they are very important
drop them. So they are very important tables. But now for the intermediate
tables. But now for the intermediate results it is not that important. It is
results it is not that important. It is just an intermediate step that we have
just an intermediate step that we have done in order to have our extra copy of
done in order to have our extra copy of the data to manipulate it and so on in
the data to manipulate it and so on in order to prepare it to be inserted in
order to prepare it to be inserted in the data warehouse. So after we loaded
the data warehouse. So after we loaded it in the data warehouse, this copy of
it in the data warehouse, this copy of the data is not anymore important. It
the data is not anymore important. It shouldn't stay like for a long time.
shouldn't stay like for a long time. That's why in this scenario, maybe we
That's why in this scenario, maybe we can go and use the temporary tables
can go and use the temporary tables instead of normal tables for the
instead of normal tables for the intermediate results. And that's because
intermediate results. And that's because only of one advantage is that the
only of one advantage is that the database going to go and do an automatic
database going to go and do an automatic clean up after the host session ends. So
clean up after the host session ends. So it comes out of the box automatically
it comes out of the box automatically from the database. So that means I don't
from the database. So that means I don't have to deal with the dropping mechanism
have to deal with the dropping mechanism of this table for the next load. If
of this table for the next load. If there is like something wrong in the
there is like something wrong in the data warehouse, you would like always to
data warehouse, you would like always to check the copy where the transformations
check the copy where the transformations are done in order to debug and find
are done in order to debug and find issues. So I don't normally use
issues. So I don't normally use temporary tables in these scenarios, I
temporary tables in these scenarios, I use just normal tables. But for other
use just normal tables. But for other small projects, maybe this makes sense.
small projects, maybe this makes sense. So this is one use case on when to use
So this is one use case on when to use the temporary tables in your projects.
the temporary tables in your projects. We use it in order to store intermediate
We use it in order to store intermediate results temporary until we are done with
results temporary until we are done with the session and then once we are done
the session and then once we are done the database can go and drop that
the database can go and drop that temporary
table. All right guys, now a quick talk about the temporary tables. To be
about the temporary tables. To be honest, I never use this in my projects.
honest, I never use this in my projects. If I need an intermediate results in one
If I need an intermediate results in one query, I can go and use the CTEs. And if
query, I can go and use the CTEs. And if my intermediate results is very
my intermediate results is very important then I put it in either view
important then I put it in either view or CTIS but it is nice technique to
or CTIS but it is nice technique to learn maybe you can utilize it in one of
learn maybe you can utilize it in one of your
projects. All right guys so now let's have a quick summary about tables.
have a quick summary about tables. Tables in database are like spreadsheet
Tables in database are like spreadsheet or grid that contains columns and rows
or grid that contains columns and rows and your actual data are stored in these
and your actual data are stored in these tables. And we have learned there are
tables. And we have learned there are two types of tables. We have permanent
two types of tables. We have permanent tables and temporary tables. Permanent
tables and temporary tables. Permanent tables lives in the database forever as
tables lives in the database forever as long as you don't drop them. But in the
long as you don't drop them. But in the other hand that temporary tables they
other hand that temporary tables they have short lifetime. They will be
have short lifetime. They will be dropped from the database once you end
dropped from the database once you end the session. Now we have learned as well
the session. Now we have learned as well there are two methods on how to create
there are two methods on how to create tables in databases. The first method is
tables in databases. The first method is create insert. This method involves two
create insert. This method involves two steps. The first one is defining and
steps. The first one is defining and creating the table and the second step
creating the table and the second step is by inserting the data inside this new
is by inserting the data inside this new table. So you are creating something
table. So you are creating something from the scratch. And the second method
from the scratch. And the second method we call it CTAs. It create as well brand
we call it CTAs. It create as well brand new table but based on the result of a
new table but based on the result of a query. So this type is done with only
query. So this type is done with only one step but it always needs another
one step but it always needs another existing table. And we have learned as
existing table. And we have learned as well the difference between tables and
well the difference between tables and views where the main advantage of using
views where the main advantage of using tables created from CTIS is that to
tables created from CTIS is that to ensure the performance is fast enough at
ensure the performance is fast enough at the end of the users or your reporting
the end of the users or your reporting system. So we use CIS instead of views
system. So we use CIS instead of views if the logic of the view is very complex
if the logic of the view is very complex and takes a lot of time to be executed
and takes a lot of time to be executed in the database. And one more nice use
in the database. And one more nice use case for the CIS is that we can go and
case for the CIS is that we can go and persist a snapshot of the data in order
persist a snapshot of the data in order to analyze a bug and data quality issue
to analyze a bug and data quality issue and to ensure that we have the exact
and to ensure that we have the exact data in order to find a solution for the
data in order to find a solution for the bug and the issue. Now we have learned
bug and the issue. Now we have learned as well that we can use temporary tables
as well that we can use temporary tables in order to store intermediate results
in order to store intermediate results in a temporary storage and the main
in a temporary storage and the main advantage of the temporary table is the
advantage of the temporary table is the database automatically drops all that
database automatically drops all that temporary tables when the session ends
temporary tables when the session ends and that's because for you the
and that's because for you the intermediate results are not that
intermediate results are not that important to live long
time. Hey my friends. So we have learned that in real data projects if you have a
that in real data projects if you have a database there will be a lot of
database there will be a lot of analytical use cases that want to access
analytical use cases that want to access your data and do analytics. And what
your data and do analytics. And what going to happen? They're going to write
going to happen? They're going to write complex queries because in many
complex queries because in many scenarios they are doing complex
scenarios they are doing complex analyzes. And if you don't do anything
analyzes. And if you don't do anything about it in your projects, you're going
about it in your projects, you're going to face a lot of challenges like
to face a lot of challenges like complexity and a lot of redundancy of
complexity and a lot of redundancy of the same complex logic but from multiple
the same complex logic but from multiple users and maybe performance and security
users and maybe performance and security issues. And we have learned we have five
issues. And we have learned we have five amazing techniques in order to solve
amazing techniques in order to solve those problems. We have learned the
those problems. We have learned the subqueries and cities and as well how to
subqueries and cities and as well how to create objects like views, CTAs and
create objects like views, CTAs and temporary tables. So now what we're
temporary tables. So now what we're going to do, we're going to go and
going to do, we're going to go and compare them side by side in order to
compare them side by side in order to have a big picture about the advantages
have a big picture about the advantages and the disadvantages of each method. So
and the disadvantages of each method. So let's go and compare them. Okay. So now
let's go and compare them. Okay. So now we have our five methods and the first
we have our five methods and the first criteria that I would like to compare
criteria that I would like to compare them is the storage type. We have
them is the storage type. We have learned that if you are using subqueries
learned that if you are using subqueries and CTE, what can happen? and the
and CTE, what can happen? and the database going to put the result of
database going to put the result of those two techniques in the memory in
those two techniques in the memory in the cache so that later the main query
the cache so that later the main query has a fast access to those intermediate
has a fast access to those intermediate results. But in the other hand if you
results. But in the other hand if you are using temporary tables or tables
are using temporary tables or tables from CDS the new created table can be
from CDS the new created table can be stored inside the disk storage. And now
stored inside the disk storage. And now for the views as we understood there
for the views as we understood there will be no data storage and that means
will be no data storage and that means we are not using any storage from the
we are not using any storage from the database. Now if you are talking about
database. Now if you are talking about the lifetime so that means how long the
the lifetime so that means how long the object going to live or persist in the
object going to live or persist in the database. Now our three techniques sub
database. Now our three techniques sub queries CTE and temporary tables all of
queries CTE and temporary tables all of them going to live a short time in the
them going to live a short time in the database. So all of them are temporary.
database. So all of them are temporary. But now if you are talking about
But now if you are talking about creating objects using CIS and views
creating objects using CIS and views those two going to be permanent. So that
those two going to be permanent. So that means they're going to live in the
means they're going to live in the database as long as you don't drop them.
database as long as you don't drop them. Now we're going to compare them with
Now we're going to compare them with something similar is when the database
something similar is when the database going to go and drop or delete those
going to go and drop or delete those objects. Now we have learned that the
objects. Now we have learned that the subqueries and the cities have a short
subqueries and the cities have a short time. They going to live only during the
time. They going to live only during the execution of the query. So once the
execution of the query. So once the query ends the database going to go to
query ends the database going to go to the cache and delete everything. But for
the cache and delete everything. But for the temporary tables they live little
the temporary tables they live little bit longer as long as you are in the
bit longer as long as you are in the session. But once you end the session,
session. But once you end the session, the database as well going to go and
the database as well going to go and drop and delete your table. Now for the
drop and delete your table. Now for the objects that comes from the CIS and
objects that comes from the CIS and views as we learned they are persistent
views as we learned they are persistent and permanent and the database can only
and permanent and the database can only delete them if you ask the database to
delete them if you ask the database to do that by using the DDL command drop.
do that by using the DDL command drop. So the database will not delete anything
So the database will not delete anything for these two. So now the next one is
for these two. So now the next one is the query scope like how we can access
the query scope like how we can access those objects. Now for the subquery and
those objects. Now for the subquery and the CTE the scope is here very small. It
the CTE the scope is here very small. It is accessed only from one single query.
is accessed only from one single query. The query itself where you write the
The query itself where you write the city and subquery. So you cannot access
city and subquery. So you cannot access it from external queries. But we have
it from external queries. But we have learned that the temporary tables cis
learned that the temporary tables cis and views you can access all those
and views you can access all those objects from multiple queries. So that
objects from multiple queries. So that means you can access those objects from
means you can access those objects from multiple external queries. Now the next
multiple external queries. Now the next one if you are thinking about the
one if you are thinking about the reusability if you look to the
reusability if you look to the subqueries they are very limited. the
subqueries they are very limited. the subquery going to be used only in one
subquery going to be used only in one query and only in one place. So if you
query and only in one place. So if you need it in multiple places, you have to
need it in multiple places, you have to go and repeat the same logic. So
go and repeat the same logic. So subqueries are the worst with their
subqueries are the worst with their reusability. But now if you are talking
reusability. But now if you are talking about the CTE, it is little bit better.
about the CTE, it is little bit better. You still can access it only from one
You still can access it only from one single query but you can access it in
single query but you can access it in the same query from multiple places. So
the same query from multiple places. So you can access it multiple times from
you can access it multiple times from different joins and you don't have to
different joins and you don't have to repeat the same logics over and over.
repeat the same logics over and over. But still it is limited because you have
But still it is limited because you have only one query that is using the logic.
only one query that is using the logic. Now if you think about the temporary
Now if you think about the temporary tables I could say the reusability here
tables I could say the reusability here is medium and that's because you can
is medium and that's because you can access the data by multiple queries but
access the data by multiple queries but only during this session. So once the
only during this session. So once the session is ended you cannot access it
session is ended you cannot access it anymore which means you have to recreate
anymore which means you have to recreate it in order to reuse it again. So it is
it in order to reuse it again. So it is more reusable than the city and the
more reusable than the city and the subqueries but not that good like the
subqueries but not that good like the CTAs and views. Those techniques can
CTAs and views. Those techniques can offer the highest reusability for you.
offer the highest reusability for you. So they are always there for multiple
So they are always there for multiple users from multiple queries. So it can
users from multiple queries. So it can eliminate a lot of redundancies and you
eliminate a lot of redundancies and you have to do the job only once. Now moving
have to do the job only once. Now moving into the next one. If you are thinking
into the next one. If you are thinking about the intermediate result of those
about the intermediate result of those techniques, the question is how fresh is
techniques, the question is how fresh is the data? Is the data from these objects
the data? Is the data from these objects always up to date? Now for the
always up to date? Now for the subqueries and the cities they are
subqueries and the cities they are always up to date because the SQL is
always up to date because the SQL is executing the logic on the fly and
executing the logic on the fly and storing the data in the memory and
storing the data in the memory and immediately after that going to come the
immediately after that going to come the main query and get the data. So always
main query and get the data. So always the intermediate results in the memory
the intermediate results in the memory are up to date. But now if you think
are up to date. But now if you think about that temporary tables and the CTIS
about that temporary tables and the CTIS the query is only executed once and if
the query is only executed once and if there is like any update and changes on
there is like any update and changes on the original table you will not find
the original table you will not find those changes in those objects and
those changes in those objects and that's because SQL executed once and
that's because SQL executed once and that's all. So if you query those tables
that's all. So if you query those tables there is no guarantee that the data are
there is no guarantee that the data are up to date. So if you want fresh data
up to date. So if you want fresh data you have always to drop the table and
you have always to drop the table and create it again from the query. Now if
create it again from the query. Now if you are talking about the views they are
you are talking about the views they are amazing they are always up to date
amazing they are always up to date because views does not store any data.
because views does not store any data. So each time you ask the views for data
So each time you ask the views for data what's going to happen the database
what's going to happen the database going to go to the original table and
going to go to the original table and fetch the data to the view. So your data
fetch the data to the view. So your data are always fresh and up to date. So this
are always fresh and up to date. So this is a big picture about the behavior of
is a big picture about the behavior of those advanced techniques that you can
those advanced techniques that you can use in SQL projects. And if you ask my
use in SQL projects. And if you ask my opinion my favorite is going to be the
opinion my favorite is going to be the views in the first place. Then in the
views in the first place. Then in the second in my list is the city. They are
second in my list is the city. They are amazing, but don't use more than five
amazing, but don't use more than five CTEs in one query. Otherwise, it's going
CTEs in one query. Otherwise, it's going to be really annoying and hard to read.
to be really annoying and hard to read. And then I'm going to say in the third
And then I'm going to say in the third place, the sub queries. And then the
place, the sub queries. And then the CDIS. I use CIS if the views are slow.
CDIS. I use CIS if the views are slow. If that's a scenario, I'm jump to the
If that's a scenario, I'm jump to the CDIS and create a permanent physical
CDIS and create a permanent physical tables from my query. And the last one
tables from my query. And the last one that I rarely use is the temporary
that I rarely use is the temporary tables. So, this is how I rank those
tables. So, this is how I rank those techniques in my skill projects.
Now I would like to show you as well a big picture on how things works in my
big picture on how things works in my projects in order to see all those
projects in order to see all those different techniques and possibilities
different techniques and possibilities that you can use. It's like a big
that you can use. It's like a big picture and recap. So story time. So you
picture and recap. So story time. So you have a database and things starts where
have a database and things starts where you have a database administrator or
you have a database administrator or let's say a data engineer that is
let's say a data engineer that is creating a new table from the scratch.
creating a new table from the scratch. So he going to write a DDL statement in
So he going to write a DDL statement in order to create one physical table at
order to create one physical table at our database. And now our database table
our database. And now our database table is empty. That's why in the second step
is empty. That's why in the second step he going to go and write an insert
he going to go and write an insert statement in order to fill our new table
statement in order to fill our new table with data. Now once we have a table
with data. Now once we have a table we're going to give the access maybe to
we're going to give the access maybe to a data scientist or data analyst in
a data scientist or data analyst in order to start writing SQL queries. So
order to start writing SQL queries. So now the first thing that could happen
now the first thing that could happen that the logic is complex and she has to
that the logic is complex and she has to do that in two steps. So the first step
do that in two steps. So the first step is a query that prepares the data in
is a query that prepares the data in order to execute the second step. So
order to execute the second step. So that's why she going to go and use the
that's why she going to go and use the subquery and the main query going to go
subquery and the main query going to go and retrieve the data from the
and retrieve the data from the intermediate results in order to prepare
intermediate results in order to prepare the final results for the analyst. Now
the final results for the analyst. Now what could happen is that there will be
what could happen is that there will be an SQL logic in the query where it keep
an SQL logic in the query where it keep repeating the scripts. So now instead of
repeating the scripts. So now instead of writing another subquery for that she
writing another subquery for that she going to go and put this logic in CTE
going to go and put this logic in CTE and now she going to go to the main
and now she going to go to the main query and use the result of the CTE in
query and use the result of the CTE in multiple places in the same query. So
multiple places in the same query. So all those stuff the sub queries and the
all those stuff the sub queries and the city queries the main queries all those
city queries the main queries all those stuff happens in one single query and
stuff happens in one single query and now what could happen is that she is
now what could happen is that she is writing an amazing code. So instead of
writing an amazing code. So instead of using it only in her query what's going
using it only in her query what's going to happen she going to go and persist
to happen she going to go and persist this logic in the database. So she going
this logic in the database. So she going to put it as a view in the database so
to put it as a view in the database so that all other users and analysts can
that all other users and analysts can benefit from this logic and they don't
benefit from this logic and they don't have to write it again. So instead
have to write it again. So instead they're going to go and query the view
they're going to go and query the view and this going to makes the life easier.
and this going to makes the life easier. And of course our data analyst can as
And of course our data analyst can as well use this view in the main query.
well use this view in the main query. And now one more thing she has as well
And now one more thing she has as well another logic that is really complex and
another logic that is really complex and as well everyone can benefit from it.
as well everyone can benefit from it. But the issue this query is very slow.
But the issue this query is very slow. So now she has to decide do I put it in
So now she has to decide do I put it in view or do I create a new table based on
view or do I create a new table based on the query using CTAs. Now of course
the query using CTAs. Now of course because of the performance and the view
because of the performance and the view takes around 30 minutes to be executed.
takes around 30 minutes to be executed. She decided to execute the query using
She decided to execute the query using the CTIS where she generate a physical
the CTIS where she generate a physical table so that all other analysts as well
table so that all other analysts as well can access this new table in order to
can access this new table in order to reuse the results and of course she can
reuse the results and of course she can use it in her main query and with that
use it in her main query and with that now you have experience how things works
now you have experience how things works in real projects. It is not simple
in real projects. It is not simple select query from table it is like this
select query from table it is like this people are creating subquery CTE views
people are creating subquery CTE views temporary tables CTAs for different
temporary tables CTAs for different purposes. All right my friends. So
purposes. All right my friends. So that's all about the CTIS and the
that's all about the CTIS and the temporary tables. And with that we have
temporary tables. And with that we have learned all the techniques on how to
learned all the techniques on how to organize our complex projects. Now next
organize our complex projects. Now next we're going to start talking about
we're going to start talking about something completely different. We're
something completely different. We're going to talk about the stored
going to talk about the stored procedures on how to put our code inside
procedures on how to put our code inside the database. This is all about that
the database. This is all about that programmability and how to add stuff
programmability and how to add stuff like parameters, variables, error
like parameters, variables, error handling. So it's like programming. So
handling. So it's like programming. So let's go. So let's uncover this word of
let's go. So let's uncover this word of the s procedures and let's go.
Now think about store procedures like this. Every time you go to a coffee
this. Every time you go to a coffee shop, you say, "I would like a large
shop, you say, "I would like a large coffee with a coconut milk, no sugar,
coffee with a coconut milk, no sugar, and extra whipped cream." And you repeat
and extra whipped cream." And you repeat this over and over each time you go to
this over and over each time you go to this coffee shop. And now, if you are
this coffee shop. And now, if you are working with stored procedures, it's
working with stored procedures, it's going to be like this. Whenever you go
going to be like this. Whenever you go to the coffee shop, you just say, "Give
to the coffee shop, you just say, "Give me my usual." and the barista know
me my usual." and the barista know exactly what you mean behind that and
exactly what you mean behind that and you will get exactly your order without
you will get exactly your order without specifying and repeating everything word
specifying and repeating everything word by word and this is exactly what's going
by word and this is exactly what's going to happen if you work with stored
to happen if you work with stored procedures so let's have some coffee
procedures so let's have some coffee right all right so now we can continue
right all right so now we can continue all right so now let's start again from
all right so now let's start again from the scratch we have always these two
the scratch we have always these two sides we have the client side and the
sides we have the client side and the server side of the database and what we
server side of the database and what we have learned we have like a database and
have learned we have like a database and you as a user you can go and create like
you as a user you can go and create like different SQL statements Like for
different SQL statements Like for example, you can create like an SQL
example, you can create like an SQL select statements in order to retrieve
select statements in order to retrieve data from the database or another SQL
data from the database or another SQL statements where you are inserting data
statements where you are inserting data to the database and another one let's
to the database and another one let's say that you are updating the content of
say that you are updating the content of your tables and so on. So you have like
your tables and so on. So you have like different statements in order to
different statements in order to interact with the database. Now let's
interact with the database. Now let's say that what you are doing is not only
say that what you are doing is not only one time job you are keep repeating
one time job you are keep repeating those steps over and over. So you are
those steps over and over. So you are always like doing an insert then an
always like doing an insert then an update and then a select and you keep
update and then a select and you keep repeating that day after day. So now
repeating that day after day. So now imagine that you are doing something
imagine that you are doing something crazy where you go in vacation but the
crazy where you go in vacation but the job should be done. So what you do you
job should be done. So what you do you hand over all those select statements to
hand over all those select statements to your colleagues and they have to do it
your colleagues and they have to do it every day as well as you are gone. So
every day as well as you are gone. So you go and give them all those SQL
you go and give them all those SQL scripts and you tell them okay you have
scripts and you tell them okay you have to execute the first query then the
to execute the first query then the second query and then the third query.
second query and then the third query. This is of course not a good way on how
This is of course not a good way on how to do things because of course there
to do things because of course there will be some human errors where like the
will be some human errors where like the execution of the script is not correct
execution of the script is not correct like first updating then inserting and
like first updating then inserting and things can go wrong and that's exactly
things can go wrong and that's exactly why we have stored procedures in SQL. So
why we have stored procedures in SQL. So what we can do we can put all those SQL
what we can do we can put all those SQL statements together in one frame in one
statements together in one frame in one program and we call it start procedure.
program and we call it start procedure. And now once you do that all your SQL
And now once you do that all your SQL statements will not stay at the client
statements will not stay at the client side they will be stored now in the
side they will be stored now in the server side of the database. So that
server side of the database. So that means in store procedures we are storing
means in store procedures we are storing our SQL statements inside the database.
our SQL statements inside the database. So you don't have to go and hand over
So you don't have to go and hand over your SQL statements to your colleagues.
your SQL statements to your colleagues. And now all what you have to do in order
And now all what you have to do in order to interact with your SQL statements is
to interact with your SQL statements is to go and execute the store procedure.
to go and execute the store procedure. So you write very simple command called
So you write very simple command called execute SP for example. So with that you
execute SP for example. So with that you are calling your stored procedure that
are calling your stored procedure that is stored inside the server. And once
is stored inside the server. And once you execute this what can happen the
you execute this what can happen the database going to go to the stored
database going to go to the stored procedure and start executing all the
procedure and start executing all the SQL statements that you have inside the
SQL statements that you have inside the store procedure and it's going to do it
store procedure and it's going to do it exactly in the order that you have
exactly in the order that you have defined. So from top to bottom. So now
defined. So from top to bottom. So now once the database went through all your
once the database went through all your SQL statements, it's going to return
SQL statements, it's going to return back to the user the data that we have
back to the user the data that we have from the selects. And with that things
from the selects. And with that things are really easy and you can tell your
are really easy and you can tell your colleagues okay just execute this third
colleagues okay just execute this third procedure and the rest can be done from
procedure and the rest can be done from the database. So with that you minimize
the database. So with that you minimize the human errors and you make sure that
the human errors and you make sure that everything can be executed as you wish
everything can be executed as you wish and as well as you are back from your
and as well as you are back from your vacation things are easier. You have to
vacation things are easier. You have to just go and execute the third procedure.
just go and execute the third procedure. So this is what we mean with start
So this is what we mean with start procedure. You can store inside it
procedure. You can store inside it multiple SQL statements in specific
multiple SQL statements in specific order and you can save it inside the
order and you can save it inside the database and each time you need your SQL
database and each time you need your SQL statements you can go and simply execute
them. So now let's have a quick comparison between a normal query normal
comparison between a normal query normal SQL statements compared to a stored
SQL statements compared to a stored procedure. So a normal SQL query you
procedure. So a normal SQL query you have like select from where and so on.
have like select from where and so on. This is like one-time transaction. You
This is like one-time transaction. You are asking the database for one thing
are asking the database for one thing and the database is answering. So it is
and the database is answering. So it is like one-time request. But now in the
like one-time request. But now in the other hand in the stored procedures you
other hand in the stored procedures you have multiple SQL statements and once
have multiple SQL statements and once you execute the stored procedure there
you execute the stored procedure there will be many interactions with the
will be many interactions with the database in one go. So that means you
database in one go. So that means you will have multiple transactions that is
will have multiple transactions that is happening in your store procedure. So an
happening in your store procedure. So an SQL query it is like a simple request.
SQL query it is like a simple request. You need one thing and you are getting
You need one thing and you are getting it. But on the other hand in the start
it. But on the other hand in the start procedure it is like a program. As you
procedure it is like a program. As you are writing a code in any programming
are writing a code in any programming languages it is more than one request it
languages it is more than one request it has a lot of stuff like for example you
has a lot of stuff like for example you can go and build looping logic where we
can go and build looping logic where we go and iterate through something or you
go and iterate through something or you can go and build a control flow where
can go and build a control flow where you have a logic like the FL statements.
you have a logic like the FL statements. So there are like different paths in
So there are like different paths in your code and as well in programming we
your code and as well in programming we have like parameters and variables in
have like parameters and variables in order to make our code dynamic and
order to make our code dynamic and flexible and as well we can build error
flexible and as well we can build error handling on our code in order to
handling on our code in order to customize what can happen if there is
customize what can happen if there is like an issue. So the store procedure it
like an issue. So the store procedure it is like having a code like for example
is like having a code like for example in Python. So that means you can do more
in Python. So that means you can do more complicated stuff compared to a simple
complicated stuff compared to a simple query where you have only like one
query where you have only like one request. So in the stored procedures you
request. So in the stored procedures you are doing like programming and coding
are doing like programming and coding and it is more advanced than only just
and it is more advanced than only just having a query. So that means if you are
having a query. So that means if you are working with stored procedures things
working with stored procedures things going to get more complicated and
going to get more complicated and advanced but of course you will get a
advanced but of course you will get a lot of flexibility and reusability
lot of flexibility and reusability compared to a simple
compared to a simple [Music]
[Music] query. So now there is like another
query. So now there is like another alternative to stored procedures. Well,
alternative to stored procedures. Well, you can go and put all your SQL
you can go and put all your SQL statements in a Python code and things
statements in a Python code and things can work as well. So, either you put
can work as well. So, either you put your SQL statements inside the stored
your SQL statements inside the stored procedure or in a Python code. But now
procedure or in a Python code. But now the big question is what are the
the big question is what are the differences between them? Well, there is
differences between them? Well, there is like a disadvantage if you having Python
like a disadvantage if you having Python in different server because you have to
in different server because you have to go and build a connection between your
go and build a connection between your server and the database server and
server and the database server and connection means always networking and
connection means always networking and you might get slightly worse
you might get slightly worse performance. So this is one advantage
performance. So this is one advantage for the start procedure. Another
for the start procedure. Another advantage for the search procedure that
advantage for the search procedure that all the scripts that you're going to
all the scripts that you're going to store inside the store procedure in the
store inside the store procedure in the database going to be pre-ompiled. So
database going to be pre-ompiled. So pre-ompiled means the SQL database
pre-ompiled means the SQL database servers knows already about your SQL
servers knows already about your SQL statements and there was already a check
statements and there was already a check whether all the syntaxes are correct and
whether all the syntaxes are correct and the database as well going to be
the database as well going to be preparing everything to execute the
preparing everything to execute the stored procedure like maybe preparing
stored procedure like maybe preparing the execution plans and a lot of stuff.
the execution plans and a lot of stuff. So if you store your skill statements
So if you store your skill statements inside store procedure in the database,
inside store procedure in the database, it is very close to the database and the
it is very close to the database and the database knows everything about your
database knows everything about your scripts and it is ready to execute it.
scripts and it is ready to execute it. But if you put all your SQL statements
But if you put all your SQL statements outside of the database, of course, the
outside of the database, of course, the database has no chance to understand
database has no chance to understand what is coming. So it cannot go and
what is coming. So it cannot go and compile anything until Python sends the
compile anything until Python sends the code to database. So this is another
code to database. So this is another advantage for the stored procedure. But
advantage for the stored procedure. But now if you build your SQL statements in
now if you build your SQL statements in Python, you will get a lot of
Python, you will get a lot of advantages. Like for example, you can go
advantages. Like for example, you can go and build very flexible Python codes
and build very flexible Python codes where you can use Python features
where you can use Python features together with the SQL and with that you
together with the SQL and with that you open the door of many possibilities and
open the door of many possibilities and flexibility. Another thing with Python,
flexibility. Another thing with Python, you can make great version control. So
you can make great version control. So everything is integrated in Python
everything is integrated in Python tools. And one more advantage is that if
tools. And one more advantage is that if you have a complex requirement in your
you have a complex requirement in your projects, it's going to be really hard
projects, it's going to be really hard to implement it in stored procedures.
to implement it in stored procedures. it's going to cost you a lot of lines of
it's going to cost you a lot of lines of code and things going to be not
code and things going to be not comfortable. But if you are implementing
comfortable. But if you are implementing a complex logic in Python, things going
a complex logic in Python, things going to be way easier. So with Python, you
to be way easier. So with Python, you can implement complex logics very easily
can implement complex logics very easily compared to the stored procedure. So
compared to the stored procedure. So those are the big differences between
those are the big differences between the stored procedure and Python. Now I
the stored procedure and Python. Now I have to be honest with you about having
have to be honest with you about having your code in store procedure or in
your code in store procedure or in Python. Well, if you are working
Python. Well, if you are working together in a data project, I will never
together in a data project, I will never recommend you to use stored procedure if
recommend you to use stored procedure if you have the possibility to have your
you have the possibility to have your code in Python. And that's because I saw
code in Python. And that's because I saw a lot of projects using stored procedure
a lot of projects using stored procedure and most of them ends in chaos. It is
and most of them ends in chaos. It is really hard to debug. It is really hard
really hard to debug. It is really hard to test. It's like catastrophic. So
to test. It's like catastrophic. So really don't use in your projects any
really don't use in your projects any store procedures. Especially if you have
store procedures. Especially if you have like a big project and you have a lot of
like a big project and you have a lot of data and tables and so on. You can
data and tables and so on. You can manage everything perfectly using
manage everything perfectly using Python. Especially if you have platform
Python. Especially if you have platform like data bricks or snowflakes then of
like data bricks or snowflakes then of course the best way to control your data
course the best way to control your data projects is using Python. But of course
projects is using Python. But of course if you don't have this possibility and
if you don't have this possibility and you have only a database server and you
you have only a database server and you can only work with this then you don't
can only work with this then you don't have any other option. You have to work
have any other option. You have to work with the store procedures. But if you
with the store procedures. But if you have this possibility to put your
have this possibility to put your project inside Python and to run your
project inside Python and to run your scripts from there, then it is way
scripts from there, then it is way better than having stored procedure.
better than having stored procedure. Well, this is my opinion. I'm just
Well, this is my opinion. I'm just talking about working in projects in big
talking about working in projects in big projects. But if you have like small
projects. But if you have like small projects, few tables and so on, then
projects, few tables and so on, then it's fine to stay with the store
it's fine to stay with the store procedure. But never build a big project
procedure. But never build a big project using stored procedures because I tell
using stored procedures because I tell you it will never work. So try to always
you it will never work. So try to always to think about to have the right
to think about to have the right platform in order to run your projects.
platform in order to run your projects. And now I'm thinking about it. Maybe I
And now I'm thinking about it. Maybe I should have put this tip at the end of
should have put this tip at the end of the video, not in the middle. So
the video, not in the middle. So whatever. If you still want to learn
whatever. If you still want to learn store procedures, we're going to
store procedures, we're going to continue on that. And I'm going to have
continue on that. And I'm going to have like a really nice example about how to
like a really nice example about how to build store procedures step by step like
build store procedures step by step like having a mini projects. So why not
having a mini projects. So why not learning both of them. So let's
go. Okay. So now let's have a quick look to the syntax of the store procedure. It
to the syntax of the store procedure. It is very simple. So it has always two
is very simple. So it has always two parts. First we have to define the start
parts. First we have to define the start procedure. So we can do it like this.
procedure. So we can do it like this. Create procedure. Then we have to define
Create procedure. Then we have to define the procedure name and then we say as
the procedure name and then we say as and then we have begin and end. It's
and then we have begin and end. It's very important for SQL to understand
very important for SQL to understand when that definition starts and when it
when that definition starts and when it ends. And then between the begin and end
ends. And then between the begin and end we're going to have a set of SQL
we're going to have a set of SQL statements. So here you can insert
statements. So here you can insert whatever you want. Insert update queries
whatever you want. Insert update queries anything. And once you have defined the
anything. And once you have defined the sort procedure the next step is that
sort procedure the next step is that we're going to go and execute it. So the
we're going to go and execute it. So the syntax is very simple. We're going to
syntax is very simple. We're going to say execute and then the procedure name.
say execute and then the procedure name. So that's it with that SSQL going to go
So that's it with that SSQL going to go to the S procedure and start executing
to the S procedure and start executing all the SQL statements that you have in
all the SQL statements that you have in the definition. So this is the syntax of
the definition. So this is the syntax of the S procedure. As I said it is very
the S procedure. As I said it is very simple. All right guys. So now let's do
simple. All right guys. So now let's do it step by step. The first step is that
it step by step. The first step is that we're going to go and write a query. So
we're going to go and write a query. So let's say that we have a very simple
let's say that we have a very simple task and it says for US customers find
task and it says for US customers find the total number of customers and the
the total number of customers and the average score. So let's go and do it.
average score. So let's go and do it. It's very simple. So select count star
It's very simple. So select count star total customers and then the average of
total customers and then the average of scores as average score from our table
scores as average score from our table sales customers and then since it says
sales customers and then since it says US customers we have to go and filter
US customers we have to go and filter the data based on the column country is
the data based on the column country is equal to USA. So that's it. This is our
equal to USA. So that's it. This is our query. Let's go and execute it. So we
query. Let's go and execute it. So we have a very quick nice report about the
have a very quick nice report about the total number of customers and the
total number of customers and the average score. So now let's say that I
average score. So now let's say that I have a weekly meeting and I have to
have a weekly meeting and I have to represent this reports over and over. So
represent this reports over and over. So that means I have to go and execute this
that means I have to go and execute this query like frequently in weekly basis in
query like frequently in weekly basis in order to get the data for the reports.
order to get the data for the reports. So now what this means I have to go and
So now what this means I have to go and save this query in order to use it later
save this query in order to use it later that each time I have to rewrite it. So
that each time I have to rewrite it. So that means I have to store this text
that means I have to store this text somewhere that I don't go and rewrite
somewhere that I don't go and rewrite the query over and over. So what I
the query over and over. So what I usually do, let's go and we copy the
usually do, let's go and we copy the whole query and then we create a new
whole query and then we create a new text and let's say it's going to be my
text and let's say it's going to be my weekly query and it's going to be SQL.
weekly query and it's going to be SQL. So I'm going to go and edit it and here
So I'm going to go and edit it and here I'm going to save my query and each time
I'm going to save my query and each time I need this query I have to go and copy
I need this query I have to go and copy it, go back to my SQL and then I'm going
it, go back to my SQL and then I'm going to go and paste it in order to execute
to go and paste it in order to execute it. So either going to write it each
it. So either going to write it each time or copy and paste it. Well, we
time or copy and paste it. Well, we don't have to do that. we have start
don't have to do that. we have start procedures. So that means we're going to
procedures. So that means we're going to go to the step two where we're going to
go to the step two where we're going to turn this query into a store procedure.
turn this query into a store procedure. So let's do that. It's very simple. So
So let's do that. It's very simple. So we're going to say create procedure. And
we're going to say create procedure. And now we have to go and give it a name. So
now we have to go and give it a name. So it's going to be get customer summary.
it's going to be get customer summary. And then after that we're going to say
And then after that we're going to say as and then we need the begin and end.
as and then we need the begin and end. And in between we're going to put our
And in between we're going to put our query. So let's go and copy our query
query. So let's go and copy our query and just put it in between. So that's
and just put it in between. So that's it. Let's go and execute it. And with
it. Let's go and execute it. And with that we have created our store
that we have created our store procedure. And now in order to see our
procedure. And now in order to see our store procedure we can go to the object
store procedure we can go to the object explorer to our database sales DB. And
explorer to our database sales DB. And then here we have a folder called
then here we have a folder called programmability. So let's go inside it.
programmability. So let's go inside it. And here we have a lot of stuff like
And here we have a lot of stuff like functions, triggers and we have stored
functions, triggers and we have stored procedures. So let's go inside it. And
procedures. So let's go inside it. And we can see over here this is our new
we can see over here this is our new created stored procedure. So we are
created stored procedure. So we are almost there. The next step is that
almost there. The next step is that we're going to go and call our store
we're going to go and call our store procedure. And this is the easiest part.
procedure. And this is the easiest part. So it's going to be execute the stored
So it's going to be execute the stored procedure. And the syntax is very
procedure. And the syntax is very simple. So execute and then the name of
simple. So execute and then the name of the stored procedure. So get customer
the stored procedure. So get customer summary. So let's go and execute it. And
summary. So let's go and execute it. And with that as you can see we get the
with that as you can see we get the result of our query. So as you can see
result of our query. So as you can see it is very simple. In just few steps we
it is very simple. In just few steps we created a store procedure. And then in
created a store procedure. And then in the future you don't need the whole
the future you don't need the whole thing. You just go and execute the store
thing. You just go and execute the store procedure. I don't have to store the
procedure. I don't have to store the query locally at my PC or to copy and
query locally at my PC or to copy and paste anything. If I want this report
paste anything. If I want this report now, I just have to execute the store
now, I just have to execute the store procedure like this and I will get the
results. Okay. So now let's keep moving. Now we're going to talk about the
Now we're going to talk about the parameters inside stored procedures. So
parameters inside stored procedures. So what is a parameter? It is like a
what is a parameter? It is like a placeholder where you can pass in
placeholder where you can pass in information from you into the store
information from you into the store procedure while running it and using
procedure while running it and using parameters in store procedure it's going
parameters in store procedure it's going to make it flexible reusable and
to make it flexible reusable and dynamic. So let's understand what this
dynamic. So let's understand what this means. Let's say that you got a new
means. Let's say that you got a new task. So it says for German customers
task. So it says for German customers find the total number of customers and
find the total number of customers and the average score. So that means now we
the average score. So that means now we have like to generate two reports one
have like to generate two reports one for USA and one for Germany. And in both
for USA and one for Germany. And in both of them you are doing the same
of them you are doing the same aggregation. And again we have to go and
aggregation. And again we have to go and start writing the query. It's going to
start writing the query. It's going to be very similar to the one that we have
be very similar to the one that we have in the previous example. So we are doing
in the previous example. So we are doing the same stuff same aggregations but the
the same stuff same aggregations but the only change here is that we're going to
only change here is that we're going to use another value to filter the data. So
use another value to filter the data. So instead of USA we're going to go and say
instead of USA we're going to go and say here Germany. So let's go and execute
here Germany. So let's go and execute this one over here. And with that we can
this one over here. And with that we can see we have total number of customers
see we have total number of customers too. So this is the report that we have
too. So this is the report that we have to provide like in weekly basis. And
to provide like in weekly basis. And again in order not to go and copy paste
again in order not to go and copy paste stuff we're going to go and create a
stuff we're going to go and create a store procedure for that. At the end
store procedure for that. At the end we're going to have an end. But now of
we're going to have an end. But now of course we cannot have like the same
course we cannot have like the same names we're going to go and say here
names we're going to go and say here Germany. So let's go and execute it. And
Germany. So let's go and execute it. And the next step we have to go and execute
the next step we have to go and execute the store procedure. So like this. Let's
the store procedure. So like this. Let's go and execute it. And the whole logic
go and execute it. And the whole logic now stored inside the database. Let's go
now stored inside the database. Let's go and refresh on the explorer over here.
and refresh on the explorer over here. And you can see now we have two stored
And you can see now we have two stored procedures. But now you have to feel
procedures. But now you have to feel there is something wrong. Always in
there is something wrong. Always in programming and coding. If you find
programming and coding. If you find yourself repeating the same task over
yourself repeating the same task over and over then there is always a smarter
and over then there is always a smarter way on how to optimize that. Repeating
way on how to optimize that. Repeating stuff in coding is always bad thing. So
stuff in coding is always bad thing. So now clearly we are repeating the same
now clearly we are repeating the same query in two different store procedure.
query in two different store procedure. And now if you compare them you see it's
And now if you compare them you see it's because of the value. So we have here
because of the value. So we have here the value for the filter once Germany
the value for the filter once Germany and one USA. And those values are static
and one USA. And those values are static values. So it's always going to stay
values. So it's always going to stay inside the store procedure as USA. But
inside the store procedure as USA. But instead of that we can replace those
instead of that we can replace those static values with a parameter. And then
static values with a parameter. And then you decide as you are executing the
you decide as you are executing the stored procedure for which country you
stored procedure for which country you want to execute the store procedure. So
want to execute the store procedure. So let's go and do that. I'm just going to
let's go and do that. I'm just going to remove everything from here and focus
remove everything from here and focus only on the first store procedure. Now
only on the first store procedure. Now what we're going to do after giving the
what we're going to do after giving the name of our store procedure we have to
name of our store procedure we have to define our parameter. So it start with
define our parameter. So it start with at and with that SQL understandhuh now
at and with that SQL understandhuh now we are talking about parameters and we
we are talking about parameters and we need now the name of the parameter. So
need now the name of the parameter. So it's going to be country. It could be
it's going to be country. It could be any name that you want and after that we
any name that you want and after that we have to define for SQL the data type.
have to define for SQL the data type. It's like when you are creating a table
It's like when you are creating a table and you define columns you assign a data
and you define columns you assign a data type for each column. The same thing
type for each column. The same thing here you have to assign as well a data
here you have to assign as well a data type for each parameter. So we're going
type for each parameter. So we're going to use the data type in var and for the
to use the data type in var and for the countries it's enough to have the length
countries it's enough to have the length of 50. So with that we are telling SQL
of 50. So with that we are telling SQL for this third procedure we can pass an
for this third procedure we can pass an information to the store procedure and
information to the store procedure and this information and value going to be
this information and value going to be used inside this parameter. So now after
used inside this parameter. So now after we defined this parameter over here we
we defined this parameter over here we can go and use it anywhere inside our
can go and use it anywhere inside our query. And of course we want to go and
query. And of course we want to go and use it instead of this static value. So
use it instead of this static value. So now we're going to remove this static
now we're going to remove this static value and instead we're going to have
value and instead we're going to have the parameter. So now we are saying
the parameter. So now we are saying you're going to filter the table based
you're going to filter the table based on the value that comes from the user
on the value that comes from the user and not anymore static with a USA. And
and not anymore static with a USA. And as I said you can use this parameter
as I said you can use this parameter everywhere like even here in the select
everywhere like even here in the select statements. So it is a value that could
statements. So it is a value that could be used everywhere in your query. So
be used everywhere in your query. So that's it. We have defined our new
that's it. We have defined our new parameter and we have used this
parameter and we have used this parameter in our query. So now we have
parameter in our query. So now we have to go and update the store procedure. We
to go and update the store procedure. We cannot leave it as create. Instead of
cannot leave it as create. Instead of that, we're going to say alter. So we
that, we're going to say alter. So we are saying alter procedure and with the
are saying alter procedure and with the new informations. Let's go and execute
new informations. Let's go and execute it. And now we have to go and execute
it. And now we have to go and execute it. So now what we're going to do, we're
it. So now what we're going to do, we're going to say execute get customer
going to say execute get customer summary. But now our store procedure is
summary. But now our store procedure is expecting a value from you from the
expecting a value from you from the input. So we're going to do it exactly
input. So we're going to do it exactly like we done in the name over here. So
like we done in the name over here. So we're going to say the parameter country
we're going to say the parameter country is equal
is equal to Germany. So that means the value of
to Germany. So that means the value of this parameter come from me come from
this parameter come from me come from the input and this information going to
the input and this information going to be passed to my query to the store
be passed to my query to the store procedure. So let's go and execute it.
procedure. So let's go and execute it. And with that as you can see we are
And with that as you can see we are getting the report of customers for
getting the report of customers for Germany. And now if you say okay let's
Germany. And now if you say okay let's go and generate the report for USA. All
go and generate the report for USA. All what you have to do is replace the
what you have to do is replace the parameter. So in the value instead of
parameter. So in the value instead of Germany we're going to say USA. So let's
Germany we're going to say USA. So let's go and execute it. Great. Now we are
go and execute it. Great. Now we are getting as well the report for us
getting as well the report for us customers. So that seems my friends for
customers. So that seems my friends for those two reports I just need one store
those two reports I just need one store procedure and with the help of the
procedure and with the help of the parameter I made my store procedure now
parameter I made my store procedure now more flexible and professional. So this
more flexible and professional. So this is exactly the power of the parameters
is exactly the power of the parameters it makes everything reusable and
it makes everything reusable and dynamic. And now of course we don't need
dynamic. And now of course we don't need the store procedure for Germany. So what
the store procedure for Germany. So what we can do we can go and drop it. So
we can do we can go and drop it. So we're going to say drop procedure and it
we're going to say drop procedure and it was like this Germany. So we don't need
was like this Germany. So we don't need this store procedure and we're going to
this store procedure and we're going to stay with only one dynamic store
stay with only one dynamic store procedure. So this is how to use
procedure. So this is how to use parameters in store procedure and why
parameters in store procedure and why it's important. Okay. So now to the next
it's important. Okay. So now to the next step is that we can go and add default
step is that we can go and add default values for the parameters. So let's say
values for the parameters. So let's say that I execute very frequently this
that I execute very frequently this report where I say the country equal to
report where I say the country equal to USA and I don't want each time to define
USA and I don't want each time to define the parameter value equal to USA. So if
the parameter value equal to USA. So if you are using a value very frequently
you are using a value very frequently you can add it as a default inside the
you can add it as a default inside the definition of the store procedure and it
definition of the store procedure and it is very simple. So if you go to the
is very simple. So if you go to the definition again over here after the
definition again over here after the parameter and you say equal to USA. So
parameter and you say equal to USA. So now it's very important to understand
now it's very important to understand that the country will not be always
that the country will not be always equal to USA. It is just you are saying
equal to USA. It is just you are saying if I don't get from the user any value
if I don't get from the user any value then as a default I'm going to go and
then as a default I'm going to go and use the USA. So let's go and again
use the USA. So let's go and again change the definition of our stored
change the definition of our stored procedure using alter. So execute and
procedure using alter. So execute and now we can go to our store procedure and
now we can go to our store procedure and I can skip the whole thing over here and
I can skip the whole thing over here and execute it. So now as a default I'm
execute it. So now as a default I'm getting the report of USA without
getting the report of USA without passing an information to the store
passing an information to the store procedure because I know it is as a
procedure because I know it is as a default USA. But if you need it as a
default USA. But if you need it as a Germany of course you have to go and
Germany of course you have to go and define it. So you say execute the store
define it. So you say execute the store procedure where the country equal to
procedure where the country equal to Germany. So if you execute it like this
Germany. So if you execute it like this SQL still going to use your value. So
SQL still going to use your value. So the value that comes as an input from
the value that comes as an input from the user has more priority of course as
the user has more priority of course as the defaults. And with that we are
the defaults. And with that we are getting the Germany reports. So as you
getting the Germany reports. So as you can see it's really nice right using
can see it's really nice right using parameters in store
procedure. All right moving on to the next step. Now we can work with multiple
next step. Now we can work with multiple queries inside one stored procedure. And
queries inside one stored procedure. And this is what we have learned at the
this is what we have learned at the start. We can have multiple SQL
start. We can have multiple SQL statements in one stored procedure. And
statements in one stored procedure. And now we have a new report and query to
now we have a new report and query to generate. It says find the total number
generate. It says find the total number of orders and the total sales. So let's
of orders and the total sales. So let's do it quickly. We can write it like
do it quickly. We can write it like this.
this. Select counts order ID. This is the
Select counts order ID. This is the total orders and then the sum of sales.
total orders and then the sum of sales. Total sales from our table sales orders.
Total sales from our table sales orders. And of course we are always creating a
And of course we are always creating a report based on specific country. So
report based on specific country. So that means we have to go and join it
that means we have to go and join it with the customers table in order to
with the customers table in order to filter the data. So on customer ID equal
filter the data. So on customer ID equal to the customer id. And now we're going
to the customer id. And now we're going to go and filter the data. So country
to go and filter the data. So country equal to USA. So something like this.
equal to USA. So something like this. Let's go and execute it. And with that
Let's go and execute it. And with that for the US customers, we have six orders
for the US customers, we have six orders and the total sales 180. And of course,
and the total sales 180. And of course, the same thing we're going to do for
the same thing we're going to do for Germany. So now, of course, we will not
Germany. So now, of course, we will not go and create an extra store procedure.
go and create an extra store procedure. For this, we're going to go and put
For this, we're going to go and put everything in one store procedure. So
everything in one store procedure. So let's go and copy the whole thing and
let's go and copy the whole thing and put it here inside. So after the first
put it here inside. So after the first report we're going to have the second
report we're going to have the second report and now the best practice here if
report and now the best practice here if you have multiple queries in store
you have multiple queries in store procedure go and add at the end of each
procedure go and add at the end of each query a semicolon. It is just easier to
query a semicolon. It is just easier to understand how now this is the end of
understand how now this is the end of this query especially if you have like a
this query especially if you have like a big complex queries where you have CTE
big complex queries where you have CTE union and so on. It's going to be really
union and so on. It's going to be really hard to understand that we are talking
hard to understand that we are talking now about completely new query but it is
now about completely new query but it is not like something the database requires
not like something the database requires it but it's just easier to read. So just
it but it's just easier to read. So just add semicolons at the end of each query.
add semicolons at the end of each query. So now let's go and execute the whole
So now let's go and execute the whole thing in order to change the definition
thing in order to change the definition of our query. And one more thing of
of our query. And one more thing of course don't forget we don't need static
course don't forget we don't need static values over here. We're going to go and
values over here. We're going to go and add our nice
add our nice parameters. So add country. So I think
parameters. So add country. So I think with that we have everything is ready to
with that we have everything is ready to be executed. So let's go and change the
be executed. So let's go and change the definition of our store procedure. And
definition of our store procedure. And now let's go and start with the defaults
now let's go and start with the defaults where the country equal to USA. So let's
where the country equal to USA. So let's go and execute it. And now in the output
go and execute it. And now in the output as you can see we have two results. And
as you can see we have two results. And that's because we have two queries. So
that's because we have two queries. So the first report is for the first query
the first report is for the first query and the second one for the new one that
and the second one for the new one that we just created. And the same thing if
we just created. And the same thing if you go and execute the store procedure
you go and execute the store procedure for Germany we will get as well two
for Germany we will get as well two results. And here we can see we have
results. And here we can see we have four orders and 200 of total sales for
four orders and 200 of total sales for Germany. So as you can see it's very
Germany. So as you can see it's very simple. You can go now and add multiple
simple. You can go now and add multiple SQL statements not only queries you can
SQL statements not only queries you can go and update you can do an insert
go and update you can do an insert delete any kind of SQL statements you
delete any kind of SQL statements you can just go and add it inside your
can just go and add it inside your program. And as usual SQL going to
program. And as usual SQL going to execute it from the top to the bottom.
execute it from the top to the bottom. So since this is the first SQL statement
So since this is the first SQL statement it's going to execute it first and then
it's going to execute it first and then after that it's going to go to the next
after that it's going to go to the next one. So this is how you can add multiple
one. So this is how you can add multiple SQL statements to your store
procedure. All right everyone. So now we're going to talk about the variables.
we're going to talk about the variables. So what is a variable? It is like a
So what is a variable? It is like a placeholder where you store inside it a
placeholder where you store inside it a value in order to use it later inside
value in order to use it later inside your stored procedure. So that means
your stored procedure. So that means variable holds like a value inside the
variable holds like a value inside the memory and you can reuse it everywhere
memory and you can reuse it everywhere you want inside your stored procedure
you want inside your stored procedure but it's not like the parameters.
but it's not like the parameters. Parameters are something like outside
Parameters are something like outside the store procedure. It's an input from
the store procedure. It's an input from the one that is executing the store
the one that is executing the store procedure and the store procedure has to
procedure and the store procedure has to adapt with the parameter. But a variable
adapt with the parameter. But a variable it's something that lives inside the
it's something that lives inside the store procedure and we use it as a
store procedure and we use it as a developers in order to make our code
developers in order to make our code dynamic and to move a value from one
dynamic and to move a value from one place to another. So let's have a very
place to another. So let's have a very simple example now. Let's say that we
simple example now. Let's say that we don't want our report here about the
don't want our report here about the total customers as a query. So I don't
total customers as a query. So I don't want it as a result in the output. Let's
want it as a result in the output. Let's say I'm generating a report always like
say I'm generating a report always like this. We are saying the total customers
this. We are saying the total customers from Germany equal to two and the
from Germany equal to two and the average score from Germany is equal to
average score from Germany is equal to 425. So I need it as a text not as a
425. So I need it as a text not as a table like here. So in order to do that
table like here. So in order to do that we can use the TSQL print in order to
we can use the TSQL print in order to give a message after executing the store
give a message after executing the store procedure. So the syntax of print is
procedure. So the syntax of print is very simple. So we can go over here and
very simple. So we can go over here and say print and then we have single quotes
say print and then we have single quotes and let's go and get the whole message
and let's go and get the whole message from here without the comments and then
from here without the comments and then the semicolon and we can repeat that for
the semicolon and we can repeat that for the second message. So for the average
the second message. So for the average score and we put it over here as well a
score and we put it over here as well a semicolon. Now if you do it like this
semicolon. Now if you do it like this this message going to be always static.
this message going to be always static. So we will have always like two for the
So we will have always like two for the total customers and the average score
total customers and the average score going to always be like this even though
going to always be like this even though that the data is changing. So we cannot
that the data is changing. So we cannot have it static like this. We have to
have it static like this. We have to make it dynamic and especially if we are
make it dynamic and especially if we are calling this function for USA. So we
calling this function for USA. So we cannot have it here as a Germany. So
cannot have it here as a Germany. So let's see how we can make this dynamic.
let's see how we can make this dynamic. Now let's start with the easy stuff.
Now let's start with the easy stuff. Instead of the Germany over here we can
Instead of the Germany over here we can go and put our parameter right. So
go and put our parameter right. So instead of this so we're going to say at
instead of this so we're going to say at country but now the problem is it is
country but now the problem is it is part of the whole string we cannot do
part of the whole string we cannot do that so we're going to stop the text and
that so we're going to stop the text and you can see the coloring is changing and
you can see the coloring is changing and then have a plus in order to have
then have a plus in order to have concatenations. So this text comes first
concatenations. So this text comes first then the value from the country and then
then the value from the country and then we're going to have as well the double
we're going to have as well the double point as a static text and again a
point as a static text and again a concatenation and then we have the two
concatenation and then we have the two we can talk about later. So let's do the
we can talk about later. So let's do the same stuff over here. So we're going to
same stuff over here. So we're going to say plus add country caring is not
say plus add country caring is not changing because of this code. So let me
changing because of this code. So let me just remove it and then afterward plus
just remove it and then afterward plus make it static again plus and remove the
make it static again plus and remove the final quotes. So with that in the
final quotes. So with that in the message we have now dynamic where we get
message we have now dynamic where we get the value of the country from the
the value of the country from the parameter. And now we come to the
parameter. And now we come to the interesting part. We have here an issue
interesting part. We have here an issue those two values they come from this
those two values they come from this query. And of course we cannot use a
query. And of course we cannot use a parameter for that. We have to use now
parameter for that. We have to use now the variables. Now in order to make a
the variables. Now in order to make a variables we have three steps. The first
variables we have three steps. The first step is that we have to tell SQL about
step is that we have to tell SQL about our new variable. So SQL can prepare and
our new variable. So SQL can prepare and make like placeholder for it in the
make like placeholder for it in the memory. So we have to tell and prepare
memory. So we have to tell and prepare it with our new variables. Now usually
it with our new variables. Now usually we do all the declarations of our
we do all the declarations of our variables at the start of the store
variables at the start of the store procedure immediately after begin. So
procedure immediately after begin. So that means we're going to go over here
that means we're going to go over here and say declare and now after that it's
and say declare and now after that it's like the parameters. It's very simple.
like the parameters. It's very simple. So at total customers. So this is the
So at total customers. So this is the name of the variable. And after that we
name of the variable. And after that we have to define the data type. Of course
have to define the data type. Of course you have to understand the data type
you have to understand the data type from the query. Since we are saying
from the query. Since we are saying count star then the output going to be
count star then the output going to be an integer. That's why we're going to
an integer. That's why we're going to write it like this. So integer. And now
write it like this. So integer. And now we need another one for the average. So
we need another one for the average. So what we're going to do we're going to
what we're going to do we're going to make a comma. Now we are declaring
make a comma. Now we are declaring another variable. So at average score
another variable. So at average score and the data type of this one going to
and the data type of this one going to be float because we have an average. So
be float because we have an average. So that's it for the first step. We are
that's it for the first step. We are telling SQL we have two variables and
telling SQL we have two variables and SQL going to go and create an empty
SQL going to go and create an empty placeholder. So now in the second step
placeholder. So now in the second step we have to give our variables a value.
we have to give our variables a value. So where we going to get the values?
So where we going to get the values? We're going to get it from the query. So
We're going to get it from the query. So let's do that. Now let's start with the
let's do that. Now let's start with the first column. As you can see we have
first column. As you can see we have here the count star. And as we learned
here the count star. And as we learned anything that we write on the right
anything that we write on the right side, it going to be like an alias for
side, it going to be like an alias for the column. But in SQL if you go and
the column. But in SQL if you go and write something before it, it going to
write something before it, it going to be the variable. So we can do it like
be the variable. So we can do it like this. at total customers and then equal.
this. at total customers and then equal. So now we are saying whatever value this
So now we are saying whatever value this query returns it should be stored inside
query returns it should be stored inside my new variable so that I'm assigning
my new variable so that I'm assigning values to my variable. But here there is
values to my variable. But here there is one thing that we cannot have any more
one thing that we cannot have any more aliases because our query will not
aliases because our query will not return any results. Our query have now
return any results. Our query have now only one task to assign values to my
only one task to assign values to my variables. So that's why we cannot have
variables. So that's why we cannot have it like this. We have to remove the
it like this. We have to remove the alias. And the same thing we're going to
alias. And the same thing we're going to do it for the average. So at average
do it for the average. So at average score equal to the average score and we
score equal to the average score and we have to remove the alias. So that's it.
have to remove the alias. So that's it. Now our query having different purpose.
Now our query having different purpose. It is not for returning result. It is to
It is not for returning result. It is to assign values to our variables. So now
assign values to our variables. So now we have values in the next step we have
we have values in the next step we have to go and use it. And we can use our
to go and use it. And we can use our variables everywhere inside our store
variables everywhere inside our store procedure. So it could be in the print,
procedure. So it could be in the print, it could be in the next query. So in any
it could be in the next query. So in any select statements in any place.
select statements in any place. Sometimes we use variables in order to
Sometimes we use variables in order to pass an information from one query to
pass an information from one query to another one. But in this example, we
another one. But in this example, we want to use our variables inside the
want to use our variables inside the prints. So it is very simple. We now
prints. So it is very simple. We now we're going to go and replace the static
we're going to go and replace the static number and it's like the parameter.
number and it's like the parameter. We're going to say at total customers
We're going to say at total customers and the same thing for the average at
and the same thing for the average at average score. So that's it. It's very
average score. So that's it. It's very simple. So again the step one we have to
simple. So again the step one we have to declare them to define it for SQL and
declare them to define it for SQL and with that we're going to get an empty
with that we're going to get an empty variable. The second step we have to add
variable. The second step we have to add values to those variables and the last
values to those variables and the last step we have to go and use those
step we have to go and use those variables. So it makes sense right now
variables. So it makes sense right now if you check our message over here you
if you check our message over here you can see that everything is dynamic and
can see that everything is dynamic and we don't have any static values but
we don't have any static values but there is one more thing that's in the
there is one more thing that's in the print everything should be as a string.
print everything should be as a string. So we cannot have dates numbers floats
So we cannot have dates numbers floats and so on. So that's why you have to
and so on. So that's why you have to make check if you're adding any
make check if you're adding any parameter and variables all of them
parameter and variables all of them should be string. So the country it is
should be string. So the country it is okay because we have the data type of
okay because we have the data type of varchar but the total number and the
varchar but the total number and the average score this is not really good
average score this is not really good because they have different data type
because they have different data type and we have to go and now cast those
and we have to go and now cast those data types to another one. So we're
data types to another one. So we're going to say cast and we're going to say
going to say cast and we're going to say here as invar so that we don't get any
here as invar so that we don't get any errors from SQL. So cast as well here as
errors from SQL. So cast as well here as in
in vchar like this. All right. So I think
vchar like this. All right. So I think we are ready. Let's go and change the
we are ready. Let's go and change the definition of our store procedure in
definition of our store procedure in order to test. So let's go and execute.
order to test. So let's go and execute. Perfect. And now let's go and test. So
Perfect. And now let's go and test. So let's start with the defaults where we
let's start with the defaults where we have the parameter as USA. So now as you
have the parameter as USA. So now as you can see we are getting one result and
can see we are getting one result and this is from the second query. So the
this is from the second query. So the first query is not returning anything
first query is not returning anything anymore in the output. But if you go to
anymore in the output. But if you go to the messages over here, you can see we
the messages over here, you can see we have a new message. It says total
have a new message. It says total customers from USA is equal to three and
customers from USA is equal to three and the average score from USA is equal to
the average score from USA is equal to 825. And this is exactly what we wanted
825. And this is exactly what we wanted for our reports. Now let's go and
for our reports. Now let's go and execute the parameter equal to Germany.
execute the parameter equal to Germany. Again, we have only one result. And in
Again, we have only one result. And in the messages, we're going to get total
the messages, we're going to get total customers from Germany is equal to two
customers from Germany is equal to two and the average score from Germany is
and the average score from Germany is equal to 425. So this is exactly how we
equal to 425. So this is exactly how we work with the variables. We use it in
work with the variables. We use it in order to hold one information in one
order to hold one information in one place in order to reuse it later in
place in order to reuse it later in different place. So that's it for
different place. So that's it for [Music]
[Music] variables. All right everyone. Now we're
variables. All right everyone. Now we're going to talk about how to control the
going to talk about how to control the flow in your store procedure and we're
flow in your store procedure and we're going to learn how to do that using the
going to learn how to do that using the if else statements. So now let's have
if else statements. So now let's have the following scenario. Now if you check
the following scenario. Now if you check our query over here we are doing the
our query over here we are doing the average of score and if you check the
average of score and if you check the data you can see that in the scores we
data you can see that in the scores we have nulls and nulls are really bad for
have nulls and nulls are really bad for aggregations. So we usually have to
aggregations. So we usually have to clean up our data before doing any
clean up our data before doing any aggregations. And in this scenario we
aggregations. And in this scenario we can understand null as a zero. And how
can understand null as a zero. And how we going to clean up and handle the
we going to clean up and handle the data? We're going to go and make an
data? We're going to go and make an update on our table where we say if
update on our table where we say if there is like a null then make it as a
there is like a null then make it as a zero. And we will do this as a pre-step
zero. And we will do this as a pre-step inside our store procedure. So that
inside our store procedure. So that means first we have to clean up the data
means first we have to clean up the data and then afterward we're going to
and then afterward we're going to generate the reports. And this is what
generate the reports. And this is what we usually do inside SQL projects. So
we usually do inside SQL projects. So the logic going to be very simple. We
the logic going to be very simple. We have to check first do we have nulls
have to check first do we have nulls inside the score. If the answer is yes
inside the score. If the answer is yes then we have to go and update the null
then we have to go and update the null values to zero. But if the answer is no,
values to zero. But if the answer is no, we don't have any values then we can
we don't have any values then we can skip everything. So now we're going to
skip everything. So now we're going to go and build this logic inside our store
go and build this logic inside our store procedure in order to clean up and
procedure in order to clean up and prepare the data. So let's go. Okay. So
prepare the data. So let's go. Okay. So now this part we're going to call it
now this part we're going to call it generating
generating reports and we're going to have another
reports and we're going to have another part called prepare and clean up data.
part called prepare and clean up data. So now let's prepare first the structure
So now let's prepare first the structure of the if statements. So the syntax
of the if statements. So the syntax going to look like this. So if and then
going to look like this. So if and then begin and end. So this is the block of
begin and end. So this is the block of the if and we're going to do the same
the if and we're going to do the same thing for the else. So we have else and
thing for the else. So we have else and we have begin and end. Let me just
we have begin and end. Let me just separate them. So now how this works? We
separate them. So now how this works? We have to create a condition. If the
have to create a condition. If the condition is met then the if statement
condition is met then the if statement going to be executed. But if the
going to be executed. But if the condition is not fulfilled and we have
condition is not fulfilled and we have false then the else statement going to
false then the else statement going to be executed. So what is the condition?
be executed. So what is the condition? We have to check whether there is null
We have to check whether there is null inside the scores. So let's write a very
inside the scores. So let's write a very simple query. It's going to say select
simple query. It's going to say select one from sales
one from sales customers where score is null and always
customers where score is null and always we have to check the country equal to
we have to check the country equal to let's say USA. So let's go and execute
let's say USA. So let's go and execute this one over here. So now we are
this one over here. So now we are getting in the output a results. If we
getting in the output a results. If we are getting a results that means
are getting a results that means somewhere there are nulls. But if you go
somewhere there are nulls. But if you go for example and say here Germany and
for example and say here Germany and execute the same query in the output you
execute the same query in the output you see that we don't have any results. That
see that we don't have any results. That means for the German customers we don't
means for the German customers we don't have any nulls in their scores. So if
have any nulls in their scores. So if this query returns something we have
this query returns something we have nulls. If it didn't return anything then
nulls. If it didn't return anything then there is no nulls. And we're going to
there is no nulls. And we're going to use exactly this query as a condition.
use exactly this query as a condition. So we're going to take our check and say
So we're going to take our check and say if exists and then two parenthesis and
if exists and then two parenthesis and then we put our query. So what we are
then we put our query. So what we are saying if exist if this query return
saying if exist if this query return anything then go and execute the next
anything then go and execute the next block and if it is not exist that means
block and if it is not exist that means it is not returning anything then go and
it is not returning anything then go and execute the second block. So it's a
execute the second block. So it's a logic right it's very simple now of
logic right it's very simple now of course instead of having a static value
course instead of having a static value over here we can use our parameter so at
over here we can use our parameter so at country and now we have to tell SQL what
country and now we have to tell SQL what to do if it exists. So in between we can
to do if it exists. So in between we can have like an update statement. So update
have like an update statement. So update sales customers and we're going to set
sales customers and we're going to set the score equal to zero. But very
the score equal to zero. But very important we have to go and use where
important we have to go and use where condition otherwise it going to go and
condition otherwise it going to go and update everything. The score is
update everything. The score is null and the country equal to our
null and the country equal to our parameter country. So with that we are
parameter country. So with that we are updating exactly the nulls for specific
updating exactly the nulls for specific country. And let's have a semicolon at
country. And let's have a semicolon at the end. And at the start maybe I'm
the end. And at the start maybe I'm going to say just to have a nice message
going to say just to have a nice message in the output print and we can have a
in the output print and we can have a message updating null scores to zero and
message updating null scores to zero and as well a semicolon at the end. So if
as well a semicolon at the end. So if there is any nulls then execute the
there is any nulls then execute the whole thing print the message and update
whole thing print the message and update the table. So now the next step is that
the table. So now the next step is that we're going to go and tell SQL what can
we're going to go and tell SQL what can happen if the condition is not
happen if the condition is not fulfilled. That means we don't have any
fulfilled. That means we don't have any nulls. Well we don't have to update the
nulls. Well we don't have to update the table at all because we don't have to
table at all because we don't have to clean up anything. But I'm going to go
clean up anything. But I'm going to go and make print over here. So print and
and make print over here. So print and we're going to give the message no null
we're going to give the message no null scores found. And at the last end I'm
scores found. And at the last end I'm going to go and put a semicolon. So
going to go and put a semicolon. So that's it. This is our logic. We are
that's it. This is our logic. We are checking our condition and then we
checking our condition and then we execute if the condition is met where we
execute if the condition is met where we update the table with zero instead of
update the table with zero instead of null and if the condition is not met
null and if the condition is not met then don't do anything. Just print a
then don't do anything. Just print a message. Now you might say you know what
message. Now you might say you know what why you are doing this? we just can use
why you are doing this? we just can use this update statements and we don't need
this update statements and we don't need the whole if else statements. So why we
the whole if else statements. So why we are checking in the first place? I can
are checking in the first place? I can like each time I run this store
like each time I run this store procedure I go and update all the nulls
procedure I go and update all the nulls if they exist to a zero. Well, this is
if they exist to a zero. Well, this is not really professional because you are
not really professional because you are wasting resources. So each time you run
wasting resources. So each time you run an update statement like this. So
an update statement like this. So imagine that you have a big table and
imagine that you have a big table and each time you run your store procedure,
each time you run your store procedure, SQL have to go and check whether there
SQL have to go and check whether there is any nulls and so on. And this is of
is any nulls and so on. And this is of course consume resources. It's way
course consume resources. It's way better if you go and check first whether
better if you go and check first whether it's really needed. So that's why we are
it's really needed. So that's why we are doing this logic. Now as you can see our
doing this logic. Now as you can see our store procedure is getting bigger and
store procedure is getting bigger and bigger. So we have like two parts. The
bigger. So we have like two parts. The first part is preparing and cleaning up
first part is preparing and cleaning up the data. And the second part we are
the data. And the second part we are generating reports. Let's go and update
generating reports. Let's go and update the whole thing and execute it. And now
the whole thing and execute it. And now we have to do it step by step. So let's
we have to do it step by step. So let's check our query over here. And you can
check our query over here. And you can see we have here null for USA customers.
see we have here null for USA customers. So let's go first execute it for the USA
So let's go first execute it for the USA as a defaults. And now let's go and
as a defaults. And now let's go and check the messages. It's saying updating
check the messages. It's saying updating null scores to zero. That means the
null scores to zero. That means the first block is executed because SQL did
first block is executed because SQL did find a customer with a null. And with
find a customer with a null. And with that the average of scores going to be
that the average of scores going to be different than previously. So we have
different than previously. So we have now more accurate average in our
now more accurate average in our reports. So if you go and check our
reports. So if you go and check our query again, you can see now we have a
query again, you can see now we have a zero instead of null. Let's go and
zero instead of null. Let's go and execute it for Germany like this. And
execute it for Germany like this. And let's go and check the messages. It says
let's go and check the messages. It says no null scores found. And that is
no null scores found. And that is correct because for Germany we don't
correct because for Germany we don't have any nulls. So with that we have
have any nulls. So with that we have created a control flow using the FL
created a control flow using the FL statements. And as you can see we are
statements. And as you can see we are not doing any more like simple queries.
not doing any more like simple queries. We are creating like a mini program. And
We are creating like a mini program. And now it's like an ETL where first we
now it's like an ETL where first we prepare the data and second we generate
prepare the data and second we generate reports. And you can imagine a real
reports. And you can imagine a real project how big those stored procedures
project how big those stored procedures going to get where you have a lot of
going to get where you have a lot of tables and a lot of things to
do. Okay. So now we're going to talk about the error handling in store
about the error handling in store procedure. Error handling it is like
procedure. Error handling it is like essential things to do while programming
essential things to do while programming because it gives you the control on what
because it gives you the control on what can happen once you have an error. And
can happen once you have an error. And there's a lot of things that you can do
there's a lot of things that you can do like maybe deleting data, printing a
like maybe deleting data, printing a very structured like message or maybe
very structured like message or maybe doing some logging and so on. So you
doing some logging and so on. So you have a full control on what to do if
have a full control on what to do if there is an error and of course we can
there is an error and of course we can do that in the store procedure. So now
do that in the store procedure. So now let's check the quickly the syntax. It
let's check the quickly the syntax. It is usually has two parts. The first part
is usually has two parts. The first part is the try part. So the syntax is like
is the try part. So the syntax is like this begin try end try. So you are
this begin try end try. So you are defining the boundaries of the try and
defining the boundaries of the try and in between you going to have all your
in between you going to have all your SQL statements and your code and the
SQL statements and your code and the second part going to be the catch parts.
second part going to be the catch parts. So you say begin catch and end catch. So
So you say begin catch and end catch. So you are defining the boundaries and then
you are defining the boundaries and then in between you can tell SQL what to do
in between you can tell SQL what to do if there is like an error. So what is
if there is like an error. So what is try and catch? Like the word it says try
try and catch? Like the word it says try it's like you are attempt to do
it's like you are attempt to do something that might fail. So you are
something that might fail. So you are telling SQL try to execute this code. So
telling SQL try to execute this code. So the SQL going to go and try to execute
the SQL going to go and try to execute your codes. And if any error happens
your codes. And if any error happens while executing your codes, the SQL
while executing your codes, the SQL going to jump to the second block and
going to jump to the second block and start doing whatever you have defined in
start doing whatever you have defined in the catch. But if there is no errors at
the catch. But if there is no errors at all, this part will not be executed. So
all, this part will not be executed. So the catch is like your backup plan. If
the catch is like your backup plan. If something goes wrong here, then go to
something goes wrong here, then go to the plan B and do something. So let's
the plan B and do something. So let's see the workflow of the try catch. So
see the workflow of the try catch. So first the SQL going to go and execute
first the SQL going to go and execute the try and then it going to check is
the try and then it going to check is there any error. If we don't have any
there any error. If we don't have any error then everything ends and that's
error then everything ends and that's it. But while execution if the SQL face
it. But while execution if the SQL face any error what going to happen it going
any error what going to happen it going to go and execute the catch. So as you
to go and execute the catch. So as you can see the workflow is very simple and
can see the workflow is very simple and this is what we mean with try and catch.
this is what we mean with try and catch. So let's go back to SQL to have some
So let's go back to SQL to have some example. All right. So now back to our
example. All right. So now back to our store procedure. Let's go and introduce
store procedure. Let's go and introduce an error inside our code. So let's go
an error inside our code. So let's go over here and maybe in our query we're
over here and maybe in our query we're going to go and divide by zero which is
going to go and divide by zero which is of course a problem. So we have this
of course a problem. So we have this error over here and let's go and update
error over here and let's go and update the logic of our store procedure. And
the logic of our store procedure. And now if you go and execute it. So let's
now if you go and execute it. So let's go and do that. We will get an error
go and do that. We will get an error saying yeah you cannot divide by zero.
saying yeah you cannot divide by zero. But now what I would like to do I would
But now what I would like to do I would like to have something else where we
like to have something else where we have customized message when error
have customized message when error happens. So I would like to have the
happens. So I would like to have the control on which information should be
control on which information should be displayed if there is an issue. And in
displayed if there is an issue. And in order to do that we have to use the try
order to do that we have to use the try and catch. So it's going to be very
and catch. So it's going to be very simple. Now this is my whole code. So
simple. Now this is my whole code. So the whole thing from preparing to
the whole thing from preparing to generate the report the whole thing is
generate the report the whole thing is my code and we have to put the whole
my code and we have to put the whole thing in a try. So how to do that?
thing in a try. So how to do that? Exactly after the first begin we're
Exactly after the first begin we're going to have another begin but for the
going to have another begin but for the try. And now what we're going to do,
try. And now what we're going to do, we're going to go to the last end over
we're going to go to the last end over here and have an end try. So with that
here and have an end try. So with that we put now the whole code inside the
we put now the whole code inside the try. And after that we're going to
try. And after that we're going to introduce the catch. So begin catch and
introduce the catch. So begin catch and end catch. And now in between we have to
end catch. And now in between we have to tell SQL what can happen if we encounter
tell SQL what can happen if we encounter an error. And here we can do many stuff
an error. And here we can do many stuff but I would like now to focus on
but I would like now to focus on customizing the error message. Let's
customizing the error message. Let's start with the first one. So I'm going
start with the first one. So I'm going to say print let's say an error
to say print let's say an error accord. This is the first thing. Then on
accord. This is the first thing. Then on the next line I'm going to print more
the next line I'm going to print more informations. And now we're going to say
informations. And now we're going to say the error message. So error message
the error message. So error message double point space. And now we can go
double point space. And now we can go and use some predefined functions from
and use some predefined functions from SQL like for example the error
SQL like for example the error message. This function going to return
message. This function going to return the description of the error like the
the description of the error like the one we have here divide by zero error
one we have here divide by zero error encountered and we can go and keep
encountered and we can go and keep adding stuff the way that we need like
adding stuff the way that we need like maybe the error
maybe the error number. So we can have it like this and
number. So we can have it like this and for that we have as well a function
for that we have as well a function called error number and I think we have
called error number and I think we have to cast this one because it is a number
to cast this one because it is a number and in the messages we have to have only
and in the messages we have to have only vchar. So this going to be as int var
vchar. So this going to be as int var like this and we can keep adding stuff
like this and we can keep adding stuff to our message like for example let's
to our message like for example let's take the error line and for that we have
take the error line and for that we have as well a function so it's going to be
as well a function so it's going to be the error
the error line like this and we have to cast
line like this and we have to cast it because it is as well a number and as
it because it is as well a number and as well what is really important is the
well what is really important is the name of the stored procedure. So error
name of the stored procedure. So error procedure and we have a function for
procedure and we have a function for that error procedure like this. It's
that error procedure like this. It's going to be a string. So that's why I
going to be a string. So that's why I don't have to cast it. So now with that
don't have to cast it. So now with that we have defined for SQL what to do if
we have defined for SQL what to do if there is like an error in our code. So
there is like an error in our code. So let's go and execute the whole thing.
let's go and execute the whole thing. And now let's go and execute our stored
And now let's go and execute our stored procedure. So let's go and do that. So
procedure. So let's go and do that. So now as you can see in the output we are
now as you can see in the output we are not getting any results and it is not
not getting any results and it is not giving an error. But if you go to the
giving an error. But if you go to the messages, you will see a very nice
messages, you will see a very nice message. So it says an error is
message. So it says an error is occurred. The error message is divided
occurred. The error message is divided by zero and we have the error number in
by zero and we have the error number in which line and as well the stored
which line and as well the stored procedure name. So as you can see it's
procedure name. So as you can see it's amazing. This is how we use the try and
amazing. This is how we use the try and catch in order to have more options on
catch in order to have more options on to control what can happen if there is
to control what can happen if there is an
error. Now the next step what I'm going to do, we have to go and organize our
to do, we have to go and organize our store procedure. As you can see,
store procedure. As you can see, everything is getting bigger. So now
everything is getting bigger. So now what we usually do, we use tab in order
what we usually do, we use tab in order to make spaces between each section. So
to make spaces between each section. So now the first section is between the
now the first section is between the first begin and the last end. So we have
first begin and the last end. So we have to go and mark everything and hit once a
to go and mark everything and hit once a tab. So now it is easier to read. Now
tab. So now it is easier to read. Now the whole thing is our codes. So now the
the whole thing is our codes. So now the next level is the block of the try. So
next level is the block of the try. So the whole thing over here is the try. So
the whole thing over here is the try. So let's go and do that. I'm just going to
let's go and do that. I'm just going to mark everything until here and then hit
mark everything until here and then hit tab. So now we can see it better, right?
tab. So now we can see it better, right? And the same thing for the catch. I
And the same thing for the catch. I think I have already done that. So it's
think I have already done that. So it's already pushed. Now we go to the next
already pushed. Now we go to the next level. So between this begin and end,
level. So between this begin and end, everything is pushed. So this looks
everything is pushed. So this looks nice. The same thing over here. It's
nice. The same thing over here. It's pushed as well. And then we don't have
pushed as well. And then we don't have here any begin and end. So it looks
here any begin and end. So it looks okay. And the same thing over here. So
okay. And the same thing over here. So all our begin and end is now sorted
all our begin and end is now sorted correctly. Now the next step is that we
correctly. Now the next step is that we can go and improve the comments a little
can go and improve the comments a little bit. So we can split our code into
bit. So we can split our code into multiple sections. So what we're going
multiple sections. So what we're going to do, we're going to go over here and
to do, we're going to go over here and say this is step one. And what I like to
say this is step one. And what I like to do is to go and add separation using the
do is to go and add separation using the equals or any special character that you
equals or any special character that you like and as well here. So with that we
like and as well here. So with that we have the first step. We are preparing
have the first step. We are preparing the data. And then let's go and copy the
the data. And then let's go and copy the whole thing and go over here and say
whole thing and go over here and say this is the step two. And we're going to
this is the step two. And we're going to say this is
say this is generating summary reports and something
generating summary reports and something like this. And of course below that we
like this. And of course below that we can say what is this report about. So
can say what is this report about. So calculate total
calculate total customers and average score for
customers and average score for specific country. And as well we can go
specific country. And as well we can go over here and add as well a comment.
over here and add as well a comment. calculate total number of orders and
calculate total number of orders and total sales for specific country. And of
total sales for specific country. And of course we have to go and remove this
course we have to go and remove this error over here otherwise we'll get an
error over here otherwise we'll get an error and we can go and add something
error and we can go and add something about the catch where we can say like
about the catch where we can say like this again few
this again few comments we're going to say error
comments we're going to say error handling. So let's go and execute it
handling. So let's go and execute it again in order to make sure we have the
again in order to make sure we have the newest version. And with that we are
newest version. And with that we are done. We have a really nice stored
done. We have a really nice stored procedure with multiple steps and we
procedure with multiple steps and we have it professional where we have error
have it professional where we have error handling inside it and everything looks
handling inside it and everything looks well organized and easy to read. So this
well organized and easy to read. So this is how we build stored procedures. All
is how we build stored procedures. All right my friends. So that's all about
right my friends. So that's all about the store procedures. That was an
the store procedures. That was an amazing feature in SQL to add
amazing feature in SQL to add programmability in SQL. Now in the next
programmability in SQL. Now in the next step we're going to cover quickly the
step we're going to cover quickly the topic of the triggers. So let's
go. All right. So previously we have understood that we can put all our SQL
understood that we can put all our SQL statements in one stored procedure and
statements in one stored procedure and you have to go and manually execute the
you have to go and manually execute the store procedure. So that means in order
store procedure. So that means in order to trigger the start procedure, you have
to trigger the start procedure, you have manually to execute it and this is of
manually to execute it and this is of course a problem. How about to do that
course a problem. How about to do that automatically? So triggers in SQL they
automatically? So triggers in SQL they are special stored procedure that
are special stored procedure that automatically runs or let's say fired in
automatically runs or let's say fired in response to a specific event that
response to a specific event that happens on a table. So what this exactly
happens on a table. So what this exactly means? So now let's say that we have a
means? So now let's say that we have a table in our database and now something
table in our database and now something could happen to this table like
could happen to this table like inserting data, deleting, updating data,
inserting data, deleting, updating data, all those stuff that is happening we
all those stuff that is happening we call them events. And now what we can do
call them events. And now what we can do we can go and attach like a trigger on
we can go and attach like a trigger on top of this table and each time an event
top of this table and each time an event happened like insert update delete
happened like insert update delete something else going to be triggered
something else going to be triggered like maybe going and inserting data
like maybe going and inserting data somewhere else in another table or doing
somewhere else in another table or doing a check whether we are allowed to delete
a check whether we are allowed to delete the data in the first place or maybe
the data in the first place or maybe sending a warning message or something.
sending a warning message or something. So based on any changes to the table we
So based on any changes to the table we can trigger another events and we can do
can trigger another events and we can do that using the SQL triggers and for the
that using the SQL triggers and for the SQL triggers we have like multiple types
SQL triggers we have like multiple types like the DML triggers and this type of
like the DML triggers and this type of trigger going to respond once we have
trigger going to respond once we have like insert update delete statements.
like insert update delete statements. Another type of triggers we have the DDL
Another type of triggers we have the DDL triggers like you can make a trigger to
triggers like you can make a trigger to respond to any schema changes like
respond to any schema changes like creating altering or dropping a table or
creating altering or dropping a table or even view by the way not only tables.
even view by the way not only tables. And the third type of triggers we have
And the third type of triggers we have the login trigger. So the trigger can
the login trigger. So the trigger can respond to login events. Now in this
respond to login events. Now in this tutorial we're going to focus on the DML
tutorial we're going to focus on the DML triggers the insert update delete. And
triggers the insert update delete. And for the DML triggers we have two types.
for the DML triggers we have two types. We have after triggers and as well we
We have after triggers and as well we have instead of triggers. So as the name
have instead of triggers. So as the name suggest if you use after so it can be
suggest if you use after so it can be executed after the event and the other
executed after the event and the other type that instead of it's something that
type that instead of it's something that cannot wait until everything happens. So
cannot wait until everything happens. So this time the trigger going to be
this time the trigger going to be executed during the event not after it.
executed during the event not after it. So now in order to understand all of
So now in order to understand all of this we're going to have really nice use
case. And now the use case is about maintaining an audit logs. So what we
maintaining an audit logs. So what we mean with that? Let's have for example
mean with that? Let's have for example the table employees. The employee data
the table employees. The employee data are usually very sensitive informations
are usually very sensitive informations because there we can see which employees
because there we can see which employees are added, the salary updates, the
are added, the salary updates, the employee terminations and this makes the
employee terminations and this makes the table very important because we would
table very important because we would like to track all those changes that is
like to track all those changes that is happening to this table. So each time we
happening to this table. So each time we are inserting, updating, deleting, we
are inserting, updating, deleting, we would like to maintain a log about all
would like to maintain a log about all those changes in order to analyze it
those changes in order to analyze it later. It is of course very important
later. It is of course very important such a logs for the compliance and the
such a logs for the compliance and the auditors and in case there is like a
auditors and in case there is like a problem we can go to the logs to
problem we can go to the logs to understand when this happened who made
understand when this happened who made the changes and what exactly changed and
the changes and what exactly changed and now in order to maintain logs we're
now in order to maintain logs we're going to use the power of triggers. So
going to use the power of triggers. So what we're going to do we're going to go
what we're going to do we're going to go and attach like a trigger on the table
and attach like a trigger on the table employees and each time we insert new
employees and each time we insert new data to the employees we are triggering
data to the employees we are triggering another events. So what can happen this
another events. So what can happen this new employee going to be inserted in the
new employee going to be inserted in the audit logs in order to have a record
audit logs in order to have a record about this activity in the logs. So that
about this activity in the logs. So that means each time you are inserting data
means each time you are inserting data to the table employees you are
to the table employees you are automatically inserting data inside the
automatically inserting data inside the logs and this is really amazing use case
logs and this is really amazing use case for the triggers. So let's go and
for the triggers. So let's go and implement it. Okay. So now let's check
implement it. Okay. So now let's check quickly the syntax of the triggers. So
quickly the syntax of the triggers. So we start with the usuals create trigger
we start with the usuals create trigger then the trigger name and then we have
then the trigger name and then we have to specify on which table this trigger
to specify on which table this trigger going to be built in. So now we are
going to be built in. So now we are attaching like a trigger on top of one
attaching like a trigger on top of one table and after that we have to define
table and after that we have to define for SQL when this trigger going to
for SQL when this trigger going to happen. So what is actually triggering
happen. So what is actually triggering the trigger and here you can define
the trigger and here you can define after or instead then you have to define
after or instead then you have to define the operator. So first you have to
the operator. So first you have to define like after or instead of and then
define like after or instead of and then we have to define the operation. So
we have to define the operation. So insert, update, delete or one of them.
insert, update, delete or one of them. And with that you are telling SQL when
And with that you are telling SQL when exactly this should happen. And now
exactly this should happen. And now after that we have to tell SQL what
after that we have to tell SQL what going to happen if the trigger is
going to happen if the trigger is triggered. So here we have like begin
triggered. So here we have like begin and end. And then we have like several
and end. And then we have like several skill statements that's going to
skill statements that's going to describe what's going to happen once we
describe what's going to happen once we have the trigger. So that's it. As you
have the trigger. So that's it. As you can see the syntax is very simple. Okay.
can see the syntax is very simple. Okay. So now let's do it step by step. First I
So now let's do it step by step. First I would like to create a table where we're
would like to create a table where we're going to store the logs information. So
going to store the logs information. So it's going to be very simple table.
it's going to be very simple table. We're going to say create table. Then
We're going to say create table. Then we're going to call it sales employee
we're going to call it sales employee logs and we're going to have the
logs and we're going to have the following columns inside it. So let's
following columns inside it. So let's start with the primary key. It's going
start with the primary key. It's going to be the log ID and the data type int
to be the log ID and the data type int and then we're going to have like a
and then we're going to have like a sequence. So we're going to have
sequence. So we're going to have identity and this is the primary key.
identity and this is the primary key. Let's go to the next one. It's going to
Let's go to the next one. It's going to be the employee ID and the data type
be the employee ID and the data type going to be ins. The next one is going
going to be ins. The next one is going to be the log message. So let's have it
to be the log message. So let's have it as a vchar and I'm going to have it like
as a vchar and I'm going to have it like 255 and then to the next one we're going
255 and then to the next one we're going to have the lock dates and then we're
to have the lock dates and then we're going to have like let's say a date or a
going to have like let's say a date or a date time. So that's it. Let's go and
date time. So that's it. Let's go and execute it and with that we have a new
execute it and with that we have a new table inside our database. Now the next
table inside our database. Now the next step is that we're going to go and
step is that we're going to go and create our trigger. So we're going to
create our trigger. So we're going to say create trigger and I'm going to call
say create trigger and I'm going to call it like this trg. This is just a prefix
it like this trg. This is just a prefix to indicate this is a trigger. And I'm
to indicate this is a trigger. And I'm just going to call it after insert
just going to call it after insert employee. And now we have to define the
employee. And now we have to define the table. So it's going to be on sales
table. So it's going to be on sales employee. So now with that we are saying
employee. So now with that we are saying we have now a trigger on the table
we have now a trigger on the table employees. And now we have to define the
employees. And now we have to define the logic. So we're going to use after
logic. So we're going to use after insert. So that means after we insert
insert. So that means after we insert any record to the table employees the
any record to the table employees the following things should happen. So we're
following things should happen. So we're going to say as and then begin and end
going to say as and then begin and end and in between we can have our logic. So
and in between we can have our logic. So what can happen after a new record is
what can happen after a new record is inserted to the employees. We're going
inserted to the employees. We're going to go and insert a new record to the
to go and insert a new record to the employee logs. So we're going to have
employee logs. So we're going to have insert into sales employee
insert into sales employee logs and we're going to have here the
logs and we're going to have here the three columns employee
three columns employee ID the log
ID the log message and the log dates. So now which
message and the log dates. So now which value is going to be inserted? it going
value is going to be inserted? it going to be like from a query. So we're going
to be like from a query. So we're going to say select and we're going to say as
to say select and we're going to say as well employee ID and for the log message
well employee ID and for the log message we can have customized one like let's
we can have customized one like let's say new employee added and it's going to
say new employee added and it's going to be equal to the employee ID. So in order
be equal to the employee ID. So in order to have the employee ID it's going to be
to have the employee ID it's going to be like
like this. So that's it. Now to the next one
this. So that's it. Now to the next one we need the log date. It's going to be
we need the log date. It's going to be get date. And now you might say okay but
get date. And now you might say okay but where this employee ID is coming from?
where this employee ID is coming from? Well, it going to come from the table
Well, it going to come from the table from inserted. So what is actually
from inserted. So what is actually inserted? It is like special virtual
inserted? It is like special virtual table that holds all the new inserted
table that holds all the new inserted data to our table employees. So anything
data to our table employees. So anything we are inserting inside the employees
we are inserting inside the employees will be available inside this table. And
will be available inside this table. And of course this is only available during
of course this is only available during the execution of this trigger. So you
the execution of this trigger. So you cannot go now outside of this query and
cannot go now outside of this query and start querying the table inserted
start querying the table inserted because you will not find anything. This
because you will not find anything. This is only like a virtual table that
is only like a virtual table that contains anything that you are doing to
contains anything that you are doing to the table employees and you find a lot
the table employees and you find a lot of informations like the salary, the age
of informations like the salary, the age and so on. So that's it for the
and so on. So that's it for the inserted. Now we have to make sure that
inserted. Now we have to make sure that in our message we have everything as a
in our message we have everything as a string because the employee ID is an
string because the employee ID is an integer. So we have to cast it. So cast
integer. So we have to cast it. So cast and then we're going to say as far char
and then we're going to say as far char like this otherwise we'll get an error.
like this otherwise we'll get an error. So I think we have our trigger ready. We
So I think we have our trigger ready. We have a new trigger on the table
have a new trigger on the table employees. And now the first question is
employees. And now the first question is when this trigger going to happen? Well
when this trigger going to happen? Well it can happen after inserting data to
it can happen after inserting data to the employees. And then the second
the employees. And then the second question what's going to happen? Well,
question what's going to happen? Well, once we have this event, the whole thing
once we have this event, the whole thing here going to be executed where we are
here going to be executed where we are saying insert to the logs, the employee
saying insert to the logs, the employee ID, the message and as well the date
ID, the message and as well the date when this happens. And we can get all
when this happens. And we can get all those informations from the table, the
those informations from the table, the virtual table inserted. So I think we
virtual table inserted. So I think we are ready. Let's go and execute it. And
are ready. Let's go and execute it. And now if you go to the object explorer to
now if you go to the object explorer to our database, let's go to our table
our database, let's go to our table employees and then to the triggers. So
employees and then to the triggers. So if you refresh over here you can see our
if you refresh over here you can see our new trigger that we just created. So
new trigger that we just created. So with that we have to find our trigger
with that we have to find our trigger and we are ready. Now the next step is
and we are ready. Now the next step is that we're going to go and trigger our
that we're going to go and trigger our trigger. So let's go and do that. Let's
trigger. So let's go and do that. Let's have a new query. But first I'm going to
have a new query. But first I'm going to have a look to our
have a look to our logs. So sales employee logs. So let's
logs. So sales employee logs. So let's query this one. And as you can see our
query this one. And as you can see our logs is empty because we didn't insert
logs is empty because we didn't insert anything to the table employees. Let's
anything to the table employees. Let's go and do that. Let's trigger our
go and do that. Let's trigger our trigger. So what we're going to do,
trigger. So what we're going to do, we're going to say insert into sales
we're going to say insert into sales employees and we're going to have the
employees and we're going to have the following values. So we are at the
following values. So we are at the counter, I think six. Let's have the
counter, I think six. Let's have the first name
first name Maria. The last name an then we're going
Maria. The last name an then we're going to have the position. It's going to be
to have the position. It's going to be the HR for example. The birth date,
the HR for example. The birth date, let's pick something. I don't know. We
let's pick something. I don't know. We have a female
have a female here. And the salary. Let's go and get
here. And the salary. Let's go and get this salary and the hierarchy it can be
this salary and the hierarchy it can be for example three. So let's go and
for example three. So let's go and execute it. And with that as you can see
execute it. And with that as you can see we have inserted a new data to the
we have inserted a new data to the employees. Let's check now the logs. So
employees. Let's check now the logs. So let's query it. So we have here nice log
let's query it. So we have here nice log about the employee number six. And we
about the employee number six. And we have here nice message and when this did
have here nice message and when this did happen. Of course you can go and insert
happen. Of course you can go and insert another employee let's say seven with
another employee let's say seven with the same data. So let's do that and
the same data. So let's do that and check the logs. And with that we have
check the logs. And with that we have another log for the new employee. So
another log for the new employee. So this is really amazing use case in order
this is really amazing use case in order to maintain a log to your data and you
to maintain a log to your data and you can go and make like some analyzes on
can go and make like some analyzes on how many inserted happens and of course
how many inserted happens and of course not only on the insert you can have it
not only on the insert you can have it on the update delete. So as you can see
on the update delete. So as you can see it is very simple. This is how we create
it is very simple. This is how we create the triggers in SQL. All right my
the triggers in SQL. All right my friends. So that's all about the
friends. So that's all about the triggers with that with with that we
triggers with that with with that we have covered now with that we have
have covered now with that we have covered now all the concepts and topics
covered now all the concepts and topics that you have to learn about SQL. Now in
that you have to learn about SQL. Now in the next chapter it's going to be about
the next chapter it's going to be about the performance. So as you start writing
the performance. So as you start writing queries and so on you will start
queries and so on you will start noticing some queries are really slow.
noticing some queries are really slow. Now what we're going to do in this
Now what we're going to do in this chapter we're going to learn different
chapter we're going to learn different techniques on how to optimize the
techniques on how to optimize the performance. And the first and the very
performance. And the first and the very famous one is to go and build indexes in
famous one is to go and build indexes in databases. So let's understand what this
means. So what is an index? An index is a data structure that provides a quick
a data structure that provides a quick access to the rows to improve the speed
access to the rows to improve the speed of your queries. So an index is like a
of your queries. So an index is like a guide for your database in order to
guide for your database in order to speed up the process of searching for
speed up the process of searching for data especially if you have like big
data especially if you have like big tables. So now in order to understand
tables. So now in order to understand what are indexes, imagine you have huge
what are indexes, imagine you have huge book and you want to find a specific
book and you want to find a specific topic or a chapter. Instead of flipping
topic or a chapter. Instead of flipping each single page in order to find the
each single page in order to find the topic that you are searching for, you
topic that you are searching for, you would use the index at the back of the
would use the index at the back of the book in order to jump straight to the
book in order to jump straight to the right page. And that's exactly what
right page. And that's exactly what index does but for your data. Another
index does but for your data. Another analogy that I use in order to
analogy that I use in order to understand indexes is think about the
understand indexes is think about the indexes as a big hotel. Now let's say
indexes as a big hotel. Now let's say that in the hotel we don't have any
that in the hotel we don't have any guide and you would like to find the
guide and you would like to find the room number let's say 5001. Now what you
room number let's say 5001. Now what you going to do? You're going to go and
going to do? You're going to go and search for your room floor by floor and
search for your room floor by floor and checking each room until you find your
checking each room until you find your room. But instead of that, thankfully
room. But instead of that, thankfully hotels have a numbering system. And you
hotels have a numbering system. And you can ask for a map from the reception in
can ask for a map from the reception in order to understand in which building in
order to understand in which building in which floor you can find your room. So
which floor you can find your room. So by just following the map and maybe some
by just following the map and maybe some signs, it's going to be very quickly to
signs, it's going to be very quickly to locate and find your room in such a big
locate and find your room in such a big hotel. And that's exactly what each
hotel. And that's exactly what each database needs. It needs an index in
database needs. It needs an index in order to help the database finding and
order to help the database finding and locating the right data without having
locating the right data without having to scan
everything. And now let's say that you ask me, you know what, I have this big
ask me, you know what, I have this big table and I would like to speed up the
table and I would like to speed up the queries using indexes. And my first
queries using indexes. And my first question going to be, what are you
question going to be, what are you exactly doing with this table? Are you
exactly doing with this table? Are you using this table to search for text or
using this table to search for text or are you doing like complex analyszis
are you doing like complex analyszis with this table? And the reason why I'm
with this table? And the reason why I'm asking this is that we have different
asking this is that we have different indexes in databases for different
indexes in databases for different purposes. So now let's have a quick look
purposes. So now let's have a quick look to the different types of indexes that
to the different types of indexes that we have in database. I divide the
we have in database. I divide the indexes in databases into three
indexes in databases into three categories. The first one is by the
categories. The first one is by the structure how the database is organizing
structure how the database is organizing and referencing the data. And here we
and referencing the data. And here we have two types. The clustered index and
have two types. The clustered index and the non-clustered index. Those are very
the non-clustered index. Those are very important to understand. Now we have
important to understand. Now we have another category for the indexes. We can
another category for the indexes. We can divide them by the storage. And in this
divide them by the storage. And in this category we are talking about how the
category we are talking about how the data is stored physically in the
data is stored physically in the database. So we have two types. We have
database. So we have two types. We have the row store index and the column store
the row store index and the column store index. And the third type is the
index. And the third type is the functions and here we have two types. We
functions and here we have two types. We have the unique index and the filtered
have the unique index and the filtered index. Now each index type has its own
index. Now each index type has its own strings but as well there is always a
strings but as well there is always a tradeoff. Some might improve their read
tradeoff. Some might improve their read performance. The other one might improve
performance. The other one might improve the insert and update operations. So
the insert and update operations. So it's all about choosing the right type
it's all about choosing the right type of index for the job. So now what we're
of index for the job. So now what we're going to do, we're going to go and deep
going to do, we're going to go and deep dive into each of those types in order
dive into each of those types in order to understand how they work and how we
to understand how they work and how we can create them. And we will start with
can create them. And we will start with the first category, the structure. We
the first category, the structure. We have the clustered index and the
have the clustered index and the nclustered index.
Now before we dive into how the indexes works in databases, let's understand
works in databases, let's understand first what happens to the database
first what happens to the database tables if you don't use any index. When
tables if you don't use any index. When you create a new table in your database
you create a new table in your database like for example the customers table
like for example the customers table where you have let's say 20 customers
where you have let's say 20 customers inside this table. What you're going to
inside this table. What you're going to see at the client side is like
see at the client side is like spreadsheets like a table with rows and
spreadsheets like a table with rows and columns but behind the scenes the
columns but behind the scenes the database store it a bit differently.
database store it a bit differently. It's going to store the data in a data
It's going to store the data in a data file on the disk and inside this file
file on the disk and inside this file the data can be stored inside blocks
the data can be stored inside blocks called pages. So it's not like rows and
called pages. So it's not like rows and columns that are stored inside data
columns that are stored inside data files and inside the data files we have
files and inside the data files we have pages. So what is a page? A page is the
pages. So what is a page? A page is the unit of data storage in a database and
unit of data storage in a database and it is a fixed size of 8 kilobyt where
it is a fixed size of 8 kilobyt where the SQL database can store anything
the SQL database can store anything inside it. It can store inside it the
inside it. It can store inside it the rows of your tables or columns metadata
rows of your tables or columns metadata indexes and every time you are
indexes and every time you are interacting with your data the SQL is
interacting with your data the SQL is reading and writing to those pages. So
reading and writing to those pages. So as you can see the SQL is not storing
as you can see the SQL is not storing the data inside like rows and columns.
the data inside like rows and columns. So if you are running a query the SQL is
So if you are running a query the SQL is not like selecting a specific column it
not like selecting a specific column it always fetch a data page in order to
always fetch a data page in order to read the rows inside this page. And the
read the rows inside this page. And the main two types that we're going to learn
main two types that we're going to learn is the data page and the index page. So
is the data page and the index page. So how the data page looks like it is
how the data page looks like it is divided into multiple sections. The
divided into multiple sections. The first section is the page header where
first section is the page header where the database can store key informations
the database can store key informations about the metadata like the page ID and
about the metadata like the page ID and it has the following format. It start
it has the following format. It start with the file ID like one and then we
with the file ID like one and then we have a unique number for each page. So
have a unique number for each page. So for example 150. So the page header is a
for example 150. So the page header is a fixed size of 96 bytes. Now to the next
fixed size of 96 bytes. Now to the next section, we're going to have a variable
section, we're going to have a variable size. This is where your data row is
size. This is where your data row is going to be stored. So your actual data
going to be stored. So your actual data and row is going to be stored in this
and row is going to be stored in this section. And the SQL going to try and
section. And the SQL going to try and fits as many rows as it can in one
fits as many rows as it can in one single page. And this of course depends
single page. And this of course depends on the size of each row. So if you have
on the size of each row. So if you have like a large table where the rows are
like a large table where the rows are really big, so SQL can fit only few rows
really big, so SQL can fit only few rows in one single page. And now moving on to
in one single page. And now moving on to the last section in the data page, we
the last section in the data page, we have the offset array. This is like a
have the offset array. This is like a quick index for the rows stored inside
quick index for the rows stored inside this page. It keeps track of where each
this page. It keeps track of where each rows begins so that the SQL can easily
rows begins so that the SQL can easily locate a specific row without having SQL
locate a specific row without having SQL like scanning the entire page in order
like scanning the entire page in order to find a row. So this is the structure
to find a row. So this is the structure of the data page and this is exactly how
of the data page and this is exactly how the SQL stores data inside the
the SQL stores data inside the databases. So now back to our example
databases. So now back to our example where we have the customers table and 20
where we have the customers table and 20 rows. So let's see how SQL going to be
rows. So let's see how SQL going to be creating those pages. Now if you are not
creating those pages. Now if you are not using any index in this table. So now
using any index in this table. So now what going to happen? SQL going to
what going to happen? SQL going to insert the data inside those pages as
insert the data inside those pages as you are inserting the data inside the
you are inserting the data inside the customers. So maybe first you are
customers. So maybe first you are inserting the customers like 12 5 6 7
inserting the customers like 12 5 6 7 and SQL going to insert it to the data
and SQL going to insert it to the data pages exactly like that. So that means
pages exactly like that. So that means SQL is just inserting the data as you
SQL is just inserting the data as you insert it to the table. So let's say
insert it to the table. So let's say each data page is like fitting only five
each data page is like fitting only five rows. So after we insert five customers,
rows. So after we insert five customers, SQL going to go and create another data
SQL going to go and create another data page for the next rows. So in the next
page for the next rows. So in the next page, the SQL going to insert the next
page, the SQL going to insert the next five customers. And once it's full, it's
five customers. And once it's full, it's going to create another data page in
going to create another data page in order to start adding the next customer
order to start adding the next customer until we have like for example four
until we have like for example four pages for that 20 customers. So now if
pages for that 20 customers. So now if you check the customers inside those
you check the customers inside those four pages you see that they are not
four pages you see that they are not sorted at all and that's because in this
sorted at all and that's because in this scenario we are not using any index. So
scenario we are not using any index. So we call this structure as a heap
we call this structure as a heap structure. So a heap table is a table
structure. So a heap table is a table without a clustered index. That means
without a clustered index. That means the rows are stored randomly without any
the rows are stored randomly without any particular order. This is not a really
particular order. This is not a really bad because it's going to be very quick
bad because it's going to be very quick to insert data inside this table. But of
to insert data inside this table. But of course finding something from this table
course finding something from this table going to be very slow. So this is the
going to be very slow. So this is the first tradeoff. You have a very fast
first tradeoff. You have a very fast writes but a very bad reads. Think about
writes but a very bad reads. Think about it like you are throwing all your papers
it like you are throwing all your papers in a drawer without organizing them. So
in a drawer without organizing them. So you can toss things very quickly in this
you can toss things very quickly in this drawer. But if you want to search for
drawer. But if you want to search for specific paper later, it's going to be
specific paper later, it's going to be very long process until you find it
very long process until you find it because nothing's in order. So now let's
because nothing's in order. So now let's see how the SQL going to handle if you
see how the SQL going to handle if you read something from this table. Let's
read something from this table. Let's say that you are searching for the
say that you are searching for the customer with the ID 14. So now SQL has
customer with the ID 14. So now SQL has totally no idea where to find this
totally no idea where to find this customer. So SQL going to start fetching
customer. So SQL going to start fetching each data page and start scanning each
each data page and start scanning each row. So it's going to start with the
row. So it's going to start with the first data page and start scanning.
first data page and start scanning. Well, SQL will not find 14 here. So SQL
Well, SQL will not find 14 here. So SQL going to go to the next page and start
going to go to the next page and start scanning as well. Searching for the ID
scanning as well. Searching for the ID 14 and nothing going to be found. The
14 and nothing going to be found. The same thing for the third page as well.
same thing for the third page as well. SQL will not find 14. So SQL going to go
SQL will not find 14. So SQL going to go to the last data page and there after
to the last data page and there after scanning four rows in this data page
scanning four rows in this data page finally SQL going to find the customer
finally SQL going to find the customer number 14 and it's going to return it
number 14 and it's going to return it for the clients. So as you can see in
for the clients. So as you can see in order to find one customer SQL did read
order to find one customer SQL did read four different pages and scanned like 19
four different pages and scanned like 19 rows in order to find the customer and
rows in order to find the customer and this process we call it full table scan.
this process we call it full table scan. So the full table scans means SQL is
So the full table scans means SQL is scanning the entire table page by page
scanning the entire table page by page and row by row in order to find specific
and row by row in order to find specific row. And of course for this table maybe
row. And of course for this table maybe it's not a big deal. But if you have
it's not a big deal. But if you have like a big table where you have like
like a big table where you have like hundred of thousands or maybe millions
hundred of thousands or maybe millions of rows searching through the heap
of rows searching through the heap structure going to be very painful and
structure going to be very painful and slow in order to locate one row. And
slow in order to locate one row. And here exactly why we need indexes in SQL
here exactly why we need indexes in SQL databases. So let's understand the first
databases. So let's understand the first type of indexes the clustered
index. All right. So now let's understand what can happen if you create
understand what can happen if you create clustered index in your table. So say
clustered index in your table. So say you create a clustered index on the ID
you create a clustered index on the ID column of the customers. So the first
column of the customers. So the first thing that's going to happen SQL going
thing that's going to happen SQL going to physically sort all the data based on
to physically sort all the data based on the column ID. So the rows going to
the column ID. So the rows going to rearranged in each data page from the
rearranged in each data page from the lowest to the highest. So in the first
lowest to the highest. So in the first page we're going to have the first
page we're going to have the first customer ID number one then 2 3 4 5
customer ID number one then 2 3 4 5 until we reach in the last page the last
until we reach in the last page the last customer number 20. So as you can see
customer number 20. So as you can see the first page has the lowest value and
the first page has the lowest value and the last page has the highest value. So
the last page has the highest value. So that's not all. The next step is that
that's not all. The next step is that SQL going to go and start structuring
SQL going to go and start structuring and building the B tree. So what is a B
and building the B tree. So what is a B tree? A B tree short for balance tree.
tree? A B tree short for balance tree. It is hierarchal structure that store
It is hierarchal structure that store the data as a tree upside
the data as a tree upside [Music]
down. It start with the root the root node and then it keep branching out
node and then it keep branching out until we reach eventually the leaves.
until we reach eventually the leaves. Between the leaf nodes and the root
Between the leaf nodes and the root nodes we call this section the
nodes we call this section the intermediate nodes. So it could be like
intermediate nodes. So it could be like one level or multiple levels between the
one level or multiple levels between the root and the leaves. And once SQL
root and the leaves. And once SQL construct the B tree, it's going to be
construct the B tree, it's going to be very easy for SQL to navigate through
very easy for SQL to navigate through the B tree in order to find specific
the B tree in order to find specific information. So let's see how SQL is
information. So let's see how SQL is building the B tree for the clustered
building the B tree for the clustered index. Now very important to understand
index. Now very important to understand that the leaves the leaf nodes and the B
that the leaves the leaf nodes and the B tree for the clustered index contain the
tree for the clustered index contain the actual data the data pages. So all your
actual data the data pages. So all your nice sorted data pages and your data is
nice sorted data pages and your data is stored at the leaf level. Then after
stored at the leaf level. Then after that SQL going to start building the
that SQL going to start building the intermediate nodes and here the database
intermediate nodes and here the database going to use different type of pages. We
going to use different type of pages. We have the index page. So in the index
have the index page. So in the index page we cannot find the actual data the
page we cannot find the actual data the entire rows but instead the index page
entire rows but instead the index page stores a key value that contain a
stores a key value that contain a pointer to another index page or to a
pointer to another index page or to a data page. So for example we have here
data page. So for example we have here the value one the key and then the value
the value one the key and then the value going to be the ID of the data page. So
going to be the ID of the data page. So here we don't have like the whole row
here we don't have like the whole row about the data we have here only a
about the data we have here only a pointer to another data page. So here we
pointer to another data page. So here we are telling the scale if you are
are telling the scale if you are searching for ids between 1 and five you
searching for ids between 1 and five you can locate it at the data page ID
can locate it at the data page ID 1.100 and then we can store in this
1.100 and then we can store in this index page another pointer where we can
index page another pointer where we can tell SQL if you are searching between 6
tell SQL if you are searching between 6 and 10 then you can locate it at the
and 10 then you can locate it at the second data page. So this is the
second data page. So this is the structure of the index page it contains
structure of the index page it contains only pointers to another page and the
only pointers to another page and the same thing for the second two pages. The
same thing for the second two pages. The SQL going to create another index page
SQL going to create another index page where it's going to says if you are
where it's going to says if you are searching for IDs between 11 and 15, you
searching for IDs between 11 and 15, you can find it at the third page 1 double
can find it at the third page 1 double point 10002. And for the last group
point 10002. And for the last group between 16 and 20, we have another
between 16 and 20, we have another pointer to the last page to the page
pointer to the last page to the page number
number one3. So as you can see inside those
one3. So as you can see inside those index pages, we have like a pointer for
index pages, we have like a pointer for each group of ids for each cluster. So
each group of ids for each cluster. So for the group of customers between 1 and
for the group of customers between 1 and five we have one pointer and for the
five we have one pointer and for the second group between six and 10 we have
second group between six and 10 we have another pointer. So that means we don't
another pointer. So that means we don't have here a pointer for each row. We
have here a pointer for each row. We have a pointer for each group for each
have a pointer for each group for each cluster. That's why we call it clustered
cluster. That's why we call it clustered index. And now once SQL is done building
index. And now once SQL is done building the intermediate nodes, SQL going to go
the intermediate nodes, SQL going to go and build the last node, the root node
and build the last node, the root node where it says if you are searching for
where it says if you are searching for customers between 1 and 10, then go to
customers between 1 and 10, then go to the index page with the ID
the index page with the ID 1.200. So that means the route node here
1.200. So that means the route node here is pointing to another index page, not
is pointing to another index page, not directly to the data page. And the same
directly to the data page. And the same thing, we need another pointer for the
thing, we need another pointer for the second index page. So the customers
second index page. So the customers between 11 and 20 go to the index page
between 11 and 20 go to the index page with the ID
with the ID 1.201 and this is exactly what going to
1.201 and this is exactly what going to happen if you create a clustered index
happen if you create a clustered index in SQL. First it going to go and
in SQL. First it going to go and physically sort all your data in the
physically sort all your data in the databases. So if it's from the first
databases. So if it's from the first time sorted randomly SQL has to arrange
time sorted randomly SQL has to arrange everything and sort the data from the
everything and sort the data from the scratch. And then it's going to go and
scratch. And then it's going to go and build this structure where you have in
build this structure where you have in the root node and index page in the
the root node and index page in the intermediate nodes the index pages but
intermediate nodes the index pages but at the leaf level at the leaves we have
at the leaf level at the leaves we have the actual data the data pages. So now
the actual data the data pages. So now let's see what going to happen if you
let's see what going to happen if you query the table where you search for the
query the table where you search for the ID number 14. So it's going to check
ID number 14. So it's going to check which pointer to use since 14 is in the
which pointer to use since 14 is in the group between 11 and 20. It's going to
group between 11 and 20. It's going to go and use the second pointer to the
go and use the second pointer to the index page with the ID one double point
index page with the ID one double point 2011. And here the SQL going to open
2011. And here the SQL going to open this index page and check the pointers.
this index page and check the pointers. So since 14 is between 11 and 15 it
So since 14 is between 11 and 15 it going to go and use the pointer to the
going to go and use the pointer to the data page one point 102 and with that
data page one point 102 and with that SQL located the correct data page the
SQL located the correct data page the third page and now SQL going to open
third page and now SQL going to open this data page and find the customer ID
this data page and find the customer ID number 14. So as you can see it was very
number 14. So as you can see it was very fast for SQL to locate the correct data
fast for SQL to locate the correct data page with only three jumps from the root
page with only three jumps from the root node to the intermediate node. The SQL
node to the intermediate node. The SQL were able to find fast the correct data
were able to find fast the correct data page. And here SQL needs only to read
page. And here SQL needs only to read one data page instead of reading as we
one data page instead of reading as we saw in the heap structure four different
saw in the heap structure four different data pages. And of course you might say
data pages. And of course you might say but still here we are reading like three
but still here we are reading like three pages. Well, reading an index page is
pages. Well, reading an index page is very fast compared to the data page
very fast compared to the data page because reading a data page is always
because reading a data page is always slower than reading an index page. So,
slower than reading an index page. So, as you can see, this P3 structure, the
as you can see, this P3 structure, the clustered index structure did help the
clustered index structure did help the SQL and the database to locate the right
SQL and the database to locate the right data in the right
data in the right [Music]
[Music] databases. And this is exactly how that
databases. And this is exactly how that clustered index works in the SQL
clustered index works in the SQL database.
All right. So now we're going to move to the second type and we're going to
the second type and we're going to understand how exactly SQL build and
understand how exactly SQL build and create the nonclustered index. So let's
create the nonclustered index. So let's go. So now we are back to the heap
go. So now we are back to the heap structure where our table don't have any
structure where our table don't have any index and our data are stored randomly
index and our data are stored randomly inside the data pages. And now if you go
inside the data pages. And now if you go and create a non-clustered index on the
and create a non-clustered index on the customer ID, what can happen? And here's
customer ID, what can happen? And here's the big difference that SQL will not
the big difference that SQL will not touch or change anything on the physical
touch or change anything on the physical actual data on the databases. So the
actual data on the databases. So the database is going to stay as it is and
database is going to stay as it is and nothing going to be changed and the SQL
nothing going to be changed and the SQL start immediately building the B
start immediately building the B structure. So it's going to start
structure. So it's going to start immediately building an index page and
immediately building an index page and this index page is a little bit
this index page is a little bit different than the one that we have
different than the one that we have learned previously. So since it's index
learned previously. So since it's index page, it's going to store pointers. But
page, it's going to store pointers. But this time SQL going to store in the key
this time SQL going to store in the key the customer ID. So one is the customer
the customer ID. So one is the customer ID and now the value the pointer it will
ID and now the value the pointer it will not be the data page ID. We will be more
not be the data page ID. We will be more specific. So we're going to have like an
specific. So we're going to have like an address where exactly the row is stored.
address where exactly the row is stored. So it's going to start with the file ID,
So it's going to start with the file ID, the page number because the customer ID
the page number because the customer ID one is stored in the page
one is stored in the page one2. But SQL gonna go add as well the
one2. But SQL gonna go add as well the offset number of the row where exactly
offset number of the row where exactly in the page we can find this ID and the
in the page we can find this ID and the whole thing we can call it an air ID the
whole thing we can call it an air ID the row identifier. So now let's see quickly
row identifier. So now let's see quickly how the index page is pointing exactly
how the index page is pointing exactly to the row inside the data page. So the
to the row inside the data page. So the first part of the row identifier is
first part of the row identifier is mapping to the data page ID and then
mapping to the data page ID and then from the 96 it's going to take us to the
from the 96 it's going to take us to the offset and that's exactly the location
offset and that's exactly the location of the row number one. So 96 is the part
of the row number one. So 96 is the part where we're going to start finding the
where we're going to start finding the row number one and that's going to takes
row number one and that's going to takes us exactly to the place where we can
us exactly to the place where we can read the information about the row ID
read the information about the row ID number one. So this is how the index
number one. So this is how the index page is locating the exact place of the
page is locating the exact place of the rows. So SQL going to go and continue
rows. So SQL going to go and continue and assign for each customer ID a
and assign for each customer ID a pointer to the exact location. So as you
pointer to the exact location. So as you can see now in the index page we don't
can see now in the index page we don't have like a pointer for each group of
have like a pointer for each group of customers like we have learned in the
customers like we have learned in the clusters index. We have now a pointer
clusters index. We have now a pointer for each ID and this type of index page
for each ID and this type of index page we call it roator page. So now SQL going
we call it roator page. So now SQL going to go and continue and map a pointer for
to go and continue and map a pointer for each customer ID that we have inside our
each customer ID that we have inside our table. So we will have multiple index
table. So we will have multiple index pages pointing to our data page. So as
pages pointing to our data page. So as you can see we have a lot of pointers
you can see we have a lot of pointers and the data inside the index page is of
and the data inside the index page is of course sorted but inside the data pages
course sorted but inside the data pages it left as it is. And now those index
it left as it is. And now those index pages that has the row identifier going
pages that has the row identifier going to be stored at the leaf level of the B
to be stored at the leaf level of the B tree. So at the leaf level we don't have
tree. So at the leaf level we don't have the actual data the data pages we have
the actual data the data pages we have index pages where we have pointers then
index pages where we have pointers then to the actual data and then it's going
to the actual data and then it's going to go and start building the
to go and start building the intermediate nodes. It's exactly like
intermediate nodes. It's exactly like the clustered index where it's going to
the clustered index where it's going to point to another index page. So between
point to another index page. So between one and five customers it's going to be
one and five customers it's going to be in the index page number 200. So the
in the index page number 200. So the next step is going to go and build the
next step is going to go and build the intermediate nodes. It's going to be
intermediate nodes. It's going to be exactly like the clustered index.
exactly like the clustered index. Nothing going to be changed. is like the
Nothing going to be changed. is like the same structure. So it is an index page
same structure. So it is an index page pointing to another index page but this
pointing to another index page but this time for a group of customers and then
time for a group of customers and then we're going to have as well the root
we're going to have as well the root node. So again we call this structure as
node. So again we call this structure as a B tree structure where they point to
a B tree structure where they point to another databases but the databases are
another databases but the databases are not part of the B tree. So now let's say
not part of the B tree. So now let's say if we are searching for the customer ID
if we are searching for the customer ID number 14, what's going to happen? It's
number 14, what's going to happen? It's going to start again from the root node
going to start again from the root node and then it's going to find the pointer
and then it's going to find the pointer to the intermediate node and then jump
to the intermediate node and then jump to the next step to the intermediate
to the next step to the intermediate node and then it's going to find the
node and then it's going to find the pointer to the index page between 11 and
pointer to the index page between 11 and 15 and then it's going to go and scan
15 and then it's going to go and scan this index page and find okay for the
this index page and find okay for the customer ID number 14 we have the
customer ID number 14 we have the following address. So it's going to go
following address. So it's going to go and locate the exact database and as
and locate the exact database and as well the exact place of the row. So it
well the exact place of the row. So it can go and jump immediately to the row
can go and jump immediately to the row without scanning anything else. So here
without scanning anything else. So here this time with the nclustered index the
this time with the nclustered index the SQL did read three different index
SQL did read three different index pages. And finally the one data page in
pages. And finally the one data page in order to find the data. So if you
order to find the data. So if you compare to the clustered index you can
compare to the clustered index you can see that we have here one extra layer
see that we have here one extra layer one extra index page to be scanned in
one extra index page to be scanned in order to find the right place of the
order to find the right place of the row. And this is how SQL creates the B
row. And this is how SQL creates the B tree for the nonclustered index and how
tree for the nonclustered index and how it scans it in order to find the
information. All right. So now when I think about the clustered index and the
think about the clustered index and the non-clustered index, I think about a
non-clustered index, I think about a book. You can think of the clustered
book. You can think of the clustered index like the table of contents at the
index like the table of contents at the front of the table. So the table of
front of the table. So the table of contents kind of tells you where to find
contents kind of tells you where to find each chapter and the chapters are
each chapter and the chapters are exactly sorted like the table of
exactly sorted like the table of contents and this is exactly what the
contents and this is exactly what the clustered index does. But now in the
clustered index does. But now in the other hand think about the nclustered
other hand think about the nclustered index as the index that you can find at
index as the index that you can find at the end of the book. The index of the
the end of the book. The index of the book is a very detailed list of topics,
book is a very detailed list of topics, terms and keywords where it points
terms and keywords where it points exactly to the location where you can
exactly to the location where you can find it in the book. And the content and
find it in the book. And the content and the topic of the book is not sorted like
the topic of the book is not sorted like the index of the book. And this is
the index of the book. And this is exactly what the noncluster index does.
exactly what the noncluster index does. It is coexisting with the data. It is an
It is coexisting with the data. It is an extra list where it can point exactly
extra list where it can point exactly where we can find the data inside our
where we can find the data inside our table. All right. Right. So now let's
table. All right. Right. So now let's put those two indexes side by side to
put those two indexes side by side to understand the differences between them.
understand the differences between them. So the structure of the cluster the
So the structure of the cluster the index is a B tree where it start with
index is a B tree where it start with the root node where we have an index
the root node where we have an index page. This index page is pointing to the
page. This index page is pointing to the intermediate nodes where we have as well
intermediate nodes where we have as well index pages and those index pages are
index pages and those index pages are pointing to the actual data to the data
pointing to the actual data to the data pages. So at the leave level of the
pages. So at the leave level of the clustered index we have the data pages
clustered index we have the data pages the actual data. What's special about
the actual data. What's special about the clustered index is that it
the clustered index is that it physically sort the data inside those
physically sort the data inside those pages. So everything here is physically
pages. So everything here is physically rearranged and sorted. Now if you are
rearranged and sorted. Now if you are talking about the nclustlustered index
talking about the nclustlustered index we have as well a bit tree. So the same
we have as well a bit tree. So the same thing at the root node we have an index
thing at the root node we have an index page pointing to an intermediate index
page pointing to an intermediate index page but this time the intermediate
page but this time the intermediate nodes are pointing to another index
nodes are pointing to another index page. They are not pointing like the
page. They are not pointing like the clustered index to a data page. they are
clustered index to a data page. they are pointing to index page. So now if you
pointing to index page. So now if you check this structure you can see that at
check this structure you can see that at the leaf level for the clustered index
the leaf level for the clustered index we have the actual data the data pages
we have the actual data the data pages but on the other side at the leaf level
but on the other side at the leaf level for the nclustered index we don't have
for the nclustered index we don't have the actual data we have index pages but
the actual data we have index pages but those index pages are pointing to the
those index pages are pointing to the actual data to the data pages but the
actual data to the data pages but the big difference of that the data pages
big difference of that the data pages are not part of the B3 the B3 of the
are not part of the B3 the B3 of the nclustlustered index is just a separate
nclustlustered index is just a separate structure that does not involve any
structure that does not involve any data. So we have only index pages and it
data. So we have only index pages and it just points to the data pages without
just points to the data pages without changing anything physically with your
changing anything physically with your data. But in reality what happen is that
data. But in reality what happen is that you can have those two types of indexes
you can have those two types of indexes the clustered and the nclustered indexes
the clustered and the nclustered indexes in one table. So one can happen the leaf
in one table. So one can happen the leaf level of the nclustered index going to
level of the nclustered index going to be pointing to the data pages of the
be pointing to the data pages of the clustered index because those index
clustered index because those index pages don't care whether those pages are
pages don't care whether those pages are sorted or not. It's just going to go and
sorted or not. It's just going to go and point to the correct page and to the
point to the correct page and to the correct row. So that means we have now
correct row. So that means we have now like two different B3 structures that
like two different B3 structures that are pointing to the data. And here there
are pointing to the data. And here there is like one thing that you have to
is like one thing that you have to understand that that you can create only
understand that that you can create only one clustered index on a table. And this
one clustered index on a table. And this rule really makes sense because you can
rule really makes sense because you can sort the data only in one way in SQL.
sort the data only in one way in SQL. And that's of course makes sense because
And that's of course makes sense because you can sort the data physically only
you can sort the data physically only once. And that's why in SQL databases
once. And that's why in SQL databases you are allowed to create only one
you are allowed to create only one clustered index because physically the
clustered index because physically the data can be sorted only in one way. But
data can be sorted only in one way. But in the other hand in the non-clustered
in the other hand in the non-clustered index you can create as many
index you can create as many nonclustered index you need. So you can
nonclustered index you need. So you can create three four and all of them are
create three four and all of them are pointing to the same data pages because
pointing to the same data pages because in the B tree of the non-clustered index
in the B tree of the non-clustered index you don't store any data pages. We store
you don't store any data pages. We store only pointers to the data and you could
only pointers to the data and you could have like multiple pointers. So this is
have like multiple pointers. So this is the most important and the main
the most important and the main difference between those two indexes.
difference between those two indexes. Now if you put it side by side, we have
Now if you put it side by side, we have learned that the clustered index going
learned that the clustered index going to go and physically sorts and stores
to go and physically sorts and stores the rows at the B tree. But the
the rows at the B tree. But the nclustered index is going to go and
nclustered index is going to go and create a separate p structure with
create a separate p structure with pointers to the actual data. And by the
pointers to the actual data. And by the way, the clustered index we call it the
way, the clustered index we call it the main index that we could use in each
main index that we could use in each table. So the clustered index is the
table. So the clustered index is the main one, the most important one that
main one, the most important one that you can go and use in each table in your
you can go and use in each table in your database. Now as we learned if you are
database. Now as we learned if you are talking about the number of indexes you
talking about the number of indexes you can create maximum one index for each
can create maximum one index for each table but for the nclustered index there
table but for the nclustered index there is no limitations you can go and create
is no limitations you can go and create multiple indexes for each table. And now
multiple indexes for each table. And now if you go and compare them about the
if you go and compare them about the read performance how fast we can get
read performance how fast we can get data using clustered index. Well it is
data using clustered index. Well it is faster than the nclustlustered index.
faster than the nclustlustered index. And that's because in the nonclass and
And that's because in the nonclass and index we have this extra layer at the
index we have this extra layer at the leaf node from the B tree and because of
leaf node from the B tree and because of this having extra layer that means SQL
this having extra layer that means SQL has to do extra job in order to find the
has to do extra job in order to find the data that's why clustered index is
data that's why clustered index is faster than the nonclustered index but
faster than the nonclustered index but now in the other hand if we are talking
now in the other hand if we are talking about the right performance how fast we
about the right performance how fast we can insert data to the tables well
can insert data to the tables well writing data to a table with a clustered
writing data to a table with a clustered index is slower than the nclustered
index is slower than the nclustered index. And that's because as you are
index. And that's because as you are inserting data to the table, SQL has
inserting data to the table, SQL has always to check the databases is
always to check the databases is everything sorted correctly and if not
everything sorted correctly and if not SQL has to go and start physically
SQL has to go and start physically sorting the data again in order to have
sorting the data again in order to have the correct order. So there is a lot of
the correct order. So there is a lot of stress in order to sort the data with
stress in order to sort the data with the clustered index. But in the other
the clustered index. But in the other hand in the non-clustered index we don't
hand in the non-clustered index we don't have this. So the physical data going to
have this. So the physical data going to stay as it is. We are just creating nice
stay as it is. We are just creating nice new pointers. So if you are writing to a
new pointers. So if you are writing to a table where you have a clustered index,
table where you have a clustered index, it's going to be slower than writing to
it's going to be slower than writing to a table where you have nclustered index.
a table where you have nclustered index. And of course the fastest way to write
And of course the fastest way to write data to a table is to not have indexes
data to a table is to not have indexes at all. So a heap structure. So SQL just
at all. So a heap structure. So SQL just go and start inserting data inside those
go and start inserting data inside those databases without creating any extra
databases without creating any extra structures. So as you can see it's like
structures. So as you can see it's like always a tradeoff. You can read fast but
always a tradeoff. You can read fast but you're going to write slower. So you
you're going to write slower. So you cannot have like everything. Now we are
cannot have like everything. Now we are talking about the storage efficiency.
talking about the storage efficiency. The clustered index going to be better
The clustered index going to be better with the storage than the nonclustered
with the storage than the nonclustered index and that's because of the same
index and that's because of the same reason with the nonstructured index. We
reason with the nonstructured index. We have this extra layer of index pages and
have this extra layer of index pages and index pages needs storage and that's why
index pages needs storage and that's why they can waste more storage than the
they can waste more storage than the clustered index. Now if you're talking
clustered index. Now if you're talking about the use cases when to use
about the use cases when to use clustered index. Well, if you have like
clustered index. Well, if you have like a column this column has to have few
a column this column has to have few criteria in order to be good candidate
criteria in order to be good candidate for the clustered index. First, it's
for the clustered index. First, it's going to be good if the values inside
going to be good if the values inside the columns are unique. And second, and
the columns are unique. And second, and it is way more important than that, the
it is way more important than that, the values of this column should not change
values of this column should not change a lot because if this column having a
a lot because if this column having a lot of update operators and the data is
lot of update operators and the data is keep changing, that means each time SQL
keep changing, that means each time SQL going to go and start sorting the data
going to go and start sorting the data again left and right. So having a column
again left and right. So having a column that is frequently changing, it's not
that is frequently changing, it's not good for clustered index. And that's why
good for clustered index. And that's why the primary keys of tables are a perfect
the primary keys of tables are a perfect candidate because first they are unique
candidate because first they are unique and second we will never go and update a
and second we will never go and update a primary key value. We always append a
primary key value. We always append a new primary key value and that's why
new primary key value and that's why primary keys are perfect for clustered
primary keys are perfect for clustered index. And one more thing where I go and
index. And one more thing where I go and use clustered index is that to optimize
use clustered index is that to optimize the performance of a range query. If you
the performance of a range query. If you are quering the data between one value
are quering the data between one value and another one clusters index works
and another one clusters index works really well. Now in the other hand if we
really well. Now in the other hand if we are talking about the non-clustered
are talking about the non-clustered index we could use it on coms that are
index we could use it on coms that are used in the search conditions or if you
used in the search conditions or if you are joining tables without using the
are joining tables without using the primary keys then you can go and apply
primary keys then you can go and apply the nclustered index in order to have
the nclustered index in order to have faster joins or you can go and use it to
faster joins or you can go and use it to optimize the performance if you are
optimize the performance if you are searching for an exact value exact
searching for an exact value exact match. So those are the main and
match. So those are the main and important differences between the
important differences between the clustered and the nclustered indexes.
All right. So now before we go to SQL and start practicing, I would like to
and start practicing, I would like to show you the syntax of the index. So
show you the syntax of the index. So it's very very simple. It start with
it's very very simple. It start with create and then we can define whether it
create and then we can define whether it is clustered or nonclustered and then
is clustered or nonclustered and then the keyword index. But this section is
the keyword index. But this section is optional. So if you don't define
optional. So if you don't define anything, the default going to be the
anything, the default going to be the nonclustered. So if you say create index
nonclustered. So if you say create index the SQL server going to go and create
the SQL server going to go and create nclustered index. Then after that we
nclustered index. Then after that we have to go and define the name of the
have to go and define the name of the index and then we have to tell SQL which
index and then we have to tell SQL which table we have to create the index in on
table we have to create the index in on table name and then we can go and define
table name and then we can go and define one column or multiple columns for the
one column or multiple columns for the index and we call an index with multiple
index and we call an index with multiple columns as composite index. So for
columns as composite index. So for example we can go and create a clustered
example we can go and create a clustered index using this command create
index using this command create clustered index the index name and then
clustered index the index name and then we specify the table and the ID. So we
we specify the table and the ID. So we are saying create clustered index based
are saying create clustered index based on this column the ID from the table
on this column the ID from the table customers. And if you want to create a
customers. And if you want to create a nclustered index you say create
nclustered index you say create nclustered index and the same thing. So
nclustered index and the same thing. So so far we are using one column in the
so far we are using one column in the index but we can go and create a
index but we can go and create a composite index with multiple columns
composite index with multiple columns like the following example. So we can
like the following example. So we can say create an index and as you can see
say create an index and as you can see we skipped here defining the type and
we skipped here defining the type and that's because the default going to be
that's because the default going to be nonclustered index. And now here we are
nonclustered index. And now here we are specifying two columns the last name and
specifying two columns the last name and the first name. And as you can see we
the first name. And as you can see we specifying as well for SQL how to sort
specifying as well for SQL how to sort the data. So we are saying last name
the data. So we are saying last name should be sorted inside the data page
should be sorted inside the data page ascending lowest to the highest but the
ascending lowest to the highest but the first name should be the way around from
first name should be the way around from the highest to the lowest. So you can
the highest to the lowest. So you can control how the data going to be sorted
control how the data going to be sorted physically in the data page. So as you
physically in the data page. So as you can see it is very simple. This is the
can see it is very simple. This is the syntax for creating index in SQL. All
syntax for creating index in SQL. All right. So back to SQL and the first
right. So back to SQL and the first question is where do we find indexes in
question is where do we find indexes in the database? Well you can go and
the database? Well you can go and explore it. If you go to the object
explore it. If you go to the object explorer over here and check any tables
explorer over here and check any tables from our sales DB for example the
from our sales DB for example the customers and here you have a folder
customers and here you have a folder called indexes. So if you expand it you
called indexes. So if you expand it you will find here an index. I didn't create
will find here an index. I didn't create any of those indexes in the database.
any of those indexes in the database. But in SQL server, if you define any of
But in SQL server, if you define any of the columns as a primary key, the SQL
the columns as a primary key, the SQL server going to go by default creating a
server going to go by default creating a clustered index for the primary key
clustered index for the primary key because it makes always sense to create
because it makes always sense to create a clustered index on the primary key. So
a clustered index on the primary key. So this one is created as a default and as
this one is created as a default and as you can see at the start we have like a
you can see at the start we have like a key primary key customer and then it is
key primary key customer and then it is clustered. Now I would like to start
clustered. Now I would like to start from the scratch. That's why I would
from the scratch. That's why I would like to go and create a new table
like to go and create a new table without any indexes. So what we're going
without any indexes. So what we're going to do, we're going to go and load the
to do, we're going to go and load the table customers into a new table. So how
table customers into a new table. So how we going to do that? We're going to go
we going to do that? We're going to go and say select star from sales
and say select star from sales customers and before the from we're
customers and before the from we're going to say into a new table. So it's
going to say into a new table. So it's going to be TB customers. So like this.
going to be TB customers. So like this. Let's go ahead and execute it. So now if
Let's go ahead and execute it. So now if you go to the left side and refresh the
you go to the left side and refresh the tables you can find we have now a new
tables you can find we have now a new table called DB customers. Now let's go
table called DB customers. Now let's go and check whether we have any indexes
and check whether we have any indexes inside it. So indexes it is empty. So we
inside it. So indexes it is empty. So we don't have anything no clustered index
don't have anything no clustered index or anything else. And this table has the
or anything else. And this table has the structure of heap structure. So the data
structure of heap structure. So the data are inserted there randomly. It is not
are inserted there randomly. It is not sorted. And if I go over here and for
sorted. And if I go over here and for example, let's say I'm going to select
example, let's say I'm going to select from this new
from this new table where customer ID equal one and I
table where customer ID equal one and I execute it. The SQL server did a full
execute it. The SQL server did a full scan on the table in order to find this
scan on the table in order to find this customer ID. So our new table DB
customer ID. So our new table DB customers is heap cluster. But let's go
customers is heap cluster. But let's go and change that. What we're going to do,
and change that. What we're going to do, we're going to go and create a new
we're going to go and create a new clustered index. So we're going to say
clustered index. So we're going to say create
create clustered index and then we're going to
clustered index and then we're going to go and give it a name for the index. We
go and give it a name for the index. We usually follow the following index. So
usually follow the following index. So we have index as prefix and then after
we have index as prefix and then after that we specify the table name. So DB
that we specify the table name. So DB customers and then the key for the
customers and then the key for the index. So the column that we are using
index. So the column that we are using in order to index the table. This is
in order to index the table. This is important to stick with the same naming
important to stick with the same naming convention for the index name because
convention for the index name because later as you are monitoring your
later as you are monitoring your indexes, it's going to be really easy to
indexes, it's going to be really easy to understand. Okay, this index is for the
understand. Okay, this index is for the table DB customers and we are using the
table DB customers and we are using the customer ID to index. So now after that
customer ID to index. So now after that we're going to go specify on which table
we're going to go specify on which table we are doing the index. So on sales DB
we are doing the index. So on sales DB customers and then we're going to
customers and then we're going to specify the column name. So we are
specify the column name. So we are saying build for me a clustered index
saying build for me a clustered index based on the customer ID. So now let's
based on the customer ID. So now let's go and execute it. So as you can see
go and execute it. So as you can see it's very fast because we have only five
it's very fast because we have only five rows. So the database just switched all
rows. So the database just switched all the data pages very fast. Now let's go
the data pages very fast. Now let's go and check our new index. So let's go and
and check our new index. So let's go and refresh and let's go inside it. And now
refresh and let's go inside it. And now we can see that we have our new index
we can see that we have our new index clustered index based on the customer
clustered index based on the customer ID. Now as we learned we cannot create
ID. Now as we learned we cannot create multiple clustered index. But let's go
multiple clustered index. But let's go and test that. So I will just take the
and test that. So I will just take the whole thing and let's say I would like
whole thing and let's say I would like to create a class index based on the
to create a class index based on the first
first name as well
name as well here. So let's go and execute it. So as
here. So let's go and execute it. So as you can see saying you cannot create
you can see saying you cannot create more than one clustered index on this
more than one clustered index on this table. That means we can create only one
table. That means we can create only one clustered index. And let's say that
clustered index. And let's say that after you created the index you chose
after you created the index you chose the wrong column and you would like to
the wrong column and you would like to change it to the first name. So what
change it to the first name. So what we're going to do, we have to go and
we're going to do, we have to go and drop the index. So we say drop index and
drop the index. So we say drop index and then you need the index name. It was
then you need the index name. It was this one. And then you have to specify
this one. And then you have to specify which table. So it's going to be sales
which table. So it's going to be sales DB
DB customers like this. So if I do it like
customers like this. So if I do it like this and let's go and refresh again. You
this and let's go and refresh again. You can see that we don't have any indexes
can see that we don't have any indexes anymore and the table is packed as a hip
anymore and the table is packed as a hip structure. And now you can go and create
structure. And now you can go and create the correct clustered index for this
the correct clustered index for this table. But to be honest, I'm going to
table. But to be honest, I'm going to stick with the customer ID. So I will
stick with the customer ID. So I will not create a clustered index on the
not create a clustered index on the first name because the first name of
first name because the first name of course is not unique. You can have like
course is not unique. You can have like maybe multiple customers having the same
maybe multiple customers having the same name. And as well updates could happen
name. And as well updates could happen on the first name and that's going to be
on the first name and that's going to be very expensive. So that means I'm going
very expensive. So that means I'm going to stick with my index on the customer
to stick with my index on the customer ID. Let's go and execute it. And now I
ID. Let's go and execute it. And now I have again my index on my table. Now
have again my index on my table. Now let's say that that I have the following
let's say that that I have the following select statements from our tables. So
select statements from our tables. So customers and I'm searching for the last
customers and I'm searching for the last name where let's say we are searching
name where let's say we are searching for brown. So let's go and execute it.
for brown. So let's go and execute it. So let's say that we are getting more
So let's say that we are getting more and more customers and our table is
and more customers and our table is getting bigger and I frequently use this
getting bigger and I frequently use this query. So I'm searching for specific
query. So I'm searching for specific customers using the last name. So what
customers using the last name. So what we can do, we can go and create a
we can do, we can go and create a nonclustered index for the last name in
nonclustered index for the last name in order to improve the performance of this
order to improve the performance of this query. So let's go and create that. So
query. So let's go and create that. So we're going to say create
we're going to say create nonclustered index. And now we're going
nonclustered index. And now we're going to give it the name using the naming
to give it the name using the naming convention. So DB customers and we're
convention. So DB customers and we're going to use the last name for this
going to use the last name for this index. So on
index. So on sales DB customers and we will use the
sales DB customers and we will use the column last name for the index. So let's
column last name for the index. So let's go and execute it. And now if you go to
go and execute it. And now if you go to our indexes and refresh, we will find
our indexes and refresh, we will find our new index over here. And as you can
our new index over here. And as you can see, it says it is nonclustered and as
see, it says it is nonclustered and as well non-unique. We will talk about the
well non-unique. We will talk about the uniqueness later. So as you can see,
uniqueness later. So as you can see, it's very easy. We have just created a
it's very easy. We have just created a uncclustered index on the last name. And
uncclustered index on the last name. And now as we learned, we can go and create
now as we learned, we can go and create multiple nonclustered index on the same
multiple nonclustered index on the same table. Let's say for example, now we our
table. Let's say for example, now we our query looks like this. We are searching
query looks like this. We are searching for the first name using for example the
for the first name using for example the value Anna. And now this query happens a
value Anna. And now this query happens a lot and maybe slow. So we can go and
lot and maybe slow. So we can go and create new nonclustered index. So let me
create new nonclustered index. So let me just have it like this. And for the
just have it like this. And for the nonclustered index you don't have to
nonclustered index you don't have to specify always like nonclustered index.
specify always like nonclustered index. As default it's going to be
As default it's going to be nonclustered. So we can skip that. And
nonclustered. So we can skip that. And here let's call it first name. And the
here let's call it first name. And the column that we are using is the first
column that we are using is the first name. So let's go and create this index.
name. So let's go and create this index. And now let's go and refresh our
And now let's go and refresh our indexes. And as you can see, SQL did
indexes. And as you can see, SQL did create a nonclustered index for the
create a nonclustered index for the first name. So if you don't specify the
first name. So if you don't specify the type of the index, it's going to be as a
type of the index, it's going to be as a default nonclustered
index. All right. So now let's talk about the composite index. It is an
about the composite index. It is an index that has multiple columns inside
index that has multiple columns inside the same index. So far we have used only
the same index. So far we have used only one column in the index but we can go
one column in the index but we can go and specify multiple columns and that's
and specify multiple columns and that's because sometimes our wear conditions
because sometimes our wear conditions are complicated and based on multiple
are complicated and based on multiple columns. So for example let's say that
columns. So for example let's say that we are searching for country equal to
we are searching for country equal to USA and at the same time we are saying
USA and at the same time we are saying the score should be higher than 500. So
the score should be higher than 500. So that means in this condition we are
that means in this condition we are using two columns and we would like to
using two columns and we would like to speed up this query. So how we going to
speed up this query. So how we going to do it? So we're going to go and create
do it? So we're going to go and create let's say an index and give it a name DB
let's say an index and give it a name DB customers and let's say country score on
customers and let's say country score on sales DB customers. And now it is very
sales DB customers. And now it is very important to do the following thing. Now
important to do the following thing. Now we have to go and define a list of
we have to go and define a list of columns that we want to be included in
columns that we want to be included in this index. And it is very crucial and
this index. And it is very crucial and important that you get the same order as
important that you get the same order as your query. So your query start with the
your query. So your query start with the country and then the score. You have to
country and then the score. You have to do it the same thing in the index. So
do it the same thing in the index. So the first column it's going to be the
the first column it's going to be the country and then the score. So it must
country and then the score. So it must be the same order as your query. So
be the same order as your query. So let's go and create this index. And if
let's go and create this index. And if you go to the indexes over here, you can
you go to the indexes over here, you can see that we have created our new index.
see that we have created our new index. So now once you create such a index and
So now once you create such a index and your table going to be like always
your table going to be like always updating this index you have to be
updating this index you have to be committed and responsible. So in your
committed and responsible. So in your queries if you want to filter the data
queries if you want to filter the data using country and score always start
using country and score always start with the country then the score in order
with the country then the score in order to be able to use the index optimizer.
to be able to use the index optimizer. So if you do it like this the index
So if you do it like this the index going to be working but if you go and
going to be working but if you go and query the way around. So you start with
query the way around. So you start with the score and then the country the SQL
the score and then the country the SQL will not be using your index. So either
will not be using your index. So either you adjust your queries or you have to
you adjust your queries or you have to go and recreate the index based on this
go and recreate the index based on this switch. So be very careful with the
switch. So be very careful with the composite indexes. The order is very
composite indexes. The order is very crucial. So you're going to have it
crucial. So you're going to have it exactly like the query. And now you
exactly like the query. And now you might say you know what now we have like
might say you know what now we have like a nice index for those two columns. What
a nice index for those two columns. What going to happen if I go and use in my
going to happen if I go and use in my query only one of them like for example
query only one of them like for example the country. So now the question is if I
the country. So now the question is if I go and execute this query is the SQL is
go and execute this query is the SQL is using this index even though that I
using this index even though that I don't have the score. Well yes because
don't have the score. Well yes because it follows the leftmost prefix rule. So
it follows the leftmost prefix rule. So this means SQL can use the index if you
this means SQL can use the index if you are using always the lift columns. So
are using always the lift columns. So here in our index country is on the left
here in our index country is on the left that's why it is working over here. But
that's why it is working over here. But if you go and skip the lift column it
if you go and skip the lift column it will not work. So if you go over here
will not work. So if you go over here for example and say let's go and select
for example and say let's go and select only the
only the score and it is like higher than 500.
score and it is like higher than 500. What we have done, we have skipped the
What we have done, we have skipped the country in this query and that's why it
country in this query and that's why it will not be working. So as long as you
will not be working. So as long as you are including the left columns, it will
are including the left columns, it will work even though it is only one column.
work even though it is only one column. So in this scenario, the first query
So in this scenario, the first query going to use the index, the second one
going to use the index, the second one will not be using it. So now let me give
will not be using it. So now let me give you a very simple example in order to
you a very simple example in order to understand how this works. So let's say
understand how this works. So let's say that we have an index using four columns
that we have an index using four columns A, B, C, D. Now in your query if you go
A, B, C, D. Now in your query if you go and target the column A the index going
and target the column A the index going to be used. Now the same thing going to
to be used. Now the same thing going to happen if you go and use A and P. So if
happen if you go and use A and P. So if you're using those two columns you will
you're using those two columns you will be using the index. So those are where
be using the index. So those are where the index will be used. So now let's
the index will be used. So now let's have the scenarios where the index wants
have the scenarios where the index wants be used. So for example if you go and
be used. So for example if you go and just jump immediately to the column B.
just jump immediately to the column B. So you are not using the left column the
So you are not using the left column the A that's why you will not be using the
A that's why you will not be using the index and as well in your query if you
index and as well in your query if you are using A and you are skipping the P.
are using A and you are skipping the P. So you have A and then C you will not be
So you have A and then C you will not be using the index. So you have always to
using the index. So you have always to use always the lift columns. So here if
use always the lift columns. So here if you are using A B C you will be using
you are using A B C you will be using the index. And let's see here you are
the index. And let's see here you are using A B and then you jump and skip to
using A B and then you jump and skip to the D you will not be using the index.
the D you will not be using the index. So this is what we mean with the
So this is what we mean with the leftmost prefix rule by using the
leftmost prefix rule by using the composite index. So if you're using
composite index. So if you're using multiple columns inside one index, be
multiple columns inside one index, be careful with the order of the columns
careful with the order of the columns that you are defining. All right. So
that you are defining. All right. So that's all for this category, clustered
that's all for this category, clustered and uncclustered index. Now we're going
and uncclustered index. Now we're going to move to the second category where we
to move to the second category where we talk about the indexes by the storage,
talk about the indexes by the storage, the row store and the column store.
So now let's say that we have a table we have multiple rows and multiple columns.
have multiple rows and multiple columns. Now if we use a row store index this is
Now if we use a row store index this is the classical one. What going to happen?
the classical one. What going to happen? Our table going to be splitted into
Our table going to be splitted into multiple rows. And as we learned each
multiple rows. And as we learned each group of rows going to be stored inside
group of rows going to be stored inside a data page. So that means we are
a data page. So that means we are organizing the data row by row which
organizing the data row by row which means all the columns for each row going
means all the columns for each row going to be stored together. This is the
to be stored together. This is the traditional way on how the databases
traditional way on how the databases organize their data where the
organize their data where the informations are stored row by row. But
informations are stored row by row. But now in the other side if you use column
now in the other side if you use column store index the SQL going to go and
store index the SQL going to go and split your table into multiple separate
split your table into multiple separate columns and then SQL going to go and
columns and then SQL going to go and store the values of one column together
store the values of one column together in data page. So that means if you go
in data page. So that means if you go and open a data page you will find only
and open a data page you will find only the values of one column. You will not
the values of one column. You will not find the entire row. So if it's like the
find the entire row. So if it's like the first name you will see only the first
first name you will see only the first name informations you will not see the
name informations you will not see the last name information in this data page.
last name information in this data page. So if you compare them the row store
So if you compare them the row store index stores the data row by row the
index stores the data row by row the column store index stores the data
column store index stores the data column by column. So this is a very high
column by column. So this is a very high level representation on how the column
level representation on how the column store index is stored. As you know me we
store index is stored. As you know me we go in details in order to understand
go in details in order to understand exactly how SQL works with the column
exactly how SQL works with the column store index. So let's go.
All right. So now let's say that we have a table for the customers. We have three
a table for the customers. We have three columns ID, name and status. And as well
columns ID, name and status. And as well we have around 2 million rows, 2 million
we have around 2 million rows, 2 million customers. And as we learned as a
customers. And as we learned as a default, the table going to be built as
default, the table going to be built as a heap structure where the rows are
a heap structure where the rows are stored row by row inside data pages. But
stored row by row inside data pages. But now we go and create a column store
now we go and create a column store index on top of this table. So now once
index on top of this table. So now once you do that SQL going to go through a
you do that SQL going to go through a process in order to build the column
process in order to build the column store. So the first step is SQL going to
store. So the first step is SQL going to go and divide the data the rows into row
go and divide the data the rows into row groups. Now in SQL server each row group
groups. Now in SQL server each row group can contain around like 1 million row.
can contain around like 1 million row. So in this example our table going to be
So in this example our table going to be splitted into two row groups. The first
splitted into two row groups. The first one million row in one group and the
one million row in one group and the second one in another row group. Now you
second one in another row group. Now you might ask me we are talking about
might ask me we are talking about columns. Why we are splitting the rows?
columns. Why we are splitting the rows? Well, this is just a pre-step in order
Well, this is just a pre-step in order just to optimize the performance and to
just to optimize the performance and to do parallel processing. And of course,
do parallel processing. And of course, the data will not be stored like this
the data will not be stored like this because we have the second step. Now, in
because we have the second step. Now, in the next step, SQL going to go and
the next step, SQL going to go and segment the columns. So now, SQL will go
segment the columns. So now, SQL will go for each row group and start splitting
for each row group and start splitting the data by the columns. And that's why
the data by the columns. And that's why we call it a column store because we are
we call it a column store because we are separating the columns from each others.
separating the columns from each others. So that means we have one segment for
So that means we have one segment for the ID, another one for the name and a
the ID, another one for the name and a third one for the status. And this can
third one for the status. And this can happen for each row group. And now it's
happen for each row group. And now it's going to move to the third step in this
going to move to the third step in this process. We have the data compression.
process. We have the data compression. And this is the most important step in
And this is the most important step in this process because it is the reason
this process because it is the reason why column store is very fast compared
why column store is very fast compared to the ro store. So in this process
to the ro store. So in this process there are like different techniques on
there are like different techniques on how to do data compression and the most
how to do data compression and the most famous one is that it's going to go and
famous one is that it's going to go and create like a dictionary. Let's take for
create like a dictionary. Let's take for example the column status the status of
example the column status the status of the customer whether it is active or
the customer whether it is active or inactive. So the word active and
inactive. So the word active and inactive going to be repeated like 2
inactive going to be repeated like 2 million times because we have 2 million
million times because we have 2 million customers and since it is like string it
customers and since it is like string it is like taking a lot of space and
is like taking a lot of space and storage. But now instead of that we're
storage. But now instead of that we're going to go and compress the data. So
going to go and compress the data. So first it's going to go and create a
first it's going to go and create a dictionary by replacing the value active
dictionary by replacing the value active and inactive into smaller values like
and inactive into smaller values like one and two. So we have like a mapping
one and two. So we have like a mapping between the long value to a small value.
between the long value to a small value. And after that SQL going to store like a
And after that SQL going to store like a data stream where we have like only two
data stream where we have like only two values one two one two. So we're going
values one two one two. So we're going to have like a big stream of 2 million
to have like a big stream of 2 million rows. So it's going to go and do this
rows. So it's going to go and do this for each column and with that the size
for each column and with that the size of each column going to be changed
of each column going to be changed depends of course on how much different
depends of course on how much different values you have in each column. So this
values you have in each column. So this step is very important in order to
step is very important in order to reduce the size of the data and as well
reduce the size of the data and as well to increase the performance. So now once
to increase the performance. So now once everything is organized and compressed,
everything is organized and compressed, SQL going to go and start storing the
SQL going to go and start storing the results in databases. But TSQL will not
results in databases. But TSQL will not use the standard databases that we have
use the standard databases that we have learned previously. But instead going to
learned previously. But instead going to use a special database called LOB large
use a special database called LOB large object page. So now let's quickly
object page. So now let's quickly compare the structure of the normal
compare the structure of the normal database that we have learned in the row
database that we have learned in the row store with the new one, the column
store with the new one, the column store, the LOB data page. So as usual
store, the LOB data page. So as usual each page has a header. This is same as
each page has a header. This is same as any data page. But the next section is
any data page. But the next section is going to be the segment header. It has
going to be the segment header. It has like metadata informations about the
like metadata informations about the column segment that is stored in this
column segment that is stored in this page. Like we have the segment ID, the
page. Like we have the segment ID, the row group ID, the column ID and it has
row group ID, the column ID and it has as well very important information the
as well very important information the ID to the dictionary page. So the
ID to the dictionary page. So the dictionary page is as well a type of
dictionary page is as well a type of pages in SQL. It has as well a header
pages in SQL. It has as well a header but inside it we have like a mapping. So
but inside it we have like a mapping. So it maps the original value, the long
it maps the original value, the long one, the inactive to the smaller version
one, the inactive to the smaller version of this value, for example, one. And
of this value, for example, one. And that's all for the dictionary page. It
that's all for the dictionary page. It has the mapping between the original
has the mapping between the original values and the smaller values. And
values and the smaller values. And beneath the segment header, we can have
beneath the segment header, we can have now the important place where our data
now the important place where our data can be stored. We have the data stream.
can be stored. We have the data stream. So it is like sequence of ids from the
So it is like sequence of ids from the dictionary that represents the values of
dictionary that represents the values of the columns side by side. And of course,
the columns side by side. And of course, we cannot fit the whole 1 million rows
we cannot fit the whole 1 million rows inside this data stream. We're going to
inside this data stream. We're going to have like multiple LOP databases. So
have like multiple LOP databases. So this is how exactly the SQL stores your
this is how exactly the SQL stores your data. If you decided to go with the
data. If you decided to go with the column store, so let's go back to the
column store, so let's go back to the process. So back to the process. As you
process. So back to the process. As you can see, SQL is storing the data as LO
can see, SQL is storing the data as LO data storage. So this is the last step
data storage. So this is the last step and with that SQL did convert your table
and with that SQL did convert your table into a column store. So now we cannot
into a column store. So now we cannot just create a column store without
just create a column store without defining whether it is clustered index
defining whether it is clustered index or non-clustered index. So let's start
or non-clustered index. So let's start with the first one the clustered column
with the first one the clustered column store index. So if you create such a
store index. So if you create such a index SQL of course will not be building
index SQL of course will not be building a B3 structure. SQL going to use exactly
a B3 structure. SQL going to use exactly this structure the column store
this structure the column store structure. So as we learned the cluster
structure. So as we learned the cluster index is a complete makeover of your
index is a complete makeover of your table. when you apply it then SQL going
table. when you apply it then SQL going to format everything column-wise and it
to format everything column-wise and it is fully replacing the old row based
is fully replacing the old row based table structure that we have at the
table structure that we have at the start. So once you apply the clustered
start. So once you apply the clustered column store index it will not leave
column store index it will not leave anything behind and your table going to
anything behind and your table going to be completely structured as a column
be completely structured as a column store and one more thing which is makes
store and one more thing which is makes sense of course all the columns from the
sense of course all the columns from the original table going to be converted to
original table going to be converted to a column store. So it is not leaving
a column store. So it is not leaving anything behind it. But in the other
anything behind it. But in the other hand, if you are using non-clustered
hand, if you are using non-clustered column store index, as we learned, it is
column store index, as we learned, it is like a companion to your existing table.
like a companion to your existing table. So it coexist with the table and it will
So it coexist with the table and it will not replace anything. So the column
not replace anything. So the column store index can be an additional thing
store index can be an additional thing that is stored beside your table. So
that is stored beside your table. So that means the original table will not
that means the original table will not be deleted at all like the clustered
be deleted at all like the clustered column store index. The first one is in
column store index. The first one is in the old row based storage. the regular
the old row based storage. the regular table, the first one, and your data
table, the first one, and your data going to be as well stored in a separate
going to be as well stored in a separate structure in the column store index. And
structure in the column store index. And of course, in the non-clustered column
of course, in the non-clustered column store index, since we are creating an
store index, since we are creating an extra index outside of your original
extra index outside of your original table, you can go and define which
table, you can go and define which column should be included in this
column should be included in this process. It must not be all the columns.
process. It must not be all the columns. You can go for example with only the
You can go for example with only the status. So that means you build a column
status. So that means you build a column store index only for one column for the
store index only for one column for the status of the customers. So this is what
status of the customers. So this is what we mean with the clustered column store
we mean with the clustered column store index and the nclustered column store
index. All right friends, so now you might ask me why we are doing all those
might ask me why we are doing all those stuff. Why I would split my data by the
stuff. Why I would split my data by the columns? Well, it's all because of
columns? Well, it's all because of analytics. Because in analytics we have
analytics. Because in analytics we have like big complex query where we have a
like big complex query where we have a lot of data aggregations and stuff on
lot of data aggregations and stuff on big tables. And the roster index is
big tables. And the roster index is perfectly designed in order to improve
perfectly designed in order to improve the performance of such big queries. And
the performance of such big queries. And that's why SQL databases like SQL server
that's why SQL databases like SQL server and as well BI tools like Tableau and
and as well BI tools like Tableau and PowerBI did adopt this methods in order
PowerBI did adopt this methods in order to offer fast platform for data
to offer fast platform for data analyzes. So now let's understand
analyzes. So now let's understand exactly why the column store index is
exactly why the column store index is way faster for data analyzes than the
way faster for data analyzes than the row store index. So let's go. So again
row store index. So let's go. So again we have the customers tables and let's
we have the customers tables and let's say we have like five customers where we
say we have like five customers where we have ID, name and status and as we
have ID, name and status and as we learned before if we are using roster
learned before if we are using roster index the data can be stored in multiple
index the data can be stored in multiple databases and in each database we're
databases and in each database we're going to have the whole record the whole
going to have the whole record the whole information about one customer. So for
information about one customer. So for this example we're going to have like
this example we're going to have like three databases but if you are using the
three databases but if you are using the column store index it's going to be
column store index it's going to be stored little bit differently. So the
stored little bit differently. So the first column the id going to be stored
first column the id going to be stored in one data page and here the SQL will
in one data page and here the SQL will not go and build a dictionary because
not go and build a dictionary because the ids are already short. So we're
the ids are already short. So we're going to have like one data stream with
going to have like one data stream with all ids and now for the next column name
all ids and now for the next column name is going to be stored in separate data
is going to be stored in separate data page where we're going to have an extra
page where we're going to have an extra dictionary page where each name going to
dictionary page where each name going to be mapped to one small value. So the
be mapped to one small value. So the data going to be compressed and we're
data going to be compressed and we're going to save storage. Now the database
going to save storage. Now the database going to create for the third column the
going to create for the third column the status one more data page and the
status one more data page and the dictionary here going to be very small.
dictionary here going to be very small. So for active we're going to have one
So for active we're going to have one and for the inactive we're going to have
and for the inactive we're going to have two and in the data stream we will be
two and in the data stream we will be storing only the ids of the dictionary.
storing only the ids of the dictionary. So now let's understand why the column
So now let's understand why the column store is faster. Let's have the
store is faster. Let's have the following query. We want to find the
following query. We want to find the total number of customers that are
total number of customers that are active. So we have the query select
active. So we have the query select count star from customers and we're
count star from customers and we're going to filter the data by the status
going to filter the data by the status where it is equal to active. So now if
where it is equal to active. So now if we query the table with the row store
we query the table with the row store what can happen? SQL have first to go
what can happen? SQL have first to go and collect the data. So it's going to
and collect the data. So it's going to go to the first data page and collect
go to the first data page and collect the first two customer then to the
the first two customer then to the second to the third and so on. And as
second to the third and so on. And as you can see SQL here is reading
you can see SQL here is reading everything the whole row the ID the name
everything the whole row the ID the name the status even though that for the
the status even though that for the query we actually we don't need all
query we actually we don't need all those informations we just need to count
those informations we just need to count how many customers we need with the
how many customers we need with the status active but still cannot go and
status active but still cannot go and selectively only reading the status has
selectively only reading the status has to read the whole record. So after SQL
to read the whole record. So after SQL has all the data it's going to go and
has all the data it's going to go and filter the data. So it's going to go and
filter the data. So it's going to go and remove the inactive rows and then SQL
remove the inactive rows and then SQL going to do the aggregate operation and
going to do the aggregate operation and with that we're going to get three rows.
with that we're going to get three rows. So that's why the total count of active
So that's why the total count of active customers going to be three. But now
customers going to be three. But now let's see how SQL going to query the
let's see how SQL going to query the column store. So SQL first have to
column store. So SQL first have to analyze okay which columns do I need
analyze okay which columns do I need actually for this query. Well, we need
actually for this query. Well, we need only the status. So SQL will not go and
only the status. So SQL will not go and open all three data pages and read it.
open all three data pages and read it. SQL will target only one data page the
SQL will target only one data page the database where we have the column
database where we have the column status. So it's going to take this very
status. So it's going to take this very simple data stream and then it's going
simple data stream and then it's going to go and understand the dictionary and
to go and understand the dictionary and it going to go and remove all the values
it going to go and remove all the values where it is equal to two. So without in
where it is equal to two. So without in the output we have only three values and
the output we have only three values and SQL going to go and do a very quick
SQL going to go and do a very quick count for those values. So in the output
count for those values. So in the output we will get as well three total number
we will get as well three total number of active customers. So now if you
of active customers. So now if you compare this intermediate result sets
compare this intermediate result sets from the row store and the column store
from the row store and the column store you can see that in the row store we
you can see that in the row store we have fetched and retrieved a lot of
have fetched and retrieved a lot of unnecessary informations for this query
unnecessary informations for this query and this of course going to make the
and this of course going to make the speed of the query very slow but in the
speed of the query very slow but in the column store reads exactly what it needs
column store reads exactly what it needs for this aggregation and we didn't read
for this aggregation and we didn't read any extra informations about the names
any extra informations about the names of the customers the ids it didn't like
of the customers the ids it didn't like open any extra data pages it exactly
open any extra data pages it exactly gets the data that it needs for the
gets the data that it needs for the aggregation and that's exactly why the
aggregation and that's exactly why the performance of queries where we have
performance of queries where we have aggregations and data analyzes is going
aggregations and data analyzes is going to be very fast if you are using column
to be very fast if you are using column store compared to the row store. So
store compared to the row store. So that's why we use column store for big
that's why we use column store for big data and data analytics. All right. So
data and data analytics. All right. So now let's summarize the differences
now let's summarize the differences between the row store and the column
between the row store and the column store indexes side by side. So let's
store indexes side by side. So let's start by the definition. The row store
start by the definition. The row store going to go and organize and store the
going to go and organize and store the data row by row. It is really nice
data row by row. It is really nice method if you need a lot of columns in
method if you need a lot of columns in one row. But in the other hand, the
one row. But in the other hand, the column store index going to go and store
column store index going to go and store the data and organize it column by
the data and organize it column by column which is really great if you're
column which is really great if you're focusing on specific column. Now if you
focusing on specific column. Now if you are talking about the storage
are talking about the storage efficiency, the row store index going to
efficiency, the row store index going to take more space compared to the column
take more space compared to the column store index and that's because as we
store index and that's because as we learned the column store going to go and
learned the column store going to go and compress the data which going to save a
compress the data which going to save a lot of storage if you have large tables.
lot of storage if you have large tables. Now to the next point which is more
Now to the next point which is more important about the performance. The
important about the performance. The read and write optimizations we can say
read and write optimizations we can say for the row store things are more
for the row store things are more balanced. So you will get a decent speed
balanced. So you will get a decent speed for both write and read operations but
for both write and read operations but things in the column store is different.
things in the column store is different. It is fast for reading especially if you
It is fast for reading especially if you are doing data analytics but writing
are doing data analytics but writing data like inserting and updating it is
data like inserting and updating it is slower because as we learned there are
slower because as we learned there are like multiple steps until the data is
like multiple steps until the data is written in the pages. So in one hand you
written in the pages. So in one hand you are optimizing the speed of your
are optimizing the speed of your analytical queries but in the other hand
analytical queries but in the other hand changing data it is slower than the
changing data it is slower than the roster index. Now let's talk about the
roster index. Now let's talk about the next point input and output efficiency.
next point input and output efficiency. Well the roster index it's not really
Well the roster index it's not really good because you are retrieving a lot of
good because you are retrieving a lot of columns. So a lot of data should be read
columns. So a lot of data should be read from the disk storage in order to answer
from the disk storage in order to answer your queries. But in the other hand for
your queries. But in the other hand for the column store it is lower and that's
the column store it is lower and that's because it targets exactly the data and
because it targets exactly the data and columns that is needed for the query. So
columns that is needed for the query. So there will be generally less data that
there will be generally less data that is read from the disk storage and of
is read from the disk storage and of course that's why we are getting fast
course that's why we are getting fast read performance. So now if you are
read performance. So now if you are thinking which systems are best for ro
thinking which systems are best for ro store index well the roster index is
store index well the roster index is very suitable for the OLTB systems
very suitable for the OLTB systems online transactional systems like
online transactional systems like banking and commerce systems where the
banking and commerce systems where the full records are accessed very
full records are accessed very frequently but in the other hand the
frequently but in the other hand the column store index is great for OLAP.
column store index is great for OLAP. All app systems are online analytical
All app systems are online analytical processing where you have like data
processing where you have like data warehouses, data league, business
warehouses, data league, business intelligence. You are building reports
intelligence. You are building reports and analyzes. You have large data sets
and analyzes. You have large data sets and very complicated aggregated queries.
and very complicated aggregated queries. So if you have such a project then the
So if you have such a project then the column store index is the way to go. So
column store index is the way to go. So that means the use case for the row
that means the use case for the row store index if you have high frequency
store index if you have high frequency transactions where the system has to
transactions where the system has to quickly access records and the use case
quickly access records and the use case for the column store is big data
for the column store is big data analytics where the SQL has to scan
analytics where the SQL has to scan large data sets. So those are the main
large data sets. So those are the main differences between the row store index
differences between the row store index and the column store
index. All right. So now let's check the syntax of the column store index. Well,
syntax of the column store index. Well, it is really easy what we're going to
it is really easy what we're going to do. we can just put a column store
do. we can just put a column store keyword between the clustered or
keyword between the clustered or nonclustered and the index. So once you
nonclustered and the index. So once you specify that then you are telling SQL
specify that then you are telling SQL you want to create a column store index
you want to create a column store index and the rest is going to stay as it is.
and the rest is going to stay as it is. Now if you want to create row column
Now if you want to create row column store then you don't have to specify
store then you don't have to specify anything. There is no keyword for the
anything. There is no keyword for the row store. So as we learned before we
row store. So as we learned before we can go and create a nonclustered index
can go and create a nonclustered index and cluster on the index and both of
and cluster on the index and both of those syntax is going to tell SQL we are
those syntax is going to tell SQL we are creating row store index but if you go
creating row store index but if you go and use the column store keyword then
and use the column store keyword then you are telling SQL that you want to
you are telling SQL that you want to create either clustered or nclustered
create either clustered or nclustered column store index and here there is
column store index and here there is like a syntax rule if you are creating a
like a syntax rule if you are creating a clustered column index then you must not
clustered column index then you must not specify anything for the columns. So you
specify anything for the columns. So you cannot go and specify anything like an
cannot go and specify anything like an ID or country or any columns over here
ID or country or any columns over here because it makes no sense once you say
because it makes no sense once you say cluster column store then all the
cluster column store then all the columns going to be included in the new
columns going to be included in the new structure. So this is the syntax of the
structure. So this is the syntax of the column store index. All right. So back
column store index. All right. So back to scale let's check how we can create
to scale let's check how we can create column store index. Now if you check our
column store index. Now if you check our table here DB customers that we have
table here DB customers that we have created previously and we go to the
created previously and we go to the indexes you can see that we have created
indexes you can see that we have created few indexes and one of them is the
few indexes and one of them is the clustered index. This one is a row store
clustered index. This one is a row store index. So our table is splitted by the
index. So our table is splitted by the rows. Now let's go and change that.
rows. Now let's go and change that. Let's make our table splitted by the
Let's make our table splitted by the columns using the column store. So we're
columns using the column store. So we're going to say
going to say create
create clustered column store index and we're
clustered column store index and we're going to give it the name index DB
going to give it the name index DB customers and it's going to be on the
customers and it's going to be on the table sales DB customers and here if you
table sales DB customers and here if you go and specify a column it's going to be
go and specify a column it's going to be a mistake. So let's go and check that.
a mistake. So let's go and check that. So if you go and execute it says it
So if you go and execute it says it fails because key lists or the columns
fails because key lists or the columns is not allowed. So we cannot have this.
is not allowed. So we cannot have this. So let's remove it. And now we have the
So let's remove it. And now we have the correct syntax. Let's execute it again.
correct syntax. Let's execute it again. We will get another error because it
We will get another error because it says in one table you cannot have more
says in one table you cannot have more than one clustered index. We have
than one clustered index. We have already one. You have to decide do you
already one. You have to decide do you want to split your table by columns or
want to split your table by columns or by rows. That's why we have to go and
by rows. That's why we have to go and drop the previous index. So we're going
drop the previous index. So we're going to do it like this. Drop index. And I
to do it like this. Drop index. And I need the name of the index like this.
need the name of the index like this. And then we have to specify the table
name. So that's it. Let's drop the index. Now if you refresh, we cannot see
index. Now if you refresh, we cannot see anymore our clustered index and our
anymore our clustered index and our query should be working. So let's do
query should be working. So let's do that. Now let's check the indexes again.
that. Now let's check the indexes again. And now as you can see, we got a new
And now as you can see, we got a new clustered index, but this time it is
clustered index, but this time it is column store. Now you can see at the
column store. Now you can see at the start we have like an icon. This looks
start we have like an icon. This looks like a bar chart or like analytics and
like a bar chart or like analytics and reports and that's because the main
reports and that's because the main purpose of creating com store is to have
purpose of creating com store is to have a bar chart. So now of course we cannot
a bar chart. So now of course we cannot go and create multiple clustered column
go and create multiple clustered column index. We can have maximum only one. So
index. We can have maximum only one. So now if you say you know what let's go
now if you say you know what let's go and create for the first name another
and create for the first name another index but this time it's going to be a
index but this time it's going to be a column store. So if I go and copy the
column store. So if I go and copy the whole thing over here and let's say it
whole thing over here and let's say it is none clustered column index and let's
is none clustered column index and let's call it for example first
call it for example first name and we define over here the first
name and we define over here the first name. So that's it. Let's go and execute
name. So that's it. Let's go and execute it. You will see that we will get an
it. You will see that we will get an error where SQL tells us you cannot
error where SQL tells us you cannot create multiple column store indexes.
create multiple column store indexes. That means you can create only one
That means you can create only one column store index for each table and
column store index for each table and you have to decide whether it is a
you have to decide whether it is a clustered or non-clustered and you
clustered or non-clustered and you cannot create like the row store
cannot create like the row store multiple non-clustered index. So you are
multiple non-clustered index. So you are allowed only with one column store index
allowed only with one column store index but this limitation is only here in the
but this limitation is only here in the SQL server. In other databases I know
SQL server. In other databases I know that is allowed to use multiple column
that is allowed to use multiple column store indexes like in the Azure SQL
store indexes like in the Azure SQL server you can do that. So now in order
server you can do that. So now in order to practice and you would like to create
to practice and you would like to create a nonclustered column store index, you
a nonclustered column store index, you can drop the first one and you can go
can drop the first one and you can go and create the one that you need as a
and create the one that you need as a nclustered index. So actually let's go
nclustered index. So actually let's go and do that. Let's drop the first one.
and do that. Let's drop the first one. So drop index and this is our index on
So drop index and this is our index on this table. Let's do that. And once you
this table. Let's do that. And once you execute the nonclustered column store
execute the nonclustered column store index is going to work. And if you
index is going to work. And if you refresh over here, you will see that we
refresh over here, you will see that we have a non-clustered column store index
have a non-clustered column store index for the first name. Okay. So now as we
for the first name. Okay. So now as we learned that the column store going to
learned that the column store going to go and compress the data and the storage
go and compress the data and the storage that is needed for the entire table
that is needed for the entire table going to be less than the row store. So
going to be less than the row store. So let's see whether that is really true.
let's see whether that is really true. Now in order to check this I will not do
Now in order to check this I will not do that in the database sales DB because
that in the database sales DB because everything here is already small. We're
everything here is already small. We're going to go and use another database. We
going to go and use another database. We have the adventure works DW2022 and if
have the adventure works DW2022 and if you have a newer version that's okay. So
you have a newer version that's okay. So now what is the plan? We're going to go
now what is the plan? We're going to go and create three identical copies of one
and create three identical copies of one table and we're going to have different
table and we're going to have different structures. So the first one going to be
structures. So the first one going to be the heap structure. The second one going
the heap structure. The second one going to be row store structure and the third
to be row store structure and the third one going to be column store structure
one going to be column store structure and then we're going to go and compare
and then we're going to go and compare the storage of those three. So now we
the storage of those three. So now we have to go and pick one of those tables.
have to go and pick one of those tables. We need one big table. So for example
We need one big table. So for example the fact internet sales. So let's see
the fact internet sales. So let's see how we can do that. Let's start with the
how we can do that. Let's start with the heap structure. We're going to say
heap structure. We're going to say select star into a new table. So it's
select star into a new table. So it's going to be the
going to be the fact internet sales and underscore hp
fact internet sales and underscore hp for the heap. And we're going to get it
for the heap. And we're going to get it from the table fact internet sales. So
from the table fact internet sales. So like this. And here it's very important
like this. And here it's very important if you are switching databases you have
if you are switching databases you have to go and use the database. So it's
to go and use the database. So it's going to be use adventure work DW 2022.
going to be use adventure work DW 2022. So execute this at the starts to make
So execute this at the starts to make sure that you are switching to the new
sure that you are switching to the new database. And now let's go and execute
database. And now let's go and execute our heap structure. So with that we have
our heap structure. So with that we have created heap table as you can see 60,000
created heap table as you can see 60,000 rows. And since we didn't define any
rows. And since we didn't define any clustered index this table going to be
clustered index this table going to be heap structure. Now let's go and create
heap structure. Now let's go and create another table where we use clustered row
another table where we use clustered row store index. So what we're going to do,
store index. So what we're going to do, we're going to copy the whole thing over
we're going to copy the whole thing over here and we're going to call this row
here and we're going to call this row store and we're going to go of course
store and we're going to go of course change the name to RS but still we are
change the name to RS but still we are targeting the same table. So let's go
targeting the same table. So let's go and execute this at the start. But now
and execute this at the start. But now in order to make it as clustered row
in order to make it as clustered row store we have to go and create an index.
store we have to go and create an index. So it going to be like this create
So it going to be like this create clustered index. We don't have to
clustered index. We don't have to specify the row store because it is as a
specify the row store because it is as a default. It's going to be ro store. So
default. It's going to be ro store. So let's call it
let's call it index facts
index facts internet sales RS and then the primary
internet sales RS and then the primary key. So B key and now we need the
key. So B key and now we need the table fact internet sales RS and now we
table fact internet sales RS and now we need the columns the primary key well
need the columns the primary key well actually I don't know what is the
actually I don't know what is the primary key so let's go and check that
primary key so let's go and check that so it is a composite primary keys so
so it is a composite primary keys so it's going to be the sales order number
it's going to be the sales order number and sales order line number like this.
and sales order line number like this. So let's go and execute this. And with
So let's go and execute this. And with that we have clustered row index. I'm
that we have clustered row index. I'm going to go and check what do we have
going to go and check what do we have over here. So let's go and refresh
over here. So let's go and refresh everything. So we have now two tables
everything. So we have now two tables the heap and the row store. So let's
the heap and the row store. So let's extend it and check the indexes. And as
extend it and check the indexes. And as you can see we have the clustered index.
you can see we have the clustered index. Now we need the third table. It's going
Now we need the third table. It's going to be the column store index. I'm just
to be the column store index. I'm just going to go and copy the whole thing
going to go and copy the whole thing over here. So this is the column store
over here. So this is the column store going to be here CS and CS and of course
going to be here CS and CS and of course we don't need any columns for the column
we don't need any columns for the column store and don't forget to add the column
store and don't forget to add the column store keyword. So create cluster column
store keyword. So create cluster column store index and we have to rename as
store index and we have to rename as well over here. So let's go and execute
well over here. So let's go and execute our new stuff. So we create first the
our new stuff. So we create first the table and then we convert it to a column
table and then we convert it to a column store index. So let's go and do that and
store index. So let's go and do that and we have to go and refresh and check our
we have to go and refresh and check our tables. So this is our third table and
tables. So this is our third table and let's go and check the indexes and we
let's go and check the indexes and we have clustered column store. All right.
have clustered column store. All right. So now we are done. We have our three
So now we are done. We have our three different tables. Now let's go and check
different tables. Now let's go and check the stoages of those three tables. So
the stoages of those three tables. So now let's go and check our first table
now let's go and check our first table the heap table. So right click on it and
the heap table. So right click on it and go to the properties. And now we can see
go to the properties. And now we can see here a lot of informations about our
here a lot of informations about our table. But we are interested on the
table. But we are interested on the storage. So click here on the page for
storage. So click here on the page for the storage. And now we can see here few
the storage. And now we can see here few informations about the storage and one
informations about the storage and one of them is the data space. It is around
of them is the data space. It is around 9 MB and the index space is almost
9 MB and the index space is almost nothing. So we don't have anything over
nothing. So we don't have anything over here. So this is the storage of the heap
here. So this is the storage of the heap structure. We don't have any indexes.
structure. We don't have any indexes. Let's go now to the row store. So we're
Let's go now to the row store. So we're going to go to the RS and properties.
going to go to the RS and properties. Then let's go to the storage. And now as
Then let's go to the storage. And now as you can see the data space is exactly
you can see the data space is exactly the same. And that's because whether it
the same. And that's because whether it is heap or row store index, we're going
is heap or row store index, we're going to store the data in data pages as rows.
to store the data in data pages as rows. So the size of the data itself will not
So the size of the data itself will not change. It will be sorted differently.
change. It will be sorted differently. But what changed here is the size of the
But what changed here is the size of the index. Now we are consuming more storage
index. Now we are consuming more storage for the index. So that means the overall
for the index. So that means the overall storage of the table with a cluster draw
storage of the table with a cluster draw store index it is more than the heap
store index it is more than the heap structure. Let's go and check now our
structure. Let's go and check now our column store index. So to the CS and
column store index. So to the CS and let's go to the properties. And now it
let's go to the properties. And now it is interesting to see whether our table
is interesting to see whether our table is getting smaller. So let's go to the
is getting smaller. So let's go to the storage. And as you can see the data
storage. And as you can see the data space is around 1 mgabyte compared to
space is around 1 mgabyte compared to the 9 mgabyte. I know those are small
the 9 mgabyte. I know those are small numbers but still it is massively
numbers but still it is massively reduced space because everything is
reduced space because everything is compressed and of course we are not
compressed and of course we are not using any index spaces because we don't
using any index spaces because we don't have this B3 structure in the column
have this B3 structure in the column store. So as you can see if you compare
store. So as you can see if you compare to the others it is the winner. This
to the others it is the winner. This table that is using the column store is
table that is using the column store is consuming way less storage than the
consuming way less storage than the others. So now if you want to rank it
others. So now if you want to rank it based on the storage the best one is the
based on the storage the best one is the column store index table. Then the next
column store index table. Then the next one is the table with the he structure
one is the table with the he structure and the worst one is the table with the
and the worst one is the table with the row store clustered index. So that's
row store clustered index. So that's true. column store index is consuming
true. column store index is consuming less space than the other type of
indexes. All right. So now what is unique index? Unique index is a special
unique index? Unique index is a special type of indexes that going to make sure
type of indexes that going to make sure no duplicates in your data. And there
no duplicates in your data. And there are a couple of reasons why is it
are a couple of reasons why is it important to have a unique index. The
important to have a unique index. The first one and the most obvious reason is
first one and the most obvious reason is to have data integrity. So the unique
to have data integrity. So the unique index going to go and enforce uniqueness
index going to go and enforce uniqueness in your data and that is very helpful.
in your data and that is very helpful. For example, if you have a column like
For example, if you have a column like an email address or a product ID. Having
an email address or a product ID. Having duplicate in such a columns can mess up
duplicate in such a columns can mess up your data very badly. So having a unique
your data very badly. So having a unique index on a column like an email going to
index on a column like an email going to make sure there are no sneaky duplicates
make sure there are no sneaky duplicates inside your data. And the second
inside your data. And the second important reason why unique index is
important reason why unique index is important is to improve the performance.
important is to improve the performance. So for example, if you are searching for
So for example, if you are searching for specific email, the SQL going to start
specific email, the SQL going to start searching for the email value and once
searching for the email value and once the SQL find the value, the SQL will
the SQL find the value, the SQL will stop searching because we are sure that
stop searching because we are sure that there is no duplicates in the data. So
there is no duplicates in the data. So with that you are improving the
with that you are improving the performance of your queries. So if you
performance of your queries. So if you are creating an index and you know this
are creating an index and you know this column is unique then make sure to make
column is unique then make sure to make the index as unique index. So now if you
the index as unique index. So now if you have a look again to our clustered index
have a look again to our clustered index where we have the B structure if you
where we have the B structure if you make this index as unique then you are
make this index as unique then you are giving an extra task for the SQL that's
giving an extra task for the SQL that's going to go and make sure that all those
going to go and make sure that all those ids of the customer going to be unique.
ids of the customer going to be unique. So SQL has to guarantee that there are
So SQL has to guarantee that there are no duplicates at all inside your data in
no duplicates at all inside your data in the databases. So now since we are
the databases. So now since we are giving SQL an extra task to prove the
giving SQL an extra task to prove the uniqueness of the data building the
uniqueness of the data building the clustered index going to be little bit
clustered index going to be little bit slower. So that means inserting new data
slower. So that means inserting new data writing data going to be slower as the
writing data going to be slower as the normal clustered index. But now if you
normal clustered index. But now if you are talking about the read performance
are talking about the read performance the performance of our query it's going
the performance of our query it's going to be optimized a little bit faster than
to be optimized a little bit faster than a normal clustered index. So again this
a normal clustered index. So again this tradeoff we are making writing data
tradeoff we are making writing data slower but we are gaining more speed on
slower but we are gaining more speed on the query performance. So this is what
the query performance. So this is what we mean with unique index. Okay. So
we mean with unique index. Okay. So let's keep extending the syntax of the
let's keep extending the syntax of the index. So now in order to tell whether
index. So now in order to tell whether it is unique or not we can specify it
it is unique or not we can specify it exactly at the start. So we say create
exactly at the start. So we say create unique is just before the clustered or
unique is just before the clustered or nonclustered and then afterward the cl
nonclustered and then afterward the cl store and nothing changed for the rest.
store and nothing changed for the rest. So we can specify this keyword to TSQL,
So we can specify this keyword to TSQL, it should be unique. And if you don't
it should be unique. And if you don't write anything before the clustered
write anything before the clustered index, it's going to be not unique. So
index, it's going to be not unique. So for example, this one says create an
for example, this one says create an index. So we didn't specify anything
index. So we didn't specify anything here, duplicates are allowed in the
here, duplicates are allowed in the index. But if you go and specify a
index. But if you go and specify a unique index, then the duplicates are
unique index, then the duplicates are not allowed. So it is very simple. Okay.
not allowed. So it is very simple. Okay. So now let's go and create unique
So now let's go and create unique cluster. Now let's go and target the
cluster. Now let's go and target the table products. Let's go and first
table products. Let's go and first select the data from the table. So sales
select the data from the table. So sales products and execute it. Now let's see
products and execute it. Now let's see that I'm going to go and create a unique
that I'm going to go and create a unique index on the column category. Let's go
index on the column category. Let's go and try it. So create unique
and try it. So create unique nonclustered index and let's give it the
nonclustered index and let's give it the name index products
name index products category on the table sales products and
category on the table sales products and we are targeting the column category. So
we are targeting the column category. So let's go and execute it. Now we will get
let's go and execute it. Now we will get an error because the category has
an error because the category has duplicates. So if you go and query again
duplicates. So if you go and query again our table, you can see we have here
our table, you can see we have here duplicate values and the SQL cannot go
duplicate values and the SQL cannot go and create unique index for this table.
and create unique index for this table. It's too late. But you still can create
It's too late. But you still can create this index if the table is empty and SQL
this index if the table is empty and SQL will not allow you to insert any
will not allow you to insert any duplicates about the categories. And of
duplicates about the categories. And of course it makes no sense to have unique
course it makes no sense to have unique index on the categories because of
index on the categories because of course we're going to get duplicates
course we're going to get duplicates here. But maybe you say, you know what,
here. But maybe you say, you know what, my products are unique. The product name
my products are unique. The product name should be unique and we are not allowed
should be unique and we are not allowed to have in this table two products with
to have in this table two products with the same name. So if you have such a
the same name. So if you have such a rule at your business, you can go and
rule at your business, you can go and define a unique index for the products.
define a unique index for the products. So let's go and do that. Now we're going
So let's go and do that. Now we're going to go and replace the category with the
to go and replace the category with the products and the same thing over here.
products and the same thing over here. So we are targeting the column products.
So we are targeting the column products. Let's go and execute it. As you can see
Let's go and execute it. As you can see now it is working because we don't have
now it is working because we don't have any duplicates inside the table
any duplicates inside the table products. And if you go and check the
products. And if you go and check the indexes over here, we can see our new
indexes over here, we can see our new index. And as you can see at the start
index. And as you can see at the start here, it says it is unique non-clustered
here, it says it is unique non-clustered index. Now let's go and try the data
index. Now let's go and try the data integrity. Are we allowed not to add any
integrity. Are we allowed not to add any duplicate to this table? So let's go and
duplicate to this table? So let's go and try that out. Let's have an insert
try that out. Let's have an insert statement. Let's say insert into sales
statement. Let's say insert into sales products. And I would like only to
products. And I would like only to insert the product ID and the product
insert the product ID and the product name. and we're going to insert two
name. and we're going to insert two values. Values, let's say we're going to
values. Values, let's say we're going to have a new ID 106, but we're going to go
have a new ID 106, but we're going to go and insert duplicate for the product
and insert duplicate for the product name. So, we're going to say caps. We
name. So, we're going to say caps. We have already a product called caps over
have already a product called caps over here. So, we are now inserting
here. So, we are now inserting duplicates. Let's go and try it. Now,
duplicates. Let's go and try it. Now, you will get an error saying you cannot
you will get an error saying you cannot insert duplicates to this table because
insert duplicates to this table because we have unique index. So as you can see
we have unique index. So as you can see this index is now helping us and
this index is now helping us and improving the quality of my table. So
improving the quality of my table. So this is how we work with the unique
this is how we work with the unique index in
SQL. Okay. So now what is a filtered index? A filtered index is a regular
index? A filtered index is a regular index but with a twist. It only includes
index but with a twist. It only includes rows that meet specific condition. So
rows that meet specific condition. So let's understand what this means. So
let's understand what this means. So again we have our nonclustered index and
again we have our nonclustered index and the B3 structure. So now at the leaf
the B3 structure. So now at the leaf nodes we will get only the ids the data
nodes we will get only the ids the data that fulfill a specific condition. So
that fulfill a specific condition. So for example if we are saying we want
for example if we are saying we want only the active customers this is the
only the active customers this is the condition. So that means on the leaf
condition. So that means on the leaf nodes we will have only the customer ids
nodes we will have only the customer ids that are active and any inactive
that are active and any inactive customer will not be included at all at
customer will not be included at all at the data page and at the nodes. So that
the data page and at the nodes. So that means our B structure going to be little
means our B structure going to be little bit smaller as usual because we have
bit smaller as usual because we have less data included in the structure. So
less data included in the structure. So our index going to be smaller than the
our index going to be smaller than the regular nclustered index. So now the
regular nclustered index. So now the question is why is it important to have
question is why is it important to have a filtered index? Well the biggest
a filtered index? Well the biggest benefit is we going to have targeted
benefit is we going to have targeted optimizations. So for example if our
optimizations. So for example if our analyzes always focuses on the active
analyzes always focuses on the active users and the inactive users are totally
users and the inactive users are totally unrelevant. So that means having only
unrelevant. So that means having only relevant subset of data in the index
relevant subset of data in the index going to make the whole index much
going to make the whole index much smaller which leads to faster
smaller which leads to faster performance. So it's going to be faster
performance. So it's going to be faster to query this filtered B3 structure. So
to query this filtered B3 structure. So that means we are doing targeted
that means we are doing targeted optimizations and we are improving the
optimizations and we are improving the query performance. Now the second
query performance. Now the second benefit if you think about the storage
benefit if you think about the storage since the size of the B structure going
since the size of the B structure going to be smaller that means we're going to
to be smaller that means we're going to need less storage space in order to
need less storage space in order to store the index which is great thing if
store the index which is great thing if you have large tables in your database.
you have large tables in your database. So the filter the index going to make
So the filter the index going to make the structure of the index smaller which
the structure of the index smaller which going to improve the speed and the
going to improve the speed and the performance and as well reduce the
performance and as well reduce the storage that is needed for your index.
storage that is needed for your index. Okay. So now let's check the syntax of
Okay. So now let's check the syntax of the filtered index. It's very simple.
the filtered index. It's very simple. It's like any query you can go and add
It's like any query you can go and add at the end of creating the index the
at the end of creating the index the wear clause and then the condition as
wear clause and then the condition as you are doing in any select statements.
you are doing in any select statements. But the SQL server is very restrictive
But the SQL server is very restrictive using this type of index. So you cannot
using this type of index. So you cannot use filtered index on a clustered index.
use filtered index on a clustered index. So it is only allowed for the nclustered
So it is only allowed for the nclustered index because it makes no sense. If you
index because it makes no sense. If you create a clustered index, the entire
create a clustered index, the entire table should be reorganized and ordered.
table should be reorganized and ordered. So it will not work for only subset of
So it will not work for only subset of data and as well you cannot create a
data and as well you cannot create a filtered index on a column store. So it
filtered index on a column store. So it is only allowed if you are using row
is only allowed if you are using row store but you can go and combine the
store but you can go and combine the unique index together with the filtered
unique index together with the filtered index. There's no restrictions. So it's
index. There's no restrictions. So it's going to be like this. Create unique
going to be like this. Create unique nonclustered index on the table and then
nonclustered index on the table and then you specify the wear condition. So this
you specify the wear condition. So this is the syntax of the filtered index and
is the syntax of the filtered index and we have these restrictions. All right.
we have these restrictions. All right. So now let's say that we have the
So now let's say that we have the following query where we are selecting
following query where we are selecting data from customers but always in our
data from customers but always in our program or in our report we are
program or in our report we are selecting only the customers from USA.
selecting only the customers from USA. So we have the following condition. It
So we have the following condition. It says where country equal to USA and
says where country equal to USA and execute. So this is the basics of many
execute. So this is the basics of many queries that we have in our project and
queries that we have in our project and we are always filtering the customers
we are always filtering the customers based on the country. So in one query we
based on the country. So in one query we are finding maybe the top customers and
are finding maybe the top customers and another query we are finding the average
another query we are finding the average of scores and so on. But we are always
of scores and so on. But we are always filtering the data like this where
filtering the data like this where country equal to USA. So now since we
country equal to USA. So now since we are using this column a lot and our
are using this column a lot and our table may be getting like million of
table may be getting like million of records we can go and create
records we can go and create nonclustered index on this column. So
nonclustered index on this column. So the usual way we go over here and say
the usual way we go over here and say create
create nonclustered index and we call it like
nonclustered index and we call it like this index
this index customers country and then it's going to
customers country and then it's going to be on the
be on the table sales customers and we select the
table sales customers and we select the column country like this. So if you do
column country like this. So if you do it like this SQL going to go and create
it like this SQL going to go and create a nclustered index for all customers not
a nclustered index for all customers not only from USA but for everything. So
only from USA but for everything. So even if the customers come from Germany
even if the customers come from Germany which is not really necessary because in
which is not really necessary because in our project we only focus on the
our project we only focus on the customers from USA. So instead of that
customers from USA. So instead of that we can go and include the wear condition
we can go and include the wear condition inside our cluster. So it's very simple
inside our cluster. So it's very simple we're going to go and say where country
we're going to go and say where country equal to USA exactly like our query. So
equal to USA exactly like our query. So now the index that's going to be built
now the index that's going to be built it will be focused and targeted only for
it will be focused and targeted only for subset of data only for the data that
subset of data only for the data that fulfill this condition. So now let's go
fulfill this condition. So now let's go and create our filtered index and it is
and create our filtered index and it is working. Let's go and check our indexes
working. Let's go and check our indexes on the customers. So let's go to the
on the customers. So let's go to the indexes over here and refresh. Now we
indexes over here and refresh. Now we can see our index over here. It says it
can see our index over here. It says it is not unique because we didn't define
is not unique because we didn't define anything at the start. So duplicates are
anything at the start. So duplicates are allowed of course which is what we
allowed of course which is what we defined here. And as well it is
defined here. And as well it is filtered. So it doesn't contain all the
filtered. So it doesn't contain all the rows from your table. It contains only
rows from your table. It contains only the rows that fulfill our condition. So
the rows that fulfill our condition. So that means now if I go and execute this
that means now if I go and execute this query, the index going to be used
query, the index going to be used because the rows of this query is
because the rows of this query is included in the index. But if I go over
included in the index. But if I go over here and say Germany and execute the
here and say Germany and execute the query, it's going to be slower because
query, it's going to be slower because all those rows inside the query is not
all those rows inside the query is not part of our index. So this index will
part of our index. So this index will not be used at all in order to improve
not be used at all in order to improve the query. So this is how we work with
the query. So this is how we work with the filtered index in
SQL. All right. So now we're going to summarize and talk quickly about how to
summarize and talk quickly about how to use the right index. So when to use
use the right index. So when to use which type? Let's start with the first
which type? Let's start with the first one. We have the heap structure. So as
one. We have the heap structure. So as we learned it is a table without any
we learned it is a table without any index. So in which scenario we don't
index. So in which scenario we don't have to use any indexes in case you want
have to use any indexes in case you want to have fast inserts. So if you want to
to have fast inserts. So if you want to have a fast write performance then don't
have a fast write performance then don't take any index. So you stay with the
take any index. So you stay with the default with the he structure of your
default with the he structure of your table and we usually use it in not very
table and we usually use it in not very important tables like the staging tables
important tables like the staging tables or temporary tables where we want to
or temporary tables where we want to insert the data fast and then get rid of
insert the data fast and then get rid of the data later. So here there is no need
the data later. So here there is no need to utilize any index. Now if you are
to utilize any index. Now if you are talking about the clustered index, we
talking about the clustered index, we usually use the clustered index for
usually use the clustered index for primary keys. It is even a default from
primary keys. It is even a default from the database. If you create any primary
the database. If you create any primary keys, then SQL going to go and create a
keys, then SQL going to go and create a clustered index. So this is the main
clustered index. So this is the main usage of the clustered index, you use it
usage of the clustered index, you use it in the primary keys. And if there's like
in the primary keys. And if there's like no primary key in your table, then you
no primary key in your table, then you can go and pick another column where
can go and pick another column where sorting the data is important like for
sorting the data is important like for example a date column. So it could be a
example a date column. So it could be a good candidate for your clustered index.
good candidate for your clustered index. Now moving on to another type we have
Now moving on to another type we have the column store index. So when I said
the column store index. So when I said here clustered index I mean clustered
here clustered index I mean clustered row store index of course. But now the
row store index of course. But now the question is when do we use the column
question is when do we use the column store index. If you have like big
store index. If you have like big complex analytical queries where you are
complex analytical queries where you are aggregating a lot of data doing data
aggregating a lot of data doing data aggregations then go for the column
aggregations then go for the column store index because it going to give you
store index because it going to give you amazing performance. And as well if you
amazing performance. And as well if you are struggling with the size of tables
are struggling with the size of tables so if you have a super large table you
so if you have a super large table you can go and use the column store index
can go and use the column store index because it can go and compress the data
because it can go and compress the data and reduce the size of the whole table.
and reduce the size of the whole table. So for those scenarios we use the column
So for those scenarios we use the column store index. So again for the row store
store index. So again for the row store clustered index we use it usually for
clustered index we use it usually for the old TB systems where you have a lot
the old TB systems where you have a lot of transactions and so on but for the
of transactions and so on but for the column store we use it usually for the
column store we use it usually for the OLAP systems where you have a data
OLAP systems where you have a data warehouse reporting system business
warehouse reporting system business intelligence and so on. Now moving on to
intelligence and so on. Now moving on to another type we have the nonclustered
another type we have the nonclustered index. We usually use this index for non
index. We usually use this index for non primary key columns. So that means the
primary key columns. So that means the rest of the columns of your tables could
rest of the columns of your tables could be candidate for the nonclustered index.
be candidate for the nonclustered index. And there are a lot of reasons why you
And there are a lot of reasons why you would do that. For example, for the
would do that. For example, for the foreign keys or using it on the columns
foreign keys or using it on the columns that are used in order to join two
that are used in order to join two tables and another place where you can
tables and another place where you can use the nonclustered index for the
use the nonclustered index for the columns that are used for the work
columns that are used for the work clause. So there are like many scenarios
clause. So there are like many scenarios where we can use the nonclustered index
where we can use the nonclustered index but not for the primary keys. Now moving
but not for the primary keys. Now moving on to another type, we have the filtered
on to another type, we have the filtered index. We use it in order to target a
index. We use it in order to target a subset of data. So if in our query and
subset of data. So if in our query and analyzes we are only focusing on a
analyzes we are only focusing on a subset of data all time, it makes no
subset of data all time, it makes no sense to have one big index for all
sense to have one big index for all data, we can use the filtered index to
data, we can use the filtered index to have focused index. And of course if the
have focused index. And of course if the size of the index is a problem then you
size of the index is a problem then you can use a filtered index in order to
can use a filtered index in order to reduce the overall size of the storage
reduce the overall size of the storage of the index. And then to the last type
of the index. And then to the last type we have the unique index. you can go and
we have the unique index. you can go and use the unique index in order to ensure
use the unique index in order to ensure data integrity of your table and as well
data integrity of your table and as well it might prove slightly the performance
it might prove slightly the performance of your query and that's because SQL has
of your query and that's because SQL has less task to do if the index is unique
less task to do if the index is unique once SQL finds a match it going to skip
once SQL finds a match it going to skip the search so this is a quick summary
the search so this is a quick summary and guide on when to use which index
and guide on when to use which index type that usually help me finding the
type that usually help me finding the right
index all right friends so now let's say that you have created your index ES in
that you have created your index ES in your database and your query is
your database and your query is optimized and you have fast performance
optimized and you have fast performance but the job is not done yet.
but the job is not done yet. No
No god no god please no no no no
god no god please no no no no because over the time the indexes get
because over the time the indexes get fragmented outdated unused and this
fragmented outdated unused and this going to lead to a poor performance in
going to lead to a poor performance in your queries and as well going to
your queries and as well going to increase the storage costs and the
increase the storage costs and the overall performance of your database
overall performance of your database going to drop down. So indexes like
going to drop down. So indexes like having a car it need maintenance. So you
having a car it need maintenance. So you need to change the oil and the tire of
need to change the oil and the tire of the car. And the same thing goes for the
the car. And the same thing goes for the indexes. You have to maintain them. They
indexes. You have to maintain them. They need attention to keep everything
need attention to keep everything running smoothly. So now I'm going to
running smoothly. So now I'm going to show you how I manage, maintain, and
show you how I manage, maintain, and monitor the indexes of my SQL projects.
monitor the indexes of my SQL projects. So let's
go. The first and the most important task is to monitor the usage of your
task is to monitor the usage of your indexes. So of course the first question
indexes. So of course the first question we have to ask ourself over the time are
we have to ask ourself over the time are we using really the indexes that we have
we using really the indexes that we have created are they really helping
created are they really helping improving the speed of my queries or was
improving the speed of my queries or was it just a good idea at the start of the
it just a good idea at the start of the project and later no one used those
project and later no one used those indexes. This is very crucial because if
indexes. This is very crucial because if you are having an unused index you are
you are having an unused index you are consuming unnecessary storage space and
consuming unnecessary storage space and as well the right performance in the
as well the right performance in the tables going to be slow which is
tables going to be slow which is completely unnecessary if you are not
completely unnecessary if you are not using the index. So now our task is to
using the index. So now our task is to find out the usage of each index that
find out the usage of each index that you have in the projects. So let's see
you have in the projects. So let's see how we can do that. So now previously we
how we can do that. So now previously we have created like multiple indexes on
have created like multiple indexes on the table DB customers. So if you go to
the table DB customers. So if you go to the DB customers and to the indexes, you
the DB customers and to the indexes, you can see that we have four indexes. Now
can see that we have four indexes. Now we can go and show those informations by
we can go and show those informations by using a special stored procedures from
using a special stored procedures from the SQL server called SP help index.
the SQL server called SP help index. Let's go and do that. So SP help index.
Let's go and do that. So SP help index. So it is a system stored procedure that
So it is a system stored procedure that comes with the database. So this stored
comes with the database. So this stored procedure needs only one value and that
procedure needs only one value and that is the table name. So we have it over
is the table name. So we have it over here sales DB customers. Let's go and
here sales DB customers. Let's go and query it. So we have four indexes. Then
query it. So we have four indexes. Then we have a nice description of the index.
we have a nice description of the index. So it says it is nonclustered index and
So it says it is nonclustered index and whether it is column store. And it say
whether it is column store. And it say where it is located. So it says it
where it is located. So it says it located on primary. Primary is the name
located on primary. Primary is the name of the file group where the data is
of the file group where the data is stored. And as a default it can be
stored. And as a default it can be stored as primary. And now the next
stored as primary. And now the next information we have the index keys. It
information we have the index keys. It is nice information to understand which
is nice information to understand which keys are used or which columns are used
keys are used or which columns are used for the index. So the first one you can
for the index. So the first one you can see we have two columns that means it is
see we have two columns that means it is a composite index and of course for the
a composite index and of course for the column store we don't have any columns
column store we don't have any columns and then we have the first name last
and then we have the first name last name. So this is a really nice quick
name. So this is a really nice quick store procedure in order to see
store procedure in order to see information about our index. Okay. So
information about our index. Okay. So now let's focus on our task on how to
now let's focus on our task on how to monitor the usage of the indexes. Now in
monitor the usage of the indexes. Now in databases we have a lot of schemas and
databases we have a lot of schemas and tables that protocol the metadata of our
tables that protocol the metadata of our database. And in SQL Server, we have a
database. And in SQL Server, we have a special schema called CIS where you can
special schema called CIS where you can find a lot of metadata information about
find a lot of metadata information about the SQL server. Metadata like the
the SQL server. Metadata like the description of the tables, views,
description of the tables, views, columns and as well the indexes. So now
columns and as well the indexes. So now let's check what we can find inside the
let's check what we can find inside the table indexes. So let's going to do it.
table indexes. So let's going to do it. Select star from CIS. This is the schema
Select star from CIS. This is the schema name. And then as you can see we have a
name. And then as you can see we have a list of many informations but we want to
list of many informations but we want to focus on the indexes. Now let's go and
focus on the indexes. Now let's go and execute it. Now we get a huge list of
execute it. Now we get a huge list of all indexes that we have and a lot of
all indexes that we have and a lot of informations for each index. We don't
informations for each index. We don't have to go and understand now each
have to go and understand now each column. But I'm going to go and select
column. But I'm going to go and select the main important informations from
the main important informations from this table. So what do we need? The
this table. So what do we need? The object ID. This is the table ID. So the
object ID. This is the table ID. So the object
object ID and we have the name. It is the index
ID and we have the name. It is the index name. And then here we have a nice
name. And then here we have a nice information whether it is clustered or
information whether it is clustered or nonclustered. So let's go and select it
nonclustered. So let's go and select it type disk as so let's call it index type
type disk as so let's call it index type and we can go and check whether it is
and we can go and check whether it is primary key or not. So let's get this
primary key or not. So let's get this information as well is primary key. I
information as well is primary key. I will go and just rename it is primary
will go and just rename it is primary key and what else do we need whether it
key and what else do we need whether it is unique. So it is as well nice
is unique. So it is as well nice information to have. So is
information to have. So is [Music]
[Music] unique. So of course you can go and grab
unique. So of course you can go and grab a lot of stuff. It depends really on
a lot of stuff. It depends really on what you are monitoring. So for example,
what you are monitoring. So for example, I'm going to go and check whether it is
I'm going to go and check whether it is disabled or not. So is
disabled or not. So is disabled and I'll just rename
disabled and I'll just rename it. So with that I have like focus
it. So with that I have like focus monitoring. I don't have to have all
monitoring. I don't have to have all those informations. So let's go and
those informations. So let's go and execute. But now I would like to go and
execute. But now I would like to go and change few stuff like for example I
change few stuff like for example I don't want the object ID. I would like
don't want the object ID. I would like to have the full name of the table. And
to have the full name of the table. And as well there is a lot of indexes that
as well there is a lot of indexes that is unrelevant for my database. So now in
is unrelevant for my database. So now in order to do that we have to go and get
order to do that we have to go and get the informations from another metadata
the informations from another metadata table. So let's go and call this index
table. So let's go and call this index and let's go and join it with another
and let's go and join it with another metadata table. It's called tables. So
metadata table. It's called tables. So tbl and we're going to go and join it
tbl and we're going to go and join it using the so the index object
using the so the index object ID equal to the table object ID. And now
ID equal to the table object ID. And now if you like to see the content of this
if you like to see the content of this table we can go and create separately.
table we can go and create separately. So select star from our new table. So
So select star from our new table. So let's see the content of this table. So
let's see the content of this table. So you can see we have the name which is
you can see we have the name which is the table name. And I think that's all
the table name. And I think that's all what we need. We have a lot of other
what we need. We have a lot of other informations about the table. Well, I
informations about the table. Well, I just need the table name. So let's go
just need the table name. So let's go and do it at the start. tbl name as
and do it at the start. tbl name as table
table name and I don't need anymore the object
name and I don't need anymore the object ID. But of course we have to go and use
ID. But of course we have to go and use the alias for each of those informations
the alias for each of those informations in order to understand those
in order to understand those informations comes from the index. So
informations comes from the index. So let's go and do that. All right. So my
let's go and do that. All right. So my query is ready. Let's go and execute it
query is ready. Let's go and execute it again. So now as you can see we are
again. So now as you can see we are getting the table name and the list is
getting the table name and the list is very short because it is only focusing
very short because it is only focusing on the tables that you have in the
on the tables that you have in the database. And this filter happens
database. And this filter happens because of the inner join. But one more
because of the inner join. But one more thing I would like to go and sort the
thing I would like to go and sort the data. So I'm going to say order by I
data. So I'm going to say order by I would like to sort it by the table name
would like to sort it by the table name and then the index name. All right. So
and then the index name. All right. So now let's go and check for example the
now let's go and check for example the table customers. You can see that we
table customers. You can see that we have two non-clustered index and one of
have two non-clustered index and one of them is column store index. Those two we
them is column store index. Those two we have created from the previous tutorial
have created from the previous tutorial and we have an index on the primary key
and we have an index on the primary key as you can see here is primary key equal
as you can see here is primary key equal to one and this is as well unique. So
to one and this is as well unique. So with that we have a really nice list of
with that we have a really nice list of all indexes that we have in our
all indexes that we have in our database. But we are not there yet
database. But we are not there yet because our task is how to monitor the
because our task is how to monitor the usage of the index. Now in order to get
usage of the index. Now in order to get the usage for each of those indexes, we
the usage for each of those indexes, we have to go to a special view called
have to go to a special view called dynamic management view. And there the
dynamic management view. And there the SQL server going to provide a lot of
SQL server going to provide a lot of statistics about the usage for that
statistics about the usage for that index. And we can find it as well in the
index. And we can find it as well in the same schema. So let's go and query this
same schema. So let's go and query this table. So it's going to be select star
table. So it's going to be select star from. So the same schema says
from. So the same schema says dodm
dodm db_ind index usage stats. So let's go
db_ind index usage stats. So let's go and explore this table and execute it.
and explore this table and execute it. Now in those statistics we can find the
Now in those statistics we can find the usage of two indexes the index number
usage of two indexes the index number three and one. And we can see there are
three and one. And we can see there are like three usage informations of the
like three usage informations of the index number one. And next we have like
index number one. And next we have like user seeks user scans and user lookups.
user seeks user scans and user lookups. So this is how many times the index is
So this is how many times the index is used as seeks or scans or lookups. We
used as seeks or scans or lookups. We will understand those informations as we
will understand those informations as we learn about the execution plan. And here
learn about the execution plan. And here we have a very nice information about
we have a very nice information about how many time our index got updated. So
how many time our index got updated. So as you can see here is zero because I
as you can see here is zero because I didn't add any new data after creating
didn't add any new data after creating the index. But of course all those
the index. But of course all those numbers might be different at your site
numbers might be different at your site because it depends whether you are doing
because it depends whether you are doing more queries and practicing. And you can
more queries and practicing. And you can find here more informations about when
find here more informations about when was exactly the last usage of those
was exactly the last usage of those indexes and many many nice informations.
indexes and many many nice informations. So now let's go and integrate this view
So now let's go and integrate this view with our query. So now what I'm going to
with our query. So now what I'm going to do, I'm going to do a lift join because
do, I'm going to do a lift join because if I do an inner join, I will only find
if I do an inner join, I will only find the used indexes. But I don't want that
the used indexes. But I don't want that because I want to see a full build of
because I want to see a full build of all my indexes in the database. So left
all my indexes in the database. So left join and we're going to go and get our
join and we're going to go and get our view and call it S. And then we have to
view and call it S. And then we have to join it on the keys. So S on. So I would
join it on the keys. So S on. So I would say let's go and grab the object ID
say let's go and grab the object ID equal to the index object ID. And of
equal to the index object ID. And of course we have to join on the index ID.
course we have to join on the index ID. So it's going to be the index ID equal
So it's going to be the index ID equal to the index ID like this. Now we have
to the index ID like this. Now we have to go and select few informations from
to go and select few informations from this view. So I'm going to go and select
this view. So I'm going to go and select like all those number of usage. So s
like all those number of usage. So s let's get the user
let's get the user seeks as the user
seeks as the user scans and the
scans and the lookups and maybe as well the user
lookups and maybe as well the user updates and it is really nice
updates and it is really nice information to understand when it was
information to understand when it was the last time used. So last user
the last time used. So last user seek and the
seek and the last user scan. Let me just correct it
last user scan. Let me just correct it over here. And actually I can go and put
over here. And actually I can go and put those two dates in one date because if
those two dates in one date because if it's like the last seek it's going to be
it's like the last seek it's going to be null over here or the opposite. And now
null over here or the opposite. And now what we can do we can go and put those
what we can do we can go and put those two together actually in one column
two together actually in one column because when we have a value over here
because when we have a value over here it's going to be null and vice versa. So
it's going to be null and vice versa. So we can do that using the null function
we can do that using the null function kowalis like this and we can get this
kowalis like this and we can get this over here and we can call the whole
over here and we can call the whole thing last
thing last update. So like this and maybe I'm going
update. So like this and maybe I'm going to go and rename all those
to go and rename all those [Music]
[Music] stuff. All right. So now we are done.
stuff. All right. So now we are done. Let's go and execute it. Okay. So let's
Let's go and execute it. Okay. So let's go and check our new report over here.
go and check our new report over here. So this is our query and let's start
So this is our query and let's start with the first table for example the
with the first table for example the customers and go to the right side. And
customers and go to the right side. And now we can see that we have three
now we can see that we have three indexes and from these two indexes we
indexes and from these two indexes we have only one index that is not used at
have only one index that is not used at all. So we can see over here that the
all. So we can see over here that the nclustered index on the country is not
nclustered index on the country is not being used and that's because we have
being used and that's because we have another index about the country that
another index about the country that comes from the column store. So it could
comes from the column store. So it could be like this that you are quering the
be like this that you are quering the table using the country but the SQL
table using the country but the SQL saying I would like to go and use this
saying I would like to go and use this index instead of the first one. So we
index instead of the first one. So we can say okay this one is not really
can say okay this one is not really useful maybe we can go and drop it right
useful maybe we can go and drop it right and for the rest you can see okay this
and for the rest you can see okay this column store index is used twice and the
column store index is used twice and the next one is once again the numbers at
next one is once again the numbers at your side might be different and if we
your side might be different and if we have a look to all other tables we have
have a look to all other tables we have a lot of nulls so that means all those
a lot of nulls so that means all those indexes that you have created on the DB
indexes that you have created on the DB customers let me check only one is used
customers let me check only one is used but now you might say you know what I've
but now you might say you know what I've used the index but why I'm not seeing
used the index but why I'm not seeing here any numbers about it well that's
here any numbers about it well that's because those numbers will not live
because those numbers will not live forever and we are using now the express
forever and we are using now the express edition locally at our PC. So each time
edition locally at our PC. So each time you shut down your PC and you close the
you shut down your PC and you close the client the database going to shut down
client the database going to shut down as well and those statistics going to be
as well and those statistics going to be lost because they are in the memory. But
lost because they are in the memory. But in real projects the numbers going to be
in real projects the numbers going to be totally different than here and of
totally different than here and of course you're going to get realistic
course you're going to get realistic numbers. Now let's try to target one of
numbers. Now let's try to target one of those not used indexes. So for example
those not used indexes. So for example let's go with this index. It is not
let's go with this index. It is not clustered index on the product. So let's
clustered index on the product. So let's go and query that. Currently it is
go and query that. Currently it is completely not used. So if I go and
completely not used. So if I go and select it. So select star from sales
select it. So select star from sales products where
products where product equal to caps. So with that we
product equal to caps. So with that we have used the index I think. Let's go
have used the index I think. Let's go back and query again and let's go to our
back and query again and let's go to our index and check whether it is used. Well
index and check whether it is used. Well it is correct. So our query did use this
it is correct. So our query did use this index and we can see here it is used
index and we can see here it is used once. And now you can go and analyze in
once. And now you can go and analyze in your project all the indexes that you
your project all the indexes that you have on your tables and you can see
have on your tables and you can see whether you are really using it with
whether you are really using it with your queries or not. And if you are not
your queries or not. And if you are not using the query of course you have to
using the query of course you have to make a decision about it. Maybe if you
make a decision about it. Maybe if you are working a team to ask about it who
are working a team to ask about it who did create it and why. Maybe there is
did create it and why. Maybe there is like one task in the database that is
like one task in the database that is not frequently used. Maybe it's
not frequently used. Maybe it's something that is run like once a month
something that is run like once a month or something like that. So the index is
or something like that. So the index is needed but not that frequently. But
needed but not that frequently. But still now we have like insights about
still now we have like insights about what is going on with those indexes and
what is going on with those indexes and whether we need them or not. And if you
whether we need them or not. And if you don't need them, go and drop them. All
don't need them, go and drop them. All right, my friends. So here is the secret
right, my friends. So here is the secret that 90% of SQL developers don't do
that 90% of SQL developers don't do that's going to make you in 1 minute the
that's going to make you in 1 minute the hero of the projects. So once I join a
hero of the projects. So once I join a project and after saying hello to
project and after saying hello to everyone, I open the database of the
everyone, I open the database of the project and do one query. I checked the
project and do one query. I checked the usage of the indexes of the projects and
usage of the indexes of the projects and I can tell you after working 15 years
I can tell you after working 15 years with SQL that 90% of indexes created in
with SQL that 90% of indexes created in projects are totally untouched and
projects are totally untouched and unused. So I collect all unused indexes
unused. So I collect all unused indexes and discuss it with the team. And if we
and discuss it with the team. And if we don't find real usage for those indexes,
don't find real usage for those indexes, we go and drop them. So after dropping
we go and drop them. So after dropping all those unused indexes, you have done
all those unused indexes, you have done two great things for the projects.
two great things for the projects. First, you have saved a lot of storage
First, you have saved a lot of storage in the database. And second, which is
in the database. And second, which is way more important, you have improved
way more important, you have improved and optimized the right performance on
and optimized the right performance on the database. So in your first day with
the database. So in your first day with one query, you have optimized the
one query, you have optimized the performance of the database. You have
performance of the database. You have save storage and you're going to shine
save storage and you're going to shine like an expert in your project. So if
like an expert in your project. So if you haven't done that, do that
now. All right. And now moving on to the next one. As we learned, identifying an
next one. As we learned, identifying an unused index is an important task. But
unused index is an important task. But in the other hand, identifying a missing
in the other hand, identifying a missing index is as well very important to
index is as well very important to improve the performance of your queries.
improve the performance of your queries. So in SQL server, you can get
So in SQL server, you can get recommendations from the database itself
recommendations from the database itself about missing indexes for your query. So
about missing indexes for your query. So let's see where we can find those
let's see where we can find those recommendations. All right. So now let's
recommendations. All right. So now let's say that you are doing multiple queries
say that you are doing multiple queries and you are doing analyszis and so on.
and you are doing analyszis and so on. For example, I have this query over
For example, I have this query over here. It is query on the database
here. It is query on the database adventure works DW and I'm joining just
adventure works DW and I'm joining just two tables the fact with the dimension
two tables the fact with the dimension and then filtering the data based on the
and then filtering the data based on the colors and as well on the date key where
colors and as well on the date key where I have like a range over here. So once I
I have like a range over here. So once I executed I got the following
executed I got the following informations. It could be any query that
informations. It could be any query that you are doing while practicing and
you are doing while practicing and analyzing and so on. So now if you have
analyzing and so on. So now if you have like slow query and so on you can go and
like slow query and so on you can go and check the recommendations from the
check the recommendations from the database about missing indexes. So in
database about missing indexes. So in order to do that we can go and check
order to do that we can go and check again the metadata from the database
again the metadata from the database system to see the recommendations about
system to see the recommendations about the missing indexes. So let's go and do
the missing indexes. So let's go and do that. So we're going to go and select
that. So we're going to go and select from and now we have to go and target
from and now we have to go and target the dynamic management views and it is
the dynamic management views and it is like this dm
like this dm db
db missing index details. So let's go and
missing index details. So let's go and explore the content over here. And don't
explore the content over here. And don't forget that those informations going to
forget that those informations going to be inside the cache of the server and if
be inside the cache of the server and if there's like a restart or something in
there's like a restart or something in the server you will lose all those
the server you will lose all those informations. So now from my query there
informations. So now from my query there is few suggestions and recommendations
is few suggestions and recommendations from the database. Let's go and check
from the database. Let's go and check it. So we can see here there are four
it. So we can see here there are four recommendations about missing indexes
recommendations about missing indexes from the database. So now let's go and
from the database. So now let's go and check the first recommendation over
check the first recommendation over here. You can go and check the table
here. You can go and check the table name from the object ID or you can find
name from the object ID or you can find it here in the statements. So here the
it here in the statements. So here the database is suggesting an index for the
database is suggesting an index for the table dimension product and it is
table dimension product and it is recommending us to make an index for the
recommending us to make an index for the column color and that's because if you
column color and that's because if you check our query we have like here a
check our query we have like here a filter the wear condition where we are
filter the wear condition where we are seeing the color equal to black and
seeing the color equal to black and since we don't have an index on the
since we don't have an index on the color SQL is just suggesting to use an
color SQL is just suggesting to use an index for the color and of course in
index for the color and of course in this situation we can go and use an
this situation we can go and use an uncclustered index. Now after that we
uncclustered index. Now after that we have three recommendations for the same
have three recommendations for the same table fact internet sales. So for
table fact internet sales. So for example here it is suggesting to make an
example here it is suggesting to make an index on the order date K because we are
index on the order date K because we are using it in the filter over here and as
using it in the filter over here and as well suggesting to make an index for the
well suggesting to make an index for the product key since we are using it for
product key since we are using it for the join. So this is really nice report
the join. So this is really nice report about missing indexes in the database
about missing indexes in the database and it could assist you to find out
and it could assist you to find out things that you didn't thought about.
things that you didn't thought about. But here my recommendation is evaluate
But here my recommendation is evaluate those informations very carefully. Don't
those informations very carefully. Don't go and create like an index for each
go and create like an index for each suggestions from the database. You still
suggestions from the database. You still have to think about it. Is it really
have to think about it. Is it really necessary? Do we really use this query
necessary? Do we really use this query very frequently and so on? So don't go
very frequently and so on? So don't go blindly creating indexes for each
blindly creating indexes for each recommendations from the database. So
recommendations from the database. So this is really nice tool and assistant
this is really nice tool and assistant for you in order to make a good strategy
for you in order to make a good strategy for your indexing. So this is how you
for your indexing. So this is how you find the recommendations of missing
find the recommendations of missing indexes from SQL
database. Okay. Okay. So now to the next step, we have to go and monitor the
step, we have to go and monitor the duplicates in indexing. If you are
duplicates in indexing. If you are working in team with multiple developers
working in team with multiple developers and you are working parallely in order
and you are working parallely in order to optimize the performance of the
to optimize the performance of the queries, what might happen is that
queries, what might happen is that different developers creating different
different developers creating different indexes for the same column in the same
indexes for the same column in the same table. But of course, this must not
table. But of course, this must not happen if you have a clean and solid
happen if you have a clean and solid review process in the project. But we
review process in the project. But we are human and those things could happen.
are human and those things could happen. So that's why you have to monitor
So that's why you have to monitor whether there are like duplicates. So
whether there are like duplicates. So the mission is to find whether there is
the mission is to find whether there is a column that is involved in multiple
a column that is involved in multiple indexes. So let's see how we can monitor
indexes. So let's see how we can monitor that in SQL. Okay. So now it's very
that in SQL. Okay. So now it's very simple in order to find the duplicates
simple in order to find the duplicates of indexes inside your database. So we
of indexes inside your database. So we have learned before that we can find the
have learned before that we can find the list of all indexes in this table
list of all indexes in this table indexes in the system schema and then we
indexes in the system schema and then we join it with the tables in order to get
join it with the tables in order to get the table name and then we have another
the table name and then we have another table in order to find the columns that
table in order to find the columns that are involved in the index. Those
are involved in the index. Those informations we can find it inside the
informations we can find it inside the index columns and now in order to get
index columns and now in order to get the full name of the columns we're going
the full name of the columns we're going to go and join it with the columns
to go and join it with the columns table. So it's very simple and makes
table. So it's very simple and makes sense. Let's go and execute the whole
sense. Let's go and execute the whole query. Now as you can see it is sorted
query. Now as you can see it is sorted by the table name and the column name
by the table name and the column name and that's because we can find then
and that's because we can find then easier the duplicate. So let's go and
easier the duplicate. So let's go and check the first table. So the country is
check the first table. So the country is part of this index where we have the
part of this index where we have the column store nonclustered and again the
column store nonclustered and again the country is involved in another index
country is involved in another index where we have the customer's country and
where we have the customer's country and this is a row store nonclustered index.
this is a row store nonclustered index. So this is of course bad thing. We have
So this is of course bad thing. We have to go and decide now do we want it as a
to go and decide now do we want it as a column store or row store. And if we
column store or row store. And if we check as well this table, we can find
check as well this table, we can find the first name in two different clusters
the first name in two different clusters the same story. And that's because we
the same story. And that's because we were practicing and creating those
were practicing and creating those indexes. And that's it. But now if you
indexes. And that's it. But now if you have like large schema and a lot of
have like large schema and a lot of indexes, I would go and make like a flag
indexes, I would go and make like a flag in order to understand whether we have a
in order to understand whether we have a duplicate or not. And that's by
duplicate or not. And that's by calculating the number of rows of unique
calculating the number of rows of unique table name and index name. And we can do
table name and index name. And we can do that very easily using the window
that very easily using the window functions. So let's have new row. And
functions. So let's have new row. And we're going to go and use the function
we're going to go and use the function count since we want to find the number
count since we want to find the number of rows over. Then we're going to go and
of rows over. Then we're going to go and partition
partition by we need the table name and as well
by we need the table name and as well the column name. Our expectation of this
the column name. Our expectation of this column should be one. If we have more
column should be one. If we have more than one then there is an issue and that
than one then there is an issue and that means the column is inside two different
means the column is inside two different indexes. And now let's go and sort it by
indexes. And now let's go and sort it by the column name and descending. So let's
the column name and descending. So let's go and execute it. And now we have here
go and execute it. And now we have here a nice flag where we can see how many
a nice flag where we can see how many rows we have for a specific column in a
rows we have for a specific column in a table. So if it's one like those
table. So if it's one like those columns, they are fine. Those columns
columns, they are fine. Those columns are involved only once in one index. But
are involved only once in one index. But for the first four rows, we have here an
for the first four rows, we have here an issue because we count here two columns.
issue because we count here two columns. That means we have two indexes for the
That means we have two indexes for the same column. So as you can see the query
same column. So as you can see the query is very simple and with that we have a
is very simple and with that we have a nice report about the duplicates of
nice report about the duplicates of indexes inside our database.
Okay, one more thing in order to maintain our indexes is by updating the
maintain our indexes is by updating the statistics. The database engines usually
statistics. The database engines usually use statistics in order to understand
use statistics in order to understand which index should be used for our
which index should be used for our query. And if these statistics are not
query. And if these statistics are not up to date, SQL going to make wrong
up to date, SQL going to make wrong decisions. So let's understand what this
decisions. So let's understand what this means. Now let's say that you have
means. Now let's say that you have created a table and you start inserting
created a table and you start inserting data to this new table. Now the database
data to this new table. Now the database engine going to go and create your new
engine going to go and create your new table and insert the data. Behind the
table and insert the data. Behind the scenes the database engine going to go
scenes the database engine going to go and create for your new table
and create for your new table statistics. It's like metadata
statistics. It's like metadata informations about your data and that's
informations about your data and that's like a report or insights about your
like a report or insights about your table where you can find a lot of
table where you can find a lot of informations like the number of rows
informations like the number of rows that distribution of values in a column
that distribution of values in a column and as well we can find the number of
and as well we can find the number of distinct values and histogram and
distinct values and histogram and patterns and many other informations
patterns and many other informations about your table. So now of course the
about your table. So now of course the question is why do we have those
question is why do we have those informations in the database? Now
informations in the database? Now imagine that you are doing select from
imagine that you are doing select from where what going to happen the database
where what going to happen the database engine has to go and create an execution
engine has to go and create an execution plan. We're going to learn about this
plan. We're going to learn about this later in details. It is just a road map
later in details. It is just a road map on how to execute this query. So here
on how to execute this query. So here for example in order to load the data
for example in order to load the data from the table there are like different
from the table there are like different ways on how to do it. So there is like a
ways on how to do it. So there is like a table scan, index scan, index seek. So
table scan, index scan, index seek. So that means the database engine has here
that means the database engine has here three different ways on how to do it.
three different ways on how to do it. And now in order for the database to
And now in order for the database to decide which way to use, it's going to
decide which way to use, it's going to go and read the statistics of the table.
go and read the statistics of the table. So it's going to go and collect
So it's going to go and collect informations. Okay, how many rows do we
informations. Okay, how many rows do we have? Are the informations are unique?
have? Are the informations are unique? How is the distribution of the data and
How is the distribution of the data and so on. And now based on those statistics
so on. And now based on those statistics and numbers, the database can now make a
and numbers, the database can now make a good decision about which methods to use
good decision about which methods to use in order to load the data. So for
in order to load the data. So for example, here the index scan is the best
example, here the index scan is the best way to load our table. So this is
way to load our table. So this is exactly why the database needs the
exactly why the database needs the statistics in order to make the correct
statistics in order to make the correct decision and to use the correct index.
decision and to use the correct index. So now you might ask okay this is
So now you might ask okay this is something internal for the database why
something internal for the database why do we have to care about it? Well there
do we have to care about it? Well there is an issue. Now for example in our
is an issue. Now for example in our table we have 50 rows and let's say that
table we have 50 rows and let's say that in the next day you went and inserted to
in the next day you went and inserted to this table like around 1 million row.
this table like around 1 million row. Now the issue that could happen is that
Now the issue that could happen is that the statistics will not get updated
the statistics will not get updated about this table and the statistics can
about this table and the statistics can still say that we have only 50 rows. So
still say that we have only 50 rows. So that means the statistics of this table
that means the statistics of this table is now outdated. And the big issue that
is now outdated. And the big issue that once you query this table, the SQL
once you query this table, the SQL engine don't know at all about the 1
engine don't know at all about the 1 million row that you have inserted in
million row that you have inserted in the table because it's going to go and
the table because it's going to go and ask the statistics and it's going to
ask the statistics and it's going to answer with only 50 rows and the
answer with only 50 rows and the database going to say okay this is very
database going to say okay this is very small table and let's maybe skip an
small table and let's maybe skip an index or something. So that means the
index or something. So that means the database going to make wrong decisions
database going to make wrong decisions because the statistics are outdated. And
because the statistics are outdated. And now your task is to monitor those
now your task is to monitor those statistics and to keep updating them. So
statistics and to keep updating them. So let's see how we can do that. Okay. So
let's see how we can do that. Okay. So now the first thing that we have to do
now the first thing that we have to do is to find out whether our statistics
is to find out whether our statistics are up to date or outdated. In order to
are up to date or outdated. In order to do that we have as well to access the
do that we have as well to access the metadata about our database. And for
metadata about our database. And for that as well we have tables and dynamic
that as well we have tables and dynamic management functions in the system
management functions in the system schema where we can find a lot of
schema where we can find a lot of details about the statistics. And in
details about the statistics. And in order to monitor the statistics, I have
order to monitor the statistics, I have prepared a query like this. So here I'm
prepared a query like this. So here I'm using a table called stats uh where here
using a table called stats uh where here you're going to get a list of all
you're going to get a list of all statistics inside our database and the
statistics inside our database and the name of the statistics and then I'm
name of the statistics and then I'm joining it with the tables in order to
joining it with the tables in order to get the table name and what is very
get the table name and what is very important is the dynamic management
important is the dynamic management function. So here we're going to get
function. So here we're going to get very important informations like the
very important informations like the last updates and the number of rows and
last updates and the number of rows and the number of modifications. So let's go
the number of modifications. So let's go and query it. So here we can see
and query it. So here we can see informations like the table name, the
informations like the table name, the statistics name and now it's very
statistics name and now it's very important when the last time the
important when the last time the statistics get updated. So now let's go
statistics get updated. So now let's go and check our table DB customers. We can
and check our table DB customers. We can see here the statistics name and what is
see here the statistics name and what is very important is the last update. So
very important is the last update. So this tells us how old is the statistics.
this tells us how old is the statistics. So for me it is like 4 days. And then we
So for me it is like 4 days. And then we can find the total number of rows in
can find the total number of rows in this table. And now what is very
this table. And now what is very important is the number of modifications
important is the number of modifications that have been done on the table. So
that have been done on the table. So after updating the statistics on the
after updating the statistics on the 19th of October, there were around 15
19th of October, there were around 15 rows that got modificated. This could be
rows that got modificated. This could be an insert, update, delete. So any
an insert, update, delete. So any operation of the table considered to be
operation of the table considered to be a modification. So that you can see
a modification. So that you can see there were a lot of modifications. So
there were a lot of modifications. So these statistics should be updated. So
these statistics should be updated. So now for the table customers, you can see
now for the table customers, you can see that the statistics are up to date. So
that the statistics are up to date. So we have here zero as a modifications and
we have here zero as a modifications and there will be no need to update the
there will be no need to update the statistics. So this is how you can go
statistics. So this is how you can go and check the statistics informations
and check the statistics informations inside your database in order to make a
inside your database in order to make a decision should I update the statistics
decision should I update the statistics or not. So now let's say that I would
or not. So now let's say that I would like to go and update the statistics of
like to go and update the statistics of our table DB customers. Now as you can
our table DB customers. Now as you can see we have here multiple statistics. So
see we have here multiple statistics. So over here we have this statistics on
over here we have this statistics on this table and as well we have the
this table and as well we have the statistics on the index. So as you can
statistics on the index. So as you can see we have here multiple statistics in
see we have here multiple statistics in one table. One for the table itself and
one table. One for the table itself and one for each index that we have in this
one for each index that we have in this table. So now let's say that I would
table. So now let's say that I would like to go and update the statistics
like to go and update the statistics only for one. I don't want to update
only for one. I don't want to update everything in this table only for one
everything in this table only for one statistics. Let's go and do that. So
statistics. Let's go and do that. So it's going to be very simple update
it's going to be very simple update statistics. And then we have to go and
statistics. And then we have to go and mention the name. So it's going to be
mention the name. So it's going to be sales DB customers. And then we have to
sales DB customers. And then we have to specify the name of the statistics. So
specify the name of the statistics. So let's go and get this over here and
let's go and get this over here and let's go and execute it. So it was very
let's go and execute it. So it was very fast. Let's go and reexecute our query
fast. Let's go and reexecute our query and check the data. So now let's go and
and check the data. So now let's go and find it. It was exactly this one. And as
find it. It was exactly this one. And as you can see it just got updated and the
you can see it just got updated and the number of rows is five and the number of
number of rows is five and the number of notifications is zero. So we have now an
notifications is zero. So we have now an upto-date statistics for this table. But
upto-date statistics for this table. But let's say that I would like to go and
let's say that I would like to go and update the rest but I don't want to do
update the rest but I don't want to do it one by one. So what we can do we can
it one by one. So what we can do we can just copy the same thing over here but
just copy the same thing over here but we don't specify any name of the
we don't specify any name of the statistic. So we are saying update
statistic. So we are saying update statistics and then only the table name.
statistics and then only the table name. So let's go and execute it. So now what
So let's go and execute it. So now what going to happen is still going to go and
going to happen is still going to go and update all the statistics that belongs
update all the statistics that belongs to this table. So let's go and check our
to this table. So let's go and check our query again. Now you can see everything
query again. Now you can see everything disappeared and the DB customer is
disappeared and the DB customer is completely up to date with no
completely up to date with no modifications problem. So this is how
modifications problem. So this is how you can go and update your table and you
you can go and update your table and you can do then for the rest as well. But
can do then for the rest as well. But now there is like one more thing where
now there is like one more thing where you can go and update the statistics of
you can go and update the statistics of the whole database. But beware this
the whole database. But beware this might take really long time and we're
might take really long time and we're going to do that by executing a special
going to do that by executing a special store procedure. So execute SP update
store procedure. So execute SP update stats. This one over here. Let's go and
stats. This one over here. Let's go and do that. And now it is done. And we have
do that. And now it is done. And we have here a pretty long log. It was fast
here a pretty long log. It was fast because we don't have a big database. It
because we don't have a big database. It is very small database. So it's not
is very small database. So it's not compared to any real databases. So now
compared to any real databases. So now we can see over here that SQL is going
we can see over here that SQL is going through everything that you have in the
through everything that you have in the database and trying to update the
database and trying to update the statistics. So in many situations it's
statistics. So in many situations it's going to be not necessary because there
going to be not necessary because there is nothing to update. There were no
is nothing to update. There were no modifications and so on. That's why the
modifications and so on. That's why the database is smart enough to say no it is
database is smart enough to say no it is not required and it go and skip it. So
not required and it go and skip it. So now how I usually do it in my project is
now how I usually do it in my project is that I have like a job on the weekend
that I have like a job on the weekend where it's going to go and update the
where it's going to go and update the whole database statistics. So with that
whole database statistics. So with that I make sure all my tables and indexes
I make sure all my tables and indexes having up to-date statistics. Of course
having up to-date statistics. Of course if you have small database you can run
if you have small database you can run this like every day but if this takes
this like every day but if this takes long time then you can schedule it in
long time then you can schedule it in the weekend. And as well if I know in
the weekend. And as well if I know in the project that there will be in one
the project that there will be in one day a lot of new incoming data. So we
day a lot of new incoming data. So we are doing some kind of data migrations.
are doing some kind of data migrations. So I go and update the statistics after
So I go and update the statistics after the data migration is done just to make
the data migration is done just to make sure we have up-to-date statistics. So
sure we have up-to-date statistics. So this is how we monitor and update the
this is how we monitor and update the statistics of the
statistics of the [Music]
[Music] database. Okay. Okay, so now moving on
database. Okay. Okay, so now moving on to the final task that I usually do in
to the final task that I usually do in order to monitor and manage the indexes
order to monitor and manage the indexes is to monitor the index fragmentations.
is to monitor the index fragmentations. Over the time as your data is inserted,
Over the time as your data is inserted, updated, deleted into your tables,
updated, deleted into your tables, indexes can become fragmented. So what
indexes can become fragmented. So what is fragmentation? It means like there is
is fragmentation? It means like there is unused spaces in your databases and the
unused spaces in your databases and the database is not filling them or your
database is not filling them or your data is not anymore sorted correctly in
data is not anymore sorted correctly in the index and this of course leads to
the index and this of course leads to inefficient use of the storage and as
inefficient use of the storage and as well going to slow down your
queries and in SQL in order to get everything organized again we have two
everything organized again we have two methods the first method is reorganize
methods the first method is reorganize so it's going to go and def fragment the
so it's going to go and def fragment the leaf level of the index in order to get
leaf level of the index in order to get it organized and sorted again with the
it organized and sorted again with the logical order. So it is very light
logical order. So it is very light operation and it will not block the user
operation and it will not block the user from using your table. And the second
from using your table. And the second method called rebuild this is
method called rebuild this is heavyweight operation. It going to go
heavyweight operation. It going to go and drop the whole index and recreate it
and drop the whole index and recreate it from the scratch. And this means of
from the scratch. And this means of course not only the data going to get
course not only the data going to get sorted again but as well the
sorted again but as well the fragmentations inside your databases and
fragmentations inside your databases and the index going to be eliminated. So
the index going to be eliminated. So let's see how we can do that in SQL.
let's see how we can do that in SQL. Okay. So now back to our database and
Okay. So now back to our database and the first question that you have to ask
the first question that you have to ask do we have an issue with the
do we have an issue with the fragmentations in our indexes. So we
fragmentations in our indexes. So we have to check the health of our indexes
have to check the health of our indexes in the database. And in order to do
in the database. And in order to do that, we have again to go to the system
that, we have again to go to the system metadata that we have and we're going to
metadata that we have and we're going to check their dynamic management
check their dynamic management functions. So there is like a special
functions. So there is like a special functions in order to get an answer in
functions in order to get an answer in the SQL server. Let's go and do that. So
the SQL server. Let's go and do that. So we're going to go and select star from
we're going to go and select star from the function. So it is sis dot so it's
the function. So it is sis dot so it's going to be sis dot dm db index physical
going to be sis dot dm db index physical states this one. And this is a function
states this one. And this is a function that we have to pass few parameters. We
that we have to pass few parameters. We will not go in details just follow me
will not go in details just follow me with this. So we have to give it the DB
with this. So we have to give it the DB ID and a null another null and a third
ID and a null another null and a third null and the last one going to be
null and the last one going to be limited. So we have to do it like this.
limited. So we have to do it like this. So let's go and query it. Now what do we
So let's go and query it. Now what do we find? We have the object ID. We have the
find? We have the object ID. We have the index ID and few other informations but
index ID and few other informations but the most important one is the average
the most important one is the average fragmentation in percent. So this
fragmentation in percent. So this columns gives us the degree of the
columns gives us the degree of the fragmentations in a word index. If it is
fragmentations in a word index. If it is zero then it is perfect. We have no
zero then it is perfect. We have no fragmentation in the index and our index
fragmentation in the index and our index is very healthy. But if it is like 100
is very healthy. But if it is like 100 then that means it is completely out of
then that means it is completely out of order and we have to do something about
order and we have to do something about it. And now you might say you know what
it. And now you might say you know what I don't know which object it does and
I don't know which object it does and which index. Well you have to go and
which index. Well you have to go and join few tables like the cy.ts and
join few tables like the cy.ts and cis.index in order to get those
cis.index in order to get those informations. So we have to go and do
informations. So we have to go and do that like we have done at the first
that like we have done at the first query. So okay so offline I have done
query. So okay so offline I have done that. So I joined with the tables and
that. So I joined with the tables and the indexes and I'm sorting the data by
the indexes and I'm sorting the data by the average fragmentations and
the average fragmentations and percentage descending in order to get
percentage descending in order to get the problems at the start because we are
the problems at the start because we are interested where we have high
interested where we have high percentage. So let's go and execute
percentage. So let's go and execute this. And now since it is practicing
this. And now since it is practicing database I didn't insert any data and so
database I didn't insert any data and so on. But in real projects you will get
on. But in real projects you will get here different numbers. And here is my
here different numbers. And here is my recommendations about the percentage. If
recommendations about the percentage. If the fragmentation is between like zero
the fragmentation is between like zero and 10 that means everything is like
and 10 that means everything is like okay and you don't have to do anything
okay and you don't have to do anything about it. But if the percentage is
about it. But if the percentage is between like 10 and 30 then here we have
between like 10 and 30 then here we have to do something about it. So here I
to do something about it. So here I recommend to use the reorganize method
recommend to use the reorganize method in order to sort the data again
in order to sort the data again correctly. But if you have more than 30%
correctly. But if you have more than 30% then here my recommendation is to go and
then here my recommendation is to go and rebuild the whole index because not only
rebuild the whole index because not only the data is in wrong order but as well
the data is in wrong order but as well there is a new spaces in your data page
there is a new spaces in your data page in the index. So you have to do
in the index. So you have to do something about it. So now let's go and
something about it. So now let's go and imagine one of those indexes for example
imagine one of those indexes for example this one over here has fragmentation of
this one over here has fragmentation of 15%. So now what we have to do is to go
15%. So now what we have to do is to go and reorganize this index. Let's see how
and reorganize this index. Let's see how we can do that. So let's go over here
we can do that. So let's go over here and say the following. alter index and
and say the following. alter index and then we need the index name. So let's go
then we need the index name. So let's go and get it from here and then you have
and get it from here and then you have to mention the table name where the
to mention the table name where the index exists. So we have it from the
index exists. So we have it from the customers. So from sales customers so
customers. So from sales customers so now we are editing the index and we have
now we are editing the index and we have to tell SQL what to do now. So we just
to tell SQL what to do now. So we just want to reorganize the index. So you go
want to reorganize the index. So you go and use the keyword reorganize. So
and use the keyword reorganize. So reorganize and that's it. This is very
reorganize and that's it. This is very simple. So let's go and do that. And as
simple. So let's go and do that. And as you can see it is completed and it was
you can see it is completed and it was very fast because we have small
very fast because we have small database. But sometimes it take little
database. But sometimes it take little more time if you have a big index and
more time if you have a big index and big table. So after reorganizing you can
big table. So after reorganizing you can go and again check the table over here
go and again check the table over here and see the results and it should be
and see the results and it should be like here is zero. Now let's see that we
like here is zero. Now let's see that we have another index where the
have another index where the fragmentation around like 50%. So let's
fragmentation around like 50%. So let's go and copy it and this time instead of
go and copy it and this time instead of reorganize we're going to do rebuild. So
reorganize we're going to do rebuild. So I'm going to take the whole thing and
I'm going to take the whole thing and this time we're going to go and rebuild
this time we're going to go and rebuild this index over here on the same table
this index over here on the same table and instead of reorganize we're going to
and instead of reorganize we're going to say rebuild. So let's go and execute
say rebuild. So let's go and execute that. And with that SQL did drop the
that. And with that SQL did drop the whole index and create it from the
whole index and create it from the scratch. And this is usually takes more
scratch. And this is usually takes more time than reorganize of course. And the
time than reorganize of course. And the next step of course is to go and check
next step of course is to go and check again the fragmentations and so on. So
again the fragmentations and so on. So that's all about how to make your index
that's all about how to make your index healthy and remove the fragmentations
healthy and remove the fragmentations from your index. All right, my friends.
from your index. All right, my friends. So as you can see, improving the
So as you can see, improving the performance of your queries doesn't end
performance of your queries doesn't end by creating them. It's all about staying
by creating them. It's all about staying proactive. So monitor the usage of the
proactive. So monitor the usage of the indexes, check whether there are any
indexes, check whether there are any missing indexes, and always make sure
missing indexes, and always make sure the statistics of the database are up to
the statistics of the database are up to date and keep your eyes on the
date and keep your eyes on the fragmentations and make sure you have
fragmentations and make sure you have healthy indexes. So with that you have
healthy indexes. So with that you have learned how I manage and monitor the
learned how I manage and monitor the indexes once I create them and I really
indexes once I create them and I really recommend you to follow those
steps. All right friends, so now let's say that you have a large complex
say that you have a large complex analytical SQL query and it involves a
analytical SQL query and it involves a lot of joins and aggregations and so on
lot of joins and aggregations and so on but it is slow and of course you want to
but it is slow and of course you want to go and optimize the performance of your
go and optimize the performance of your query by maybe using indexes. And now
query by maybe using indexes. And now the big question is where exactly I'm
the big question is where exactly I'm going to go build this index on which
going to go build this index on which table on which columns. So that means
table on which columns. So that means you have to understand where exactly the
you have to understand where exactly the problem is. Is it by joining tables or
problem is. Is it by joining tables or sorting data or by the aggregations? Now
sorting data or by the aggregations? Now in order to answer all those questions
in order to answer all those questions we have something called execution plan.
we have something called execution plan. So what is that? The execution plan
So what is that? The execution plan going to show you how the database
going to show you how the database exactly process your query step by step.
exactly process your query step by step. And this is what we need. It's going to
And this is what we need. It's going to show us where exactly we have a
show us where exactly we have a performance issue. So in other words,
performance issue. So in other words, the execution plan it's like your window
the execution plan it's like your window on how the SQL database thinks and once
on how the SQL database thinks and once you understand that then you're going to
you understand that then you're going to make a right decision on building an
make a right decision on building an index. So let's understand exactly what
index. So let's understand exactly what this means. Okay. So now let's imagine
this means. Okay. So now let's imagine that you are doing a query like
that you are doing a query like selecting from table and then joining
selecting from table and then joining the data with another table. So now once
the data with another table. So now once you execute this query the database
you execute this query the database engine will not go immediately and start
engine will not go immediately and start fetching data from the disk but instead
fetching data from the disk but instead of that first the SQL has to make a
of that first the SQL has to make a plan. So it's like you are planning a
plan. So it's like you are planning a trip where you check the Google map in
trip where you check the Google map in order to find the best route in order to
order to find the best route in order to reach the destination and the execution
reach the destination and the execution plan is exactly the same thing. The
plan is exactly the same thing. The database has first to plan how to
database has first to plan how to execute your query and it's going to
execute your query and it's going to build this plan step by step based on
build this plan step by step based on your query and as well the statistics.
your query and as well the statistics. So the first step for example how to get
So the first step for example how to get the data from the tables and there are
the data from the tables and there are like multiple ways like scan index or
like multiple ways like scan index or full table scan and then after that it
full table scan and then after that it need to decide which type of joins going
need to decide which type of joins going to be done like is it hash join or a
to be done like is it hash join or a loop join and then at the end of this
loop join and then at the end of this plan it's going to be the select
plan it's going to be the select statements. So once the execution plan
statements. So once the execution plan is ready the database engine going to
is ready the database engine going to start implementing the steps. So it's
start implementing the steps. So it's going to go and start reading your
going to go and start reading your tables for example from the disk and
tables for example from the disk and then after that it's going to join the
then after that it's going to join the tables and then select the columns and
tables and then select the columns and send at the end the results to the end
send at the end the results to the end user. And now once everything is done
user. And now once everything is done the database engine going to do one more
the database engine going to do one more thing where it's going to go and take
thing where it's going to go and take this execution plan and store it at the
this execution plan and store it at the cache. And that's because the database
cache. And that's because the database engine can go and reuse this plan if we
engine can go and reuse this plan if we have a similar query. So for example, if
have a similar query. So for example, if you go and execute the same query again,
you go and execute the same query again, the database engine here going to
the database engine here going to understand ah this is the same query. We
understand ah this is the same query. We have already built an execution plan for
have already built an execution plan for that. So it going to go and check the
that. So it going to go and check the cache and it is way faster to get it
cache and it is way faster to get it immediately from the cache instead of
immediately from the cache instead of building it. So in this scenario, the
building it. So in this scenario, the database engine doesn't have to make any
database engine doesn't have to make any decisions or something like that. going
decisions or something like that. going to go and get the plan from the cache
to go and get the plan from the cache and start immediately by executing the
and start immediately by executing the plan. And of course, the database engine
plan. And of course, the database engine will not hide the execution plan from
will not hide the execution plan from the users. You can go and check it
the users. You can go and check it because you can go and check how the
because you can go and check how the database loaded the data, how they are
database loaded the data, how they are joined and so on. And then you can make
joined and so on. And then you can make a correct decision on how to optimize
a correct decision on how to optimize your query maybe by adding indexes. So
your query maybe by adding indexes. So let's go back to SQL and see how we can
let's go back to SQL and see how we can do that.
Okay, so now we're going to work with the database Adventure Works DW2022. And
the database Adventure Works DW2022. And now we're going to go to our tables and
now we're going to go to our tables and we're going to focus on the fact fact
we're going to focus on the fact fact reseller sales. Now let's go and check
reseller sales. Now let's go and check the type of this table. So if you go
the type of this table. So if you go inside it and go to the indexes, you can
inside it and go to the indexes, you can see that we have an index on the primary
see that we have an index on the primary key. So we have a clustered roster
key. So we have a clustered roster index. So that means the data is
index. So that means the data is structured in this P tree. So now what
structured in this P tree. So now what we're going to do, we're going to go and
we're going to do, we're going to go and create a mirror of this table but
create a mirror of this table but without any indexes. So it's going to be
without any indexes. So it's going to be very simple. Select star from our fact
very simple. Select star from our fact reseller sales and we're going to insert
reseller sales and we're going to insert it in a new table. So
it in a new table. So into fact
into fact reseller sales and I'm going to call it
reseller sales and I'm going to call it HP for heap. So let's go and execute it.
HP for heap. So let's go and execute it. And now you can see we have inserted in
And now you can see we have inserted in the new table around 60,000 rows. So now
the new table around 60,000 rows. So now we can go and refresh our tables in
we can go and refresh our tables in order to find our new table. So it is
order to find our new table. So it is over here factory seller sales and if
over here factory seller sales and if you check the indexes you will not find
you check the indexes you will not find any. So that means it is a heap table.
any. So that means it is a heap table. Now let's go and do a very simple query
Now let's go and do a very simple query on top of our new table. So select star
on top of our new table. So select star from the factory seller HP like this. So
from the factory seller HP like this. So let's go and execute it and we got the
let's go and execute it and we got the results. So now the question is I would
results. So now the question is I would like to see the execution plan of this
like to see the execution plan of this query. Now in order to see the execution
query. Now in order to see the execution plan we're going to go to the toolbar
plan we're going to go to the toolbar over here and we have three things. The
over here and we have three things. The first one is says display estimated
first one is says display estimated execution plan and we have another one
execution plan and we have another one says include actual execution plan and a
says include actual execution plan and a third one says include live query
third one says include live query statistics. So now the question is what
statistics. So now the question is what are the differences between them? Let's
are the differences between them? Let's start with the first one displayed
start with the first one displayed estimated execution plan. So here what's
estimated execution plan. So here what's going to happen? SQL going to go and
going to happen? SQL going to go and guess the execution plan without
guess the execution plan without executing the query. So it's just an
executing the query. So it's just an estimation. So this is only a guess an
estimation. So this is only a guess an estimation. The second one is the actual
estimation. The second one is the actual one. So this going to show you the
one. So this going to show you the execution plan that is used in order to
execution plan that is used in order to process your query. So after executing
process your query. So after executing your query, SQL going to show for you
your query, SQL going to show for you which plan is used. So that means the
which plan is used. So that means the estimated plan it is something before
estimated plan it is something before executing your query and the actual plan
executing your query and the actual plan is something after executing your query.
is something after executing your query. And the third one is while executing the
And the third one is while executing the query. So you're going to get a realtime
query. So you're going to get a realtime execution of your query and you can see
execution of your query and you can see how your execution plan is working. So
how your execution plan is working. So now we can go and try that. Let's go and
now we can go and try that. Let's go and activate the estimated execution plan.
activate the estimated execution plan. Now we can see over here we have a new
Now we can see over here we have a new output where you can see like few boxes.
output where you can see like few boxes. So this is an estimated execution plan
So this is an estimated execution plan without executing your query. But now if
without executing your query. But now if you go over here and switch it to the
you go over here and switch it to the actual execution plan nothing going to
actual execution plan nothing going to happen because first you have to execute
happen because first you have to execute your query. So let's go and do that. So
your query. So let's go and do that. So once we have executed we got the result
once we have executed we got the result the messages and here we have a new tab
the messages and here we have a new tab called execution plan. So if you go over
called execution plan. So if you go over here you will find the real execution
here you will find the real execution plan that is used to process your query.
plan that is used to process your query. And let's go and try the third one. And
And let's go and try the third one. And let's go and
let's go and execute. It was pretty fast because the
execute. It was pretty fast because the query is very fast. But here we can see
query is very fast. But here we can see how the data and the plan is working
how the data and the plan is working during the execution. So this is the
during the execution. So this is the live execution plan. And of course we
live execution plan. And of course we have the last one which is the current
have the last one which is the current execution plan. So those are the
execution plan. So those are the differences between those stuff. Now you
differences between those stuff. Now you might ask why do we have this estimated
might ask why do we have this estimated and actual execution plans? Well, it is
and actual execution plans? Well, it is really nice tool to understand whether
really nice tool to understand whether everything like is healthy at your
everything like is healthy at your database because if the guessing is
database because if the guessing is something else at the actual execution
something else at the actual execution plan that means this is an indicator
plan that means this is an indicator that something is wrong at the
that something is wrong at the statistics or the index at your
statistics or the index at your database. So if they are matching the
database. So if they are matching the estimated and the actual then everything
estimated and the actual then everything looks good. But now we're going to focus
looks good. But now we're going to focus only on one type of those execution
only on one type of those execution plans. We're going to stick with the
plans. We're going to stick with the actual execution plan. So now what we're
actual execution plan. So now what we're going to do, we're going to go and open
going to do, we're going to go and open two queries side by side and one going
two queries side by side and one going to be from the clustered index and
to be from the clustered index and another one is from the heap structure.
another one is from the heap structure. So it's going to be like one to one.
So it's going to be like one to one. Let's go and query both of them. And now
Let's go and query both of them. And now let's go and try to read the execution
let's go and try to read the execution plan. But make sure that you are
plan. But make sure that you are activating the actual execution plan. So
activating the actual execution plan. So we have here now two plans. So now we
we have here now two plans. So now we are at the he table and we don't have
are at the he table and we don't have any indexes. So now the question is how
any indexes. So now the question is how to read this execution plan? Well, now
to read this execution plan? Well, now the plan is very simple because we have
the plan is very simple because we have a very simple query but we read it from
a very simple query but we read it from the right to the left. So the first
the right to the left. So the first operation is the table scan and then we
operation is the table scan and then we have here a very small arrow to the next
have here a very small arrow to the next one where we have the select. So from
one where we have the select. So from right to left. So now of course the
right to left. So now of course the first operator is how to read your data
first operator is how to read your data inside the table and here we have
inside the table and here we have different types of scans and one of them
different types of scans and one of them is the table scan. So table scan
is the table scan. So table scan actually is scanning the entire table.
actually is scanning the entire table. So it's going to go and scan all the
So it's going to go and scan all the rows inside your tables in order to
rows inside your tables in order to execute this query. Now if you go and
execute this query. Now if you go and mouse hover on the table scan, you will
mouse hover on the table scan, you will find a lot of details about what is
find a lot of details about what is happening during loading the data or
happening during loading the data or scanning the table. But it is little bit
scanning the table. But it is little bit annoying better than that. If you go
annoying better than that. If you go right click on it and then go to
right click on it and then go to properties, you will get in the right
properties, you will get in the right side the same details but it is easier
side the same details but it is easier to read. So the first thing that we have
to read. So the first thing that we have to read is the number of rows that has
to read is the number of rows that has been read. So we can see that we have
been read. So we can see that we have read all the rows inside the table which
read all the rows inside the table which is not really good and we have another
is not really good and we have another important informations about the
important informations about the resources and the cost. So we have the
resources and the cost. So we have the CPU cost and the input output costs and
CPU cost and the input output costs and what is interesting is the logical
what is interesting is the logical operator the table scan and we can see
operator the table scan and we can see some nice informations about the
some nice informations about the storage. It says it is row store. Now
storage. It says it is row store. Now let's go and check the execution plan of
let's go and check the execution plan of this other table where we have a
this other table where we have a clustered index. So let's go to the
clustered index. So let's go to the execution plan. And now you can see that
execution plan. And now you can see that we have on the right side something
we have on the right side something else. We don't have table scan. We have
else. We don't have table scan. We have something called clustered index scan.
something called clustered index scan. It is either scanning the entire table
It is either scanning the entire table again or only a range or a part of the
again or only a range or a part of the index. And of course in the details we
index. And of course in the details we can see whether it read all the
can see whether it read all the informations or not. Now if you go and
informations or not. Now if you go and check the number of rows again the whole
check the number of rows again the whole index is read in order to get this
index is read in order to get this results. So again we have here the total
results. So again we have here the total number of rows inside our table. And as
number of rows inside our table. And as well you can see over here the logical
well you can see over here the logical operation it is clustered index scan. So
operation it is clustered index scan. So it is not table scan. Now of course we
it is not table scan. Now of course we have to go and check the CPU and the
have to go and check the CPU and the input output costs whether we are
input output costs whether we are consuming the same efforts or not. So we
consuming the same efforts or not. So we can go and compare stuff. So here we
can go and compare stuff. So here we have like
have like 0.07. And if you go over here you can
0.07. And if you go over here you can see we didn't gain like a lot of
see we didn't gain like a lot of information having an index on this
information having an index on this table. And that's of course logical
table. And that's of course logical because this query is not using any
because this query is not using any indexes. It is just like selecting
indexes. It is just like selecting everything from the whole
table. So now let's go and extend it where we're going to sort the data by
where we're going to sort the data by the primary key sales order number. So
the primary key sales order number. So let's go and get this one and as well
let's go and get this one and as well for the heap structure. So let's go and
for the heap structure. So let's go and execute it and check the execution plan
execute it and check the execution plan and the same thing for our cluster
and the same thing for our cluster table. Now let's check first the heap
table. Now let's check first the heap structure. As you can see here, we have
structure. As you can see here, we have like two steps. First, it's going to go
like two steps. First, it's going to go and scan the whole table and then we
and scan the whole table and then we have sort operator in order to go and
have sort operator in order to go and sort all the data in order to present it
sort all the data in order to present it in the output. And at the end, we have
in the output. And at the end, we have the select which is not really
the select which is not really important. So here we have like two
important. So here we have like two operators. But now if you go to our
operators. But now if you go to our clustered index, you can see that we
clustered index, you can see that we have only like two steps. There is no
have only like two steps. There is no sort step, right? And that's because the
sort step, right? And that's because the clustered index is only sorted and SQL
clustered index is only sorted and SQL don't have to go and sort the data
don't have to go and sort the data again. So it doesn't have to go and sort
again. So it doesn't have to go and sort anything. The data is already sorted. So
anything. The data is already sorted. So this is the first win that you have if
this is the first win that you have if you have an index. So everything is
you have an index. So everything is already sorted and if you have an order
already sorted and if you have an order by on this column then SQL don't have to
by on this column then SQL don't have to do it during the query. So now if you
do it during the query. So now if you want to go and compare the cost you can
want to go and compare the cost you can see here we still have the same cost for
see here we still have the same cost for the CPU and the input output in the h
the CPU and the input output in the h structure without any index we have here
structure without any index we have here like double cost. The first cost is for
like double cost. The first cost is for the table scan. It is the exact same
the table scan. It is the exact same amount of CPU and input output like the
amount of CPU and input output like the clustered but as well on top of it we
clustered but as well on top of it we have high cost for sorting the data. So
have high cost for sorting the data. So we are consuming more CPU and input
we are consuming more CPU and input output. And if you summarize those cost
output. And if you summarize those cost of course this query going to be slower
of course this query going to be slower and bad compared to the clustered index.
and bad compared to the clustered index. So with that in the execution plan you
So with that in the execution plan you can understand exactly the benefit of
can understand exactly the benefit of your index. And one more thing about
your index. And one more thing about this plan if you go over here. So if you
this plan if you go over here. So if you go to the objects and let me just extend
go to the objects and let me just extend it like this. You can see the name of
it like this. You can see the name of the index that has been used for your
the index that has been used for your query. So it says the index is B key for
query. So it says the index is B key for primary key. And then we have the whole
primary key. And then we have the whole thing. So now if you go to our table on
thing. So now if you go to our table on the left side, check the indexes, it
the left side, check the indexes, it going to be exactly this index. So in
going to be exactly this index. So in the execution plan you can find as well
the execution plan you can find as well which index has been used in your query.
which index has been used in your query. And this is very important to check. If
And this is very important to check. If you create a new index then run your
you create a new index then run your query and check whether the database is
query and check whether the database is using your new created index. And if not
using your new created index. And if not then you are making the wrong decisions
then you are making the wrong decisions about your index. So each time you
about your index. So each time you create a new index, make sure to check
create a new index, make sure to check whether in the execution plan the
whether in the execution plan the database is using your new
database is using your new [Music]
[Music] index. Okay, so now let's keep going.
index. Okay, so now let's keep going. Now instead of using the primary key,
Now instead of using the primary key, I'm going to go and filter the data
I'm going to go and filter the data based on one of those columns that we
based on one of those columns that we have in this table. So let me check the
have in this table. So let me check the results and let's take for example the
results and let's take for example the carrier tracking number. So carrier
carrier tracking number. So carrier tracking number and let's go and pick a
tracking number and let's go and pick a value. the first one here like this and
value. the first one here like this and let's do the same thing for the heap
let's do the same thing for the heap table and execute it. And now in the
table and execute it. And now in the execution plan you see we still have a
execution plan you see we still have a table scan and on this table let's see
table scan and on this table let's see the execution plan with the clustered
the execution plan with the clustered index. Now let's say that I would like
index. Now let's say that I would like to go and create a nclustered index for
to go and create a nclustered index for this column. So let's go and do it. So
this column. So let's go and do it. So create
create nonclustered index and I'm going to call
nonclustered index and I'm going to call it index fact reseller and then the
it index fact reseller and then the column name. So on our table fact
column name. So on our table fact reseller and the column going to be
reseller and the column going to be carrier tracking number. So I'm going to
carrier tracking number. So I'm going to take it from here and let's go and
take it from here and let's go and create it. Now let's see whether our
create it. Now let's see whether our query going to use this index. So let's
query going to use this index. So let's go and execute it and let's go to the
go and execute it and let's go to the execution plan. Now things looks
execution plan. Now things looks completely different than before. So
completely different than before. So what is going on? We can see that we
what is going on? We can see that we have now something new. We don't have a
have now something new. We don't have a clustered index. We have something
clustered index. We have something called index seek. Index seek is an
called index seek. Index seek is an amazing sign in your execution plan
amazing sign in your execution plan because it tells us that SQL server did
because it tells us that SQL server did find a way to use the index in order to
find a way to use the index in order to find the exact data that we need without
find the exact data that we need without scanning a lot of stuff. So that means
scanning a lot of stuff. So that means now we have like three types of scans.
now we have like three types of scans. We have the table scan where the SQL
We have the table scan where the SQL going to go and scan the whole table and
going to go and scan the whole table and this can happen in the heap structure
this can happen in the heap structure and the second one we have the index
and the second one we have the index scan and here we don't know whether it
scan and here we don't know whether it is scanning the whole index or a part of
is scanning the whole index or a part of the index and the last one we have the
the index and the last one we have the index seek where the database is able to
index seek where the database is able to find directly the data without scanning
find directly the data without scanning a lot of stuff. So the worst type is the
a lot of stuff. So the worst type is the table scan. Then we have the index scan
table scan. Then we have the index scan and the best one is the index seek. So
and the best one is the index seek. So if you check here the details you can
if you check here the details you can see the number of rows that has been
see the number of rows that has been read is only 12. This is amazing. Let's
read is only 12. This is amazing. Let's go and check the heap scan over here. So
go and check the heap scan over here. So to the execution plan and if you go over
to the execution plan and if you go over here you can see that we are reading
here you can see that we are reading around 60,000 rows in order to get 12.
around 60,000 rows in order to get 12. But with the index we are reading only
But with the index we are reading only 12 in order to get 12 and this is
12 in order to get 12 and this is amazing and very fast of course and of
amazing and very fast of course and of course the cost of this is very very
course the cost of this is very very small. So if you check the CPU and the
small. So if you check the CPU and the input output you can see those numbers
input output you can see those numbers are nothing and of course if you go to
are nothing and of course if you go to the object over here you can see which
the object over here you can see which index has been used and this is exactly
index has been used and this is exactly the index that we have just created. So
the index that we have just created. So that means it was a really good decision
that means it was a really good decision to create this index and the SQL was
to create this index and the SQL was very happy about it and used it in order
very happy about it and used it in order to fast find our data. So now let's go
to fast find our data. So now let's go and check the rest of the plan. And now
and check the rest of the plan. And now you can see over here we have key
you can see over here we have key lookup. The key lookup is an operation
lookup. The key lookup is an operation that we need in order to get the rest of
that we need in order to get the rest of the columns because from this index we
the columns because from this index we are getting the data of only one column
are getting the data of only one column the carrier tracking number. But since
the carrier tracking number. But since in our query we are saying select star
in our query we are saying select star that means we have a lot of columns and
that means we have a lot of columns and those columns are not part of the index.
those columns are not part of the index. So in this index is called don't know
So in this index is called don't know anything about the rest. That's why has
anything about the rest. That's why has to go and search for the other columns
to go and search for the other columns and of course it is called a lookup not
and of course it is called a lookup not a scan or something like that and that's
a scan or something like that and that's why we have here as well only 12 rows
why we have here as well only 12 rows but from this step we will get the rest
but from this step we will get the rest of the columns. So and now the next step
of the columns. So and now the next step is that SQL going to go and join those
is that SQL going to go and join those two informations. So we have from the
two informations. So we have from the first one the carrier tracking number
first one the carrier tracking number and the second one we have the rest of
and the second one we have the rest of course SQL has to go and merge all those
course SQL has to go and merge all those stuff in one in order to have it as a
stuff in one in order to have it as a results. And now this operation called a
results. And now this operation called a nested loops. Behind the scenes there
nested loops. Behind the scenes there are different types of joins not the one
are different types of joins not the one that we know the inner lift and so on
that we know the inner lift and so on but there is another types of joints. We
but there is another types of joints. We have the nested loop. We have the merge
have the nested loop. We have the merge join and the hash join. The nested loop
join and the hash join. The nested loop is very good for small stuff. If you
is very good for small stuff. If you have large tables, then the merge and
have large tables, then the merge and the hash joints are way better than the
the hash joints are way better than the nested loop. So that means if you are
nested loop. So that means if you are getting here a lot of data from the
getting here a lot of data from the index and the lookups and you seek is
index and the lookups and you seek is using a nested loop, this is not good.
using a nested loop, this is not good. But for now it is okay because we are
But for now it is okay because we are getting only 12 rows and the operation
getting only 12 rows and the operation going to be fast enough. And now one
going to be fast enough. And now one more thing that we can see inside our
more thing that we can see inside our execution plan is the cost in
execution plan is the cost in percentage. So from checking this plan
percentage. So from checking this plan you can see the select is almost costing
you can see the select is almost costing nothing. The cost of the nested loop is
nothing. The cost of the nested loop is as well like 0%. And then we have like
as well like 0%. And then we have like 6% of the index seek. That's because it
6% of the index seek. That's because it is pretty fast and the most expensive
is pretty fast and the most expensive operation that done in our query is the
operation that done in our query is the key lookups of course because it's going
key lookups of course because it's going to go and get all the columns. And now
to go and get all the columns. And now if you go and compare to the heap
if you go and compare to the heap structure even though that the execution
structure even though that the execution plan of the heap structure looks very
plan of the heap structure looks very small doesn't mean that is faster than
small doesn't mean that is faster than the indexes that we have. Still if you
the indexes that we have. Still if you go and add up all those numbers it is
go and add up all those numbers it is way way faster than the heap structure.
way way faster than the heap structure. Now I would like to show you one more
Now I would like to show you one more thing. If you want to get rid of this
thing. If you want to get rid of this key lookup and in your query you have
key lookup and in your query you have only selecting the carrier tracking
only selecting the carrier tracking number. Let's go and execute it and go
number. Let's go and execute it and go to the execution plan. As you can see
to the execution plan. As you can see there is no need for the lookup because
there is no need for the lookup because we have only one column and this data we
we have only one column and this data we can get it completely from our index. So
can get it completely from our index. So as you can see it is interesting to
as you can see it is interesting to understand how SQL is working with your
understand how SQL is working with your table and with your index and this is
table and with your index and this is how to validate whether you are making
how to validate whether you are making correct decisions about your
indexes. Okay. So now let's go and add more stuff where we are doing
more stuff where we are doing aggregations joins and so on. Let's
aggregations joins and so on. Let's extend our query. So I'm going to go and
extend our query. So I'm going to go and join it with another dimension like for
join it with another dimension like for example the dim products and the join
example the dim products and the join going to be on the product key. So
going to be on the product key. So product key and equal to as well product
product key and equal to as well product key. Now after that we're going to go
key. Now after that we're going to go and aggregate few stuff. So we're going
and aggregate few stuff. So we're going to aggregate by the product name. So I'm
to aggregate by the product name. So I'm going to take the product name. So it's
going to take the product name. So it's going to be the English product name and
going to be the English product name and let's go and call it product name. And
let's go and call it product name. And let's go and aggregate the sales. So sum
let's go and aggregate the sales. So sum and we're going to get it from the fact
and we're going to get it from the fact table. It's going to be sales amount. So
table. It's going to be sales amount. So as
as total sales and of course we have to go
total sales and of course we have to go and do group by and not French name.
and do group by and not French name. It's going to be the English
It's going to be the English name. So let's group up by the product
name. So let's group up by the product name. And that's it. Let's go and
name. And that's it. Let's go and execute it. Now we have a nice list of
execute it. Now we have a nice list of products and total sales. But let's go
products and total sales. But let's go and check the execution plan. And oh my
and check the execution plan. And oh my god, we have a lot of stuff. So let's
god, we have a lot of stuff. So let's start from the right side. So let's do
start from the right side. So let's do it quickly from the right to the left.
it quickly from the right to the left. So the first thing is that it's going to
So the first thing is that it's going to go and get the data from the fact. So it
go and get the data from the fact. So it is using the clustered index. And then
is using the clustered index. And then after that it's going to go and do a
after that it's going to go and do a hashmatch for the aggregation. And after
hashmatch for the aggregation. And after that it's going to go and sort the data
that it's going to go and sort the data because it is doing later a merge join.
because it is doing later a merge join. So all those steps are preparing the
So all those steps are preparing the fact table. And then we have another
fact table. And then we have another cluster scan for the dimension. So it
cluster scan for the dimension. So it going to go and as well select the
going to go and as well select the informations from the dimension. And we
informations from the dimension. And we have here like not a lot of rows. So it
have here like not a lot of rows. So it is very small table 600 rows. And now of
is very small table 600 rows. And now of course the result of the cluster scan is
course the result of the cluster scan is as well sorted right and of course as we
as well sorted right and of course as we learned the cluster the index going to
learned the cluster the index going to go and sort the data. So we have here a
go and sort the data. So we have here a sorted output together with another
sorted output together with another sorted output. So we have like two data
sorted output. So we have like two data sets that are sorted and SQL here
sets that are sorted and SQL here decided to go with the merge join which
decided to go with the merge join which is a good join in order to join two
is a good join in order to join two sorted data sets. It is way faster than
sorted data sets. It is way faster than joining using the nested loop. So
joining using the nested loop. So everything is fine and then the data
everything is fine and then the data going to be sorted and presented at the
going to be sorted and presented at the output. And now if you are checking this
output. And now if you are checking this plan you can see the most expensive
plan you can see the most expensive thing happened at the fact table. So
thing happened at the fact table. So 71% of the total cost happened in this
71% of the total cost happened in this step. Now let's say that the query is
step. Now let's say that the query is slow and I would like to go and optimize
slow and I would like to go and optimize it. We have learned that if you are
it. We have learned that if you are doing aggregations on big tables then
doing aggregations on big tables then the column store index is a good idea.
the column store index is a good idea. So let's go and find whether that is
So let's go and find whether that is true. So I'm going to go to our other
true. So I'm going to go to our other table. So our sales table was with the
table. So our sales table was with the heap structure. And now you say you know
heap structure. And now you say you know what let's go and convert this he
what let's go and convert this he structure to a column store. So let's go
structure to a column store. So let's go and do that. So we're going to say
and do that. So we're going to say create
create clustered column store index and we're
clustered column store index and we're going to call it index and then the
going to call it index and then the whole name fact
whole name fact reseller sales HP and we don't have to
reseller sales HP and we don't have to specify any columns. So it's going to be
specify any columns. So it's going to be our table on and that's it. Let's go and
our table on and that's it. Let's go and execute it. So now our table is not
execute it. So now our table is not anymore heap structure. It should be a
anymore heap structure. It should be a column store. So if you go and check the
column store. So if you go and check the informations we can see we have like
informations we can see we have like clustered column stored index on it. So
clustered column stored index on it. So now let's go and do the same query and
now let's go and do the same query and check whether we have a better
check whether we have a better performance. Let's go and execute it.
performance. Let's go and execute it. And of course you have to go and
And of course you have to go and activate the execution plan. So I'm
activate the execution plan. So I'm going to and now let's go and check from
going to and now let's go and check from the right again. So this is our fact
the right again. So this is our fact table and as you can see already it is
table and as you can see already it is costing only 6%. Interesting. So let's
costing only 6%. Interesting. So let's go and compare what happened to our fact
go and compare what happened to our fact table. First of all, we can see that the
table. First of all, we can see that the physical operation is a column store
physical operation is a column store index scan. And if you go to the objects
index scan. And if you go to the objects over here, you can see that the SQL did
over here, you can see that the SQL did use the column store. And that is of
use the column store. And that is of course going to happen because the whole
course going to happen because the whole data is stored only in the index. So
data is stored only in the index. So there is no way around it. So it can go
there is no way around it. So it can go and of course and use the index. But now
and of course and use the index. But now what is interesting maybe we have to go
what is interesting maybe we have to go and compare the CPU costs. So if we
and compare the CPU costs. So if we check over here, it is like
check over here, it is like 0,000.67 almost the same thing for the
0,000.67 almost the same thing for the input output. Let's go to the previous
input output. Let's go to the previous plan where we don't have a column store
plan where we don't have a column store and check our facts. So as you can see
and check our facts. So as you can see here it is way more expensive reading
here it is way more expensive reading the fact table than the column store and
the fact table than the column store and as well we have reduced the input output
as well we have reduced the input output costs. So as you can see we went from
costs. So as you can see we went from 71% of total cost for the fact table to
71% of total cost for the fact table to only 6%. And the resources that is used
only 6%. And the resources that is used to execute the query it is way less than
to execute the query it is way less than a normal clustered res store. And this
a normal clustered res store. And this is exactly the power of this index, the
is exactly the power of this index, the column store index. You can use it in
column store index. You can use it in big tables like the fact tables like we
big tables like the fact tables like we are doing here in this query, you will
are doing here in this query, you will be getting amazing performance for this
be getting amazing performance for this scenario. So of course you can go and
scenario. So of course you can go and compare the execution plan by moving
compare the execution plan by moving left and right. So as you can see if I
left and right. So as you can see if I click over here and I just switch to the
click over here and I just switch to the other tab, I can like quickly compare
other tab, I can like quickly compare the numbers. But there is another way on
the numbers. But there is another way on how to compare execution plans and that
how to compare execution plans and that is if you go to the execution plan and
is if you go to the execution plan and right click on it then go to save
right click on it then go to save execution plan as and then you have to
execution plan as and then you have to go and give it a name for example query
go and give it a name for example query pro store. So let's go and save it and
pro store. So let's go and save it and then you can go to the second query
then you can go to the second query where we have the row store and then
where we have the row store and then right click on the execution plan and
right click on the execution plan and say compare show plan. So once you click
say compare show plan. So once you click on that then you have to go and select
on that then you have to go and select the one that you want to compare with.
the one that you want to compare with. So open and now on top you have your
So open and now on top you have your query and at the bottom you have the
query and at the bottom you have the execution plan that you have saved and
execution plan that you have saved and then you have here a lot of informations
then you have here a lot of informations where they compare both of the execution
where they compare both of the execution plan and with that you can go in more
plan and with that you can go in more details in order to understand which
details in order to understand which plan is better. All right friends so as
plan is better. All right friends so as you can see having the execution plan is
you can see having the execution plan is is amazing. We can see how the SQL is
is amazing. We can see how the SQL is working behind the scenes and we can
working behind the scenes and we can understand how SQL is processing my
understand how SQL is processing my query step by step. How much resources
query step by step. How much resources it is consuming, whether my indexes are
it is consuming, whether my indexes are useful or useless and I can go and
useful or useless and I can go and experiment stuff. I can go and add like
experiment stuff. I can go and add like an index then test and check whether I
an index then test and check whether I gained like few performance or not. And
gained like few performance or not. And we can go and compare like multiple
we can go and compare like multiple execution plans before and after until
execution plans before and after until you get the right index for the right
you get the right index for the right table and the right column. So the
table and the right column. So the execution plan are amazing in order to
execution plan are amazing in order to help us understanding whether our
help us understanding whether our indexing strategy is correct or
not. All right friends, so so far we have learned that the SQL server going
have learned that the SQL server going to make its own decisions on how to
to make its own decisions on how to execute your queries and the SQL make
execute your queries and the SQL make those plans based on the statistics. But
those plans based on the statistics. But sometimes the plan that you are getting
sometimes the plan that you are getting from the database might be not the best
from the database might be not the best one for your query and there could be
one for your query and there could be many reasons why this could happen.
many reasons why this could happen. Maybe the statistics are not up to date
Maybe the statistics are not up to date or you have a lot of indexes and the
or you have a lot of indexes and the database engine get confused and here
database engine get confused and here exactly where we need the SQL hints. So
exactly where we need the SQL hints. So you can use the SQL hints in order to
you can use the SQL hints in order to command to force the SQL database on how
command to force the SQL database on how exactly your SQL query should be
exactly your SQL query should be executed. So you can intervene and
executed. So you can intervene and change the steps in the execution plan.
change the steps in the execution plan. So let's see how we can do that. All
So let's see how we can do that. All right. So now let's have a very simple
right. So now let's have a very simple query. We are just joining the table
query. We are just joining the table orders with the customers and we are
orders with the customers and we are showing like few columns. Now if you go
showing like few columns. Now if you go and execute it and we go and check the
and execute it and we go and check the execution plan, we can see in this plan
execution plan, we can see in this plan that it is using the clustered index in
that it is using the clustered index in order to read the data from the orders
order to read the data from the orders and the customers and then it is using
and the customers and then it is using the nested loop in order to do the
the nested loop in order to do the joins. Now let's say that our tables are
joins. Now let's say that our tables are really big but still the SQL is using
really big but still the SQL is using the nested loops and of course this is
the nested loops and of course this is not good for large tables and maybe the
not good for large tables and maybe the SQL was confused with the indexes and
SQL was confused with the indexes and statistics and so on and it decided to
statistics and so on and it decided to use the nested loops. So now in order to
use the nested loops. So now in order to force the SQL to use another type of
force the SQL to use another type of join, we can go and give a hint in our
join, we can go and give a hint in our query for the SQL to use different types
query for the SQL to use different types for the join. So let's go and do that.
for the join. So let's go and do that. We're going to go at the end of our
We're going to go at the end of our query and we're going to say option and
query and we're going to say option and inside it we're going to say use the
inside it we're going to say use the hash join like this. So that's it. This
hash join like this. So that's it. This is our query and at the end we are
is our query and at the end we are giving the database a hint for the
giving the database a hint for the execution plan. So let's go and try that
execution plan. So let's go and try that out. So let's check the execution plan.
out. So let's check the execution plan. And now as you can see is using
And now as you can see is using different type of join. So with that we
different type of join. So with that we are intervening in the execution plan
are intervening in the execution plan and we are making choices. So with that
and we are making choices. So with that we have changed the technicality on how
we have changed the technicality on how the SQL is joining those two tables. All
the SQL is joining those two tables. All right. So now let's go and change
right. So now let's go and change something else like for example instead
something else like for example instead of having index scan I would like to
of having index scan I would like to have an index seek. So if you have the
have an index seek. So if you have the right index in your table, you can go
right index in your table, you can go and tell SQL how to read your data in
and tell SQL how to read your data in the table. So let's go and do that.
the table. So let's go and do that. Currently here we have an index scan on
Currently here we have an index scan on the table customers. So we can go over
the table customers. So we can go over here near the table and we're going to
here near the table and we're going to say with and inside it we're going to
say with and inside it we're going to say for SQL force seek. So we are
say for SQL force seek. So we are forcing SQL to use the seek index. So we
forcing SQL to use the seek index. So we can use those keywords near the table in
can use those keywords near the table in order to specify for SQL how to load the
order to specify for SQL how to load the data. If you are not specifying anything
data. If you are not specifying anything like here with the orders, we don't have
like here with the orders, we don't have here any hints. That means we are
here any hints. That means we are counting on the execution plan that is
counting on the execution plan that is generated from the SQL. But if you don't
generated from the SQL. But if you don't want the recommendations, you can go and
want the recommendations, you can go and specify which one should be used. So now
specify which one should be used. So now let's go and execute it. Now we got an
let's go and execute it. Now we got an error because the SQL is not able to
error because the SQL is not able to process what we are asking for and I
process what we are asking for and I think maybe we are using the force
think maybe we are using the force command and as well the hash join. Let
command and as well the hash join. Let me just uncomment this and let's go and
me just uncomment this and let's go and give it another try and now it is
give it another try and now it is working. So let's go to the execution
working. So let's go to the execution plan. So you can see we got again the
plan. So you can see we got again the nested loop. And now if you go to the
nested loop. And now if you go to the customers table you can see now it is
customers table you can see now it is using the index seek. So it is not using
using the index seek. So it is not using anymore the index scan. So as you can
anymore the index scan. So as you can see again we are intervening and forcing
see again we are intervening and forcing SQL to use the method that might be
SQL to use the method that might be better for our query. Now if you are
better for our query. Now if you are creating a lot of indexes in one table
creating a lot of indexes in one table and the SQL is still not targeting the
and the SQL is still not targeting the right index. So if you check the object
right index. So if you check the object you can see it is targeting specific
you can see it is targeting specific index. But if you have a better index
index. But if you have a better index than that you can give a hint for the
than that you can give a hint for the SQL to use a specific index. And we can
SQL to use a specific index. And we can do that like this. If you go over here
do that like this. If you go over here and remove the force seek and you say
and remove the force seek and you say use index and then we have to go and
use index and then we have to go and specify the index name. So let's go and
specify the index name. So let's go and get again the primary key over here. Now
get again the primary key over here. Now I'm telling SQL you have to go and use
I'm telling SQL you have to go and use this index in order to scan the table
this index in order to scan the table customers. So let's go and try this out.
customers. So let's go and try this out. And if you go to the execution plan you
And if you go to the execution plan you can see it is as well targeting this
can see it is as well targeting this index. So not only you can force SQL for
index. So not only you can force SQL for a specific type of loading or joining,
a specific type of loading or joining, you can force SQL to use a specific
you can force SQL to use a specific index that you created. All right
index that you created. All right friends, so as you can see, SQL hands
friends, so as you can see, SQL hands are very powerful, but we have to be
are very powerful, but we have to be very careful with them because I really
very careful with them because I really had a bad experience using them in my
had a bad experience using them in my projects. So here are my recommendations
projects. So here are my recommendations and what happens. So what could happen
and what happens. So what could happen is that you are optimizing the
is that you are optimizing the performance in the development database
performance in the development database and you start using the hints and the
and you start using the hints and the speed was really good and once you roll
speed was really good and once you roll that out to another database the
that out to another database the production database this hint will not
production database this hint will not be working correctly. The same hint that
be working correctly. The same hint that you are using might not improve the
you are using might not improve the performance and one reason is that
performance and one reason is that sometimes the productive database has
sometimes the productive database has like large data compared to the
like large data compared to the development database. So you have really
development database. So you have really to test the hint in each database that
to test the hint in each database that you have. So if your hint is working in
you have. So if your hint is working in one environment that doesn't mean it
one environment that doesn't mean it going to work in the other one. So
going to work in the other one. So always make sure to test. And the second
always make sure to test. And the second recommendation is that don't use the
recommendation is that don't use the hint as a permanent fix for your
hint as a permanent fix for your queries. So what this means? Let's say
queries. So what this means? Let's say that you are working in the project and
that you are working in the project and one of your queries are very slow. Now,
one of your queries are very slow. Now, if it's not clear why the execution plan
if it's not clear why the execution plan is really bad, you can go and use the
is really bad, you can go and use the hints as a workaround in order to speed
hints as a workaround in order to speed up your query again, but it's still as a
up your query again, but it's still as a workaround temporary. You still have to
workaround temporary. You still have to invest and spend time in order to
invest and spend time in order to analyze what was the road cause. So
analyze what was the road cause. So maybe it is an old statistics or you
maybe it is an old statistics or you have wrong indexing and so on. So use
have wrong indexing and so on. So use hints only to work around and speed up
hints only to work around and speed up your queries, but don't use it as a
your queries, but don't use it as a permanent fix. So friends, SQL hints are
permanent fix. So friends, SQL hints are really amazing in order to control the
really amazing in order to control the execution plan, but use it very
execution plan, but use it very carefully and only if there is like an
carefully and only if there is like an emergency.
All right friends, so now for each SQL data project, we have to make sure that
data project, we have to make sure that we create a clear guidance about the
we create a clear guidance about the index strategy and everyone in the team
index strategy and everyone in the team has to commit and follow the strategy in
has to commit and follow the strategy in order to make sure that each index that
order to make sure that each index that is created in the project to fulfill a
is created in the project to fulfill a purpose and that's because without a
purpose and that's because without a clear strategy about the indexing, I'm
clear strategy about the indexing, I'm going to promise you there will be a lot
going to promise you there will be a lot of redundancy, unused indexes, uh waste
of redundancy, unused indexes, uh waste of storage and the whole system of your
of storage and the whole system of your project is going to be slow and bad. So
project is going to be slow and bad. So now what we're going to do, I'm going to
now what we're going to do, I'm going to show you my indexing strategy that I
show you my indexing strategy that I usually follow in my projects. But I'm
usually follow in my projects. But I'm going to tell you from now there is like
going to tell you from now there is like not one strategy that can fit any
not one strategy that can fit any project and any scenario. That's why the
project and any scenario. That's why the team of each project should brainstorm
team of each project should brainstorm in order to make their own strategy. So
in order to make their own strategy. So now let's have a look to my indexing
now let's have a look to my indexing strategy.
And now if I have to pick only one recommendation from me to you in this
recommendation from me to you in this indexing tutorial, I'm going to have
indexing tutorial, I'm going to have this advice for you. Avoid overindexing.
this advice for you. Avoid overindexing. Overindexing is the biggest mistake and
Overindexing is the biggest mistake and trap that a lot of developers do where
trap that a lot of developers do where they think adding more indexes. That
they think adding more indexes. That sounds like we are speeding up things
sounds like we are speeding up things and our queries can be fast. But I have
and our queries can be fast. But I have to tell you this exactly lead to the
to tell you this exactly lead to the opposite. And here's why. As we learned,
opposite. And here's why. As we learned, each time you add a new data to your
each time you add a new data to your table, your index has to get updated,
table, your index has to get updated, sorted, rearranged. That means having
sorted, rearranged. That means having too many indexes, what's going to
too many indexes, what's going to happen? Your insert, update, delete
happen? Your insert, update, delete operations going to be slow. And this
operations going to be slow. And this means your database is slower and not
means your database is slower and not faster. And one more very important
faster. And one more very important reason why overindexing is bad is you
reason why overindexing is bad is you make the database confused while
make the database confused while creating the execution plan. As we
creating the execution plan. As we learned, the SQL database has to create
learned, the SQL database has to create the best execution plan for your query.
the best execution plan for your query. And if you have a lot of indexes in your
And if you have a lot of indexes in your database, it's going to make the process
database, it's going to make the process of creating an execution plan
of creating an execution plan complicated for the database, which
complicated for the database, which makes it of course for database harder
makes it of course for database harder to choose the best path and index. And
to choose the best path and index. And as well, you open the door for bad
as well, you open the door for bad execution plans. And this means it's
execution plans. And this means it's going to slow the query because first
going to slow the query because first the database has to create the execution
the database has to create the execution plan before executing your query. So
plan before executing your query. So again it has a bad effect for the
again it has a bad effect for the performance and as well there is another
performance and as well there is another bad thing. It can make it harder for the
bad thing. It can make it harder for the database to decide what is the best
database to decide what is the best execution plan for a query and having
execution plan for a query and having too many indexes might make the SQL
too many indexes might make the SQL database choosing a really bad execution
database choosing a really bad execution plan. So overindexing confuse the
plan. So overindexing confuse the execution plan and as well makes the
execution plan and as well makes the query slower. So that's why I call this
query slower. So that's why I call this a golden rule and you have to commit to
a golden rule and you have to commit to it. Just avoid overindexing because it
it. Just avoid overindexing because it is double-edged sword and exactly you
is double-edged sword and exactly you have to have the mindset of less is
have to have the mindset of less is more. So having a few effective indexes
more. So having a few effective indexes is way better than having a lot of
is way better than having a lot of indexes. So keep it in mind and write it
indexes. So keep it in mind and write it in your development guideline for the
in your development guideline for the team with big statement avoid
team with big statement avoid overindexing. So this is the first
overindexing. So this is the first statement in your indexing strategy. So
statement in your indexing strategy. So now let's check the
now let's check the [Music]
[Music] rest. All right. So now we can split the
rest. All right. So now we can split the indexing strategy into four phases and
indexing strategy into four phases and each phase has multiple steps. So now
each phase has multiple steps. So now the first step is we're going to go and
the first step is we're going to go and create an initial indexing strategy. So
create an initial indexing strategy. So now once you start a new SQL project you
now once you start a new SQL project you have to define the objectives of the
have to define the objectives of the projects very clearly. So that means we
projects very clearly. So that means we have to make it clear what we are
have to make it clear what we are focusing on what we want to achieve and
focusing on what we want to achieve and in order to define the goal of your
in order to define the goal of your indexing strategy you have to understand
indexing strategy you have to understand your system. We have mainly two types of
your system. We have mainly two types of databases. In one hand we have OLAB
databases. In one hand we have OLAB databases. It stands for online
databases. It stands for online analytical processing. The purpose of
analytical processing. The purpose of this database is for data analytics and
this database is for data analytics and an example for that is the data
an example for that is the data warehouse. So in data warehousing we go
warehouse. So in data warehousing we go and extract the data from multiple
and extract the data from multiple sources and then we prepare it and
sources and then we prepare it and transform it and put it in one big
transform it and put it in one big storage and we call this process an ETL
storage and we call this process an ETL process. And then the front end we have
process. And then the front end we have like reports and dashboards where the
like reports and dashboards where the data is summarized and aggregated and
data is summarized and aggregated and presented for the end user. And these
presented for the end user. And these reports could be used from users in
reports could be used from users in order to analyze and have insights about
order to analyze and have insights about the data. And now in order to generate
the data. And now in order to generate those reports there will be like heavy
those reports there will be like heavy reading on the data warehouse database.
reading on the data warehouse database. So that means there will be huge queries
So that means there will be huge queries that's going to access the database in
that's going to access the database in order to aggregate and prepare the data
order to aggregate and prepare the data for the visualization. But now in the
for the visualization. But now in the other hand we have the OLTP systems
other hand we have the OLTP systems online transactional processing. It is
online transactional processing. It is like an e-commerce finance banking where
like an e-commerce finance banking where you have at the back end a database
you have at the back end a database where the data is stored and on the
where the data is stored and on the front end we have like an applications
front end we have like an applications for the end users. So now as the users
for the end users. So now as the users are interacting with the app this can
are interacting with the app this can cause write operations on the database.
cause write operations on the database. So inserting new data or changing data
So inserting new data or changing data and as well there will be read
and as well there will be read operations on the database in order to
operations on the database in order to show the data in the app. So we have
show the data in the app. So we have both write and read. So now of course we
both write and read. So now of course we have to ask ourself what is the goal
have to ask ourself what is the goal what do we want to achieve and here
what do we want to achieve and here mainly there is like two strategy either
mainly there is like two strategy either you want to improve the read performance
you want to improve the read performance or the right performance. Now if you are
or the right performance. Now if you are looking to the OLAP system here it's
looking to the OLAP system here it's really you have to understand the
really you have to understand the project where is the struggle sometimes
project where is the struggle sometimes it could be like the ATL process itself
it could be like the ATL process itself it's slow and mainly the ATL is writing
it's slow and mainly the ATL is writing data from the sources in the data
data from the sources in the data warehouse and maybe you have scenario
warehouse and maybe you have scenario where it takes like every day 10 hours
where it takes like every day 10 hours and 10 hours is of course a problem
and 10 hours is of course a problem because you cannot wait so long in order
because you cannot wait so long in order to get a new data fresh data to the
to get a new data fresh data to the report every day. So you can make the
report every day. So you can make the goal of the project is to optimize the
goal of the project is to optimize the right performance. You want to speed up
right performance. You want to speed up the ETL. But actually most of those
the ETL. But actually most of those projects having another issue. Well, it
projects having another issue. Well, it is the read operation on the database
is the read operation on the database because data warehouses normally have
because data warehouses normally have really big data sets and at the front
really big data sets and at the front end the reports generate large complex
end the reports generate large complex queries on the database. So that means
queries on the database. So that means the rate process going to be the pain
the rate process going to be the pain point in each OLAP system. So normally
point in each OLAP system. So normally the big goal in each OLAP system going
the big goal in each OLAP system going to be how to optimize the read
to be how to optimize the read performance. But now in the right hand
performance. But now in the right hand with the OLTB we have different nature
with the OLTB we have different nature of database and scenario. What going to
of database and scenario. What going to happen? You will not have like big
happen? You will not have like big queries from the apps. You're going to
queries from the apps. You're going to have like many query many transactions
have like many query many transactions happening between the application and
happening between the application and the database. So you're going to have
the database. So you're going to have like massive amount of read and write
like massive amount of read and write transactions. So the whole time we are
transactions. So the whole time we are reading, writing, reading, writing and
reading, writing, reading, writing and so on. But with the OL app we have like
so on. But with the OL app we have like something bigger and slower because in
something bigger and slower because in the ATL we usually run it only once.
the ATL we usually run it only once. That means we are writing only once new
That means we are writing only once new data to the database and this happen
data to the database and this happen usually at the night but on the
usually at the night but on the transactional systems you have a lot of
transactional systems you have a lot of readrs all time. Again depend on the
readrs all time. Again depend on the project but usually the main pain point
project but usually the main pain point in the OLTP is the right operation. So
in the OLTP is the right operation. So it could be like this. If you are
it could be like this. If you are building OTP system, the main goal is to
building OTP system, the main goal is to optimize the right performance. Now of
optimize the right performance. Now of course the question is how to do that?
course the question is how to do that? How we going to optimize that? Well,
How we going to optimize that? Well, again we have to understand the nature
again we have to understand the nature of the database. What do we have in the
of the database. What do we have in the OLAP systems is usually like a data
OLAP systems is usually like a data model where you have a very big fact
model where you have a very big fact tables and around the fact we have like
tables and around the fact we have like multiple dimensions that are connected
multiple dimensions that are connected to the facts. So those fact tables are
to the facts. So those fact tables are really big tables in the database and
really big tables in the database and each time they are used in order to
each time they are used in order to build a report and the report going to
build a report and the report going to be using all time those facts in order
be using all time those facts in order to prepare the data for the
to prepare the data for the visualizations and a lot of aggregations
visualizations and a lot of aggregations query going to be done on the facts and
query going to be done on the facts and now of course you have to answer now the
now of course you have to answer now the question which type of index should we
question which type of index should we use in this scenario. Well we have a
use in this scenario. Well we have a perfect one called a column store index.
perfect one called a column store index. So the best practice here is and you can
So the best practice here is and you can make it as a strategy for the whole
make it as a strategy for the whole project that we make all fact tables as
project that we make all fact tables as a column store index because this is
a column store index because this is what we are doing in the OLAP. We are
what we are doing in the OLAP. We are aggregating large data sets but now the
aggregating large data sets but now the data model and the scenario is
data model and the scenario is completely different at the right side
completely different at the right side here. We're going to have like a lot of
here. We're going to have like a lot of tables and they have like different
tables and they have like different sizes and so on and there are like a lot
sizes and so on and there are like a lot of relationship between all those
of relationship between all those tables. So it is completely connected.
tables. So it is completely connected. So you have a lot of like primary keys
So you have a lot of like primary keys and foreign keys relationships between
and foreign keys relationships between them and normally those tables are
them and normally those tables are completely normalized table. So they are
completely normalized table. So they are like small pieces but on the left side
like small pieces but on the left side we have denormalized tables as a facts.
we have denormalized tables as a facts. So here is like one strategy that we can
So here is like one strategy that we can follow in the indexing of the ALTB is
follow in the indexing of the ALTB is that we create clustered index for each
that we create clustered index for each primary key of our tables. This of
primary key of our tables. This of course can improve a lot of stuff like
course can improve a lot of stuff like searching, sorting and as well joining
searching, sorting and as well joining tables together. But of course since we
tables together. But of course since we are focusing on optimizing the right
are focusing on optimizing the right performance on the OLTP you have to be
performance on the OLTP you have to be more sensitive by adding new indexes
more sensitive by adding new indexes compared to the OLAP because each index
compared to the OLAP because each index you add it could be a reason why the
you add it could be a reason why the data is written very slowly. So in the
data is written very slowly. So in the OLTB you have to be way more careful
OLTB you have to be way more careful adding indexes. So now as you can see
adding indexes. So now as you can see you have to understand the nature of
you have to understand the nature of your project. You have to understand
your project. You have to understand what is the main issue. Once you
what is the main issue. Once you understand your project, you can go and
understand your project, you can go and define like a goal for optimizing the
define like a goal for optimizing the system. So either read or write or maybe
system. So either read or write or maybe both of them and with that you are
both of them and with that you are making like the initial strategy of
making like the initial strategy of indexing your
indexing your [Music]
[Music] system. All right. So with that we have
system. All right. So with that we have an initial strategy for our indexing and
an initial strategy for our indexing and we have a rough plan. Now in the next
we have a rough plan. Now in the next phase we have usage patterns indexing.
phase we have usage patterns indexing. So now we're going to do a deep dive
So now we're going to do a deep dive into our project. And the first thing
into our project. And the first thing that we have to do is that we have to
that we have to do is that we have to identify the frequently used tables and
identify the frequently used tables and columns. So that means you have to go
columns. So that means you have to go and check the queries used in your
and check the queries used in your project in order to understand okay what
project in order to understand okay what is the most important table that is used
is the most important table that is used in many queries. Like for example here
in many queries. Like for example here we have the fact internet sales. It is
we have the fact internet sales. It is used like in many many queries in our
used like in many many queries in our scripts. So here you are like developing
scripts. So here you are like developing a feeling about what are the most
a feeling about what are the most important frequently used tables and not
important frequently used tables and not only that you can go and check how we
only that you can go and check how we are filtering the data on those queries.
are filtering the data on those queries. So for example we have over here we are
So for example we have over here we are filtering by the order date key is this
filtering by the order date key is this kind of filtering is used like in
kind of filtering is used like in multiple queries. So as you can see we
multiple queries. So as you can see we have like here a couple of queries where
have like here a couple of queries where we are doing always the same where we
we are doing always the same where we are filtering the data by the dates. So
are filtering the data by the dates. So with that we understand there is like a
with that we understand there is like a pattern inside our projects where this
pattern inside our projects where this column is used mainly on filtering and
column is used mainly on filtering and as well for aggregating. So that means
as well for aggregating. So that means you do a deep dive in order to
you do a deep dive in order to understand what are the most and
understand what are the most and frequently used tables and columns
frequently used tables and columns inside your scripts. And now of course
inside your scripts. And now of course what I usually do I go and use the help
what I usually do I go and use the help of the AI and IBT where I give it my
of the AI and IBT where I give it my code and then ask questions about it.
code and then ask questions about it. For example, this prompt, it says, "Anal
For example, this prompt, it says, "Anal analyze the following SQL queries and
analyze the following SQL queries and generate a report on table and column
generate a report on table and column usage statistics. And for each table,
usage statistics. And for each table, provide the total number of times the
provide the total number of times the table is used across all queries. A
table is used across all queries. A breakdown for each column in the table
breakdown for each column in the table showing the number of times each column
showing the number of times each column appears. And I would like to see as well
appears. And I would like to see as well the primary usage of each column,
the primary usage of each column, filtering, joining, grouping, and so on.
filtering, joining, grouping, and so on. And in the output, as you can see, we
And in the output, as you can see, we got like nice statistics about my
got like nice statistics about my scripts. So as you can see the most used
scripts. So as you can see the most used fact table is fact internet sales. It is
fact table is fact internet sales. It is like 13 times used in the projects and
like 13 times used in the projects and then we can see like statistics about
then we can see like statistics about each column that is inside these facts.
each column that is inside these facts. So most of the time is the sales is used
So most of the time is the sales is used for aggregating and as we saw the order
for aggregating and as we saw the order date key is used like five times for
date key is used like five times for filtering and the other keys is used for
filtering and the other keys is used for joining tables. So as you can see it's
joining tables. So as you can see it's amazing right now we can identify which
amazing right now we can identify which tables are important which columns as
tables are important which columns as well are important and we can like based
well are important and we can like based on those informations maybe derive our
on those informations maybe derive our indexing for our database. So with that
indexing for our database. So with that we have identified our frequently used
we have identified our frequently used tables and columns and now the next step
tables and columns and now the next step we have to go and choose the right index
we have to go and choose the right index type and as we learned before we have
type and as we learned before we have multiple types of indexes and that's
multiple types of indexes and that's really depend on the usage and the
really depend on the usage and the scenario. So for examples, if your
scenario. So for examples, if your columns are primary keys, then go with
columns are primary keys, then go with the clustered index. And if you are
the clustered index. And if you are using columns that are not primary key
using columns that are not primary key where you are doing joining filtering,
where you are doing joining filtering, then think about the non-clustered
then think about the non-clustered index. And of course, if the table is
index. And of course, if the table is very big, as we said, you can go and use
very big, as we said, you can go and use the column store index. And if you are
the column store index. And if you are targeting always like a subset of data
targeting always like a subset of data only like one year informations, then
only like one year informations, then you can think about the filtered index.
you can think about the filtered index. And the last one, if you have like a
And the last one, if you have like a unique column where you don't have any
unique column where you don't have any duplicates, then you can go and apply a
duplicates, then you can go and apply a unique index. So it depends on the
unique index. So it depends on the scenario and the usages. You have to
scenario and the usages. You have to choose the right index. And of course
choose the right index. And of course the last step in this phase is that you
the last step in this phase is that you have to go and test your index whether
have to go and test your index whether everything is working
fine. So that's all for the phase two. Then we go to phase three scenario-based
Then we go to phase three scenario-based indexing. So here we have to tackle and
indexing. So here we have to tackle and focus on specific issues to specific
focus on specific issues to specific pain points. So that means we have first
pain points. So that means we have first to identify the slow queries. So it
to identify the slow queries. So it could be reported from users or the team
could be reported from users or the team is doing like analyzing on the logs and
is doing like analyzing on the logs and to understand which queries are causing
to understand which queries are causing like performance issues. And now once
like performance issues. And now once you get a list of slow queries then you
you get a list of slow queries then you have to analyze them one by one and it
have to analyze them one by one and it is time to dig into the execution plans.
is time to dig into the execution plans. So as we learn we can check how SQL is
So as we learn we can check how SQL is implementing our queries and start
implementing our queries and start looking for areas for example where the
looking for areas for example where the SQL is doing a full scan of the tables
SQL is doing a full scan of the tables or maybe using expensive operations like
or maybe using expensive operations like nested loop joins and so on. So once you
nested loop joins and so on. So once you understood where is exactly the pain
understood where is exactly the pain point the next step is that you have to
point the next step is that you have to go and choose the right index. So which
go and choose the right index. So which type of indexes we're going to use in
type of indexes we're going to use in order to optimize the query. And once
order to optimize the query. And once you go and create the index, the last
you go and create the index, the last step is that you have to go and test it.
step is that you have to go and test it. So you're going to run again the
So you're going to run again the execution plan in order to make sure
execution plan in order to make sure that your query is using the index that
that your query is using the index that you have just created. So that means you
you have just created. So that means you have to go and compare the execution
have to go and compare the execution plans before and after. And if you see
plans before and after. And if you see that there is no benefit, then something
that there is no benefit, then something is wrong. That means you have to go and
is wrong. That means you have to go and investigate more and analyze the
investigate more and analyze the execution query and maybe choose a
execution query and maybe choose a better index way. And you have to do
better index way. And you have to do this process for each slow query until
this process for each slow query until you get all your queries fast. But of
you get all your queries fast. But of course, don't forget indexing is not the
course, don't forget indexing is not the only methods on how to optimize the
only methods on how to optimize the speed of queries. So as you can see
speed of queries. So as you can see through these three phases, we went from
through these three phases, we went from a very generic methods on how to index
a very generic methods on how to index our system to something very specific
our system to something very specific and scenario based. So as you can see as
and scenario based. So as you can see as we moving in the phases, we are doing
we moving in the phases, we are doing more deep dive into our projects.
All right. So now moving to the last phase, we have the monitoring and
phase, we have the monitoring and maintenance of our indexes. As we
maintenance of our indexes. As we learned, the job doesn't stop by just
learned, the job doesn't stop by just creating and implementing indexes. We
creating and implementing indexes. We have to be responsible by keeping eye on
have to be responsible by keeping eye on the health of our indexes. And here the
the health of our indexes. And here the databases offers a lot of statistics and
databases offers a lot of statistics and metadata about your data that you could
metadata about your data that you could use in this phase. So the first step is
use in this phase. So the first step is to monitor the usage of the indexes. And
to monitor the usage of the indexes. And as we learned, we can use the dynamic
as we learned, we can use the dynamic management views or functions that we
management views or functions that we can find in the system schema where we
can find in the system schema where we can see the number of usage of each
can see the number of usage of each index and when the last time our queries
index and when the last time our queries did use the indexes. So with that we can
did use the indexes. So with that we can go and find out all those indexes that
go and find out all those indexes that we have created and never been used in
we have created and never been used in our projects. And now the next step is
our projects. And now the next step is that we can go and monitor the missing
that we can go and monitor the missing indexes. So here we can go and check
indexes. So here we can go and check what are the recommendations from the
what are the recommendations from the database where the database is reporting
database where the database is reporting missing indexes from the execution plan
missing indexes from the execution plan and again we can go and use those
and again we can go and use those dynamic management views or functions in
dynamic management views or functions in order to see more details and as well we
order to see more details and as well we can go and monitor whether we have
can go and monitor whether we have duplicates in the indexing. It happens a
duplicates in the indexing. It happens a lot if you have like a lot of developers
lot if you have like a lot of developers in your team. So it could be that they
in your team. So it could be that they are working parallelly to optimize the
are working parallelly to optimize the performance of slow queries and then go
performance of slow queries and then go and create multiple indexes for the same
and create multiple indexes for the same column. So this is something that we can
column. So this is something that we can go and check whether we have duplicates
go and check whether we have duplicates in our indexes and if you have
in our indexes and if you have duplicates then you have to go and find
duplicates then you have to go and find how you can go and consolidate them.
how you can go and consolidate them. Then the next step we have to go and
Then the next step we have to go and update the statistics. So as we learned
update the statistics. So as we learned statistics are very important for the
statistics are very important for the execution plan because the database
execution plan because the database engine use those informations to decide
engine use those informations to decide the best execution plan for your query
the best execution plan for your query and if the statistics are old then the
and if the statistics are old then the database going to make wrong decisions
database going to make wrong decisions about how to execute your query which
about how to execute your query which might lead to bad performance. So here
might lead to bad performance. So here again we have like special functions in
again we have like special functions in order to monitor the statistics but here
order to monitor the statistics but here my recommendation that each weekend have
my recommendation that each weekend have a job that go and create all the
a job that go and create all the statistics of your database. And the
statistics of your database. And the last step we don't have to forget about
last step we don't have to forget about monitoring the fragmentations as we
monitoring the fragmentations as we learned over the time as you are doing
learned over the time as you are doing modifications on the tables. What could
modifications on the tables. What could happen the order of the databases could
happen the order of the databases could get wrong or there are like free spaces
get wrong or there are like free spaces on the database that are not used. So we
on the database that are not used. So we have like fragmentations in the index
have like fragmentations in the index and the same thing we have to monitor
and the same thing we have to monitor the fragmentations of each tables and
the fragmentations of each tables and here if the percentage is between 0 and
here if the percentage is between 0 and 10 then there is no issue but if the
10 then there is no issue but if the fragmentation is between 10 and 30 then
fragmentation is between 10 and 30 then we have to go and reorganize the index
we have to go and reorganize the index and if it's more than 30 then this is
and if it's more than 30 then this is alerting you have to go and rebuild the
alerting you have to go and rebuild the whole index and usually for the
whole index and usually for the monitoring I go and build like automated
monitoring I go and build like automated dashboard in PowerBI or Tableau where I
dashboard in PowerBI or Tableau where I go and extract all those metad data and
go and extract all those metad data and create a nice dashboards in order to
create a nice dashboards in order to monitor the health of the database or
monitor the health of the database or you can go and buy some other tools that
you can go and buy some other tools that are advanced in order to do those
stuff. All right. So this is my indexing strategy that I usually follow in my
strategy that I usually follow in my projects. And as you can see, each phase
projects. And as you can see, each phase builds upon the previous one. Moving
builds upon the previous one. Moving from a general strategy to more
from a general strategy to more targeted, refined, specific strategy
targeted, refined, specific strategy where we define first the goal of the
where we define first the goal of the indexing strategy of the projects. And
indexing strategy of the projects. And as we move with the phases, we're going
as we move with the phases, we're going to be targeting more specific scenarios.
to be targeting more specific scenarios. And this cycle keep repeating. It's not
And this cycle keep repeating. It's not only one time. So you have to keep
only one time. So you have to keep discussing is the goal still suitable
discussing is the goal still suitable for the projects. You have to keep
for the projects. You have to keep analyzing the frequently used tables and
analyzing the frequently used tables and columns and keep searching and finding
columns and keep searching and finding those slow queries and always keep an
those slow queries and always keep an eye monitoring the indexes and of course
eye monitoring the indexes and of course I can only keep repeating this avoid
I can only keep repeating this avoid overindexing. All right my friends so
overindexing. All right my friends so that's all about the indexes that was a
that's all about the indexes that was a lot of informations and a lot of
lot of informations and a lot of technique. So now you know everything
technique. So now you know everything about indexing in SQL. Now in the next
about indexing in SQL. Now in the next one there is another important
one there is another important techniques on how to optimize the
techniques on how to optimize the performance. So we're going to talk
performance. So we're going to talk about the partitions. So how to divide
about the partitions. So how to divide our data in order to optimize the
our data in order to optimize the performance. So let's
go. All right. So what is SQL partitioning? It's a technique in order
partitioning? It's a technique in order to divide a large table into small
to divide a large table into small pieces and each piece we call it a
pieces and each piece we call it a partition. Well, this sounds like we are
partition. Well, this sounds like we are dividing one big table into smaller
dividing one big table into smaller tables but it's not like that. We are
tables but it's not like that. We are just dividing one table into smaller
just dividing one table into smaller partitions. So we going to see it in the
partitions. So we going to see it in the database still as one solid table but
database still as one solid table but behind the scenes it is splitted into
behind the scenes it is splitted into multiple partitions. So now let's go and
multiple partitions. So now let's go and understand what this means. Okay. So now
understand what this means. Okay. So now let's say that you have a table at your
let's say that you have a table at your database and over the time this table is
database and over the time this table is getting bigger and bigger where you have
getting bigger and bigger where you have like hundreds of millions of rows. Now
like hundreds of millions of rows. Now once you have such a big table what's
once you have such a big table what's going to happen everything going to be
going to happen everything going to be slow. So for example, if you are reading
slow. So for example, if you are reading the table and the execution plan is
the table and the execution plan is doing full scan of the table, this can
doing full scan of the table, this can take SQL long time until all the rows
take SQL long time until all the rows are fetched. And if you decide to make
are fetched. And if you decide to make like an index for this table, what's
like an index for this table, what's going to happen? SQL going to go and
going to happen? SQL going to go and build a very big B tree index where
build a very big B tree index where there are a lot of branches and files
there are a lot of branches and files and so on. And having a big index is not
and so on. And having a big index is not always a good thing because if you do
always a good thing because if you do operations like delete rows, update rows
operations like delete rows, update rows or inserting rows, these operations
or inserting rows, these operations going to need long time to process. So
going to need long time to process. So having a big index doesn't mean that you
having a big index doesn't mean that you can have a good performance for your big
can have a good performance for your big table. So that means having a big table
table. So that means having a big table is a problematic because everything
is a problematic because everything going to be slow. So now what we can do
going to be slow. So now what we can do in order to optimize the performance of
in order to optimize the performance of this big table? Well, we can use SQL
this big table? Well, we can use SQL partitioning and in order to do that, we
partitioning and in order to do that, we have to understand the behavior and the
have to understand the behavior and the transactions that are happening on our
transactions that are happening on our table and what usually happen with that
table and what usually happen with that the table grows over the time. So, you
the table grows over the time. So, you can have like subset of data that
can have like subset of data that belongs to 2023 and another one that is
belongs to 2023 and another one that is created and updated in 2024 and then you
created and updated in 2024 and then you have something like more current in
have something like more current in 2025. So that means we have like in our
2025. So that means we have like in our table old data and as well new data and
table old data and as well new data and we usually interact with the new data
we usually interact with the new data more often than the old data. So maybe
more often than the old data. So maybe for example for 2023 there is like only
for example for 2023 there is like only one read transaction and for the data in
one read transaction and for the data in 2024 we have done like two reads and one
2024 we have done like two reads and one rights. So it is little bit more than
rights. So it is little bit more than 2023 but for the new data for the
2023 but for the new data for the current year there will be heavy
current year there will be heavy transactions. So we're going to have a
transactions. So we're going to have a lot of reads a lot of rights. We are
lot of reads a lot of rights. We are updating, inserting, reading. So a lot
updating, inserting, reading. So a lot of things are going on for the new data.
of things are going on for the new data. So that means we are accessing
So that means we are accessing frequently the big table only to
frequently the big table only to interact with the new data and we rarely
interact with the new data and we rarely need the old data. So what we can do, we
need the old data. So what we can do, we can go and divide this big table and we
can go and divide this big table and we usually divide it by like a date. So
usually divide it by like a date. So that means we can go and split this
that means we can go and split this table by the year and we put each year
table by the year and we put each year in one partition. So at the end we're
in one partition. So at the end we're going to have like three partitions. And
going to have like three partitions. And now it's really important to understand
now it's really important to understand that that those are three partitions.
that that those are three partitions. They are not three tables. So that means
They are not three tables. So that means at the client side the users can see
at the client side the users can see only one table but behind the scenes we
only one table but behind the scenes we have like three partitions. Now let's
have like three partitions. Now let's say that you have a query in order to
say that you have a query in order to read the data from 2025. And now what
read the data from 2025. And now what going to happen? SQL will not go and
going to happen? SQL will not go and scan all the data from the table. It's
scan all the data from the table. It's going to go and only target one
going to go and only target one partition the 2025. So that means SQL is
partition the 2025. So that means SQL is only scanning the relevant informations
only scanning the relevant informations the relevant partition and not the
the relevant partition and not the entire table. And now we have another
entire table. And now we have another benefits of having partitions. Let's say
benefits of having partitions. Let's say that you're using a modern database and
that you're using a modern database and normally they support parallel
normally they support parallel processing. So if you have the
processing. So if you have the infrastructure for that what can happen
infrastructure for that what can happen the database engine can process each
the database engine can process each partition independently and parallelly.
partition independently and parallelly. So whether you are reading or writing
So whether you are reading or writing data. So what's going to happen? SQL
data. So what's going to happen? SQL going to process your queries parallelly
going to process your queries parallelly which of course can reduce the overall
which of course can reduce the overall execution time. So that means if you
execution time. So that means if you have a modern infrastructure like maybe
have a modern infrastructure like maybe for example the Azure Synapse and so on
for example the Azure Synapse and so on go with the partitions because the
go with the partitions because the partition then could be stored in
partition then could be stored in different servers and this helps of
different servers and this helps of course the SQL engine to use all the
course the SQL engine to use all the resources at once. So that means
resources at once. So that means partitions allow scalability and as well
partitions allow scalability and as well parallel processing. partitions going to
parallel processing. partitions going to make the indexing more efficient. So
make the indexing more efficient. So instead of having one very big index for
instead of having one very big index for the whole table, if you put an index on
the whole table, if you put an index on a partition table, what's going to
a partition table, what's going to happen? Each partition going to get its
happen? Each partition going to get its own index, which means the size of the
own index, which means the size of the indexes going to be smaller. And of
indexes going to be smaller. And of course, this helps a lot with searching
course, this helps a lot with searching for data or as well extending the index
for data or as well extending the index itself. So for example, if you are
itself. So for example, if you are inserting data to the partition 2025,
inserting data to the partition 2025, the SQL will not go and change anything
the SQL will not go and change anything on the other indexes, it's going to go
on the other indexes, it's going to go and only change the index of the
and only change the index of the partition 2025. So that you can see the
partition 2025. So that you can see the power of the partitioning. It improves
power of the partitioning. It improves significantly the performance of your
significantly the performance of your table whether you are reading or writing
table whether you are reading or writing data to this big table. So this is what
data to this big table. So this is what we mean with partitioning and why we
we mean with partitioning and why we need
it. All right, friends. So now we're going to go to the process of creating
going to go to the process of creating partitions in SQL. At the start it might
partitions in SQL. At the start it might sounds a little bit complicated but
sounds a little bit complicated but we're going to do it step by step and I
we're going to do it step by step and I have a sketch for that. So we have like
have a sketch for that. So we have like four steps because we have in the
four steps because we have in the database like multiple layers. So let's
database like multiple layers. So let's see how we can do that. Let's go. So the
see how we can do that. Let's go. So the first step is that we're going to go and
first step is that we're going to go and define the partition function. So what
define the partition function. So what is that? We're going to go and define
is that? We're going to go and define here in the function the logic on how to
here in the function the logic on how to divide the table into partitions. And
divide the table into partitions. And this can be based on the partition key.
this can be based on the partition key. So that means we need a column in order
So that means we need a column in order to define the logic. And we usually use
to define the logic. And we usually use columns with the dates like for example
columns with the dates like for example the order dates or in other scenarios we
the order dates or in other scenarios we can use the region or country and so on.
can use the region or country and so on. But the most famous one is the dates and
But the most famous one is the dates and that's because our tables like get
that's because our tables like get bigger over the time and there are like
bigger over the time and there are like multiple types of functions. We're going
multiple types of functions. We're going to focus on the range function. So how
to focus on the range function. So how it going to work? We're going to have
it going to work? We're going to have like a range of dates and then we have
like a range of dates and then we have to define like boundary values and let's
to define like boundary values and let's say that I would like to make a
say that I would like to make a partition for each year and in order to
partition for each year and in order to do that we have to define the partition
do that we have to define the partition boundary. So it is like a value the
boundary. So it is like a value the boundary of the years could be like the
boundary of the years could be like the first day of the year or the last day of
first day of the year or the last day of the year. So here in this example we're
the year. So here in this example we're going to take for the boundary the last
going to take for the boundary the last day of the year. So the last day of
day of the year. So the last day of 2023, 2024 and 2025. So we call those
2023, 2024 and 2025. So we call those values the boundary of our function. Now
values the boundary of our function. Now between the boundaries we going to have
between the boundaries we going to have our partitions. So for examples all the
our partitions. So for examples all the rows for 2025 and earlier years is going
rows for 2025 and earlier years is going to be the partition one. So between the
to be the partition one. So between the boundary and everything before is one
boundary and everything before is one partition and after that between the two
partition and after that between the two boundaries we have partition two. So
boundaries we have partition two. So this partition going to be for all rows
this partition going to be for all rows of 2024. And then we have another
of 2024. And then we have another section the partition three where we
section the partition three where we have all rows of 2025 and then between
have all rows of 2025 and then between the last boundary and everything onwards
the last boundary and everything onwards is going to be partition 4 and here
is going to be partition 4 and here we're going to have all the rows from
we're going to have all the rows from 2026 onward. So with that we have now a
2026 onward. So with that we have now a logic we are telling SQL how to divide
logic we are telling SQL how to divide our data into multiple partitions and
our data into multiple partitions and here there is like two methods the left
here there is like two methods the left and the right. So what are those two
and the right. So what are those two methods? So again we have our boundary
methods? So again we have our boundary and now the big question to which
and now the big question to which partition does this boundary belongs to
partition does this boundary belongs to is it partition one or partition two and
is it partition one or partition two and that's why we have those two methods. If
that's why we have those two methods. If you say it is left that mean the
you say it is left that mean the boundary belongs to the partition number
boundary belongs to the partition number one. But in the other hand if you say it
one. But in the other hand if you say it is right then the boundary going to be
is right then the boundary going to be part and belongs to the partition number
part and belongs to the partition number two. So you have to decide whether the
two. So you have to decide whether the boundaries belongs to the left partition
boundaries belongs to the left partition or to the right partition. And with that
or to the right partition. And with that in the partition one, we're going to
in the partition one, we're going to have all the rows of 2023 including the
have all the rows of 2023 including the last day of 2023 because in the
last day of 2023 because in the partition 2 we only focus on 2024. So
partition 2 we only focus on 2024. So it's just the boundary belongs to the
it's just the boundary belongs to the left partition. It's very simple. Now
left partition. It's very simple. Now let's go and implement that in SQL. So
let's go and implement that in SQL. So let's do it. The syntax is very simple.
let's do it. The syntax is very simple. We're going to say create partition
We're going to say create partition function and then we have to give it a
function and then we have to give it a name. So it's going to be
name. So it's going to be partition by year since we are dividing
partition by year since we are dividing the data by the year. And after that we
the data by the year. And after that we have to define the data type. So we are
have to define the data type. So we are splitting the data by a date. So it's
splitting the data by a date. So it's going to be date. And after that we have
going to be date. And after that we have to define the partition function type.
to define the partition function type. So in our example we are using the
So in our example we are using the range. And now we have to define whether
range. And now we have to define whether it is left or right. We're going to
it is left or right. We're going to stick with the left. And now comes the
stick with the left. And now comes the very important step. We have to define
very important step. We have to define the boundaries. So we're going to say
the boundaries. So we're going to say for
for values and we're going to enter here
values and we're going to enter here three boundaries like in our example for
three boundaries like in our example for each year we're going to define a date.
each year we're going to define a date. So
So 2023 and the last day of the year. Same
2023 and the last day of the year. Same goes for
goes for 2024 and for the last one
2024 and for the last one 2025. So with that we have defined the
2025. So with that we have defined the logic the range we have defined the
logic the range we have defined the boundaries and we tell SQL the
boundaries and we tell SQL the boundaries are a date. So let's go and
boundaries are a date. So let's go and execute our function. Okay, so that's
execute our function. Okay, so that's it. As you can see, it's very simple. We
it. As you can see, it's very simple. We just created a function that split the
just created a function that split the data by the date using the range lift.
data by the date using the range lift. And of course, this function is not yet
And of course, this function is not yet attached to any tables or anything. It
attached to any tables or anything. It is just a logic that is stored in the
is just a logic that is stored in the database. All right. So now since our
database. All right. So now since our partition function is stored inside the
partition function is stored inside the database, we will have metadata about
database, we will have metadata about those functions stored in the system
those functions stored in the system schema. So we have there a dedicated
schema. So we have there a dedicated table called partition functions and
table called partition functions and there we're going to find informations
there we're going to find informations about all functions that we have inside
about all functions that we have inside our database. So let's go and execute
our database. So let's go and execute it. And as you can see we find now our
it. And as you can see we find now our new created partition function. So
new created partition function. So partition by year it is a range and it
partition by year it is a range and it has an ID and so on. And I really
has an ID and so on. And I really recommend you to check it before
recommend you to check it before creating any new partition function.
creating any new partition function. Maybe you have already one in the
projects. Okay. Okay. So now let's check the next step in our process. We're
the next step in our process. We're going to go and build now the file
going to go and build now the file groups. So what is a file group? It is
groups. So what is a file group? It is like a logical container of one or more
like a logical container of one or more data files. So it's very simple. It's
data files. So it's very simple. It's like folders. We're going to go and
like folders. We're going to go and create now like multiple folders. So
create now like multiple folders. So later we can insert inside them files.
later we can insert inside them files. And this is really nice because it gives
And this is really nice because it gives us like freedom and flexibility where we
us like freedom and flexibility where we can go and decide how the data files are
can go and decide how the data files are organized for each partition. So what we
organized for each partition. So what we usually do, we go and create for each
usually do, we go and create for each partition a file group. So we're going
partition a file group. So we're going to have like four folders or four file
to have like four folders or four file groups for 2023, 2024 and so on. So now
groups for 2023, 2024 and so on. So now let's go back to SQL in order to do
let's go back to SQL in order to do that. All right. So now let's go and
that. All right. So now let's go and create those file groups. The syntax is
create those file groups. The syntax is very simple. So it's going to say alter
very simple. So it's going to say alter database. And now we have to tell the
database. And now we have to tell the database where these file groups should
database where these file groups should be stored in which database. So I'm
be stored in which database. So I'm going to stay with the sales DB. And
going to stay with the sales DB. And then we have to tell okay add file group
then we have to tell okay add file group and after that we have to define the
and after that we have to define the name of the file group. So the first one
name of the file group. So the first one going to be for
going to be for 2023. So the syntax is very simple.
2023. So the syntax is very simple. Let's go and do it for the other years.
Let's go and do it for the other years. So we need
So we need 2024 5 and six. Okay. So that's all. We
2024 5 and six. Okay. So that's all. We can just select everything and execute.
can just select everything and execute. So as you can see it's very simple. We
So as you can see it's very simple. We have just created four file groups and
have just created four file groups and they are empty. So we don't have
they are empty. So we don't have anything inside those containers. Now
anything inside those containers. Now let's say that you have made mistake
let's say that you have made mistake with the namings and so on and you would
with the namings and so on and you would like to drop one of them. So the syntax
like to drop one of them. So the syntax is as well very easy. So it's going to
is as well very easy. So it's going to say alter database sales DB and instead
say alter database sales DB and instead of add you're going to say remove. So
of add you're going to say remove. So once you execute this file group will be
once you execute this file group will be dropped but we need it. So let's go and
dropped but we need it. So let's go and recreate it. Now as usual after creating
recreate it. Now as usual after creating stuff let's check whether everything is
stuff let's check whether everything is created correctly and whether we have
created correctly and whether we have any duplicate or anything wrong. So with
any duplicate or anything wrong. So with that we have as well a file group table
that we have as well a file group table inside the system schema and let's go
inside the system schema and let's go and execute it. So I'm just filtering
and execute it. So I'm just filtering with the type FG for file group. So
with the type FG for file group. So let's execute it. And now we can see in
let's execute it. And now we can see in our database we have four file groups.
our database we have four file groups. Now four of those file groups we just
Now four of those file groups we just created it right. So we have the 2023 24
created it right. So we have the 2023 24 and so on. But we have something called
and so on. But we have something called primary file group. This is the default
primary file group. This is the default file group that is created for each
file group that is created for each database. So it is a container for all
database. So it is a container for all data files in your database. And as you
data files in your database. And as you can see we have here a flag saying it is
can see we have here a flag saying it is a default. So it's default and we have
a default. So it's default and we have it one and for the rest they are not the
it one and for the rest they are not the defaults. So this is really nice to see
defaults. So this is really nice to see all the file groups inside your database
all the file groups inside your database to check that you don't have duplicate
to check that you don't have duplicate and so
on. Okay. Now moving on to the third step where things going to get more
step where things going to get more physically. So so far we have like a
physically. So so far we have like a function the file group and all those
function the file group and all those stuff are logical stuff. We don't have
stuff are logical stuff. We don't have data yet. In order to have data, we have
data yet. In order to have data, we have to go and create data files. So, as we
to go and create data files. So, as we learned before, data files going to
learned before, data files going to contain our actual data and they're
contain our actual data and they're going to be stored physically in the
going to be stored physically in the database. So, you can go and assign for
database. So, you can go and assign for each file group like one or multiple
each file group like one or multiple data files. And the file format here is
data files. And the file format here is MDF. It is secondary data files. We have
MDF. It is secondary data files. We have like primary and secondary. But in the
like primary and secondary. But in the partitions, we usually go with this
partitions, we usually go with this format, the NDF. So again the file
format, the NDF. So again the file groups are illogical containers and the
groups are illogical containers and the data files are physical files where our
data files are physical files where our actual data going to be stored inside
actual data going to be stored inside it. So now let's go back to SQL in order
it. So now let's go back to SQL in order to create some data files. Okay. So now
to create some data files. Okay. So now we're going to come to the little bit
we're going to come to the little bit annoying part where we're going to go
annoying part where we're going to go and create files. But the syntax is as
and create files. But the syntax is as well very simple. So we're going to say
well very simple. So we're going to say the same things alter database and our
the same things alter database and our database is sales DB. And then this time
database is sales DB. And then this time we're going to say add file. And now we
we're going to say add file. And now we have to give SQL not only the name but
have to give SQL not only the name but the physical place of the files. So
the physical place of the files. So let's do it step by step. We're going to
let's do it step by step. We're going to open new two parenthesis. So first we
open new two parenthesis. So first we have to define for SQL the logical name.
have to define for SQL the logical name. It is not the file name. It is the
It is not the file name. It is the logical name of the file. So let's give
logical name of the file. So let's give it a name for example B 2023 and then
it a name for example B 2023 and then comma. So this is the logical name. And
comma. So this is the logical name. And now the next one is we're going to give
now the next one is we're going to give the physical name of the file together
the physical name of the file together with the path. So we're going to say
with the path. So we're going to say file name equal and now we have to
file name equal and now we have to define for SQL the complete path of the
define for SQL the complete path of the file in SQL server there is like a
file in SQL server there is like a default path where the data going to be
default path where the data going to be stored and I'm going to go and use the
stored and I'm going to go and use the same path and the path really depends on
same path and the path really depends on the version and as well the type of the
the version and as well the type of the SQL server that you are using. So for
SQL server that you are using. So for the current version that I'm using for
the current version that I'm using for this tutorial we can find it over here
this tutorial we can find it over here in this path. So if you go to the C then
in this path. So if you go to the C then program files Microsoft SQL Server MSSQL
program files Microsoft SQL Server MSSQL and the version for me is 16 SQL Express
and the version for me is 16 SQL Express and then inside MSSQL data and so on. So
and then inside MSSQL data and so on. So we're going to go inside this folder and
we're going to go inside this folder and now we can see over here all the
now we can see over here all the database files. So we can see for
database files. So we can see for example here the sales DB the sales DB
example here the sales DB the sales DB logs and we have here the adventure
logs and we have here the adventure works and so on. So you're going to see
works and so on. So you're going to see all the files of your database. And what
all the files of your database. And what we're going to do, we're going to put as
we're going to do, we're going to put as well our partitions files inside the
well our partitions files inside the default folder. But for real project,
default folder. But for real project, you have to ask the database
you have to ask the database administrators about the exact location
administrators about the exact location where you can put your partitions. So
where you can put your partitions. So let's go back to SQL and I'm going to
let's go back to SQL and I'm going to put this path over here. And then we
put this path over here. And then we have to specify the file name. So it's
have to specify the file name. So it's going to be P 2023 dot. And now we have
going to be P 2023 dot. And now we have to specify the file name. So, NDF and
to specify the file name. So, NDF and with that we have now a complete path
with that we have now a complete path with the file name. So, we are almost
with the file name. So, we are almost there but we are not done yet. We have
there but we are not done yet. We have to tell SQL where to put this file in
to tell SQL where to put this file in which container in which file group. So,
which container in which file group. So, we're going to go over here and we're
we're going to go over here and we're going to say to file group and here make
going to say to file group and here make sure to select the correct one. So, FG
sure to select the correct one. So, FG 2023. All right. So, that's all. Let's
2023. All right. So, that's all. Let's go and execute it. So, let's do it. And
go and execute it. So, let's do it. And with that we have created a file inside
with that we have created a file inside a file group. I will not be creating
a file group. I will not be creating like multiple files inside one file
like multiple files inside one file group. It's going to be like one to one.
group. It's going to be like one to one. So now what we're going to do we're
So now what we're going to do we're going to go and create the other files
going to go and create the other files for each file group for each year. So we
for each file group for each year. So we just have to copy and paste and just
just have to copy and paste and just change the names. So for
change the names. So for 2024 going to be like
2024 going to be like this. So that's it. And the same thing
this. So that's it. And the same thing for 2025.
26 and we can go and select now everything and execute it. So that's it
everything and execute it. So that's it with that we have created now four
with that we have created now four different files and we have mapped as
different files and we have mapped as well each file to the correct file group
well each file to the correct file group and I usually don't create like a lot of
and I usually don't create like a lot of files. I just create like one for each
files. I just create like one for each year or maybe for bunch of years. So you
year or maybe for bunch of years. So you don't have to go and make for each day
don't have to go and make for each day like partition or something like that.
like partition or something like that. Okay. As usual after creating stuff we
Okay. As usual after creating stuff we have to go and check the metadata. Now I
have to go and check the metadata. Now I have here prepared a query where we
have here prepared a query where we query the file groups together with the
query the file groups together with the files. So all the data informations
files. So all the data informations could be found inside the table master
could be found inside the table master files and then we join those tables and
files and then we join those tables and select our database. So let's go and
select our database. So let's go and query this one. And now we're going to
query this one. And now we're going to get a list of all files inside your
get a list of all files inside your database. So we see over here we have
database. So we see over here we have the primary for the database itself and
the primary for the database itself and you can see the path of the file and as
you can see the path of the file and as well the size of it and we can see over
well the size of it and we can see over here we have four files and the file
here we have four files and the file group that is assigned to and the
group that is assigned to and the complete path of each file and you can
complete path of each file and you can monitor over here of course how the size
monitor over here of course how the size of each file is growing over the time.
of each file is growing over the time. Maybe one of them is getting like really
Maybe one of them is getting like really big and then you can think about let's
big and then you can think about let's go and split it to multiple files. So
go and split it to multiple files. So that's it about how to create data
that's it about how to create data files.
All right. So now we're going to move to the last step where we're going to go
the last step where we're going to go and define the function scheme. Now if
and define the function scheme. Now if you have a look to this picture, you see
you have a look to this picture, you see that there is something missing. From
that there is something missing. From one side, we have defined how to divide
one side, we have defined how to divide our data into multiple partitions. And
our data into multiple partitions. And from the other side, we have repaired
from the other side, we have repaired all the files and the file groups and so
all the files and the file groups and so on. And now what is missing is the
on. And now what is missing is the connection. How to connect those
connection. How to connect those partitions to the file groups. And we
partitions to the file groups. And we can do that by using the partition
can do that by using the partition scheme. So all what we are doing now is
scheme. So all what we are doing now is just defining which partition belongs to
just defining which partition belongs to which file group. So for example, we're
which file group. So for example, we're going to go and map the partition one to
going to go and map the partition one to the file group 2023. And with that all
the file group 2023. And with that all the data of 2023 and earlier going to go
the data of 2023 and earlier going to go to the file group 2023. And of course we
to the file group 2023. And of course we have to go and map each partition to a
have to go and map each partition to a file group. If you don't do that, you
file group. If you don't do that, you will get error in SQL. And once we build
will get error in SQL. And once we build the partition scheme then we can have
the partition scheme then we can have all the component ready in order to have
all the component ready in order to have partition table. So now let's have a
partition table. So now let's have a quick summarize. The partition function
quick summarize. The partition function going to decide on how to split your
going to decide on how to split your data into multiple partitions. The
data into multiple partitions. The partition scheme going to go and map the
partition scheme going to go and map the partitions to a file group. And the file
partitions to a file group. And the file groups are like folders in order to
groups are like folders in order to organize your files. And each file group
organize your files. And each file group has one or more data files where your
has one or more data files where your actual data going to be stored
actual data going to be stored physically. add these files at the
physically. add these files at the start. It might be confusing, but now as
start. It might be confusing, but now as you understand each layer, then it's
you understand each layer, then it's going to make it easier for you to build
going to make it easier for you to build partitions. So now let's go back to SQL
partitions. So now let's go back to SQL in order to build the partition scheme.
in order to build the partition scheme. Okay, so now we have the easiest part
Okay, so now we have the easiest part where we're going to connect everything
where we're going to connect everything together. So the syntax as well very
together. So the syntax as well very simple. It's going to say create
simple. It's going to say create partition scheme and now we have to give
partition scheme and now we have to give it a name. So let's go with like scheme
it a name. So let's go with like scheme partition by year. And now we have to
partition by year. And now we have to map the partition function with the file
map the partition function with the file groups. So first we're going to say as
groups. So first we're going to say as and then we define here the partition
and then we define here the partition function. So as partition and now we
function. So as partition and now we need the partition function that we have
need the partition function that we have created. So as
created. So as partition by year and then after that
partition by year and then after that we're going to map it to the file
we're going to map it to the file groups. And here it is very important to
groups. And here it is very important to map it in the correct order. So the
map it in the correct order. So the order is very important. So the first
order is very important. So the first one was file group 2023. The second one
one was file group 2023. The second one 2024 and we have 2025 and the last one
2024 and we have 2025 and the last one 2026. So again the order is very
2026. So again the order is very important and as well it's going to be a
important and as well it's going to be a little bit tricky. So sometimes as you
little bit tricky. So sometimes as you are creating like the functions maybe
are creating like the functions maybe you make mistake that you don't know how
you make mistake that you don't know how much partitions are going to create like
much partitions are going to create like in our example we have three boundaries
in our example we have three boundaries and SQL going to create four partitions.
and SQL going to create four partitions. So it happens sometimes that you think
So it happens sometimes that you think okay I have three boundaries and then
okay I have three boundaries and then I'm going to get three partitions which
I'm going to get three partitions which is not really correct. So for example
is not really correct. So for example let me just remove one of those and
let me just remove one of those and let's say I have only three five groups
let's say I have only three five groups and let's go and execute this one over
and let's go and execute this one over here. Now we are getting error. It says
here. Now we are getting error. It says the partition function generates more
the partition function generates more partitions than the five groups. And
partitions than the five groups. And that is really correct because our
that is really correct because our definition of the logic can split the
definition of the logic can split the data into four partitions. And now we
data into four partitions. And now we are giving SQL only three five groups
are giving SQL only three five groups which is not correct. So we have to go
which is not correct. So we have to go and add the plus one. And one more thing
and add the plus one. And one more thing SQL will not go and check whether you
SQL will not go and check whether you are mapping things correctly to the five
are mapping things correctly to the five groups because it doesn't really care
groups because it doesn't really care about the naming of those five groups.
about the naming of those five groups. So for example, if you go and put this
So for example, if you go and put this one at the end, what's going to happen?
one at the end, what's going to happen? It's going to be a big problem. So all
It's going to be a big problem. So all the years of 2023 going to be stored
the years of 2023 going to be stored inside 2024, 2024 going to be in 2025.
inside 2024, 2024 going to be in 2025. So everything going to be mixed and the
So everything going to be mixed and the skill can do it like you tell it. So
skill can do it like you tell it. So that's why make sure you have the
that's why make sure you have the correct sorts. So that's it. Let's go
correct sorts. So that's it. Let's go and create our scheme. So it is working.
and create our scheme. So it is working. This is very simple. We just map now the
This is very simple. We just map now the partitions to the five groups. And as
partitions to the five groups. And as usual we check things after creating and
usual we check things after creating and I have prepared here like really nice
I have prepared here like really nice query from the metadata in order to see
query from the metadata in order to see the whole thing the functions the file
the whole thing the functions the file groups the schemes you can of course add
groups the schemes you can of course add to it the data files but I'm just going
to it the data files but I'm just going to stick with this over here. So again
to stick with this over here. So again in SQL server we have a dedicated table
in SQL server we have a dedicated table for the partition schemes. Then I'm just
for the partition schemes. Then I'm just joining it with the functions and then
joining it with the functions and then with the destination data spaces in
with the destination data spaces in order to get the partition number and
order to get the partition number and the file groups. So let's go and execute
the file groups. So let's go and execute it. And now we can see very nicely the
it. And now we can see very nicely the scheme that we have created and the
scheme that we have created and the function name of the partition. And then
function name of the partition. And then we can see the partition number and the
we can see the partition number and the file group name. So we can see how
file group name. So we can see how things are mapped together. So if you
things are mapped together. So if you get it like this then so far everything
get it like this then so far everything is
good. All right. So so far what you have done we have prepared all the layers. So
done we have prepared all the layers. So we have the setup is ready to be used in
we have the setup is ready to be used in any table. So we have the functions, the
any table. So we have the functions, the files, the file groups and schema and
files, the file groups and schema and everything is ready. But still we are
everything is ready. But still we are not using it. The logic just exist and
not using it. The logic just exist and the files are empty. So now what we're
the files are empty. So now what we're going to do we're going to go and create
going to do we're going to go and create a table but not a normal one a partition
a table but not a normal one a partition table. So let's go and do that. It's
table. So let's go and do that. It's very simple as well. So create table and
very simple as well. So create table and we have to give it a name. So let's get
we have to give it a name. So let's get it as well in the schema sales orders
it as well in the schema sales orders and I'm just going to give it the name
and I'm just going to give it the name partitions. So now we have just to
partitions. So now we have just to define like few columns inside this
define like few columns inside this table. So let's get an order ID and data
table. So let's get an order ID and data type int. And let's go and get an order
type int. And let's go and get an order date. We call it dates with the data
date. We call it dates with the data type dates. And maybe just one more
type dates. And maybe just one more called sales and a data type in. So this
called sales and a data type in. So this is very normal table that we create in
is very normal table that we create in databases. But it's still not yet
databases. But it's still not yet partitioned. Now in order to use
partitioned. Now in order to use everything that we have defined, we're
everything that we have defined, we're going to go do the following. We're
going to go do the following. We're going to say on and now we have to tell
going to say on and now we have to tell SQL only the name of the partition
SQL only the name of the partition scheme. So everything else is like
scheme. So everything else is like connected and mapped together because
connected and mapped together because the scheme is mapping the function with
the scheme is mapping the function with the file groups. The file groups are
the file groups. The file groups are mapped to the data files and everything
mapped to the data files and everything is like connected together. And here in
is like connected together. And here in the table we have just to give the name
the table we have just to give the name of the scheme. So the name of the
of the scheme. So the name of the partition scheme is scheme
partition scheme is scheme partition by year. And now it's very
partition by year. And now it's very important to give a column. And since
important to give a column. And since the whole logic and the function is
the whole logic and the function is based on a date, we cannot go and
based on a date, we cannot go and specify here for example the order ID or
specify here for example the order ID or sales because it makes no sense. We're
sales because it makes no sense. We're going to go and pick the order date and
going to go and pick the order date and put it over here. And with that, we have
put it over here. And with that, we have created a partition table. So now what
created a partition table. So now what we're going to do, we're going to go and
we're going to do, we're going to go and start inserting that out of our table.
start inserting that out of our table. So let's go and do that. We're going to
So let's go and do that. We're going to say insert into sales order
say insert into sales order partitioned and we're going to pick
partitioned and we're going to pick values like this. So one and then let's
values like this. So one and then let's get any dates like 2023 like for example
get any dates like 2023 like for example my the mid of the month and the sales
my the mid of the month and the sales could be anything like let's say 100. So
could be anything like let's say 100. So let's go and execute this and let's go
let's go and execute this and let's go query our
query our table. So it is this one over
table. So it is this one over here. All right. So now we have one
here. All right. So now we have one record inside our partition table. And
record inside our partition table. And now the big question is in which
now the big question is in which partition in which data file did SQL
partition in which data file did SQL store this record. So we have to test
store this record. So we have to test whether everything is working fine. So
whether everything is working fine. So in order to do that I have prepared as
in order to do that I have prepared as well a query. So we are again asking the
well a query. So we are again asking the table partitions with the destination
table partitions with the destination data spaces where we're going to get the
data spaces where we're going to get the number of rows in each partition and
number of rows in each partition and then we have the file group and we are
then we have the file group and we are focusing on our table orders partitions.
focusing on our table orders partitions. So let's go and execute this one. And
So let's go and execute this one. And now we can see very easily we have the
now we can see very easily we have the four partitions. our new record is
four partitions. our new record is inserted in the correct place in 2023
inserted in the correct place in 2023 file group and in the correct partition.
file group and in the correct partition. So with that we make sure our function
So with that we make sure our function and the whole logic that we have built
and the whole logic that we have built is working correctly. So now let's go
is working correctly. So now let's go and add more records. I'm just going to
and add more records. I'm just going to go and duplicate it. Record number two.
go and duplicate it. Record number two. And I'm just going to pick a date in
And I'm just going to pick a date in 2024. And this one going to be like 20.
2024. And this one going to be like 20. Let's just change the value. So 50.
Let's just change the value. So 50. Let's go and execute it.
Let's go and execute it. And now we have a second row inside our
And now we have a second row inside our table. And again the big question is
table. And again the big question is whether it is working. So let's go and
whether it is working. So let's go and execute this again. And now we can see
execute this again. And now we can see our record is inserted in the partition
our record is inserted in the partition 2 in the file group 2024 which is
2 in the file group 2024 which is correct. Now let's go and check the
correct. Now let's go and check the boundaries whether it is working
boundaries whether it is working correctly. So I'm going to go and here
correctly. So I'm going to go and here in the third row I'm going to say the
in the third row I'm going to say the last day of 2025. So it's going to be
last day of 2025. So it's going to be month 12 and the last day. So 20. Let's
month 12 and the last day. So 20. Let's go and insert it and check our table. So
go and insert it and check our table. So we have a new record. And now let's go
we have a new record. And now let's go and check. My expectation here that this
and check. My expectation here that this row is going to be inserted in the file
row is going to be inserted in the file group
group 2025. So let's go and execute. And that
2025. So let's go and execute. And that is correct. As you can see the record is
is correct. As you can see the record is inserted in the correct partition. And
inserted in the correct partition. And this is really important to test the
this is really important to test the boundaries whether they are working
boundaries whether they are working correctly because it's a little bit
correctly because it's a little bit tricky. You have this range left right
tricky. You have this range left right and boundaries and so on. So you can do
and boundaries and so on. So you can do it like this to check whether the
it like this to check whether the expectation of your logic is working
expectation of your logic is working correctly. And the last one I'm just
correctly. And the last one I'm just going to do it very fast. So let's do it
going to do it very fast. So let's do it 2026. And I'm going to pick the first
2026. And I'm going to pick the first day of this
year. So let's go and insert it. And now what is the expectation? I think it is
what is the expectation? I think it is pretty simple. So let's go and query.
pretty simple. So let's go and query. And the first day of this year is
And the first day of this year is inserted in the partition number four.
inserted in the partition number four. So I can say everything is working
So I can say everything is working correctly. If you get it like this then
correctly. If you get it like this then you have created successfully a
you have created successfully a partition table and you have prepared
partition table and you have prepared all the layers of this partition
all the layers of this partition correctly. I know this is a lot of work
correctly. I know this is a lot of work but to be honest it is fun because for
but to be honest it is fun because for the first time in database you feel like
the first time in database you feel like you are controlling stuff. Usually in
you are controlling stuff. Usually in database everything like behind the
database everything like behind the scenes and you don't know exactly where
scenes and you don't know exactly where the files are stored of your tables and
the files are stored of your tables and so on. There is a lot of abstraction in
so on. There is a lot of abstraction in databases but here like we are getting
databases but here like we are getting deep in databases and we are controlling
deep in databases and we are controlling and managing all those files which is
and managing all those files which is sometimes it's nice to have this freedom
sometimes it's nice to have this freedom and flexibility. All right one quick
and flexibility. All right one quick thing that I would like to show you that
thing that I would like to show you that if you go to the database in the
if you go to the database in the explorer then let's go to the storage
explorer then let's go to the storage over here. So let's expand it and here
over here. So let's expand it and here you can find easily informations about
you can find easily informations about the partitions. So over here we can find
the partitions. So over here we can find our partition scheme and as well the
our partition scheme and as well the partition function that we have created.
partition function that we have created. it is just a quick access instead of
it is just a quick access instead of like querying the
metadata. So now let's have a quick summarize how everything is connected
summarize how everything is connected together. So we have a table and then we
together. So we have a table and then we specify for scale that is connected to a
specify for scale that is connected to a partition scheme and in the partition
partition scheme and in the partition scheme we have everything connected. It
scheme we have everything connected. It is linked to a specific partition
is linked to a specific partition function and there we have the
function and there we have the partitions and at the same time it is
partitions and at the same time it is connected to file groups and the file
connected to file groups and the file groups are connected to the data files.
groups are connected to the data files. So as you can see all those layers and
So as you can see all those layers and elements are connected together. Now
elements are connected together. Now let's see how this works. So we have
let's see how this works. So we have inserted the last day of 2025 and now
inserted the last day of 2025 and now the first thing that's going to happen
the first thing that's going to happen the partition function going to decide
the partition function going to decide to which partition it belongs. So as you
to which partition it belongs. So as you can see it is a boundary value and since
can see it is a boundary value and since we have defined it as a lift it going to
we have defined it as a lift it going to target the left partition the partition
target the left partition the partition three and then the partition scheme
three and then the partition scheme going to connect it to the right file
going to connect it to the right file group and in this scenario it's going to
group and in this scenario it's going to be the file group 2025 and we have here
be the file group 2025 and we have here only one file so it going to as well go
only one file so it going to as well go to the correct data file and in this
to the correct data file and in this file the SQL going to store this row so
file the SQL going to store this row so it is pretty
easy and now we come to very important part where we can understand how the
part where we can understand how the partitions are really improving the
partitions are really improving the performance of my query and of course we
performance of my query and of course we can do that by checking the execution
can do that by checking the execution plan. So now in order to compare like
plan. So now in order to compare like the behavior with and without the
the behavior with and without the partition what we have to do is to
partition what we have to do is to create a mirror table without partition.
create a mirror table without partition. So we have our table here the
So we have our table here the partitioned one what I'm just going to
partitioned one what I'm just going to do I will go over here and say into and
do I will go over here and say into and we're going to call it sales
we're going to call it sales orders no partition. So we are taking
orders no partition. So we are taking the data and the structure from the
the data and the structure from the orders partitions and of course it will
orders partitions and of course it will not be partitioned. So let's go and
not be partitioned. So let's go and execute it. Now if you go over here we
execute it. Now if you go over here we can see that we have two tables. We have
can see that we have two tables. We have the no partition and the partitioned
the no partition and the partitioned one. So now what we're going to do we're
one. So now what we're going to do we're going to write a query on both tables
going to write a query on both tables and then compare the execution plan. So
and then compare the execution plan. So first let's start with the no partition.
first let's start with the no partition. also from and and now in order to see
also from and and now in order to see the effect of the partition what we're
the effect of the partition what we're going to do we're going to say where
going to do we're going to say where order dates equal to and now we're just
order dates equal to and now we're just going to pick a value like 2026 the 1st
going to pick a value like 2026 the 1st of January so let's go and query it and
of January so let's go and query it and we're going to do the same thing a new
we're going to do the same thing a new query but this time for the partitions
query but this time for the partitions so now in order to see the execution
so now in order to see the execution plan make sure to activate it so we go
plan make sure to activate it so we go to the action bar over here and we're
to the action bar over here and we're going to say include the actual
going to say include the actual execution plan. So let's click on it and
execution plan. So let's click on it and execute. And with that we have here an
execute. And with that we have here an execution plan. And let's do the same
execution plan. And let's do the same thing for the no partitions. So execute
thing for the no partitions. So execute and we have here execution plan. So now
and we have here execution plan. So now let's check what we have in execution
let's check what we have in execution plan. We're going to focus on this one
plan. We're going to focus on this one over here. So right click on it and then
over here. So right click on it and then go to properties. And now we can see a
go to properties. And now we can see a lot of details about the execution plan.
lot of details about the execution plan. But what is interesting is the number of
But what is interesting is the number of rows. So as you can see we are reading
rows. So as you can see we are reading four rows. That means the whole table.
four rows. That means the whole table. And of course we have here the CPU and
And of course we have here the CPU and the other costs. Now let's go and check
the other costs. Now let's go and check the partition. So let's click over here.
the partition. So let's click over here. So now if you check over here, you can
So now if you check over here, you can see that the total number of rows is
see that the total number of rows is one. So SQL didn't read all four rows.
one. So SQL didn't read all four rows. It reads only row and that's because we
It reads only row and that's because we have in this partition only one row. And
have in this partition only one row. And as you can see the number of partitions
as you can see the number of partitions that is used is as well only one. So as
that is used is as well only one. So as you can see using partition we have
you can see using partition we have reduced the number of rows that is
reduced the number of rows that is retrieved from the files. Now let's go
retrieved from the files. Now let's go and retrieve like two data from two
and retrieve like two data from two different partitions and check the
different partitions and check the execution plan. So let's target 2025 the
execution plan. So let's target 2025 the last day of the year like this. So let's
last day of the year like this. So let's go and execute it. And the same thing
go and execute it. And the same thing for the other
for the other query. So let's check the without
query. So let's check the without partition. We still we are reading like
partition. We still we are reading like four rows. But now if you go to the
four rows. But now if you go to the other one, if you check the execution
other one, if you check the execution plan and check the table scan, you can
plan and check the table scan, you can see we are reading only two rows and
see we are reading only two rows and this time the number of partitions that
this time the number of partitions that are involved in this query is two and
are involved in this query is two and that's because we have partition for
that's because we have partition for 2025 and 2026. So as you can see it's
2025 and 2026. So as you can see it's worth the efforts. We have optimized our
worth the efforts. We have optimized our queries and this has a great impact on
queries and this has a great impact on big tables. The number of resources and
big tables. The number of resources and the number of reads going to be reduced
the number of reads going to be reduced massively. All right my friends. So
massively. All right my friends. So that's all about the partitions in SQL.
that's all about the partitions in SQL. It is amazing and you can use it as well
It is amazing and you can use it as well not only in databases but as well in
not only in databases but as well in many other data platforms and tools
many other data platforms and tools where you always can divide your data in
where you always can divide your data in order to optimize the performance. Now
order to optimize the performance. Now in the next step what I have prepared
in the next step what I have prepared for you after 15 years working in real
for you after 15 years working in real projects using SQL. I have a lot of best
projects using SQL. I have a lot of best practices and tips for you. So I have
practices and tips for you. So I have collected everything that I know and now
collected everything that I know and now I'm going to show you the best practices
I'm going to show you the best practices and tips and tricks that I can give you
and tips and tricks that I can give you in order to optimize the performance in
in order to optimize the performance in SQL. So let's go.
And now before we deep dive into the 30 best practices, I'm going to give you
best practices, I'm going to give you the golden rule. The SQL optimizer
the golden rule. The SQL optimizer responds differently for different sizes
responds differently for different sizes of tables. So that means if you have
of tables. So that means if you have small and medium tables like hundred of
small and medium tables like hundred of thousands, you might not notice any
thousands, you might not notice any performance differences if you are
performance differences if you are following the best practices. And that's
following the best practices. And that's because the size of the data is small.
because the size of the data is small. But if you have like million or hundred
But if you have like million or hundred of millions of records in tables, you
of millions of records in tables, you will immediately notice how things can
will immediately notice how things can be faster if you follow the best
be faster if you follow the best practices. And here is my golden rule.
practices. And here is my golden rule. If you get any best practice from me or
If you get any best practice from me or let's say you are reading something in
let's say you are reading something in the internet, always you have to test
the internet, always you have to test using the execution plan. So for
using the execution plan. So for example, if you have like two queries
example, if you have like two queries are returning the same result of the
are returning the same result of the data, I'm going to recommend you here to
data, I'm going to recommend you here to check the execution plan. And if you
check the execution plan. And if you notice there is no differences between
notice there is no differences between them in the execution plan then pick the
them in the execution plan then pick the one that you see it is easier to read
one that you see it is easier to read and to understand because sometimes if
and to understand because sometimes if you are following the best practices for
you are following the best practices for the performance your query might be like
the performance your query might be like little bit more complicated. So always
little bit more complicated. So always write the query to be understandable and
write the query to be understandable and only optimize it if you notice it is
only optimize it if you notice it is slow. So the golden rule here is always
slow. So the golden rule here is always test. If you find you are optimizing the
test. If you find you are optimizing the performance with the new query then pick
performance with the new query then pick that and if there is no gain in the
that and if there is no gain in the performance then focus on making your
performance then focus on making your queries readable. So this is the golden
queries readable. So this is the golden rule always test test test using
rule always test test test using execution plan. So let's deep dive into
execution plan. So let's deep dive into best practices and we're going to start
best practices and we're going to start by optimizing the performance of our
queries. All right let's start with the easy stuff. The first step is select
easy stuff. The first step is select only what you need. What I usually see
only what you need. What I usually see in many queries is that the developers
in many queries is that the developers just go and select all the columns from
just go and select all the columns from one table and I can tell you I cannot
one table and I can tell you I cannot think of one scenario where you need all
think of one scenario where you need all the columns of one table in one query.
the columns of one table in one query. So for sure in the result we will get
So for sure in the result we will get like unnecessary columns and of course
like unnecessary columns and of course reading unnecessary informations going
reading unnecessary informations going to make your query slower. So this is
to make your query slower. So this is usually a bad practice. Don't use select
usually a bad practice. Don't use select star but instead of that go list all the
star but instead of that go list all the columns that you need for your query. So
columns that you need for your query. So make sure that you only select what you
make sure that you only select what you need. Don't go and select all the
need. Don't go and select all the columns from one table and with that you
columns from one table and with that you don't risk reading unnecessary
don't risk reading unnecessary informations from the database. So
informations from the database. So always make sure that you select exactly
always make sure that you select exactly what you need for a query don't go with
what you need for a query don't go with a star. Okay. Tip number two avoid
a star. Okay. Tip number two avoid unnecessary distinct and order by. I
unnecessary distinct and order by. I have noticed that many developers as
have noticed that many developers as they are writing a lot of queries they
they are writing a lot of queries they tend by default adding always distinct
tend by default adding always distinct and order by for each query. And as we
and order by for each query. And as we review the code and discuss it with the
review the code and discuss it with the developer, we see that we really don't
developer, we see that we really don't need to remove any duplicates in the
need to remove any duplicates in the query because there are no duplicates
query because there are no duplicates and it was only a habit to remove the
and it was only a habit to remove the duplicates using distincts. And the same
duplicates using distincts. And the same thing for the order by in many
thing for the order by in many situations there is no need to sort the
situations there is no need to sort the data at all. And those operations, the
data at all. And those operations, the distinct removing the duplicate and
distinct removing the duplicate and sorting the data, they are very
sorting the data, they are very expensive operations in your execution
expensive operations in your execution plan. So they're going to take a lot of
plan. So they're going to take a lot of resources and slow down your query. So
resources and slow down your query. So this considered as a bad practice if you
this considered as a bad practice if you always go and use distinct even though
always go and use distinct even though it's not needed or you are using the
it's not needed or you are using the order by in order to sort the data when
order by in order to sort the data when it is not necessary. So the best
it is not necessary. So the best practice here is to avoid them. Don't
practice here is to avoid them. Don't use distinct or order by only if it is
use distinct or order by only if it is necessary. Okay. The next one for
necessary. Okay. The next one for exploration purposes limit the rows. So
exploration purposes limit the rows. So sometimes especially if you are working
sometimes especially if you are working with a new database you would like to
with a new database you would like to explore the tables just to have a quick
explore the tables just to have a quick peek in order to see the content of the
peek in order to see the content of the tables. And if your database has a lot
tables. And if your database has a lot of big tables with millions of rows and
of big tables with millions of rows and so on, you will be consuming a lot of
so on, you will be consuming a lot of resources. If you just select the data
resources. If you just select the data like this. So now imagine that the
like this. So now imagine that the orders has like 100 million. As you run
orders has like 100 million. As you run this query, the database has to fetch
this query, the database has to fetch all the 100 million for you. And usually
all the 100 million for you. And usually for exploration, it's enough to see like
for exploration, it's enough to see like 10 rows and that's going to be enough.
10 rows and that's going to be enough. That's why it is considered as a bad
That's why it is considered as a bad practice if you are exploring the tables
practice if you are exploring the tables to not have a limit or top. So a good
to not have a limit or top. So a good practice would be to say select top 10
practice would be to say select top 10 and then have the same query. So if you
and then have the same query. So if you go over here you will get only 10 rows
go over here you will get only 10 rows and the database will not fetch 100
and the database will not fetch 100 million. It can fetch only 10 rows. And
million. It can fetch only 10 rows. And now if you are exploring a lot of tables
now if you are exploring a lot of tables you will not consume a lot of resource
you will not consume a lot of resource from the database. So if you are
from the database. So if you are exploring always limit the number of
exploring always limit the number of rows that you are
retrieving. All right. Right. So now we're going to talk about how to
we're going to talk about how to optimize the filtering in SQL. So the
optimize the filtering in SQL. So the tip here is to create an uncclustered
tip here is to create an uncclustered index on frequently used columns in wear
index on frequently used columns in wear clause. So now of course you have to
clause. So now of course you have to check your queries and so on. And if you
check your queries and so on. And if you see that you are frequently filtering
see that you are frequently filtering the data using the order status then it
the data using the order status then it makes sense to create a non-clustered
makes sense to create a non-clustered index for this column in order to
index for this column in order to improve the performance of your query.
improve the performance of your query. So for this situation I'm going to go
So for this situation I'm going to go and create then a nonclustered index for
and create then a nonclustered index for the table sales order for the order
the table sales order for the order status. So once you create it then you
status. So once you create it then you improving now the performance of your
improving now the performance of your query. Okay. The next one is avoid
query. Okay. The next one is avoid applying functions to columns in the
applying functions to columns in the works. So in many cases what we usually
works. So in many cases what we usually do is that we go and transform the
do is that we go and transform the columns before like filtering the data.
columns before like filtering the data. Like for example here I'm applying the
Like for example here I'm applying the function lower on the order status
function lower on the order status because I'm searching for the value
because I'm searching for the value delivered and I'm not sure about the
delivered and I'm not sure about the values in the table whether they have
values in the table whether they have like a camel case or uppercase or
like a camel case or uppercase or anything but in order to make sure that
anything but in order to make sure that I'm going to find the value I'm going to
I'm going to find the value I'm going to go and say lower the order status and
go and say lower the order status and then give here a lower value and of
then give here a lower value and of course it's going to work. So if we go
course it's going to work. So if we go and search for it and as you can see we
and search for it and as you can see we have here the status delivered and the
have here the status delivered and the value is different than the one I used
value is different than the one I used because here we have like a capital
because here we have like a capital first character but here we have a
first character but here we have a problem we have an index on the order
problem we have an index on the order status and now if you use any functions
status and now if you use any functions like for example here the lower the SQL
like for example here the lower the SQL will not use the index so that means the
will not use the index so that means the whole index is now useless and the SQL
whole index is now useless and the SQL is not using it and that's why we
is not using it and that's why we consider it as a bad practice to use
consider it as a bad practice to use functions for the wear clause and
functions for the wear clause and Instead of that the good practice is
Instead of that the good practice is that to not use any function and to
that to not use any function and to write exactly the value that is used
write exactly the value that is used inside your data and with that the SQL
inside your data and with that the SQL going to be happy and use the index that
going to be happy and use the index that you have created. Okay, let's have
you have created. Okay, let's have another example about this rule and here
another example about this rule and here we are selecting all the customers where
we are selecting all the customers where the first name start with the A. So with
the first name start with the A. So with that we can go and use the function
that we can go and use the function substring in order to get the first
substring in order to get the first character of the first name and once you
character of the first name and once you match it with a then you will get the
match it with a then you will get the result and here we have Anna. And this
result and here we have Anna. And this is again bad if you have an index on the
is again bad if you have an index on the first name and that's because we are
first name and that's because we are applying a function on the column. So
applying a function on the column. So this considered to be a bad practice and
this considered to be a bad practice and instead of that we can go and use the
instead of that we can go and use the help of the like. So we can go and
help of the like. So we can go and search for this pattern where it start
search for this pattern where it start with the A and then we have a white
with the A and then we have a white card. We don't care about the rest. So
card. We don't care about the rest. So it must start with a. So if you go and
it must start with a. So if you go and execute it you will get the same
execute it you will get the same results. So try as much as you can to
results. So try as much as you can to avoid the functions in the wear clouds
avoid the functions in the wear clouds in order to hit and get the index
in order to hit and get the index working. And in many scenarios, we have
working. And in many scenarios, we have a workaround in order to use the
a workaround in order to use the function without transformations. So try
function without transformations. So try your best to avoid using functions if
your best to avoid using functions if your columns having an index. All right,
your columns having an index. All right, one more example that you see a lot on
one more example that you see a lot on queries that you filter by the year. So
queries that you filter by the year. So we are searching for the orders that
we are searching for the orders that happens in 2025 and we usually go and
happens in 2025 and we usually go and use the year order dates. And now if you
use the year order dates. And now if you have an index on the order dates, this
have an index on the order dates, this again will not be working because you
again will not be working because you are using a function year. So this
are using a function year. So this considered to be a bad practice. Instead
considered to be a bad practice. Instead of using the year function, you can go
of using the year function, you can go and use between. So we don't apply a
and use between. So we don't apply a function on the order date and we say
function on the order date and we say the order date is between the boundaries
the order date is between the boundaries of the year. Of course, now our query is
of the year. Of course, now our query is not looking really cool and easy like
not looking really cool and easy like the first one. But still with the second
the first one. But still with the second one, we are hitting the index. So again
one, we are hitting the index. So again while you are filtering, try to not use
while you are filtering, try to not use functions on the columns because it is
functions on the columns because it is really waste if you have an index and
really waste if you have an index and you are not using it. and most of the
you are not using it. and most of the cases you have like a workound for your
cases you have like a workound for your function. So those are the three
function. So those are the three examples that I wanted to show you about
examples that I wanted to show you about this tip. All right, moving on to a
this tip. All right, moving on to a similar one. It says avoid leading wild
similar one. It says avoid leading wild cards as they prevent index usage. So
cards as they prevent index usage. So this is a similar one. Let's say for
this is a similar one. Let's say for example I'm searching for the word gold
example I'm searching for the word gold inside the last name. And here we have
inside the last name. And here we have to be careful what we are searching for.
to be careful what we are searching for. Should the gold exist somewhere in the
Should the gold exist somewhere in the last name or only we are searching for
last name or only we are searching for the last name that start with gold? If
the last name that start with gold? If it's like that we are searching only the
it's like that we are searching only the last name that starts with gold then we
last name that starts with gold then we are doing it here wrong. And in SQL if
are doing it here wrong. And in SQL if you're using the leading wild card then
you're using the leading wild card then the SQL will not be using the index. But
the SQL will not be using the index. But if you are using the wild card at the
if you are using the wild card at the end and the trailing this one is fine
end and the trailing this one is fine and will not avoid using the index. So
and will not avoid using the index. So this considered as a bad practice
this considered as a bad practice because you will not be hitting the
because you will not be hitting the index. Better than that to not use the
index. Better than that to not use the white card as a leading and if that's
white card as a leading and if that's enough for your search then with that
enough for your search then with that you are hitting and using the index.
you are hitting and using the index. Okay, moving on to the next one. It says
Okay, moving on to the next one. It says use in instead of multiple or or
use in instead of multiple or or operator is very evil for performance
operator is very evil for performance and try to avoid using it. It really
and try to avoid using it. It really kills your performance whether it is in
kills your performance whether it is in the filters or joins and so on. So now
the filters or joins and so on. So now we want to show the orders where the
we want to show the orders where the customers is equal to one or two or
customers is equal to one or two or three. And of course this is considered
three. And of course this is considered to be bad practice and hard to read and
to be bad practice and hard to read and so on. Please don't do that. Instead we
so on. Please don't do that. Instead we have the in operator and we are saying
have the in operator and we are saying if the customer is one of those values
if the customer is one of those values then show the orders. So if you go and
then show the orders. So if you go and run it you will get the exact results
run it you will get the exact results and it's not only looks nicer than the
and it's not only looks nicer than the first query but it has as well a better
first query but it has as well a better performance. So if you find out writing
performance. So if you find out writing a lot of ors think about the inoperator.
a lot of ors think about the inoperator. So those are the best practices for
So those are the best practices for filtering data to improve the
filtering data to improve the performance.
Okay, so now we're going to focus on how to optimize joining tables in SQL. So
to optimize joining tables in SQL. So the first tip here is to understand the
the first tip here is to understand the speed of joins and to use inner join
speed of joins and to use inner join when it's possible. Well, as we learned
when it's possible. Well, as we learned before, we have like different types of
before, we have like different types of joins. We have the inner, left, right,
joins. We have the inner, left, right, and outer join. And if we talk about the
and outer join. And if we talk about the performance, the best performance you
performance, the best performance you will get from the inner join. And that's
will get from the inner join. And that's because SQL going to work only on the
because SQL going to work only on the matching rows. That means the effort and
matching rows. That means the effort and the processing time is better than the
the processing time is better than the other joins. Now in the next one in
other joins. Now in the next one in ranking we have the left and right
ranking we have the left and right joins. They are slightly slower than the
joins. They are slightly slower than the inner join because usually they process
inner join because usually they process more data and more rows than the inner
more data and more rows than the inner join because SQL will work not only with
join because SQL will work not only with the matching rows as well with the
the matching rows as well with the unmatching rows. So for right and left
unmatching rows. So for right and left SQL has to do more stuff than the inner
SQL has to do more stuff than the inner join. And now the worst type of joins we
join. And now the worst type of joins we have the outer join. And that and that's
have the outer join. And that and that's because this type works with the biggest
because this type works with the biggest number of rows compared to the other
number of rows compared to the other types. It's going to present unmatching
types. It's going to present unmatching rows from the left and from the right
rows from the left and from the right tables. So that means SQL has a lot of
tables. So that means SQL has a lot of to-do and that's why this join has the
to-do and that's why this join has the worst performance. So here my advice is
worst performance. So here my advice is always try to use the inner join if it's
always try to use the inner join if it's enough to work with the matching rows
enough to work with the matching rows and if the matching rows is not enough
and if the matching rows is not enough then go with the lift join maybe. But
then go with the lift join maybe. But try your best always to bring the inner
try your best always to bring the inner join instead of lift join. But don't
join instead of lift join. But don't forget inner join filters the data.
forget inner join filters the data. Okay. The next one it says use explicit
Okay. The next one it says use explicit join the unzi join instead of implicit
join the unzi join instead of implicit join. Well it is considered as a bad
join. Well it is considered as a bad practice if you join tables like this
practice if you join tables like this the implicit join or the nonzi join.
the implicit join or the nonzi join. It's better to use the normal modern
It's better to use the normal modern join where you use the inner join for
join where you use the inner join for example. about the performance. There is
example. about the performance. There is like no differences between them. And
like no differences between them. And for this scenario, it's very simple. But
for this scenario, it's very simple. But if you have like a complex query, then
if you have like a complex query, then joining table like this might be very
joining table like this might be very confusing and really hard to read and as
confusing and really hard to read and as well complex to optimize. That's why the
well complex to optimize. That's why the best practice says go with the normal
best practice says go with the normal inner join. So go with the anzi join
inner join. So go with the anzi join instead of the nonzi join. Okay. To the
instead of the nonzi join. Okay. To the next tip. Make sure to index the columns
next tip. Make sure to index the columns used in the on clause. So we have to go
used in the on clause. So we have to go and make sure that both of those columns
and make sure that both of those columns has an index because indexes speed up
has an index because indexes speed up the lookup process. Without an index,
the lookup process. Without an index, the SQL might go and do a full table
the SQL might go and do a full table scan. Without an index on those columns,
scan. Without an index on those columns, the database might go and scan the
the database might go and scan the entire tables in order to find a match.
entire tables in order to find a match. And that is really slow if you have big
And that is really slow if you have big tables. So now if you go to the
tables. So now if you go to the customers over here and then to the
customers over here and then to the indexes, we can see that we have an
indexes, we can see that we have an index, a clustered index for the
index, a clustered index for the customer ID. But if you check the
customer ID. But if you check the customer ID in the orders, we don't have
customer ID in the orders, we don't have an index for that. So this one doesn't
an index for that. So this one doesn't have an index. So in order to fix that,
have an index. So in order to fix that, we're going to go and create an
we're going to go and create an uncclustered index on the table orders
uncclustered index on the table orders for the customer's ID since it is a
for the customer's ID since it is a foreign key. So once we do that, we have
foreign key. So once we do that, we have now an index for both of those columns
now an index for both of those columns and with that our join going to be
and with that our join going to be faster. Okay. So now we come to a tip
faster. Okay. So now we come to a tip where we say really it depends on there
where we say really it depends on there is like not one clear way on how to do
is like not one clear way on how to do it. But let's say if you have a big
it. But let's say if you have a big tables, it is better to filter data
tables, it is better to filter data before joining. And here we have like
before joining. And here we have like three different scenarios that going to
three different scenarios that going to deliver the same results. But of course
deliver the same results. But of course the question is which one is the best
the question is which one is the best for performance. So now let's have a
for performance. So now let's have a look to them. What we are doing here we
look to them. What we are doing here we are just joining two tables and then we
are just joining two tables and then we are filtering the result based on the
are filtering the result based on the order status that comes from the orders.
order status that comes from the orders. So in the first query what we are doing
So in the first query what we are doing we are first joining tables and at the
we are first joining tables and at the ends we are using where clause in order
ends we are using where clause in order to filter the data. So by looking to
to filter the data. So by looking to this we are just filtering the data
this we are just filtering the data after joining the tables. But there is
after joining the tables. But there is another way on how to do it. You can go
another way on how to do it. You can go and join the tables but on the join
and join the tables but on the join condition you can go and add this order
condition you can go and add this order status equals to delivered. So we are
status equals to delivered. So we are matching the data by the customer ID and
matching the data by the customer ID and at the same time we are filtering the
at the same time we are filtering the data by the order status since we are
data by the order status since we are using the inner join. So the filtering
using the inner join. So the filtering is happening during the join or you can
is happening during the join or you can do it like this where we have here more
do it like this where we have here more stuff to be added where we don't join
stuff to be added where we don't join the table directly with the orders. We
the table directly with the orders. We first prepare the table orders before
first prepare the table orders before joining it with the customers. And here
joining it with the customers. And here our preparation is we are just selecting
our preparation is we are just selecting the columns that we need and we are
the columns that we need and we are already filtering the data before doing
already filtering the data before doing the join using the subquery. But if you
the join using the subquery. But if you run all those queries you will get the
run all those queries you will get the exact same results. And of course there
exact same results. And of course there is another way on how to do it. you can
is another way on how to do it. you can go and prepare the data not in subquery
go and prepare the data not in subquery you can go and use a CTE and then join
you can go and use a CTE and then join the result of the CTE with the table
the result of the CTE with the table customers. So now about the performance
customers. So now about the performance if your query is like small not that
if your query is like small not that complex and as well you don't have a big
complex and as well you don't have a big data inside your tables all those three
data inside your tables all those three queries going to deliver the same
queries going to deliver the same performance. I know it might sounds
performance. I know it might sounds weird because here we are like filtering
weird because here we are like filtering after joining or here we are filtering
after joining or here we are filtering during the join. Normally in databases
during the join. Normally in databases the SQL optimizers are now very smart
the SQL optimizers are now very smart can understand that there is a filter
can understand that there is a filter here and decide on the best execution
here and decide on the best execution plan for you. So actually wherever you
plan for you. So actually wherever you put your filter after, during or before
put your filter after, during or before the SQL is smart enough to do it
the SQL is smart enough to do it correctly. So if you don't have complex
correctly. So if you don't have complex query and you don't have like big
query and you don't have like big tables, go with the one that suits you.
tables, go with the one that suits you. And I really recommend you to go with
And I really recommend you to go with the first one because it's logical and
the first one because it's logical and easier to understand. But if you have
easier to understand. But if you have big tables and complex queries, the best
big tables and complex queries, the best practices says try always to prepare the
practices says try always to prepare the data before joining it. So try to
data before joining it. So try to isolate and abstract the pre-step in a
isolate and abstract the pre-step in a subquery or in a CTE before joining it
subquery or in a CTE before joining it with any other tables. And in many
with any other tables. And in many scenarios in my project where I have a
scenarios in my project where I have a big table, this did help where the
big table, this did help where the execution plan was better if I isolate
execution plan was better if I isolate and prepare the data before joining it.
and prepare the data before joining it. So if you have small or medium tables,
So if you have small or medium tables, go with the normal way, use the wear
go with the normal way, use the wear clause. But if you have complex big
clause. But if you have complex big tables, prepare the data in subquery or
tables, prepare the data in subquery or CTE and then join it with the tables.
CTE and then join it with the tables. Okay. And now moving on to tip number
Okay. And now moving on to tip number 12. It is similar to the previous one
12. It is similar to the previous one but this time it says aggregate data
but this time it says aggregate data before joining tables and again it is
before joining tables and again it is special case to improve the performance
special case to improve the performance of big tables. So now we have the
of big tables. So now we have the following scenario where we are joining
following scenario where we are joining the orders and the customers and we are
the orders and the customers and we are aggregating the data by the customer ID
aggregating the data by the customer ID but we are just joining the table
but we are just joining the table customers because we need the first
customers because we need the first name. So as a result we have the
name. So as a result we have the customer ID, the first name and the
customer ID, the first name and the order count. So the standard way is to
order count. So the standard way is to join the tables and then do a group by
join the tables and then do a group by in order to summarize the data. Now if
in order to summarize the data. Now if you look to this query, we actually
you look to this query, we actually don't need the join in order to do the
don't need the join in order to do the aggregations. We can do first the
aggregations. We can do first the aggregation like preparing the orders
aggregation like preparing the orders with the aggregated data and then join
with the aggregated data and then join the result with the customers in order
the result with the customers in order to get the first name. So again we
to get the first name. So again we prepare first and then we do the join
prepare first and then we do the join and we can do that using either the
and we can do that using either the subqueries or using the CTE. So in this
subqueries or using the CTE. So in this scenario first we are doing the group by
scenario first we are doing the group by we are aggregating the data and the
we are aggregating the data and the result of this is joined with the
result of this is joined with the customers tables in order to get the
customers tables in order to get the first name. Now of course there are like
first name. Now of course there are like many ways on how to do it like for
many ways on how to do it like for example as well using the correlated
example as well using the correlated queries where we can go and use the
queries where we can go and use the subquery in the select statements and
subquery in the select statements and then use the where condition over here
then use the where condition over here to make the correlated query. Now all
to make the correlated query. Now all those three going to deliver the same
those three going to deliver the same results but the question here again
results but the question here again which one has the best performance?
which one has the best performance? Well, I can go immediately and tell you
Well, I can go immediately and tell you that correlated subqueries are the worst
that correlated subqueries are the worst one. Always avoid using correlated
one. Always avoid using correlated subqueries. They has really bad
subqueries. They has really bad performance. And that's because SQL
performance. And that's because SQL going to go and do the aggregations for
going to go and do the aggregations for each customer individually. So it's
each customer individually. So it's going to go like for each row and doing
going to go like for each row and doing aggregation then to the next row and so
aggregation then to the next row and so on. So it takes long time. So this is
on. So it takes long time. So this is bad practices. Don't use it. Now we are
bad practices. Don't use it. Now we are left again with the first option and the
left again with the first option and the second option. And here my tip going to
second option. And here my tip going to be like the previous one. I'm going to
be like the previous one. I'm going to say if you have small to medium size of
say if you have small to medium size of tables then go with this one because it
tables then go with this one because it is easier to read and to understand and
is easier to read and to understand and you will gain exactly the same
you will gain exactly the same performance as this subquery. But if
performance as this subquery. But if your tables are big the best practices
your tables are big the best practices is to prepare first the data to group up
is to prepare first the data to group up the data to filter the data and to
the data to filter the data and to isolate it in a subquery or a CTE before
isolate it in a subquery or a CTE before joining it with the final table in the
joining it with the final table in the final query. But again here only for big
final query. But again here only for big tables and always test check the
tables and always test check the execution plan whether you are really
execution plan whether you are really getting any benefits from it. All right.
getting any benefits from it. All right. So if you have big tables try to prepare
So if you have big tables try to prepare the data first in city subquery and then
the data first in city subquery and then join. Okay moving on to the next tip. It
join. Okay moving on to the next tip. It says use union instead of or operator in
says use union instead of or operator in joins. So what this means sometime let's
joins. So what this means sometime let's say that you are joining two tables the
say that you are joining two tables the customers and the orders. And now about
customers and the orders. And now about the join key, you can see over here it
the join key, you can see over here it says the customer ID should be equal to
says the customer ID should be equal to the customer ID from the orders or the
the customer ID from the orders or the customer ID should be equal to the
customer ID should be equal to the saleserson's ID. If one of these two
saleserson's ID. If one of these two conditions is fulfilled, then we have a
conditions is fulfilled, then we have a match. And I can tell you the or
match. And I can tell you the or operator over here is a performance
operator over here is a performance killer. It has really bad performance.
killer. It has really bad performance. So try to avoid it. Don't use ore in the
So try to avoid it. Don't use ore in the joins. It has a lot of problems like it
joins. It has a lot of problems like it avoid indexes, it create like loop joins
avoid indexes, it create like loop joins and so on. That's why we consider it as
and so on. That's why we consider it as a bad practice. And now in order to get
a bad practice. And now in order to get the same results, we can go and split
the same results, we can go and split the joins. So we can go and have two
the joins. So we can go and have two queries. The first query is joining the
queries. The first query is joining the data based on the customer ID and the
data based on the customer ID and the second query based on the saleserson and
second query based on the saleserson and then we go and merge those two results
then we go and merge those two results using the union. It sounds like bigger
using the union. It sounds like bigger and too much for the SQL but with this
and too much for the SQL but with this you will get better performance than
you will get better performance than using this simple or operator. So again
using this simple or operator. So again if you have big tables try to avoid
if you have big tables try to avoid using or and instead of that go and use
using or and instead of that go and use union. Okay the next tip says check for
union. Okay the next tip says check for nested loops and use SQL hints. Now
nested loops and use SQL hints. Now imagine that we have like big tables and
imagine that we have like big tables and we are joining tables. So now if you are
we are joining tables. So now if you are checking the execution plan you have to
checking the execution plan you have to check always the join type. So for
check always the join type. So for example here it is using the nested
example here it is using the nested loops which is of course is okay because
loops which is of course is okay because we have small tables but if you have big
we have small tables but if you have big tables and still SQL is using for some
tables and still SQL is using for some reason the nested loops then this is
reason the nested loops then this is alerting. So in order to change this
alerting. So in order to change this what we can do we can go and use the SQL
what we can do we can go and use the SQL hints in order to force SQL to use the
hints in order to force SQL to use the hash join. Hash join is really good if
hash join. Hash join is really good if you have a big table like for example
you have a big table like for example the orders that is joins with a small
the orders that is joins with a small table like the customers. So now what we
table like the customers. So now what we can do at the end we can write over here
can do at the end we can write over here option hash join. So let's go and
option hash join. So let's go and execute it and let's check the execution
execute it and let's check the execution plan and with that we have forced SQL to
plan and with that we have forced SQL to use the hash join or hash match. Again
use the hash join or hash match. Again you have here really to evaluate your
you have here really to evaluate your tables. If you have like small tables
tables. If you have like small tables don't bother with that. But if you have
don't bother with that. But if you have big tables and SQL still doing the
big tables and SQL still doing the nested loops, nested loops are usually
nested loops, nested loops are usually very slow because you have a lot of
very slow because you have a lot of iterations and so on and with the hash
iterations and so on and with the hash join that small table going to be stored
join that small table going to be stored in the memory and then you have really a
in the memory and then you have really a quick matching between the two tables.
quick matching between the two tables. So those are all the best practices and
So those are all the best practices and tips on how to optimize joining tables
tips on how to optimize joining tables in SQL. All right, so now we're going to
in SQL. All right, so now we're going to talk about union and here is the best
talk about union and here is the best practices. It says use union all instead
practices. It says use union all instead of using union if duplicates are
of using union if duplicates are acceptable. So it's very simple. If the
acceptable. So it's very simple. If the duplicates are acceptable or let's say
duplicates are acceptable or let's say that there is no duplicates then don't
that there is no duplicates then don't go with the union because it needs more
go with the union because it needs more time to be executed. SQL has to go and
time to be executed. SQL has to go and check row by row whether we have
check row by row whether we have duplicates or not and this usually takes
duplicates or not and this usually takes longer time than using the union all. So
longer time than using the union all. So if duplicates are acceptable or you
if duplicates are acceptable or you don't have any duplicates in your data
don't have any duplicates in your data go with the union all just have to go
go with the union all just have to go and merge all the data without checking
and merge all the data without checking anything and the performance going to be
anything and the performance going to be faster. All right, the next one is
faster. All right, the next one is little bit tricky. So it says use union
little bit tricky. So it says use union all together with the distinct instead
all together with the distinct instead of using union if the duplicates are not
of using union if the duplicates are not acceptable. So you want to remove the
acceptable. So you want to remove the duplicates. So we have learned that in
duplicates. So we have learned that in order to do that we're going to go and
order to do that we're going to go and use the union. It's going to go and
use the union. It's going to go and merge the data and as well remove the
merge the data and as well remove the duplicates which is really okay to use
duplicates which is really okay to use it if you have like smaller data or
it if you have like smaller data or medium. But let's say that you have like
medium. But let's say that you have like millions of row which is really okay if
millions of row which is really okay if you have like medium and small tables.
you have like medium and small tables. But again here if you have huge tables
But again here if you have huge tables big tables hundreds of millions the best
big tables hundreds of millions the best practice says go with the union all and
practice says go with the union all and afterwards use a distincts. So in the
afterwards use a distincts. So in the sub query we are using union all but in
sub query we are using union all but in order to remove the duplicates we use
order to remove the duplicates we use the distincts. But again here you have
the distincts. But again here you have to test it to check the execution plan.
to test it to check the execution plan. If you are getting benefit then go with
If you are getting benefit then go with this version. But if your data is not
this version. But if your data is not really big you have hundred of
really big you have hundred of thousands. So go just with the normal
thousands. So go just with the normal union. the code is smaller and you will
union. the code is smaller and you will get the same effects but only for large
get the same effects but only for large tables you can go with this best
tables you can go with this best practice. So that's all what I have for
practice. So that's all what I have for you for the
you for the [Music]
[Music] union. Okay. So now let's talk about
union. Okay. So now let's talk about aggregations and here the tip says use
aggregations and here the tip says use column store index for aggregations on
column store index for aggregations on large tables like for example fact
large tables like for example fact tables and that's because column store
tables and that's because column store index going to compress the data. So the
index going to compress the data. So the size of the data going to be smaller and
size of the data going to be smaller and as well the aggregation is super fast
as well the aggregation is super fast because we are selecting only the
because we are selecting only the relevant informations only the relevant
relevant informations only the relevant columns. So it makes it a perfect setup
columns. So it makes it a perfect setup for aggregating large tables. And now
for aggregating large tables. And now let's say that we have hundreds of
let's say that we have hundreds of millions of orders and we have this
millions of orders and we have this query over here. So the best practice
query over here. So the best practice says convert this table to a clustered
says convert this table to a clustered column store index. So if you go and
column store index. So if you go and create this clustered index over here,
create this clustered index over here, the whole table going to have amazing
the whole table going to have amazing performance for aggregations like this.
performance for aggregations like this. All right. So to the next one, it says
All right. So to the next one, it says pre-agregate data and store it in a new
pre-agregate data and store it in a new table for reporting. So let's say that
table for reporting. So let's say that we have like a big query where we are
we have like a big query where we are aggregating the data and so on. And this
aggregating the data and so on. And this query takes really long time. Let's say
query takes really long time. Let's say like 5 minutes or something like that.
like 5 minutes or something like that. But now the problem with that I would
But now the problem with that I would like to show the results as a report
like to show the results as a report maybe to my manager or let's say during
maybe to my manager or let's say during a meeting it's going to be really bad if
a meeting it's going to be really bad if everyone have to wait until the query is
everyone have to wait until the query is done. So the best practice here if you
done. So the best practice here if you have like a query that runs very slow
have like a query that runs very slow what you can do you can go and store the
what you can do you can go and store the results in a table. So if I go over here
results in a table. So if I go over here and say into sales summary what going to
and say into sales summary what going to happen going to store the result inside
happen going to store the result inside this table. So let's go and execute it.
this table. So let's go and execute it. And now with that we have a nice table
And now with that we have a nice table where everything is prepared. So all
where everything is prepared. So all that you have to do is to go and query
that you have to do is to go and query this table. And of course it's going to
this table. And of course it's going to be very fast because it's only select
be very fast because it's only select statements. And with that you have like
statements. And with that you have like prepared and pre-agregated the data to
prepared and pre-agregated the data to have like fast reports. So don't forget
have like fast reports. So don't forget about this. If you have a big query you
about this. If you have a big query you can insert the result of this query in a
can insert the result of this query in a new table in order later to use it for
new table in order later to use it for reporting. But one thing that you have
reporting. But one thing that you have to make sure that you have always to
to make sure that you have always to update this table. So if we have new
update this table. So if we have new orders, it will not be presented inside
orders, it will not be presented inside the sales summary. You have to go and
the sales summary. You have to go and run this query again in order to get new
run this query again in order to get new data inside the sales summary. So those
data inside the sales summary. So those are the tips on how to improve the
are the tips on how to improve the performance of your aggregations in
SQL. So now what is happening here? I would like to show the orders but only
would like to show the orders but only from customers from USA. So if you check
from customers from USA. So if you check this query over here, we are joining the
this query over here, we are joining the tables order and customers but mainly we
tables order and customers but mainly we are showing only the orders information
are showing only the orders information and that means we are using the
and that means we are using the customers only to filter the table
customers only to filter the table orders and there are like multiple ways
orders and there are like multiple ways on how to do this task. So it's not only
on how to do this task. So it's not only the joins you can go and use the exist
the joins you can go and use the exist as a subquery and as well you can go and
as a subquery and as well you can go and use the in operator in the subquery. And
use the in operator in the subquery. And now comes the old but gold question.
now comes the old but gold question. Which one is better? Should we join or
Which one is better? Should we join or use exist or in? And oh my god, if you
use exist or in? And oh my god, if you go to the forums, you will see people
go to the forums, you will see people fighting about which one is the best.
fighting about which one is the best. Clean tech. Come on, do that again. Do
Clean tech. Come on, do that again. Do that again. I dare you.
that again. I dare you. Okay,
Okay, bring it. Oh, you can't say you can't
bring it. Oh, you can't say you can't say one point. Two point. Now, about the
say one point. Two point. Now, about the best practices, everyone agrees that's
best practices, everyone agrees that's don't go and use the in operator. So
don't go and use the in operator. So this is the bad practice. So
this is the bad practice. So bad practice avoid it. Don't use it. And
bad practice avoid it. Don't use it. And of course I'm always speaking about big
of course I'm always speaking about big tables, okay? Not small tables. So we
tables, okay? Not small tables. So we don't go and use this in order to filter
don't go and use this in order to filter one table based on the result of another
one table based on the result of another table. So don't use any operator in this
table. So don't use any operator in this scenario. Now here comes the conflicts.
scenario. Now here comes the conflicts. We have join and exist. Well, about the
We have join and exist. Well, about the performance of those two, they are very
performance of those two, they are very similar for medium tables. like I'm
similar for medium tables. like I'm speaking about hundred or thousand and
speaking about hundred or thousand and so on. But still you have to test it.
so on. But still you have to test it. You have to go and compare the execution
You have to go and compare the execution plans and if you are getting like
plans and if you are getting like identical results and both of them are
identical results and both of them are having the same speed then I prefer to
having the same speed then I prefer to go with the join and that's because to
go with the join and that's because to be honest it is easier to write than
be honest it is easier to write than writing that exists. So I'm going to say
writing that exists. So I'm going to say from my point of view this is best
from my point of view this is best practice if the
practice if the performance
performance equal to exist. But now what happens for
equal to exist. But now what happens for me is that sometimes I get better
me is that sometimes I get better performance using exists. So I'm going
performance using exists. So I'm going to say from my point of view the best
to say from my point of view the best practice
practice here. And now you might ask why we are
here. And now you might ask why we are getting with the exist better
getting with the exist better performance than in the inner join. And
performance than in the inner join. And that's because SSQL has only to check
that's because SSQL has only to check the existence of data from the subquery.
the existence of data from the subquery. But in the other hand with the inner
But in the other hand with the inner join SQL has to go and start doing
join SQL has to go and start doing matching between two tables. So it can
matching between two tables. So it can go and evaluate all matching records and
go and evaluate all matching records and so on. It is not evaluating whether it
so on. It is not evaluating whether it exist or not. And as well sometimes SQL
exist or not. And as well sometimes SQL has to deal with more rows because you
has to deal with more rows because you might introduce duplicates as you are
might introduce duplicates as you are joining tables. And this will not happen
joining tables. And this will not happen using exists. So for some scenarios if
using exists. So for some scenarios if you are using exist you might get better
you are using exist you might get better performance than using join but everyone
performance than using join but everyone agrees to not use the end operator. Okay
agrees to not use the end operator. Okay the next tip is to avoid redundant logic
the next tip is to avoid redundant logic in your query. This happens a lot if you
in your query. This happens a lot if you have a lot of sub queries and if you
have a lot of sub queries and if you analyze it you might find sometimes
analyze it you might find sometimes there is like redundancy. So for example
there is like redundancy. So for example this query I would like to have like a
this query I would like to have like a tag for each employee whether the salary
tag for each employee whether the salary is above the average or below the
is above the average or below the average. So now we might do it like
average. So now we might do it like this. we say okay let's get the data for
this. we say okay let's get the data for employees where the salary is higher
employees where the salary is higher than the average and you go and
than the average and you go and calculate the average in a subquery. So
calculate the average in a subquery. So if it's higher then you write here above
if it's higher then you write here above average and now we say okay let's go for
average and now we say okay let's go for the below average. So we do a union all
the below average. So we do a union all and the condition going to be salary is
and the condition going to be salary is less than the average. And now by
less than the average. And now by checking this you see that there's a
checking this you see that there's a problem. First of all we are querying
problem. First of all we are querying the employees like four times. We have 1
the employees like four times. We have 1 2 3 4. So we are scanning the table
2 3 4. So we are scanning the table employees four times and as well we have
employees four times and as well we have the same logic over here. So we are
the same logic over here. So we are calculating the average of salary at
calculating the average of salary at twice. So this is of course I can say a
twice. So this is of course I can say a bad practice and there is like many ways
bad practice and there is like many ways on how to do it better than that. For
on how to do it better than that. For example, you can go and put this
example, you can go and put this subquery in CTE and then use it multiple
subquery in CTE and then use it multiple times. But there is like better solution
times. But there is like better solution using the window function. So if you
using the window function. So if you check this, it is very simple. Let's me
check this, it is very simple. Let's me execute it. We are reading the table
execute it. We are reading the table employees only once and then we are
employees only once and then we are using the case statements. If the salary
using the case statements. If the salary is higher than the window function. So
is higher than the window function. So we are calculating the average on top of
we are calculating the average on top of the whole table employees. If it's
the whole table employees. If it's higher then write above average. If it's
higher then write above average. If it's lower then below average. So as you can
lower then below average. So as you can see it is easier to read and it is
see it is easier to read and it is smaller and the performance here is way
smaller and the performance here is way better than reading four times the
better than reading four times the employees and repeating the same logic.
employees and repeating the same logic. So here you have always to look to your
So here you have always to look to your queries and if you see that you are
queries and if you see that you are repeating the same things over and over
repeating the same things over and over then you are writing a bad query. Think
then you are writing a bad query. Think about alternatives like CTE window
about alternatives like CTE window functions and I'm sure you will find a
functions and I'm sure you will find a better way than reading the table
better way than reading the table several times or repeating the same
several times or repeating the same logic several times. So as you can see
logic several times. So as you can see optimizing the queries is not always
optimizing the queries is not always about using indexes and partitions. It's
about using indexes and partitions. It's all about using best practices. All
all about using best practices. All right guys, so with that we have covered
right guys, so with that we have covered a lot of best practices on how to
a lot of best practices on how to optimize the performance of your query.
optimize the performance of your query. And as you can see it's not always
And as you can see it's not always creating indexes, right? In many
creating indexes, right? In many scenarios it's about how you write the
query. And now in the next section I'm going to show you the best practices on
going to show you the best practices on how to create tables. So the best
how to create tables. So the best practices of DDL data definition
practices of DDL data definition language. If you have a poor definition
language. If you have a poor definition of your tables, this has a great impact
of your tables, this has a great impact on the performance of your queries. All
on the performance of your queries. All right. So now we have here like a DDL in
right. So now we have here like a DDL in order to create a table customer info
order to create a table customer info and it is not really following best
and it is not really following best practices. So let's go through it one by
practices. So let's go through it one by one. The first tip is try to avoid the
one. The first tip is try to avoid the data types varchar and text if it's
data types varchar and text if it's possible. The vchart and text they are
possible. The vchart and text they are like one of the worst data types for
like one of the worst data types for performance because they consume a lot
performance because they consume a lot of resources whatever you do like for
of resources whatever you do like for example if you are sorting the data by a
example if you are sorting the data by a column that is var or text it is very
column that is var or text it is very expensive operation the same thing if
expensive operation the same thing if you go like and create an index on top
you go like and create an index on top of such a column it's going to be as
of such a column it's going to be as well expensive and they cause a lot of
well expensive and they cause a lot of problems with the data fragmentations
problems with the data fragmentations and many issues. So try as much as you
and many issues. So try as much as you can to skip those data type if it's
can to skip those data type if it's possible. So now let's go and review all
possible. So now let's go and review all those columns in order to see whether we
those columns in order to see whether we can change something about it because it
can change something about it because it has a lot of bar charts. So the first
has a lot of bar charts. So the first one over here we have is var because it
one over here we have is var because it is the first name. Well, it is okay. Now
is the first name. Well, it is okay. Now moving on to the next one. We have the
moving on to the next one. We have the last name as a text which is not really
last name as a text which is not really good because text is worse than vchar.
good because text is worse than vchar. So it's better to use var than a text.
So it's better to use var than a text. So here we have to fix it. So var and
So here we have to fix it. So var and I'm going to go with the links 50. Now
I'm going to go with the links 50. Now moving on to the countries. So the
moving on to the countries. So the country is going to be vartar. We cannot
country is going to be vartar. We cannot change that. that contain characters. So
change that. that contain characters. So the next one is the score of the
the next one is the score of the customer. H here we can do something
customer. H here we can do something about it because scores are only
about it because scores are only numbers. So that's why we can go and
numbers. So that's why we can go and skip this one. So let's remove it and
skip this one. So let's remove it and say you are integer and with that we
say you are integer and with that we have avoided using the varchar. And the
have avoided using the varchar. And the same thing goes for the birthday. The
same thing goes for the birthday. The birthday is a date and here we have it
birthday is a date and here we have it as a vchar. Well this is not really good
as a vchar. Well this is not really good and we can skip that by having this
and we can skip that by having this column as a date. So date is way better
column as a date. So date is way better than having a vchar. All right. And the
than having a vchar. All right. And the next one is integer. So with that we
next one is integer. So with that we have fixed few stuff. So we have fixed
have fixed few stuff. So we have fixed the score and the birthday. And with
the score and the birthday. And with that we have saved some storage. If we
that we have saved some storage. If we have an index on the score it's going to
have an index on the score it's going to be way better than having a var. And if
be way better than having a var. And if you are filtering the data based on the
you are filtering the data based on the birthday it's going to be faster. So
birthday it's going to be faster. So again try your best to avoid the vchar
again try your best to avoid the vchar and the text. I have seen in many
and the text. I have seen in many projects that a lot of developers tend
projects that a lot of developers tend to use the vchar and I understand it is
to use the vchar and I understand it is easier to make everything as a vchar
easier to make everything as a vchar than deciding whether it is an integer,
than deciding whether it is an integer, date, float and so on because you can
date, float and so on because you can fit everything in the vchar and text but
fit everything in the vchar and text but this is lazy. Take time to understand
this is lazy. Take time to understand the content of this column and try to
the content of this column and try to assign it to the correct data type
assign it to the correct data type because this has really impact on the
because this has really impact on the performance. Okay, to the next one it
performance. Okay, to the next one it says avoid using max or overly large
says avoid using max or overly large lengths. So now we have to keep our eyes
lengths. So now we have to keep our eyes on the links of each data type
on the links of each data type especially the bar charts. Not only it
especially the bar charts. Not only it going to waste like a lot of storage.
going to waste like a lot of storage. It's also going to like mislead the SQL
It's also going to like mislead the SQL by creating large indexes which is
by creating large indexes which is totally unnecessary because the data
totally unnecessary because the data itself is small but because you have
itself is small but because you have defined like a large length SQL going to
defined like a large length SQL going to check those informations and make
check those informations and make decision to make a big index and large
decision to make a big index and large indexes are always problematic because
indexes are always problematic because they're going to slow everything down by
they're going to slow everything down by sorting the data by retrieving data by
sorting the data by retrieving data by updating the index. So it is really bad
updating the index. So it is really bad practices if you go blindly and define
practices if you go blindly and define everywhere max or
everywhere max or 255. Again give it a chance to think
255. Again give it a chance to think about each column and predict a length
about each column and predict a length for it. So for example if you check over
for it. So for example if you check over here we are saying first name v chart
here we are saying first name v chart max. Well most of the first names are
max. Well most of the first names are short. So we don't need like the maximum
short. So we don't need like the maximum size of a v chart to fit a first name.
size of a v chart to fit a first name. So here we can go easily instead of max
So here we can go easily instead of max with the 50. And the same thing goes for
with the 50. And the same thing goes for the column country. We don't need 255
the column country. We don't need 255 characters for the country name. We can
characters for the country name. We can go with something more realistic like
go with something more realistic like around 50. I think you can even go
around 50. I think you can even go smaller, but it's fine to have 50. So,
smaller, but it's fine to have 50. So, the best practice here is to analyze
the best practice here is to analyze your data and to predict the size of
your data and to predict the size of each column. And don't be lazy by just
each column. And don't be lazy by just defining max everywhere. I know it's
defining max everywhere. I know it's faster, but it's bad for performance.
faster, but it's bad for performance. Okay. What do you have else? Use the
Okay. What do you have else? Use the constraint nutnull as much as possible.
constraint nutnull as much as possible. The nutnull is amazing. It has a lot of
The nutnull is amazing. It has a lot of advantages. Of course, the biggest
advantages. Of course, the biggest advantage is that's the data integrity
advantage is that's the data integrity of your table. So with that, you make
of your table. So with that, you make sure no nulls are inserted in specific
sure no nulls are inserted in specific column. But it is as well good practices
column. But it is as well good practices to use it for improving the performance
to use it for improving the performance because if you are creating an index,
because if you are creating an index, you're going to get a better index
you're going to get a better index performance since SQL knows there is no
performance since SQL knows there is no nulls inside my tree inside the index.
nulls inside my tree inside the index. And in the other side, if you are
And in the other side, if you are writing query, we tend to use a filter
writing query, we tend to use a filter where we say a specific column should
where we say a specific column should not be null. But if you make sure that
not be null. But if you make sure that in the DDL it is not null then you can
in the DDL it is not null then you can skip this filter and with that you are
skip this filter and with that you are reducing the size of your query. So what
reducing the size of your query. So what we're going to do we're going to go
we're going to do we're going to go through all the columns and decide
through all the columns and decide whether it is not null and null. So for
whether it is not null and null. So for example the first name and the last name
example the first name and the last name they should not be null. So that's why
they should not be null. So that's why I'm going to say not null and the same
I'm going to say not null and the same thing for the last name not null. For
thing for the last name not null. For the customer ID we're going to talk
the customer ID we're going to talk about it soon because we're going to
about it soon because we're going to convert it to primary key and primary
convert it to primary key and primary keys are usually not null. So now for
keys are usually not null. So now for the country we make have it in the
the country we make have it in the business that it should not be null. So
business that it should not be null. So we go and make a constraint about it.
we go and make a constraint about it. Now about the total purchases and
Now about the total purchases and scores. If it is new customer, maybe we
scores. If it is new customer, maybe we can have a null inside our data. So
can have a null inside our data. So we're going to leave it empty. And I
we're going to leave it empty. And I think birthday is going to be usually
think birthday is going to be usually optional. So we're going to leave it as
optional. So we're going to leave it as well. And whether the customer is
well. And whether the customer is employee or not. This could be as well a
employee or not. This could be as well a null. So with that we have found out
null. So with that we have found out like three columns where we can have a
like three columns where we can have a constraint about the not null. And if we
constraint about the not null. And if we go and create like an index on the
go and create like an index on the country, it's going to be a better
country, it's going to be a better index. Okay. Moving on to the next one.
index. Okay. Moving on to the next one. It says make sure that all your tables
It says make sure that all your tables inside the database have a clustered
inside the database have a clustered primary key and as well it can help you
primary key and as well it can help you building the relationship between tables
building the relationship between tables where you have primary keys and foreign
where you have primary keys and foreign keys and you can join tables then very
keys and you can join tables then very easily and as well a primary key has
easily and as well a primary key has importance for the performance and
importance for the performance and incale server the default going to be a
incale server the default going to be a clustered index which is really good to
clustered index which is really good to have an index on the primary key because
have an index on the primary key because sometimes you are doing like an update
sometimes you are doing like an update operations or delete operations it's
operations or delete operations it's going to help up by the lookups of
going to help up by the lookups of joining tables. So there are a lot of
joining tables. So there are a lot of performance benefits of having a primary
performance benefits of having a primary key and make sure that all your tables
key and make sure that all your tables having a primary key. So as you can see
having a primary key. So as you can see the issue of our table we don't have a
the issue of our table we don't have a primary key and our primary key going to
primary key and our primary key going to be the customer ID. So let's go and do
be the customer ID. So let's go and do that primary key and as I said as a
that primary key and as I said as a default it can be clustered but I'm
default it can be clustered but I'm going to write it down in case if you
going to write it down in case if you are working with different databases
are working with different databases make sure it is clustered. Okay moving
make sure it is clustered. Okay moving on to the next one. It's not only about
on to the next one. It's not only about the primary key we have to take care of
the primary key we have to take care of our foreign keys. So the best practice
our foreign keys. So the best practice says create non-clustered index for the
says create non-clustered index for the foreign keys if they are frequently
foreign keys if they are frequently used. The foreign keys are usually
used. The foreign keys are usually important in order to connect and join
important in order to connect and join two tables and usually we frequently use
two tables and usually we frequently use it and not only that we use it sometimes
it and not only that we use it sometimes in order to filter the data and if you
in order to filter the data and if you create a nonclustered index for that it
create a nonclustered index for that it can improve the speed. So what we can do
can improve the speed. So what we can do it's very simple we're going to go and
it's very simple we're going to go and create a nclustered index on our table
create a nclustered index on our table customers info for the foreign key
customers info for the foreign key employee ID. So how to do it is very
employee ID. So how to do it is very simple. We're going to go and say create
simple. We're going to go and say create nonclustered index on our table the
nonclustered index on our table the customer's info on our foreign key the
customer's info on our foreign key the employee ID. But again make sure that
employee ID. But again make sure that this is an important foreign key that is
this is an important foreign key that is used frequently from your queries. All
used frequently from your queries. All right friends so as you can see there
right friends so as you can see there are a lot of best practices on how to
are a lot of best practices on how to improve and optimize the DDL. Having a
improve and optimize the DDL. Having a healthy DDL can improve the performance
healthy DDL can improve the performance of your queries. Now in the next section
of your queries. Now in the next section I'm going to show you the best practices
I'm going to show you the best practices and tips and tricks about indexing. So
and tips and tricks about indexing. So let's go.
All right, the fifth best practices and the most important one is avoid
the most important one is avoid overindexing because too many index is
overindexing because too many index is going to slow down the insert, update,
going to slow down the insert, update, delete operations and it's going to
delete operations and it's going to confuse as well the execution plan about
confuse as well the execution plan about choosing the right index and the
choosing the right index and the performance of the whole system going to
performance of the whole system going to go down. And another tip is to monitor
go down. And another tip is to monitor the usage of the indexes and I can tell
the usage of the indexes and I can tell you 90% of the indexes that is being
you 90% of the indexes that is being created usually are not used at all. So
created usually are not used at all. So they are taking a lot of space slowing
they are taking a lot of space slowing down everything. So go and drop those
down everything. So go and drop those unused indexes in your system. The next
unused indexes in your system. The next best practice is to have a regular job
best practice is to have a regular job like maybe a weekly job. So first you
like maybe a weekly job. So first you have to update the statistics regularly
have to update the statistics regularly as you are inserting new data and
as you are inserting new data and modifying data inside your database. The
modifying data inside your database. The statistics and the metadata of your
statistics and the metadata of your tables might get outdated and this is
tables might get outdated and this is really bad because you will not get an
really bad because you will not get an optimal execution plan for your queries
optimal execution plan for your queries and this can slow down your queries of
and this can slow down your queries of course. So regularly make sure that all
course. So regularly make sure that all the statistics are updated in order to
the statistics are updated in order to have an optimal execution plan. And what
have an optimal execution plan. And what else we can do in this weekly job is
else we can do in this weekly job is that we can go and rebuild and
that we can go and rebuild and reorganize our indexes. And that is to
reorganize our indexes. And that is to make sure that we are preventing data
make sure that we are preventing data fragmentations in our indexes. Data
fragmentations in our indexes. Data fragmentations in your indexes is really
fragmentations in your indexes is really bad because there will be a lot of
bad because there will be a lot of unused spaces. The order of your
unused spaces. The order of your clustered index will not be correct. So
clustered index will not be correct. So make sure that at least weekly you are
make sure that at least weekly you are rebuilding and reorganizing all your
rebuilding and reorganizing all your indexes. So those are the best practices
indexes. So those are the best practices of improving the performance and
of improving the performance and optimizing your indexing. If you are
optimizing your indexing. If you are struggling with very large tables in
struggling with very large tables in your projects like having fact tables,
your projects like having fact tables, then go and use SQL partitioning in
then go and use SQL partitioning in order to divide these tables into
order to divide these tables into smaller pieces which can improve the
smaller pieces which can improve the performance whether you are reading data
performance whether you are reading data from the table or writing data. And of
from the table or writing data. And of course you can go and mix things where
course you can go and mix things where you can go and apply a column store
you can go and apply a column store index on this partition table then you
index on this partition table then you will get the best performance if you are
will get the best performance if you are having large
tables. All right friends so that's all those are the best practices tips and
those are the best practices tips and tricks that I've collected in the many
tricks that I've collected in the many years working with SQL. And now my final
years working with SQL. And now my final thought about this is that try always to
thought about this is that try always to focus on making clear queries. Make it
focus on making clear queries. Make it like easy to read and easy to understand
like easy to read and easy to understand and try to optimize the performance only
and try to optimize the performance only if it's needed. So if you have like
if it's needed. So if you have like small database don't worry a lot about
small database don't worry a lot about the performance because the SQL
the performance because the SQL optimizer going to pick the best plan
optimizer going to pick the best plan for you and focus only on having simple
for you and focus only on having simple queries and if there is like performance
queries and if there is like performance problem always test using the execution
problem always test using the execution plan. It should be your judge. So if you
plan. It should be your judge. So if you are applying any index or you are
are applying any index or you are rewriting your queries always compare
rewriting your queries always compare before and after using the execution
before and after using the execution plan. And if you are gaining more
plan. And if you are gaining more performance then adopt the new query or
performance then adopt the new query or the new index. All right my friends. So
the new index. All right my friends. So that's all the tips and tricks best
that's all the tips and tricks best practices that I have for you in order
practices that I have for you in order to optimize the performance. And with
to optimize the performance. And with that we have covered now everything
that we have covered now everything about this chapter the performance
about this chapter the performance optimization. Now in the next chapter
optimization. Now in the next chapter I'm going to show you how I use AI in
I'm going to show you how I use AI in order to assist me while I'm using SQL.
order to assist me while I'm using SQL. So let's
go. All right. Right. So now I would like to share something important with
like to share something important with you especially as a future developer
you especially as a future developer that is working with AI. One of the best
that is working with AI. One of the best ways in order to truly build skill and
ways in order to truly build skill and to grow as a developer is by working on
to grow as a developer is by working on complex task and issue on your own. So
complex task and issue on your own. So when you are stuck on complex task and
when you are stuck on complex task and you are pushing yourself to find a
you are pushing yourself to find a solution for it and you are writing your
solution for it and you are writing your code in yourself here the magic happens
code in yourself here the magic happens and the real learning can happen. And if
and the real learning can happen. And if you jump too quickly and ask the AI for
you jump too quickly and ask the AI for a solution, what you are doing, you are
a solution, what you are doing, you are skipping an essential step in order to
skipping an essential step in order to become an expert. And more important
become an expert. And more important than that, you won't develop skills in
than that, you won't develop skills in order to understand when and where the
order to understand when and where the AI was wrong. So my recommendation here
AI was wrong. So my recommendation here is to have a discipline. Always try to
is to have a discipline. Always try to solve the task on your own and only turn
solve the task on your own and only turn to AI if you don't have any more ideas
to AI if you don't have any more ideas on how to solve the task. So that's my
on how to solve the task. So that's my opinion and my advice for you.
So quickly what is shippet? It is an AI program that is developed by open AI
program that is developed by open AI that is trained to understand questions
that is trained to understand questions and provide humanlike answers. So what
and provide humanlike answers. So what GPT stands for? The G stands for
GPT stands for? The G stands for generative. So that means the data model
generative. So that means the data model can generate a new content new text and
can generate a new content new text and P stands for pre-trained. The data model
P stands for pre-trained. The data model is already trained on huge amount of
is already trained on huge amount of data. And the T stands for transformer.
data. And the T stands for transformer. It is type of neural network
It is type of neural network architecture that processes your
architecture that processes your sentences in the prompts in order to
sentences in the prompts in order to understand the context behind it very
understand the context behind it very fast and accurate. And in the other hand
fast and accurate. And in the other hand we have the GitHub copilot. It is
we have the GitHub copilot. It is developed by the GitHub and as well
developed by the GitHub and as well using the same data models from the open
using the same data models from the open AAI. So that means both shad and copilot
AAI. So that means both shad and copilot both of them are using the same language
both of them are using the same language model that is developed from OpenAI. So
model that is developed from OpenAI. So the GitHub copilot did train on tons of
the GitHub copilot did train on tons of codes that is available in GitHub. So
codes that is available in GitHub. So how it works as you are writing a code
how it works as you are writing a code in the code editor like for example
in the code editor like for example visual studio it going to provide
visual studio it going to provide realtime suggestions as you are writing
realtime suggestions as you are writing and typing your code. So now if we
and typing your code. So now if we compare those two shad and the copilot
compare those two shad and the copilot we can say that the shajibet is a
we can say that the shajibet is a standalone application where you can
standalone application where you can interact with it using a website or an
interact with it using a website or an app where you go and start a
app where you go and start a conversation with the AI where in the
conversation with the AI where in the other hand the copilot is directly
other hand the copilot is directly integrated in your code editor like for
integrated in your code editor like for example the visual studio code this is
example the visual studio code this is way better than shibility because you
way better than shibility because you have realtime interaction with the AI
have realtime interaction with the AI this is a great advantage for the
this is a great advantage for the copilot because everything in one place
copilot because everything in one place so with the copilot pilot you are
so with the copilot pilot you are getting realtime assistant during your
getting realtime assistant during your coding. So the main purpose of the ship
coding. So the main purpose of the ship is to have a conversation with the AI
is to have a conversation with the AI for any topic that you like not limited
for any topic that you like not limited only for software developments but in
only for software developments but in the other hand a copilot focuses only on
the other hand a copilot focuses only on assisting the software development where
assisting the software development where you as a developer as you are writing
you as a developer as you are writing your code you are getting auto
your code you are getting auto completion of the code or maybe a block
completion of the code or maybe a block of code as a suggestion. So these are
of code as a suggestion. So these are the key differences between shad and
the key differences between shad and copilot.
Now if you are doing software developments or you are working with
developments or you are working with data projects and of course it depends
data projects and of course it depends on your role in the projects there will
on your role in the projects there will be many different types of tasks and
be many different types of tasks and activities that should be done in the
activities that should be done in the project like there will be a lot of
project like there will be a lot of brainstormings about new ideas and
brainstormings about new ideas and coding solutions debugging generating
coding solutions debugging generating documentations discussing the different
documentations discussing the different types of architecture doing road cause
types of architecture doing road cause analyzes. So the spectrum of activities
analyzes. So the spectrum of activities and tasks in each projects usually is
and tasks in each projects usually is very huge. And of course we can go and
very huge. And of course we can go and use the help of different AI tools to
use the help of different AI tools to assist us with those tasks and
assist us with those tasks and activities and there is like not one AI
activities and there is like not one AI tool that can cover all those stuff. I
tool that can cover all those stuff. I tend to jump between co-pilots and
tend to jump between co-pilots and something like Shajbet. Okay. So now I'm
something like Shajbet. Okay. So now I'm going to go and map those different
going to go and map those different tasks to either sht or copilot. So now
tasks to either sht or copilot. So now let's focus on the shibbet. The first
let's focus on the shibbet. The first one is brainstorming and ideas. So now
one is brainstorming and ideas. So now if we have in our project a big task or
if we have in our project a big task or let's say a big issue that we want to
let's say a big issue that we want to find solution for it. I tend to use of
find solution for it. I tend to use of course tools like shad in order to have
course tools like shad in order to have a discussion about the topic in order to
a discussion about the topic in order to explore and discuss multiple ideas and
explore and discuss multiple ideas and then start evaluating all those ideas.
then start evaluating all those ideas. The next one where I found myself using
The next one where I found myself using shbt is doing the project planning. So
shbt is doing the project planning. So it is as well something high level. You
it is as well something high level. You can go and discuss with the shaj GBT
can go and discuss with the shaj GBT about the design of your projects and
about the design of your projects and you can as well discuss the milestones
you can as well discuss the milestones the road map of the projects. The next
the road map of the projects. The next thing that I find myself using shajbt is
thing that I find myself using shajbt is for learning knowledge and research. If
for learning knowledge and research. If you are working with big data projects
you are working with big data projects you will be overwhelmed with the amount
you will be overwhelmed with the amount of cloud services and AI analytics
of cloud services and AI analytics tools. So and of course you can go and
tools. So and of course you can go and learn new stuff gather informations and
learn new stuff gather informations and knowledge using shajibb. Okay, moving on
knowledge using shajibb. Okay, moving on to the next task. We have generating
to the next task. We have generating documentations. Writing documentations
documentations. Writing documentations is always painful process and consumes a
is always painful process and consumes a lot of time and I tend to use tools like
lot of time and I tend to use tools like shibbit in order to generate those
shibbit in order to generate those documentations. But of course, I always
documentations. But of course, I always review the documentations and make it
review the documentations and make it short. Okay, moving on to another topic
short. Okay, moving on to another topic where I use shadet is that to discuss
where I use shadet is that to discuss architecture. Of course, if you are
architecture. Of course, if you are starting new projects, they will be like
starting new projects, they will be like different types of architecture in order
different types of architecture in order to implement the projects. And of
to implement the projects. And of course, you can discuss with the
course, you can discuss with the shajibility about the different types of
shajibility about the different types of architecture and if you give the
architecture and if you give the specifications about your projects then
specifications about your projects then you can discuss with the shajibility
you can discuss with the shajibility which architecture is suitable for the
which architecture is suitable for the project. And another task that I find
project. And another task that I find myself always like researching is
myself always like researching is exploring the best practices, tips and
exploring the best practices, tips and tricks. So you can have a discussion
tricks. So you can have a discussion with the SHP about the recommendations,
with the SHP about the recommendations, what are the best practices, what are
what are the best practices, what are the common pitfalls in order to make
the common pitfalls in order to make sure that your code and your solution is
sure that your code and your solution is always up to date with the best
always up to date with the best practices. And one more thing, if
practices. And one more thing, if there's like in the projects a very
there's like in the projects a very complex task, then I tend to have a
complex task, then I tend to have a discussion with a tool like Shajibet in
discussion with a tool like Shajibet in order to break this complex task into
order to break this complex task into small pieces and start finding the
small pieces and start finding the solution for each piece. And now in the
solution for each piece. And now in the other hand, I'm using copilot in order
other hand, I'm using copilot in order to solve different type of tasks. So
to solve different type of tasks. So here where I get my hand dirty in the
here where I get my hand dirty in the code. So while I'm coding I'm using
code. So while I'm coding I'm using alltime co-pilot in order to assist me
alltime co-pilot in order to assist me because it provide directly inline
because it provide directly inline suggestions and help me to code faster
suggestions and help me to code faster and reduce the human error that I might
and reduce the human error that I might make. So while I'm writing a code or
make. So while I'm writing a code or debugging I tend to use copilot and I
debugging I tend to use copilot and I don't find myself going to shy GBT to
don't find myself going to shy GBT to ask about code or syntax. We can do it
ask about code or syntax. We can do it directly in the copilot. And one task
directly in the copilot. And one task that is very famous in any software
that is very famous in any software developments we have the refactoring. So
developments we have the refactoring. So if you have like a code that is slow and
if you have like a code that is slow and bad designs and you want to refactor the
bad designs and you want to refactor the whole codes, you can do it directly in
whole codes, you can do it directly in your code together with the copilot in
your code together with the copilot in order to find optimizations. And I use
order to find optimizations. And I use as well copilot in order to add inline
as well copilot in order to add inline comments. So I don't find myself going
comments. So I don't find myself going to ship and asking to add comments to my
to ship and asking to add comments to my codes. You can do it directly in your
codes. You can do it directly in your code using cilot. And of course if
code using cilot. And of course if everything is working perfectly, I have
everything is working perfectly, I have the best practices, the good
the best practices, the good performance, I have the comments, it's
performance, I have the comments, it's still you have to maintain nice style
still you have to maintain nice style and format of your code. And of course
and format of your code. And of course now we can do that directly using the
now we can do that directly using the copilot. We don't have to go and jump to
copilot. We don't have to go and jump to shajbt in order to style and format your
shajbt in order to style and format your code. And as you can see I'm currently
code. And as you can see I'm currently using both of them for different types
using both of them for different types of tasks. So again if I have the feeling
of tasks. So again if I have the feeling that I have to discuss something I go to
that I have to discuss something I go to shbt. But once the idea is very clear
shbt. But once the idea is very clear and I know the solution then I start
and I know the solution then I start using copilot in order to write the code
using copilot in order to write the code and with the help of the copilot I can
and with the help of the copilot I can deliver clean and professional code. So
deliver clean and professional code. So this is how I currently use both
this is how I currently use both Shajbuty and
Copilot. Okay friends, so now what we're going to do, I'm going to show you a
going to do, I'm going to show you a quick guide about the GitHub copilot in
quick guide about the GitHub copilot in the Visual Studio Code. Once you create
the Visual Studio Code. Once you create a profile and connect it to your Visual
a profile and connect it to your Visual Studio, you will get a new icon for the
Studio, you will get a new icon for the copilot. So once you go there, you can
copilot. So once you go there, you can see quickly the status and as well you
see quickly the status and as well you can go and disable the copilot. So if
can go and disable the copilot. So if you have it like this, that's means your
you have it like this, that's means your co-pilot is active. So now once you have
co-pilot is active. So now once you have everything up and running, what you have
everything up and running, what you have to do is very simple. Just go and start
to do is very simple. Just go and start writing your code. So start typing any
writing your code. So start typing any select statements. And now you can see
select statements. And now you can see that we have a gray text. This gray text
that we have a gray text. This gray text called the ghost text. It is an auto
called the ghost text. It is an auto completion from the copilot. And now it
completion from the copilot. And now it says select star from table. And now as
says select star from table. And now as you can see as I mouse hover on it, we
you can see as I mouse hover on it, we can see that I can go and switch between
can see that I can go and switch between different suggestions. So here we have
different suggestions. So here we have like three suggestions. One, two, three.
like three suggestions. One, two, three. And I'm going to go with the third one.
And I'm going to go with the third one. So now here as it says if you want to
So now here as it says if you want to accept the suggestion all what you have
accept the suggestion all what you have to do is to press tab. So let's go and
to do is to press tab. So let's go and do it. So you are accepting the whole
do it. So you are accepting the whole thing. But now if you say you know what
thing. But now if you say you know what I'm going to accept only part of the
I'm going to accept only part of the code. So let's go again and write
code. So let's go again and write select. So this time we're going to be
select. So this time we're going to be selective. In order to do that hold
selective. In order to do that hold control and then with the right arrow
control and then with the right arrow and with that we are accepting part of
and with that we are accepting part of the ghost not everything. But of course
the ghost not everything. But of course if you are accepting the whole thing
if you are accepting the whole thing just go with the tab. And now there is
just go with the tab. And now there is another way in order to trigger the
another way in order to trigger the ghost text and that's by defining first
ghost text and that's by defining first a comments. For example we want to
a comments. For example we want to select the top three customers based on
select the top three customers based on the score. So now once you start writing
the score. So now once you start writing the query the co-pilot going to go and
the query the co-pilot going to go and write a query that is relevant for the
write a query that is relevant for the comments. So now as you can see we are
comments. So now as you can see we are getting top three from customers because
getting top three from customers because we want the top three customers and here
we want the top three customers and here we have like two suggestions like over
we have like two suggestions like over here we have the order buy or without
here we have the order buy or without it. So I will go with order by and hit a
it. So I will go with order by and hit a tap. And now here another suggestion
tap. And now here another suggestion which is correct. In order to solve the
which is correct. In order to solve the data from the highest to the lowest. All
data from the highest to the lowest. All right moving on to the next one. As we
right moving on to the next one. As we learned in SQL in order to solve a task
learned in SQL in order to solve a task there could be like multiple solutions
there could be like multiple solutions and multiple variants of queries that
and multiple variants of queries that solving the same task. So let's say that
solving the same task. So let's say that we have this task rank customers based
we have this task rank customers based on their total order sales. So what you
on their total order sales. So what you can do if you start writing the query we
can do if you start writing the query we are getting now the ghost text. But now
are getting now the ghost text. But now what we can do we can go and hit ct
what we can do we can go and hit ct controll enter. So now what happens on
controll enter. So now what happens on the right side you will get different
the right side you will get different suggestions and here we have like nine
suggestions and here we have like nine suggestions on how to solve this task in
suggestions on how to solve this task in scale. So now what you have to do is to
scale. So now what you have to do is to go through all those suggestions and
go through all those suggestions and pick one. For example I can go with the
pick one. For example I can go with the suggestion number three and say accept
suggestion number three and say accept suggestion and you will get it in your
suggestion and you will get it in your code editor. So this is what we mean
code editor. So this is what we mean with the copilot autocomp completion and
with the copilot autocomp completion and integrating the AI directly as you are
integrating the AI directly as you are developing and writing a code. Now in
developing and writing a code. Now in the co-pilot, not only using the ghost
the co-pilot, not only using the ghost text and the autoco compilation, we can
text and the autoco compilation, we can go and interact with the AI using inline
go and interact with the AI using inline shots. So it's something like shimity.
shots. So it's something like shimity. Now in order to trigger the shot, what
Now in order to trigger the shot, what you're going to do, you're going to go
you're going to do, you're going to go and hit control I and then you're going
and hit control I and then you're going to get a place in order to ask the
to get a place in order to ask the copilot any question like for example
copilot any question like for example join the query with the
join the query with the table sales orders. So let's go and hit
table sales orders. So let's go and hit it.
it. And now as you can see we got a full
And now as you can see we got a full query where the customers is joined with
query where the customers is joined with the orders and it is totally correct how
the orders and it is totally correct how the table are joins. So that means
the table are joins. So that means copilot knows already all the tables
copilot knows already all the tables that I have in the database and as well
that I have in the database and as well the columns and how to join them. This
the columns and how to join them. This is amazing. So if you like it you go and
is amazing. So if you like it you go and accept it of course and this is way
accept it of course and this is way faster than having shajibbd because in
faster than having shajibbd because in shajibity you have to introduce your
shajibity you have to introduce your database your columns and stuff before
database your columns and stuff before even asking anything. This is exactly
even asking anything. This is exactly the power of copilot. Now what else we
the power of copilot. Now what else we can do with that? We can go and
can do with that? We can go and highlight part of our codes and then
highlight part of our codes and then start again the shots and here we can
start again the shots and here we can say replace this column with an
say replace this column with an aggregation of the sales. So let's go
aggregation of the sales. So let's go and hit okay. Now as you can see it
and hit okay. Now as you can see it replaced it with an aggregate function.
replaced it with an aggregate function. And one thing that is very important the
And one thing that is very important the code is not changed yet. So it is
code is not changed yet. So it is highlighted and showing you a suggestion
highlighted and showing you a suggestion and now you have to accept it or discard
and now you have to accept it or discard it. If you discard, nothing going to
it. If you discard, nothing going to change in your codes. But once you say
change in your codes. But once you say accept, it's going to go and replace
accept, it's going to go and replace your original codes. So if you go and do
your original codes. So if you go and do that, now your code is replaced with the
that, now your code is replaced with the AI suggestion. Okay. Another thing about
AI suggestion. Okay. Another thing about the copilot, it's try to fix issues that
the copilot, it's try to fix issues that you have in your codes. So for example,
you have in your codes. So for example, we have here an error. If you go and
we have here an error. If you go and mouse hover it, you can see a menu from
mouse hover it, you can see a menu from the copilot in order to view the error
the copilot in order to view the error or to fix it. And another way to do
or to fix it. And another way to do that, if you right click on it, you go
that, if you right click on it, you go to the copilot. And here you can see we
to the copilot. And here you can see we can explain or fix. So if you go and
can explain or fix. So if you go and explain, you will get another window
explain, you will get another window where you get an explanation about the
where you get an explanation about the issue in your code. And once you
issue in your code. And once you understand it, you can go and ask the
understand it, you can go and ask the copilot in order to fix it. So let's go
copilot in order to fix it. So let's go over here and go to
over here and go to fix. And with that, the copilot did fix
fix. And with that, the copilot did fix the issue. It was all about the order of
the issue. It was all about the order of the select statements. So first you have
the select statements. So first you have to do the group by then order by. So it
to do the group by then order by. So it helps you to find issues and to fix it
helps you to find issues and to fix it as well. And now, as you might already
as well. And now, as you might already noticed, as we are writing the code and
noticed, as we are writing the code and interacting with the Visual Studio, you
interacting with the Visual Studio, you will often get a sparkle, this little
will often get a sparkle, this little yellow sparkle on the left side. So, you
yellow sparkle on the left side. So, you will see this icon each time the copilot
will see this icon each time the copilot thinks it can help. So, if you go and
thinks it can help. So, if you go and click on it, you will get a menu of
click on it, you will get a menu of different stuff that the copilot can do
different stuff that the copilot can do for you, like fixing, explaining,
for you, like fixing, explaining, modifying, and so on. Well, my friends,
modifying, and so on. Well, my friends, that's it. This is the copilot, and it
that's it. This is the copilot, and it is very simple, but yet very powerful
is very simple, but yet very powerful for developers. And of course, not only
for developers. And of course, not only for SQL, for anything like for Python
for SQL, for anything like for Python and so on. Everything is integrated in
and so on. Everything is integrated in one place. I don't have to jump to
one place. I don't have to jump to Shajibbet and ask stuff. It is live and
Shajibbet and ask stuff. It is live and I can do it directly as I'm writing my
I can do it directly as I'm writing my code. So that's all for Copilot. All
code. So that's all for Copilot. All right friends. So now let's switch to
right friends. So now let's switch to Shajibet. So let's start first by
Shajibet. So let's start first by understanding the structure and the
understanding the structure and the basic components of Shajbet prompts.
So the first component and the most important one we have the tasks. You
important one we have the tasks. You have to be very clear by defining what
have to be very clear by defining what the AI should do and without having a
the AI should do and without having a clear tasks the AI will not understand
clear tasks the AI will not understand what to do. So this is mandatory in each
what to do. So this is mandatory in each prompt and then after that you have to
prompt and then after that you have to provide some context. So you give some
provide some context. So you give some background informations like for example
background informations like for example you say I am students or I am a data
you say I am students or I am a data engineer and so on. And another
engineer and so on. And another components we have to add
components we have to add specifications. So in the task you give
specifications. So in the task you give the main task what the AI should do but
the main task what the AI should do but with the specifications you go in
with the specifications you go in details like for example which topic
details like for example which topic should be added or maybe excluded the
should be added or maybe excluded the number of word counts. So here you are
number of word counts. So here you are specifying a lot of wishes and small
specifying a lot of wishes and small details and specifications in order to
details and specifications in order to get an answer that meet your
get an answer that meet your expectations. So both of the context and
expectations. So both of the context and specifications they are important. And
specifications they are important. And then after that we have some nice to
then after that we have some nice to have components like for example
have components like for example specifying a rule. So here you give the
specifying a rule. So here you give the AI a role like for example you tell it
AI a role like for example you tell it to act as an expert as a teacher
to act as an expert as a teacher interviewer. So you are setting the AI
interviewer. So you are setting the AI to play a role and the last component
to play a role and the last component that you can add as as well the tone.
that you can add as as well the tone. Here you are defining like the voice of
Here you are defining like the voice of the answer in order just to make the
the answer in order just to make the answer like more friendly and easy to
answer like more friendly and easy to read and engaging. So the role and the
read and engaging. So the role and the tone they are nice to have and if you go
tone they are nice to have and if you go and use all those components you will
and use all those components you will get a better results from the AI. So
get a better results from the AI. So let's take for example the following
let's take for example the following prompts explain SQL window functions. So
prompts explain SQL window functions. So this is very simple and very short and
this is very simple and very short and here we have only one component the
here we have only one component the task. So here you are not giving any
task. So here you are not giving any context whether it is for data analytics
context whether it is for data analytics or for data engineering. So you leave it
or for data engineering. So you leave it up to the AI and maybe the answer that
up to the AI and maybe the answer that you will get will not meet the
you will get will not meet the expectation that you have. And now if
expectation that you have. And now if you want to shape it in the way that you
you want to shape it in the way that you want you have to add more components
want you have to add more components like for example this prompt you are
like for example this prompt you are saying you are a senior SQL expert. So
saying you are a senior SQL expert. So here we are defining the rule for the
here we are defining the rule for the AI. So the AI should act now as an SQL
AI. So the AI should act now as an SQL expert. And then the next section we are
expert. And then the next section we are adding a context to the prompts. So we
adding a context to the prompts. So we are saying I'm data analyst working on
are saying I'm data analyst working on SQL projects using SQL server. So now
SQL projects using SQL server. So now the answer that you will get from the AI
the answer that you will get from the AI going to use the syntax of the SQL
going to use the syntax of the SQL server and focus on the topic of
server and focus on the topic of analytics. That's why the context is
analytics. That's why the context is very important and then we go specify in
very important and then we go specify in the prompt the task the main task. So we
the prompt the task the main task. So we say explain the concept of SQL window
say explain the concept of SQL window functions and do the following. And now
functions and do the following. And now we go and give more fine details about
we go and give more fine details about what the AI should provide. We are
what the AI should provide. We are saying explain each window function and
saying explain each window function and show the syntax. describe why they are
show the syntax. describe why they are important and when to use them and list
important and when to use them and list the top three use cases. So you are now
the top three use cases. So you are now specifying what you are expecting from
specifying what you are expecting from the AI and after that of course it is
the AI and after that of course it is nice to have we specify the tone of the
nice to have we specify the tone of the explanation. So we say the tone should
explanation. So we say the tone should be conversational and direct as if you
be conversational and direct as if you are speaking to me onetoone so that it
are speaking to me onetoone so that it is not like you are reading a document
is not like you are reading a document you are reading something that is
you are reading something that is engaging. So I know this prompt is
engaging. So I know this prompt is really big but still you will get way
really big but still you will get way better results than only saying explain
better results than only saying explain the concepts. So those are the main
the concepts. So those are the main components that I usually use if I'm
components that I usually use if I'm starting like a conversation and a
starting like a conversation and a discussion with the
shajuti. Okay. Next I'm going to show you the frequently used prompts that I
you the frequently used prompts that I use in my projects. Now little bit
use in my projects. Now little bit awareness about using shajib in
awareness about using shajib in companies. If you are working in new
companies. If you are working in new company, make sure to ask about the
company, make sure to ask about the rules of using Shia Gibbt because some
rules of using Shia Gibbt because some companies offer their own chatbots for
companies offer their own chatbots for few security reasons. So make sure
few security reasons. So make sure always to check with the rules before
always to check with the rules before jumping immediately to sht. All right.
jumping immediately to sht. All right. So let's start with the first prompts.
So let's start with the first prompts. We can use shad in order to solve an SQL
We can use shad in order to solve an SQL task that you have in the project. So
task that you have in the project. So let's see this prompts. It start first
let's see this prompts. It start first with the context. So I'm telling that I
with the context. So I'm telling that I have an SQL server database and we have
have an SQL server database and we have like two tables. So now I have to
like two tables. So now I have to explain for shad the database that I
explain for shad the database that I have. So I'm saying we have a table
have. So I'm saying we have a table called orders and we have the following
called orders and we have the following columns and we have another table called
columns and we have another table called customers and here are the columns for
customers and here are the columns for the customers. So that I gave shy a
the customers. So that I gave shy a context about the tables that I have in
context about the tables that I have in my database and as well I was precise
my database and as well I was precise about the database. It is SQL server.
about the database. It is SQL server. Now after we have the context the next
Now after we have the context the next step is that I'm going to tell SQL what
step is that I'm going to tell SQL what to do. So I'm telling the AI do the
to do. So I'm telling the AI do the following. write a query to rank
following. write a query to rank customers based on their sales and then
customers based on their sales and then I'm detailing what I'm expecting to have
I'm detailing what I'm expecting to have at the output. So the result should
at the output. So the result should include customer ID, full name, country,
include customer ID, full name, country, total sales and so on. And here I'm
total sales and so on. And here I'm adding like more tasks. It's not enough
adding like more tasks. It's not enough to have a query. I would like as well to
to have a query. I would like as well to have a comments. So I'm saying include
have a comments. So I'm saying include comments but avoid commenting on obvious
comments but avoid commenting on obvious parts because if you tell just include
parts because if you tell just include comments, you will get a lot of
comments, you will get a lot of unnecessary comments. Now of course in
unnecessary comments. Now of course in square there is like not one solution
square there is like not one solution for a task. there is always like
for a task. there is always like different variants on how to achieve the
different variants on how to achieve the same task. So usually I would like to
same task. So usually I would like to understand what are my options. That's
understand what are my options. That's why I'm telling Shaji write three
why I'm telling Shaji write three different versions of the query to
different versions of the query to achieve this task and then I would like
achieve this task and then I would like to evaluate each of those versions and
to evaluate each of those versions and that's why I'm giving the task for the
that's why I'm giving the task for the AI to evaluate those versions and to
AI to evaluate those versions and to focus on two things. It is easy to read
focus on two things. It is easy to read and as well has good performance. Okay.
and as well has good performance. Okay. So let's see what shajivity going to
So let's see what shajivity going to give us the results. So we can see the
give us the results. So we can see the first solution over here where shadivity
first solution over here where shadivity is using the CTE. So we can see in the
is using the CTE. So we can see in the CT over here that the table first are
CT over here that the table first are joined and then we have like a group by
joined and then we have like a group by in order to aggregate the sales. In the
in order to aggregate the sales. In the step two we can see over here we have
step two we can see over here we have the rank window function in order to
the rank window function in order to rank the sales. So of course you can do
rank the sales. So of course you can do that. Let's check the version number two
that. Let's check the version number two over here. So they I used the subquery
over here. So they I used the subquery and it is as well a nice solution where
and it is as well a nice solution where the shad first prepared the data. So
the shad first prepared the data. So first done the aggregation before
first done the aggregation before joining the data. Let's get the last
joining the data. Let's get the last solution over here. So we have here
solution over here. So we have here single query using window function which
single query using window function which is as you can see it is the smallest
is as you can see it is the smallest one. We don't have CTE we don't have any
one. We don't have CTE we don't have any sub queries. So first it is joining the
sub queries. So first it is joining the tables and doing together the group by
tables and doing together the group by together with the window function and
together with the window function and after that we get an evaluation from the
after that we get an evaluation from the AI where where as you can see it focus
AI where where as you can see it focus on two things the readability and the
on two things the readability and the performance. So it is saying with the
performance. So it is saying with the CTE the readability is really high
CTE the readability is really high compared to the sub query and to the
compared to the sub query and to the last version where you have the group by
last version where you have the group by together with the window function. So I
together with the window function. So I totally agree with the shajibbity the
totally agree with the shajibbity the first version was the best one for the
first version was the best one for the readability. Now checking the
readability. Now checking the performance. You can see the performance
performance. You can see the performance is moderate. The second one, the
is moderate. The second one, the subquery is good. And the last one is
subquery is good. And the last one is the best for the performance. But of
the best for the performance. But of course, always test with the execution
course, always test with the execution plan. So as you can see, there is like a
plan. So as you can see, there is like a trade-off between the readability and
trade-off between the readability and the performance. If the priority is
the performance. If the priority is readability, then go with the version
readability, then go with the version one. But if the priority is the
one. But if the priority is the performance, then go with the version
performance, then go with the version three. As you can see, we got three
three. As you can see, we got three solutions for our one task. And you can
solutions for our one task. And you can now evaluate which one you want to use.
now evaluate which one you want to use. And this is really amazing, right? All
And this is really amazing, right? All right, moving on to the next one that I
right, moving on to the next one that I frequently use. We have
frequently use. We have impromptability. As you are creating an
impromptability. As you are creating an SQL query for a complex task, you might
SQL query for a complex task, you might end up writing a lot of CTE, sub
end up writing a lot of CTE, sub queries. You might end up having a lot
queries. You might end up having a lot of joins, sub queries, CTE, hundreds of
of joins, sub queries, CTE, hundreds of lines, and you might lose the big
lines, and you might lose the big picture. So what I always do, I give the
picture. So what I always do, I give the query to the SHBT and ask it to optimize
query to the SHBT and ask it to optimize it in order to be more readable and to
it in order to be more readable and to find any redundancy in my query in order
find any redundancy in my query in order to consolidate it. So now let's check
to consolidate it. So now let's check the prompt. It says the following SQL
the prompt. It says the following SQL server query is long and hard to
server query is long and hard to understand. And then we're going to give
understand. And then we're going to give the AI tasks. So the first task is to
the AI tasks. So the first task is to improve its readability and the next one
improve its readability and the next one is to detect any redundancy in the code
is to detect any redundancy in the code in order to remove it and to consolidate
in order to remove it and to consolidate the query. So to make our query compact
the query. So to make our query compact and small and of course to include some
and small and of course to include some comments and not to comment the obvious
comments and not to comment the obvious parts and now always if there is like
parts and now always if there is like some optimizations there should be a
some optimizations there should be a learning process. So I'm asking now the
learning process. So I'm asking now the AI to explain each improvement to
AI to explain each improvement to understand the reasons behind it so that
understand the reasons behind it so that next time I'm writing the queries I can
next time I'm writing the queries I can avoid those mistakes and of course you
avoid those mistakes and of course you have to go and give the query to the AI.
have to go and give the query to the AI. All right. So now let's check the answer
All right. So now let's check the answer from the ship for my prompt. So as you
from the ship for my prompt. So as you can see we have a really long query and
can see we have a really long query and here we have now from the result the
here we have now from the result the improved query. So we can see that we
improved query. So we can see that we have only one city. Well that is crazy.
have only one city. Well that is crazy. We had before like five six cities and
We had before like five six cities and we can see here that the team managed to
we can see here that the team managed to put everything in one city and then do
put everything in one city and then do all the aggregations and the window
all the aggregations and the window function and then we have here the final
function and then we have here the final select. Well this is huge improvement to
select. Well this is huge improvement to the previous query. Let's check here the
the previous query. Let's check here the explanation. So it says it consolidated
explanation. So it says it consolidated the cities so combined all the cities
the cities so combined all the cities into one and many other stuff like there
into one and many other stuff like there were a lot of unnecessary joins and so
were a lot of unnecessary joins and so on. And here a small improvement where
on. And here a small improvement where it uses the concat instead of the plus
it uses the concat instead of the plus because concat is standards for multiple
because concat is standards for multiple databases. And here we have a final
databases. And here we have a final benefits. So we have shorter query
benefits. So we have shorter query instead of five CDs we have only one and
instead of five CDs we have only one and combining the logic you can reduce the
combining the logic you can reduce the number of scans of the tables which is
number of scans of the tables which is correct. So as you can see it is the
correct. So as you can see it is the magic of the AI. It found the issues in
magic of the AI. It found the issues in my code, improved the readability and
my code, improved the readability and reduced all the redundancy and
reduced all the redundancy and unnecessary joints and so on in the SQL
unnecessary joints and so on in the SQL script. Okay, moving on to the next
script. Okay, moving on to the next prompt. It is about optimizing the
prompt. It is about optimizing the performance of my query. And if you are
performance of my query. And if you are working in big projects where you have
working in big projects where you have like millions of data in your tables, it
like millions of data in your tables, it can be an issue if you are writing
can be an issue if you are writing queries that are not following the best
queries that are not following the best practices for performance. So that's why
practices for performance. So that's why I go and double check with the AI
I go and double check with the AI whether my script is following the best
whether my script is following the best practices for the performance. So as
practices for the performance. So as usual in the prompt we have to go and
usual in the prompt we have to go and give the context. So the following SQL
give the context. So the following SQL server query is slow and then we start
server query is slow and then we start giving the AI some tasks. So propose
giving the AI some tasks. So propose optimizations to improve its performance
optimizations to improve its performance and provide me then the improved SQL
and provide me then the improved SQL query and I would like always to
query and I would like always to understand the reason why it's better to
understand the reason why it's better to write it in another way so that by the
write it in another way so that by the next time I improve while I'm writing
next time I improve while I'm writing the query. So explain each improvement
the query. So explain each improvement to understand the reasoning behind it
to understand the reasoning behind it and then at the end we go and give our
and then at the end we go and give our query. Okay. So now let's write the
query. Okay. So now let's write the prompts on the following query over
prompts on the following query over here. So on this query we have a lot of
here. So on this query we have a lot of bad practices like for example doing
bad practices like for example doing aggregations using correlated subquery.
aggregations using correlated subquery. We are using a lot of functions inside
We are using a lot of functions inside the work clause which is not really good
the work clause which is not really good for indexing and we are using a lot of
for indexing and we are using a lot of or operators and here we have again a
or operators and here we have again a subquery. So let's check whether shad
subquery. So let's check whether shad going to find all those bad practices.
going to find all those bad practices. So let's check the results from the
So let's check the results from the shad. And as you can see now we have an
shad. And as you can see now we have an optimized query. It is little bit longer
optimized query. It is little bit longer but I think we have here better
but I think we have here better practices. So we have here a lot of
practices. So we have here a lot of changes. Let's check what did. So first
changes. Let's check what did. So first it replaced the lower in the query. It
it replaced the lower in the query. It says it's not really good to use
says it's not really good to use functions in the works so that the index
functions in the works so that the index can work. So it replaced the lower with
can work. So it replaced the lower with the order status without the function.
the order status without the function. the next one. So it is avoiding the
the next one. So it is avoiding the correlated subquery. So instead of that
correlated subquery. So instead of that it is using a lift join. So it is
it is using a lift join. So it is joining the table normally without doing
joining the table normally without doing any correlated queries and as well it is
any correlated queries and as well it is avoiding the function year in the works
avoiding the function year in the works and instead of that it is using the
and instead of that it is using the range using between and the next one it
range using between and the next one it is using exist better than in which is
is using exist better than in which is better for the performance of course. So
better for the performance of course. So as you can see you can use the AI in
as you can see you can use the AI in order to optimize the performance of
order to optimize the performance of your query and to convert it to a script
your query and to convert it to a script that is following the best practices. Of
that is following the best practices. Of course my recommendations always don't
course my recommendations always don't go blindly with all changes that is
go blindly with all changes that is suggested from the shajibity. Always
suggested from the shajibity. Always take each recommendation one by one.
take each recommendation one by one. Test it and evaluate it using your
Test it and evaluate it using your knowledge. Okay to the next one. It is
knowledge. Okay to the next one. It is interesting one. We can use
interesting one. We can use [Music]
[Music] impromptution plan. So now the execution
impromptution plan. So now the execution plans usually are advanced. So you need
plans usually are advanced. So you need a lot of knowhow and experience in order
a lot of knowhow and experience in order to understand and read the execution
to understand and read the execution plan and if you have a big query it's
plan and if you have a big query it's going to be really nightmare in order to
going to be really nightmare in order to understand the flow and where is exactly
understand the flow and where is exactly the issue. But now we are not alone. We
the issue. But now we are not alone. We have assistant the AI in order to help
have assistant the AI in order to help us understanding this complex stuff. So
us understanding this complex stuff. So what we can do we can take a screenshot
what we can do we can take a screenshot of the execution plan and upload it to
of the execution plan and upload it to Shajib and we say the image is execution
Shajib and we say the image is execution plan of SQL server query and now we give
plan of SQL server query and now we give the following task to say describe the
the following task to say describe the execution plan step by step after that
execution plan step by step after that I'm going to tell SQL to identify the
I'm going to tell SQL to identify the performance bottlenecks and where is
performance bottlenecks and where is exactly the issue what makes my query
exactly the issue what makes my query slow this is of course the hardest part
slow this is of course the hardest part of reading an execution plan and once it
of reading an execution plan and once it identify the performance issues I'm
identify the performance issues I'm going to ask it to suggest ways to
going to ask it to suggest ways to improve improve the performance and
improve improve the performance and optimize the execution plan. So first
optimize the execution plan. So first understand the execution plan identify
understand the execution plan identify the issues and how to optimize it. Okay.
the issues and how to optimize it. Okay. So now after uploading the photo and
So now after uploading the photo and asking the AI we have the following
asking the AI we have the following results. So now we can see a detailed
results. So now we can see a detailed explanation about the execution plan and
explanation about the execution plan and there is like a lot of details. I will
there is like a lot of details. I will not go through everything. So we start
not go through everything. So we start with the table scans then the cluster
with the table scans then the cluster scan and the nested loops. So we have
scan and the nested loops. So we have several nested loops and then the
several nested loops and then the aggregation and the final step. So that
aggregation and the final step. So that now we have like a nice explanation what
now we have like a nice explanation what is SQL is doing behind the scenes for my
is SQL is doing behind the scenes for my query and you don't have to be an expert
query and you don't have to be an expert understanding the execution plan. You
understanding the execution plan. You can ask the AI about it. Now what is
can ask the AI about it. Now what is very important is to understand where
very important is to understand where are the bottlenecks what are the
are the bottlenecks what are the problems. So let's see what's we have
problems. So let's see what's we have here. So let's say the first one we have
here. So let's say the first one we have a table scan which is really bad. That
a table scan which is really bad. That means this table the orders archive does
means this table the orders archive does not has any index. So it says the table
not has any index. So it says the table scan indicates a lake of useful index on
scan indicates a lake of useful index on the table which forces the engine to
the table which forces the engine to scan the whole table or rows. And now
scan the whole table or rows. And now what is very important is the nested
what is very important is the nested loops in the joins. This is really bad
loops in the joins. This is really bad if you have big tables. So here it's
if you have big tables. So here it's saying it's fine if you have like small
saying it's fine if you have like small data sets but it going to be really
data sets but it going to be really problematic if you have many rows. So as
problematic if you have many rows. So as you can see we are getting more
you can see we are getting more knowledge about the issues that we have
knowledge about the issues that we have from our execution plan. And the last
from our execution plan. And the last step it is the suggestions. So the first
step it is the suggestions. So the first one and the most obvious one is to add
one and the most obvious one is to add an index to the orders archive. The
an index to the orders archive. The nonclustered index. Well, if there's no
nonclustered index. Well, if there's no index at all, I would go first with a
index at all, I would go first with a clustered index, not immediately with a
clustered index, not immediately with a nonclustered index. And then some other
nonclustered index. And then some other best practices, but I think this one is
best practices, but I think this one is very relevant is to change the join
very relevant is to change the join type. So you can use the hints in order
type. So you can use the hints in order to use a merge join or a hash join. So
to use a merge join or a hash join. So now we understand how it works, where
now we understand how it works, where are the issues and what the suggestions
are the issues and what the suggestions to fix it. All right, the next prompt is
to fix it. All right, the next prompt is about debugging. As you are writing a
about debugging. As you are writing a complex SQL query, you might get from
complex SQL query, you might get from the database an error when you execute
the database an error when you execute it and sometimes it is challenging to
it and sometimes it is challenging to find the root cause of the issue. So we
find the root cause of the issue. So we have the following prompts. First the
have the following prompts. First the context is going to say the following
context is going to say the following SQL server query causing this error.
SQL server query causing this error. Then we can paste the error message that
Then we can paste the error message that we are getting and then we ask the AI to
we are getting and then we ask the AI to do the following stuff. First explain
do the following stuff. First explain the error message. So I would like to
the error message. So I would like to have better understanding of the error.
have better understanding of the error. And then we ask the AI to find the root
And then we ask the AI to find the root cause of the issue from my scripts. And
cause of the issue from my scripts. And after finding the problem and the issue,
after finding the problem and the issue, we're going to ask the AI to suggest how
we're going to ask the AI to suggest how to fix it. And of course, we have to
to fix it. And of course, we have to give in the prompt as well our SQL
give in the prompt as well our SQL query. All right. So now I have the
query. All right. So now I have the following query and if I execute it, I'm
following query and if I execute it, I'm getting the following error. It says the
getting the following error. It says the column sales.order dot sales in invalid
column sales.order dot sales in invalid in the select list because it is not
in the select list because it is not contained in the aggregations and so on.
contained in the aggregations and so on. So I'm not really understanding what's
So I'm not really understanding what's going on. Let's ask the AI about it. So
going on. Let's ask the AI about it. So let's check what shity did answer. When
let's check what shity did answer. When you are using group by every column in
you are using group by every column in the select must be used in the group by
the select must be used in the group by as well. And it says in your query you
as well. And it says in your query you are selecting few columns which is this
are selecting few columns which is this one is valid. The other two as well
one is valid. The other two as well valid but we have one inside the rank
valid but we have one inside the rank function. It is invalid. Okay. So now we
function. It is invalid. Okay. So now we can see here more details about the root
can see here more details about the root cause. It is saying when you are using
cause. It is saying when you are using window function like the rank it doesn't
window function like the rank it doesn't directly work with the aggregate
directly work with the aggregate functions. So here it's indicate clearly
functions. So here it's indicate clearly that the sales inside the rank function
that the sales inside the rank function is the issue. So let's see the fix over
is the issue. So let's see the fix over here. So since we don't have here sales
here. So since we don't have here sales at all you cannot have here sales in the
at all you cannot have here sales in the partition. That's why the fix here is to
partition. That's why the fix here is to use the sum of sales because we have it
use the sum of sales because we have it in the select. And here you have as well
in the select. And here you have as well a nice explanation about the fix. So you
a nice explanation about the fix. So you can see here we have an explanation
can see here we have an explanation about the error message the road cause
about the error message the road cause it's pointing exactly where there's the
it's pointing exactly where there's the issue suggesting a fix and explaining
issue suggesting a fix and explaining the fix and this is exactly the steps
the fix and this is exactly the steps that you have to do if you are debugging
that you have to do if you are debugging a code all right moving on to the next
a code all right moving on to the next prompt we can use AI to explain the
prompt we can use AI to explain the result that I'm getting from SQL well
result that I'm getting from SQL well sometimes you might have an SQL query
sometimes you might have an SQL query that you have in the project and you are
that you have in the project and you are not understanding why you are getting
not understanding why you are getting specific results so as usual we start
specific results so as usual we start with the context we tell the AI I didn't
with the context we tell the AI I didn't understand the result of the following
understand the result of the following SQL server query and then we ask the AI
SQL server query and then we ask the AI to do the following. First break down
to do the following. First break down how SQL processes the query step by step
how SQL processes the query step by step and as well I would like to get an
and as well I would like to get an explanation for each stage and how the
explanation for each stage and how the result is formed. So as you can see here
result is formed. So as you can see here I don't need any optimizations. I don't
I don't need any optimizations. I don't need in the output any query. I just
need in the output any query. I just need an explanation and then at the end
need an explanation and then at the end you're going to go and paste your query.
you're going to go and paste your query. Okay. So now we have the following
Okay. So now we have the following query. We have a recursive CTE where we
query. We have a recursive CTE where we are generating like numbers between 1
are generating like numbers between 1 and 20. Can tell you recursive CTE are
and 20. Can tell you recursive CTE are usually like complicated to understand.
usually like complicated to understand. So now maybe we are having hard time
So now maybe we are having hard time understanding the result of this query.
understanding the result of this query. After asking the AI about it, we got the
After asking the AI about it, we got the explanation first about the query
explanation first about the query structure. So it says you are using the
structure. So it says you are using the CTE with the main query. Well, okay. But
CTE with the main query. Well, okay. But what is very interesting is to
what is very interesting is to understand step by step how SQL executed
understand step by step how SQL executed this query. So it tells the step one
this query. So it tells the step one it's going to go and execute the anchor
it's going to go and execute the anchor query and that's why we will get first
query and that's why we will get first the one and then the next step the
the one and then the next step the recursive query going to be executed for
recursive query going to be executed for the first time. So it is saying okay we
the first time. So it is saying okay we are adding one to the current value. So
are adding one to the current value. So as you can see 1 + 1 we will get two and
as you can see 1 + 1 we will get two and then in the iteration two we will get 2
then in the iteration two we will get 2 + 1 3 and it will keep repeating this
+ 1 3 and it will keep repeating this process until we get all the result from
process until we get all the result from 1 to 20. And then as well we have here
1 to 20. And then as well we have here an explanation about the termination of
an explanation about the termination of the recursive query. So it's saying the
the recursive query. So it's saying the filter is the way out of the loop. So
filter is the way out of the loop. So once we reach the 20 it will stop. And
once we reach the 20 it will stop. And then a few informations about the main
then a few informations about the main query and with that you will get a deep
query and with that you will get a deep knowledge about how works and why you
knowledge about how works and why you are seeing those results. This is really
are seeing those results. This is really amazing use case for the GBT. All right
amazing use case for the GBT. All right friends. So now we're going to talk
friends. So now we're going to talk about my favorite prompts. So we can use
about my favorite prompts. So we can use the AI to style and format my code. So
the AI to style and format my code. So now once you are done writing a complex
now once you are done writing a complex query to solve a task and everything is
query to solve a task and everything is correct and optimized as well for the
correct and optimized as well for the performance. Now it's time to go and
performance. Now it's time to go and review your code in order to style and
review your code in order to style and format your script. So we have the
format your script. So we have the following prompt. It says the following
following prompt. It says the following SQL server query is hard to understand.
SQL server query is hard to understand. So now we ask the AI to do the
So now we ask the AI to do the following. Restyle the code to make it
following. Restyle the code to make it easier to read. And the next task for AI
easier to read. And the next task for AI is to align all the columns aliases.
is to align all the columns aliases. Sometimes if you are using any tool to
Sometimes if you are using any tool to style and format your code, you will
style and format your code, you will find that it is bringing a lot of new
find that it is bringing a lot of new lines. So I tell he AI, keep it compact,
lines. So I tell he AI, keep it compact, do not introduce unnecessary new lines.
do not introduce unnecessary new lines. And the last task for the AI is to make
And the last task for the AI is to make sure it is following the best practices.
sure it is following the best practices. And of course, what do we need at the
And of course, what do we need at the end? Our query. Okay, so now we have the
end? Our query. Okay, so now we have the following query. And as you can see, we
following query. And as you can see, we have very annoying query where it is
have very annoying query where it is really hard to read and that's because
really hard to read and that's because the format and the styling of the query
the format and the styling of the query is really bad. I don't want to speak
is really bad. I don't want to speak about the alignment and so on. But as
about the alignment and so on. But as you can see, we have here lower cases,
you can see, we have here lower cases, we have here uppercase sometimes for the
we have here uppercase sometimes for the keywords. And of course, if you are
keywords. And of course, if you are developing and writing codes and you are
developing and writing codes and you are delivering something like this, it is
delivering something like this, it is really not nice. So let's see how shipy
really not nice. So let's see how shipy can fix it. Okay. So now after executing
can fix it. Okay. So now after executing the prompts, as you can see, now my
the prompts, as you can see, now my query looks way nicer. So first of all
query looks way nicer. So first of all all the keywords are uppercase and then
all the keywords are uppercase and then you can see our CTE are really nice to
you can see our CTE are really nice to read. We have here enough spacing. The
read. We have here enough spacing. The alignment of everything looks really
alignment of everything looks really nice and the case is very clear and the
nice and the case is very clear and the main query over here is as well easy to
main query over here is as well easy to read. So they done wonderful job styling
read. So they done wonderful job styling and formatting my code and here you have
and formatting my code and here you have like explanation what did change. So
like explanation what did change. So first it is saying okay all the keywords
first it is saying okay all the keywords are capitalized the alignment of the
are capitalized the alignment of the aliases and the columns and so on. So
aliases and the columns and so on. So with that we got a really nice style
with that we got a really nice style formatted query that we can share with
formatted query that we can share with others. Okay, moving on to the next one.
others. Okay, moving on to the next one. We can use AI in order to generate
We can use AI in order to generate documentations and as well to add
documentations and as well to add comments to my code. Creating
comments to my code. Creating documentations and adding comments to
documentations and adding comments to code is usually something very annoying
code is usually something very annoying for the developers. And sadly I see a
for the developers. And sadly I see a lot of developers that they tend to not
lot of developers that they tend to not add any comments or anything to their
add any comments or anything to their code. And of course, this is really bad
code. And of course, this is really bad because you are not thinking about other
because you are not thinking about other developers that are reading your code.
developers that are reading your code. No
No god, no god, please no.
god, no god, please no. And since this process is annoying and
And since this process is annoying and takes time, we can use the help of AI to
takes time, we can use the help of AI to improve the speed of creating those
improve the speed of creating those stuff. So let's check the following
stuff. So let's check the following prompt. It says the following SQL server
prompt. It says the following SQL server query lakes comments and documentation.
query lakes comments and documentation. So we are saying first insert a leading
So we are saying first insert a leading comment at the start of the query
comment at the start of the query describing its overall purpose. So this
describing its overall purpose. So this is what we usually do. We add at the
is what we usually do. We add at the start a short description about the
start a short description about the following code and then it should go and
following code and then it should go and add comments only where clarifications
add comments only where clarifications is necessary and very important it
is necessary and very important it should avoid obvious statements. So it's
should avoid obvious statements. So it's like indexing don't over commenting your
like indexing don't over commenting your code and usually if you are creating
code and usually if you are creating query for data analytics it's really
query for data analytics it's really good to explain the business rules and
good to explain the business rules and transformations that you are doing
transformations that you are doing inside your query and maybe another
inside your query and maybe another documentations describing how the query
documentations describing how the query works. So for now we are asking to add
works. So for now we are asking to add comments and documentations and of
comments and documentations and of course you have to go and add your
course you have to go and add your query. Okay. So now I just used this
query. Okay. So now I just used this prompt to one of my queries. Let's go
prompt to one of my queries. Let's go and check the results. Now the first
and check the results. Now the first comment is the most important one
comment is the most important one because it gives the overall purpose of
because it gives the overall purpose of the whole query. So let's see what it's
the whole query. So let's see what it's saying. It's saying this query identify
saying. It's saying this query identify customers based on their total salaries
customers based on their total salaries and provide list of customers with their
and provide list of customers with their total sales and their assigned segments.
total sales and their assigned segments. So we have here like customer
So we have here like customer segmentations. We have high value,
segmentations. We have high value, medium value and low value. So with this
medium value and low value. So with this comment we have the overall purpose of
comment we have the overall purpose of the query and then we have the inline
the query and then we have the inline comments like here. So it says it's
comments like here. So it says it's calculate the total sales for each
calculate the total sales for each customer for the first CTE and now for
customer for the first CTE and now for the second CTE we have here a full
the second CTE we have here a full description how the segment is built and
description how the segment is built and this is built of course from the
this is built of course from the business rule of the customer segments.
business rule of the customer segments. So it say the high values for total
So it say the high values for total sales above like 100 and between and so
sales above like 100 and between and so on. Well this case win is really easy.
on. Well this case win is really easy. So actually you can read it from the
So actually you can read it from the case win. But if you have like complex
case win. But if you have like complex queries, it's really nice to have the
queries, it's really nice to have the full text of the case win and then add
full text of the case win and then add the main query. You can see here the
the main query. You can see here the final output and the inline comments. So
final output and the inline comments. So as you can see it's really nice comments
as you can see it's really nice comments inside our codes. And now the next one
inside our codes. And now the next one we have like a document about the
we have like a document about the business rule. And I totally agree with
business rule. And I totally agree with the AI that the business rule is here
the AI that the business rule is here about the customer segmentations. So we
about the customer segmentations. So we have here again very nice like short
have here again very nice like short documentations about the business rules
documentations about the business rules that we have and then we have another
that we have and then we have another document about how the query is working.
document about how the query is working. Well I think this is too much for small
Well I think this is too much for small query. We can go and ask the shibility
query. We can go and ask the shibility to make the documentation like shorter.
to make the documentation like shorter. So as you can see we have a full
So as you can see we have a full documentation about our query about our
documentation about our query about our business rules and we have really nice
business rules and we have really nice comments in our code. All right. Now
comments in our code. All right. Now moving on to the next prompts. It is
moving on to the next prompts. It is very important to improve the whole
very important to improve the whole project, the whole database. So what
project, the whole database. So what we're going to do, we're going to go and
we're going to do, we're going to go and take our DDL scripts and give it to the
take our DDL scripts and give it to the AI and start asking AI to optimize our
AI and start asking AI to optimize our database DDL. So here there is a lot of
database DDL. So here there is a lot of things that you can optimize with the
things that you can optimize with the database. So let's check this prompts.
database. So let's check this prompts. It's going to say the following SQL
It's going to say the following SQL server DDL script has to be optimized
server DDL script has to be optimized and we ask the following task from the
and we ask the following task from the AI. The first one is to check the
AI. The first one is to check the naming. So if you have a database where
naming. So if you have a database where you have a lot of tables and columns and
you have a lot of tables and columns and so on, you should be always working with
so on, you should be always working with a specific naming convention. So here
a specific naming convention. So here just to make sure that the naming that
just to make sure that the naming that you are using is correct. Then what is
you are using is correct. Then what is very important in DDLs is the data type.
very important in DDLs is the data type. Data types plays very crucial role in
Data types plays very crucial role in optimizing your queries. So we are
optimizing your queries. So we are telling the AI to check the data types
telling the AI to check the data types and whether they are optimized as well.
and whether they are optimized as well. And now the next point is about the data
And now the next point is about the data integrity. So if you are building a
integrity. So if you are building a relational database, you will have a lot
relational database, you will have a lot of primary keys and foreign keys and you
of primary keys and foreign keys and you can tell the AI to check the integrity
can tell the AI to check the integrity of all those keys. The next point is
of all those keys. The next point is about indexes. Here you can tell the AI
about indexes. Here you can tell the AI to check the overall indexing that you
to check the overall indexing that you are using in the DDL scripts just to
are using in the DDL scripts just to make sure that you are not missing
make sure that you are not missing anything and as well to check whether we
anything and as well to check whether we have duplicates. So it is really great
have duplicates. So it is really great check and the last check is that to
check and the last check is that to check the normalizations of the table to
check the normalizations of the table to check the data model and whether there
check the data model and whether there is like any suggestions about splitting
is like any suggestions about splitting tables and normalizing tables or they
tables and normalizing tables or they are like some weird redundancy. Okay. So
are like some weird redundancy. Okay. So now what we're going to do we're going
now what we're going to do we're going to let the chat activity to optimize the
to let the chat activity to optimize the DDL of the sales DB. So now we have here
DDL of the sales DB. So now we have here the DDL of the customers employees
the DDL of the customers employees orders and so on. And after running it
orders and so on. And after running it we have the following results. So now we
we have the following results. So now we have here again the DDL but optimized
have here again the DDL but optimized one. And here the AI is adding comment
one. And here the AI is adding comment about the changes. So here it added the
about the changes. So here it added the auto incremental for the primary key.
auto incremental for the primary key. And here for example a check that is not
And here for example a check that is not a negative score and for the employees.
a negative score and for the employees. Here another check to make sure that the
Here another check to make sure that the birthday is not something in the future.
birthday is not something in the future. So all those constraints in order to
So all those constraints in order to make sure that the quality of the table
make sure that the quality of the table is good. And here for the gender it is
is good. And here for the gender it is restricting the valid values that could
restricting the valid values that could be used inside this column and many
be used inside this column and many other stuff. And at the end we have like
other stuff. And at the end we have like the key changes. So about the naming
the key changes. So about the naming it's saying that we have to stick with
it's saying that we have to stick with one naming convention. So here it did
one naming convention. So here it did understand that we are using the bascal
understand that we are using the bascal case and for those two columns we have
case and for those two columns we have an issue like for example this product
an issue like for example this product it should called product name. And for
it should called product name. And for the data types I don't want to go in all
the data types I don't want to go in all details. So here for example it says
details. So here for example it says don't use the int use a decimal for the
don't use the int use a decimal for the price and sales for the integrity saying
price and sales for the integrity saying go and add foreign keys. I think for the
go and add foreign keys. I think for the orders we don't have any foreign keys
orders we don't have any foreign keys that is used in the DDL. So the sht did
that is used in the DDL. So the sht did go and add all the foreign keys in the
go and add all the foreign keys in the DDL. So that was good. And now about the
DDL. So that was good. And now about the indexing it says since we have primary
indexing it says since we have primary keys we will get automatically the
keys we will get automatically the clustered indexing and the foreign keys
clustered indexing and the foreign keys should get as well an index in order to
should get as well an index in order to improve the queries and so on. So as you
improve the queries and so on. So as you can see there is a lot of optimizations
can see there is a lot of optimizations that could be done in our DDL. So now if
that could be done in our DDL. So now if you are working on the project and you
you are working on the project and you have a DDL go ask the AI what could we
have a DDL go ask the AI what could we optimize I'm sure you will find
optimize I'm sure you will find something and this is very critical
something and this is very critical because having a solid and optimized DDL
because having a solid and optimized DDL improves of course the speed of the
improves of course the speed of the queries. All right so now we come to
queries. All right so now we come to very useful use case of using AI for
very useful use case of using AI for your SQL projects and that is by using
your SQL projects and that is by using AI to generate test data sets. It is
AI to generate test data sets. It is always really nice to have small data
always really nice to have small data sets in order to test the logic of your
sets in order to test the logic of your query. Sometimes you are building a
query. Sometimes you are building a logic that does not exist yet in your
logic that does not exist yet in your database and of course if you are not
database and of course if you are not able to test the scenario that you are
able to test the scenario that you are developing it can be really bad and it
developing it can be really bad and it is always very painful process in order
is always very painful process in order to generate a data sets for your code
to generate a data sets for your code but of course now it is easier because
but of course now it is easier because we have the help of AI. So let's check
we have the help of AI. So let's check the following prompt. It says I need the
the following prompt. It says I need the data sets for testing the following SQL
data sets for testing the following SQL server DDL. And now next we have to
server DDL. And now next we have to specify for the AI different tasks. The
specify for the AI different tasks. The first one is we have to define the shape
first one is we have to define the shape of the data sets. So how do you want the
of the data sets. So how do you want the output? Do you want it as an insert
output? Do you want it as an insert statements or do you want it as an excel
statements or do you want it as an excel or a file and so on. Now the next
or a file and so on. Now the next specifications I would like always to
specifications I would like always to have a data set that is realistic. So I
have a data set that is realistic. So I would like to always to have a data set
would like to always to have a data set that is relevant and realistic not to
that is relevant and realistic not to get dummy word data. So again he's like
get dummy word data. So again he's like only configurations about the data set.
only configurations about the data set. The next configuration is that I would
The next configuration is that I would like to have small data sets. Of course,
like to have small data sets. Of course, you can go and specify for charge the
you can go and specify for charge the exact size of your data sets. You can
exact size of your data sets. You can say I would like to have like 100,000
say I would like to have like 100,000 rows or millions of rows and so on. So
rows or millions of rows and so on. So you can define the size that you want.
you can define the size that you want. For me, I would like to have like small
For me, I would like to have like small data sets. And now what is very
data sets. And now what is very important that if you have multiple
important that if you have multiple tables in your DDL and those table have
tables in your DDL and those table have primary keys and foreign keys, the data
primary keys and foreign keys, the data set should be correct. So the AI should
set should be correct. So the AI should generate keys that is joinable. So if
generate keys that is joinable. So if you go and join data together, you will
you go and join data together, you will not get weird results. And of course,
not get weird results. And of course, you can go and keep adding
you can go and keep adding specifications whether you want to have
specifications whether you want to have nulls or no nulls inside your data set.
nulls or no nulls inside your data set. So here for example, I'm saying don't
So here for example, I'm saying don't introduce any null values. And of course
introduce any null values. And of course at the end you have to go and give the
at the end you have to go and give the DDL for the AI. It could be one table or
DDL for the AI. It could be one table or the whole database. So you could
the whole database. So you could generate a data set for one table or
generate a data set for one table or hundreds of tables. Okay. So now I'm
hundreds of tables. Okay. So now I'm asking the SHT to create test data sets
asking the SHT to create test data sets for two tables. the employees and the
for two tables. the employees and the orders. Let's check the results. So now
orders. Let's check the results. So now we can see very small nice insert
we can see very small nice insert statements for the table employees. So
statements for the table employees. So we have over here like five employees
we have over here like five employees with the different informations. And now
with the different informations. And now for the table orders we have a lot of
for the table orders we have a lot of columns. So as you can see we have four
columns. So as you can see we have four orders. And what is very important is
orders. And what is very important is that the salesperson ID comes from the
that the salesperson ID comes from the table employees. So as you can see we
table employees. So as you can see we have two and one where we have it
have two and one where we have it already in the employees. and the rest
already in the employees. and the rest of the informations we have like here
of the informations we have like here fake addresses and stuff. So with that
fake addresses and stuff. So with that we have a very nice test data sets in
we have a very nice test data sets in order to be inserted to our database to
order to be inserted to our database to test our queries. Of course we can go
test our queries. Of course we can go and ask maybe to extend it maybe instead
and ask maybe to extend it maybe instead of only four orders we can go with 20
of only four orders we can go with 20 orders and so on. So we can go and
orders and so on. So we can go and change the size of it and here we have
change the size of it and here we have some notes about the data itself. So it
some notes about the data itself. So it is really amazing we are now generating
is really amazing we are now generating this data using our DLS. All right. So
this data using our DLS. All right. So now we have the following query and of
now we have the following query and of course we are using the SQL server and
course we are using the SQL server and let's say that you are migrating from
let's say that you are migrating from SQL server to MySQL. So let's ask
SQL server to MySQL. So let's ask Shajbet to convert my code to MySQL. All
Shajbet to convert my code to MySQL. All right. So after running it as we can see
right. So after running it as we can see now we have the same query but in MySQL.
now we have the same query but in MySQL. So instead of the isnull we are using
So instead of the isnull we are using Kawalis and here we are using the
Kawalis and here we are using the concatenation instead of the plus
concatenation instead of the plus operator and instead of the get date in
operator and instead of the get date in MySQL we use the now function. And the
MySQL we use the now function. And the last thing we are using here top 10 but
last thing we are using here top 10 but in my scale we use limit 10. And here we
in my scale we use limit 10. And here we have really nice explanation about the
have really nice explanation about the transition. So as you can see it is
transition. So as you can see it is amazing and if you are working on
amazing and if you are working on companies and in projects this might
companies and in projects this might happen that there is like decision to
happen that there is like decision to start migrating from one database to
start migrating from one database to another database and then your project
another database and then your project going to get a big task of migrating the
going to get a big task of migrating the data migrating the DDLs and the queries
data migrating the DDLs and the queries and everything and I really recommend
and everything and I really recommend using the shad in order to help with the
using the shad in order to help with the migration otherwise this big task might
migration otherwise this big task might take really long time. So as you can see
take really long time. So as you can see this is really amazing how shad can
this is really amazing how shad can improve the speed of your
projects. Okay. Now in the next section I'm going to show you the prompts that
I'm going to show you the prompts that you can use as a student or if you are
you can use as a student or if you are learning any new programming language.
learning any new programming language. Okay. So the first thing that you can do
Okay. So the first thing that you can do with Shajibet is that you can ask it to
with Shajibet is that you can ask it to generate an SQL course. So you can ask
generate an SQL course. So you can ask the shajibet to guide you step by step
the shajibet to guide you step by step in your journey learning any programming
in your journey learning any programming language and you want to do it
language and you want to do it completely onetoone with the AI. So
completely onetoone with the AI. So first it is very important in creating a
first it is very important in creating a course is that to give enough context.
course is that to give enough context. So in this example it is very short I'm
So in this example it is very short I'm saying create an SQL course with a
saying create an SQL course with a detailed road map and agenda. But of
detailed road map and agenda. But of course you can go and give more
course you can go and give more specifications. You can tell about your
specifications. You can tell about your current knowledge. You can specify which
current knowledge. You can specify which database type you would like to work
database type you would like to work with MySQL SQL server. So the more
with MySQL SQL server. So the more context and details you give for the AI,
context and details you give for the AI, the better results you're going to get.
the better results you're going to get. And then you go and configure your
And then you go and configure your course. So you can say for example start
course. So you can say for example start with SQL fundamentals and advance to
with SQL fundamentals and advance to complex topics. And as well we can say
complex topics. And as well we can say make it beginner friendly and it is
make it beginner friendly and it is important if it is the first time you
important if it is the first time you are learning about the topic. And now we
are learning about the topic. And now we have to shape the focus of the course
have to shape the focus of the course like I'm saying here include topics that
like I'm saying here include topics that is relevant for data analytics because
is relevant for data analytics because SQL is widely used in different topics
SQL is widely used in different topics for data engineering data analytics and
for data engineering data analytics and it's really important in each course to
it's really important in each course to focus on use cases. So we are saying
focus on use cases. So we are saying focus on real world data analytics use
focus on real world data analytics use cases and scenarios and of course you
cases and scenarios and of course you can go and add more details about your
can go and add more details about your course. Okay. So now I just asked the
course. Okay. So now I just asked the shivity in order to make this course. So
shivity in order to make this course. So now let's see the road map and the
now let's see the road map and the structure of our course. So let's start
structure of our course. So let's start with the phase one with the SQL
with the phase one with the SQL fundamentals. So it start with the basic
fundamentals. So it start with the basic select where and so on. Then the next
select where and so on. Then the next section we are talking about order by
section we are talking about order by group by and insert update delete. So
group by and insert update delete. So the basic stuff. Now in the road map you
the basic stuff. Now in the road map you get the phase two intermediate SQL. So
get the phase two intermediate SQL. So here we are talking about inner joins
here we are talking about inner joins few functions about the text the date
few functions about the text the date and the case statements and views. And
and the case statements and views. And now to the phase three we have the
now to the phase three we have the advanced SQL for analytics. So we have
advanced SQL for analytics. So we have the window functions, the CTE and data
the window functions, the CTE and data cleaning using the null functions and
cleaning using the null functions and few transformations. Then we go to the
few transformations. Then we go to the phase number four. Here in your road map
phase number four. Here in your road map you start talking about real world use
you start talking about real world use cases. And here you have like multiple
cases. And here you have like multiple projects. So as you can see this is
projects. So as you can see this is really solid road map in order to learn
really solid road map in order to learn SQL. And now in the next step what you
SQL. And now in the next step what you can do you can start deep diving into
can do you can start deep diving into each of those chapters until SQL to
each of those chapters until SQL to start okay with the phase number one
start okay with the phase number one with the week one to give more details.
with the week one to give more details. All right. So now the next one once you
All right. So now the next one once you have the agenda and the road map
have the agenda and the road map learning the SQL now you can go and
learning the SQL now you can go and focus on specific chapter specific SQL
focus on specific chapter specific SQL concepts. So in this prompt we are
concepts. So in this prompt we are saying the context first I want detailed
saying the context first I want detailed explanation about SQL window functions
explanation about SQL window functions and now after that we are specifying for
and now after that we are specifying for the AI the exact structure of the
the AI the exact structure of the explanation. So first it should explain
explanation. So first it should explain what are the window functions and maybe
what are the window functions and maybe as well to give an analogy in order to
as well to give an analogy in order to understand exactly what is window
understand exactly what is window functions and after that it should
functions and after that it should explain why we need them and when to use
explain why we need them and when to use the window functions. So once you
the window functions. So once you understand the basics then you can start
understand the basics then you can start learning about the syntax of the window
learning about the syntax of the window functions and it should provide as well
functions and it should provide as well few simple examples and at the end the
few simple examples and at the end the AI should show you the best or the most
AI should show you the best or the most frequently use cases used for the SQL
frequently use cases used for the SQL window functions. So this is the pattern
window functions. So this is the pattern that I like in order to learn something
that I like in order to learn something new. All right. So now let's see how the
new. All right. So now let's see how the AI going to explain the SQL window
AI going to explain the SQL window functions. So as you can see it start
functions. So as you can see it start with the big title understanding SQL
with the big title understanding SQL with the functions. So we have here a
with the functions. So we have here a quick definition and then we have here
quick definition and then we have here an analogy and the analogy about like a
an analogy and the analogy about like a teacher grading students. Well that's
teacher grading students. Well that's nice because we have the rank function.
nice because we have the rank function. So you have here a nice analogy about
So you have here a nice analogy about the window function and then we
the window function and then we understand why do we need the window
understand why do we need the window functions. Well I totally agree in order
functions. Well I totally agree in order to have row level details with the
to have row level details with the aggregations. So you can do aggregations
aggregations. So you can do aggregations while maintaining the raw level details
while maintaining the raw level details and as well you can do complex
and as well you can do complex calculations because you cannot do
calculations because you cannot do everything with a group I there's
everything with a group I there's functions that only work with the window
functions that only work with the window and then we have some explanation when
and then we have some explanation when to use them. So we see here for example
to use them. So we see here for example the syntax of the window function. So it
the syntax of the window function. So it divided to a function partition order by
divided to a function partition order by over and here few explanation about
over and here few explanation about that. Then we have few simple examples
that. Then we have few simple examples with queries. So explaining the
with queries. So explaining the different functions but not all of them.
different functions but not all of them. Of course, you can go and ask the
Of course, you can go and ask the schedule to extend the examples for all
schedule to extend the examples for all functions. And now we can see the top
functions. And now we can see the top three use cases for the window
three use cases for the window functions. So we use it in order to rank
functions. So we use it in order to rank the data and as well to build the
the data and as well to build the running totals and the moving average.
running totals and the moving average. And at the end we have a summary. So as
And at the end we have a summary. So as you can see we have wonderful
you can see we have wonderful explanation about the concept of the SQL
explanation about the concept of the SQL window functions. Okay, moving on to the
window functions. Okay, moving on to the next one. And this one I use it very
next one. And this one I use it very frequently in my projects. There is like
frequently in my projects. There is like in programming always different concepts
in programming always different concepts that are very close to each others and
that are very close to each others and sometimes it is confusing and naturally
sometimes it is confusing and naturally clear what are the big differences
clear what are the big differences between them. So here I have for you a
between them. So here I have for you a prompt in order to compare different SQL
prompt in order to compare different SQL concepts. So now the prompt says I want
concepts. So now the prompt says I want to understand the differences between
to understand the differences between SQL window functions and the group by.
SQL window functions and the group by. So both of them are used usually to
So both of them are used usually to aggregate data in SQL and I would like
aggregate data in SQL and I would like to understand more what are the
to understand more what are the differences between them. So we define
differences between them. So we define for the AI the following task. Explain
for the AI the following task. Explain the key differences between the two
the key differences between the two concepts and then it's really important
concepts and then it's really important to understand when to use what. So
to understand when to use what. So describe when to use each concept with
describe when to use each concept with examples and it's really nice to
examples and it's really nice to understand as well the advantages and
understand as well the advantages and the disadvantages of each concept and at
the disadvantages of each concept and at the end you would like maybe to get a
the end you would like maybe to get a quick summarization about the
quick summarization about the differences between those two functions
differences between those two functions side by side in one table. Okay. So now
side by side in one table. Okay. So now let's see how the share GBD can compare
let's see how the share GBD can compare those two concepts. So first we have
those two concepts. So first we have really nice table in order to see the
really nice table in order to see the differences between those two. So for
differences between those two. So for example the output granularity it says
example the output granularity it says the wind function provides calculation
the wind function provides calculation at the rowle details where the group by
at the rowle details where the group by provides aggregated results at the group
provides aggregated results at the group level detail and if you are talking
level detail and if you are talking about the functions it allow ranking
about the functions it allow ranking running total moving average and the
running total moving average and the group by it allows only the basic
group by it allows only the basic aggregations like sum average count. So
aggregations like sum average count. So this is really nice overview for the
this is really nice overview for the differences. Then we have when to use
differences. Then we have when to use which concepts. So it's telling the
which concepts. So it's telling the window function it is used if you want
window function it is used if you want role level details together with the
role level details together with the aggregations and here you have like a
aggregations and here you have like a nice example for the group by it says
nice example for the group by it says you can use it for example when
you can use it for example when summarizing data into categories like
summarizing data into categories like here grouping up the data by the region
here grouping up the data by the region and then after that we have like pros
and then after that we have like pros and cons for each concept. So the
and cons for each concept. So the advantage of the window function we get
advantage of the window function we get all the rows and for the group I it is
all the rows and for the group I it is like easier to understand and to use.
like easier to understand and to use. For the disadvantage of the window
For the disadvantage of the window function it is more complex. For the
function it is more complex. For the group I the disadvantage is it removes
group I the disadvantage is it removes the details about the rows and at the
the details about the rows and at the end we have like sideby-side comparison
end we have like sideby-side comparison between those two concepts. So as you
between those two concepts. So as you can see we have really nice full
can see we have really nice full detailed comparison between those two
detailed comparison between those two SQL concepts. Practicing SQL with the
SQL concepts. Practicing SQL with the AI. Well, it is really not enough to
AI. Well, it is really not enough to just read about something or maybe to
just read about something or maybe to follow and watch a course in order to
follow and watch a course in order to learn something. You have always to
learn something. You have always to practice. And of course, it is really
practice. And of course, it is really hard to find a materials in order to
hard to find a materials in order to practice a new programming language. So,
practice a new programming language. So, we can do it like this. We give a rule
we can do it like this. We give a rule act as an SQL trainer and then a context
act as an SQL trainer and then a context where we say and help me practice SQL
where we say and help me practice SQL window functions and then we go and
window functions and then we go and configure this training this practice by
configure this training this practice by doing the following. We tell it to make
doing the following. We tell it to make it interactive practicing. So the AI
it interactive practicing. So the AI provide a task and you give a solution.
provide a task and you give a solution. And what else is important is that it
And what else is important is that it provides you a simple data set and of
provides you a simple data set and of course you can specify which data set
course you can specify which data set you want. Is it industrial data set or
you want. Is it industrial data set or healthcare or anything you want and then
healthcare or anything you want and then we tell the AI give SQL task that
we tell the AI give SQL task that gradually increase in difficulty. So we
gradually increase in difficulty. So we start with the basics until getting
start with the basics until getting advanced tasks. And you can tell the AI
advanced tasks. And you can tell the AI to act as an SQL server and show the
to act as an SQL server and show the results of your query. So you would like
results of your query. So you would like to get as a result not only the correct
to get as a result not only the correct solution or feedback you want to see the
solution or feedback you want to see the result of the query that you gives and
result of the query that you gives and then finally the AI should go and review
then finally the AI should go and review your queries provide a feedback and
your queries provide a feedback and suggest improvements okay so now let's
suggest improvements okay so now let's start practicing I gave the prompt to
start practicing I gave the prompt to shity and now we have simple data sets
shity and now we have simple data sets so it is very simple we have the sales
so it is very simple we have the sales ID employee region sales dates and
ID employee region sales dates and amounts and then we have the first task
amounts and then we have the first task so it says write a query to rank
so it says write a query to rank employees by their total sales. So here
employees by their total sales. So here you have like an example output and now
you have like an example output and now it says your turn. So the shad is
it says your turn. So the shad is waiting for your answer. Okay. So now I
waiting for your answer. Okay. So now I just prepared a query for it. Let's see
just prepared a query for it. Let's see what can happen once I post it. Oh no, I
what can happen once I post it. Oh no, I got some errors in the query. So let's
got some errors in the query. So let's see what we have. So it says error in
see what we have. So it says error in the aggregations. You should use the
the aggregations. You should use the amount instead of sales. And it says
amount instead of sales. And it says unnecessary partition by in the rank and
unnecessary partition by in the rank and so on. So let's check the correct query.
so on. So let's check the correct query. So we have here the group pi and then we
So we have here the group pi and then we have to do the window function without
have to do the window function without using partition pi. So that was a
using partition pi. So that was a mistake and the result of this query
mistake and the result of this query going to be this one. And here I have
going to be this one. And here I have really nice feedback about the first
really nice feedback about the first task. So now it ask me about the next
task. So now it ask me about the next task. So I'm going to say yes. So now we
task. So I'm going to say yes. So now we have this task number two about the
have this task number two about the running total. We have a task and we
running total. We have a task and we have the data and we have now to write
have the data and we have now to write query in order to solve the task. So my
query in order to solve the task. So my friends it is nice right interactive and
friends it is nice right interactive and not only SQL you can go and practice any
not only SQL you can go and practice any programming language. Now moving on to
programming language. Now moving on to the last prompt you can use AI in order
the last prompt you can use AI in order to prepare you for SQL interview. So
to prepare you for SQL interview. So let's say that you are invited to an
let's say that you are invited to an interview and you would like to prepare
interview and you would like to prepare yourself for it. So you can do a quick
yourself for it. So you can do a quick preparation together with the AI. So you
preparation together with the AI. So you can say the following act as interviewer
can say the following act as interviewer and prepare me for SQL interview. And
and prepare me for SQL interview. And now you can go and configure the
now you can go and configure the interview where you can say ask common
interview where you can say ask common SQL interview questions and make it
SQL interview questions and make it interactive. So it provide a question
interactive. So it provide a question and then wait for you to answer and then
and then wait for you to answer and then you can say gradually progress to
you can say gradually progress to advanced topics. So from basics to
advanced topics. So from basics to advanced and it is very important that
advanced and it is very important that it evaluates your answer and give you a
it evaluates your answer and give you a feedback. So it is a really great way to
feedback. So it is a really great way to prepare for interviews and I really
prepare for interviews and I really recommended to do it and you can prepare
recommended to do it and you can prepare yourself not only for an SQL interview,
yourself not only for an SQL interview, you can prepare yourself for an SQL
you can prepare yourself for an SQL exam. Okay. Okay. So now let's prepare
exam. Okay. Okay. So now let's prepare for an ISQL interview. And here we have
for an ISQL interview. And here we have the first question. Shibility says what
the first question. Shibility says what is the difference between where and
is the difference between where and having. So now it is waiting for an
having. So now it is waiting for an answer. We can say where filters data
answer. We can say where filters data before
before aggregation and
aggregation and having filters data after aggregation.
having filters data after aggregation. So let's check the answer. So here it is
So let's check the answer. So here it is giving me an example of a very solid
giving me an example of a very solid answer. But in general I have answered
answer. But in general I have answered correctly. So it says the answer is
correctly. So it says the answer is correct. But the feedback says here
correct. But the feedback says here maybe the interviewer like needs more
maybe the interviewer like needs more details not only one sentence about the
details not only one sentence about the differences. So here it is like
differences. So here it is like encouraging me to speak more and to give
encouraging me to speak more and to give more details as an answer but still the
more details as an answer but still the answer is correct. So now let's go to
answer is correct. So now let's go to the next question. What we have here can
the next question. What we have here can you explain the differences between
you explain the differences between inner join and left join. So I hope you
inner join and left join. So I hope you know the answer but as you can see it is
know the answer but as you can see it is very interactive and nice and I think
very interactive and nice and I think those questions are really relevant. So
those questions are really relevant. So if I'm interviewing someone I'm going to
if I'm interviewing someone I'm going to go and ask this question. What is the
go and ask this question. What is the difference between where and having and
difference between where and having and as well the differences between the
as well the differences between the joint types. So this is amazing right? I
joint types. So this is amazing right? I really recommend you if you have like an
really recommend you if you have like an interview go and prepare yourself using
interview go and prepare yourself using shajbt and you can go and practice and
shajbt and you can go and practice and prepare yourself before going to the
prepare yourself before going to the interview. All right. So with that you
interview. All right. So with that you have learned how I use AI in order to
have learned how I use AI in order to assist me while I'm coding using SQL.
assist me while I'm coding using SQL. And now my friends we come to the most
And now my friends we come to the most important chapter from the whole course.
important chapter from the whole course. You have now learned a lot of things
You have now learned a lot of things about SQL. A lot of advanced techniques,
about SQL. A lot of advanced techniques, a lot of functions, how to transform
a lot of functions, how to transform data, how to aggregate data. But now
data, how to aggregate data. But now what you have to do is to take
what you have to do is to take everything and to apply it in SQL
everything and to apply it in SQL projects. And those projects are not
projects. And those projects are not only like easy projects. I bought
only like easy projects. I bought projects for you that is very similar to
projects for you that is very similar to the real project that I do in the
the real project that I do in the industry. So you will not learn only
industry. So you will not learn only like how to do project in SQL but as
like how to do project in SQL but as well what are the main steps and how we
well what are the main steps and how we implement projects in real world. And
implement projects in real world. And here I have for you three projects data
here I have for you three projects data warehousing data exploration and
warehousing data exploration and advanced data analytics. We're going to
advanced data analytics. We're going to start with the first one the data
start with the first one the data warehousing projects. This one can be
warehousing projects. This one can be amazing. So let's go and deep dive in
that. All right my friends. So now if you want to do data analytics projects
you want to do data analytics projects using SQL we have three different types.
using SQL we have three different types. The first type of projects you can do
The first type of projects you can do data warehousing. It's all about how to
data warehousing. It's all about how to organize, structure and prepare your
organize, structure and prepare your data for data analyszis. It is the
data for data analyszis. It is the foundations of any data analytics
foundations of any data analytics projects. And in the next step, you can
projects. And in the next step, you can do exploratory data analyzes, EDA. And
do exploratory data analyzes, EDA. And all what you have to do is to understand
all what you have to do is to understand and cover insights about our data sets.
and cover insights about our data sets. In this kind of project, you can learn
In this kind of project, you can learn how to ask the right questions and how
how to ask the right questions and how to find the answer using SQL by just
to find the answer using SQL by just using basic SQL skills. Now moving on to
using basic SQL skills. Now moving on to the last stage where you can do advanced
the last stage where you can do advanced analytics projects where you're going to
analytics projects where you're going to use advanced SQL techniques in order to
use advanced SQL techniques in order to answer business questions like finding
answer business questions like finding trends over time, comparing the
trends over time, comparing the performance, segmenting your data into
performance, segmenting your data into different sections and as well generate
different sections and as well generate reports for your stakeholders. So here
reports for your stakeholders. So here you will be solving real business
you will be solving real business questions using advanced SQL techniques.
questions using advanced SQL techniques. Now what we're going to do, we're going
Now what we're going to do, we're going to start with the first type of projects
to start with the first type of projects SQL data warehousing where you will gain
SQL data warehousing where you will gain the following skills. So first you will
the following skills. So first you will learn how to do ETL ELT processing using
learn how to do ETL ELT processing using SQL in order to prepare the data. You
SQL in order to prepare the data. You will learn as well how to build data
will learn as well how to build data architecture, how to do data
architecture, how to do data integrations where we're going to merge
integrations where we're going to merge multiple sources together and as well
multiple sources together and as well how to do data load and data modeling.
how to do data load and data modeling. So if I got you interested, grab your
So if I got you interested, grab your coffee and let's jump to the
projects. All right, my friends. So now before we deep dive into the tools and
before we deep dive into the tools and the cool stuff, we have first to have
the cool stuff, we have first to have good understanding about what is exactly
good understanding about what is exactly data warehouse why the companies try to
data warehouse why the companies try to build such a data management system. So
build such a data management system. So now the question is what is a data
now the question is what is a data warehouse? I will just use the
warehouse? I will just use the definition of the father of the data
definition of the father of the data warehouse bill in a data warehouse is
warehouse bill in a data warehouse is subject-oriented integrated time variant
subject-oriented integrated time variant and nonvolatile collection of data
and nonvolatile collection of data designed to support the management's
designed to support the management's decision-making process. Okay, I I know
decision-making process. Okay, I I know that might be confusing.
that might be confusing. Subject-oriented it means that the
Subject-oriented it means that the warehouses always focus on a business
warehouses always focus on a business area like the sales, customers, finance
area like the sales, customers, finance and so on. Integrated because it goes
and so on. Integrated because it goes and integrate multiple source systems.
and integrate multiple source systems. Usually you build a warehouse not only
Usually you build a warehouse not only for one source but for multiple sources.
for one source but for multiple sources. Time variance it means you can keep
Time variance it means you can keep historical data inside the data
historical data inside the data warehouse. Nonvolatile it means once the
warehouse. Nonvolatile it means once the data enter the data warehouse it is not
data enter the data warehouse it is not deleted or modified. So this is how
deleted or modified. So this is how build inmon defined data warehouse.
build inmon defined data warehouse. Okay. So now I'm going to show you the
Okay. So now I'm going to show you the scenario where your company don't have a
scenario where your company don't have a real data management. So now let's say
real data management. So now let's say that you have one system and you have
that you have one system and you have like one data analyst has to go to this
like one data analyst has to go to this system and start collecting and
system and start collecting and extracting the data and then he going to
extracting the data and then he going to spend days and sometimes weeks
spend days and sometimes weeks transforming the raw data into something
transforming the raw data into something meaningful. Then once they have the
meaningful. Then once they have the reports they're going to go and share
reports they're going to go and share it. And this data analyst is sharing the
it. And this data analyst is sharing the report using an Excel. And then you have
report using an Excel. And then you have like another source of data and you have
like another source of data and you have another data analyst that she is doing
another data analyst that she is doing maybe the same steps collecting the data
maybe the same steps collecting the data spending a lot of time transforming the
spending a lot of time transforming the data and then share at the end like a
data and then share at the end like a report and this time she is sharing the
report and this time she is sharing the data using PowerPoint and a third system
data using PowerPoint and a third system and the same story but this time he is
and the same story but this time he is sharing the data using maybe PowerBI. So
sharing the data using maybe PowerBI. So now if the company works like this then
now if the company works like this then there is a lot of issues. First this
there is a lot of issues. First this process it take two way long. I saw a
process it take two way long. I saw a lot of scenarios where sometimes it
lot of scenarios where sometimes it takes weeks and even months until the
takes weeks and even months until the employee manually generating those
employee manually generating those reports. And of course, what can happen
reports. And of course, what can happen for the users? They are consuming
for the users? They are consuming multiple reports with multiple state of
multiple reports with multiple state of the data. One report is 40 days old,
the data. One report is 40 days old, another one 10 days and a third one is
another one 10 days and a third one is like 5 days. So it's going to be really
like 5 days. So it's going to be really hard to make a real decision based on
hard to make a real decision based on this structure. A manual process is
this structure. A manual process is always slow and stressful and the more
always slow and stressful and the more employees you involved in the process
employees you involved in the process the more you open the door for human
the more you open the door for human errors and errors of course in reports
errors and errors of course in reports leads to bad decisions and another issue
leads to bad decisions and another issue of course is handling the big data. If
of course is handling the big data. If one of your sources generating like
one of your sources generating like massive amount of data then the data
massive amount of data then the data analyst going to struggle collecting the
analyst going to struggle collecting the data and maybe in some scenarios it will
data and maybe in some scenarios it will not be anymore possible to get the data.
not be anymore possible to get the data. So the whole process can breaks and you
So the whole process can breaks and you cannot generate anymore fresh data for
cannot generate anymore fresh data for specific reports. And one last very big
specific reports. And one last very big issue with that. If one of your
issue with that. If one of your stakeholders asks for an integrated
stakeholders asks for an integrated report from multiple sources, well good
report from multiple sources, well good luck with that because merging all those
luck with that because merging all those data manually is very chaotic,
data manually is very chaotic, time-conuming and full of risk. So this
time-conuming and full of risk. So this is just a picture. If a company is
is just a picture. If a company is working without a proper data
working without a proper data management, without a data leak, data
management, without a data leak, data warehouse, data lake houses. So in order
warehouse, data lake houses. So in order to make real and good decisions, you
to make real and good decisions, you need data management. So now let's talk
need data management. So now let's talk about the scenario of a data warehouse.
about the scenario of a data warehouse. So the first thing that's going to
So the first thing that's going to happen is that you will not have your
happen is that you will not have your data team collecting manually the data.
data team collecting manually the data. You're going to have a very important
You're going to have a very important component called ETL. ETL stands for
component called ETL. ETL stands for extract, transform and load. It is a
extract, transform and load. It is a process that you do in order to extract
process that you do in order to extract the data from the sources and then apply
the data from the sources and then apply multiple transformations on those
multiple transformations on those sources and at the end it loads the data
sources and at the end it loads the data to the data warehouse and this one going
to the data warehouse and this one going to be the single point of truth for
to be the single point of truth for analyzes and reporting and it is called
analyzes and reporting and it is called data warehouse. So now what can happen
data warehouse. So now what can happen all your reports going to be consuming
all your reports going to be consuming this single point of truth. So with that
this single point of truth. So with that you create your multiple reports and as
you create your multiple reports and as well you can create integrated reports
well you can create integrated reports from multiple sources not only from one
from multiple sources not only from one single source. So now by looking to the
single source. So now by looking to the right side it looks already organized
right side it looks already organized right and the whole process is
right and the whole process is completely automated. There is no more
completely automated. There is no more manual steps which of course it reduces
manual steps which of course it reduces the human error and as well it is pretty
the human error and as well it is pretty fast. So usually you can load the data
fast. So usually you can load the data from the sources until the reports in
from the sources until the reports in matter of hours or sometimes in minutes.
matter of hours or sometimes in minutes. So there is no need to wait like weeks
So there is no need to wait like weeks and months in order to refresh anything.
and months in order to refresh anything. And of course the big advantage is that
And of course the big advantage is that the data warehouse itself it is
the data warehouse itself it is completely integrated. So that means it
completely integrated. So that means it goes and bring all those sources
goes and bring all those sources together in one place which makes it
together in one place which makes it really easier for reporting and not only
really easier for reporting and not only integrated you can build in the data
integrated you can build in the data warehouse as well history. So we have
warehouse as well history. So we have now the possibility to access historical
now the possibility to access historical data and what is also amazing is that
data and what is also amazing is that all those reports having the same data
all those reports having the same data status. So all those reports can have
status. So all those reports can have the same status maybe sometimes one day
the same status maybe sometimes one day old or something. And of course if you
old or something. And of course if you have a modern data warehouse in cloud
have a modern data warehouse in cloud platforms you can really easily handle
platforms you can really easily handle any big data sources. So no need to
any big data sources. So no need to panic if one of your sources is
panic if one of your sources is delivering massive amount of data. And
delivering massive amount of data. And of course in order to build the data
of course in order to build the data warehouse you need different types of
warehouse you need different types of developers. So usually the one that
developers. So usually the one that builds the ETL component and the data
builds the ETL component and the data warehouse is the data engineer. So they
warehouse is the data engineer. So they are the one that is accessing the
are the one that is accessing the sources, scripting the ATLs and building
sources, scripting the ATLs and building the database for the data warehouse. And
the database for the data warehouse. And now for the other part, the one that is
now for the other part, the one that is responsible for that is the data
responsible for that is the data analyst. They are the one that is
analyst. They are the one that is consuming the data warehouse, building
consuming the data warehouse, building different data models and reports and
different data models and reports and sharing it with the stakeholders. So
sharing it with the stakeholders. So they are usually contacting the
they are usually contacting the stakeholders, understanding the
stakeholders, understanding the requirements and building multiple
requirements and building multiple reports based on the data warehouse. So
reports based on the data warehouse. So now if you have a look to those two
now if you have a look to those two scenarios, this is exactly why we need
scenarios, this is exactly why we need data management. Your data team is not
data management. Your data team is not wasting time and fighting with the data.
wasting time and fighting with the data. They are now more organized and more
They are now more organized and more focused and with like a data warehouse
focused and with like a data warehouse and you are delivering professional and
and you are delivering professional and fresh reports that your company can
fresh reports that your company can count on in order to make good and fast
count on in order to make good and fast decisions. So this is why you need a
decisions. So this is why you need a data management like a data warehouse.
data management like a data warehouse. Think about data warehouse as a busy
Think about data warehouse as a busy restaurant. Every day different
restaurant. Every day different suppliers bring in fresh ingredients,
suppliers bring in fresh ingredients, vegetables, spices, meat, you name it.
vegetables, spices, meat, you name it. They don't just use it immediately and
They don't just use it immediately and throw everything in one pot, right? They
throw everything in one pot, right? They clean it, shop it, and organize
clean it, shop it, and organize everything and store each ingredients in
everything and store each ingredients in the right place, fridge or freezer. So,
the right place, fridge or freezer. So, this is the preparing phase. And when
this is the preparing phase. And when the order comes in, they quickly grab
the order comes in, they quickly grab the prepared ingredients and create a
the prepared ingredients and create a perfect dish and then serve it to the
perfect dish and then serve it to the customers of the restaurant. And this
customers of the restaurant. And this process is exactly like the data
process is exactly like the data warehouse process. It is like the
warehouse process. It is like the kitchen where the raw ingredients, your
kitchen where the raw ingredients, your data are cleaned, sorted and stored. And
data are cleaned, sorted and stored. And when you need a report or analyzes, it
when you need a report or analyzes, it is ready to serve up exactly like what
is ready to serve up exactly like what you
need. Okay. So now we're going to zoom in and focus on the component ETL. If
in and focus on the component ETL. If you are building such a project, you're
you are building such a project, you're going to spend almost 90% just building
going to spend almost 90% just building this component, the ETL. So it is the
this component, the ETL. So it is the core element of the data warehouse and I
core element of the data warehouse and I want you to have a clear understanding
want you to have a clear understanding what is exactly an ETL. So our data
what is exactly an ETL. So our data exist in a source system. And now what
exist in a source system. And now what we want to do is is to get our data from
we want to do is is to get our data from the source and move it to the target.
the source and move it to the target. Source and target could be like database
Source and target could be like database tables. So now the first step that we
tables. So now the first step that we have to do is to specify which data we
have to do is to specify which data we have to load from the source. Of course
have to load from the source. Of course we can say that we want to load
we can say that we want to load everything but let's say that we are
everything but let's say that we are doing incremental loads. So we're going
doing incremental loads. So we're going to go and specify a subset of the data
to go and specify a subset of the data from the source in order to prepare it
from the source in order to prepare it and load it later to the target. So this
and load it later to the target. So this step in the ATL process we call it
step in the ATL process we call it extract. We are just identifying the
extract. We are just identifying the data that we need. We pull it out and we
data that we need. We pull it out and we don't change anything. It's going to be
don't change anything. It's going to be like one to one like the source system.
like one to one like the source system. So the extract has only one task to
So the extract has only one task to identify the data that we have to pull
identify the data that we have to pull out from the source and to not change
out from the source and to not change anything. So we will not manipulate the
anything. So we will not manipulate the data at all. It can stay as it is. So
data at all. It can stay as it is. So this is the first step in the ETL
this is the first step in the ETL process, the extract. Now moving on to
process, the extract. Now moving on to the stage number two. We're going to
the stage number two. We're going to take this extract data and we will do
take this extract data and we will do some manipulations, transformations and
some manipulations, transformations and we're going to change the shape of those
we're going to change the shape of those data. And this process is really heavy
data. And this process is really heavy working. We can do a lot of stuff like
working. We can do a lot of stuff like data cleansing, data integration and a
data cleansing, data integration and a lot of formatting and data
lot of formatting and data normalizations. So a lot of stuff we can
normalizations. So a lot of stuff we can do in this step. So this is the second
do in this step. So this is the second step in the ETL process, the
step in the ETL process, the transformation. We're going to take the
transformation. We're going to take the original data and reshape it, transform
original data and reshape it, transform it into exactly the format that we need
it into exactly the format that we need into a new format and shapes that we
into a new format and shapes that we need for analyzes and reporting. Now,
need for analyzes and reporting. Now, finally, we get to the last step in the
finally, we get to the last step in the ATL process. We have the load. So, in
ATL process. We have the load. So, in this step, we're going to take this new
this step, we're going to take this new data and we're going to insert it into
data and we're going to insert it into the target. So, it is very simple. We're
the target. So, it is very simple. We're going to take this prepared data from
going to take this prepared data from the transformation step and we're going
the transformation step and we're going to move it into its final destination,
to move it into its final destination, the target like for example data
the target like for example data warehouse. So that's ETL in a nutshell.
warehouse. So that's ETL in a nutshell. First extract the raw data, then
First extract the raw data, then transform it into something meaningful
transform it into something meaningful and finally load it to a target where
and finally load it to a target where it's going to make a difference. So
it's going to make a difference. So that's it. This is what we mean with the
that's it. This is what we mean with the ETL process. Now in real projects, we
ETL process. Now in real projects, we don't have like only source and targets.
don't have like only source and targets. Our data architecture going to have like
Our data architecture going to have like multiple layers depend on your design
multiple layers depend on your design whether you are building a warehouse or
whether you are building a warehouse or a data lake or a data warehouse. And
a data lake or a data warehouse. And usually there are like different ways on
usually there are like different ways on how to load the data between all those
how to load the data between all those layers. And in order now to load the
layers. And in order now to load the data from one layer to another one there
data from one layer to another one there are like multiple ways on how to use the
are like multiple ways on how to use the ATL process. So usually if you are
ATL process. So usually if you are loading the data from the source to the
loading the data from the source to the layer number one like only extract the
layer number one like only extract the data from the source and load it
data from the source and load it directly to the layer number one without
directly to the layer number one without doing any transformations because I want
doing any transformations because I want to see the data as it is in the first
to see the data as it is in the first layer. And now between the layer number
layer. And now between the layer number one and the layer number two you might
one and the layer number two you might go and use the full ETL. So we're going
go and use the full ETL. So we're going to extract from the layer one, transform
to extract from the layer one, transform it and then load it to the layer number
it and then load it to the layer number two. So with that we are using the whole
two. So with that we are using the whole process the ATL. And now between layer
process the ATL. And now between layer two and layer three we can do only
two and layer three we can do only transformation and then load. So we
transformation and then load. So we don't have to deal with how to extract
don't have to deal with how to extract the data because it is maybe using the
the data because it is maybe using the same technology and we are taking all
same technology and we are taking all data from layer 2 to layer three. So we
data from layer 2 to layer three. So we transform the whole layer 2 and then
transform the whole layer 2 and then load it to layer three. And now between
load it to layer three. And now between three and four you can use only the LM.
three and four you can use only the LM. So maybe it's something like duplicating
So maybe it's something like duplicating and replicating the data and then you
and replicating the data and then you are doing the transformation. So you
are doing the transformation. So you load to the new layer and then transform
load to the new layer and then transform it. Of course, this is not a real
it. Of course, this is not a real scenario. I'm just showing you that in
scenario. I'm just showing you that in order to move from source to a target,
order to move from source to a target, you don't have always to use a complete
you don't have always to use a complete ETL. Depend on the design of your data
ETL. Depend on the design of your data architecture. You might use only few
architecture. You might use only few components from the ETL. Okay. So this
components from the ETL. Okay. So this is how ETL looks like in real projects.
is how ETL looks like in real projects. Okay. So now I would like to show you an
Okay. So now I would like to show you an overview of the different techniques and
overview of the different techniques and methods in the ETLs. We have wide range
methods in the ETLs. We have wide range of possibilities where you have to make
of possibilities where you have to make decisions on which one you want to apply
decisions on which one you want to apply to your projects. So let's start first
to your projects. So let's start first with the extraction. The first thing
with the extraction. The first thing that I want to show you is we have
that I want to show you is we have different methods of extraction. Either
different methods of extraction. Either you are going to the source system and
you are going to the source system and pulling the data from the source or the
pulling the data from the source or the source system is pushing the data to the
source system is pushing the data to the data warehouse. So those are the two
data warehouse. So those are the two main methods on how to extract data. And
main methods on how to extract data. And then we have in the extraction two
then we have in the extraction two types. We have a full extraction
types. We have a full extraction everything all the records from tables
everything all the records from tables and every day we load all the data to
and every day we load all the data to the data warehouse or we make more
the data warehouse or we make more smarter one where we say we're going to
smarter one where we say we're going to do an incremental extraction where every
do an incremental extraction where every day we're going to identify only the new
day we're going to identify only the new changing data. So we don't have to load
changing data. So we don't have to load the whole thing only the new data we go
the whole thing only the new data we go extract it and then load it to the data
extract it and then load it to the data warehouse. And in data extraction we
warehouse. And in data extraction we have different techniques. The first one
have different techniques. The first one is like manually where someone has to
is like manually where someone has to access a source system and extract the
access a source system and extract the data manually or we connect ourselves to
data manually or we connect ourselves to a database and we have then a query in
a database and we have then a query in order to extract the data or we have a
order to extract the data or we have a file that we have to parse it to the
file that we have to parse it to the data warehouse or another technique is
data warehouse or another technique is to connect ourself to API and do their
to connect ourself to API and do their calls in order to extract the data or if
calls in order to extract the data or if the data is available in streaming like
the data is available in streaming like in CFKA we can do eventbased streaming
in CFKA we can do eventbased streaming in order to extract the data. Another
in order to extract the data. Another way is to use the change data capture
way is to use the change data capture CDC is as well something very similar to
CDC is as well something very similar to streaming or another way is by using web
streaming or another way is by using web scrabbing where you have a code that
scrabbing where you have a code that going to run and extract all the
going to run and extract all the informations from the web. So those are
informations from the web. So those are the different techniques and types that
the different techniques and types that we have in the extraction. Now if you
we have in the extraction. Now if you are talking on the transformation there
are talking on the transformation there are wide range of different
are wide range of different transformations that we can do on our
transformations that we can do on our data like for example doing data
data like for example doing data enrichment where we add values to our
enrichment where we add values to our data sets or we do a data integration
data sets or we do a data integration where we have multiple sources and we
where we have multiple sources and we bring everything to one data model or we
bring everything to one data model or we derive new columns based on already
derive new columns based on already existing one. Another type of data
existing one. Another type of data transformations we have the data
transformations we have the data normalization. So the sources has values
normalization. So the sources has values that are like a code and you go and map
that are like a code and you go and map it to more friendly values for the
it to more friendly values for the analyzers which is more easier to
analyzers which is more easier to understand and to use. Another
understand and to use. Another transformations we have the business
transformations we have the business rules and logic depend on the business
rules and logic depend on the business you can define different criterias in
you can define different criterias in order to build like new columns. And
order to build like new columns. And what belongs to transformations is the
what belongs to transformations is the data aggregation. So here we aggregate
data aggregation. So here we aggregate the data to a different granularity and
the data to a different granularity and then we have type of transformation
then we have type of transformation called data cleansing. There are many
called data cleansing. There are many different ways on how to clean our data.
different ways on how to clean our data. For example, removing the duplicates,
For example, removing the duplicates, doing data filtering, handling the
doing data filtering, handling the missing data, handling invalid values or
missing data, handling invalid values or removing unwanted spaces, casting the
removing unwanted spaces, casting the data types and detecting the outliers
data types and detecting the outliers and many more. So we have different
and many more. So we have different types of data cleansing that we can do
types of data cleansing that we can do in our data warehouse and this is very
in our data warehouse and this is very important transformation. So as you can
important transformation. So as you can see we have different types of
see we have different types of transformations that we can do in our
transformations that we can do in our data warehouse. Now moving on to the
data warehouse. Now moving on to the load. So what do we have over here? We
load. So what do we have over here? We have different processing types. So
have different processing types. So either we are doing patch processing or
either we are doing patch processing or stream processing. Patch processing
stream processing. Patch processing means we are loading the data warehouse
means we are loading the data warehouse in one big patch of data that's going to
in one big patch of data that's going to run and load the data warehouse. So it
run and load the data warehouse. So it is only one time job in order to refresh
is only one time job in order to refresh the content of the data warehouse and as
the content of the data warehouse and as well the reports. So that means we are
well the reports. So that means we are scheduling the data warehouse in order
scheduling the data warehouse in order to load it in the day once or twice. And
to load it in the day once or twice. And the other type we have the stream
the other type we have the stream processing. So this means if there is
processing. So this means if there is like a change in the source system,
like a change in the source system, we're going to process this change as
we're going to process this change as soon as possible. So we're going to
soon as possible. So we're going to process it through all the layers of the
process it through all the layers of the data warehouse once something changes
data warehouse once something changes from the source system. So we are
from the source system. So we are streaming the data in order to have real
streaming the data in order to have real time data warehouse which is very
time data warehouse which is very challenging things to do in data
challenging things to do in data warehousing. And if you are talking
warehousing. And if you are talking about the loads we have two methods
about the loads we have two methods either we are doing a full load or
either we are doing a full load or incremental load. It's the same thing as
incremental load. It's the same thing as extraction right? So for the full load
extraction right? So for the full load in databases there are like different
in databases there are like different methods on how to do it like for example
methods on how to do it like for example we truncate and then insert that means
we truncate and then insert that means we make the table completely empty and
we make the table completely empty and then we insert everything from the
then we insert everything from the scratch or another one you are doing an
scratch or another one you are doing an update insert we call it upsert. So we
update insert we call it upsert. So we can go and update all the records and
can go and update all the records and then insert the new one and another way
then insert the new one and another way is to drop create and insert. So that
is to drop create and insert. So that means we drop the whole table and then
means we drop the whole table and then we create it from scratch and then we
we create it from scratch and then we insert the data. It is very similar to
insert the data. It is very similar to the truncate but here we are as well
the truncate but here we are as well removing and dropping the whole table.
removing and dropping the whole table. So those are the different methods of
So those are the different methods of full loads. The incremental load we can
full loads. The incremental load we can use as well the upserts. So update and
use as well the upserts. So update and insert. So we're going to do an update
insert. So we're going to do an update or insert statements to our tables. Or
or insert statements to our tables. Or if the source is something like a log,
if the source is something like a log, we can do only insert. So we can go and
we can do only insert. So we can go and append the data always to the table
append the data always to the table without having to update anything.
without having to update anything. Another way to do incremental load is to
Another way to do incremental load is to do a merge. And here it is very similar
do a merge. And here it is very similar to the upsert but as well with a delete.
to the upsert but as well with a delete. So update, insert, delete. So those are
So update, insert, delete. So those are the different methods on how to load the
the different methods on how to load the data to your tables. And one more thing
data to your tables. And one more thing in data warehousing, we have something
in data warehousing, we have something called slowly changing dimensions. So
called slowly changing dimensions. So here it's all about the historicizations
here it's all about the historicizations of your table. And there are many
of your table. And there are many different ways on how to handle the
different ways on how to handle the historiizations in your table. The first
historiizations in your table. The first type is sedd0. We say there is
type is sedd0. We say there is notoriizations and nothing should be
notoriizations and nothing should be changed at all. So that means you are
changed at all. So that means you are not going to update anything. The second
not going to update anything. The second one which is more famous, it is the sedd
one which is more famous, it is the sedd one. you are doing an overwrite. So that
one. you are doing an overwrite. So that means you are updating the records with
means you are updating the records with the new informations from the source
the new informations from the source system by overwriting the old value. So
system by overwriting the old value. So we are doing something like the upsert.
we are doing something like the upsert. So update and insert but you are losing
So update and insert but you are losing of course history. Another one we have
of course history. Another one we have the sedd2 and here you want to add
the sedd2 and here you want to add historiizations to your table. So what
historiizations to your table. So what we do each change that we get from the
we do each change that we get from the source system that means we are
source system that means we are inserting new records and we are not
inserting new records and we are not going to overwrite or delete the old
going to overwrite or delete the old data. we are just going to make it
data. we are just going to make it inactive and the new record going to be
inactive and the new record going to be active one. So there are different
active one. So there are different methods on how to do historiizations as
methods on how to do historiizations as well while you are loading the data to
well while you are loading the data to the data warehouse. All right. So those
the data warehouse. All right. So those are the different types and techniques
are the different types and techniques that you might encounter in data
that you might encounter in data management projects. So now what I'm
management projects. So now what I'm going to show you quickly which of those
going to show you quickly which of those types we will be using in our projects.
types we will be using in our projects. So now if we are talking about the
So now if we are talking about the extraction over here we will be doing a
extraction over here we will be doing a pull extraction and about the full or
pull extraction and about the full or incremental it's going to be a full
incremental it's going to be a full extraction. And about the technique we
extraction. And about the technique we are going to be parsing files to the
are going to be parsing files to the data warehouse. And now about the data
data warehouse. And now about the data transformations. Well, this one we will
transformations. Well, this one we will cover everything all those types of
cover everything all those types of transformations that I'm showing you now
transformations that I'm showing you now is going to be part of the project
is going to be part of the project because I believe in each data project
because I believe in each data project you will be facing those
you will be facing those transformations. Now if you have a look
transformations. Now if you have a look to the load our project going to be
to the load our project going to be patch processing and about the load
patch processing and about the load methods we will be doing a full load
methods we will be doing a full load since we have full extraction and it's
since we have full extraction and it's going to be truncate and inserts. And
going to be truncate and inserts. And now about the historiizations we will be
now about the historiizations we will be doing the sedd one. So that means we
doing the sedd one. So that means we will be updating the content of the data
will be updating the content of the data warehouse. So those are the different
warehouse. So those are the different techniques and types that we will be
techniques and types that we will be using in our ETL process for this
using in our ETL process for this project. All right. So with that we have
project. All right. So with that we have now clear understanding what is a data
now clear understanding what is a data warehouse and we are done with the
warehouse and we are done with the theory parts. So now the next step we're
theory parts. So now the next step we're going to start with the projects. The
going to start with the projects. The first thing that we have to do is to
first thing that we have to do is to prepare our environment to develop the
prepare our environment to develop the projects. So let's start with
that. All right. So now we go to the link in the description and from there
link in the description and from there we're going to go to the downloads and
we're going to go to the downloads and you can find all the materials of all
you can find all the materials of all courses and projects. But the one that
courses and projects. But the one that we need now is the SQL data warehouse
we need now is the SQL data warehouse projects. So let's go to the link and
projects. So let's go to the link and here we have bunch of links that we need
here we have bunch of links that we need for the projects. But the most important
for the projects. But the most important one to get all data and files is this
one to get all data and files is this one download all project files. So let's
one download all project files. So let's go and do that. And after you do that
go and do that. And after you do that you're going to get a zip file where you
you're going to get a zip file where you have there a lot of stuff. So let's go
have there a lot of stuff. So let's go and extract it. And now inside it if you
and extract it. And now inside it if you go over here you will find the
go over here you will find the repository structure from git. And the
repository structure from git. And the most important one here is the data
most important one here is the data sets. So you have two sources the CRM
sets. So you have two sources the CRM and the ARP. And in each one of them
and the ARP. And in each one of them there are three CSV files. So those are
there are three CSV files. So those are the data set for the projects. For the
the data set for the projects. For the other stuffs don't worry about it. We
other stuffs don't worry about it. We will be explaining that during the
will be explaining that during the project. So go and get the data and put
project. So go and get the data and put it somewhere at your PC where you don't
it somewhere at your PC where you don't lose it. Okay. So now what else do we
lose it. Okay. So now what else do we have? We have here a link to the get
have? We have here a link to the get repository. So this is the link to my
repository. So this is the link to my repository that I have created through
repository that I have created through the projects. So you can go and access
the projects. So you can go and access it. But don't worry about it. We're
it. But don't worry about it. We're going to explain the whole structure
going to explain the whole structure during the projects and you will be
during the projects and you will be creating your own repository. And as
creating your own repository. And as well we have the link to the notion.
well we have the link to the notion. Here we are doing the project
Here we are doing the project management. Here you're going to find
management. Here you're going to find the main steps the main phases of the
the main steps the main phases of the SQL projects that we will do and as well
SQL projects that we will do and as well all the task that we will be doing
all the task that we will be doing together during the projects. And now we
together during the projects. And now we have links to the project tools. So if
have links to the project tools. So if you don't have it already go and
you don't have it already go and download the SQL server express. So it's
download the SQL server express. So it's like a server that's going to run
like a server that's going to run locally at your PC where your database
locally at your PC where your database going to live. Another one that you have
going to live. Another one that you have to download is the SQL Server Management
to download is the SQL Server Management Studio. It is just a client in order to
Studio. It is just a client in order to interact with the database and there
interact with the database and there we're going to run all our queries and
we're going to run all our queries and then link to the GitHub and as well link
then link to the GitHub and as well link to the draw AO if you don't have it
to the draw AO if you don't have it already go and download it. It is free
already go and download it. It is free and amazing tool in order to draw
and amazing tool in order to draw diagrams. So through the projects we
diagrams. So through the projects we will be drawing data models the data
will be drawing data models the data architecture a data lineage. So a lot of
architecture a data lineage. So a lot of stuff we'll be doing using this tool. So
stuff we'll be doing using this tool. So go and download it. And the last thing
go and download it. And the last thing it is nice to have you have a link to
it is nice to have you have a link to the notion where you can go and create
the notion where you can go and create of course free accounts if you want to
of course free accounts if you want to build the project plan and as well
build the project plan and as well follow me by creating the project steps
follow me by creating the project steps and the projects tasks. Okay. So that's
and the projects tasks. Okay. So that's all those are all the links for the
all those are all the links for the projects. So go and download all those
projects. So go and download all those stuff create the accounts and once you
stuff create the accounts and once you are ready then we continue with the
projects. All right. So now I hope that you have downloaded all the tools and
you have downloaded all the tools and created the accounts. Now it's time to
created the accounts. Now it's time to move to very important step that almost
move to very important step that almost all people skip while doing projects and
all people skip while doing projects and that is by creating the project plan and
that is by creating the project plan and for that we will be using the tool
for that we will be using the tool notion. Notion is of course a free tool
notion. Notion is of course a free tool and it can help you to organize your
and it can help you to organize your ideas, your plans and resources all in
ideas, your plans and resources all in one place. I use it very intensively for
one place. I use it very intensively for my private projects like for example
my private projects like for example creating this course and I can tell you
creating this course and I can tell you creating a project plan is the key to
creating a project plan is the key to success. Creating a data warehouse
success. Creating a data warehouse project is usually very complex. And
project is usually very complex. And according to Gartner reports, over 50%
according to Gartner reports, over 50% of data warehouse projects fail. In my
of data warehouse projects fail. In my opinion about any complex project, the
opinion about any complex project, the key to success is to have a clear
key to success is to have a clear project plan. So now at this phase of
project plan. So now at this phase of the project, we're going to go and
the project, we're going to go and create a rough project plan because at
create a rough project plan because at the moment we don't have yet clear
the moment we don't have yet clear understanding about the data
understanding about the data architecture. So let's go. Okay. So now
architecture. So let's go. Okay. So now let's create a new page and let's call
let's create a new page and let's call it data warehouse projects. The first
it data warehouse projects. The first thing is that we have to go and create
thing is that we have to go and create the main phases and stages of the
the main phases and stages of the projects and for that we need a table.
projects and for that we need a table. So in order to do that hit slash and
So in order to do that hit slash and then type database in line and then
then type database in line and then let's go and call it something like data
let's go and call it something like data warehouse epics and we're going to go
warehouse epics and we're going to go and hide it because I don't like it. And
and hide it because I don't like it. And then on the table we can go and rename
then on the table we can go and rename it like for example projects epics
it like for example projects epics something like that. And now what we're
something like that. And now what we're going to do we're going to go and list
going to do we're going to go and list all the big task of the project. So an
all the big task of the project. So an epic is usually like a large task that
epic is usually like a large task that needs a lot of efforts in order to solve
needs a lot of efforts in order to solve it. So you can call it epics, stages,
it. So you can call it epics, stages, phases of the project, whatever you
phases of the project, whatever you want. So we're going to go and list our
want. So we're going to go and list our project steps. So let's start with the
project steps. So let's start with the requirements analyzes and then designing
requirements analyzes and then designing data
data architecture and another one we have the
architecture and another one we have the project
project initialization. So those are the three
initialization. So those are the three big task in the project first. And now
big task in the project first. And now what do we need? We need another table
what do we need? We need another table for the small chunks of the tasks, the
for the small chunks of the tasks, the subtasks and we're going to do the same
subtasks and we're going to do the same thing. So we're going to go and hit
thing. So we're going to go and hit slash and we're going to search for the
slash and we're going to search for the table in line and we're going to do the
table in line and we're going to do the same thing. So first we're going to call
same thing. So first we're going to call it data warehouse tasks and then we're
it data warehouse tasks and then we're going to hide it and over here we're
going to hide it and over here we're going to rename it and say this is the
going to rename it and say this is the project tasks. So now what we're going
project tasks. So now what we're going to do, we're going to go to the plus
to do, we're going to go to the plus icon over here and then search for
icon over here and then search for relation. This one over here with the
relation. This one over here with the arrow. And now we're going to search for
arrow. And now we're going to search for the name of the first table. So we
the name of the first table. So we called it data warehouse eix. So let's
called it data warehouse eix. So let's go and click it and we're going to say
go and click it and we're going to say as well two-way relation. So let's go
as well two-way relation. So let's go and add the relation. So with that we
and add the relation. So with that we got a field in the new table called data
got a field in the new table called data warehouse eix. This comes from this
warehouse eix. This comes from this table and as well we have here data
table and as well we have here data warehouse tasks that comes from the
warehouse tasks that comes from the below table. So as you can see we have
below table. So as you can see we have linked them together. Now what I'm going
linked them together. Now what I'm going to do I'm going to take this to the left
to do I'm going to take this to the left side and then what we're going to do
side and then what we're going to do we're going to go and select one of
we're going to go and select one of those epics. Like for example let's take
those epics. Like for example let's take design the data architecture. And now
design the data architecture. And now what we're going to do, we're going to
what we're going to do, we're going to go and break down this epic into
go and break down this epic into multiple tasks. Like for example, choose
multiple tasks. Like for example, choose data management approach. And then we
data management approach. And then we have another task. What we're going to
have another task. What we're going to do, we're going to go and select as well
do, we're going to go and select as well the same epic. So maybe the next step is
the same epic. So maybe the next step is brainstorm and design the layers. And
brainstorm and design the layers. And then let's go to another epic for
then let's go to another epic for example the project initialization. And
example the project initialization. And we say over here for example create get
we say over here for example create get repo prepare the structure. we can go
repo prepare the structure. we can go and make another one in the same epic.
and make another one in the same epic. Let's say we're going to go and create
Let's say we're going to go and create the database and the schemas. So, as you
the database and the schemas. So, as you can see, I'm just defining the subtasks
can see, I'm just defining the subtasks of those epics. So, now what we're going
of those epics. So, now what we're going to do, we're going to go and add a
to do, we're going to go and add a checkbox in order to understand whether
checkbox in order to understand whether we have done the task or not. So, we go
we have done the task or not. So, we go to the plus and search for check. We
to the plus and search for check. We need a checkbox. And what we're going to
need a checkbox. And what we're going to do, we're going to make it really small
do, we're going to make it really small like this. And with that, each time we
like this. And with that, each time we are done with the task, we're going to
are done with the task, we're going to go and click on it just to make sure
go and click on it just to make sure that we have done the task. Now, there
that we have done the task. Now, there is one more thing that is not really
is one more thing that is not really working nice and that is here. We're
working nice and that is here. We're going to have like a long list of tasks
going to have like a long list of tasks and it's really annoying. So, what we're
and it's really annoying. So, what we're going to do, we're going to go to the
going to do, we're going to go to the plus over here and let's search for roll
plus over here and let's search for roll up. So, let's go and select it. So, now
up. So, let's go and select it. So, now what we're going to do, we have to go
what we're going to do, we have to go and select the relationship. It's going
and select the relationship. It's going to be the data warehouse task. And after
to be the data warehouse task. And after that, we're going to go to the property
that, we're going to go to the property and make it as a checkbox. So, now as
and make it as a checkbox. So, now as you can see in the first table, we are
you can see in the first table, we are saying how many tasks is closed. But I
saying how many tasks is closed. But I don't want to show it like this. What we
don't want to show it like this. What we can do, we're going to go to the
can do, we're going to go to the calculation and to the percent and then
calculation and to the percent and then percent checked. And with that, we can
percent checked. And with that, we can see the progress of our project. And now
see the progress of our project. And now instead of the numbers, we can have
instead of the numbers, we can have really nice bar. Great. So as well, we
really nice bar. Great. So as well, we can go and give it a name like progress.
can go and give it a name like progress. So that's it. And we can go and hide the
So that's it. And we can go and hide the data warehouse tasks. And now with that,
data warehouse tasks. And now with that, we have really nice progress bar for
we have really nice progress bar for each epic. And if we close all the tasks
each epic. And if we close all the tasks of this epic, we can see that we have
of this epic, we can see that we have reached 100%. So this is the main
reached 100%. So this is the main structure. Now we can go and add some
structure. Now we can go and add some cosmetics and rename stuff in order to
cosmetics and rename stuff in order to make things looks nicer. Like for
make things looks nicer. Like for example, if I go to the tasks over here,
example, if I go to the tasks over here, I can go and call it tasks and as well
I can go and call it tasks and as well go and change the icon to something like
go and change the icon to something like this. And if you'd like to have an icon
this. And if you'd like to have an icon for all those epics, what you're going
for all those epics, what you're going to do, we're going to go to the epic for
to do, we're going to go to the epic for example design data architecture. And
example design data architecture. And then if you hover on top of the title,
then if you hover on top of the title, you can see add an icon. And you can go
you can see add an icon. And you can go and pick any icon that you want. So for
and pick any icon that you want. So for example, this one. And now as you can
example, this one. And now as you can see, we have defined it here in the top.
see, we have defined it here in the top. And the icon going to be as well in the
And the icon going to be as well in the below table. Okay. So now one more thing
below table. Okay. So now one more thing that we can do for the project tasks is
that we can do for the project tasks is that we can go and group them by the
that we can go and group them by the epics. So if you go to the three dots
epics. So if you go to the three dots and then we go to groups and then we can
and then we go to groups and then we can group up by the epics. As you can see
group up by the epics. As you can see now we have like a section for each epic
now we have like a section for each epic and you can go and sort the epics if you
and you can go and sort the epics if you want. If you go over here sort then
want. If you go over here sort then manual and you can go over here and
manual and you can go over here and start sorting the epics as you want. And
start sorting the epics as you want. And with that you can expand and minimize
with that you can expand and minimize each task. if you don't want to see
each task. if you don't want to see always all tasks in one go. So this is
always all tasks in one go. So this is really nice way in order to build like
really nice way in order to build like data management for your projects. Of
data management for your projects. Of course, in companies, we use
course, in companies, we use professional tools in order to do
professional tools in order to do projects like for example Gyra. But for
projects like for example Gyra. But for private personal projects that I do, I
private personal projects that I do, I always do it like this and I really
always do it like this and I really recommend you to do it not only for this
recommend you to do it not only for this project, for any project that you are
project, for any project that you are doing. Cuz if you see the whole project
doing. Cuz if you see the whole project in one go, you can see the big picture
in one go, you can see the big picture and closing tasks and doing it like
and closing tasks and doing it like this. These small things going to makes
this. These small things going to makes you really satisfied and keeps you
you really satisfied and keeps you motivated to finish the whole project
motivated to finish the whole project and makes you proud. Okay friends, so
and makes you proud. Okay friends, so now I just went and added few icons, a
now I just went and added few icons, a renamed stuff and as well more tasks for
renamed stuff and as well more tasks for each epic and this going to be our
each epic and this going to be our starting point in the project and once
starting point in the project and once we have more informations we're going to
we have more informations we're going to go and add more details on how exactly
go and add more details on how exactly we're going to build the data warehouse.
we're going to build the data warehouse. So at the start we're going to go and
So at the start we're going to go and analyze and understand the requirements
analyze and understand the requirements and only after that we're going to start
and only after that we're going to start designing the data architecture and here
designing the data architecture and here we have three tasks. First we have to
we have three tasks. First we have to choose the data management approach and
choose the data management approach and after that we're going to do
after that we're going to do brainstorming and designing the layers
brainstorming and designing the layers of the data warehouse and at the end
of the data warehouse and at the end we're going to go and draw a data
we're going to go and draw a data architecture. So with that we have clear
architecture. So with that we have clear understanding how the data architecture
understanding how the data architecture looks like and after that we're going to
looks like and after that we're going to go to the next epic where we're going to
go to the next epic where we're going to start preparing our projects. So once we
start preparing our projects. So once we have clear understanding of the data
have clear understanding of the data architecture the first task here is to
architecture the first task here is to go and create detailed project tasks. So
go and create detailed project tasks. So we're going to go and add more AP and
we're going to go and add more AP and more tasks. And once we are done then
more tasks. And once we are done then we're going to go and create the naming
we're going to go and create the naming conventions for the project just to make
conventions for the project just to make sure that we have rules and standards in
sure that we have rules and standards in the whole project. And next we're going
the whole project. And next we're going to go and create a repository in the git
to go and create a repository in the git and we're going to prepare as well the
and we're going to prepare as well the structure of the repository so that we
structure of the repository so that we always commit our work there. And then
always commit our work there. And then we're going to start with the first
we're going to start with the first script where we're going to create a
script where we're going to create a database and schemas. So my friends this
database and schemas. So my friends this is the initial plan for the project. Now
is the initial plan for the project. Now let's start with the first epic. We have
let's start with the first epic. We have the requirements
analyzes. Now analyzing the requirement, it is very important to understand which
it is very important to understand which type of data warehouse you're going to
type of data warehouse you're going to go and build because there is like not
go and build because there is like not only one standard on how to build it.
only one standard on how to build it. And if you go blindly implementing the
And if you go blindly implementing the data warehouse, you might be doing a lot
data warehouse, you might be doing a lot of stuff that is totally unnecessary and
of stuff that is totally unnecessary and you will be burning a lot of time. So
you will be burning a lot of time. So that's why you have to sit with the
that's why you have to sit with the stakeholders with the department and
stakeholders with the department and understand what we exactly have to build
understand what we exactly have to build and depend on the requirements you
and depend on the requirements you design the shape of the data warehouse.
design the shape of the data warehouse. So now let's go and analyze the
So now let's go and analyze the requirement of this project. Now the
requirement of this project. Now the whole project is splitted into two main
whole project is splitted into two main sections. The first section we have to
sections. The first section we have to go and build a data warehouse. So this
go and build a data warehouse. So this is a data engineering task and we will
is a data engineering task and we will go and develop ETLs and data warehouse.
go and develop ETLs and data warehouse. And once we have done that we have to go
And once we have done that we have to go and build analytics and reporting
and build analytics and reporting business intelligence. So we're going to
business intelligence. So we're going to do data analyszis. But now first we will
do data analyszis. But now first we will be focusing on the first part building
be focusing on the first part building the data warehouse. So what do we have
the data warehouse. So what do we have here? The statement is very simple. It
here? The statement is very simple. It says develop a modern data warehouse
says develop a modern data warehouse using SQL server to consolidate sales
using SQL server to consolidate sales data enabling analytical reporting and
data enabling analytical reporting and informed decision making. So this is the
informed decision making. So this is the main statements and then we have
main statements and then we have specifications. The first one is about
specifications. The first one is about the data sources. It says import data
the data sources. It says import data from two source systems ERB and CRM and
from two source systems ERB and CRM and they are provided as CSV files. And now
they are provided as CSV files. And now the second task is talking about the
the second task is talking about the data quality. We have to clean and fix
data quality. We have to clean and fix data quality issues before we do the
data quality issues before we do the data analyzers because let's be real
data analyzers because let's be real there is no raw data that is perfect is
there is no raw data that is perfect is always messy and we have to clean that
always messy and we have to clean that up. Now the next task is talking about
up. Now the next task is talking about the integration. So it says we have to
the integration. So it says we have to go and combine both of the sources into
go and combine both of the sources into one single userfriendly data model that
one single userfriendly data model that is designed for analytics and reporting.
is designed for analytics and reporting. So that means we have to go and merge
So that means we have to go and merge those two sources into one single data
those two sources into one single data model. And now we have here another
model. And now we have here another specifications. It says focus on the
specifications. It says focus on the latest data sets. So there is no need
latest data sets. So there is no need for historiization. So that means we
for historiization. So that means we don't have to go and build histories in
don't have to go and build histories in the database. And the final requirement
the database. And the final requirement is talking about the documentation. So
is talking about the documentation. So it says provide clear documentations of
it says provide clear documentations of the data model. So that means the last
the data model. So that means the last product of the data warehouse to support
product of the data warehouse to support the business users and the analytical
the business users and the analytical teams. So that means we have to generate
teams. So that means we have to generate a manual that's going to help the users
a manual that's going to help the users that makes lives easier for the
that makes lives easier for the consumers of our data. So as you can see
consumers of our data. So as you can see maybe this is very generic requirements
maybe this is very generic requirements but it has a lot of informations already
but it has a lot of informations already for you. So it's saying that we have to
for you. So it's saying that we have to use the platform SQL server. We have two
use the platform SQL server. We have two source systems using the CSV files and
source systems using the CSV files and it sounds that we really have a bad data
it sounds that we really have a bad data quality in the sources and as well it
quality in the sources and as well it wants us to focus on building completely
wants us to focus on building completely new data model that is designed for
new data model that is designed for reporting and it says we don't have to
reporting and it says we don't have to do historiization and it is expected
do historiization and it is expected from us to generate documentations of
from us to generate documentations of the system. So these are the
the system. So these are the requirements for the data engineering
requirements for the data engineering part where we're going to go and build a
part where we're going to go and build a data warehouse that fulfill these
data warehouse that fulfill these requirements. All right. Right. So with
requirements. All right. Right. So with that we have analyzed the requirements
that we have analyzed the requirements and as well we have closed the first
and as well we have closed the first easiest ebick. So we are done with this.
easiest ebick. So we are done with this. Let's go and close it. And now let's
Let's go and close it. And now let's open another one. Here we have to design
open another one. Here we have to design the data architecture and the first task
the data architecture and the first task is to choose data management approach.
is to choose data management approach. So let's
go. Now designing the data architecture it is exactly like building a house. So
it is exactly like building a house. So before construction starts, an
before construction starts, an architect's going to go and design a
architect's going to go and design a plan, a blueprint for the house. How the
plan, a blueprint for the house. How the rooms will be connected, how to make the
rooms will be connected, how to make the house functional, safe and wonderful.
house functional, safe and wonderful. And without this blueprint from the
And without this blueprint from the architects, the builders might create
architects, the builders might create something unstable, inefficient or maybe
something unstable, inefficient or maybe unlivable. The same goes for data
unlivable. The same goes for data projects. A data architect is like a
projects. A data architect is like a house architecture. They design how your
house architecture. They design how your data will flow, integrate and be
data will flow, integrate and be accessed. So as data architects we make
accessed. So as data architects we make sure that the data warehouse is not only
sure that the data warehouse is not only functioning but also scalable and easy
functioning but also scalable and easy to maintain. And this is exactly what we
to maintain. And this is exactly what we will do now. We will play the role of
will do now. We will play the role of the data architect and we will start
the data architect and we will start brainstorming and designing the
brainstorming and designing the architecture of the data warehouse. So
architecture of the data warehouse. So now I'm going to show you a sketch in
now I'm going to show you a sketch in order to understand what are the
order to understand what are the different approaches in order to design
different approaches in order to design a data architecture. And this phase of
a data architecture. And this phase of the projects usually is very exciting
the projects usually is very exciting for me because this is my main role in
for me because this is my main role in data projects. I am a data architect and
data projects. I am a data architect and I discuss a lot of different projects
I discuss a lot of different projects where we try to find out the best design
where we try to find out the best design for the projects. All right. So now
for the projects. All right. So now let's
go. Now the first step of building a data architecture is to make a very
data architecture is to make a very important decision to choose between
important decision to choose between four major types. The first approach is
four major types. The first approach is to build a data warehouse. It is very
to build a data warehouse. It is very suitable if you have only structured
suitable if you have only structured data and your business want to build
data and your business want to build solid foundations for reporting and
solid foundations for reporting and business intelligence. And another
business intelligence. And another approach is to build a data leak. This
approach is to build a data leak. This one is way more flexible than a data
one is way more flexible than a data warehouse where you can store not only
warehouse where you can store not only structured data but as well semi and
structured data but as well semi and unstructured data. We usually use this
unstructured data. We usually use this approach if you have mixed types of data
approach if you have mixed types of data like database tables, logs, images,
like database tables, logs, images, videos and your business want to focus
videos and your business want to focus not only on reporting but as well on
not only on reporting but as well on advanced analytics or machine learning
advanced analytics or machine learning but it's not that organized like a data
but it's not that organized like a data warehouse and data leaks if it's too
warehouse and data leaks if it's too much unorganized and turns into data
much unorganized and turns into data swamp and this is where we need the next
swamp and this is where we need the next approach. So the next one we can go and
approach. So the next one we can go and build data lakehouse. So it is like a
build data lakehouse. So it is like a mix between data warehouse and data
mix between data warehouse and data lake. You get the flexibility of having
lake. You get the flexibility of having different types of data from the data
different types of data from the data lake but you still want to structure and
lake but you still want to structure and organize your data like we do in the
organize your data like we do in the data warehouse. So you mix those two
data warehouse. So you mix those two words into one and this is a very modern
words into one and this is a very modern way on how to build that architecture
way on how to build that architecture and this is currently my favorite way of
and this is currently my favorite way of building data management system. Now the
building data management system. Now the last and very recent approach is to
last and very recent approach is to build data mesh. So this is a little bit
build data mesh. So this is a little bit different. Instead of having centralized
different. Instead of having centralized data management system the idea now in
data management system the idea now in the data mesh is to make it
the data mesh is to make it decentralized. You cannot have like one
decentralized. You cannot have like one centralized data management system
centralized data management system because always if you say centralized
because always if you say centralized then it means bottleneck. So instead you
then it means bottleneck. So instead you have multiple departments and multiple
have multiple departments and multiple domains where each one of them is
domains where each one of them is building a data product and sharing it
building a data product and sharing it with others. So now you have to go and
with others. So now you have to go and pick one of those approaches and in this
pick one of those approaches and in this project we will be focusing on the data
project we will be focusing on the data warehouse. So now the question is how to
warehouse. So now the question is how to build the data warehouse. Well there is
build the data warehouse. Well there is as well four different approaches on how
as well four different approaches on how to build it. The first one is the
to build it. The first one is the enimmon approach. So again you have your
enimmon approach. So again you have your sources and the first layer you start
sources and the first layer you start with the staging where the row data is
with the staging where the row data is landing and then the next layer you
landing and then the next layer you organize your data in something called
organize your data in something called enterprise data warehouse where you go
enterprise data warehouse where you go and model the data using the third
and model the data using the third normal format. It's about like how to
normal format. It's about like how to structure and normalize your tables. So
structure and normalize your tables. So you are building a new integrated data
you are building a new integrated data model from the multiple sources. And
model from the multiple sources. And then we go to the third layer. It's
then we go to the third layer. It's called the data marts where you go and
called the data marts where you go and take like small subset of the data
take like small subset of the data warehouse and you design it in a way
warehouse and you design it in a way that is ready to be consumed from
that is ready to be consumed from reporting and it focus on only one topic
reporting and it focus on only one topic like for example the customers sales or
like for example the customers sales or products and after that you go and
products and after that you go and connect your BI tool like PowerBI or
connect your BI tool like PowerBI or Tableau to the data marts. So with that
Tableau to the data marts. So with that you have three layers to prepare the
you have three layers to prepare the data before reporting. Now moving on to
data before reporting. Now moving on to the next one we have the Kimple
the next one we have the Kimple approach. He says you know what building
approach. He says you know what building this enterprise data warehouse it is
this enterprise data warehouse it is wasting a lot of time. So what we can do
wasting a lot of time. So what we can do we can jump immediately from the stage
we can jump immediately from the stage layer to the final data because building
layer to the final data because building this enterprise data warehouse it is a
this enterprise data warehouse it is a big struggle and usually waste a lot of
big struggle and usually waste a lot of time. So he always want you to focus and
time. So he always want you to focus and building the data ms quickly as
building the data ms quickly as possible. So it is faster approach than
possible. So it is faster approach than in but with the time you might get chaos
in but with the time you might get chaos in the data MS cuz you are not always
in the data MS cuz you are not always focusing in the big picture and you
focusing in the big picture and you might be repeating same transformations
might be repeating same transformations and integrations in different data ms.
and integrations in different data ms. So there is like trade-off between the
So there is like trade-off between the speed and consistent data warehouse. Now
speed and consistent data warehouse. Now moving on to the third approach we have
moving on to the third approach we have the data vault. So we still have the
the data vault. So we still have the stage and the data marts but it says we
stage and the data marts but it says we still need this central data warehouse
still need this central data warehouse in the middle but this middle layer
in the middle but this middle layer we're going to bring more standards and
we're going to bring more standards and rules. So it tells you to split this
rules. So it tells you to split this middle layer into two layers the row
middle layer into two layers the row vault and the business vault. In the row
vault and the business vault. In the row vault you have the original data but in
vault you have the original data but in the business vault you have all the
the business vault you have all the business rules and transformations that
business rules and transformations that prepares the data for the data marks. So
prepares the data for the data marks. So that vault it is very similar to the
that vault it is very similar to the inmon but it brings more standards and
inmon but it brings more standards and rules to the middle layer. Now I'm going
rules to the middle layer. Now I'm going to go and add a fourth one that I'm
to go and add a fourth one that I'm going to call it medallion architecture
going to call it medallion architecture and this one is my favorite one because
and this one is my favorite one because it is very easy to understand and to
it is very easy to understand and to build. So it says you're going to go and
build. So it says you're going to go and build three layers bronze, silver and
build three layers bronze, silver and gold. The bronze layer it is very
gold. The bronze layer it is very similar to the stage but we have
similar to the stage but we have understood with the time that the stage
understood with the time that the stage layer is very important because having
layer is very important because having the original data as it is it going to
the original data as it is it going to helps a lot by traceability and finding
helps a lot by traceability and finding issues. Then the next layer we have the
issues. Then the next layer we have the silver layer. It is where we do
silver layer. It is where we do transformations data cleansing but we
transformations data cleansing but we don't apply yet any business rules. Now
don't apply yet any business rules. Now moving on to the last layer the gold
moving on to the last layer the gold layer. It is as well very similar to the
layer. It is as well very similar to the data marts but there we can build
data marts but there we can build different type of objects not only for
different type of objects not only for reporting but as well for machine
reporting but as well for machine learning for AI and for many different
learning for AI and for many different purposes. So they are like business
purposes. So they are like business ready objects that you want to share as
ready objects that you want to share as a data products. So those are the four
a data products. So those are the four approaches that you can use in order to
approaches that you can use in order to build a data warehouse. So again if you
build a data warehouse. So again if you are building a data architecture you
are building a data architecture you have to specify which approach you want
have to specify which approach you want to follow. So at the start we said we
to follow. So at the start we said we want to build a data warehouse and then
want to build a data warehouse and then we have to decide between those four
we have to decide between those four approaches on how to build a data
approaches on how to build a data warehouse and in this project we will be
warehouse and in this project we will be using the medallion architecture. So
using the medallion architecture. So this is a very important question that
this is a very important question that you have to answer as the first step of
you have to answer as the first step of building a data architecture. All right.
building a data architecture. All right. So with that we have decided on the
So with that we have decided on the approach. So we can go and mark it as
approach. So we can go and mark it as done. The next step we're going to go
done. The next step we're going to go and design the layers of the data
and design the layers of the data warehouse.
Now there is like not 100% standard way and rules for each layer. What you have
and rules for each layer. What you have to do as a data architects you have to
to do as a data architects you have to define exactly what is the purpose of
define exactly what is the purpose of each layer. So we start with the bronze
each layer. So we start with the bronze layer. So we say it's going to store row
layer. So we say it's going to store row and unprocessed data as it is from the
and unprocessed data as it is from the sources. And why we are doing that it is
sources. And why we are doing that it is for traceability and debugging. If you
for traceability and debugging. If you have a layer where you are keeping the
have a layer where you are keeping the raw data, it is very important to have
raw data, it is very important to have the data as it is from the sources
the data as it is from the sources because we can go always back to the
because we can go always back to the bronze layer and investigate the data of
bronze layer and investigate the data of specific source if something goes wrong.
specific source if something goes wrong. So the main objective is to have raw
So the main objective is to have raw untouched data that's going to helps you
untouched data that's going to helps you as a data engineer by analyzing the root
as a data engineer by analyzing the root cause of issues. Now moving on to the
cause of issues. Now moving on to the server layer. It is the layer where
server layer. It is the layer where we're going to store clean and
we're going to store clean and standardized data and this is the place
standardized data and this is the place where we're going to do basic
where we're going to do basic transformations in order to prepare the
transformations in order to prepare the data for the final layer. Now for the go
data for the final layer. Now for the go layer it's going to contain business
layer it's going to contain business ready data. So the main goal here is to
ready data. So the main goal here is to provide data that could be consumed by
provide data that could be consumed by business users and analysts in order to
business users and analysts in order to build reporting and analytics. So with
build reporting and analytics. So with that we have defined the main goal for
that we have defined the main goal for each layer. Now next what I would like
each layer. Now next what I would like to do is to define the object types and
to do is to define the object types and since we are talking about a data
since we are talking about a data warehouse in database we have here
warehouse in database we have here generally two types either a table or a
generally two types either a table or a view. So we are going for the bronze
view. So we are going for the bronze layer and the silver layer with tables
layer and the silver layer with tables but for the gold layer we are going with
but for the gold layer we are going with the views. So the best practice says for
the views. So the best practice says for the last layer in your data warehouse
the last layer in your data warehouse make it virtual using views. It going to
make it virtual using views. It going to gives you a lot of dynamic and of course
gives you a lot of dynamic and of course speed in order to build it since we
speed in order to build it since we don't have to make a load process for
don't have to make a load process for it. And now the next step is that we're
it. And now the next step is that we're going to go and define the load method.
going to go and define the load method. So in this project I have decided to go
So in this project I have decided to go with the full load using the method of
with the full load using the method of truncating and inserting. It is just
truncating and inserting. It is just faster and way easier. So we're going to
faster and way easier. So we're going to say for the bronze layer we're going to
say for the bronze layer we're going to go with the full load. And you have to
go with the full load. And you have to specify as well for the silver layer as
specify as well for the silver layer as well. We're going to go with the full
well. We're going to go with the full load. And of course for the views we
load. And of course for the views we don't need any load process. So each
don't need any load process. So each time you decide to go with tables you
time you decide to go with tables you have to define the load methods with our
have to define the load methods with our full load, incremental loads and so on.
full load, incremental loads and so on. Now we come to the very interesting part
Now we come to the very interesting part the data transformations. Now for the
the data transformations. Now for the bronze layer, it is the easiest one
bronze layer, it is the easiest one about this topic because we don't have
about this topic because we don't have any transformations. We have to commit
any transformations. We have to commit ourself to not touch the data, do not
ourself to not touch the data, do not manipulate it, don't change anything. So
manipulate it, don't change anything. So it's going to stay as it is. If it comes
it's going to stay as it is. If it comes bad, it's going to stay bad in the
bad, it's going to stay bad in the bronze layer. And now we come to the
bronze layer. And now we come to the silver layer where we have the heavy
silver layer where we have the heavy lifting. As we committed in the
lifting. As we committed in the objective, we have to make clean and
objective, we have to make clean and standardized data. And for that we have
standardized data. And for that we have different types of transformations. So
different types of transformations. So we have to do data cleansing, data
we have to do data cleansing, data standardizations, data normalizations.
standardizations, data normalizations. We have to go and derive new columns and
We have to go and derive new columns and data enrichment. So there are like bunch
data enrichment. So there are like bunch of transformations that we have to do in
of transformations that we have to do in order to prepare the data. Our focus
order to prepare the data. Our focus here is to transform the data to make it
here is to transform the data to make it clean and following standards and try to
clean and following standards and try to push all business transformations to the
push all business transformations to the next layer. So that means in the god
next layer. So that means in the god layer we will be focusing on business
layer we will be focusing on business transformations that is needed for the
transformations that is needed for the consumers for the use cases. So what we
consumers for the use cases. So what we do here we do data integrations between
do here we do data integrations between source system we do data aggregations we
source system we do data aggregations we apply a lot of business logics and rules
apply a lot of business logics and rules and we build a data model that is ready
and we build a data model that is ready for for example business intelligence.
for for example business intelligence. So here we do a lot of business
So here we do a lot of business transformations and in the silver layer
transformations and in the silver layer we do basic data transformations. So it
we do basic data transformations. So it is really here very important to make
is really here very important to make the fine decisions what type of
the fine decisions what type of transformations to be done in each layer
transformations to be done in each layer and make sure that you commit to those
and make sure that you commit to those rules. Now the next aspect is about the
rules. Now the next aspect is about the data modeling in the bronze layer and
data modeling in the bronze layer and the silver layer. We will not break the
the silver layer. We will not break the data model that comes from the source
data model that comes from the source system. So if the source system deliver
system. So if the source system deliver five tables, we're going to have here
five tables, we're going to have here like five tables and as well in the
like five tables and as well in the silver layer. We will not go and
silver layer. We will not go and denormalize or normalize or like make
denormalize or normalize or like make something new, we're going to leave it
something new, we're going to leave it exactly like it comes from the source
exactly like it comes from the source system because what we're going to do,
system because what we're going to do, we're going to build the data model in
we're going to build the data model in the gold layer. And here you have to
the gold layer. And here you have to define which data model you want to
define which data model you want to follow. Are you following the star
follow. Are you following the star schema, the snowflake or are you just
schema, the snowflake or are you just making aggregated objects? So you have
making aggregated objects? So you have to go and make a list of all data models
to go and make a list of all data models types that you're going to follow in the
types that you're going to follow in the gold layer. And at the end, what you can
gold layer. And at the end, what you can specify in each layer is the target
specify in each layer is the target audience. And this is of course very
audience. And this is of course very important decision. In the bronze layer,
important decision. In the bronze layer, you don't want to give access to any end
you don't want to give access to any end user. It is really important to make
user. It is really important to make sure that only data engineers access the
sure that only data engineers access the bronze layer. It makes no sense for data
bronze layer. It makes no sense for data analysts or data scientists to go to the
analysts or data scientists to go to the bad data because you have a better
bad data because you have a better version for that in the silver layer. So
version for that in the silver layer. So in the silver layer of course the data
in the silver layer of course the data engineers have to have an access to it
engineers have to have an access to it and as well the data analysts and the
and as well the data analysts and the data scientists and so on but still you
data scientists and so on but still you don't give it to any business user that
don't give it to any business user that can't deal with the raw data model from
can't deal with the raw data model from the sources because for the business
the sources because for the business users you're going to get a better layer
users you're going to get a better layer for them and that is the go layer. So in
for them and that is the go layer. So in the gold layer it is suitable for the
the gold layer it is suitable for the data analyst and as well the business
data analyst and as well the business users because usually the business users
users because usually the business users don't have a deep knowledge on the
don't have a deep knowledge on the technicality of the server layer. So if
technicality of the server layer. So if you are designing multiple layers you
you are designing multiple layers you have to discuss all those topics and
have to discuss all those topics and make clear decision for each layer. All
make clear decision for each layer. All right my friends. So now before we
right my friends. So now before we proceed with the design I want to tell
proceed with the design I want to tell you a secret principle concept that each
you a secret principle concept that each data architect must know and that is the
data architect must know and that is the separation of concerns. So what is that?
separation of concerns. So what is that? As you are designing an architecture,
As you are designing an architecture, you have to make sure to break down the
you have to make sure to break down the complex system into smaller independent
complex system into smaller independent parts and each part is responsible for a
parts and each part is responsible for a specific task. And here comes the magic.
specific task. And here comes the magic. The component of your architecture must
The component of your architecture must not be duplicated. So you cannot have
not be duplicated. So you cannot have two parts are doing the same thing. So
two parts are doing the same thing. So the idea here is to not mix everything.
the idea here is to not mix everything. And this is one of the biggest mistakes
And this is one of the biggest mistakes in any big projects and I have shown
in any big projects and I have shown that almost everywhere. So a good data
that almost everywhere. So a good data architects follow this concept this
architects follow this concept this principle. So for example if you are
principle. So for example if you are looking to our data architecture we have
looking to our data architecture we have already done that. So we have defined
already done that. So we have defined unique set of tasks for each layer. So
unique set of tasks for each layer. So for example we have said in the server
for example we have said in the server layer we do data cleansing but in the
layer we do data cleansing but in the gold layer we do business
gold layer we do business transformations and with that you will
transformations and with that you will not be allowing to do any business
not be allowing to do any business transformations. In the server layer and
transformations. In the server layer and the same thing goes for the gold layer.
the same thing goes for the gold layer. You don't do in the gold layer any data
You don't do in the gold layer any data cleansing. So each layer has its own
cleansing. So each layer has its own unique tasks and the same thing goes for
unique tasks and the same thing goes for the bronze layer and the silver layer.
the bronze layer and the silver layer. You do not allow to load data from the
You do not allow to load data from the source systems directly to the silver
source systems directly to the silver layer because we have decided the
layer because we have decided the landing layer. The first layer is the
landing layer. The first layer is the bronze layer otherwise you will have
bronze layer otherwise you will have like set of source systems that are
like set of source systems that are loaded first to the bronze layer and
loaded first to the bronze layer and another set is skipping the layer and
another set is skipping the layer and going to the silver and with that we
going to the silver and with that we have overlapping. You are doing data
have overlapping. You are doing data ingestion in two different layers. So my
ingestion in two different layers. So my friends, if you have this mindset,
friends, if you have this mindset, separation of concerns, I promise you,
separation of concerns, I promise you, you're going to be a top data architect.
you're going to be a top data architect. So think about it. All right, my
So think about it. All right, my friends. So with that, we have designed
friends. So with that, we have designed the layers of the data warehouse. We can
the layers of the data warehouse. We can go ahead close it. The next step, we're
go ahead close it. The next step, we're going to go to DYO and start drawing the
going to go to DYO and start drawing the data
architecture. So there is like no one standard on how to build a data
standard on how to build a data architecture. You can add your style and
architecture. You can add your style and the way that you want. So now the first
the way that you want. So now the first thing that we have to show in that
thing that we have to show in that architecture is the different layers
architecture is the different layers that we have. The first layer is the
that we have. The first layer is the source system layer. So let's go and
source system layer. So let's go and take a box like this and make it a
take a box like this and make it a little bit bigger. And I'm just going to
little bit bigger. And I'm just going to go and make the design. So I'm going to
go and make the design. So I'm going to remove the fill and make the line dotted
remove the fill and make the line dotted one. And after that I'm going to go and
one. And after that I'm going to go and change maybe the color to something like
change maybe the color to something like this gray. So now we have like a
this gray. So now we have like a container for the first layer. And then
container for the first layer. And then we have to go and add like a text on top
we have to go and add like a text on top of it. So what I'm going to do, I'm
of it. So what I'm going to do, I'm going to take another box. Let's go and
going to take another box. Let's go and type inside it sources. And now I'm
type inside it sources. And now I'm going to go and style it. So I'm going
going to go and style it. So I'm going to go to the text and make it maybe 24.
to go to the text and make it maybe 24. And then remove the lines like this.
And then remove the lines like this. Make it a little bit smaller and put it
Make it a little bit smaller and put it on top. So this is the first layer. This
on top. So this is the first layer. This is where the data come from. And then
is where the data come from. And then the data going to go inside a data
the data going to go inside a data warehouse. So I'm just going to go and
warehouse. So I'm just going to go and duplicate this one. This one is the data
warehouse. All right. So now the third layer what it going to be? It's going to
layer what it going to be? It's going to be the consumers. who will be consuming
be the consumers. who will be consuming this data warehouse. So I'm going to put
this data warehouse. So I'm going to put another box and say this is the consume
another box and say this is the consume layer. Okay. So those are the three
layer. Okay. So those are the three containers. Now inside the data
containers. Now inside the data warehouse, we have decided to build it
warehouse, we have decided to build it using the medallion architecture. So
using the medallion architecture. So we're going to have three layers inside
we're going to have three layers inside the warehouse. So I'm going to take
the warehouse. So I'm going to take again another box. I'm going to call
again another box. I'm going to call this one. This is the bronze layer. And
this one. This is the bronze layer. And now we have to go and put a design for
now we have to go and put a design for it. So I'm going to go with this color
it. So I'm going to go with this color over here. And then the text and maybe
over here. And then the text and maybe something like 20. And then make it a
something like 20. And then make it a little bit smaller and just put it here.
little bit smaller and just put it here. And beneath that we're going to have the
And beneath that we're going to have the component. So this is just a title of a
component. So this is just a title of a container. So I'm going to have it like
container. So I'm going to have it like this. Remove the text from inside it.
this. Remove the text from inside it. And remove the filling. So this
And remove the filling. So this container is for the bronze layer. Let's
container is for the bronze layer. Let's go and duplicate it for the next one. So
go and duplicate it for the next one. So this one going to be the silver layer.
this one going to be the silver layer. And of course, we can go and change the
And of course, we can go and change the coloring to gray because it is silver.
coloring to gray because it is silver. And as well the lines and remove the
And as well the lines and remove the filling. Great. And now maybe I'm going
filling. Great. And now maybe I'm going to make the font as bold. All right. Now
to make the font as bold. All right. Now the third layer going to be the gold
the third layer going to be the gold layer. And we have to go and pick a
layer. And we have to go and pick a color for that. So style and here we
color for that. So style and here we have like something like yellow. The
have like something like yellow. The same thing for the container. I remove
same thing for the container. I remove the filling. So with that we are showing
the filling. So with that we are showing now the different layers inside our data
now the different layers inside our data warehouse. Now those containers are
warehouse. Now those containers are empty. What we're going to do, we're
empty. What we're going to do, we're going to go inside each one of them and
going to go inside each one of them and start adding contents. So now in the
start adding contents. So now in the sources, it is very important to make it
sources, it is very important to make it clear what are the different types of
clear what are the different types of source systems that you are connecting
source systems that you are connecting to the data warehouse because in real
to the data warehouse because in real project there are like multiple types.
project there are like multiple types. You might have a database, API, files,
You might have a database, API, files, cafka and here it's important to show
cafka and here it's important to show those different types. In other projects
those different types. In other projects we have folders and inside those folders
we have folders and inside those folders we have CSV files. So now what you have
we have CSV files. So now what you have to do we have to make it clear in this
to do we have to make it clear in this layer that the input for our project is
layer that the input for our project is CSV file. So it really depend how you
CSV file. So it really depend how you want to show that. I'm going to go over
want to show that. I'm going to go over here and say maybe folder and then I'm
here and say maybe folder and then I'm going to go and take the folder and put
going to go and take the folder and put it here inside and then maybe search for
it here inside and then maybe search for file more results and go pick one of
file more results and go pick one of those icons. For example, I'm going to
those icons. For example, I'm going to go with this one over here. So I'm going
go with this one over here. So I'm going to make it smaller and add it on top of
to make it smaller and add it on top of the folder. So with that we make it
the folder. So with that we make it clear for everyone seeing the
clear for everyone seeing the architecture that the sources is not a
architecture that the sources is not a database is not an API it is a file
database is not an API it is a file inside the folder. So now very important
inside the folder. So now very important here to show is the source systems. What
here to show is the source systems. What are the sources that is involved in the
are the sources that is involved in the project. So here what we're going to do
project. So here what we're going to do we're going to go and give it a name.
we're going to go and give it a name. For example we have one source called
For example we have one source called CRM like this and maybe make the icon
CRM like this and maybe make the icon and we have another source called ERP.
and we have another source called ERP. So we're going to go and duplicate it
So we're going to go and duplicate it put it over here and then rename it ERP.
put it over here and then rename it ERP. So now it is for everyone clear. We have
So now it is for everyone clear. We have two sources for this project and the
two sources for this project and the technology is used is simply a file. So
technology is used is simply a file. So now what we can do as well we can go and
now what we can do as well we can go and add some descriptions inside this box to
add some descriptions inside this box to make it more clear. So what I'm going to
make it more clear. So what I'm going to do, I'm going to take a line because I
do, I'm going to take a line because I want to split the description from the
want to split the description from the icons something like this and make it
icons something like this and make it gray. And then below it, we're going to
gray. And then below it, we're going to go and add some text and we're going to
go and add some text and we're going to say is CSV file. And the next point and
say is CSV file. And the next point and we can say the interface is simply files
we can say the interface is simply files in folder. And of course you can go and
in folder. And of course you can go and add any specifications and explanation
add any specifications and explanation about the sources. If it is a database,
about the sources. If it is a database, you can say the type of the database and
you can say the type of the database and so on. So that we made it in the data
so on. So that we made it in the data architecture clear what are the sources
architecture clear what are the sources of our data warehouse. And now the next
of our data warehouse. And now the next step what we're going to do we're going
step what we're going to do we're going to go and design the content of the
to go and design the content of the bronze silver and gold. So I'm going to
bronze silver and gold. So I'm going to start by adding like an icon in each
start by adding like an icon in each container. It is to show about that we
container. It is to show about that we are talking about database. So what
are talking about database. So what we're going to do we're going to go and
we're going to do we're going to go and search for database and then more
search for database and then more result. More results. I'm going to go
result. More results. I'm going to go with this icon over here. So let's go
with this icon over here. So let's go and make it bigger. Something like this.
and make it bigger. Something like this. Maybe change the color of dots. So,
Maybe change the color of dots. So, we're going to have the bronze and as
we're going to have the bronze and as well here the silver and the gold. So,
well here the silver and the gold. So, now what we can do, we're going to go
now what we can do, we're going to go and add some arrows between those
and add some arrows between those layers. So, we're going to go over here.
layers. So, we're going to go over here. So, we can go and search for arrow and
So, we can go and search for arrow and maybe go and pick one of those. Let's go
maybe go and pick one of those. Let's go and put it here. And we can go and pick
and put it here. And we can go and pick a color for that. Maybe something like
a color for that. Maybe something like this. And adjust it. So, now we're going
this. And adjust it. So, now we're going to have this nice arrow between all the
to have this nice arrow between all the layers just to explain the direction of
layers just to explain the direction of our architecture, right? So we can read
our architecture, right? So we can read it from left to right and as well
it from left to right and as well between the go layer and the consume.
between the go layer and the consume. Okay. So now what I'm going to do next
Okay. So now what I'm going to do next we're going to go and add one statement
we're going to go and add one statement about each layer the main objective. So
about each layer the main objective. So let's go and grab a text and put it
let's go and grab a text and put it beneath the database and we're going to
beneath the database and we're going to say for example for the bronze layer
say for example for the bronze layer it's going to be the row data. Maybe
it's going to be the row data. Maybe make the text bigger so you are the row
make the text bigger so you are the row data. And then the next one in the
data. And then the next one in the silver you are clean standard data. And
silver you are clean standard data. And then the last one for the gold we can
then the last one for the gold we can say business
say business ready data. So with that we make the
ready data. So with that we make the objective clear for each layer. Now
objective clear for each layer. Now below all those icons what we're going
below all those icons what we're going to do we're going to have a separator
to do we're going to have a separator again like this. Make it like colored.
again like this. Make it like colored. And beneath it we're going to add the
And beneath it we're going to add the most important specifications of this
most important specifications of this layer. So let's go and add those
layer. So let's go and add those separators in each layer. Okay. So now
separators in each layer. Okay. So now we need a text below it. Let's take this
we need a text below it. Let's take this one here. So what is the object type of
one here. So what is the object type of the bronze layer? That's going to be a
the bronze layer? That's going to be a table and we can go and add the load
table and we can go and add the load methods. We say this is patch
methods. We say this is patch processing. Since we are not doing
processing. Since we are not doing streaming, we can say it is a full load.
streaming, we can say it is a full load. We are not doing incremental load. So we
We are not doing incremental load. So we can say here trank and insert. And then
can say here trank and insert. And then we add one more section maybe about the
we add one more section maybe about the transformations. So we can say no
transformations. So we can say no transformations. And one more about the
transformations. And one more about the data model. We're going to say none as
data model. We're going to say none as is. And now what I'm going to do I'm
is. And now what I'm going to do I'm going to go and add those specifications
going to go and add those specifications as well for the silver and gold. So here
as well for the silver and gold. So here what we have discussed the object type
what we have discussed the object type the load process the
the load process the transformations and whether we are
transformations and whether we are breaking the data model or not the same
breaking the data model or not the same thing for the gold layer. So I can say
thing for the gold layer. So I can say with that we have really nice layering
with that we have really nice layering of the data warehouse and what we are
of the data warehouse and what we are left is with the consumers over here you
left is with the consumers over here you can go and add the different use cases
can go and add the different use cases and tools that can access your data
and tools that can access your data warehouse like for example I'm adding
warehouse like for example I'm adding here business intelligence and reporting
here business intelligence and reporting maybe using PowerBI or Tableau or you
maybe using PowerBI or Tableau or you can say you can access my data warehouse
can say you can access my data warehouse in order to do at analyzes using the SQL
in order to do at analyzes using the SQL queries and this is what we're going to
queries and this is what we're going to focus on the projects after we build the
focus on the projects after we build the data warehouse and as well you can offer
data warehouse and as well you can offer it for machine learning purposes and of
it for machine learning purposes and of course it It's really nice to add some
course it It's really nice to add some icons in your architecture and usually I
icons in your architecture and usually I use this nice websites called flat icon.
use this nice websites called flat icon. It has really amazing icons that you can
It has really amazing icons that you can go and use it in your architecture. Now,
go and use it in your architecture. Now, of course, we can go and keep adding
of course, we can go and keep adding icons and stuff to explain the data
icons and stuff to explain the data architecture and as well the system.
architecture and as well the system. Like for example, it is very important
Like for example, it is very important here to say which tools you are using in
here to say which tools you are using in order to build this data warehouse. Is
order to build this data warehouse. Is it in the cloud? Are using Azure datab
it in the cloud? Are using Azure datab bricks or maybe snowflake? So we're
bricks or maybe snowflake? So we're going to go and add for our project the
going to go and add for our project the icon of SQL server since we are building
icon of SQL server since we are building this data warehouse completely in the
this data warehouse completely in the SQL server. So for now I'm really happy
SQL server. So for now I'm really happy about it. As you can see we have now a
about it. As you can see we have now a plan right. All right guys so with that
plan right. All right guys so with that we have designed the data architecture
we have designed the data architecture using the doyo and with that we have
using the doyo and with that we have done the last step in this epic and now
done the last step in this epic and now with that we have a design for the data
with that we have a design for the data architecture and we can say we have
architecture and we can say we have closed this epic. Now let's go to the
closed this epic. Now let's go to the next one. We will start doing the first
next one. We will start doing the first step to prepare our project. And the
step to prepare our project. And the first task here is to create a detailed
first task here is to create a detailed project
plan. All right, my friends. So now it's clear for us that we have three layers
clear for us that we have three layers and we have to go and build them. So
and we have to go and build them. So that means our big epics going to be
that means our big epics going to be after the layers. So here I have added
after the layers. So here I have added three more epics. So we have build
three more epics. So we have build bronze layer, build silver layer and
bronze layer, build silver layer and gold layer. And after that I went and
gold layer. And after that I went and start defining all the different tasks
start defining all the different tasks that we have to follow in the projects.
that we have to follow in the projects. So at the start we will be analyzing
So at the start we will be analyzing then coding and after that we're going
then coding and after that we're going to go and do testing and once everything
to go and do testing and once everything is ready we're going to go and document
is ready we're going to go and document stuff and at the end we have to commit
stuff and at the end we have to commit our work in the get repo. All those
our work in the get repo. All those epics are following the same like
epics are following the same like pattern in the tasks. So as you can see
pattern in the tasks. So as you can see now we have a very detailed project
now we have a very detailed project structure and now things are more
structure and now things are more cleared for us how we're going to build
cleared for us how we're going to build the data warehouse. So with that we are
the data warehouse. So with that we are done from this task and now the next
done from this task and now the next task we have to go and define the naming
task we have to go and define the naming convention of the
projects. All right. So now at this phase of the projects we usually define
phase of the projects we usually define the naming conventions. So what is that?
the naming conventions. So what is that? It is set of rules that you define for
It is set of rules that you define for naming everything in the projects
naming everything in the projects whether it is a database, schema,
whether it is a database, schema, tables, stored procedures, folders,
tables, stored procedures, folders, anything. And if you don't do that at
anything. And if you don't do that at the early phase of the projects, I
the early phase of the projects, I promise you chaos can happen because
promise you chaos can happen because what going to happen? You will have
what going to happen? You will have different developers in your projects
different developers in your projects and each of those developers have their
and each of those developers have their own style of course. So one developer
own style of course. So one developer might name a table dimension customers
might name a table dimension customers where everything is lowerase and between
where everything is lowerase and between them underscore and you have another
them underscore and you have another developer creating another table called
developer creating another table called dimension products but using the camel
dimension products but using the camel case. So there is no separation between
case. So there is no separation between the words and the first character is
the words and the first character is capitalized and maybe another one using
capitalized and maybe another one using some prefixes like
some prefixes like dim categories. So we have here like a
dim categories. So we have here like a shortcut of the dimension. So as you can
shortcut of the dimension. So as you can see there are different designs and
see there are different designs and styles and if you leave the door open
styles and if you leave the door open what can happen in the middle of the
what can happen in the middle of the project you will notice okay everything
project you will notice okay everything looks inconsistent and you can define a
looks inconsistent and you can define a big task to go and rename everything
big task to go and rename everything following a specific rule. So instead of
following a specific rule. So instead of wasting all this time at this phase you
wasting all this time at this phase you go and define the naming conventions and
go and define the naming conventions and let's go and do that. So we usually
let's go and do that. So we usually start with a very important decision and
start with a very important decision and that is which naming convention we going
that is which naming convention we going to follow in the whole project. So you
to follow in the whole project. So you have different cases like the camel
have different cases like the camel case, the Pascal case, the kebab case,
case, the Pascal case, the kebab case, and the snake case. And for this
and the snake case. And for this project, we're going to go with the
project, we're going to go with the snake case where all the letters of a
snake case where all the letters of a word going to be lowercased. And the
word going to be lowercased. And the separation between words going to be an
separation between words going to be an underscore. For example, a table name
underscore. For example, a table name called customer info. Customer is
called customer info. Customer is lowercased. Info is as well lowercased.
lowercased. Info is as well lowercased. And between them an underscore. So this
And between them an underscore. So this is always the first thing that you have
is always the first thing that you have to decide for your data projects. The
to decide for your data projects. The second thing is to decide the language.
second thing is to decide the language. So for example, I work in Germany and
So for example, I work in Germany and there is always like a decision that we
there is always like a decision that we have to make whether we use Germany or
have to make whether we use Germany or English. So we have to decide for our
English. So we have to decide for our project which language we're going to
project which language we're going to use. And a very important general rule
use. And a very important general rule is that avoid reserved words. So don't
is that avoid reserved words. So don't use a square reserved word as an object
use a square reserved word as an object name like for example table. Don't give
name like for example table. Don't give a table name as a table. So those are
a table name as a table. So those are the general principles. So those are the
the general principles. So those are the general rules that you have to follow in
general rules that you have to follow in the whole project. This applies for
the whole project. This applies for everything for tables, columns, stored
everything for tables, columns, stored procedures, any names that you are
procedures, any names that you are giving in your scripts. Now moving on,
giving in your scripts. Now moving on, we have specifications for the table
we have specifications for the table names. And here we have different set of
names. And here we have different set of rules for each layer. So here the rule
rules for each layer. So here the rule says source system underscore entity. So
says source system underscore entity. So we are saying all the tables in the
we are saying all the tables in the bronze layer should start first with the
bronze layer should start first with the source system name like for example CRM
source system name like for example CRM or ARB and after that we have an
or ARB and after that we have an underscore and then at the end we have
underscore and then at the end we have the entity name or the table name. So
the entity name or the table name. So for example we have this table name CRM.
for example we have this table name CRM. So that means this table comes from the
So that means this table comes from the source system CRM and then we have the
source system CRM and then we have the table name the entity name customer
table name the entity name customer info. So this is the rule that we're
info. So this is the rule that we're going to follow in naming all tables in
going to follow in naming all tables in the bronze layer. Then moving on to the
the bronze layer. Then moving on to the silver layer, it is exactly like the
silver layer, it is exactly like the bronze because we are not going to
bronze because we are not going to rename anything. We are not going to
rename anything. We are not going to build any new data model. So the naming
build any new data model. So the naming going to be one one to one like the
going to be one one to one like the bronze. So it is exactly the same rules
bronze. So it is exactly the same rules as the bronze. But if we go to the gold
as the bronze. But if we go to the gold here, since we are building new data
here, since we are building new data model, we have to go and rename things.
model, we have to go and rename things. And since as well we are integrating
And since as well we are integrating multiple sources together, we will not
multiple sources together, we will not be using the source system name in the
be using the source system name in the tables because inside one table you
tables because inside one table you could have multiple sources. So the rule
could have multiple sources. So the rule says all the names must be meaningful
says all the names must be meaningful business aligned names for the tables
business aligned names for the tables starting with the category prefix. So
starting with the category prefix. So here the rule says it start with
here the rule says it start with category then underscore and then
category then underscore and then entity. Now what is category? We have in
entity. Now what is category? We have in the code layer different types of
the code layer different types of tables. So we could build a table called
tables. So we could build a table called a fact table. Another one could be a
a fact table. Another one could be a dimension. A third type could be an
dimension. A third type could be an aggregation or a report. So we have
aggregation or a report. So we have different types of tables and we can
different types of tables and we can specify those types as a prefix at the
specify those types as a prefix at the start. So for example we are saying here
start. So for example we are saying here effect sales. So the category is fact
effect sales. So the category is fact and the table name called sales. And
and the table name called sales. And here I just made like a table with
here I just made like a table with different type of patterns. So we could
different type of patterns. So we could have a dimension. So we say it start
have a dimension. So we say it start with the dim underscore for example
with the dim underscore for example dimim customers or products. And then we
dimim customers or products. And then we have another type called fact table. So
have another type called fact table. So it start with fact underscore or
it start with fact underscore or aggregated table where we have the first
aggregated table where we have the first three characters like aggregating the
three characters like aggregating the customers or the sales monthly. So as
customers or the sales monthly. So as you can see as you are creating a naming
you can see as you are creating a naming convention you have first to make it
convention you have first to make it clear what is the rule describe each
clear what is the rule describe each part of the rule and start giving
part of the rule and start giving examples. So with that we make it clear
examples. So with that we make it clear for the whole team which names they
for the whole team which names they should follow. So we talked here about
should follow. So we talked here about the table naming convention. Then you
the table naming convention. Then you can as well go and make naming
can as well go and make naming convention for the columns. Like for
convention for the columns. Like for example in the code layer we're going to
example in the code layer we're going to go and have surrogate keys. So we can
go and have surrogate keys. So we can define it like this. The surrogate key
define it like this. The surrogate key should start with a table name and then
should start with a table name and then underscore a key. Like for example we
underscore a key. Like for example we can call it customer key. It is a
can call it customer key. It is a surrogate key in the dimension
surrogate key in the dimension customers. The same thing for technical
customers. The same thing for technical columns. As a data engineer, we might
columns. As a data engineer, we might add our own columns to the tables that
add our own columns to the tables that don't come from the source system. And
don't come from the source system. And those columns are the technical columns
those columns are the technical columns or sometimes we call them metadata
or sometimes we call them metadata columns. Now, in order to separate them
columns. Now, in order to separate them from the original columns that comes
from the original columns that comes from the source system, we can have like
from the source system, we can have like a prefix for that. Like for example, the
a prefix for that. Like for example, the rule says if you are building any
rule says if you are building any technical or metadata columns, the
technical or metadata columns, the column should start with DWH underscore
column should start with DWH underscore and then the column name. For example,
and then the column name. For example, if you want the metadata load dates, we
if you want the metadata load dates, we can have
can have DWH load dates. So with that, if anyone
DWH load dates. So with that, if anyone sees that column starts with DWH, we
sees that column starts with DWH, we understand this data comes from a data
understand this data comes from a data engineer. And we can keep adding rules
engineer. And we can keep adding rules like for example the store procedure
like for example the store procedure over here. If you are making an ETL
over here. If you are making an ETL script, then it should start with the
script, then it should start with the prefix load underscore and then the
prefix load underscore and then the layer. For example, the store procedure
layer. For example, the store procedure that is responsible for loading the
that is responsible for loading the bronze going to be called load bronze.
bronze going to be called load bronze. and for the silver load underscore
and for the silver load underscore silver. So those are currently the rules
silver. So those are currently the rules for the start procedure. So this is how
for the start procedure. So this is how I do it usually in my projects. All
I do it usually in my projects. All right my friends. So with that we have a
right my friends. So with that we have a solid naming conventions for our
solid naming conventions for our projects. So this is done and now the
projects. So this is done and now the next step is that we're going to go to
next step is that we're going to go to git and you will create a brand new
git and you will create a brand new repository and we're going to prepare
repository and we're going to prepare its structure. So let's
go. All right. Right. So now we come to as well important step in any projects
as well important step in any projects and that's by creating the G repository.
and that's by creating the G repository. So if you are new to Git, don't worry
So if you are new to Git, don't worry about it. It is simpler than it sounds.
about it. It is simpler than it sounds. So it's all about to have a safe place
So it's all about to have a safe place where you can put your codes that you
where you can put your codes that you are developing and you will have the
are developing and you will have the possibility to track everything happens
possibility to track everything happens to the codes and as well you can use it
to the codes and as well you can use it in order to collaborate with your team
in order to collaborate with your team and if something goes wrong you can
and if something goes wrong you can always roll back. And the best part here
always roll back. And the best part here once you are done with the project you
once you are done with the project you can share your repository as a part of
can share your repository as a part of your portfolio and it is really amazing
your portfolio and it is really amazing thing if you are applying for a job by
thing if you are applying for a job by showcasing your skills that you have
showcasing your skills that you have built a data warehouse by using well
built a data warehouse by using well doumented get repository. So now let's
doumented get repository. So now let's go and create the repository of the
go and create the repository of the project. Now we are at the overview of
project. Now we are at the overview of our account. So the first thing that we
our account. So the first thing that we have to do is to go to the repositories
have to do is to go to the repositories over here and then we're going to go to
over here and then we're going to go to this green button and click on new. The
this green button and click on new. The first thing that we have to do is to
first thing that we have to do is to give the repository name. So let's call
give the repository name. So let's call it SQL data warehouse project and then
it SQL data warehouse project and then here we can go and give it a
here we can go and give it a description. So for example I'm saying
description. So for example I'm saying building a modern data warehouse with
building a modern data warehouse with SQL server. Now the next option whether
SQL server. Now the next option whether you want to make it public and private.
you want to make it public and private. I'm going to leave it as a public and
I'm going to leave it as a public and then let's go and add here a readme
then let's go and add here a readme file. And then here about the license we
file. And then here about the license we can go over here and select the MIT. MIT
can go over here and select the MIT. MIT license gives everyone the freedom of
license gives everyone the freedom of using and modifying your code. Okay. So
using and modifying your code. Okay. So I think I'm happy with the setup. Let's
I think I'm happy with the setup. Let's go and create the repository. And with
go and create the repository. And with that we have our brand new repository.
that we have our brand new repository. Now the next step that I usually do is
Now the next step that I usually do is to create the structure of the
to create the structure of the repository. And usually I always follow
repository. And usually I always follow the same patterns in any projects. So
the same patterns in any projects. So here we need few folders in order to put
here we need few folders in order to put our files right. So what I usually do I
our files right. So what I usually do I go over here to add file create a new
go over here to add file create a new file and I start creating the structure
file and I start creating the structure over here. So the first thing is that we
over here. So the first thing is that we need data sets then slash and with that
need data sets then slash and with that the repository going to understand this
the repository going to understand this is a folder not a file and then you can
is a folder not a file and then you can go and add anything like here
go and add anything like here placeholder just an empty file this just
placeholder just an empty file this just going to help me to create the folders
going to help me to create the folders so let's go and commit so commit the
so let's go and commit so commit the changes and now if you go back to the
changes and now if you go back to the main projects you can see now we have a
main projects you can see now we have a folder called data sets so I'm going to
folder called data sets so I'm going to go and keep creating stuff so I will go
go and keep creating stuff so I will go and create the documents placeholder
and create the documents placeholder commit the changes and then I'm going to
commit the changes and then I'm going to go and create the scripts
placeholder and the final one what I usually add is the
usually add is the tests something like
tests something like this. So that as you can see now we have
this. So that as you can see now we have the main folders of our repository. Now
the main folders of our repository. Now what I usually do the next that I'm
what I usually do the next that I'm going to go and edit the main readme. So
going to go and edit the main readme. So you can see it over here as well. So
you can see it over here as well. So what we're going to do, we're going to
what we're going to do, we're going to go inside the readme and then we're
go inside the readme and then we're going to go to the edit button here and
going to go to the edit button here and we're going to start writing the main
we're going to start writing the main information about our project. This is
information about our project. This is really depend on your style. So you can
really depend on your style. So you can go and add whatever you want. This is
go and add whatever you want. This is the main page of your repository. And
the main page of your repository. And now as you can see the file name here is
now as you can see the file name here is MD. It stands for markdown. It is just
MD. It stands for markdown. It is just an easy and friendly format in order to
an easy and friendly format in order to write a text. So if you have like
write a text. So if you have like documentations, you are writing a text.
documentations, you are writing a text. It is a really nice format in order to
It is a really nice format in order to organize it, structure it and it is very
organize it, structure it and it is very friendly. So what I'm going to do at the
friendly. So what I'm going to do at the start I'm going to give a few
start I'm going to give a few description about the project. So we
description about the project. So we have the main title and then we have
have the main title and then we have like a welcome message and what this
like a welcome message and what this repository is about. And in the next
repository is about. And in the next section maybe we can start with the
section maybe we can start with the project requirements and then maybe at
project requirements and then maybe at the end you can say a few words about
the end you can say a few words about the licensing and few words about you.
the licensing and few words about you. So as you can see it's like the homepage
So as you can see it's like the homepage of the project and the repository. So
of the project and the repository. So once you are done we're going to go and
once you are done we're going to go and commit the changes. And now if you go to
commit the changes. And now if you go to the main page of the repository you can
the main page of the repository you can see always the folder and files at the
see always the folder and files at the start and then below it we're going to
start and then below it we're going to see the informations from the readme. So
see the informations from the readme. So again here we have the welcome statement
again here we have the welcome statement and then the projects requirements and
and then the projects requirements and at the end we have the licensing and
at the end we have the licensing and about me. So my friends that's it. We
about me. So my friends that's it. We have now a repository and we have now
have now a repository and we have now the main structure of the project and
the main structure of the project and through the projects as we are building
through the projects as we are building the data warehouse we're going to go and
the data warehouse we're going to go and commit all our work in this repository.
commit all our work in this repository. Nice, right? All right. So with that we
Nice, right? All right. So with that we have now your repository ready and as we
have now your repository ready and as we go in the project we will be adding
go in the project we will be adding stuff to it. So this step is done and
stuff to it. So this step is done and now the last step finally we're going to
now the last step finally we're going to go to the SQL server and we're going to
go to the SQL server and we're going to write our first script where we're going
write our first script where we're going to create a database and schemas.
All right. Now the first step is we have to go and create a brand new database.
to go and create a brand new database. So now in order to do that first we have
So now in order to do that first we have to switch to the database master. So you
to switch to the database master. So you can do it like this. Use master and
can do it like this. Use master and semicolon. And if you go and execute it
semicolon. And if you go and execute it now we are switched to the master
now we are switched to the master database. It is a system database in SQL
database. It is a system database in SQL server where you can go and create other
server where you can go and create other databases. And you can see here from the
databases. And you can see here from the toolbar that we are now logged into the
toolbar that we are now logged into the master database. Now the next step we
master database. Now the next step we have to go and create our new database.
have to go and create our new database. So we're going to say create database
So we're going to say create database and you can call it whatever you want.
and you can call it whatever you want. So I'm going to go with data warehouse
So I'm going to go with data warehouse semicolon. Let's go and execute it. And
semicolon. Let's go and execute it. And with that we have created our database.
with that we have created our database. Let's go and check it from the object
Let's go and check it from the object explorer. Let's go and refresh. And you
explorer. Let's go and refresh. And you can see our new data warehouse. This is
can see our new data warehouse. This is our new database. Awesome. Right now to
our new database. Awesome. Right now to the next step we're going to go and
the next step we're going to go and switch to the new database. So we're
switch to the new database. So we're going to say use data
going to say use data warehouse and semicolon. So let's go and
warehouse and semicolon. So let's go and switch to it. And you can see now we are
switch to it. And you can see now we are logged into the data warehouse database.
logged into the data warehouse database. And now we can go and start building
And now we can go and start building stuff inside this data warehouse. So now
stuff inside this data warehouse. So now the first step that I usually do is I go
the first step that I usually do is I go and start creating the schemas. So what
and start creating the schemas. So what is schema? Think about it. It's like a
is schema? Think about it. It's like a folder or a container that helps you to
folder or a container that helps you to keep things organized. So now as we
keep things organized. So now as we decided in the architecture we have
decided in the architecture we have three layers, bronze, silver, gold. And
three layers, bronze, silver, gold. And now we're going to go and create for
now we're going to go and create for each layer a schema. So let's go and do
each layer a schema. So let's go and do that. We're going to start with the
that. We're going to start with the first one. Create schema. And the first
first one. Create schema. And the first one is bronze. So let's do it like this.
one is bronze. So let's do it like this. And a semicolon. Let's go and create the
And a semicolon. Let's go and create the first schema. Nice. So we have new
first schema. Nice. So we have new schema. Let's go to our database. And
schema. Let's go to our database. And then in order to check the schemas, we
then in order to check the schemas, we go to the security and then to the
go to the security and then to the schemas over here. And as you can see,
schemas over here. And as you can see, we have the bronze. And if you don't
we have the bronze. And if you don't find it, you have to go and refresh the
find it, you have to go and refresh the whole schemas. and then you will find
whole schemas. and then you will find the new schema. Great. So now we have
the new schema. Great. So now we have the first schema. Now what we're going
the first schema. Now what we're going to do, we're going to go and create the
to do, we're going to go and create the others two. So I'm just going to go and
others two. So I'm just going to go and duplicate it. So the next one going to
duplicate it. So the next one going to be the silver and the third one going to
be the silver and the third one going to be the gold. So let's go and execute
be the gold. So let's go and execute those two together. We will get an error
those two together. We will get an error and that's because we are not having the
and that's because we are not having the go in between. So after each command,
go in between. So after each command, let's have a go. And now if I highlight
let's have a go. And now if I highlight the silver and gold and then execute, it
the silver and gold and then execute, it will be working. the go in SQL it is
will be working. the go in SQL it is like separator. So it tells SQL first
like separator. So it tells SQL first execute completely the first command
execute completely the first command before go to the next one. So it is just
before go to the next one. So it is just separator. Now let's go to our schemas
separator. Now let's go to our schemas refresh and now we can see as well we
refresh and now we can see as well we have the gold and the silver. So with
have the gold and the silver. So with that we have now a database. We have the
that we have now a database. We have the three layers and we can start developing
three layers and we can start developing each layer
individually. Okay. So now let's go and commit our work in the git. So now since
commit our work in the git. So now since it is a script and code we're going to
it is a script and code we're going to go to the folder scripts over here and
go to the folder scripts over here and then we're going to go and add a new
then we're going to go and add a new file let's call it in it
file let's call it in it database.sql and now we're going to go
database.sql and now we're going to go and paste our code over here. So now I
and paste our code over here. So now I have done few modifications like for
have done few modifications like for example before we create the database we
example before we create the database we have to check whether the database
have to check whether the database exists. This is an important step if you
exists. This is an important step if you are recreating the database otherwise if
are recreating the database otherwise if you don't do that you will get an error
you don't do that you will get an error where it's going to say the database
where it's going to say the database already exists. So first it is checking
already exists. So first it is checking whether the database exists then it
whether the database exists then it drops it. I have added few comments like
drops it. I have added few comments like here we are saying creating the data
here we are saying creating the data warehouse creating the schemas and now
warehouse creating the schemas and now we have a very important step. We have
we have a very important step. We have to go and add a header comment at the
to go and add a header comment at the start of each script. To be honest after
start of each script. To be honest after 3 months from now you will not be
3 months from now you will not be remembering all the details of this
remembering all the details of this script. And adding a comment like this
script. And adding a comment like this it is like a sticky note for you later
it is like a sticky note for you later once you visit this script again. And it
once you visit this script again. And it is as well very important for the other
is as well very important for the other developers in the team because each time
developers in the team because each time you open the scripts the first question
you open the scripts the first question going to be what is the purpose of this
going to be what is the purpose of this script because if you or anyone in the
script because if you or anyone in the team open the file the first question
team open the file the first question going to be what is the purpose of this
going to be what is the purpose of this scripts why we are doing this stuff. So
scripts why we are doing this stuff. So as you can see here we have a comment
as you can see here we have a comment saying this script creates a new data
saying this script creates a new data warehouse after checking if it already
warehouse after checking if it already exists. If the database exists, it's
exists. If the database exists, it's going to drop it and recreate it. And
going to drop it and recreate it. And additionally, it's going to go and
additionally, it's going to go and create three schemas, bronze, silver,
create three schemas, bronze, silver, gold. So that it gives clarity what this
gold. So that it gives clarity what this script is about. And it makes everyone
script is about. And it makes everyone life easier. Now, the second reason why
life easier. Now, the second reason why this is very important to add is that
this is very important to add is that you can add warnings and especially for
you can add warnings and especially for this script, it is very important to add
this script, it is very important to add these notes because if you run this
these notes because if you run this script, what's going to happen? It's
script, what's going to happen? It's going to go and destroy the whole
going to go and destroy the whole database. Imagine someone open this
database. Imagine someone open this script and run it. Imagine an admin open
script and run it. Imagine an admin open this script and run it in your database.
this script and run it in your database. Everything going to be destroyed and all
Everything going to be destroyed and all the data will be lost and this can be a
the data will be lost and this can be a disaster if you don't have any backup.
disaster if you don't have any backup. So with that we have nice header
So with that we have nice header comments and we have added few comments
comments and we have added few comments in our code and now we are ready to
in our code and now we are ready to commit our code. So let's go and commit
commit our code. So let's go and commit it. And now we have our script in the
it. And now we have our script in the git as well. And of course if you are
git as well. And of course if you are doing any modifications make sure to
doing any modifications make sure to update the changes in the git. Okay my
update the changes in the git. Okay my friends. So with that we have an empty
friends. So with that we have an empty database and schemas and we are done
database and schemas and we are done with this task and as well we are done
with this task and as well we are done with the whole epic. So we have
with the whole epic. So we have completed the project initialization and
completed the project initialization and now we're going to go to the interesting
now we're going to go to the interesting stuff. We will go and build the bronze
stuff. We will go and build the bronze layer. So now the first task is to
layer. So now the first task is to analyze the source systems. So let's
go. All right. So now the big question is how to build the bronze layer. So
is how to build the bronze layer. So first thing first we do analyzing. As
first thing first we do analyzing. As you are developing anything, you don't
you are developing anything, you don't immediately start writing a code. So
immediately start writing a code. So before we start coding the bronze layer,
before we start coding the bronze layer, what we usually do is we have to
what we usually do is we have to understand the source system. So what I
understand the source system. So what I usually do, I make an interview with the
usually do, I make an interview with the source system experts and ask them many
source system experts and ask them many many questions in order to understand
many questions in order to understand the nature of the source system that I'm
the nature of the source system that I'm connecting to the data warehouse. And
connecting to the data warehouse. And once you know the source systems, then
once you know the source systems, then we can start coding. And the main focus
we can start coding. And the main focus here is to do the data ingestion. So
here is to do the data ingestion. So that means we have to find a way on how
that means we have to find a way on how to load the data from the source into
to load the data from the source into the data warehouse. So it's like we are
the data warehouse. So it's like we are building a bridge between the source and
building a bridge between the source and our target system the data warehouse.
our target system the data warehouse. And once we have the code ready, the
And once we have the code ready, the next step is we have to do data
next step is we have to do data validation. So here comes the quality
validation. So here comes the quality control. It is very important in the
control. It is very important in the bronze layer to check the data
bronze layer to check the data completeness. So that means we have to
completeness. So that means we have to compare the number of records between
compare the number of records between the source system and the bronze layer
the source system and the bronze layer just to make sure we are not losing any
just to make sure we are not losing any data in between. And another check that
data in between. And another check that we will be doing is the schema checks
we will be doing is the schema checks and that's to make sure that the data is
and that's to make sure that the data is placed on the right position. And
placed on the right position. And finally we don't have to forget about
finally we don't have to forget about documentation and committing our work in
documentation and committing our work in the G. So this is the process that we're
the G. So this is the process that we're going to follow to build the bronze
layer. All right my friends. So now before connecting any source systems to
before connecting any source systems to our data warehouse, we have to make very
our data warehouse, we have to make very important step is to understand the
important step is to understand the sources. So how I usually do it, I set
sources. So how I usually do it, I set up a meeting with the source systems
up a meeting with the source systems expert in order to interview them to ask
expert in order to interview them to ask them a lot of stuff about the source.
them a lot of stuff about the source. And gaining this knowledge is very
And gaining this knowledge is very important because asking the right
important because asking the right question will help you to design the
question will help you to design the correct scripts in order to extract the
correct scripts in order to extract the data and to avoid a lot of mistakes and
data and to avoid a lot of mistakes and challenges. And now I'm going to show
challenges. And now I'm going to show you the most common questions that I
you the most common questions that I usually ask before connecting anything.
usually ask before connecting anything. Okay. So we start first by understanding
Okay. So we start first by understanding the business context and the ownership.
the business context and the ownership. So I would like to understand the story
So I would like to understand the story behind the data. I would like to
behind the data. I would like to understand who is responsible for the
understand who is responsible for the data, which IT departments and so on.
data, which IT departments and so on. And then it's nice to understand as well
And then it's nice to understand as well what business process it supports. Does
what business process it supports. Does it support the customer transactions,
it support the customer transactions, the supply chain, logistics or maybe
the supply chain, logistics or maybe finance reporting. So with that you can
finance reporting. So with that you can understand the importance of your data.
understand the importance of your data. And then I ask about the system and data
And then I ask about the system and data documentation. So having documentations
documentation. So having documentations from the source is your learning
from the source is your learning materials about your data. And it's
materials about your data. And it's going to saves you a lot of time later
going to saves you a lot of time later when you are working and designing maybe
when you are working and designing maybe new data models. And as well I would
new data models. And as well I would like always to understand the data model
like always to understand the data model for the source system. And if they have
for the source system. And if they have like descriptions of the columns and the
like descriptions of the columns and the tables, it's going to be nice to have
tables, it's going to be nice to have the data catalog. This can helps me a
the data catalog. This can helps me a lot in the data warehouse. How I'm going
lot in the data warehouse. How I'm going to go and join the tables together. So
to go and join the tables together. So with that you get a solid foundations
with that you get a solid foundations about the business context, the
about the business context, the processes and the ownership of the data.
processes and the ownership of the data. And now in the next step we're going to
And now in the next step we're going to start talking about the technicality. So
start talking about the technicality. So I would like to understand the
I would like to understand the architecture and as well the technology
architecture and as well the technology stack. So the first question that I
stack. So the first question that I usually ask is how the source system is
usually ask is how the source system is storing the data. Do we have the data on
storing the data. Do we have the data on the on-prem like in SQL server, Oracle
the on-prem like in SQL server, Oracle or is it in the cloud like Azure, AWS
or is it in the cloud like Azure, AWS and so on. And then once we understand
and so on. And then once we understand that then we can discuss what are the
that then we can discuss what are the integration capabilities like how I'm
integration capabilities like how I'm going to go and get the data. Do the
going to go and get the data. Do the source system offer APIs maybe cafka or
source system offer APIs maybe cafka or they have only like file extractions or
they have only like file extractions or they're going to give you like a direct
they're going to give you like a direct connection to the database. So once you
connection to the database. So once you understand the technology that you're
understand the technology that you're going to use in order to extract the
going to use in order to extract the data then we're going to deep dive into
data then we're going to deep dive into more technical questions and here we're
more technical questions and here we're going to understand how to extract the
going to understand how to extract the data from the source system and then
data from the source system and then load it into the data warehouse. So the
load it into the data warehouse. So the first things that we have to discuss
first things that we have to discuss with the experts can we do an
with the experts can we do an incremental load or a full load and then
incremental load or a full load and then after that we're going to discuss the
after that we're going to discuss the data scope the historicizations do we
data scope the historicizations do we need all data do we need only maybe 10
need all data do we need only maybe 10 years of the data are there histories
years of the data are there histories already in the source system or should
already in the source system or should we build it in the data warehouse and so
we build it in the data warehouse and so on and then we're going to go and
on and then we're going to go and discuss what is the expected size of the
discuss what is the expected size of the extracts are we talking here about
extracts are we talking here about megabytes gigabytes terabytes and this
megabytes gigabytes terabytes and this is very important to understand whether
is very important to understand whether we have the right tools and platform to
we have the right tools and platform to connect that source system and then I
connect that source system and then I try to understand whether there are any
try to understand whether there are any data volume limitations like if you have
data volume limitations like if you have some old source systems they might
some old source systems they might struggle a lot with performance and so
struggle a lot with performance and so on. So if you have like an ETL that is
on. So if you have like an ETL that is extracting large amount of data you
extracting large amount of data you might bring the performance down of the
might bring the performance down of the source system. So that's why you have to
source system. So that's why you have to try to understand whether there are any
try to understand whether there are any limitations for your extracts and as
limitations for your extracts and as well other aspects that might impact the
well other aspects that might impact the performance of the source system. This
performance of the source system. This is very important. If they give you an
is very important. If they give you an access to the database, you have to be
access to the database, you have to be responsible that you are not bringing
responsible that you are not bringing the performance of the database down.
the performance of the database down. And of course, very important question
And of course, very important question is to ask about the authentication and
is to ask about the authentication and the authorization like how you going to
the authorization like how you going to go and access the data in the source
go and access the data in the source system. Do you need any tokens, keys,
system. Do you need any tokens, keys, password and so on. So those are the
password and so on. So those are the questions that you have to ask if you
questions that you have to ask if you are connecting a new source system to
are connecting a new source system to the data warehouse. And once you have
the data warehouse. And once you have the answers for those questions, you can
the answers for those questions, you can proceed with the next steps to connect
proceed with the next steps to connect the sources to the data warehouse. All
the sources to the data warehouse. All right, my friends. So with that, you
right, my friends. So with that, you have learned how to analyze a new source
have learned how to analyze a new source systems that you want to connect to your
systems that you want to connect to your data warehouse. So this step is done and
data warehouse. So this step is done and now we're going to go back to coding
now we're going to go back to coding where we're going to write scripts in
where we're going to write scripts in order to do the data ingestion from the
order to do the data ingestion from the CSV files to the pros
layer. And let's have a quick look again to our bronze layer specifications. So
to our bronze layer specifications. So we just have to load the data from the
we just have to load the data from the sources to the data warehouse. We're
sources to the data warehouse. We're going to build tables in the bronze
going to build tables in the bronze layer. We are doing a full load. So that
layer. We are doing a full load. So that means we are truncating and then
means we are truncating and then inserting the data. There will be no
inserting the data. There will be no data transformations at all in the
data transformations at all in the bronze layer. And as well we will not be
bronze layer. And as well we will not be creating any data model. So this is the
creating any data model. So this is the specifications of the bronze layer. All
specifications of the bronze layer. All right. Right now in order to create the
right. Right now in order to create the DDL script for the bronze layer creating
DDL script for the bronze layer creating the tables of the bronze we have to
the tables of the bronze we have to understand the metadata the structure
understand the metadata the structure the schema of the incoming data and here
the schema of the incoming data and here either you ask the technical experts
either you ask the technical experts from the source system about these
from the source system about these informations or you can go and explore
informations or you can go and explore the incoming data and try to define the
the incoming data and try to define the structure of your tables. So now what
structure of your tables. So now what we're going to do we're going to start
we're going to do we're going to start with the first source system the CRM. So
with the first source system the CRM. So let's go inside it and we're going to
let's go inside it and we're going to start with the first table the customer
start with the first table the customer info. Now if you open the file and check
info. Now if you open the file and check the data inside it, you see we have a
the data inside it, you see we have a header information and that is very good
header information and that is very good because now we have the names of the
because now we have the names of the columns that are coming from the source
columns that are coming from the source and from the content you can define of
and from the content you can define of course the data types. So let's go and
course the data types. So let's go and do that. First we're going to say create
do that. First we're going to say create table and then we have to define the
table and then we have to define the layer. It's going to be the bronze. And
layer. It's going to be the bronze. And now very important we have to follow the
now very important we have to follow the naming convention. So we start with the
naming convention. So we start with the name of the source system. It is CRM
name of the source system. It is CRM underscore and then after that the table
underscore and then after that the table name from the source system. So it's
name from the source system. So it's going to be the cost underscore info. So
going to be the cost underscore info. So this is the name of our first table in
this is the name of our first table in the bronze layer. Then the next step we
the bronze layer. Then the next step we have to go and define of course the
have to go and define of course the columns. And here again the column names
columns. And here again the column names in the bronze layer going to be one to
in the bronze layer going to be one to one exactly like the source system. So
one exactly like the source system. So the first one going to be the ID and I
the first one going to be the ID and I will go with the data type integer. Then
will go with the data type integer. Then the next one going to be the key invar
the next one going to be the key invar char and the length I will go with 50.
[Music] And the last one going to be the create
And the last one going to be the create date. It's going to be date. So with
date. It's going to be date. So with that we have covered all the columns
that we have covered all the columns available from the source system. So
available from the source system. So let's go and check. And yes the last one
let's go and check. And yes the last one is the create date. So that's it for the
is the create date. So that's it for the first table. Now a semicolon of course
first table. Now a semicolon of course at the end. Let's go and execute it. And
at the end. Let's go and execute it. And now we're going to go to the object
now we're going to go to the object explorer over here. Refresh. And we can
explorer over here. Refresh. And we can see the first table inside our data
see the first table inside our data warehouse. Amazing right? So now next
warehouse. Amazing right? So now next what you have to do is to go and create
what you have to do is to go and create a DDL statement for each file for those
a DDL statement for each file for those two systems. So for the CRM we need
two systems. So for the CRM we need three DDLs and as well for the other
three DDLs and as well for the other system the ERP we have as well to create
system the ERP we have as well to create three DDLs for the three files. So at
three DDLs for the three files. So at the end we're going to have in the
the end we're going to have in the bronze layer six tables six DTLs. So now
bronze layer six tables six DTLs. So now pause the video go create those DDLs. I
pause the video go create those DDLs. I will be doing the same as well and we
will be doing the same as well and we will see you soon.
[Music] All right. So now I hope you have
All right. So now I hope you have created all those details. I'm going to
created all those details. I'm going to show you what I have just created. So
show you what I have just created. So the second table in the source CRM we
the second table in the source CRM we have the product informations and the
have the product informations and the third one is the sales details. Then we
third one is the sales details. Then we go to the second system and here we make
go to the second system and here we make sure that we are following the naming
sure that we are following the naming convention. So first the source system
convention. So first the source system ERB and then the table name. So the
ERB and then the table name. So the second system was really easy. You can
second system was really easy. You can see we have only here like two columns
see we have only here like two columns and for the customers like only three
and for the customers like only three and for the categories only four
and for the categories only four columns. All right. So after defining
columns. All right. So after defining those stuff of course we have to go and
those stuff of course we have to go and execute them. So let's go and do that.
execute them. So let's go and do that. And then we go to the object explorer
And then we go to the object explorer over here. Refresh the tables. And with
over here. Refresh the tables. And with that you can see we have six empty
that you can see we have six empty tables in the bronze layer. And with
tables in the bronze layer. And with that we have all the tables from the two
that we have all the tables from the two source systems inside our database. But
source systems inside our database. But still we don't have any data. And you
still we don't have any data. And you can see our naming convention is really
can see our naming convention is really nice. You see the first three tables
nice. You see the first three tables comes from the CRM source system and
comes from the CRM source system and then the other three comes from the ERB.
then the other three comes from the ERB. So we can see in the bronze layer the
So we can see in the bronze layer the things are really splitted nicely and
things are really splitted nicely and you can identify quickly which table
you can identify quickly which table belong to which source system. Now there
belong to which source system. Now there is something else that I usually add to
is something else that I usually add to the DDL script is to check whether the
the DDL script is to check whether the table exists before creating. So for
table exists before creating. So for example, let's say that you are renaming
example, let's say that you are renaming or you would like to change the data
or you would like to change the data type of specific field. If you just go
type of specific field. If you just go and run this query, you will get an
and run this query, you will get an error because the database going to say
error because the database going to say we have already this table. So in other
we have already this table. So in other databases you can say create or replace
databases you can say create or replace table. But in the SQL server you have to
table. But in the SQL server you have to go and build a TSQL logic. So it is very
go and build a TSQL logic. So it is very simple. First we have to go and check
simple. First we have to go and check whether the object exists in the
whether the object exists in the database. So we say if object ID and
database. So we say if object ID and then we have to go and specify the table
then we have to go and specify the table name. So let's go and copy the whole
name. So let's go and copy the whole thing over here and make sure you get
thing over here and make sure you get exactly the same name as the table name.
exactly the same name as the table name. So there you see like space. I'm just
So there you see like space. I'm just going to go and remove it. And then
going to go and remove it. And then we're going to go and define the object
we're going to go and define the object type. So it's going to be the U. It
type. So it's going to be the U. It stands for user. It is the user defined
stands for user. It is the user defined tables. So if this table is not null. So
tables. So if this table is not null. So that means the database did find this
that means the database did find this object in the database. So what's going
object in the database. So what's going to happen? We say go and drop the table.
to happen? We say go and drop the table. So the whole thing again and semicolon.
So the whole thing again and semicolon. So again if the table exist in the
So again if the table exist in the database is not null then go and drop
database is not null then go and drop the table and after that go and create
the table and after that go and create it. So now if you go and highlight the
it. So now if you go and highlight the whole thing and then execute it it will
whole thing and then execute it it will be working. So first drop the table if
be working. So first drop the table if it exist then go and create the table
it exist then go and create the table from scratch. Now what you have to do is
from scratch. Now what you have to do is to go and add this check before creating
to go and add this check before creating any table inside our database. So it's
any table inside our database. So it's going to be the same thing for the next
going to be the same thing for the next table and so on. I went and added all
table and so on. I went and added all those checks for each table and what can
those checks for each table and what can happen if I go and execute the whole
happen if I go and execute the whole thing it going to work. So with that I'm
thing it going to work. So with that I'm recreating all the tables in the bronze
recreating all the tables in the bronze layer from the
scratch. Now the methods that we're going to use in order to load the data
going to use in order to load the data from the source to the data warehouse is
from the source to the data warehouse is the bulk inserts. Pulk insert is a
the bulk inserts. Pulk insert is a method of loading massive amount of data
method of loading massive amount of data very quickly from files like CSV files
very quickly from files like CSV files or maybe a text file directly into a
or maybe a text file directly into a database. It is not like the classical
database. It is not like the classical normal inserts where it's going to go
normal inserts where it's going to go and insert the data row by row but
and insert the data row by row but instead the bulk insert is one operation
instead the bulk insert is one operation that's going to load all the data in one
that's going to load all the data in one go into the database and that's what
go into the database and that's what makes it very fast. So let's go and use
makes it very fast. So let's go and use this method. Okay. Okay, so now let's
this method. Okay. Okay, so now let's start writing the script in order to
start writing the script in order to load the first table in the source CRM.
load the first table in the source CRM. So we're going to go and load the table
So we're going to go and load the table customer info from the CSV file to the
customer info from the CSV file to the database table. So the syntax is very
database table. So the syntax is very simple. We're going to start with saying
simple. We're going to start with saying bulk insert. So with that SQL understand
bulk insert. So with that SQL understand we are doing not a normal insert, we are
we are doing not a normal insert, we are doing a bulk insert and then we have to
doing a bulk insert and then we have to go and specify the table name. So it is
go and specify the table name. So it is bronze dot CRM cost info. So now we have
bronze dot CRM cost info. So now we have to specify the full location of the file
to specify the full location of the file that we are trying to load in this
that we are trying to load in this table. So now what we have to do is to
table. So now what we have to do is to go and get the path where the file is
go and get the path where the file is stored. So I'm going to go and copy the
stored. So I'm going to go and copy the whole path and then add it to the bulk
whole path and then add it to the bulk insert exactly like where the data
insert exactly like where the data exists. So for me it is in CSQL data
exists. So for me it is in CSQL data warehouse project data set in the source
warehouse project data set in the source CRM. And then I have to specify the file
CRM. And then I have to specify the file name. So it's going to be like cost
name. So it's going to be like cost info. CSV. You have to get it exactly
info. CSV. You have to get it exactly like the path of your files otherwise it
like the path of your files otherwise it will not be working. So after the path
will not be working. So after the path now we come to the with clause. Now we
now we come to the with clause. Now we have to tell the SQL server how to
have to tell the SQL server how to handle our file. So here comes the
handle our file. So here comes the specifications. There is a lot of stuff
specifications. There is a lot of stuff that we can define. So let's start with
that we can define. So let's start with the very important one is the row
the very important one is the row header. Now if you check the content of
header. Now if you check the content of our files you can see always the first
our files you can see always the first row includes the header information of
row includes the header information of the file. So those informations are
the file. So those informations are actually not the data. It's just the
actually not the data. It's just the column names. The actual data starts
column names. The actual data starts from the second row and we have to tell
from the second row and we have to tell the database about this information. So
the database about this information. So we're going to say first row is actually
we're going to say first row is actually the second row. So with that we are
the second row. So with that we are telling SQL to skip the first row in the
telling SQL to skip the first row in the file. We don't need to load those
file. We don't need to load those informations because we have already
informations because we have already defined the structure of our table. So
defined the structure of our table. So this is the first specifications. The
this is the first specifications. The next one which is as well very important
next one which is as well very important in loading any CSV file is the separator
in loading any CSV file is the separator between fields. The delimiter between
between fields. The delimiter between fields. So it's really depend on the
fields. So it's really depend on the file structure that you are getting from
file structure that you are getting from the source. As you can see all those
the source. As you can see all those values are splitted with a comma and we
values are splitted with a comma and we call this comma as a file separator or a
call this comma as a file separator or a delimter and I saw a lot of different
delimter and I saw a lot of different CSVs like sometime they use a semicolon
CSVs like sometime they use a semicolon or a pipe or special character like a
or a pipe or special character like a hash and so on. So you have to
hash and so on. So you have to understand how the values are splitted
understand how the values are splitted and in this file it's splitted by the
and in this file it's splitted by the comma and we have to tell SQL about this
comma and we have to tell SQL about this info. It's very important. So we're
info. It's very important. So we're going to say filled terminator and then
going to say filled terminator and then we're going to say it is the comma and
we're going to say it is the comma and basically those two informations are
basically those two informations are very important for SQL in order to be
very important for SQL in order to be able to read your CSV file. Now there
able to read your CSV file. Now there are like many different options that you
are like many different options that you can go and add. For example, tape lock.
can go and add. For example, tape lock. It is an option in order to improve the
It is an option in order to improve the performance where you are locking the
performance where you are locking the entire table during loading it. So as
entire table during loading it. So as SQL is loading the data to this table,
SQL is loading the data to this table, it going to go and lock the whole table.
it going to go and lock the whole table. So that's it for now. I'm just going to
So that's it for now. I'm just going to go and add the semicolon and let's go
go and add the semicolon and let's go and insert the data from the file inside
and insert the data from the file inside our bronze table. Let's execute it. And
our bronze table. Let's execute it. And now we can see SQL did insert around
now we can see SQL did insert around 80,000 rows inside our table. So it is
80,000 rows inside our table. So it is working. We just loaded the file into
working. We just loaded the file into our database. But now it is not enough
our database. But now it is not enough to just write this script. you have to
to just write this script. you have to test the quality of your bronze table
test the quality of your bronze table especially if you are working with
especially if you are working with files. So let's go and just do a simple
files. So let's go and just do a simple select. So from our new
select. So from our new table and let's run it. So now the first
table and let's run it. So now the first thing that I check is do we have data
thing that I check is do we have data like in each column? Well yes as you can
like in each column? Well yes as you can see we have data and the second thing is
see we have data and the second thing is do we have the data in the correct
do we have the data in the correct column. This is very critical as you are
column. This is very critical as you are loading the data from a file to a
loading the data from a file to a database. Do we have the data in the
database. Do we have the data in the correct column? So for example, here we
correct column? So for example, here we have the first name which of course
have the first name which of course makes sense and here we have the last
makes sense and here we have the last name. But what could happen and this
name. But what could happen and this mistakes happens a lot is that you find
mistakes happens a lot is that you find the first name informations inside the
the first name informations inside the key and as well you see the last name
key and as well you see the last name inside the first name and the status
inside the first name and the status inside the last name. So there is like
inside the last name. So there is like shifting of the data and this data
shifting of the data and this data engineering mistake is very common if
engineering mistake is very common if you are working with CSV files and there
you are working with CSV files and there are like different reasons why it
are like different reasons why it happens. Maybe the definition of your
happens. Maybe the definition of your table is wrong or the field separator is
table is wrong or the field separator is wrong. Maybe it's not a comma, it's
wrong. Maybe it's not a comma, it's something else or the separator is a bad
something else or the separator is a bad separator because sometimes maybe in the
separator because sometimes maybe in the keys or in the first name there is a
keys or in the first name there is a comma and the SQL is not able to split
comma and the SQL is not able to split the data correctly. So the quality of
the data correctly. So the quality of the CSV file is not really good and
the CSV file is not really good and there are many different reasons why you
there are many different reasons why you are not getting the data in the correct
are not getting the data in the correct column. But for now everything looks
column. But for now everything looks fine for us. And the next step is that
fine for us. And the next step is that I'll go and count the rows inside this
I'll go and count the rows inside this table. So let's go and select that. So
table. So let's go and select that. So we can see we have
we can see we have 18,493. And now what we can do, we can
18,493. And now what we can do, we can go to our CSV file and check how many
go to our CSV file and check how many rows do we have inside this file. And as
rows do we have inside this file. And as you can see we have
you can see we have 18,494. We are almost there. There is
18,494. We are almost there. There is like one extra row inside the file. And
like one extra row inside the file. And that's because of the header. the first
that's because of the header. the first header information is not loaded inside
header information is not loaded inside our table and that's why always in our
our table and that's why always in our tables we're going to have one less row
tables we're going to have one less row than the original files. So everything
than the original files. So everything looks nice and we have done this step
looks nice and we have done this step correctly. Now if I go and run it again
correctly. Now if I go and run it again what's going to happen we will get
what's going to happen we will get duplicates inside the bronze layer. So
duplicates inside the bronze layer. So now we have loaded the file like twice
now we have loaded the file like twice inside the same table which is not
inside the same table which is not really correct. The method that we have
really correct. The method that we have discussed is first to make the table
discussed is first to make the table empty and then load truncate and then
empty and then load truncate and then insert. In order to do that before the
insert. In order to do that before the bulk inserts, what we're going to do,
bulk inserts, what we're going to do, we're going to say truncate table and
we're going to say truncate table and then we're going to have our
then we're going to have our table and that's it with a semicolon. So
table and that's it with a semicolon. So now what we are doing is first we are
now what we are doing is first we are making the table empty and then we start
making the table empty and then we start loading from the scratch. We are loading
loading from the scratch. We are loading the whole content of the file inside the
the whole content of the file inside the table and this is what we call full
table and this is what we call full load. So now let's go and mark
load. So now let's go and mark everything together and execute. And
everything together and execute. And again if you go and check the content of
again if you go and check the content of the table you can see we have only
the table you can see we have only 18,000 rows. Let's go and run it again.
18,000 rows. Let's go and run it again. The count of the bronze layer you can
The count of the bronze layer you can see we still have the 18,000. So each
see we still have the 18,000. So each time you run this script now we are
time you run this script now we are refreshing the table customer info from
refreshing the table customer info from the file into the database table. So we
the file into the database table. So we are refreshing the bronze layer table.
are refreshing the bronze layer table. So that means if there's like now any
So that means if there's like now any changes in the file, it will be loaded
changes in the file, it will be loaded to the table. So this is how we do a
to the table. So this is how we do a full load in the bronze layer by
full load in the bronze layer by truncating the table and then doing the
truncating the table and then doing the inserts. And now of course what we have
inserts. And now of course what we have to do is to pause the video and go and
to do is to pause the video and go and write the same script for all six files.
write the same script for all six files. So let's go and do
[Music] that. Okay, back. So I hope that you
that. Okay, back. So I hope that you have as well written all those scripts.
have as well written all those scripts. So I have the three tables in order to
So I have the three tables in order to load the first source system and then
load the first source system and then three sections in order to load the
three sections in order to load the second source system. And as I'm writing
second source system. And as I'm writing those scripts, make sure to have the
those scripts, make sure to have the correct path. So for the second source
correct path. So for the second source system, you have to go and change the
system, you have to go and change the path for the other folder. And as well,
path for the other folder. And as well, don't forget the table name on the
don't forget the table name on the bronze layer is different from the file
bronze layer is different from the file name because we start always with the
name because we start always with the source system name with the files. We
source system name with the files. We don't have that. So now I think I have
don't have that. So now I think I have everything is ready. So let's go and
everything is ready. So let's go and execute the whole thing. Perfect.
execute the whole thing. Perfect. Awesome. So everything is working. Let
Awesome. So everything is working. Let me check the messages. So we can see
me check the messages. So we can see from the message how many rows are
from the message how many rows are inserted in each table. And now of
inserted in each table. And now of course the task is to go through each
course the task is to go through each table and check the
content. So that means now we have really nice script in order to load the
really nice script in order to load the bronze layer. And we will use this
bronze layer. And we will use this script in daily basis. every day we have
script in daily basis. every day we have to run it in order to get a new content
to run it in order to get a new content to the data warehouse. And as we learned
to the data warehouse. And as we learned before, if you have like a script of SQL
before, if you have like a script of SQL that is frequently used, what we can do,
that is frequently used, what we can do, we can go and create a stored procedure
we can go and create a stored procedure from those scripts. So let's go and do
from those scripts. So let's go and do that. It's going to be very simple.
that. It's going to be very simple. We're going to go over here and say
We're going to go over here and say create or alter procedure. And now we
create or alter procedure. And now we have to define the name of the S
have to define the name of the S procedure. I'm going to go and put it in
procedure. I'm going to go and put it in the schema bronze because it belongs to
the schema bronze because it belongs to the bronze layer. So then we're going to
the bronze layer. So then we're going to go and follow the naming convention. The
go and follow the naming convention. The source procedure start with load
source procedure start with load underscore and then the bronze layer. So
underscore and then the bronze layer. So that's it about the name and then very
that's it about the name and then very important we have to define the begin
important we have to define the begin and as well the end of our skill
and as well the end of our skill statements. So here is the begin and
statements. So here is the begin and let's go to the end and say this is the
let's go to the end and say this is the end. And then let's go highlight
end. And then let's go highlight everything in between and give it one
everything in between and give it one push with tab. So with that it is easier
push with tab. So with that it is easier to read. So now next what we're going to
to read. So now next what we're going to do we're going to go and execute it. So
do we're going to go and execute it. So let's go and create this store
let's go and create this store procedure. And now if you want to go and
procedure. And now if you want to go and check your store procedure, you go to
check your store procedure, you go to the database and then we have here a
the database and then we have here a folder called programmability. And then
folder called programmability. And then inside it we have start procedure. So if
inside it we have start procedure. So if you go and refresh, you will see our new
you go and refresh, you will see our new stored procedure. Let's go and test it.
stored procedure. Let's go and test it. So I'm going to go and have a new query.
So I'm going to go and have a new query. And what we're going to do, we're going
And what we're going to do, we're going to say execute
to say execute bronze.load bronze. So let's go and
bronze.load bronze. So let's go and execute it. And with that, we have just
execute it. And with that, we have just loaded completely the bronze layer. So
loaded completely the bronze layer. So as you can see SQL did go and insert all
as you can see SQL did go and insert all the data from the files to the bronze
the data from the files to the bronze layer. It is way easier than each time
layer. It is way easier than each time running those scripts of course. All
running those scripts of course. All right. So now the next step is that as
right. So now the next step is that as you can see the output message it is
you can see the output message it is really not having a lot of informations.
really not having a lot of informations. The message of your ETL sold procedure
The message of your ETL sold procedure it will not be really clear. So that's
it will not be really clear. So that's why if you are writing an ETL script
why if you are writing an ETL script always take care of the messaging of
always take care of the messaging of your code. So let me show you a nice
your code. So let me show you a nice design. Let's go back to our store
design. Let's go back to our store procedure. So now what we can do we can
procedure. So now what we can do we can go and divide the message based on our
go and divide the message based on our code. So now we can start with the
code. So now we can start with the message for example over here let's say
message for example over here let's say print and we say what we are doing with
print and we say what we are doing with this store procedure we are loading the
this store procedure we are loading the bronze liar. So this is the main message
bronze liar. So this is the main message the most important one and we can go and
the most important one and we can go and play with the separators like this. So
play with the separators like this. So we can say print and now we can go and
we can say print and now we can go and add some nice separators like for
add some nice separators like for example the equals at the start and at
example the equals at the start and at the end just to have like a section. So
the end just to have like a section. So this is just a nice message at the
this is just a nice message at the start. So now by looking to our code we
start. So now by looking to our code we can see that our code is splitted into
can see that our code is splitted into two sections. The first section we are
two sections. The first section we are loading all the tables from the source
loading all the tables from the source system CRM and the second section is
system CRM and the second section is loading the tables from the ERP. So we
loading the tables from the ERP. So we can split the prints by the source
can split the prints by the source system. So let's go and do that. So
system. So let's go and do that. So we're going to say print and we're going
we're going to say print and we're going to say loading CRM tables. This is for
to say loading CRM tables. This is for the first section. And then we can go
the first section. And then we can go and add some nice separators like the
and add some nice separators like the one. Let's take the minus. And of
one. Let's take the minus. And of course, don't forget to add semicolons
course, don't forget to add semicolons like me. So, we're going to have
like me. So, we're going to have semicolon for each prints. Same thing
semicolon for each prints. Same thing over here. I will go and copy the whole
over here. I will go and copy the whole thing because we're going to have it at
thing because we're going to have it at the start and as well at the ends. Let's
the start and as well at the ends. Let's go copy the whole thing for the second
go copy the whole thing for the second section. So, for the ERP, it starts over
section. So, for the ERP, it starts over here. And we're going to have it like
here. And we're going to have it like this. And we're going to call it loading
this. And we're going to call it loading ERP. So, with that in the output, we can
ERP. So, with that in the output, we can see nice separation between loading each
see nice separation between loading each source system. Now we go to the next
source system. Now we go to the next step where we go and add like a print
step where we go and add like a print for each action. So for example here we
for each action. So for example here we are truncating the table. So we say
are truncating the table. So we say print and now what we can do we can go
print and now what we can do we can go and add two arrows and we say what we
and add two arrows and we say what we are doing. So we are truncating the
are doing. So we are truncating the table and then we can go and add the
table and then we can go and add the table name in the message as well. So
table name in the message as well. So this is the first action that we are
this is the first action that we are doing and we can go and add another
doing and we can go and add another print for inserting the data. So we can
print for inserting the data. So we can say inserting data into and then we have
say inserting data into and then we have the table name. So with that in the
the table name. So with that in the output we can understand what SQL is
output we can understand what SQL is doing. So let's go and repeat this for
doing. So let's go and repeat this for all other tables. Okay. So I just added
all other tables. Okay. So I just added all those prints and don't forget the
all those prints and don't forget the semicolon at the end. So I would say
semicolon at the end. So I would say let's go and execute it and check the
let's go and execute it and check the output. So let's go and do that and then
output. So let's go and do that and then maybe at the start just to have quick
maybe at the start just to have quick output execute our stored procedure like
output execute our stored procedure like this. So let's see now if you check the
this. So let's see now if you check the output you can see things are more
output you can see things are more organized than before. So at the start
organized than before. So at the start we are reading okay we are loading the
we are reading okay we are loading the bronze layer. Now first we are loading
bronze layer. Now first we are loading the source system CRM and then the
the source system CRM and then the second section is for the ERP and we can
second section is for the ERP and we can see the actions. So we are truncating
see the actions. So we are truncating inserting truncating inserting for each
inserting truncating inserting for each table and as well the same thing for the
table and as well the same thing for the second source. So as you can see it is
second source. So as you can see it is nice and cosmetic but it's very
nice and cosmetic but it's very important as you are debugging any
important as you are debugging any errors. And speaking of errors, we have
errors. And speaking of errors, we have to go and handle the errors in our store
to go and handle the errors in our store procedure. So let's go and do that. It's
procedure. So let's go and do that. It's going to be the first thing that we do.
going to be the first thing that we do. We say begin try and then we go to the
We say begin try and then we go to the end of our script and we say before the
end of our script and we say before the last end we say end try and then the
last end we say end try and then the next thing we have to add the catch. So
next thing we have to add the catch. So we're going to say begin catch and end
we're going to say begin catch and end catch. So now first let's go and
catch. So now first let's go and organize our code. I'm going to take the
organize our code. I'm going to take the whole codes and give it one more push
whole codes and give it one more push and as well the begin try. So it is more
and as well the begin try. So it is more organized and as you know the try and
organized and as you know the try and catch going to go and execute the try
catch going to go and execute the try and if there is like any errors during
and if there is like any errors during executing this script the second section
executing this script the second section going to be executed. So the catch will
going to be executed. So the catch will be executed only if the SQL failed to
be executed only if the SQL failed to run the try. So now what we have to do
run the try. So now what we have to do is to go and define for SQL what to do
is to go and define for SQL what to do if there's like an error in your code.
if there's like an error in your code. And here we can do multiple stuff like
And here we can do multiple stuff like maybe creating a logging tables and add
maybe creating a logging tables and add the messages inside this table or we can
the messages inside this table or we can go and add some nice messaging to the
go and add some nice messaging to the output like for example we can go and
output like for example we can go and add like a section again over here. So
add like a section again over here. So again some equals and we can go and
again some equals and we can go and repeat it over here and then add some
repeat it over here and then add some content in between. So we can start with
content in between. So we can start with something like to say error
something like to say error accord during loading bronze layer and
accord during loading bronze layer and then we can go and add many stuff like
then we can go and add many stuff like for example we can go and add the error
for example we can go and add the error message and here we can go and call the
message and here we can go and call the function
function error message and we can go and add as
error message and we can go and add as well for example the error number. So
well for example the error number. So error number and of course the output of
error number and of course the output of this going to be a number but the error
this going to be a number but the error message here is a text. So we have to go
message here is a text. So we have to go and change the data type. So we're going
and change the data type. So we're going to do a cast as invar like this and then
to do a cast as invar like this and then there is like many functions that you
there is like many functions that you can add to the output like for example
can add to the output like for example the error state and so on. So you can
the error state and so on. So you can design what can happen if there is an
design what can happen if there is an error in the ETL. Now what else is very
error in the ETL. Now what else is very important in each ATL process is to add
important in each ATL process is to add the duration of each like step. So for
the duration of each like step. So for example, I would like to understand how
example, I would like to understand how long it takes to load this table over
long it takes to load this table over here. But looking to the output, I don't
here. But looking to the output, I don't have any informations how long is taking
have any informations how long is taking to load my tables. And this is very
to load my tables. And this is very important because as you are building
important because as you are building like a big data warehouse, the ETL
like a big data warehouse, the ETL process going to take long time and you
process going to take long time and you would like to understand where is the
would like to understand where is the issue, where is the bottleneck, which
issue, where is the bottleneck, which table is consuming a lot of time to be
table is consuming a lot of time to be loaded. So that's why we have to add
loaded. So that's why we have to add those informations as well to the output
those informations as well to the output or even maybe to protocol it in a table.
or even maybe to protocol it in a table. So let's go and add as well this step.
So let's go and add as well this step. So we're going to go to the start and
So we're going to go to the start and now in order to calculate the duration
now in order to calculate the duration you need the starting time and the end
you need the starting time and the end time. So we have to understand when we
time. So we have to understand when we start loaded and when we ended loading
start loaded and when we ended loading the table. So now the first thing is we
the table. So now the first thing is we have to go and declare the variables. So
have to go and declare the variables. So we're going to say declare and then
we're going to say declare and then let's make one called start time and the
let's make one called start time and the data type of this going to be the date
data type of this going to be the date time. I need exactly the second when it
time. I need exactly the second when it started and then another one for the end
started and then another one for the end time. So another variable end time and
time. So another variable end time and as well the same thing date time. So
as well the same thing date time. So with that we have declared the variables
with that we have declared the variables and the next step is to go and use them.
and the next step is to go and use them. So now let's go to the first table to
So now let's go to the first table to the customer info and at the start we're
the customer info and at the start we're going to say set start
going to say set start time equal to get date. So we will get
time equal to get date. So we will get the exact time when we start loading
the exact time when we start loading this table. And then let's go and copy
this table. And then let's go and copy the whole thing and go to the end of
the whole thing and go to the end of loading over here. So we're going to say
loading over here. So we're going to say set this time the end time equal as well
set this time the end time equal as well to the get dates. So with that now we
to the get dates. So with that now we have the values of when we start loading
have the values of when we start loading this table and when we completed loading
this table and when we completed loading the table. And now the next step is we
the table. And now the next step is we have to go and print the duration those
have to go and print the duration those informations. So over here we can go and
informations. So over here we can go and say print and we can go and have as
say print and we can go and have as again the same design. So two arrows and
again the same design. So two arrows and we can say very simply load duration and
we can say very simply load duration and then double points and a space. And now
then double points and a space. And now what we have to do is to calculate the
what we have to do is to calculate the duration and we can do that using the
duration and we can do that using the date and time function date diff in
date and time function date diff in order to find the interval between two
order to find the interval between two dates. So we're going to say plus over
dates. So we're going to say plus over here and then use date diff. And here we
here and then use date diff. And here we have to define three arguments. First
have to define three arguments. First one is the unit. So here you can define
one is the unit. So here you can define second, minute, hours and so on. So
second, minute, hours and so on. So we're going to go with the second and
we're going to go with the second and then we're going to define the start of
then we're going to define the start of the interval. It's going to be the start
the interval. It's going to be the start time. And then the last argument it
time. And then the last argument it going to be the end of the boundary.
going to be the end of the boundary. It's going to be the end time. And now
It's going to be the end time. And now of course the output of this going to be
of course the output of this going to be a number that's why we have to go and
a number that's why we have to go and cast it. So we're going to say cast as
cast it. So we're going to say cast as invar and then we're going to close it
invar and then we're going to close it like this and maybe at the end we're
like this and maybe at the end we're going to say
going to say plus space seconds in order to have a
plus space seconds in order to have a nice message. So again what we have done
nice message. So again what we have done we have declared the two variables and
we have declared the two variables and we are using them at the start we are
we are using them at the start we are getting the current date and time and at
getting the current date and time and at the end of loading the table we are
the end of loading the table we are getting the current date and time and
getting the current date and time and then we are finding the differences
then we are finding the differences between them in order to get the load
between them in order to get the load duration and in this case we are just
duration and in this case we are just printing this information and now we can
printing this information and now we can go of course and add some nice separator
go of course and add some nice separator between each table so I'm going to go
between each table so I'm going to go and do it like this just few minuses not
and do it like this just few minuses not a lot of stuff so now what we have to do
a lot of stuff so now what we have to do is to go and add this mechanism for each
is to go and add this mechanism for each table in order to measure the speed of
table in order to measure the speed of the ETL for each one of
[Music] them. Okay. So now I have added all
them. Okay. So now I have added all those configurations for each table and
those configurations for each table and let's go and run the whole thing now. So
let's go and run the whole thing now. So let's go and edit the store procedure
let's go and edit the store procedure this and we're going to go and run it.
this and we're going to go and run it. So let's go and execute. So now as you
So let's go and execute. So now as you can see we have here one more info about
can see we have here one more info about the load durations and it is everywhere
the load durations and it is everywhere I can see we have zero seconds and
I can see we have zero seconds and that's because it is super fast of
that's because it is super fast of loading those informations we are doing
loading those informations we are doing everything locally at PC so loading the
everything locally at PC so loading the data from files to database going to be
data from files to database going to be mega fast but of course in real projects
mega fast but of course in real projects you have like different servers and
you have like different servers and networking between them and you have
networking between them and you have millions of rows in the tables of course
millions of rows in the tables of course the duration going to be not like 0
the duration going to be not like 0 seconds things going to be slower and
seconds things going to be slower and now you can see easily how long it takes
now you can see easily how long it takes to load each of your tables. And now of
to load each of your tables. And now of course what is very interesting is to
course what is very interesting is to understand how long it takes to load the
understand how long it takes to load the whole bronze layer. So now your task is
whole bronze layer. So now your task is as well to print at the end informations
as well to print at the end informations about the whole patch. How long it took
about the whole patch. How long it took to load the bronze
to load the bronze [Music]
[Music] layer. Okay, I hope we are done. Now I
layer. Okay, I hope we are done. Now I have done it like this. We have to
have done it like this. We have to define two new variables. So the start
define two new variables. So the start time of the batch and the end time of
time of the batch and the end time of the batch. And the first step in the
the batch. And the first step in the start procedure is to get the date and
start procedure is to get the date and time informations for the first
time informations for the first variable. And exactly at the end the
variable. And exactly at the end the last thing that we do in the start
last thing that we do in the start procedure, we're going to go and get the
procedure, we're going to go and get the date and time informations for the end
date and time informations for the end time. So we say again set get date for
time. So we say again set get date for the patch and time. And then all what we
the patch and time. And then all what we have to do is to go and print a message.
have to do is to go and print a message. So we are saying loading bronze layer is
So we are saying loading bronze layer is completed and then we are printing total
completed and then we are printing total load duration and the same thing with a
load duration and the same thing with a date difference between the patch start
date difference between the patch start time and the end time and we are
time and the end time and we are calculating the seconds and so on. So
calculating the seconds and so on. So now what we have to do is to go and
now what we have to do is to go and execute the whole thing. So let's go and
execute the whole thing. So let's go and refresh the definition of the start
refresh the definition of the start procedure and then let's go and execute
procedure and then let's go and execute it. So in the output we have to go to
it. So in the output we have to go to the last message and we can see loading
the last message and we can see loading bronze layer is completed and the total
bronze layer is completed and the total load duration is as well 0 seconds
load duration is as well 0 seconds because the execution time is less than
because the execution time is less than 1 second. So with that you are getting
1 second. So with that you are getting now a feeling about how to build an ETL
now a feeling about how to build an ETL process. So as you can see the data
process. So as you can see the data engineering is not all about how to load
engineering is not all about how to load the data. It's how to engineer the whole
the data. It's how to engineer the whole pipeline. how to measure the speed of
pipeline. how to measure the speed of loading the data. What can happen if
loading the data. What can happen if there is like an error and to print each
there is like an error and to print each step in your ETL process and make
step in your ETL process and make everything organized and cleared in the
everything organized and cleared in the output and maybe in the logging just to
output and maybe in the logging just to make debugging and optimizing the
make debugging and optimizing the performance way easier. And there's like
performance way easier. And there's like a lot of things that we can add. We can
a lot of things that we can add. We can add the quality measures and stuff. So
add the quality measures and stuff. So we can add many stuff to our ETL script
we can add many stuff to our ETL script to make our data warehouse professional.
to make our data warehouse professional. All right, my friends. So with that we
All right, my friends. So with that we have developed a code in order to load
have developed a code in order to load the bronze layer and we have tested that
the bronze layer and we have tested that as well. And now in the next step we're
as well. And now in the next step we're going to go back to draw because we want
going to go back to draw because we want to draw a diagram about the data flow.
to draw a diagram about the data flow. So let's
go. So now what is a data flow diagram? We're going to draw a simple visual in
We're going to draw a simple visual in order to map the flow of your data where
order to map the flow of your data where it come from and where it ends up. So we
it come from and where it ends up. So we want just to make clear how the data
want just to make clear how the data flows through different layers of your
flows through different layers of your projects. And that's help us to create
projects. And that's help us to create something called the data lineage. And
something called the data lineage. And this is really nice especially if you
this is really nice especially if you are analyzing an issue. So if you have
are analyzing an issue. So if you have like multiple layers and you don't have
like multiple layers and you don't have a real data lineage or flow, it's going
a real data lineage or flow, it's going to be really hard to analyze the scripts
to be really hard to analyze the scripts in order to understand the origin of the
in order to understand the origin of the data and having this diagram going to
data and having this diagram going to improve the process of finding issues.
improve the process of finding issues. So now let's go and create one. Okay. So
So now let's go and create one. Okay. So now back to draw and we're going to go
now back to draw and we're going to go and build the flow diagram. So we're
and build the flow diagram. So we're going to start first with the source
going to start first with the source system. So, let's build the layer. I'm
system. So, let's build the layer. I'm going to go and remove the fill dot it.
going to go and remove the fill dot it. And then we're going to go and add like
And then we're going to go and add like a box saying sources and we're going to
a box saying sources and we're going to put it over here. Increase the size 24
put it over here. Increase the size 24 and as well without any lines. Now, what
and as well without any lines. Now, what do we have inside the sources? We have
do we have inside the sources? We have like folder and files. So, let's go and
like folder and files. So, let's go and search for a folder icon. I'm going to
search for a folder icon. I'm going to go and take this one over here and say
go and take this one over here and say you are the CRM. And we can as well
you are the CRM. And we can as well increase the size. And we have another
increase the size. And we have another source. We have the
source. We have the ERP. Okay. So, this is the first layer.
ERP. Okay. So, this is the first layer. Let's go and now have the bronze layer.
Let's go and now have the bronze layer. So, we're going to go and grab another
So, we're going to go and grab another box. And we're going to go and make the
box. And we're going to go and make the coloring like this. And instead of auto,
coloring like this. And instead of auto, maybe take the hatch, maybe something
maybe take the hatch, maybe something like this, whatever, you know. So,
like this, whatever, you know. So, rounded. And then we can go and put on
rounded. And then we can go and put on top of it like the title. So, we can say
top of it like the title. So, we can say you are the bronze layer. and increase
you are the bronze layer. and increase as well the size of the font. So now
as well the size of the font. So now what we're going to do, we're going to
what we're going to do, we're going to go and add boxes for each table that we
go and add boxes for each table that we have in the bronze layer. So for
have in the bronze layer. So for example, we have the sales details. We
example, we have the sales details. We can go and make it a little bit smaller.
can go and make it a little bit smaller. So maybe 16 and not bold. And we have
So maybe 16 and not bold. And we have other two tables from the CRM. We have
other two tables from the CRM. We have the customer info and as well the
the customer info and as well the product info. So those are the three
product info. So those are the three tables that comes from the CRM. And now
tables that comes from the CRM. And now what we're going to do, we're going to
what we're going to do, we're going to go and connect now the source CRM with
go and connect now the source CRM with those three tables. So what we're going
those three tables. So what we're going to do, we're going to go to the folder
to do, we're going to go to the folder and start making arrows from the folder
and start making arrows from the folder to the bronze layer like this. And now
to the bronze layer like this. And now we have to do the same thing for the ERP
we have to do the same thing for the ERP source. So as you can see the data flow
source. So as you can see the data flow diagram shows us in one picture the data
diagram shows us in one picture the data lineage between the two layers. So here
lineage between the two layers. So here we can see easily those three tables
we can see easily those three tables actually comes from the CRM and as well
actually comes from the CRM and as well those three tables in the bronze layer
those three tables in the bronze layer are coming from the ERP. I understand if
are coming from the ERP. I understand if we have like a lot of tables it's going
we have like a lot of tables it's going to be a huge mess. But if you have like
to be a huge mess. But if you have like small or medium data warehouse building
small or medium data warehouse building those diagrams going to make things
those diagrams going to make things really easier to understand how
really easier to understand how everything is flowing from the sources
everything is flowing from the sources into the different layers in your data
into the different layers in your data warehouse. All right. So with that we
warehouse. All right. So with that we have the first version of the data flow.
have the first version of the data flow. So this step is done and the final step
So this step is done and the final step is to commit our code in the get
repo. Okay. So now let's go and commit our work. Since it is scripts, we're
our work. Since it is scripts, we're going to go to the folder scripts. And
going to go to the folder scripts. And here we're going to have like script for
here we're going to have like script for the bronze, silver, and gold. That's why
the bronze, silver, and gold. That's why maybe it makes sense to create a folder
maybe it makes sense to create a folder for each layer. So let's go and start
for each layer. So let's go and start creating the bronze folder. So I'm going
creating the bronze folder. So I'm going to go and create a new file. And then
to go and create a new file. And then I'm going to say bronze slash. And then
I'm going to say bronze slash. And then we can have the DDL script of the bronze
we can have the DDL script of the bronze layer SQL. So now I'm going to go and
layer SQL. So now I'm going to go and paste the DDL codes that we have
paste the DDL codes that we have created. So those six tables and as
created. So those six tables and as usual at the start we have a comment
usual at the start we have a comment where we are explaining the purpose of
where we are explaining the purpose of this script. So we are saying this
this script. So we are saying this scripts creates tables in the bronze
scripts creates tables in the bronze schema. And by running this scripts you
schema. And by running this scripts you are redefining the DDL structure of the
are redefining the DDL structure of the bronze tables. So let's have it like
bronze tables. So let's have it like that. And I'm going to go and commit the
that. And I'm going to go and commit the changes. All right. So now as you can
changes. All right. So now as you can see inside the scripts we have a folder
see inside the scripts we have a folder called bronze and inside it we have the
called bronze and inside it we have the DDL script for the bronze layer and as
DDL script for the bronze layer and as well in the bronze layer we're going to
well in the bronze layer we're going to go and put our start procedure. So we're
go and put our start procedure. So we're going to go and create a new file let's
going to go and create a new file let's call it proc load bronze dossql and then
call it proc load bronze dossql and then let's go and paste our script and as
let's go and paste our script and as usual I have put it at the start an
usual I have put it at the start an explanation about the store procedure.
explanation about the store procedure. So we are saying this third procedure
So we are saying this third procedure going to go and load the data from the
going to go and load the data from the CSV files into the bronze schema. So it
CSV files into the bronze schema. So it going to go and truncate first the
going to go and truncate first the tables and then do a bulk insert. And
tables and then do a bulk insert. And about the parameters, this source
about the parameters, this source procedure does not accept any parameter
procedure does not accept any parameter or return any values. And here a quick
or return any values. And here a quick example how to execute it. All right. So
example how to execute it. All right. So I think I'm happy with that. So let's go
I think I'm happy with that. So let's go and commit it. All right. My friends, so
and commit it. All right. My friends, so with that we have committed our code
with that we have committed our code into the g. And with that we are done
into the g. And with that we are done building the bronze layer. So the whole
building the bronze layer. So the whole op is done. Now we're going to go to the
op is done. Now we're going to go to the next one. This one going to be more
next one. This one going to be more advanced than the bronze layer because
advanced than the bronze layer because there will be a lot of struggle with
there will be a lot of struggle with cleaning the data and so on. So we're
cleaning the data and so on. So we're going to start with the first task where
going to start with the first task where we're going to analyze and explore the
we're going to analyze and explore the data in the source systems. So let's
go. Okay. So now we're going to start with the big question. How to build the
with the big question. How to build the server layer? What is the process? Okay.
server layer? What is the process? Okay. As usual, first things first, we have to
As usual, first things first, we have to analyze. And now the task before
analyze. And now the task before building anything in the server layer we
building anything in the server layer we have to go and explore the data in order
have to go and explore the data in order to understand the content of our sources
to understand the content of our sources once we have it what we're going to do
once we have it what we're going to do we will be starting coding and here the
we will be starting coding and here the transformation that we're going to do is
transformation that we're going to do is data cleansing this is usually process
data cleansing this is usually process that take really long time and I usually
that take really long time and I usually do it in three steps the first step is
do it in three steps the first step is to check first the data quality issues
to check first the data quality issues that we have in the bronze layer so
that we have in the bronze layer so before writing any data transformations
before writing any data transformations first we have to understand what are the
first we have to understand what are the issues and only then I start writing
issues and only then I start writing think data transformations in order to
think data transformations in order to fix all those quality issues that we
fix all those quality issues that we have in the bronze and the last step
have in the bronze and the last step once I have clean results what we're
once I have clean results what we're going to do we're going to go and insert
going to do we're going to go and insert it into the server layer and those are
it into the server layer and those are the three faces that we will be doing as
the three faces that we will be doing as we are writing the code for the silver
we are writing the code for the silver layer and the third step once we have
layer and the third step once we have all the data in the server layer we have
all the data in the server layer we have to make sure that the data is now
to make sure that the data is now correct and we don't have any quality
correct and we don't have any quality issues anymore and if you find any
issues anymore and if you find any issues of course what you going to do
issues of course what you going to do we're going to go back to coding we're
we're going to go back to coding we're going to do the data cleansing and again
going to do the data cleansing and again object. So it is like a cycle between
object. So it is like a cycle between validating and coding. Once the quality
validating and coding. Once the quality of the silver layer is good, we cannot
of the silver layer is good, we cannot skip the last phase where we're going to
skip the last phase where we're going to document and commit our work in the G.
document and commit our work in the G. And here we're going to have two new
And here we're going to have two new documentations. We're going to build the
documentations. We're going to build the data flow diagram and as well the data
data flow diagram and as well the data integration diagram after we understood
integration diagram after we understood the relationship between the sources
the relationship between the sources from the first step. So this is the
from the first step. So this is the process and this is how we're going to
process and this is how we're going to build the server layer.
All right. So now exploring the data in the bronze layer. So why it is very
the bronze layer. So why it is very important? Because understanding the
important? Because understanding the data it is the key to make smart
data it is the key to make smart decisions in the server layer. It was
decisions in the server layer. It was not the focus in the bronze layer to
not the focus in the bronze layer to understand the content of the data at
understand the content of the data at all. We focus only how to get the data
all. We focus only how to get the data to the data warehouse. So that's why we
to the data warehouse. So that's why we have now to take a moment in order to
have now to take a moment in order to explore and understand the tables and as
explore and understand the tables and as well how to connect them. what are the
well how to connect them. what are the relationship between these tables and it
relationship between these tables and it is very important as you are learning
is very important as you are learning about the new source system is to create
about the new source system is to create like some kind of documentation. So now
like some kind of documentation. So now let's go and explore the sources. Okay.
let's go and explore the sources. Okay. So now let's go and explore them one by
So now let's go and explore them one by one. We can start with the first one
one. We can start with the first one from the CRM. We have the customer info.
from the CRM. We have the customer info. So right click on it and say select top
So right click on it and say select top thousand rows. And this is of course
thousand rows. And this is of course important if you have like a lot of
important if you have like a lot of data. Don't go and explore millions of
data. Don't go and explore millions of rows. Always limit your query. So for
rows. Always limit your query. So for example here we are using the top
example here we are using the top thousands just to make sure that you are
thousands just to make sure that you are not impacting the system with your
not impacting the system with your queries. So now let's have a look to the
queries. So now let's have a look to the content of this table. So we can see
content of this table. So we can see that we have here customer informations.
that we have here customer informations. So we have an ID, we have a key for the
So we have an ID, we have a key for the customer, we have first name, last name,
customer, we have first name, last name, marital status, gender and the creation
marital status, gender and the creation date of the customer. So simply this is
date of the customer. So simply this is a table for the customer information and
a table for the customer information and a lot of details for the customers. And
a lot of details for the customers. And here we have like two identifiers. one
here we have like two identifiers. one it is like technical ID and another one
it is like technical ID and another one it's like the customer number so maybe
it's like the customer number so maybe we can use either the ID or the key in
we can use either the ID or the key in order to join it with other tables so
order to join it with other tables so now what I usually do is to go and draw
now what I usually do is to go and draw like data model or let's say integration
like data model or let's say integration model just to document and visual what I
model just to document and visual what I am understanding because if you don't do
am understanding because if you don't do that you're going to forget it after a
that you're going to forget it after a while so now we go and search for a
while so now we go and search for a shape let's search for a table and I'm
shape let's search for a table and I'm going to go and pick this one over here
going to go and pick this one over here so here we can go and change the style
so here we can go and change the style for example we can make it rounded or
for example we can make it rounded or you can go make it sketch and so on. And
you can go make it sketch and so on. And we can go and change the color. I'm
we can go and change the color. I'm going to make it blue. Then go to the
going to make it blue. Then go to the text. Make sure to select the whole
text. Make sure to select the whole thing. And let's make it bigger. 26. And
thing. And let's make it bigger. 26. And then what I'm going to do for those
then what I'm going to do for those items, I'm just going to select them and
items, I'm just going to select them and go to our range and maybe make it 40.
go to our range and maybe make it 40. Something like this. So now what we're
Something like this. So now what we're going to do, we're going to just go and
going to do, we're going to just go and put the table name. So this is the one
put the table name. So this is the one that we are now learning about. And what
that we are now learning about. And what I'm going to do, I'm just going to go
I'm going to do, I'm just going to go and put here the primary key. I will not
and put here the primary key. I will not go and list all the informations. So the
go and list all the informations. So the primary key was the ID. And I will go
primary key was the ID. And I will go and remove all those stuff. I don't need
and remove all those stuff. I don't need it. Now, as you can see, the table name
it. Now, as you can see, the table name is not really friendly. So I can go and
is not really friendly. So I can go and bring a text and put it here on top and
bring a text and put it here on top and say this is the customer information.
say this is the customer information. Just to make it friendly and to not
Just to make it friendly and to not forget about it. And as well going to
forget about it. And as well going to increase the size to maybe 20 something
increase the size to maybe 20 something like this. Okay. With that, we have our
like this. Okay. With that, we have our first table. and we're going to go and
first table. and we're going to go and keep exploring. So let's move to the
keep exploring. So let's move to the second one. We're going to take the
second one. We're going to take the product information, right click on it
product information, right click on it and select the top thousand rows. I will
and select the top thousand rows. I will just put it below the previous query.
just put it below the previous query. Query it. Now by looking to this table
Query it. Now by looking to this table we can see we have product informations.
we can see we have product informations. So we have here a primary key for the
So we have here a primary key for the product and then we have like key or
product and then we have like key or let's say product number and after that
let's say product number and after that we have the full name of the product the
we have the full name of the product the product costs and then we have the
product costs and then we have the product line and then we have like start
product line and then we have like start and end. Well this is interesting to
and end. Well this is interesting to understand why we have start and ends.
understand why we have start and ends. Let's have a look for example for those
Let's have a look for example for those three rows all of those three having the
three rows all of those three having the same key but they have different ids. So
same key but they have different ids. So it is the same product but with
it is the same product but with different costs. So for 2011 we have the
different costs. So for 2011 we have the cost of 12. Then 2012 we have 14 and for
cost of 12. Then 2012 we have 14 and for the last year 2013 we have 13. So it's
the last year 2013 we have 13. So it's like we have like a history for the
like we have like a history for the changes. So this table not only holding
changes. So this table not only holding the current informations of the product
the current informations of the product but also history informations of the
but also history informations of the product and that's why we have those to
product and that's why we have those to date start and end. Now let's go back
date start and end. Now let's go back and draw this information over here. So
and draw this information over here. So I'm just going to go and duplicate it.
I'm just going to go and duplicate it. So the name of this table going to be
So the name of this table going to be the BRD info and let's go and give it
the BRD info and let's go and give it like a short description current and
like a short description current and history products information something
history products information something like this just to not forget that we
like this just to not forget that we have history in this table and here we
have history in this table and here we have as well the PRD ID and there is
have as well the PRD ID and there is like nothing that we can use in order to
like nothing that we can use in order to join those two tables we don't have like
join those two tables we don't have like a customer ID here or in the other table
a customer ID here or in the other table we don't have any product ID okay so
we don't have any product ID okay so that's it for this table let's jump to
that's it for this table let's jump to the third table and the last one in the
the third table and the last one in the CRM M. So let's go and select. I just
CRM M. So let's go and select. I just made the other queries as well short. So
made the other queries as well short. So let's go and execute. So what do we have
let's go and execute. So what do we have over here? We have a lot of informations
over here? We have a lot of informations about the order, the sales and a lot of
about the order, the sales and a lot of measures. Order number. We have the
measures. Order number. We have the product key. So this is something that
product key. So this is something that we can use in order to join it with the
we can use in order to join it with the product table. We have the customer ID.
product table. We have the customer ID. We don't have the customer key. So here
We don't have the customer key. So here we have like ID and here we have key. So
we have like ID and here we have key. So there's like two different ways on how
there's like two different ways on how to join tables. And then we have here
to join tables. And then we have here like dates. the order date, the shipping
like dates. the order date, the shipping date, the due date and then we have the
date, the due date and then we have the sales amount, the quantity and the
sales amount, the quantity and the price. So this is like an event table.
price. So this is like an event table. It is transactional table about the
It is transactional table about the orders and sales and it is great table
orders and sales and it is great table in order to connect the customers with
in order to connect the customers with the products and as well with the
the products and as well with the orders. So let's document this new
orders. So let's document this new information that we have. So the table
information that we have. So the table name is the sales details. So we can go
name is the sales details. So we can go and describe it like this.
and describe it like this. Transactional records about sales and
Transactional records about sales and orders. And now we have to go and
orders. And now we have to go and describe how we can connect this table
describe how we can connect this table to the other two. So we are not using
to the other two. So we are not using the product ID. We are using the
the product ID. We are using the products key. And now we need a new
products key. And now we need a new column over here. So you can hold
column over here. So you can hold control and enter or you can go over
control and enter or you can go over here and add a new row. And the other
here and add a new row. And the other row going to be the customer ID. So now
row going to be the customer ID. So now for the customer ID it is easy. we can
for the customer ID it is easy. we can go and grab an arrow in order to connect
go and grab an arrow in order to connect those two tables. But for the product
those two tables. But for the product key, we are not using the ID. So that's
key, we are not using the ID. So that's why I'm just going to go and remove this
why I'm just going to go and remove this one and say product key. Let's have
one and say product key. Let's have again a check. So this is a product key.
again a check. So this is a product key. It's not the product ID. And if we go
It's not the product ID. And if we go and check the old table, the products
and check the old table, the products info, you can see we are using this key
info, you can see we are using this key and not the primary key. So what we're
and not the primary key. So what we're going to do now, we will just go and
going to do now, we will just go and link it like this. And maybe switch
link it like this. And maybe switch those two tables. So I will put the
those two tables. So I will put the customers below. Just perfect. It looks
customers below. Just perfect. It looks nice. Okay. So, let's keep moving. Let's
nice. Okay. So, let's keep moving. Let's go now to the other source system. We
go now to the other source system. We have the ARP and the first one is ARB
have the ARP and the first one is ARB cost and we have this cryptical name.
cost and we have this cryptical name. Let's go and select the data. So, now
Let's go and select the data. So, now here it's small table and we have only
here it's small table and we have only three informations. So, we have here
three informations. So, we have here something called CD and then we have
something called CD and then we have something I think this is the birthday
something I think this is the birthday and the gender information. So, we have
and the gender information. So, we have here male, female and so on. So, it
here male, female and so on. So, it looks again like the customer
looks again like the customer informations but here we have like extra
informations but here we have like extra data about the birthday. And now if you
data about the birthday. And now if you go and compare it to the customer table
go and compare it to the customer table that we have from the other source
that we have from the other source system. Let's go and query it. You can
system. Let's go and query it. You can see the new table from the ARB don't
see the new table from the ARB don't have ids. It has actually the customer
have ids. It has actually the customer number or the key. So we can go and join
number or the key. So we can go and join those two tables using the customer key.
those two tables using the customer key. Let's go and document this information.
Let's go and document this information. So I will just go and copy paste and put
So I will just go and copy paste and put it here on the right side. I will just
it here on the right side. I will just go and change the color now since we are
go and change the color now since we are now talking about different source
now talking about different source system. And here the table name going to
system. And here the table name going to be this one. and the key called C ID.
be this one. and the key called C ID. Now, in order to join this table with
Now, in order to join this table with the customer info, we cannot join it
the customer info, we cannot join it with the customer ID. We need the
with the customer ID. We need the customer key. That's why here we have to
customer key. That's why here we have to go and add a new row. So, ctrl enter and
go and add a new row. So, ctrl enter and we're going to say customer key. And
we're going to say customer key. And then we have to go and make a nice arrow
then we have to go and make a nice arrow between those two keys. So, we're going
between those two keys. So, we're going to go and give it a description,
to go and give it a description, customer information. And here we have
customer information. And here we have the birth date. Okay. So, now let's keep
the birth date. Okay. So, now let's keep going. We're going to go to the next
going. We're going to go to the next one. We have the ERP location. Let's go
one. We have the ERP location. Let's go and query this table. So, what do we
and query this table. So, what do we have over here? We have the CD again.
have over here? We have the CD again. And as you can see, we have country
And as you can see, we have country informations. And this is of course
informations. And this is of course again the customer number. And we have
again the customer number. And we have only this information, the country. So,
only this information, the country. So, let's go and document this information.
let's go and document this information. This is the customer location. Table
This is the customer location. Table name going to be like this. And we still
name going to be like this. And we still have the same ID. So, we have here still
have the same ID. So, we have here still the customer ID and we can go and join
the customer ID and we can go and join it using the customer key. And we have
it using the customer key. And we have to give it the description location of
to give it the description location of customers and we can say here the
customers and we can say here the country. Okay. So now let's go to the
country. Okay. So now let's go to the last table and explore it. We have the
last table and explore it. We have the ERP ex catalog. So let's go and query
ERP ex catalog. So let's go and query those informations. So what do we have
those informations. So what do we have here? We have like an ID, a category, a
here? We have like an ID, a category, a subcategory and the maintenance. Here we
subcategory and the maintenance. Here we have like either yes and no. So by
have like either yes and no. So by looking to this table we have all the
looking to this table we have all the categories and the subcategories of the
categories and the subcategories of the products and here we have like special
products and here we have like special identifier for those informations. Now
identifier for those informations. Now the question is how to join it. So I
the question is how to join it. So I would like to join it actually with the
would like to join it actually with the product informations. So let's go and
product informations. So let's go and check those two tables together. Okay.
check those two tables together. Okay. So in the product we don't have any ID
So in the product we don't have any ID for the categories but we have these
for the categories but we have these informations actually in the product
informations actually in the product key. So the first five characters of the
key. So the first five characters of the product key is actually the category ID.
product key is actually the category ID. So we can use this information over here
So we can use this information over here in order to join it with the categories.
in order to join it with the categories. So we can go and describe this
So we can go and describe this information like this and then we have
information like this and then we have to go and give it a name. And then here
to go and give it a name. And then here we have the ID and the ID could be
we have the ID and the ID could be joined using the product key. So that
joined using the product key. So that means for the product information we
means for the product information we don't need at all the product ID the
don't need at all the product ID the primary key. All what we need is the
primary key. All what we need is the product key or the product number. And
product key or the product number. And what I would like to do is like to group
what I would like to do is like to group those informations in a box. So, let's
those informations in a box. So, let's go grab like any boxes here on the left
go grab like any boxes here on the left side and make it bigger and then make
side and make it bigger and then make the edges a little bit smaller. Let's
the edges a little bit smaller. Let's remove the fill and the line. I will
remove the fill and the line. I will make a dotted line. And then let's grab
make a dotted line. And then let's grab another box over here and say this is
another box over here and say this is the CRM. And we can go and increase the
the CRM. And we can go and increase the size maybe something like 40 smaller 35
size maybe something like 40 smaller 35 bold and change the color to blue and
bold and change the color to blue and just place it here on top of this box.
just place it here on top of this box. So with that we can understand all those
So with that we can understand all those tables belongs to the source system CRM
tables belongs to the source system CRM and we can do the same stuff for the
and we can do the same stuff for the right side as well. Now of course we
right side as well. Now of course we have to go and add the description here.
have to go and add the description here. So it's going to be the products
So it's going to be the products categories. All right. So with that we
categories. All right. So with that we have now a clear understanding how the
have now a clear understanding how the tables are connected to each others. We
tables are connected to each others. We understand now the content of each table
understand now the content of each table and of course it can help us to clean up
and of course it can help us to clean up the data in the silver layer in order to
the data in the silver layer in order to prepare it. So as you can see it is very
prepare it. So as you can see it is very important to take time understanding the
important to take time understanding the structure of the tables the relationship
structure of the tables the relationship between them before start writing any
between them before start writing any code. All right. So with that we have
code. All right. So with that we have now clear understanding about the
now clear understanding about the sources and with that we have as well
sources and with that we have as well created a data integration in the draw.
created a data integration in the draw. So with that we have more understanding
So with that we have more understanding about how to connect the sources. And
about how to connect the sources. And now in the next two task we will go back
now in the next two task we will go back to SQL where we're going to start
to SQL where we're going to start checking the quality and as well doing a
checking the quality and as well doing a lot of data transformations. So let's
lot of data transformations. So let's go.
Okay, so now let's have a quick look to the specifications of the server layer.
the specifications of the server layer. So the main objective to have clean and
So the main objective to have clean and standardized data. We have to prepare
standardized data. We have to prepare the data before going to the gold layer.
the data before going to the gold layer. And we will be building tables inside
And we will be building tables inside the silver layer. And the way of loading
the silver layer. And the way of loading the data from the bronze to the silver
the data from the bronze to the silver is a full load. So that means we're
is a full load. So that means we're going to truncate and then insert. And
going to truncate and then insert. And here we're going to have a lot of data
here we're going to have a lot of data transformations. So we're going to clean
transformations. So we're going to clean the data. We're going to bring
the data. We're going to bring normalizations, standardizations. We're
normalizations, standardizations. We're going to derive new columns. We will be
going to derive new columns. We will be doing as well data enrichments. So a lot
doing as well data enrichments. So a lot of things to be done in the data
of things to be done in the data transformation. But we will not be
transformation. But we will not be building any new data model. So those
building any new data model. So those are the specifications and we have to
are the specifications and we have to commit ourself to this scope. Okay. So
commit ourself to this scope. Okay. So now building the DDL script for the
now building the DDL script for the silver layer going to be way easier than
silver layer going to be way easier than the bronze because the definition and
the bronze because the definition and the structure of each table in the
the structure of each table in the silver going to be identical to the
silver going to be identical to the bronze layer. We are not doing anything
bronze layer. We are not doing anything new. So all what you have to do is to
new. So all what you have to do is to take the DDL script from the bronze
take the DDL script from the bronze layer and just go and search and replace
layer and just go and search and replace for the schema. I'm just using the
for the schema. I'm just using the Notepad++ for the scripts. So I'm going
Notepad++ for the scripts. So I'm going to go over here and say replace the
to go over here and say replace the bronze dots with silver dots and I'm
bronze dots with silver dots and I'm going to go and replace all. So with
going to go and replace all. So with that now all the DDL is targeting the
that now all the DDL is targeting the schema silver layer which is exactly
schema silver layer which is exactly what we need. All right. Now before we
what we need. All right. Now before we execute our new DDL script for the
execute our new DDL script for the silver, we have to talk about something
silver, we have to talk about something called the metadata columns. They are
called the metadata columns. They are additional columns or fields that the
additional columns or fields that the data engineers add to each table that
data engineers add to each table that don't come directly from the source
don't come directly from the source systems. But the data engineers use it
systems. But the data engineers use it in order to provide extra informations
in order to provide extra informations for each record. Like we can add a
for each record. Like we can add a column called create date is when the
column called create date is when the record was loaded or an update date when
record was loaded or an update date when the record got updated or we can add the
the record got updated or we can add the source system in order to understand the
source system in order to understand the origin of the data that we have or
origin of the data that we have or sometimes we can add the file location
sometimes we can add the file location in order to understand the lineage from
in order to understand the lineage from which file the data come from. Those are
which file the data come from. Those are great tool if you have data issue in
great tool if you have data issue in your data warehouse if there is like
your data warehouse if there is like corrupt data and so on. This can help
corrupt data and so on. This can help you to track exactly where this issue
you to track exactly where this issue happens and when. And as well it is
happens and when. And as well it is great in order to understand whether I
great in order to understand whether I have gap in my data especially if you
have gap in my data especially if you are doing incremental loads. It is like
are doing incremental loads. It is like putting labels on everything and you
putting labels on everything and you will thank yourself later when you start
will thank yourself later when you start using them in hard times as you have an
using them in hard times as you have an issue in your data warehouse. So now
issue in your data warehouse. So now back to our DDL scripts and all what you
back to our DDL scripts and all what you have to do is to go and do the
have to do is to go and do the following. So for example for the first
following. So for example for the first table I will go and add at the end one
table I will go and add at the end one more extra column. So it start with the
more extra column. So it start with the prefix TWW as we have defined in the
prefix TWW as we have defined in the naming convention and then underscore
naming convention and then underscore let's have the create date and the data
let's have the create date and the data type going to be date time 2 and now
type going to be date time 2 and now what we can do is we can go and add a
what we can do is we can go and add a default value for it. I want the
default value for it. I want the database to generate these informations
database to generate these informations automatically. We don't have to specify
automatically. We don't have to specify that in any scripts. So which value?
that in any scripts. So which value? It's going to be the get date. So each
It's going to be the get date. So each record going to be inserted in this
record going to be inserted in this table will get automatically a value
table will get automatically a value from the current date and time. So now
from the current date and time. So now as you can see the naming convention it
as you can see the naming convention it is very important. All those columns
is very important. All those columns comes from the source system and only
comes from the source system and only this one column comes from the data
this one column comes from the data engineer of the data warehouse. Okay. So
engineer of the data warehouse. Okay. So that's it. Let's go and repeat the same
that's it. Let's go and repeat the same thing for all other tables. So I will
thing for all other tables. So I will just go and add this piece of
just go and add this piece of information for each
information for each DDL. All right. So I think that's it.
DDL. All right. So I think that's it. All what you have to do is now to go and
All what you have to do is now to go and execute the whole DDL script for the
execute the whole DDL script for the silver layer. Let's go and do that. All
silver layer. Let's go and do that. All right, perfect. There's no errors. Let's
right, perfect. There's no errors. Let's go and refresh the tables on the object
go and refresh the tables on the object explorer. And with that, as you can see,
explorer. And with that, as you can see, we have six tables for the silver layer.
we have six tables for the silver layer. It is identical to the bronze layer, but
It is identical to the bronze layer, but we have one extra column for the
metadata. All right. All right. So now in the server layer before we start
in the server layer before we start writing any data transformations and
writing any data transformations and cleansing we have first to detect the
cleansing we have first to detect the quality issues in the bronze without
quality issues in the bronze without knowing the issues we cannot find
knowing the issues we cannot find solution right we will explore first the
solution right we will explore first the quality issues only then we start
quality issues only then we start writing the transformation scripts. So
writing the transformation scripts. So let's
go. Okay. Okay. So now what we're going to do, we're going to go through all the
to do, we're going to go through all the tables over the bronze layer, clean up
tables over the bronze layer, clean up the data, and then insert it to the
the data, and then insert it to the server layer. So let's start with the
server layer. So let's start with the first table, the first bronze table from
first table, the first bronze table from the source CRM. So we're going to go to
the source CRM. So we're going to go to the bronze CRM customer info. So let's
the bronze CRM customer info. So let's go and query the data over here. Now, of
go and query the data over here. Now, of course, before writing any data
course, before writing any data transformations, we have to go and
transformations, we have to go and detect and identify the quality issues
detect and identify the quality issues of this table. So usually I start with
of this table. So usually I start with the first check where we go and check
the first check where we go and check the primary key. So we have to go and
the primary key. So we have to go and check whether there are nulls inside the
check whether there are nulls inside the primary key and whether there are
primary key and whether there are duplicates. So now in order to detect
duplicates. So now in order to detect the duplicates in the primary key what
the duplicates in the primary key what we have to do is to go and aggregate the
we have to do is to go and aggregate the primary key. If we find any value in the
primary key. If we find any value in the primary key that exist more than once
primary key that exist more than once that means it is not unique and we have
that means it is not unique and we have duplicates in the table. So let's go and
duplicates in the table. So let's go and write query for that. So what we're
write query for that. So what we're going to do, we're going to go with the
going to do, we're going to go with the customer ID and then we're going to go
customer ID and then we're going to go and count and then we have to group up
and count and then we have to group up the data. So group by based on the
the data. So group by based on the primary key and of course we don't need
primary key and of course we don't need all the results. We need only where we
all the results. We need only where we have an issue. So we're going to say
have an issue. So we're going to say having
having count higher than one. So we are
count higher than one. So we are interested in the values where the count
interested in the values where the count is higher than one. So let's go and
is higher than one. So let's go and execute it. Now as you can see we have
execute it. Now as you can see we have issue in this table. we have duplicates
issue in this table. we have duplicates because all those ids exist more than
because all those ids exist more than one in the table which is completely
one in the table which is completely wrong. We should have the primary key
wrong. We should have the primary key unique and you can see as well we have
unique and you can see as well we have three records where the primary key is
three records where the primary key is empty which is as well a bad thing. Now
empty which is as well a bad thing. Now there is an issue here. If we have only
there is an issue here. If we have only one null it will not be here at the
one null it will not be here at the result. So what I'm going to do I'm
result. So what I'm going to do I'm going to go over here and say or the
going to go over here and say or the primary key is null just in case if we
primary key is null just in case if we have only one null I'm still interested
have only one null I'm still interested to see the results. So if I go and run
to see the results. So if I go and run it again, we'll get the same results. So
it again, we'll get the same results. So this is equality check that you can do
this is equality check that you can do on the table. And as you can see, it is
on the table. And as you can see, it is not meeting the expectation. So that
not meeting the expectation. So that means we have to do something about it.
means we have to do something about it. So let's go and create a new query. So
So let's go and create a new query. So here what we're going to do, we can
here what we're going to do, we can start writing the query that is doing
start writing the query that is doing the data transformation and the data
the data transformation and the data cleansing. So let's start again by
cleansing. So let's start again by selecting the
data and execute it again. So now what I usually do I go and focus on the issue.
usually do I go and focus on the issue. So for example let's go and take one of
So for example let's go and take one of those values and I focus on it before
those values and I focus on it before start writing the transformation. So
start writing the transformation. So we're going to say where customer ID
we're going to say where customer ID equal to this value. All right. So now
equal to this value. All right. So now as you can see we have here the issue
as you can see we have here the issue where the ID exist three times but
where the ID exist three times but actually we are interested only on one
actually we are interested only on one of them. So the question is how to pick
of them. So the question is how to pick one of those. Usually we search for a
one of those. Usually we search for a time stamp or date value to help us. So
time stamp or date value to help us. So if you check the creation date over here
if you check the creation date over here we can understand that this record this
we can understand that this record this one over here is the newest one and the
one over here is the newest one and the previous two are older than it. So that
previous two are older than it. So that means if I have to go and pick one of
means if I have to go and pick one of those values I would like to get the
those values I would like to get the latest one because it holds the most
latest one because it holds the most fresh information. So what we have to do
fresh information. So what we have to do is we have to go and rank all those
is we have to go and rank all those values based on the create dates and
values based on the create dates and only pick the highest one. So that means
only pick the highest one. So that means we need a racking function and for that
we need a racking function and for that in scale we have the amazing window
in scale we have the amazing window functions. So let's go and do that. We
functions. So let's go and do that. We will use the function row number over
will use the function row number over and then partition by and here we have
and then partition by and here we have to divide the table by the customer ID.
to divide the table by the customer ID. So we're going to divide it by the
So we're going to divide it by the customer ID and in order now to rank
customer ID and in order now to rank those rows we have to sort the data by
those rows we have to sort the data by something. So order by and as we
something. So order by and as we discussed we want to sort the data by
discussed we want to sort the data by the creation date. So create
the creation date. So create date and we're going to sort it
date and we're going to sort it descending. So the highest first then
descending. So the highest first then the lowest. So let's go and do that. And
the lowest. So let's go and do that. And now we're going to go and give it a name
now we're going to go and give it a name flag last. So now let's go and execute
flag last. So now let's go and execute it. Now the data is sorted by the
it. Now the data is sorted by the creation date. And you can see over here
creation date. And you can see over here that this record is the number one. Then
that this record is the number one. Then the one that is older is two and the
the one that is older is two and the oldest one is three. Of course we are
oldest one is three. Of course we are interested in the rank number one. Now
interested in the rank number one. Now let's go and remove the filter and check
let's go and remove the filter and check everything. So now if you have a look to
everything. So now if you have a look to the table you can see that on the flag
the table you can see that on the flag we have everywhere like one and that's
we have everywhere like one and that's because the those primary keys exist
because the those primary keys exist only one but sometimes we will not have
only one but sometimes we will not have one we'll have two three and so on. If
one we'll have two three and so on. If there's like duplicates we can go of
there's like duplicates we can go of course and do a double check. So let's
course and do a double check. So let's go over here and say select
go over here and say select star from this query we can say where
star from this query we can say where flag last is in equal to one. So let's
flag last is in equal to one. So let's go and query it. And now we can see all
go and query it. And now we can see all the data that we don't need because they
the data that we don't need because they are causing duplicates in the primary
are causing duplicates in the primary key and they have like an old status. So
key and they have like an old status. So what we're going to do we're going to
what we're going to do we're going to say equal to one. And with that we
say equal to one. And with that we guarantee that our primary key is unique
guarantee that our primary key is unique and each value exist only once. So if I
and each value exist only once. So if I go and query it like this you will see
go and query it like this you will see we will not find any duplicate inside
we will not find any duplicate inside our table. And we can go and check that
our table. And we can go and check that of course. So let's go and check this
of course. So let's go and check this primary key. And we're going to say and
primary key. And we're going to say and customer ID equal to this value. And you
customer ID equal to this value. And you can see it exists now only once and we
can see it exists now only once and we are getting the freshest data from this
are getting the freshest data from this primary key. So with that we have
primary key. So with that we have defined like transformation in order to
defined like transformation in order to remove any duplicates. Okay. So now
remove any duplicates. Okay. So now moving on to the next one. As you can
moving on to the next one. As you can see in our table we have a lot of values
see in our table we have a lot of values where they are like string values. Now
where they are like string values. Now for these string values we have to check
for these string values we have to check the unwanted spaces. So now let's go and
the unwanted spaces. So now let's go and write a query that's going to detect
write a query that's going to detect those unwanted spaces. So we're going to
those unwanted spaces. So we're going to say select this column the first name
say select this column the first name from our table bronze customer
from our table bronze customer information. So let's go and query it.
information. So let's go and query it. Now by just looking to the data it's
Now by just looking to the data it's going to be really hard to find those
going to be really hard to find those unwanted spaces especially if they are
unwanted spaces especially if they are at the end of the word. But there is a
at the end of the word. But there is a very easy way in order to detect those
very easy way in order to detect those issues. So what we're going to do we're
issues. So what we're going to do we're going to do a filter. So now we're going
going to do a filter. So now we're going to say the first name is not equal to
to say the first name is not equal to the first name after trimming the
the first name after trimming the values. So if you use the function trim,
values. So if you use the function trim, what it going to do? It's going to go
what it going to do? It's going to go and remove all the leading and trailing
and remove all the leading and trailing spaces. So the first name. So if this
spaces. So the first name. So if this value is not equal to the first name
value is not equal to the first name after trimming it, then we have an
after trimming it, then we have an issue. So it is very simple. Let's go
issue. So it is very simple. Let's go and execute it. So now in the result, we
and execute it. So now in the result, we will get a list of all first names where
will get a list of all first names where we have spaces either at the start or at
we have spaces either at the start or at the end. So again the expectation here
the end. So again the expectation here is no results. And the same thing we can
is no results. And the same thing we can go and check something else like for
go and check something else like for example the last name. So let's go and
example the last name. So let's go and do that over here and here. Let's go and
do that over here and here. Let's go and execute it. We see in the results we
execute it. We see in the results we have as well 17 customers where they
have as well 17 customers where they have like space in their last name which
have like space in their last name which is not really good. And we can go and
is not really good. And we can go and keep checking all the string values that
keep checking all the string values that we have inside the table. So for example
we have inside the table. So for example the gender. So let's go and check
the gender. So let's go and check that and execute. Now as you can see we
that and execute. Now as you can see we don't have any results. That means the
don't have any results. That means the quality of the gender is better and we
quality of the gender is better and we don't have any unwanted spaces. So now
don't have any unwanted spaces. So now we have to go and write transformation
we have to go and write transformation in order to clean up those two columns.
in order to clean up those two columns. Now what I'm going to do, I'm just going
Now what I'm going to do, I'm just going to go and list all the columns in the
to go and list all the columns in the query instead of the star. All right. So
query instead of the star. All right. So now I have a list of all the columns
now I have a list of all the columns that I need. And now what we have to do
that I need. And now what we have to do is to go to those two columns and start
is to go to those two columns and start removing the unwanted spaces. So we will
removing the unwanted spaces. So we will just use the trim. It's very
just use the trim. It's very simple. And give it a name, of course,
simple. And give it a name, of course, the same name. And we will trim as well
the same name. And we will trim as well the last name. So let's go and query
the last name. So let's go and query this. And with that we have cleaned up
this. And with that we have cleaned up those two columns from any unwanted
those two columns from any unwanted spaces. Okay. So now moving on we have
spaces. Okay. So now moving on we have those two informations. We have the
those two informations. We have the maritalial status and as well the
maritalial status and as well the gender. If you check the values inside
gender. If you check the values inside those two columns as you can see we have
those two columns as you can see we have here low cardality. So we have limited
here low cardality. So we have limited numbers of possible values that is used
numbers of possible values that is used inside those two columns. So what we
inside those two columns. So what we usually do is to go and check the data
usually do is to go and check the data consistency inside those two columns. So
consistency inside those two columns. So it's very simple what we're going to do.
it's very simple what we're going to do. We're going to do the following. We're
We're going to do the following. We're going to say
going to say distinct and we're going to check the
distinct and we're going to check the values. Let's go and do that. And now as
values. Let's go and do that. And now as you can see we have only three possible
you can see we have only three possible values either null, f or m which is
values either null, f or m which is okay. We can stay like this of course.
okay. We can stay like this of course. But we can make a rule in our project
But we can make a rule in our project where we can say we will not be working
where we can say we will not be working with data abbreviations. We will go and
with data abbreviations. We will go and use only friendly full names. So instead
use only friendly full names. So instead of having an F, we're going to have like
of having an F, we're going to have like a full word female. And instead of m
a full word female. And instead of m we're going to have like male and we
we're going to have like male and we make it as a rule for the whole project.
make it as a rule for the whole project. So each time we find the gender
So each time we find the gender informations we try to give the full
informations we try to give the full name of it. So let's go and map those
name of it. So let's go and map those two values to a friendly one. So we're
two values to a friendly one. So we're going to go to the gender over here and
going to go to the gender over here and say case when and we're going to say the
say case when and we're going to say the gender is equal to f then make it a
gender is equal to f then make it a female. And when it
female. And when it is equal to
is equal to m then map it to male. And now we have
m then map it to male. And now we have to make decision about the nulls. As you
to make decision about the nulls. As you can see over here we have nulls. So do
can see over here we have nulls. So do we want to leave it as a null or we want
we want to leave it as a null or we want to use always the value unknown. So with
to use always the value unknown. So with that we are replacing the missing values
that we are replacing the missing values with a standard default value or you can
with a standard default value or you can leave it as null. But let's say in our
leave it as null. But let's say in our project that we are replacing all the
project that we are replacing all the missing value with a default value. So
missing value with a default value. So let's go and do that. We're going to say
let's go and do that. We're going to say else I'm going to go with the NA not
else I'm going to go with the NA not available or you can go with the unknown
available or you can go with the unknown of course. So that's for the gender
of course. So that's for the gender information like this. And we can go and
information like this. And we can go and remove the old one. And now there is one
remove the old one. And now there is one thing that I usually do in this case
thing that I usually do in this case where sometimes what happens currently
where sometimes what happens currently we are getting the capital F and the
we are getting the capital F and the capital M but maybe in the time
capital M but maybe in the time something change and you will get like
something change and you will get like lower M and lower F. So just to make
lower M and lower F. So just to make sure in those cases we still are able to
sure in those cases we still are able to map those values to the correct value.
map those values to the correct value. What we're going to do we're going to
What we're going to do we're going to just use the function upper just to make
just use the function upper just to make sure that if you get any lowerase values
sure that if you get any lowerase values we are able to catch it. So the same
we are able to catch it. So the same thing over here as well. And now one
thing over here as well. And now one more thing that you can add as well. Of
more thing that you can add as well. Of course, if you are not trusting the data
course, if you are not trusting the data because we saw some unwanted spaces in
because we saw some unwanted spaces in the first name and the last name, you
the first name and the last name, you might not trust that in the future. You
might not trust that in the future. You will get here as well unwanted spaces.
will get here as well unwanted spaces. You can go and make sure to trim
You can go and make sure to trim everything just to make sure that you
everything just to make sure that you are catching all those cases. So that's
are catching all those cases. So that's it for now. Let's go and execute. Now,
it for now. Let's go and execute. Now, as you can see, we don't have an M and
as you can see, we don't have an M and an F. We have a full word, male and
an F. We have a full word, male and female. And if we don't have a value, we
female. And if we don't have a value, we don't have a null, we are getting here
don't have a null, we are getting here not available. Now we can go and do the
not available. Now we can go and do the same stuff for the maritial status. You
same stuff for the maritial status. You can see as well we have only three
can see as well we have only three possibilities. The s null and an M. We
possibilities. The s null and an M. We can go and do the same stuff. So I will
can go and do the same stuff. So I will just go and copy everything from here.
just go and copy everything from here. And I will go and use the marital status
And I will go and use the marital status and just remove this one from here. And
and just remove this one from here. And now what are the possible values? We
now what are the possible values? We have the S. So it's going to be single.
have the S. So it's going to be single. We have an M for married. And we have as
We have an M for married. And we have as well a null and with that we are getting
well a null and with that we are getting the not available. So with that we are
the not available. So with that we are making as well data standardizations for
making as well data standardizations for this column. So let's go and execute it.
this column. So let's go and execute it. Now as you can see we don't have those
Now as you can see we don't have those short values. We have a full friendly
short values. We have a full friendly value for the status and as well for the
value for the status and as well for the gender. And at the same time we are
gender. And at the same time we are handling the nulls inside those two
handling the nulls inside those two columns. So with that we are done with
columns. So with that we are done with those two columns. And now we can go to
those two columns. And now we can go to the last one that create date. For this
the last one that create date. For this type of informations, we make sure that
type of informations, we make sure that this column is a real date and not as a
this column is a real date and not as a string or varchar. And as we defined it
string or varchar. And as we defined it in the data type, it is a date which is
in the data type, it is a date which is completely correct. So nothing to do
completely correct. So nothing to do with this column. And now the next step
with this column. And now the next step is that we're going to go and write the
is that we're going to go and write the insert statement. So how we going to do
insert statement. So how we going to do it? We're going to go to the start over
it? We're going to go to the start over here and say insert into
here and say insert into silverm customer info. Now we have to go
silverm customer info. Now we have to go and specify all the columns that should
and specify all the columns that should be inserted. So we're going to go and
be inserted. So we're going to go and type it. So something like this. And
type it. So something like this. And then we have the query over here. Let's
then we have the query over here. Let's go and execute it. So let's do that. So
go and execute it. So let's do that. So with that we have inserted clean data
with that we have inserted clean data inside the silver table. So now what
inside the silver table. So now what we're going to do we're going to go and
we're going to do we're going to go and take all the queries that we have used
take all the queries that we have used in order to check the quality of the
in order to check the quality of the bronze and let's go and take it to
bronze and let's go and take it to another query and instead of having
another query and instead of having bronze we're going to say silver. So
bronze we're going to say silver. So this is about the primary key. Let's go
this is about the primary key. Let's go and execute it. Perfect. We don't have
and execute it. Perfect. We don't have any results. So we don't have any
any results. So we don't have any duplicates. The same thing for the next
duplicates. The same thing for the next one. So the silver and it was for the
one. So the silver and it was for the first name. So let's go and check the
first name. So let's go and check the first name and run it. As you can see
first name and run it. As you can see there is no results. It is perfect. We
there is no results. It is perfect. We don't have any issues. You can of course
don't have any issues. You can of course go and check the last
go and check the last name and run it again. We don't have any
name and run it again. We don't have any results over here. And now we can go and
results over here. And now we can go and check those low cardality columns like
check those low cardality columns like for
for example the gender. Let's go and execute
example the gender. Let's go and execute it. So as you can see we have the not
it. So as you can see we have the not available or the unknown male and
available or the unknown male and female. So perfect and you can go and
female. So perfect and you can go and have a final look to the table to the
have a final look to the table to the silver customer info. Let's go and check
silver customer info. Let's go and check that. So now we can have a look to all
that. So now we can have a look to all those columns. As you can see everything
those columns. As you can see everything looks perfect and you can see it is
looks perfect and you can see it is working this metadata information that
working this metadata information that we have added to the table definition.
we have added to the table definition. Now it says when we have inserted all
Now it says when we have inserted all those records to the table which is
those records to the table which is really amazing information to have a
really amazing information to have a track and audit. Okay. So now by looking
track and audit. Okay. So now by looking to this script we have done different
to this script we have done different types of data transformations. The first
types of data transformations. The first one is with the first name and the last
one is with the first name and the last name. Here we have done trimming
name. Here we have done trimming removing unwanted spaces. This is one of
removing unwanted spaces. This is one of the types of data cleansing. So we
the types of data cleansing. So we remove unnecessary spaces or unwanted
remove unnecessary spaces or unwanted characters to ensure data consistency.
characters to ensure data consistency. Now moving on to the next
Now moving on to the next transformation. we have this case when
transformation. we have this case when so what we have done here is data
so what we have done here is data normalization or we call it sometimes
normalization or we call it sometimes data standardization so this
data standardization so this transformation is type of data cleansing
transformation is type of data cleansing where we're going to map coded values to
where we're going to map coded values to meaningful user friendly description and
meaningful user friendly description and we have done the same transformation as
we have done the same transformation as well to the gender another type of
well to the gender another type of transformation that we have done as well
transformation that we have done as well in the same case when is that we have
in the same case when is that we have handled the missing values so instead of
handled the missing values so instead of nulls we going to have not available so
nulls we going to have not available so handling missing data is as type of data
handling missing data is as type of data cleansing where we are filling the
cleansing where we are filling the blanks by adding for example a default
blanks by adding for example a default value. So instead of having an empty
value. So instead of having an empty string or a null we're going to have a
string or a null we're going to have a default value like the not available or
default value like the not available or unknown. Another type of data and
unknown. Another type of data and transformations that we have done in
transformations that we have done in this script is we have removed the
this script is we have removed the duplicates. So removing duplicates is as
duplicates. So removing duplicates is as well type of data cleansing where we
well type of data cleansing where we ensure only one record for each primary
ensure only one record for each primary key by identifying and retaining only
key by identifying and retaining only the most relevant row to ensure there is
the most relevant row to ensure there is no duplicates inside our data and as we
no duplicates inside our data and as we are removing the duplicates of course we
are removing the duplicates of course we are doing data filtering. So those are
are doing data filtering. So those are the different types of data
the different types of data transformations that we have done in
transformations that we have done in this script.
All right, moving on to the second table in the bronze layer from the CRM. We
in the bronze layer from the CRM. We have the product info. And of course, as
have the product info. And of course, as usual, before we start writing any
usual, before we start writing any transformations, we have to search for
transformations, we have to search for data quality issues. And we start with
data quality issues. And we start with the first one, we have to check the
the first one, we have to check the primary key. So we have to check whether
primary key. So we have to check whether we have duplicates or nulls inside this
we have duplicates or nulls inside this key. So what we have to do, we have to
key. So what we have to do, we have to group up the data by the primary key or
group up the data by the primary key or check whether we have nulls. So let's go
check whether we have nulls. So let's go and execute it. So as you can see,
and execute it. So as you can see, everything is safe. We don't have
everything is safe. We don't have duplicates or nulls in the primary key.
duplicates or nulls in the primary key. Now moving on to the next one, we have
Now moving on to the next one, we have the product key. Here we have in this
the product key. Here we have in this column a lot of informations. So now
column a lot of informations. So now what we have to do is to go and split
what we have to do is to go and split this string into two informations. So we
this string into two informations. So we are deriving new two columns. So now
are deriving new two columns. So now let's start with the first one is the
let's start with the first one is the category ID. The first five characters
category ID. The first five characters they are actually the category ID and we
they are actually the category ID and we can go and use the substring function in
can go and use the substring function in order to extract part of a string. It
order to extract part of a string. It needs three arguments. The first one
needs three arguments. The first one going to be the column that we want to
going to be the column that we want to extract from. And then we have to define
extract from. And then we have to define the position where to extract. And since
the position where to extract. And since the first part is on the left side, we
the first part is on the left side, we going to start from the first position.
going to start from the first position. And then we have to specify the length.
And then we have to specify the length. So how many characters we want to
So how many characters we want to extract, we need five characters. So 1 2
extract, we need five characters. So 1 2 3 4 5. So that's it for the category ID.
3 4 5. So that's it for the category ID. Category ID. Let's go and execute it.
Category ID. Let's go and execute it. Now, as you can see, we have a new
Now, as you can see, we have a new column called the category ID. and it
column called the category ID. and it contains the first part of the string
contains the first part of the string and in our database from the other
and in our database from the other source system we have as well the
source system we have as well the category ID. Now we can go and double
category ID. Now we can go and double check just in order to make sure that we
check just in order to make sure that we can join data together. So we're going
can join data together. So we're going to go and check the ID from the bronze
to go and check the ID from the bronze table ERP and this canopy from the
table ERP and this canopy from the category. So in this table we have the
category. So in this table we have the category ids and you can see over here
category ids and you can see over here those are the ids of the category and in
those are the ids of the category and in the code layer we have to go and join
the code layer we have to go and join those two tables. But here we still have
those two tables. But here we still have an issue. We have here an underscore
an issue. We have here an underscore between the category and the
between the category and the subcategory. But in our table we have
subcategory. But in our table we have actually a minus. So we have to replace
actually a minus. So we have to replace that with an underscore in order to have
that with an underscore in order to have matching informations between those two
matching informations between those two tables. Otherwise we will not be able to
tables. Otherwise we will not be able to join the tables. So we're going to use
join the tables. So we're going to use the function
the function replace. And what we are replacing? We
replace. And what we are replacing? We are replacing the minus with an
are replacing the minus with an underscore something like this. And if
underscore something like this. And if you go now and execute it, we will get
you go now and execute it, we will get an underscore exactly like the other
an underscore exactly like the other table. And of course we can go and check
table. And of course we can go and check whether everything is matching by having
whether everything is matching by having very simple query where we say this new
very simple query where we say this new information not in. And then we have
information not in. And then we have this nice subquery. So we are trying to
this nice subquery. So we are trying to find any category ID that is not
find any category ID that is not available in the second table. So let's
available in the second table. So let's go and execute it. Now as you can see we
go and execute it. Now as you can see we have only one category that is not
have only one category that is not matching. We are not finding it in this
matching. We are not finding it in this table which is maybe correct. So if you
table which is maybe correct. So if you go over here you will not find this
go over here you will not find this category. I just make it a little bit
category. I just make it a little bit bigger. So we are not finding this one
bigger. So we are not finding this one category from this table which is fine.
category from this table which is fine. So our check is okay. Okay. So that we
So our check is okay. Okay. So that we have the first part. Now we have to go
have the first part. Now we have to go and extract the second part and we're
and extract the second part and we're going to do the same thing. So we're
going to do the same thing. So we're going to use the substring and the three
going to use the substring and the three argument the product key but this time
argument the product key but this time we will not start cutting from the first
we will not start cutting from the first position we have to be in the middle. So
position we have to be in the middle. So 1 2 3 4 5 6 7. So we start from the
1 2 3 4 5 6 7. So we start from the position number seven. And now we have
position number seven. And now we have to define the length how many characters
to define the length how many characters to be extracted. But if you look over
to be extracted. But if you look over here you can see that we have different
here you can see that we have different length of the product keys. It is not
length of the product keys. It is not fixed like the category ID. So we cannot
fixed like the category ID. So we cannot go and here specify number. We have to
go and here specify number. We have to make something dynamic and there is
make something dynamic and there is trick in order to do that. We're going
trick in order to do that. We're going to go and use the length of the whole
to go and use the length of the whole column. With that we make sure that we
column. With that we make sure that we are always getting enough characters to
are always getting enough characters to be extracted and we will not be losing
be extracted and we will not be losing any informations. So we will make it
any informations. So we will make it dynamic like this. We will not have it
dynamic like this. We will not have it as a fixed length and with that we have
as a fixed length and with that we have the product key. So let's go and execute
the product key. So let's go and execute it. As you can see we are now extracting
it. As you can see we are now extracting the second part from this string. Now
the second part from this string. Now why we need the product key? We need it
why we need the product key? We need it in order to join it with another table
in order to join it with another table called sales details. So let's go and
called sales details. So let's go and check the sales details. So let me just
check the sales details. So let me just check the column name. It is SLS product
check the column name. It is SLS product key. So from bronze
key. So from bronze CRM sales. Let's go and check the data
CRM sales. Let's go and check the data over here. And it looks wonderful. So
over here. And it looks wonderful. So actually we can go and join those
actually we can go and join those informations together. But of course
informations together. But of course we're going to go and check that. So
we're going to go and check that. So we're going to say where and we're going
we're going to say where and we're going to take our new column and we're going
to take our new column and we're going to say not in the sub query just to make
to say not in the sub query just to make sure that we are not missing anything.
sure that we are not missing anything. So let's go and execute. So it looks
So let's go and execute. So it looks like we have a lot of products that
like we have a lot of products that don't have any orders. Well, I don't
don't have any orders. Well, I don't have a nice feelings about it. Let's go
have a nice feelings about it. Let's go and try something like this one here.
and try something like this one here. And we say where sld key like this value
And we say where sld key like this value over here. So I'll just cut the last
over here. So I'll just cut the last three just to search inside this table.
three just to search inside this table. So we really don't have such a keys. Let
So we really don't have such a keys. Let me just cut the second one. So let's go
me just cut the second one. So let's go and search for it. We don't have it as
and search for it. We don't have it as well. So anything that starts with the F
well. So anything that starts with the F key, we don't have any order with the
key, we don't have any order with the product where it starts with the F key.
product where it starts with the F key. So let's go and remove it. But still we
So let's go and remove it. But still we are able to join the tables, right? So
are able to join the tables, right? So if I go and say in instead of not in. So
if I go and say in instead of not in. So with that you are able to match all
with that you are able to match all those products. So that means everything
those products. So that means everything is fine. Actually it's just products
is fine. Actually it's just products that don't have any orders. So with that
that don't have any orders. So with that I'm happy with this transformation. Now
I'm happy with this transformation. Now moving on to the next one. We have here
moving on to the next one. We have here the name of the product. We can go and
the name of the product. We can go and check whether there is unwanted spaces.
check whether there is unwanted spaces. So let's go to our quality checks. Make
So let's go to our quality checks. Make sure to use the same table and we're
sure to use the same table and we're going to use the product name and check
going to use the product name and check whether we find any unmatching after
whether we find any unmatching after trimming. So let's go and do it. Well,
trimming. So let's go and do it. Well, it looks really fine. So we don't have
it looks really fine. So we don't have to trim anything. This column is safe.
to trim anything. This column is safe. Now moving on to the next one. We have
Now moving on to the next one. We have the costs. So here we have numbers and
the costs. So here we have numbers and we have to check the quality of the
we have to check the quality of the numbers. So what we can do? We can check
numbers. So what we can do? We can check whether we have nulls or negative
whether we have nulls or negative numbers. So negative costs or negative
numbers. So negative costs or negative prices which is not realistic depend on
prices which is not realistic depend on the business of course. So let's say in
the business of course. So let's say in our business we don't have any negative
our business we don't have any negative costs. So it's going to be like this.
costs. So it's going to be like this. Let's go and check whether it's
Let's go and check whether it's something less than zero or whether we
something less than zero or whether we have costs that is null. So let's go and
have costs that is null. So let's go and check those informations. Well, as you
check those informations. Well, as you can see, we don't have any negative
can see, we don't have any negative values, but we have nulls. So we can go
values, but we have nulls. So we can go and handle that by replacing the null
and handle that by replacing the null with a zero. Of course, if the business
with a zero. Of course, if the business allow that. So in SQL server, in order
allow that. So in SQL server, in order to replace the null with a zero, we have
to replace the null with a zero, we have a very nice function called is null. So
a very nice function called is null. So we are saying if it is null then replace
we are saying if it is null then replace this value with a zero. It is very
this value with a zero. It is very simple like this and we give it a name
simple like this and we give it a name of course. So let's go and execute it.
of course. So let's go and execute it. And as you can see we don't have any
And as you can see we don't have any more nulls. We have zero which is better
more nulls. We have zero which is better for the calculations if you are later
for the calculations if you are later doing any aggregate functions like the
doing any aggregate functions like the average. Now moving on to the next one
average. Now moving on to the next one we have the product line. This is again
we have the product line. This is again abbreviation of something and the
abbreviation of something and the cardinality is low. So let's go and
cardinality is low. So let's go and check all possible values inside this
check all possible values inside this column. So we're just going to use the
column. So we're just going to use the distinct going to be BRD line. So let's
distinct going to be BRD line. So let's go and execute it. And as you can see
go and execute it. And as you can see the possible values are null M R ST. And
the possible values are null M R ST. And again those are abbreviations but in our
again those are abbreviations but in our data warehouse we have decided to give
data warehouse we have decided to give full nice names. So we have to go and
full nice names. So we have to go and replace those codes those abbreviations
replace those codes those abbreviations with a friendly value. And of course in
with a friendly value. And of course in order to get those informations I
order to get those informations I usually go and ask the expert from the
usually go and ask the expert from the source system or an expert from the
source system or an expert from the process. So let's start building our
process. So let's start building our case win. And then let's use the upper
case win. And then let's use the upper and as well the trim just to make sure
and as well the trim just to make sure that we are having all the cases. So the
that we are having all the cases. So the BRD
BRD line is equal to so let's start with the
line is equal to so let's start with the first value the M. Then we will get the
first value the M. Then we will get the friendly value it's going to be
friendly value it's going to be mountain. then to the next one. So I
mountain. then to the next one. So I will just copy and paste here. If it is
will just copy and paste here. If it is an R then it is road and another one for
an R then it is road and another one for let me check what do we have here? We
let me check what do we have here? We have M R and then S. The S stands for
have M R and then S. The S stands for other sales and we have the T. So let's
other sales and we have the T. So let's go and get the T. So the T stands for
go and get the T. So the T stands for touring. We have at the end an else for
touring. We have at the end an else for unknown not available. So we don't need
unknown not available. So we don't need any nulls. So that's it. And we're going
any nulls. So that's it. And we're going to name it as before. So product line.
to name it as before. So product line. So let's remove the old one. And let's
So let's remove the old one. And let's execute it. And as you can see, we don't
execute it. And as you can see, we don't have here anymore those shortcuts and
have here anymore those shortcuts and the abbreviations. We have now full
the abbreviations. We have now full friendly value. But I will go and have
friendly value. But I will go and have here like capital O. It looks nicer. So
here like capital O. It looks nicer. So that we have nice friendly value. Now by
that we have nice friendly value. Now by looking to this case when as you can see
looking to this case when as you can see it is always like we are mapping one
it is always like we are mapping one value to another value and we are
value to another value and we are repeating all time upper time upper time
repeating all time upper time upper time and so on. We have here a quick form in
and so on. We have here a quick form in the case when if it is just a simple
the case when if it is just a simple mapping. So the syntax is very simple we
mapping. So the syntax is very simple we say case and then we have the column. So
say case and then we have the column. So we are evaluating this value over here
we are evaluating this value over here and then we just say when without the
and then we just say when without the equal so if it is an M then make it
equal so if it is an M then make it mountain. the same thing for the next
mountain. the same thing for the next one and so so with that we have the
one and so so with that we have the functions only once and we don't have to
functions only once and we don't have to go and keep repeating the same function
go and keep repeating the same function over and over and this one only if you
over and over and this one only if you are mapping values but if you have
are mapping values but if you have complex conditions you cannot do it like
complex conditions you cannot do it like this but for now I'm going to stay with
this but for now I'm going to stay with the quick form of the case when it looks
the quick form of the case when it looks nicer and shorter so let's go and
nicer and shorter so let's go and execute it we will get the same results
execute it we will get the same results okay so now back to our table let's go
okay so now back to our table let's go to the last two columns we have the
to the last two columns we have the start and end date so it's like defining
start and end date so it's like defining an interval we have start and end so
an interval we have start and end so Let's go and check the quality of the
Let's go and check the quality of the start and end dates. We're going to go
start and end dates. We're going to go and say select star from our bronze
and say select star from our bronze table. And now we're going to go and
table. And now we're going to go and search it like this. We are searching
search it like this. We are searching for the end date that is smaller than
for the end date that is smaller than the start. So we are key to start dates.
the start. So we are key to start dates. So let's go and query this. So you can
So let's go and query this. So you can see the start is always like after the
see the start is always like after the end which makes no sense at all. So we
end which makes no sense at all. So we have here data issue with those two
have here data issue with those two dates. So now for this kind of data
dates. So now for this kind of data transformations what I usually do is I
transformations what I usually do is I go and grab few examples and put it in
go and grab few examples and put it in Excel and try to think about how I'm
Excel and try to think about how I'm going to go and fix it. So here I took
going to go and fix it. So here I took like two products this one and this one
like two products this one and this one over here. And for that we have like
over here. And for that we have like three rows for each one of them. And we
three rows for each one of them. And we have this situation over here. So the
have this situation over here. So the question now how we going to go and fix
question now how we going to go and fix it? I will go and make like a copy of
it? I will go and make like a copy of one solution where we're going to say
one solution where we're going to say it's very simple. Let's go and switch
it's very simple. Let's go and switch the start date with the end date. So if
the start date with the end date. So if I go and grab the end date and put it at
I go and grab the end date and put it at the start, things going to look way
the start, things going to look way nicer, right? So we have the start is
nicer, right? So we have the start is always younger than the end. But my
always younger than the end. But my friends, the data now makes no sense
friends, the data now makes no sense because we say it start from 2007 and
because we say it start from 2007 and ends by 2011 the price was 12. But
ends by 2011 the price was 12. But between 2008 and 2012, we have 14. which
between 2008 and 2012, we have 14. which is not really good because if you take
is not really good because if you take for example the year 2010 for 2010 it
for example the year 2010 for 2010 it was 12 and at the same time 14. So it is
was 12 and at the same time 14. So it is really bad to have an overlapping
really bad to have an overlapping between those two dates. It should start
between those two dates. It should start from 2007 and end with 11 and then start
from 2007 and end with 11 and then start Feb from 12 and end with something else.
Feb from 12 and end with something else. There should be no overlapping between
There should be no overlapping between years. So it's not enough to say the
years. So it's not enough to say the start should be always smaller than the
start should be always smaller than the ends but as well the end of the first
ends but as well the end of the first history should be younger than the start
history should be younger than the start of the next records. This is as well a
of the next records. This is as well a rule in order to have no overlapping.
rule in order to have no overlapping. This one has no start but has already an
This one has no start but has already an end which is not really okay because we
end which is not really okay because we have always to have a start. Each new
have always to have a start. Each new record in historiizations has to has a
record in historiizations has to has a start. So for this record over here this
start. So for this record over here this is as well wrong. And of course it is
is as well wrong. And of course it is okay to have the start without an end.
okay to have the start without an end. So in this scenario it's fine because
So in this scenario it's fine because this indicate this is the current
this indicate this is the current informations about the costs. So again
informations about the costs. So again this solution is not working at all. So
this solution is not working at all. So now for the solution two what we can say
now for the solution two what we can say let's go and ignore completely the end
let's go and ignore completely the end date and we take only the start date. So
date and we take only the start date. So let's go and paste it over here. But now
let's go and paste it over here. But now we go and rebuild the end date
we go and rebuild the end date completely from the start date following
completely from the start date following the rules that we have defined. So the
the rules that we have defined. So the rule says the end of date of the current
rule says the end of date of the current records comes from the start date from
records comes from the start date from the next records. So here this end date
the next records. So here this end date comes from this value over here from the
comes from this value over here from the next record. So that means we take the
next record. So that means we take the next start date and put it at the end
next start date and put it at the end date for the previous records. So with
date for the previous records. So with that as you can see it is working the
that as you can see it is working the end date is higher than the start date.
end date is higher than the start date. And as well we are making sure this date
And as well we are making sure this date is not overlapping with the next record.
is not overlapping with the next record. But as well in order to make it way
But as well in order to make it way nicer we can subtract it with one. So we
nicer we can subtract it with one. So we can take the previous day like this. So
can take the previous day like this. So with that we are making sure the end
with that we are making sure the end date is smaller than the next start. And
date is smaller than the next start. And now for the next record this one over
now for the next record this one over here the end date going to come from the
here the end date going to come from the next start date. So we will take this
next start date. So we will take this one for here and put it as an end date
one for here and put it as an end date and subtract it with one. So we will get
and subtract it with one. So we will get the previous day. So now if you compare
the previous day. So now if you compare those two you can see it's still higher
those two you can see it's still higher than the start. And if you compare it
than the start. And if you compare it with the next record this one over here
with the next record this one over here it is still smaller than the next one.
it is still smaller than the next one. So there is no overlapping. And now for
So there is no overlapping. And now for the last record since we don't have here
the last record since we don't have here any informations it will be a null which
any informations it will be a null which is totally fine. So as you can see I'm
is totally fine. So as you can see I'm really happy with this scenario over
really happy with this scenario over here. Of course you can go and validate
here. Of course you can go and validate this with an expert from the source
this with an expert from the source system. But let's say I have done that
system. But let's say I have done that and they approved it and now I can go
and they approved it and now I can go and clean up the data using this new
and clean up the data using this new logic. So this is how I usually
logic. So this is how I usually brainstorm about fixing an issues. If I
brainstorm about fixing an issues. If I have like a complex stuff, I go and use
have like a complex stuff, I go and use Excel and then discuss it with the
Excel and then discuss it with the expert using this example. It's way
expert using this example. It's way better than showing a database queries
better than showing a database queries and so on. It just makes things easier
and so on. It just makes things easier to explain and as well to discuss. So
to explain and as well to discuss. So now how I usually do it, I usually go
now how I usually do it, I usually go and make a focus on only the columns
and make a focus on only the columns that I need and take only one two
that I need and take only one two scenarios while I'm building the logic
scenarios while I'm building the logic and once everything is ready I go and
and once everything is ready I go and integrate it in the query. So now I'm
integrate it in the query. So now I'm focusing only on these columns and only
focusing only on these columns and only for these products. So now let's go and
for these products. So now let's go and build our logic. Now in SQL if you are
build our logic. Now in SQL if you are at specific record and you want to
at specific record and you want to access another information from another
access another information from another records and for that we have two amazing
records and for that we have two amazing window functions. We have the lead and
window functions. We have the lead and log. In this scenario, we want to access
log. In this scenario, we want to access the next records. That's why we have to
the next records. That's why we have to go with the function leads. So, let's go
go with the function leads. So, let's go and build it lead. And then what do we
and build it lead. And then what do we need? We need the lead of the
need? We need the lead of the start date. So, we want the start date
start date. So, we want the start date of the next record. And then we say over
of the next record. And then we say over and we have to partition the data. So,
and we have to partition the data. So, the window going to be focusing on only
the window going to be focusing on only one product which is the product key and
one product which is the product key and not the product ID. So, we are dividing
not the product ID. So, we are dividing the data by product key. And of course,
the data by product key. And of course, we have to go and sort the data. So
we have to go and sort the data. So order by and we are sorting the data by
order by and we are sorting the data by the start
the start date and ascending. So from the lowest
date and ascending. So from the lowest to the highest and let's go and give it
to the highest and let's go and give it another name. So as let's say test for
another name. So as let's say test for example just to test the data. So let's
example just to test the data. So let's go and execute. And I think I missed
go and execute. And I think I missed something here. It is partition by. So
something here. It is partition by. So let's go and execute again. And now
let's go and execute again. And now let's go and check the results for the
let's go and check the results for the first partition over here. So the start
first partition over here. So the start is 2011 and the end is 2012. And this
is 2011 and the end is 2012. And this information came from the next record.
information came from the next record. So this data is moved to the previous
So this data is moved to the previous record over here. And the same thing for
record over here. And the same thing for this record. So the end date comes from
this record. So the end date comes from the next record. So our logic is
the next record. So our logic is working. And the last record over here
working. And the last record over here is null because we are at the end of the
is null because we are at the end of the window and there is no next data. That's
window and there is no next data. That's why we will get null and this is perfect
why we will get null and this is perfect of course. So it looks really awesome.
of course. So it looks really awesome. But what is missing is we have to go and
But what is missing is we have to go and get the previous day. And we can do that
get the previous day. And we can do that very simply using minus one. we are just
very simply using minus one. we are just subtracting one day. So we have no
subtracting one day. So we have no overlapping between those two dates and
overlapping between those two dates and the same thing for those two dates. So
the same thing for those two dates. So as you can see we have just built a
as you can see we have just built a perfect end date which is way better
perfect end date which is way better than the original data that we got from
than the original data that we got from the source system. Now let's take this
the source system. Now let's take this one over here and put it inside our
one over here and put it inside our query. So we don't need the end date, we
query. So we don't need the end date, we need our new end date. Let's just remove
need our new end date. Let's just remove that test and execute. Now it looks
that test and execute. Now it looks perfect. All right. Now we are not done
perfect. All right. Now we are not done yet with those two dates. Actually we
yet with those two dates. Actually we are saying all time dates because we
are saying all time dates because we don't have here any informations about
don't have here any informations about the time always zero. So it makes no
the time always zero. So it makes no sense to have these informations inside
sense to have these informations inside our data. So what we can do we can do a
our data. So what we can do we can do a very simple cast and we make this column
very simple cast and we make this column as a date instead of date time. So this
as a date instead of date time. So this is for the first one and as well for the
is for the first one and as well for the next one as date. So let's try that out.
next one as date. So let's try that out. And as you can see it is nicer. We don't
And as you can see it is nicer. We don't have the time informations. Of course,
have the time informations. Of course, we can tell the source systems about all
we can tell the source systems about all those issues. But since they don't
those issues. But since they don't provide a time, it makes no sense to
provide a time, it makes no sense to have date and time. Okay, so it was a
have date and time. Okay, so it was a long run, but we have now a cleaned
long run, but we have now a cleaned product informations. And this is way
product informations. And this is way nicer than the original product
nicer than the original product information that we got from the source
information that we got from the source CRM. So if you grab the DDL of the
CRM. So if you grab the DDL of the server table, you can see that we don't
server table, you can see that we don't have a category ID. So we have product
have a category ID. So we have product ID and product key. And as well those
ID and product key. And as well those two columns, we just changed the data
two columns, we just changed the data type. So it's date time here but we have
type. So it's date time here but we have changed that to a date. So that means we
changed that to a date. So that means we have to go and do few modifications to
have to go and do few modifications to the DDL. So what we're going to do we're
the DDL. So what we're going to do we're going to go over here and say category
going to go over here and say category ID and I will be using the same data
ID and I will be using the same data type for the start and the end. This
type for the start and the end. This time going to be date and not date and
time going to be date and not date and time. So that's it for now. Let's go and
time. So that's it for now. Let's go and execute it in order to repair the DDL.
execute it in order to repair the DDL. And this is what happen in the silver
And this is what happen in the silver layer. Sometimes we have to adjust the
layer. Sometimes we have to adjust the metadata if the quality of the data
metadata if the quality of the data types and so on is not good or we are
types and so on is not good or we are building new derived informations in
building new derived informations in order later to integrate the data. So it
order later to integrate the data. So it will be like very close to the bronze
will be like very close to the bronze layer but with few modifications. So
layer but with few modifications. So make sure to update your DTL scripts.
make sure to update your DTL scripts. And now the next step is that we're
And now the next step is that we're going to go and insert the data into the
going to go and insert the data into the table. And now the next we're going to
table. And now the next we're going to go and insert the result of this query
go and insert the result of this query that is cleaning up the bronze table
that is cleaning up the bronze table into the silver table. So as we done it
into the silver table. So as we done it before insert into silver the product
before insert into silver the product info and then we have to go and list all
info and then we have to go and list all the columns. I've just prepared those
the columns. I've just prepared those columns. So with that we can go and now
columns. So with that we can go and now run our query in order to insert the
run our query in order to insert the data. So now as you can see this did
data. So now as you can see this did insert the data and the very important
insert the data and the very important step is now to check the quality of the
step is now to check the quality of the silver table. So we go back to our data
silver table. So we go back to our data quality checks and we go switch to the
quality checks and we go switch to the silver. So let's check the primary key.
silver. So let's check the primary key. There is no issues and we can go and
There is no issues and we can go and check for example here the trims there
check for example here the trims there is as well no issue and now let's go and
is as well no issue and now let's go and check the costs it should not be
check the costs it should not be negative or null which is perfect let's
negative or null which is perfect let's go and check the data standardizations
go and check the data standardizations as you can see they are friendly and we
as you can see they are friendly and we don't have any nulls and now very
don't have any nulls and now very interesting the order of the dates so
interesting the order of the dates so let's go and check that as you can see
let's go and check that as you can see we don't have any issues and finally
we don't have any issues and finally what I do I go and have a final look to
what I do I go and have a final look to the silver table and As we can see
the silver table and As we can see everything is inserted correctly in the
everything is inserted correctly in the correct columns. So all those columns
correct columns. So all those columns comes from the source system and the
comes from the source system and the last one is automatically generated from
last one is automatically generated from the DDL indicate when we loaded this
the DDL indicate when we loaded this table. Now let's sit back and have a
table. Now let's sit back and have a look to our script. What are the
look to our script. What are the different types of data transformations
different types of data transformations that we have done here is for example
that we have done here is for example over here the category ID and the
over here the category ID and the product key we have derived new columns.
product key we have derived new columns. So it is when we create a new column
So it is when we create a new column based on calculations or transformations
based on calculations or transformations of an existing one. So sometimes we need
of an existing one. So sometimes we need columns only for analytics and we cannot
columns only for analytics and we cannot each time go to the source system and
each time go to the source system and ask them to create it. So instead of
ask them to create it. So instead of that we derive our own columns that we
that we derive our own columns that we need for the analytics. Another
need for the analytics. Another transformation we have is the is null
transformation we have is the is null over here. So we are handling here
over here. So we are handling here missing information. Instead of null
missing information. Instead of null we're going to have a zero. And one more
we're going to have a zero. And one more transformation we have over here for the
transformation we have over here for the product line. We have done here data
product line. We have done here data normalization. Instead of having a code
normalization. Instead of having a code value we have a friendly value. And as
value we have a friendly value. And as well we have handled the missing data.
well we have handled the missing data. For example, over here instead of having
For example, over here instead of having a null, we're going to have not
a null, we're going to have not available. All right, moving on to
available. All right, moving on to another data transformation. We have
another data transformation. We have done data type casting. So we are
done data type casting. So we are converting the data type from one to
converting the data type from one to another. And this considered as well to
another. And this considered as well to be a data transformation. And now moving
be a data transformation. And now moving on to the last one. We are doing as well
on to the last one. We are doing as well data type casting. But what's more
data type casting. But what's more important, we are doing data enrichment.
important, we are doing data enrichment. This type of transformation, it's all
This type of transformation, it's all about adding a value to your data. So we
about adding a value to your data. So we are adding new relevant data to our data
are adding new relevant data to our data sets. So those are the different types
sets. So those are the different types of data transformations that we have
of data transformations that we have done for this
table. Okay. So let's keep going. We have the sales details and this is the
have the sales details and this is the last table in the CRM. So what do we
last table in the CRM. So what do we have over here? We have the order number
have over here? We have the order number and this is a string. Of course we can
and this is a string. Of course we can go and check whether we have an issue
go and check whether we have an issue with the unwanted spaces. So we can
with the unwanted spaces. So we can search whether we're going to find
search whether we're going to find something. So we can say trim and
something. So we can say trim and something like this. and let's go and
something like this. and let's go and execute it. So we can see that we don't
execute it. So we can see that we don't have any unwanted spaces. That means we
have any unwanted spaces. That means we don't have to transform this column. So
don't have to transform this column. So we can leave it as it is. Now the next
we can leave it as it is. Now the next two columns they are like keys and ids
two columns they are like keys and ids in order to connect it with the other
in order to connect it with the other tables. As we learned before we are
tables. As we learned before we are using the product key in order to
using the product key in order to connect it with the product informations
connect it with the product informations and we are connecting the customer ID
and we are connecting the customer ID with the customer ID from the customer
with the customer ID from the customer info. So that means we have to go and
info. So that means we have to go and check whether everything is working
check whether everything is working perfectly. So we can go and check the
perfectly. So we can go and check the integrity of those columns where we say
integrity of those columns where we say the product key not in and then we make
the product key not in and then we make a subquery and this time we can work
a subquery and this time we can work with the silver layer right so we can
with the silver layer right so we can say the product key from silver dot
say the product key from silver dot product info so let's go and query this
product info so let's go and query this and as you can see we are not getting
and as you can see we are not getting any issue that means all the product
any issue that means all the product keys from the sales details can be used
keys from the sales details can be used and connected with the product info the
and connected with the product info the same thing we can go and check the
same thing we can go and check the integrity of the customer ID and we can
integrity of the customer ID and we can use not the product we and go to the
use not the product we and go to the customer info and the name was CST ID.
customer info and the name was CST ID. So let's go and query that and the same
So let's go and query that and the same thing we don't have here any issues. So
thing we don't have here any issues. So that means we can go and connect the
that means we can go and connect the sales with the customers using the
sales with the customers using the customer ID and we don't have to do any
customer ID and we don't have to do any transformations for it. So things looks
transformations for it. So things looks really nice for those three columns. Now
really nice for those three columns. Now we come to the challenging one. We have
we come to the challenging one. We have here the dates. Now those dates are not
here the dates. Now those dates are not actual dates. They are integer. So those
actual dates. They are integer. So those are numbers and we don't want to have it
are numbers and we don't want to have it like this. We would like to clean that
like this. We would like to clean that up. we have to change the data type from
up. we have to change the data type from integer to a dates. Now if you want to
integer to a dates. Now if you want to convert an integer to a date, we have to
convert an integer to a date, we have to be careful with the values that we have
be careful with the values that we have inside each of those columns. So now
inside each of those columns. So now let's check the quality for example of
let's check the quality for example of the order dates. Let's say where order
the order dates. Let's say where order dates is less than zero for example
dates is less than zero for example something negative. Well, we don't have
something negative. Well, we don't have any negative values which is good. Let's
any negative values which is good. Let's go and check whether we have any zeros.
go and check whether we have any zeros. Well, this is bad. So we have here a lot
Well, this is bad. So we have here a lot of zeros. Now what we can do? We can
of zeros. Now what we can do? We can replace those informations with a null.
replace those informations with a null. We can use of course the null if
We can use of course the null if function like this. We can say null if
function like this. We can say null if and if it is zero then make it null. So
and if it is zero then make it null. So let's execute it. And as you can see now
let's execute it. And as you can see now all those informations are null. Now
all those informations are null. Now let's go and check again the data. So
let's go and check again the data. So now this integer has the year's
now this integer has the year's information at the start then the months
information at the start then the months and then the day. So here we have to
and then the day. So here we have to have like 1 2 3 4 5. So the length of
have like 1 2 3 4 5. So the length of each number should be h. And if the
each number should be h. And if the length is less than eight or higher than
length is less than eight or higher than eight then we have an issue. Let's go
eight then we have an issue. Let's go and check that. So we're going to say or
and check that. So we're going to say or length sales order is not equal to h
length sales order is not equal to h that means less or higher. Let's go and
that means less or higher. Let's go and execute it. Now let's go and check the
execute it. Now let's go and check the results over here. And those two
results over here. And those two informations they don't look like a
informations they don't look like a date. So we cannot go and make from
date. So we cannot go and make from these informations a real date. They are
these informations a real date. They are just bad data quality. And of course you
just bad data quality. And of course you can go and check the boundaries of a
can go and check the boundaries of a date. Like for example it should not be
date. Like for example it should not be higher than for example let's go and get
higher than for example let's go and get this value 2050 and then any for the
this value 2050 and then any for the month and the date. So let's go and
month and the date. So let's go and execute it. And if we just remove those
execute it. And if we just remove those informations just to make sure. So we
informations just to make sure. So we don't have any date that is outside of
don't have any date that is outside of the boundaries that you have in your
the boundaries that you have in your business. Or you go for example and say
business. Or you go for example and say the boundary should be not less than
the boundary should be not less than depend when your business started. Maybe
depend when your business started. Maybe something like this. We are getting of
something like this. We are getting of course those values because they are
course those values because they are less than null. But if you have values
less than null. But if you have values around this dates you will get it as
around this dates you will get it as well in the query. So we can go and add
well in the query. So we can go and add the rests. So all those checks like
the rests. So all those checks like validate the column that has a date
validate the column that has a date informations and it has the data type
informations and it has the data type integer. So again what are the issues
integer. So again what are the issues over here? We have zeros and sometimes
over here? We have zeros and sometimes we have like strange numbers that cannot
we have like strange numbers that cannot be converted to a dates. So let's go and
be converted to a dates. So let's go and fix that in our query. So we can say
fix that in our query. So we can say case when the sales order the order
case when the sales order the order dates is equal to zero or of the order
dates is equal to zero or of the order date is not equal to 8 then null. Right?
date is not equal to 8 then null. Right? We don't want to deal with those values.
We don't want to deal with those values. they are just wrong and they they are
they are just wrong and they they are not real dates otherwise we say else
not real dates otherwise we say else it's going to be the order date. Now
it's going to be the order date. Now what we're going to do we're going to go
what we're going to do we're going to go and convert this to a date. We don't
and convert this to a date. We don't want this as an integer. So how we can
want this as an integer. So how we can do that? We can go and cast it first to
do that? We can go and cast it first to a varchar because we cannot cast from
a varchar because we cannot cast from integer to date in SQL server. First you
integer to date in SQL server. First you have to convert it to a varchchar and
have to convert it to a varchchar and then from varchchar you go to a date.
then from varchchar you go to a date. Well this is how we do it in SQL server.
Well this is how we do it in SQL server. So we cast it first to a varchar and
So we cast it first to a varchar and then we cast it to a date like this.
then we cast it to a date like this. That's it. So we have end and we are
That's it. So we have end and we are using the same column
using the same column name. So this is how we transform an
name. So this is how we transform an integer to a date. So let's go and query
integer to a date. So let's go and query this. And as you can see the order date
this. And as you can see the order date now is a real date. It is not a number.
now is a real date. It is not a number. So we can go and get rid of the old
So we can go and get rid of the old column. Now we have to go and do the
column. Now we have to go and do the same stuff for the shipping dates. So,
same stuff for the shipping dates. So, we can go over here and replace
we can go over here and replace everything with the shipping date and
everything with the shipping date and let's go and query. Well, as you can
let's go and query. Well, as you can see, the shipping date is perfect. We
see, the shipping date is perfect. We don't have any issue with this column.
don't have any issue with this column. But still, I don't like that we found a
But still, I don't like that we found a lot of issues with the order date. So,
lot of issues with the order date. So, what we're going to do just in case this
what we're going to do just in case this happens for the shipping date in the
happens for the shipping date in the future, I will go and apply the same
future, I will go and apply the same rules to the shipping dates. Oh, let's
rules to the shipping dates. Oh, let's take the shipping
take the shipping date like this. And if you don't want to
date like this. And if you don't want to apply it now, you have always to build
apply it now, you have always to build like quality checks that runs every day
like quality checks that runs every day in order to detect those issues. And
in order to detect those issues. And once you detect it, then you can go and
once you detect it, then you can go and do the transformations. But for now, I'm
do the transformations. But for now, I'm going to apply it right away. So that is
going to apply it right away. So that is for the shipping date. Now we go to the
for the shipping date. Now we go to the due date and we will do the same test.
due date and we will do the same test. Let's go and execute it. And as well, it
Let's go and execute it. And as well, it is perfect. So still, I'm going to apply
is perfect. So still, I'm going to apply the same rules. So let's get the due
the same rules. So let's get the due date everywhere here in the query. Just
date everywhere here in the query. Just make sure you don't miss anything here.
make sure you don't miss anything here. So let's go and execute now. Perfect. As
So let's go and execute now. Perfect. As you can see, we have the order date,
you can see, we have the order date, shipping date, and due date. And all of
shipping date, and due date. And all of them are date and don't have any wrong
them are date and don't have any wrong data inside those columns. Now, still
data inside those columns. Now, still there is one more check that we can do
there is one more check that we can do and it's that the order date should be
and it's that the order date should be always smaller than the shipping date or
always smaller than the shipping date or the due date because it makes no sense,
the due date because it makes no sense, right? If you are delivering an item
right? If you are delivering an item without an order. So first the order
without an order. So first the order should happen then we are shipping the
should happen then we are shipping the items. So there is like an order of
items. So there is like an order of those dates and we can go and check
those dates and we can go and check that. So we are checking now for invalid
that. So we are checking now for invalid date orders where we can say the order
date orders where we can say the order date is higher than the shipping date or
date is higher than the shipping date or we are searching as well for an order
we are searching as well for an order where the order date is higher than the
where the order date is higher than the due date. So we can have it like this
due date. So we can have it like this due date. So let's go and check. Well,
due date. So let's go and check. Well, that's really good. We don't have such a
that's really good. We don't have such a mistake on the data and the quality
mistake on the data and the quality looks good. So the order date is always
looks good. So the order date is always smaller than the shipping date or the
smaller than the shipping date or the due date. So we don't have to do any
due date. So we don't have to do any transformations or cleanup. Okay
transformations or cleanup. Okay friends, now moving on to the last three
friends, now moving on to the last three columns. We have the sales, quantity and
columns. We have the sales, quantity and the price. All those informations are
the price. All those informations are connected to each others. So we have a
connected to each others. So we have a business rule or calculation. It says
business rule or calculation. It says the sales must be equal to quantity
the sales must be equal to quantity multiplied by the price. And all sales
multiplied by the price. And all sales quantity and price informations must be
quantity and price informations must be positive numbers. So it's not allowed to
positive numbers. So it's not allowed to be negative, zero or null. So those are
be negative, zero or null. So those are the business rules and we have to check
the business rules and we have to check the data consistency in our table. Does
the data consistency in our table. Does all those three informations following
all those three informations following our rules? So we're going to start first
our rules? So we're going to start first with our rule, right? So we're going to
with our rule, right? So we're going to say if the sales is not equal to
say if the sales is not equal to quantity multiplied by the price. So we
quantity multiplied by the price. So we are searching where the result is not
are searching where the result is not matching our expectation. And as well we
matching our expectation. And as well we can go and check other stuff like the
can go and check other stuff like the nulls. So for example we can say or
nulls. So for example we can say or sales is null or quantity is
sales is null or quantity is null and the last one for the price and
null and the last one for the price and as well we can go and check whether they
as well we can go and check whether they are negative numbers or zero. So we can
are negative numbers or zero. So we can go over here and say less or equal to
go over here and say less or equal to zero and apply it for the other columns
zero and apply it for the other columns as well. So with that we are checking
as well. So with that we are checking the calculation and as well we are
the calculation and as well we are checking whether we have null, zero or
checking whether we have null, zero or negative numbers. Let's go and check our
negative numbers. Let's go and check our informations. I'm going to have here
informations. I'm going to have here extinct. So let's go and query it. And
extinct. So let's go and query it. And of course we have here bad data. But we
of course we have here bad data. But we can go and sort the data by the sales
can go and sort the data by the sales quantity and the price. So let's do it.
quantity and the price. So let's do it. Now by looking to the data we can see in
Now by looking to the data we can see in the sales we have nulls. We have
the sales we have nulls. We have negative numbers and zeros. So we have
negative numbers and zeros. So we have all bad combinations and as well we have
all bad combinations and as well we have here bad calculations. So as you can see
here bad calculations. So as you can see the price here is 50, the quantity is
the price here is 50, the quantity is one but the sales is two which is not
one but the sales is two which is not correct. And here we have as well wrong
correct. And here we have as well wrong calculations. Here we have to have a 10
calculations. Here we have to have a 10 and here nine or maybe the price is
and here nine or maybe the price is wrong. And by looking to the quantity
wrong. And by looking to the quantity now you can see we don't have any nulls.
now you can see we don't have any nulls. We don't have any zeros or negative
We don't have any zeros or negative numbers. So the quantity looks better
numbers. So the quantity looks better than the sales. And if you look to the
than the sales. And if you look to the prices we have nulls we have negatives
prices we have nulls we have negatives and yeah we don't have zeros. So that
and yeah we don't have zeros. So that means the quality of the sales and the
means the quality of the sales and the price is wrong. The calculation is not
price is wrong. The calculation is not working and we have these scenarios. Now
working and we have these scenarios. Now of course how I do it here I don't go
of course how I do it here I don't go and try now to transform everything on
and try now to transform everything on my own. I usually go and talk to an
my own. I usually go and talk to an expert maybe someone from the business
expert maybe someone from the business or from the source system and I show
or from the source system and I show those scenarios and discuss and usually
those scenarios and discuss and usually there is like two answers either they
there is like two answers either they going to tell me you know what I will
going to tell me you know what I will fix it in my source so I have to live
fix it in my source so I have to live with it there is incoming bad data and
with it there is incoming bad data and the bad data going to be presented in
the bad data going to be presented in the warehouse until the source system
the warehouse until the source system clean up those issues. And the other
clean up those issues. And the other answer you might get you know what we
answer you might get you know what we don't have the budget and those data are
don't have the budget and those data are really old and we are not going to do
really old and we are not going to do anything. So here you have to decide
anything. So here you have to decide either you leave it as it is or you say
either you leave it as it is or you say you know what let's go and improve the
you know what let's go and improve the quality of the data. But here you have
quality of the data. But here you have to ask for the experts to support you
to ask for the experts to support you solving these issues because it really
solving these issues because it really depend on the rules. Different rules
depend on the rules. Different rules makes different transformations. So now
makes different transformations. So now let's say that we have the following
let's say that we have the following rules. If the sales informations are
rules. If the sales informations are null or negative or zero, then use the
null or negative or zero, then use the calculation the formula by multiplying
calculation the formula by multiplying the quality with the price. And now if
the quality with the price. And now if the prices are wrong, for example, we
the prices are wrong, for example, we have here a null or zero, then go and
have here a null or zero, then go and calculate it from the sales and the
calculate it from the sales and the quantity. And if you have a price that
quantity. And if you have a price that is a minus like minus 21, a negative
is a minus like minus 21, a negative number, then you have to go and convert
number, then you have to go and convert it to a 21. So from a negative to a
it to a 21. So from a negative to a positive without any calculations. So
positive without any calculations. So those are the rules and now we're going
those are the rules and now we're going to go and build the transformations.
to go and build the transformations. based on those rules. So let's do it
based on those rules. So let's do it step by step. I will go over here and
step by step. I will go over here and we're going to start building the new
we're going to start building the new sales. So what is the rule says case
sales. So what is the rule says case when of course as usual if the
when of course as usual if the sales is null or let's say the sales is
sales is null or let's say the sales is negative number or equal to zero or
negative number or equal to zero or another scenario we have a sales
another scenario we have a sales information but it is not following the
information but it is not following the calculation. So we have wrong
calculation. So we have wrong information in the sales. So we're going
information in the sales. So we're going to say the sales is not equal to the
to say the sales is not equal to the quantity multiplied by the price. But of
quantity multiplied by the price. But of course we will not leave the price like
course we will not leave the price like this by using the function APS. The
this by using the function APS. The absolute is going to go and convert
absolute is going to go and convert everything from negative to a positive.
everything from negative to a positive. Then what we have to do is to go and use
Then what we have to do is to go and use the calculation. So it going to be the
the calculation. So it going to be the quantity multiplied by the price. So
quantity multiplied by the price. So that means we are not using the value
that means we are not using the value that's come from the source system. We
that's come from the source system. We are recalculating it. Now let's say the
are recalculating it. Now let's say the sales is correct and not one of those
sales is correct and not one of those scenarios. So we're going to say else.
scenarios. So we're going to say else. We will go with the sales as it is that
We will go with the sales as it is that comes from the source because it is
comes from the source because it is correct. It's really nice. Let's go and
correct. It's really nice. Let's go and say an end and give it the same name. I
say an end and give it the same name. I will go and rename the old one here as
will go and rename the old one here as an old value and the same for the price.
an old value and the same for the price. The quantity will not touch it because
The quantity will not touch it because it is correct. So like this. And now
it is correct. So like this. And now let's go and transform the prices. So
let's go and transform the prices. So again as usual we go with case when. So
again as usual we go with case when. So what are the scenarios? The price is
what are the scenarios? The price is null or the price is less or equal to
null or the price is less or equal to zero. Then what we going to do? We're
zero. Then what we going to do? We're going to do the calculation. So it's
going to do the calculation. So it's going to be the sales divided by the
going to be the sales divided by the quantity the SLS quantity. But here we
quantity the SLS quantity. But here we have to make sure that we are not
have to make sure that we are not dividing by zero. Currently we don't
dividing by zero. Currently we don't have any zeros in the quantity but you
have any zeros in the quantity but you don't know in the future you might get a
don't know in the future you might get a zero and the whole code going to break.
zero and the whole code going to break. So what you have to do is to go and say
So what you have to do is to go and say if you get any zero replace it with a
if you get any zero replace it with a null. So null if if it is zero then make
null. So null if if it is zero then make it null. So that's it. Now if the price
it null. So that's it. Now if the price is not null and the price is not
is not null and the price is not negative or equal to zero then
negative or equal to zero then everything is fine and that's why we're
everything is fine and that's why we're going to have now the else it going to
going to have now the else it going to be the price as it is from the source
be the price as it is from the source system. So that's it. We're going to say
system. So that's it. We're going to say end as price. So I'm totally happy with
end as price. So I'm totally happy with that. Let's go and execute it and check
that. Let's go and execute it and check of course. So those are the old
of course. So those are the old informations and those are the new
informations and those are the new transformed cleaned up informations. So
transformed cleaned up informations. So here previously we have a null but now
here previously we have a null but now we have two. So two multiplied with one
we have two. So two multiplied with one we are getting two. So the sales is here
we are getting two. So the sales is here correct. Now moving on to the next one
correct. Now moving on to the next one we have in the sales 40 but the price is
we have in the sales 40 but the price is two. So two multiplied with one we
two. So two multiplied with one we should get two. So the new sales is
should get two. So the new sales is correct. It is two and not 40. Now to
correct. It is two and not 40. Now to the next one over here the old sales is
the next one over here the old sales is zero. But if you go and multiply the
zero. But if you go and multiply the four with the quantity you will get
four with the quantity you will get four. So the sales here is not correct.
four. So the sales here is not correct. That's why in the new sales we have it
That's why in the new sales we have it correct as a four. And let's go and get
correct as a four. And let's go and get a minus. So in this case we have a minus
a minus. So in this case we have a minus which is not correct. So we are getting
which is not correct. So we are getting the price multiplied with one. We should
the price multiplied with one. We should get here a nine. And this sales here is
get here a nine. And this sales here is correct. Now let's go and get a scenario
correct. Now let's go and get a scenario where the price is null like this here.
where the price is null like this here. So we don't have here a price but we
So we don't have here a price but we calculated from the sales and the
calculated from the sales and the quantity. So we divided the 10 by two
quantity. So we divided the 10 by two and we have five. So the new price is
and we have five. So the new price is better. And the same thing for the
better. And the same thing for the minuses. So we have here minus 21 and in
minuses. So we have here minus 21 and in the output we have 21 which is correct.
the output we have 21 which is correct. So for now I don't see any scenario
So for now I don't see any scenario where the data is wrong. So everything
where the data is wrong. So everything looks better than before. And with that
looks better than before. And with that we have applied the business rules from
we have applied the business rules from the experts and we have cleaned up the
the experts and we have cleaned up the data in the data warehouse. And this is
data in the data warehouse. And this is way better than before because we are
way better than before because we are presenting now better data for analyszis
presenting now better data for analyszis and reporting but it is challenging and
and reporting but it is challenging and you have exactly to understand the
you have exactly to understand the business. So now what we're going to do
business. So now what we're going to do we're going to go and copy those
we're going to go and copy those informations and integrate it in our
informations and integrate it in our query. So instead of sales we're going
query. So instead of sales we're going to get our new calculation and instead
to get our new calculation and instead of the price we will get our correct
of the price we will get our correct calculation and here I'm missing the
calculation and here I'm missing the end. Let's go and run the whole thing
end. Let's go and run the whole thing again. So with that we have as well now
again. So with that we have as well now cleaned sales quantity and price and it
cleaned sales quantity and price and it is following our business rules. So with
is following our business rules. So with that we are done cleaning up the sales
that we are done cleaning up the sales details. The next step we're going to go
details. The next step we're going to go and insert it to the sales details. But
and insert it to the sales details. But we have to go and check again the DDL.
we have to go and check again the DDL. So now all what you have to do is to
So now all what you have to do is to compare those results with the DDL. So
compare those results with the DDL. So the first one is the order number. It's
the first one is the order number. It's fine. The product key, the customer ID,
fine. The product key, the customer ID, but here we have an issue. All those
but here we have an issue. All those informations now are date and not an
informations now are date and not an integer. So we have to go and change the
integer. So we have to go and change the data type. And with that we have better
data type. And with that we have better data type than before. Then the sales
data type than before. Then the sales quantity price it is correct. Let's go
quantity price it is correct. Let's go and drop the table and create it from
and drop the table and create it from scratch again. And don't forget to
scratch again. And don't forget to update your DDL script. So that's it for
update your DDL script. So that's it for this. And we're going to go now and
this. And we're going to go now and insert the results into our silver table
insert the results into our silver table sales details. And we have to go and
sales details. And we have to go and list now all the columns. I have already
list now all the columns. I have already prepared the list of all the columns. So
prepared the list of all the columns. So make sure that you have the correct
make sure that you have the correct order of the columns. So let's go now
order of the columns. So let's go now and insert the data. And with that and
and insert the data. And with that and with that we can see that the SQL did
with that we can see that the SQL did insert data to our sales details. But
insert data to our sales details. But now very important is to check the
now very important is to check the health of the silver table. So what
health of the silver table. So what we're going to do instead here of
we're going to do instead here of bronze, we're going to go and switch it
bronze, we're going to go and switch it to silver. So let's check over here. So
to silver. So let's check over here. So here always the order is smaller than
here always the order is smaller than the shipping and the due date, which is
the shipping and the due date, which is really nice. But now I'm very interested
really nice. But now I'm very interested on the calculations. So here we're going
on the calculations. So here we're going to switch it from bronze to silver. And
to switch it from bronze to silver. And I'm going to go and get rid of all those
I'm going to go and get rid of all those calculations because we don't need it
calculations because we don't need it this. And now let's see whether we have
this. And now let's see whether we have any issue. Well, perfect. Our data is
any issue. Well, perfect. Our data is following the business rules. We don't
following the business rules. We don't have any nulls, negative values, zeros.
have any nulls, negative values, zeros. Now as usual the last step the final
Now as usual the last step the final check we will just have a final look to
check we will just have a final look to the table. So we have the order number
the table. So we have the order number the product key the customer ID those
the product key the customer ID those three dates we have the sales quantity
three dates we have the sales quantity and the price and of course we have our
and the price and of course we have our metadata column. Everything is perfect.
metadata column. Everything is perfect. So now by looking to our code what are
So now by looking to our code what are the different types of data
the different types of data transformation that we are doing. So in
transformation that we are doing. So in those three columns we are doing the
those three columns we are doing the following. So at the start we are
following. So at the start we are handling invalid data and this is as
handling invalid data and this is as well type of transformation and as well
well type of transformation and as well at the same time we are doing data type
at the same time we are doing data type casting. So we are changing it to more
casting. So we are changing it to more correct data type. And if you are
correct data type. And if you are looking to the sales over here then what
looking to the sales over here then what we are doing over here is we are
we are doing over here is we are handling the missing data and as well
handling the missing data and as well the invalid data by deriving the column
the invalid data by deriving the column from already existing one. And it is as
from already existing one. And it is as well very similar for the price. We are
well very similar for the price. We are handling as well the invalid data by
handling as well the invalid data by deriving it from specific calculation
deriving it from specific calculation over here. So those are the different
over here. So those are the different types of data transformations that you
types of data transformations that you have done in these
scripts. All right. Now let's keep moving to the next system. We have the
moving to the next system. We have the customer AZ2. So here we have like only
customer AZ2. So here we have like only three columns and let's start with the
three columns and let's start with the ID first. So here again we have the
ID first. So here again we have the customer's informations and if we go and
customer's informations and if we go and check again our model you can see that
check again our model you can see that we can connect this table with the CRM
we can connect this table with the CRM table customer info using the customer
table customer info using the customer key. So that means we have to go and
key. So that means we have to go and make sure that we can go and connect
make sure that we can go and connect those two tables. So let's go and check
those two tables. So let's go and check the other table. We can go and check of
the other table. We can go and check of course the server layer. So let's query
course the server layer. So let's query it and we can query both of the tables.
it and we can query both of the tables. Now we can see there is here like extra
Now we can see there is here like extra characters that are not included in the
characters that are not included in the customer key from the CRM. So let's go
customer key from the CRM. So let's go and search for example for this customer
and search for example for this customer over here where C ID like so we are
over here where C ID like so we are searching for customer has similar ID.
searching for customer has similar ID. Now as you can see we are finding this
Now as you can see we are finding this customer but the issue is that we have
customer but the issue is that we have those three characters NAS. There is no
those three characters NAS. There is no specifications or explanation why we
specifications or explanation why we have the NAS. So actually what we have
have the NAS. So actually what we have to do is to go and remove those
to do is to go and remove those informations. We don't need it. So let's
informations. We don't need it. So let's again check the data. So it looks like
again check the data. So it looks like the old data have an NAS at the start
the old data have an NAS at the start and then afterward we have new data
and then afterward we have new data without those three characters. So we
without those three characters. So we have to clean up those ids in order to
have to clean up those ids in order to be able to connect it with other tables.
be able to connect it with other tables. So we're going to do it like this. We're
So we're going to do it like this. We're going to start with the case when since
going to start with the case when since we have like two scenarios in our data.
we have like two scenarios in our data. So if the C ID is like the three
So if the C ID is like the three characters in as so if the ID start with
characters in as so if the ID start with those three characters then we're going
those three characters then we're going to go and apply transformation function
to go and apply transformation function otherwise it's going to stay like it is.
otherwise it's going to stay like it is. So that's it. So now we have to go and
So that's it. So now we have to go and build the transformation. So we're going
build the transformation. So we're going to use substring and then we have to
to use substring and then we have to define the string. It's going to be the
define the string. It's going to be the CD and then we have to define the
CD and then we have to define the position where it start cutting or
position where it start cutting or extracting. So we can say 1 2 3 and then
extracting. So we can say 1 2 3 and then four. So we have to define the position
four. So we have to define the position number four. And then we have to define
number four. And then we have to define the string how many characters should be
the string how many characters should be extracted. I will make it dynamic. So I
extracted. I will make it dynamic. So I will go with the length. I will not go
will go with the length. I will not go and count how much. So we're going to
and count how much. So we're going to say the C ID. So it looks good. If it's
say the C ID. So it looks good. If it's like NAS then go and extract from the CD
like NAS then go and extract from the CD at the position number four the rest of
at the position number four the rest of the characters. So let's go and execute
the characters. So let's go and execute it. And I'm missing here a comma again
it. And I'm missing here a comma again where we don't have any NAS at the
where we don't have any NAS at the start. And if you scroll down you can
start. And if you scroll down you can see those as well are not affected. So
see those as well are not affected. So with that we have now a nice ID to be
with that we have now a nice ID to be joined with other table. Of course we
joined with other table. Of course we can go and test it like this where then
can go and test it like this where then we take the whole thing the whole
we take the whole thing the whole transformation and say not in we remove
transformation and say not in we remove of course the alias name we don't need
of course the alias name we don't need it. And then we make very simple
it. And then we make very simple substring select
substring select distinct CST key the customer key from
distinct CST key the customer key from the silver table can be silver CRM cost
the silver table can be silver CRM cost info. So that's it. So let's go and
info. So that's it. So let's go and check. So as you can see it is working
check. So as you can see it is working fine. So we are not able to find any
fine. So we are not able to find any unmatching data between the customer
unmatching data between the customer info from ERB and the CRM. But of course
info from ERB and the CRM. But of course after the transformation if you don't
after the transformation if you don't use the transformation. So if I just
use the transformation. So if I just remove it like this, we will find a lot
remove it like this, we will find a lot of unmatching data. So this means our
of unmatching data. So this means our transformation is working perfectly and
transformation is working perfectly and we can go and remove the original value.
we can go and remove the original value. So that's it for the first column. Okay.
So that's it for the first column. Okay. Now moving on to the next field, we have
Now moving on to the next field, we have the birthday of the customers. So the
the birthday of the customers. So the first thing to do is to check the data
first thing to do is to check the data type. It is a date. So it's fine. It is
type. It is a date. So it's fine. It is not an integer or a string. So we don't
not an integer or a string. So we don't have to convert anything. But still
have to convert anything. But still there is something to check with the
there is something to check with the birth date. So we can check whether we
birth date. So we can check whether we have something out of range. So for
have something out of range. So for example, we can go and check whether we
example, we can go and check whether we have really old dates at the birth
have really old dates at the birth dates. So let's take 19, 100, and let's
dates. So let's take 19, 100, and let's say 24 and we can take the first date of
say 24 and we can take the first date of the month. So let's go and check that.
the month. So let's go and check that. Well, it looks like that we have
Well, it looks like that we have customers that are older than 100 year.
customers that are older than 100 year. Well, I don't know. Maybe this is
Well, I don't know. Maybe this is correct, but it sounds of course strange
correct, but it sounds of course strange to do the business. Of course.
to do the business. Of course. Hey, this is Creed and he is in charge
Hey, this is Creed and he is in charge of something. That is correct. Say hi to
of something. That is correct. Say hi to the kids. Hi kids. Yay. And then we can
the kids. Hi kids. Yay. And then we can go and check the other boundary where it
go and check the other boundary where it is almost impossible to have a customer
is almost impossible to have a customer that the birthday is in the future. So
that the birthday is in the future. So we can say birth date is higher than the
we can say birth date is higher than the current date like this. So let's go and
current date like this. So let's go and query this information. Well, it will
query this information. Well, it will not work because we have to have like an
not work because we have to have like an or between them. And now if we check the
or between them. And now if we check the list over here, we have dates that are
list over here, we have dates that are invalid for the birth dates. So all
invalid for the birth dates. So all those dates they are all per day in the
those dates they are all per day in the future and this is totally unacceptable.
future and this is totally unacceptable. So this is an indicator for bad data
So this is an indicator for bad data quality. Of course you can go and report
quality. Of course you can go and report it to the source system in order to
it to the source system in order to correct it. So here it's up to you what
correct it. So here it's up to you what to do with those dates. Either leave it
to do with those dates. Either leave it as it is as a bad data or we can go and
as it is as a bad data or we can go and clean that up by replacing all those
clean that up by replacing all those dates with a null or maybe replacing
dates with a null or maybe replacing only the one that is extreme where it is
only the one that is extreme where it is 100% is incorrect. So let's go and write
100% is incorrect. So let's go and write the transformation for that. As usual,
the transformation for that. As usual, we're going to start with case when
we're going to start with case when birth date is larger than the current
birth date is larger than the current date and time then null. Otherwise, we
date and time then null. Otherwise, we can have an else where we have the birth
can have an else where we have the birth date as it is and then we have an end as
date as it is and then we have an end as birth date. So, let's go and execute it.
birth date. So, let's go and execute it. And with that, we should not get any
And with that, we should not get any customer where the birthday in the
customer where the birthday in the future. So, that's it for the birth
future. So, that's it for the birth date. Now, let's move to the next one.
date. Now, let's move to the next one. We have the gender. Now again the gender
We have the gender. Now again the gender informations is low cardalities. So we
informations is low cardalities. So we have to go and check all the possible
have to go and check all the possible values inside this column. So in order
values inside this column. So in order to check all the possible values we're
to check all the possible values we're going to use select distinct gen from
going to use select distinct gen from our table. So let's go and execute it.
our table. So let's go and execute it. And now the data doesn't look really
And now the data doesn't look really good. So we have here a null, we have an
good. So we have here a null, we have an f, we have here an empty string, we have
f, we have here an empty string, we have male, female, and again we have the M.
male, female, and again we have the M. So this is not really good. And what
So this is not really good. And what we're going to do, we're going to go and
we're going to do, we're going to go and clean up all those informations in order
clean up all those informations in order to have only three values. Male, female,
to have only three values. Male, female, and not available. So, we're going to do
and not available. So, we're going to do it like this. We're going to say case
it like this. We're going to say case when and now we're going to go and trim
when and now we're going to go and trim the values just to make sure there is
the values just to make sure there is like no empty spaces. And as well, I'm
like no empty spaces. And as well, I'm going to go and use the upper function
going to go and use the upper function just to make sure that in the future if
just to make sure that in the future if we get any lower cases and so on, we are
we get any lower cases and so on, we are covering all the different scenarios. So
covering all the different scenarios. So case this is in F or let's say
case this is in F or let's say female then make it as female and we can
female then make it as female and we can go and do the same thing for the male
go and do the same thing for the male like this. So if it is an M or a male
like this. So if it is an M or a male make sure it is capital letters because
make sure it is capital letters because here we are using the upper then it is a
here we are using the upper then it is a male otherwise all other scenarios it
male otherwise all other scenarios it should be not available. So whether it
should be not available. So whether it is an empty string or nulls and so on.
is an empty string or nulls and so on. So we have to have an end of course as
So we have to have an end of course as gen. So now let's go and test it and
gen. So now let's go and test it and check whether we have covered
check whether we have covered everything. So you can see the M is now
everything. So you can see the M is now male. The empty is not available. The F
male. The empty is not available. The F is female. The empty string or maybe
is female. The empty string or maybe spaces here is not available. Female
spaces here is not available. Female going to stay as it is. And the same for
going to stay as it is. And the same for the male. So with that we are covering
the male. So with that we are covering all the scenarios and we are following
all the scenarios and we are following our standards in the project. So I'm
our standards in the project. So I'm going to go and cut this and put it in
going to go and cut this and put it in our original query over here. So let's
our original query over here. So let's go and execute the whole thing. And with
go and execute the whole thing. And with that we have cleaned up all those three
that we have cleaned up all those three columns. Now the question is did we
columns. Now the question is did we change anything in the DDL? Well we
change anything in the DDL? Well we didn't change anything. We didn't
didn't change anything. We didn't introduce any new column or change any
introduce any new column or change any data type. So that means the next step
data type. So that means the next step is we're going to go and insert it in
is we're going to go and insert it in the server layer. So as usual we're
the server layer. So as usual we're going to say here insert into silver ERP
going to say here insert into silver ERP the customer and then we're going to go
the customer and then we're going to go and list all the column names. So C ID
and list all the column names. So C ID birth date and the gender. All right. So
birth date and the gender. All right. So let's go and execute it. And with that
let's go and execute it. And with that we can see it inserted all the data. And
we can see it inserted all the data. And of course the very important step as the
of course the very important step as the next is to check the data quality. So
next is to check the data quality. So let's go back to our query over here and
let's go back to our query over here and change it from bronze to silver. So
change it from bronze to silver. So let's go and check the silver layer.
let's go and check the silver layer. Well of course we are getting those very
Well of course we are getting those very old customers but we didn't change that.
old customers but we didn't change that. We only change the birthday that is in
We only change the birthday that is in the future and we don't see it here in
the future and we don't see it here in the results. So that means everything is
the results. So that means everything is clean. So for the next one, let's go and
clean. So for the next one, let's go and check the different genders. And as you
check the different genders. And as you can see, we have only those three
can see, we have only those three values. And of course, we can go and
values. And of course, we can go and take a final look to our table. So you
take a final look to our table. So you can see the C ID here, the birth date,
can see the C ID here, the birth date, the gender, and then we see our metadata
the gender, and then we see our metadata column. And everything looks amazing. So
column. And everything looks amazing. So that's it. What are the different types
that's it. What are the different types of data transformations that we have
of data transformations that we have done? First with the ID, what we have
done? First with the ID, what we have done, we have handled invalid values. So
done, we have handled invalid values. So we have removed this part where it is
we have removed this part where it is not needed. And the same thing goes for
not needed. And the same thing goes for the birth dates. We have handled as well
the birth dates. We have handled as well invalid values. And then for the last
invalid values. And then for the last one, for the gender, we have done data
one, for the gender, we have done data normalizations by mapping the code to
normalizations by mapping the code to more friendly value. And as well, we
more friendly value. And as well, we have handled the missing values. So
have handled the missing values. So those are the types that we have done in
those are the types that we have done in this
code. Okay. Moving on to the second table, we have the location
table, we have the location informations. So we have ERP location
informations. So we have ERP location A101. So now here the task is easy
A101. So now here the task is easy because we have only two columns and if
because we have only two columns and if you go and check the integration model
you go and check the integration model we can find our table over here. So we
we can find our table over here. So we can go and connect it together with the
can go and connect it together with the customer info from the other system
customer info from the other system using a CID with the customer key. So
using a CID with the customer key. So those two informations must be matching
those two informations must be matching in order to join the tables. So that
in order to join the tables. So that means we have to go and check the data.
means we have to go and check the data. So let's go and select the data CST key
So let's go and select the data CST key from let's go and get the silver data
from let's go and get the silver data customer info. So let's go. Now if you
customer info. So let's go. Now if you go and check the result you can see over
go and check the result you can see over here that we have an issue with the CI
here that we have an issue with the CI ID there is like a minus between the
ID there is like a minus between the characters and the numbers but the
characters and the numbers but the customer ID the customer number we don't
customer ID the customer number we don't have anything that splits the characters
have anything that splits the characters with the numbers. So if you go and join
with the numbers. So if you go and join those two informations it will not be
those two informations it will not be working. So what we have to do we have
working. So what we have to do we have to go and get rid of this minus because
to go and get rid of this minus because it is totally unnecessary. So let's go
it is totally unnecessary. So let's go and fix that. It's going to be very
and fix that. It's going to be very simple. So what we're going to do we're
simple. So what we're going to do we're going to say CI ID. So we're going to go
going to say CI ID. So we're going to go and search for the minus and replace it
and search for the minus and replace it with nothing. It's very simple like
with nothing. It's very simple like this. So let's go and query it again.
this. So let's go and query it again. And with that things looks very similar
And with that things looks very similar to each others. And as well we can go
to each others. And as well we can go and query it. So we're going to say
and query it. So we're going to say where our transformation is not in then
where our transformation is not in then we can go and use this as a subquery
we can go and use this as a subquery like this. So let's go and execute it.
like this. So let's go and execute it. And as you can see we are not finding
And as you can see we are not finding any unmatching data now. So that means
any unmatching data now. So that means our transformation is working. And with
our transformation is working. And with that we can go and connect those two
that we can go and connect those two tables together. So if I take the
tables together. So if I take the transformation away you can see that we
transformation away you can see that we will find a lot of unmatching data. So
will find a lot of unmatching data. So the transformation is okay. We're going
the transformation is okay. We're going to stay with it. And now let's speak
to stay with it. And now let's speak about the countries. Now we have here
about the countries. Now we have here multiple values and so on. What I'm
multiple values and so on. What I'm going to do this is low cardinality and
going to do this is low cardinality and we have to go and check all possible
we have to go and check all possible values inside this column. So that means
values inside this column. So that means we are checking whether the data is
we are checking whether the data is consistent. So we can do it like this.
consistent. So we can do it like this. distinct the
distinct the country from our table. I'm just going
country from our table. I'm just going to go and copy it like this. And as
to go and copy it like this. And as well, I'm going to go and sort the data
well, I'm going to go and sort the data by the country. So, let's go and check
by the country. So, let's go and check the informations. Now, you can see we
the informations. Now, you can see we have a null. We have an empty string,
have a null. We have an empty string, which is really bad. And then we have a
which is really bad. And then we have a full name of country and then we have as
full name of country and then we have as well an abbreviation of the countries.
well an abbreviation of the countries. Well, this is a mix. This is not really
Well, this is a mix. This is not really good because sometimes we have DE and
good because sometimes we have DE and sometimes we have Germany and then we
sometimes we have Germany and then we have the United Kingdom and then for the
have the United Kingdom and then for the United States we have like three
United States we have like three versions of the same information which
versions of the same information which is as well not really good. So the
is as well not really good. So the quality of the country is not really
quality of the country is not really good. So let's go and work on the
good. So let's go and work on the transformation. As usual we're going to
transformation. As usual we're going to start with the case win. If trim
start with the case win. If trim country is equal to D, then we're going
country is equal to D, then we're going to transform it to Germany. And the next
to transform it to Germany. And the next one it's going to be about the USA. So
one it's going to be about the USA. So if trim country is in. So now let's go
if trim country is in. So now let's go and get those two values the US and the
and get those two values the US and the USA. So US and USA then it's going to be
USA. So US and USA then it's going to be the United States states. So with us we
the United States states. So with us we have covered as well those three cases.
have covered as well those three cases. Now we have to talk about the null and
Now we have to talk about the null and the empty string. So we're going to say
the empty string. So we're going to say when trim country is equal to empty
when trim country is equal to empty string or country is null then it's
string or country is null then it's going to be not available otherwise I
going to be not available otherwise I would like to get the country as it is.
would like to get the country as it is. So trim country just to make sure that
So trim country just to make sure that we don't have any leading or trailing
we don't have any leading or trailing spaces. So that's it. Let's go and say
spaces. So that's it. Let's go and say this is the country. So it is working
this is the country. So it is working and the country information is
and the country information is transformed. And now what I'm going to
transformed. And now what I'm going to do, I'm going to take the whole new
do, I'm going to take the whole new transformation and compare it to the old
transformation and compare it to the old one. Let me just call this as old
one. Let me just call this as old country and let's go and query it. So
country and let's go and query it. So now we can check those values state as
now we can check those values state as before. So nothing did change. The DE is
before. So nothing did change. The DE is now Germany. The empty string is not
now Germany. The empty string is not available. The null the same thing and
available. The null the same thing and the United Kingdom stayed as like it's
the United Kingdom stayed as like it's like before. And now we have one value
like before. And now we have one value for all those information. So it's only
for all those information. So it's only the United States. So it looks perfect.
the United States. So it looks perfect. And with that we have cleaned as well
And with that we have cleaned as well the second column. So with that we have
the second column. So with that we have now clean results. And now the question
now clean results. And now the question did we change anything in the DDL? Well
did we change anything in the DDL? Well we haven't changed anything. Both of
we haven't changed anything. Both of them are varchar. So we can go now
them are varchar. So we can go now immediately and insert it into our
immediately and insert it into our table. So insert into silver customer
table. So insert into silver customer location. And here we have to specify
location. And here we have to specify the columns. It's very simple the ID and
the columns. It's very simple the ID and the country. So let's go and execute it.
the country. So let's go and execute it. And as you can see we got now inserted
And as you can see we got now inserted all those values. Of course, as a next,
all those values. Of course, as a next, we go and double check those
we go and double check those informations. I would just go and remove
informations. I would just go and remove all those stuff as well here. And
all those stuff as well here. And instead of bronze, let's go with the
instead of bronze, let's go with the silver. So, as you can see, all the
silver. So, as you can see, all the values of the country looks good. And
values of the country looks good. And let's have a final look to the table.
let's have a final look to the table. So, like this. So, we have the ids
So, like this. So, we have the ids without the separator. We have the
without the separator. We have the countries and as well our metadata
countries and as well our metadata information. So, with that, we have
information. So, with that, we have cleaned up the data for the location.
cleaned up the data for the location. Okay. So now what are the different
Okay. So now what are the different types of data transformation that we
types of data transformation that we have done here is first we have handled
have done here is first we have handled invalid values. So we have removed the
invalid values. So we have removed the minus with an empty string and for the
minus with an empty string and for the country we have done data normalization.
country we have done data normalization. So we have replaced codes with friendly
So we have replaced codes with friendly values and as well at the same time we
values and as well at the same time we have handled missing values by replacing
have handled missing values by replacing the empty string and null with not
the empty string and null with not available. And one more thing of course
available. And one more thing of course we have removed the unwanted spaces. So
we have removed the unwanted spaces. So those are the different types of
those are the different types of transformation that we have done for
transformation that we have done for this
table. Okay guys, now keep the energy up, keep the spirit up. We have to go
up, keep the spirit up. We have to go and clean up the last table in the
and clean up the last table in the bronze layer. And of course, we cannot
bronze layer. And of course, we cannot go and skip anything. We have to check
go and skip anything. We have to check the quality and to detect all the
the quality and to detect all the errors. So now we have a table about the
errors. So now we have a table about the categories for the products. And here we
categories for the products. And here we have like four columns. Let's go and
have like four columns. Let's go and start with the first one, the ID. As you
start with the first one, the ID. As you can see in our integration model, we can
can see in our integration model, we can connect this table together with the
connect this table together with the product info from the CRM using the
product info from the CRM using the product key. And as you remember in the
product key. And as you remember in the silver layer, we have created an extra
silver layer, we have created an extra column for that in the product info. So
column for that in the product info. So if you go and select those data, you can
if you go and select those data, you can see we have a column called category ID
see we have a column called category ID and this one is exactly matching the ID
and this one is exactly matching the ID that we have in this table and we have
that we have in this table and we have done the testing. So this ID is ready to
done the testing. So this ID is ready to be used together with the other table.
be used together with the other table. So there is nothing to do over here. And
So there is nothing to do over here. And now for the next columns they are
now for the next columns they are string. And of course we can go and
string. And of course we can go and check whether there are any unwanted
check whether there are any unwanted spaces. So we are checking for the
spaces. So we are checking for the unwanted spaces. So let's go and check
unwanted spaces. So let's go and check select start from and we're going to go
select start from and we're going to go and get the same table like this here.
and get the same table like this here. And first we are checking the category.
And first we are checking the category. So the category is not equal to the
So the category is not equal to the category after trimming the unwanted
category after trimming the unwanted spaces. So let's go and execute it. And
spaces. So let's go and execute it. And as you can see we don't have any
as you can see we don't have any results. So there are no unwanted
results. So there are no unwanted spaces. Let's go and check the other
spaces. Let's go and check the other column. For example, the subcategory,
column. For example, the subcategory, the next one. So let's get the
the next one. So let's get the subcategory and run the query as well.
subcategory and run the query as well. We don't have anything. So that means we
We don't have anything. So that means we don't have unwanted spaces for the
don't have unwanted spaces for the subcategory. Let's go now and check the
subcategory. Let's go now and check the last column. So I will just copy and
last column. So I will just copy and paste. Now let's get the maintenance and
paste. Now let's get the maintenance and let's go and execute. And as well, no
let's go and execute. And as well, no results. Perfect. We don't have any
results. Perfect. We don't have any unwanted spaces inside this table. So
unwanted spaces inside this table. So now the next step is that we're going to
now the next step is that we're going to go and check the data standardizations
go and check the data standardizations because all those columns has low
because all those columns has low cardinality. So what we can do we can
cardinality. So what we can do we can say
say select distinct let's get the cats
select distinct let's get the cats category from our table. I'll just copy
category from our table. I'll just copy and paste it and check all values. So as
and paste it and check all values. So as you can see we have the accessories,
you can see we have the accessories, bikes, clothing and components.
bikes, clothing and components. Everything looks perfect. We don't have
Everything looks perfect. We don't have to change anything in this column. Let's
to change anything in this column. Let's go and check the subcategory. And if you
go and check the subcategory. And if you scroll down, all values are friendly and
scroll down, all values are friendly and nice as well. Nothing to change here.
nice as well. Nothing to change here. And let's go and check the last column,
And let's go and check the last column, the maintenance. Perfect. We have only
the maintenance. Perfect. We have only two values, yes and no. We don't have
two values, yes and no. We don't have any nulls. So my friends, that's means
any nulls. So my friends, that's means this table has really nice data quality
this table has really nice data quality and we don't have to clean up anything.
and we don't have to clean up anything. But still, we have to follow our
But still, we have to follow our process. We have to go and load it from
process. We have to go and load it from the bronze to the silver even if we
the bronze to the silver even if we didn't transform anything. So our job is
didn't transform anything. So our job is really easy. Here we're going to go and
really easy. Here we're going to go and say insert into silver dot ERP px and so
say insert into silver dot ERP px and so on. And we're going to go and define the
on. And we're going to go and define the columns. So it's going to be the ID, the
columns. So it's going to be the ID, the category,
category, subcategory, maintenance. So that's it.
subcategory, maintenance. So that's it. Let's go and insert the data. Now, as
Let's go and insert the data. Now, as usual, what we're going to do, we're
usual, what we're going to do, we're going to go and check the data. So
going to go and check the data. So silver
silver ERP. Let's have a look. All right. So we
ERP. Let's have a look. All right. So we can see the ids are here, the
can see the ids are here, the categories, the subcategories, the
categories, the subcategories, the maintenance and we have our meta column.
maintenance and we have our meta column. So everything is inserted correctly. All
So everything is inserted correctly. All right. So now I have all those queries
right. So now I have all those queries and the insert statements for all six
and the insert statements for all six tables. And now what is important before
tables. And now what is important before inserting any data, we have to make sure
inserting any data, we have to make sure that we are truncating and emptying the
that we are truncating and emptying the table because if you run this query
table because if you run this query twice, what's going to happen? You will
twice, what's going to happen? You will be inserting duplicates. So first
be inserting duplicates. So first truncate the data and then do a full
truncate the data and then do a full load insert all data. So we're going to
load insert all data. So we're going to have one step before it's like the
have one step before it's like the bronze layer. We're going to say
bronze layer. We're going to say truncate table and then we will be
truncate table and then we will be truncating the silver customer info and
truncating the silver customer info and only after that we have to go and insert
only after that we have to go and insert the data. And of course we can go and
the data. And of course we can go and give this nice information at the start.
give this nice information at the start. So first we are truncating the table and
So first we are truncating the table and then inserting. So if I go and run the
then inserting. So if I go and run the whole thing. So let's go and do it. It
whole thing. So let's go and do it. It will be working. So if I can run it
will be working. So if I can run it again, we will not have any duplicates.
again, we will not have any duplicates. So we have to go and add this step
So we have to go and add this step before each insert. So let's go and do
before each insert. So let's go and do that. All right. So I'm done with all
that. All right. So I'm done with all tables. So now let's go and run
tables. So now let's go and run everything. So let's go and execute it.
everything. So let's go and execute it. And we can see in the messaging
And we can see in the messaging everything working perfectly. So with
everything working perfectly. So with that we made all tables empty. And then
that we made all tables empty. And then we inserted the
data. So perfect. With that we have a nice script that loads the silver layer.
nice script that loads the silver layer. But of course like the front layer,
But of course like the front layer, we're going to put everything in one
we're going to put everything in one stored procedure. So let's go and do
stored procedure. So let's go and do that. We'll go to the beginning over
that. We'll go to the beginning over here and say create or alter procedure
here and say create or alter procedure and we're going to put it in the schema
and we're going to put it in the schema silver and using the naming convention
silver and using the naming convention load silver and we're going to go over
load silver and we're going to go over here and say begin and take the whole
here and say begin and take the whole code end it is long one and give it one
code end it is long one and give it one push with a tab and then at the end
push with a tab and then at the end we're going to say edge. Perfect. So we
we're going to say edge. Perfect. So we have our stored procedure but we forgot
have our stored procedure but we forgot here the ass with that we will not have
here the ass with that we will not have any error. Let's go and execute it. So
any error. Let's go and execute it. So the stored procedure is created. If you
the stored procedure is created. If you go to the programmability and you will
go to the programmability and you will find two procedures load bronze and load
find two procedures load bronze and load silver. So now let's go and try it out.
silver. So now let's go and try it out. All what you have to do is now only to
All what you have to do is now only to execute the silver load silver. So let's
execute the silver load silver. So let's execute the start procedure and with
execute the start procedure and with that we will get the same results. This
that we will get the same results. This third procedure now is responsible of
third procedure now is responsible of loading the whole silver layer. Now of
loading the whole silver layer. Now of course the messaging here is not really
course the messaging here is not really good because we have learned in the
good because we have learned in the bronze layer we can go and add many
bronze layer we can go and add many stuff like handling the error doing nice
stuff like handling the error doing nice messaging catching the duration time. So
messaging catching the duration time. So now your task is to pause the video take
now your task is to pause the video take this start procedure and go and
this start procedure and go and transform it to be very similar to the
transform it to be very similar to the bronze layer with the same messaging and
bronze layer with the same messaging and all the add-ons that we have added. So
all the add-ons that we have added. So pause the video now. I will do it as
pause the video now. I will do it as well offline and I will see you
[Music] soon. Okay. So I hope you are done and I
soon. Okay. So I hope you are done and I can show you the results. It's like the
can show you the results. It's like the bronze layer. We have defined at the
bronze layer. We have defined at the start few variables in order to catch
start few variables in order to catch the duration. So we have the start time,
the duration. So we have the start time, the end time, patch start time and patch
the end time, patch start time and patch end time. And then we are printing a lot
end time. And then we are printing a lot of stuff in order to have like nice
of stuff in order to have like nice messaging in the output. So at the start
messaging in the output. So at the start we are saying loading the server layer
we are saying loading the server layer and then we start splitting by the
and then we start splitting by the source system. So loading the CRM tables
source system. So loading the CRM tables and I'm going to show you only one table
and I'm going to show you only one table for now. So we are setting the timer. So
for now. So we are setting the timer. So we are saying start time get the date
we are saying start time get the date and time informations to it. Then we are
and time informations to it. Then we are doing the usual. We are truncating the
doing the usual. We are truncating the table and then we are inserting the new
table and then we are inserting the new informations after cleaning it up. And
informations after cleaning it up. And we have this nice message. We will say
we have this nice message. We will say load duration where we are finding the
load duration where we are finding the differences between the start time and
differences between the start time and the end time using the function date
the end time using the function date diff. And we want to show the result in
diff. And we want to show the result in the seconds. So we are just printing how
the seconds. So we are just printing how long it took to load this table. And
long it took to load this table. And we're going to go and repeat this
we're going to go and repeat this process for all the tables. And of
process for all the tables. And of course we are putting everything in try
course we are putting everything in try and catch. So the SQL going to go and
and catch. So the SQL going to go and try to execute the try part. And if
try to execute the try part. And if there are any issues the SQL going to go
there are any issues the SQL going to go and execute the catch. And here we are
and execute the catch. And here we are just printing few information like the
just printing few information like the error message the error number and the
error message the error number and the error states. And we are following
error states. And we are following exactly the same standard at the bronze
exactly the same standard at the bronze layer. So let's go and execute the whole
layer. So let's go and execute the whole thing. And with that we have updated the
thing. And with that we have updated the definition of the third procedure. Let's
definition of the third procedure. Let's go now and execute it. So execute silver
go now and execute it. So execute silver dot load silver. So let's go and do
dot load silver. So let's go and do that. It went very fast like fewer than
that. It went very fast like fewer than 1 seconds again because we are working
1 seconds again because we are working on local machine loading the server
on local machine loading the server layer loading the CRM tables and we can
layer loading the CRM tables and we can see this nice messaging. So it start
see this nice messaging. So it start with truncating the table inserting the
with truncating the table inserting the data and we are getting the load
data and we are getting the load duration for this table and you will see
duration for this table and you will see that everything is below 1 second and
that everything is below 1 second and that's because in real projects you will
that's because in real projects you will get of course more than 1 second. So at
get of course more than 1 second. So at the end we have load duration of the
the end we have load duration of the whole silver layer. And now I have one
whole silver layer. And now I have one more thing for you. Let's say that you
more thing for you. Let's say that you are changing the design of this store
are changing the design of this store procedure for the server layer. You are
procedure for the server layer. You are adding different types of messaging or
adding different types of messaging or maybe you're creating logs and so on. So
maybe you're creating logs and so on. So now all those new ideas and redesigns
now all those new ideas and redesigns that you are doing for the silver layer,
that you are doing for the silver layer, you have always to think about bringing
you have always to think about bringing the same changes as well in the other
the same changes as well in the other store procedure for the pros layer. So
store procedure for the pros layer. So always try to keep your codes following
always try to keep your codes following the same standards. Don't have like one
the same standards. Don't have like one idea in one store procedure and an old
idea in one store procedure and an old idea in another one. Always try to
idea in another one. Always try to maintain those scripts and to keep them
maintain those scripts and to keep them all up to date following the same
all up to date following the same standards. Otherwise, it can be really
standards. Otherwise, it can be really hard for other developers to understand
hard for other developers to understand the cause. I know that needs a lot of
the cause. I know that needs a lot of work and commitments, but this is your
work and commitments, but this is your job to make everything following the
job to make everything following the best practices and following the same
best practices and following the same naming convention and standards that you
naming convention and standards that you put for your projects. So guys, now we
put for your projects. So guys, now we have very nice two ETL scripts. One that
have very nice two ETL scripts. One that loads the bronze layer and another one
loads the bronze layer and another one for the server layer. So now our data
for the server layer. So now our data warehouse is very simple. All what you
warehouse is very simple. All what you have to do is to run first the bronze
have to do is to run first the bronze layer and with that we are taking all
layer and with that we are taking all the data from the CSV files from the
the data from the CSV files from the source and we put it inside our data
source and we put it inside our data warehouse in the bronze layer and with
warehouse in the bronze layer and with that we are refreshing the whole bronze
that we are refreshing the whole bronze layer. Once it's done the next step is
layer. Once it's done the next step is to run the store procedure of the server
to run the store procedure of the server layer. So once you execute it you are
layer. So once you execute it you are taking now all the data from the bronze
taking now all the data from the bronze layer transforming it cleaning it up and
layer transforming it cleaning it up and then loading it to the server layer. And
then loading it to the server layer. And as you can see the concept is very
as you can see the concept is very simple. We are just moving the data from
simple. We are just moving the data from one layer another layer with different
one layer another layer with different tasks. All right guys, so as you can see
tasks. All right guys, so as you can see in the server layer we have done a lot
in the server layer we have done a lot of data transformations and we have
of data transformations and we have covered all the types that we have in
covered all the types that we have in the data cleansing. So we remove
the data cleansing. So we remove duplicates, data filtering, handling
duplicates, data filtering, handling missing data, invalid data, unwanted
missing data, invalid data, unwanted spaces, casting the data types and so
spaces, casting the data types and so on. And as well we have derived new
on. And as well we have derived new columns, we have done data enrichment
columns, we have done data enrichment and we have normalized a lot of data. So
and we have normalized a lot of data. So now of course what we have not done yet
now of course what we have not done yet business rules and logic data
business rules and logic data aggregations and data integration. This
aggregations and data integration. This is for the next layer. All right my
is for the next layer. All right my friends. So finally we are done cleaning
friends. So finally we are done cleaning up the data and checking the quality of
up the data and checking the quality of our data. So we can go and close those
our data. So we can go and close those two steps. And now to the next step we
two steps. And now to the next step we have to go and extend the data flow
have to go and extend the data flow diagram. So let's
go. Okay. So now let's go and extend our data flow for the silver layer. So, what
data flow for the silver layer. So, what I'm going to do, I'm just going to go
I'm going to do, I'm just going to go and copy the whole thing and put it side
and copy the whole thing and put it side by side to the bronze layer. And let's
by side to the bronze layer. And let's call it silver layer. And the table name
call it silver layer. And the table name is going to stay as before because we
is going to stay as before because we have like one to one like the bronze
have like one to one like the bronze layer. But what we're going to do, we're
layer. But what we're going to do, we're going to go and change the coloring. So,
going to go and change the coloring. So, I'm going to go and mark everything and
I'm going to go and mark everything and make it gray like silver. And of course,
make it gray like silver. And of course, what is very important is to make the
what is very important is to make the lineage. So, I'm going to go now from
lineage. So, I'm going to go now from the bronze and take an arrow and put it
the bronze and take an arrow and put it to the silver table. And now with that
to the silver table. And now with that we have like a lineage between three
we have like a lineage between three layers and you are checking this table
layers and you are checking this table the customer info you can understand aha
the customer info you can understand aha this comes from the bronze layer from
this comes from the bronze layer from the customer info and as well this comes
the customer info and as well this comes from the source system CRM so now we can
from the source system CRM so now we can see the lineage between different layers
see the lineage between different layers and without looking to any scripts and
and without looking to any scripts and so on in one picture you can understand
so on in one picture you can understand the whole projects so I don't have to
the whole projects so I don't have to explain a lot of stuff by just looking
explain a lot of stuff by just looking to this picture you can understand how
to this picture you can understand how the data is flowing between sources is
the data is flowing between sources is bronze layer, silver layer, and to the
bronze layer, silver layer, and to the gold layer, of course, later. So, as you
gold layer, of course, later. So, as you can see, it looks really nice and clean.
can see, it looks really nice and clean. All right. So, with that, we have
All right. So, with that, we have updated the data flow. Next, we're going
updated the data flow. Next, we're going to go and commit our work in the G repo.
to go and commit our work in the G repo. So, let's
go. Okay. So, now let's go and commit our scripts. We're going to go to the
our scripts. We're going to go to the folder scripts. And here we have a
folder scripts. And here we have a server layer. If you don't have it, of
server layer. If you don't have it, of course, you can go and create it. So,
course, you can go and create it. So, first we're going to go and put the DDL
first we're going to go and put the DDL scripts for the server layer. So let's
scripts for the server layer. So let's go and I will paste the code over here.
go and I will paste the code over here. And as usual, we have this commit as the
And as usual, we have this commit as the header explaining the purpose of this
header explaining the purpose of this script. So let's go and commit our work.
script. So let's go and commit our work. And we're going to do the same thing for
And we're going to do the same thing for the store procedure that loads the
the store procedure that loads the server layer. So I'm going to go over
server layer. So I'm going to go over here. I have already filed for that. So
here. I have already filed for that. So let's go and paste that. So we have here
let's go and paste that. So we have here our stored procedures. And as usual at
our stored procedures. And as usual at the start, we have as well. So this
the start, we have as well. So this script is doing the ATL process where we
script is doing the ATL process where we load the data from bronze into silver.
load the data from bronze into silver. So the action is to truncate the table
So the action is to truncate the table first and then insert transformed cleans
first and then insert transformed cleans data from bronze to silver. There are no
data from bronze to silver. There are no parameters at all. And this is how you
parameters at all. And this is how you can use the source procedure. Okay. So
can use the source procedure. Okay. So we're going to go and commit our work.
we're going to go and commit our work. And now one more thing that we want to
And now one more thing that we want to commit in our project all those queries
commit in our project all those queries that you have built to check the quality
that you have built to check the quality of the server layer. So this time we
of the server layer. So this time we will not put it in the scripts. We're
will not put it in the scripts. We're going to go to the tests and here we're
going to go to the tests and here we're going to go and make a new file called
going to go and make a new file called quality checks silver and inside it
quality checks silver and inside it we're going to go and paste all the
we're going to go and paste all the queries that we have filled. I just here
queries that we have filled. I just here reorganize them by the tables. So here
reorganize them by the tables. So here we can see all the checks that we have
we can see all the checks that we have done during the course and at the header
done during the course and at the header we have here nice comments. So here we
we have here nice comments. So here we are just saying that this script is
are just saying that this script is going to check the quality of the server
going to check the quality of the server layer and we are checking for nulls,
layer and we are checking for nulls, duplicates, unwanted spaces, invalid
duplicates, unwanted spaces, invalid date range and so on. So that each time
date range and so on. So that each time you come up with a new quality check,
you come up with a new quality check, I'm going to recommend you to share it
I'm going to recommend you to share it with the project and with other team in
with the project and with other team in order to make it part of multiple checks
order to make it part of multiple checks that you do after running the ATL. So
that you do after running the ATL. So that's it. I'm going to go and put those
that's it. I'm going to go and put those checks in our repo and in case I come up
checks in our repo and in case I come up with new check, I'm going to go and
with new check, I'm going to go and update it. Perfect. So now we have our
update it. Perfect. So now we have our code in our repository. All right. So
code in our repository. All right. So with that, our code is saved and we are
with that, our code is saved and we are done with the whole epic. So we have
done with the whole epic. So we have built the silver layer. Now let's go and
built the silver layer. Now let's go and minimize it. And now we come to my
minimize it. And now we come to my favorite layer, the code layer. So we're
favorite layer, the code layer. So we're going to go and build it. The first step
going to go and build it. The first step as usual, we have to analyze. And this
as usual, we have to analyze. And this time we're going to explore the business
time we're going to explore the business objects. So let's
go. All right. So now we come to the big question. How we going to build the gold
question. How we going to build the gold layer? As usual, we start with
layer? As usual, we start with analyzing. So now what we're going to do
analyzing. So now what we're going to do here is to explore and understand what
here is to explore and understand what are the main business objects that are
are the main business objects that are hidden inside our source system. So as
hidden inside our source system. So as you can see we have two sources six
you can see we have two sources six files and here we have to identify what
files and here we have to identify what are the business objects. Once we have
are the business objects. Once we have this understanding then we can start
this understanding then we can start coding and here the main transformation
coding and here the main transformation that we are doing is data integration.
that we are doing is data integration. And here usually I split it into three
And here usually I split it into three steps. The first one we're going to go
steps. The first one we're going to go and build those business objects that we
and build those business objects that we have identified. And after we have a
have identified. And after we have a business objects we have to look at it
business objects we have to look at it and decide what is the type of this
and decide what is the type of this table. Is it a dimension? Is it a fact?
table. Is it a dimension? Is it a fact? Or is it like maybe a flat table? So
Or is it like maybe a flat table? So what type of table that we have built
what type of table that we have built and the last step is of course we have
and the last step is of course we have now to rename all the columns into
now to rename all the columns into something friendly and easy to
something friendly and easy to understand so that our consumers don't
understand so that our consumers don't struggle with technical names. So once
struggle with technical names. So once we have all those steps what we're going
we have all those steps what we're going to do it's time to validate what we have
to do it's time to validate what we have created. So what we have to do the new
created. So what we have to do the new data model that we have created it
data model that we have created it should be connectable and we have to
should be connectable and we have to check that the data integration is done
check that the data integration is done correctly and once everything is fine we
correctly and once everything is fine we cannot skip the last step. we have to
cannot skip the last step. we have to document and as well commit our work in
document and as well commit our work in the g. And here we will be introducing a
the g. And here we will be introducing a new type of documentations. So we're
new type of documentations. So we're going to have a diagram about the data
going to have a diagram about the data model. We're going to build a data
model. We're going to build a data dictionary where we're going to describe
dictionary where we're going to describe the data model. And of course we're
the data model. And of course we're going to extend the data flow diagram.
going to extend the data flow diagram. So this is our process. Those are the
So this is our process. Those are the main steps that we will do in order to
main steps that we will do in order to build the code
layer. Okay. So what is exactly data moduling? Usually the source system
moduling? Usually the source system going to deliver for you row data
going to deliver for you row data unorganized messy not very useful in its
unorganized messy not very useful in its current states. But now the data
current states. But now the data modeling is the process of taking this
modeling is the process of taking this row data and then organize it and
row data and then organize it and structure it in meaningful way. So what
structure it in meaningful way. So what we are doing we are putting the data in
we are doing we are putting the data in new friendly and easy to understand
new friendly and easy to understand objects like customers, orders,
objects like customers, orders, products. Each one of them is focused on
products. Each one of them is focused on specific information and what is very
specific information and what is very important is we're going to describe the
important is we're going to describe the relationship between those objects. So
relationship between those objects. So by connecting them using lines. So what
by connecting them using lines. So what you have built on the right side we call
you have built on the right side we call it logical data model. If you compare to
it logical data model. If you compare to the left side you can see the data model
the left side you can see the data model makes it really easy to understand our
makes it really easy to understand our data and the relationship the processes
data and the relationship the processes behind them. Now in data modeling we
behind them. Now in data modeling we have three different stages or let's say
have three different stages or let's say three different ways on how to draw a
three different ways on how to draw a data model. The first stage is the
data model. The first stage is the conceptual data model. Here the focus is
conceptual data model. Here the focus is only on the entity. So we have
only on the entity. So we have customers, orders, products and we don't
customers, orders, products and we don't go in details at all. So we don't
go in details at all. So we don't specify any columns or attributes inside
specify any columns or attributes inside those boxes. We just want to focus what
those boxes. We just want to focus what are the entities that we have and as
are the entities that we have and as well the relationship between them. So
well the relationship between them. So the conceptual data model don't focus at
the conceptual data model don't focus at all on the details. It just gives the
all on the details. It just gives the big picture. So the second data model
big picture. So the second data model that we can build is the logical data
that we can build is the logical data model. And here we start specifying what
model. And here we start specifying what are the different columns that we can
are the different columns that we can find in each entity like we have the
find in each entity like we have the customer ID the first name last name and
customer ID the first name last name and so on and we still draw the relationship
so on and we still draw the relationship between those entities and as well we
between those entities and as well we make it clear which columns are the
make it clear which columns are the primary key and so on. So as you can see
primary key and so on. So as you can see we have here more details but one thing
we have here more details but one thing we don't describe a lot of details for
we don't describe a lot of details for each column and we are not worry how
each column and we are not worry how exactly we going to store those tables
exactly we going to store those tables in the database. The third and last
in the database. The third and last stage we have the physical data model.
stage we have the physical data model. This is where everything gets ready
This is where everything gets ready before creating it in the database. So
before creating it in the database. So here you have to add all the technical
here you have to add all the technical details like adding for each column the
details like adding for each column the data types and the length of each data
data types and the length of each data type and many other database techniques
type and many other database techniques and details. So again if you look to the
and details. So again if you look to the conceptual data model it gives us the
conceptual data model it gives us the big picture and in the logical data
big picture and in the logical data model we dive into details of what data
model we dive into details of what data we need and the physical layer model
we need and the physical layer model prepares everything for the
prepares everything for the implementation in the database. And to
implementation in the database. And to be honest in my projects I only draw the
be honest in my projects I only draw the conceptual and the logical data model
conceptual and the logical data model because drawing and building the
because drawing and building the physical data model needs a lot of
physical data model needs a lot of efforts and time and there are many
efforts and time and there are many tools like in data bricks they
tools like in data bricks they automatically generate those models. So
automatically generate those models. So in this project what we're going to do
in this project what we're going to do we're going to draw the logical data
we're going to draw the logical data model for the gold
layer. All right. It's now for analytics and especially for data warehousing and
and especially for data warehousing and business intelligence. We need a special
business intelligence. We need a special data model that is optimized for
data model that is optimized for reporting and analytics and it should be
reporting and analytics and it should be flexible, scalable and as well easy to
flexible, scalable and as well easy to understand. And for that we have two
understand. And for that we have two special data models. The first type of
special data models. The first type of data model we have the star schema. It
data model we have the star schema. It has a central fact table in the middle
has a central fact table in the middle and surrounded by dimensions. The fact
and surrounded by dimensions. The fact table contains transactions, events, and
table contains transactions, events, and the dimensions contains descriptive
the dimensions contains descriptive informations. And the relationship
informations. And the relationship between the fact table in the middle and
between the fact table in the middle and the dimensions around it forms like a
the dimensions around it forms like a star shape. And that's why we call it
star shape. And that's why we call it star schema. And we have another data
star schema. And we have another data model called snowflake schema. It looks
model called snowflake schema. It looks very similar to the star schema. So we
very similar to the star schema. So we have again the fact in the middle and
have again the fact in the middle and surrounded by dimensions. But the big
surrounded by dimensions. But the big difference is that we break the
difference is that we break the dimensions into smaller subdimensions.
dimensions into smaller subdimensions. And the shape of this data model as you
And the shape of this data model as you are extending the dimensions it's going
are extending the dimensions it's going to looks like a snowflake. So now if you
to looks like a snowflake. So now if you compare them side by side you can see
compare them side by side you can see that the star schema looks easier right?
that the star schema looks easier right? So it is usually easy to understand easy
So it is usually easy to understand easy to query it is really perfect for
to query it is really perfect for analyzers but it has one issue with the
analyzers but it has one issue with the dimension might contain duplicates and
dimension might contain duplicates and your dimensions get bigger with the
your dimensions get bigger with the time. Now if you compare it to the
time. Now if you compare it to the snowflake you can see the schema is more
snowflake you can see the schema is more complex. You saw you need a lot of
complex. You saw you need a lot of knowledge and efforts in order to query
knowledge and efforts in order to query something from the snowflake. But the
something from the snowflake. But the main advantage here comes with the
main advantage here comes with the normalization as you are breaking those
normalization as you are breaking those redundancies in small tables. You can
redundancies in small tables. You can optimize the storage. But to be honest,
optimize the storage. But to be honest, who care about the storage? So for this
who care about the storage? So for this project, I have chose to use the star
project, I have chose to use the star schema because it is very commonly used.
schema because it is very commonly used. Perfect for reporting like for example
Perfect for reporting like for example if you're using PowerBI and we don't
if you're using PowerBI and we don't have to worry about the storage. So
have to worry about the storage. So that's why we're going to adopt this
that's why we're going to adopt this model to build our gold layer.
Okay. So now one more thing about those data models is that they contain two
data models is that they contain two types of tables fact and dimensions. So
types of tables fact and dimensions. So when I say this is a fact table or a
when I say this is a fact table or a dimension table well the dimension
dimension table well the dimension contains descriptive informations or
contains descriptive informations or like categories that gives some context
like categories that gives some context to your data. For example a product info
to your data. For example a product info you have product name, category,
you have product name, category, subcategories and so on. This is like a
subcategories and so on. This is like a table that is describing the products
table that is describing the products and this we call it dimension. But in
and this we call it dimension. But in the other hand we have facts. They are
the other hand we have facts. They are events like transactions. They contain
events like transactions. They contain three important informations. First you
three important informations. First you have multiple ids from multiple
have multiple ids from multiple dimensions. Then we have like date
dimensions. Then we have like date informations like when the transaction
informations like when the transaction or the event did happen. And the third
or the event did happen. And the third type of information you're going to have
type of information you're going to have like measures and numbers. So if you see
like measures and numbers. So if you see those three types of data in one table,
those three types of data in one table, then this is a fact. So if you have a
then this is a fact. So if you have a table that answers how much or how many,
table that answers how much or how many, then this is a fact. But if you have a
then this is a fact. But if you have a table that answers who, what, where,
table that answers who, what, where, then this is a dimension table. So this
then this is a dimension table. So this is what dimension and fact
tables. All right my friends. So so far in the bronze layer and in the silver
in the bronze layer and in the silver layer we didn't discuss anything about
layer we didn't discuss anything about the business. So the bronze and silver
the business. So the bronze and silver were very technical. We are focusing on
were very technical. We are focusing on data ingestion. We are focusing on
data ingestion. We are focusing on cleaning up the data quality of the
cleaning up the data quality of the data. But still the tables are very
data. But still the tables are very oriented to the source system. Now comes
oriented to the source system. Now comes the fun part in the god layer where
the fun part in the god layer where we're going to go and break the whole
we're going to go and break the whole data model of the sources. So we're
data model of the sources. So we're going to create something completely new
going to create something completely new to our business that is easy to consume
to our business that is easy to consume for business reporting and analyzes. And
for business reporting and analyzes. And here it is very important to have a
here it is very important to have a clear understanding of the business and
clear understanding of the business and the processes. And if you don't know it
the processes. And if you don't know it already at this phase you have really to
already at this phase you have really to invest time by meeting maybe process
invest time by meeting maybe process experts, the domain experts in order to
experts, the domain experts in order to have clear understanding what we are
have clear understanding what we are talking about in the data. So now what
talking about in the data. So now what we're going to do, we're going to try to
we're going to do, we're going to try to detect what are the business objects
detect what are the business objects that are hidden in the source systems.
that are hidden in the source systems. So now let's go and explore that. All
So now let's go and explore that. All right. Now in order to build a new data
right. Now in order to build a new data model, I have to understand first the
model, I have to understand first the original data model. What are the main
original data model. What are the main business objects that we have? How
business objects that we have? How things are related to each others? And
things are related to each others? And this is very important process in
this is very important process in building a new model. So now what I
building a new model. So now what I usually do, I start giving labels to all
usually do, I start giving labels to all those tables. So if you go to the shapes
those tables. So if you go to the shapes over here, let's go and search for
over here, let's go and search for label. And if we go to more icons, I'm
label. And if we go to more icons, I'm going to go and take this label over
going to go and take this label over here. So, drag and drop it. And then I'm
here. So, drag and drop it. And then I'm going to go and increase maybe the size
going to go and increase maybe the size of the font. So, let's go with 20 and
of the font. So, let's go with 20 and bold. Just make it a little bit bigger.
bold. Just make it a little bit bigger. So, now by looking to this data model,
So, now by looking to this data model, we can see that we have product
we can see that we have product informations in the CRM and as well in
informations in the CRM and as well in the ARP. And then we have like customer
the ARP. And then we have like customer informations and transactional table.
informations and transactional table. So, now let's focus on the product. So,
So, now let's focus on the product. So, the product information is over here. We
the product information is over here. We have here the current and the history
have here the current and the history product informations and here we have
product informations and here we have the categories that's belong to the
the categories that's belong to the products. So in our data model we have
products. So in our data model we have something called products. So let's go
something called products. So let's go and create this label. It's going to be
and create this label. It's going to be the product and let's go and give it a
the product and let's go and give it a color to the style. Let's pick for
color to the style. Let's pick for example the red one. Now let's go and
example the red one. Now let's go and move this label and put it beneath this
move this label and put it beneath this table over here. And with that I have
table over here. And with that I have like a label saying this table belongs
like a label saying this table belongs to the objects called products. Now I'm
to the objects called products. Now I'm going to do the same thing for the other
going to do the same thing for the other table over here. So I'm going to go and
table over here. So I'm going to go and tag this table to the product as well.
tag this table to the product as well. So that I can see easily which tables
So that I can see easily which tables from the sources does has informations
from the sources does has informations about the product business object. All
about the product business object. All right. Now moving on, we have here a
right. Now moving on, we have here a table called customer information. So we
table called customer information. So we have a lot of information about the
have a lot of information about the customer. We have as well in the ARP
customer. We have as well in the ARP customer information where we have the
customer information where we have the birthday and the country. So those three
birthday and the country. So those three tables has to do with the object
tables has to do with the object customer. So that means we're going to
customer. So that means we're going to go and label it like that. So let's call
go and label it like that. So let's call it customer and I'm going to go and pick
it customer and I'm going to go and pick different color for that. Let's go with
different color for that. Let's go with the green. So I will tag this table like
the green. So I will tag this table like this. And the same thing for the other
this. And the same thing for the other tables. So copy tag the second table and
tables. So copy tag the second table and the third table. Now it is very easily
the third table. Now it is very easily for me to see which table to belong to
for me to see which table to belong to which business objects. And now we have
which business objects. And now we have the final table over here and only one
the final table over here and only one table about the sales and orders. In the
table about the sales and orders. In the arb we don't have any informations about
arb we don't have any informations about that. So this one going to be easy.
that. So this one going to be easy. Let's call it sales. And let's move it
Let's call it sales. And let's move it over here. And as well maybe change the
over here. And as well maybe change the color of that to for example this color
color of that to for example this color over here. Now this step is very
over here. Now this step is very important by building any data model in
important by building any data model in the gold layer. It gives you a big
the gold layer. It gives you a big picture about the things that you are
picture about the things that you are going to module. So now the next step is
going to module. So now the next step is that we're going to go and build those
that we're going to go and build those objects step by step. So let's start
objects step by step. So let's start with the first objects with our
with the first objects with our customers. So here we have three tables
customers. So here we have three tables and we're going to start with the CRM.
and we're going to start with the CRM. So let's start with this table over
So let's start with this table over here. All right. So with that we know
here. All right. So with that we know what are our business objects and this
what are our business objects and this task is done and now in the next step
task is done and now in the next step we're going to go back to scale and
we're going to go back to scale and start doing data integrations and
start doing data integrations and building completely new data model. So
building completely new data model. So let's go and do
that. Now let's have a quick look to the good layer specifications. So this is
good layer specifications. So this is the final stage. We're going to provide
the final stage. We're going to provide data to be consumed by reporting and
data to be consumed by reporting and analytics. And this time we will not be
analytics. And this time we will not be building tables. We will be using views.
building tables. We will be using views. So that means we will not be having like
So that means we will not be having like stored procedure or any load process to
stored procedure or any load process to the code layer. All what we are doing is
the code layer. All what we are doing is only data transformation and the focus
only data transformation and the focus of the data transformation going to be
of the data transformation going to be data integration, aggregation, business
data integration, aggregation, business logic and so on. And this time we're
logic and so on. And this time we're going to introduce a new data model. We
going to introduce a new data model. We will be doing star schema. So those are
will be doing star schema. So those are the specifications for the gold layer
the specifications for the gold layer and this is our scope. So this time we
and this is our scope. So this time we make sure that we are selecting data
make sure that we are selecting data from the silver layer not from the
from the silver layer not from the bronze because the bronze has bad data
bronze because the bronze has bad data quality and the silver is everything is
quality and the silver is everything is prepared and cleaned up. In order to
prepared and cleaned up. In order to build the good layer going to be
build the good layer going to be targeting the server layer. So let's
targeting the server layer. So let's start with select star from and we're
start with select star from and we're going to go to the silver CRM customer
going to go to the silver CRM customer info. So let's go and hit execute. And
info. So let's go and hit execute. And now we're going to go and select the
now we're going to go and select the columns that we need to be presented in
columns that we need to be presented in the go layer. So let's start selecting
the go layer. So let's start selecting the columns that we want. So we have the
the columns that we want. So we have the ID, the key, the first
name. I will not go and get the metadata information. This only belongs to the
information. This only belongs to the silver. Perfect. The next step is that
silver. Perfect. The next step is that I'm going to go and give this table an
I'm going to go and give this table an alias. So let's go and call it CI. And
alias. So let's go and call it CI. And I'm going to make sure that we are
I'm going to make sure that we are selecting from this alias because later
selecting from this alias because later we're going to go and join this table
we're going to go and join this table with other tables. So something like
with other tables. So something like this. So we're going to go with those
this. So we're going to go with those columns. Now let's move to the second
columns. Now let's move to the second table. Let's go and get the birthday
table. Let's go and get the birthday information. So now we're going to jump
information. So now we're going to jump to the other system and we have to join
to the other system and we have to join the data by the CID together with the
the data by the CID together with the customer key. So now we have to go and
customer key. So now we have to go and join the data with another table. And
join the data with another table. And here I try to avoid using the inner join
here I try to avoid using the inner join because if the other table doesn't have
because if the other table doesn't have all the information about the customers,
all the information about the customers, I might lose customers. So always start
I might lose customers. So always start with the master table and if you join it
with the master table and if you join it with any other table in order to get
with any other table in order to get informations try always to avoid inner
informations try always to avoid inner join because the other source might not
join because the other source might not have all the customers and if you do
have all the customers and if you do inner join you might lose customers. So
inner join you might lose customers. So I tend to start from the master table
I tend to start from the master table and then everything else is about the
and then everything else is about the lift join. So I'm going to say lift join
lift join. So I'm going to say lift join silver ERP customer a12. So let's give
silver ERP customer a12. So let's give it the alias ca. And now we have to join
it the alias ca. And now we have to join the tables. So it's going to be by CE
the tables. So it's going to be by CE from the first table. It's going to be
from the first table. It's going to be the customer key equal to CA and we have
the customer key equal to CA and we have the CI ID. Now of course we're going to
the CI ID. Now of course we're going to get matching data because we checked the
get matching data because we checked the server layer. But if we haven't prepared
server layer. But if we haven't prepared the data in the server layer, we have to
the data in the server layer, we have to do here preparation step in order to
do here preparation step in order to join the tables. But we don't have to do
join the tables. But we don't have to do that because that was a pre-step in the
that because that was a pre-step in the server layer. So now you can see the
server layer. So now you can see the systematic that we have in this bronze,
systematic that we have in this bronze, silver, gold. So now after joining the
silver, gold. So now after joining the tables we have to go and pick the
tables we have to go and pick the information that we need from the second
information that we need from the second table which is the birth date. So B date
table which is the birth date. So B date dates and as well from this table there
dates and as well from this table there is another nice information it is the
is another nice information it is the gender information. So that's all what
gender information. So that's all what we need from the second table. Let's go
we need from the second table. Let's go and check the third table. So the third
and check the third table. So the third table is about the location information
table is about the location information the countries and as well we connect the
the countries and as well we connect the tables by the CID with the key. So let's
tables by the CID with the key. So let's go and do that. We're going to say as
go and do that. We're going to say as well left join silver ERP location and
well left join silver ERP location and I'm going to give it the name LA and
I'm going to give it the name LA and then we have to join Y the keys the same
then we have to join Y the keys the same thing it's going to be CI customer key
thing it's going to be CI customer key equal to LA CI ID again we have prepared
equal to LA CI ID again we have prepared those ids and keys in the server layer
those ids and keys in the server layer so the join should be working now we
so the join should be working now we have to go and pick the data from the
have to go and pick the data from the second table so what do we have over
second table so what do we have over here we have the ID the country and the
here we have the ID the country and the metadata information so let's go and
metadata information so let's go and just get the country Perfect. So now
just get the country Perfect. So now with that we have joined all the three
with that we have joined all the three tables and we have picked all the
tables and we have picked all the columns that we want in this object. So
columns that we want in this object. So again by looking over here we have
again by looking over here we have joined this table with this one and this
joined this table with this one and this one. So with that we have collected all
one. So with that we have collected all the customer informations that we have
the customer informations that we have from the two source systems. Okay. So
from the two source systems. Okay. So now let's go and query in order to make
now let's go and query in order to make sure that we have everything correct and
sure that we have everything correct and in order to understand that your joints
in order to understand that your joints are correct you have to keep your eye in
are correct you have to keep your eye in those three columns. So if you are
those three columns. So if you are seeing that you are getting data that
seeing that you are getting data that means you are doing the the joints
means you are doing the the joints correctly but if you are seeing a lot of
correctly but if you are seeing a lot of nulls or no data at all that means your
nulls or no data at all that means your joints are incorrect but now it looks
joints are incorrect but now it looks for me it is working and another check
for me it is working and another check that I do is that if your first table
that I do is that if your first table has no duplicates what could happen is
has no duplicates what could happen is that after doing multiple joins you
that after doing multiple joins you might now start getting duplicates
might now start getting duplicates because the relationship between those
because the relationship between those tables is not clear one to one you might
tables is not clear one to one you might get like one to many relationship ship
get like one to many relationship ship or many to many relationships. So now
or many to many relationships. So now the check that I usually do at this
the check that I usually do at this stage is that I have to make sure that I
stage is that I have to make sure that I don't have duplicates from their
don't have duplicates from their results. So we don't have like multiple
results. So we don't have like multiple rows for the same customer. So in order
rows for the same customer. So in order to do that, we go and do a quick group
to do that, we go and do a quick group by. So we're going to group by the data
by. So we're going to group by the data by the customer ID and then we do the
by the customer ID and then we do the count from this subquery. So this is the
count from this subquery. So this is the whole subquery and then after that we're
whole subquery and then after that we're going to go and say group by the
going to go and say group by the customer ID and then we say having
customer ID and then we say having count higher than one. So this query
count higher than one. So this query actually try to find out whether we have
actually try to find out whether we have any duplicates in the primary key. So
any duplicates in the primary key. So let's go and execute it. We don't have
let's go and execute it. We don't have any duplicates and that means after
any duplicates and that means after joining all those tables with the
joining all those tables with the customer info those tables didn't cause
customer info those tables didn't cause any issues and didn't duplicate my data.
any issues and didn't duplicate my data. So this is very important check to make
So this is very important check to make sure that you are in the right way. All
sure that you are in the right way. All right. So that means everything is fine
right. So that means everything is fine about the duplicates. We don't have to
about the duplicates. We don't have to worry about it. Now we have here an
worry about it. Now we have here an integration issue. So let's go and
integration issue. So let's go and execute it again. And now if you look to
execute it again. And now if you look to the data we have two sources for the
the data we have two sources for the gender informations. one comes from the
gender informations. one comes from the CRM and another one come from the ERP.
CRM and another one come from the ERP. So now the question is what we're going
So now the question is what we're going to do with this? Well, we have to do
to do with this? Well, we have to do data integration. So let me show you how
data integration. So let me show you how I do it. First I go and have a new query
I do it. First I go and have a new query and then I'm going to go and remove all
and then I'm going to go and remove all other stuff and I'm going to leave only
other stuff and I'm going to leave only those two informations and use it
those two informations and use it distinct just to focus on the
distinct just to focus on the integration and let's go and execute it
integration and let's go and execute it and maybe as well to do an order by. So
and maybe as well to do an order by. So let's do one and two. Let's go and
let's do one and two. Let's go and execute it again. So now here we have
execute it again. So now here we have all the scenarios and we can see
all the scenarios and we can see sometimes there is a matching. So from
sometimes there is a matching. So from the first table we have female and the
the first table we have female and the other table we have as well female but
other table we have as well female but sometimes we have an issue like those
sometimes we have an issue like those two tables are giving different
two tables are giving different informations and the same thing over
informations and the same thing over here. So this is as well an issue
here. So this is as well an issue different informations. Another scenario
different informations. Another scenario where we have a data from the first
where we have a data from the first table like here we have the female but
table like here we have the female but in the other table we have not
in the other table we have not available. Well this is not a problem.
available. Well this is not a problem. So we can get it from the first table
So we can get it from the first table but we have as well the exact opposite
but we have as well the exact opposite scenario where from the first table the
scenario where from the first table the data is not available but it is
data is not available but it is available from the second table. And now
available from the second table. And now here you might wonder why I'm getting a
here you might wonder why I'm getting a null over here. We did handle all the
null over here. We did handle all the missing data in the silver layer and we
missing data in the silver layer and we replace everything with not available.
replace everything with not available. So why we are still in getting a null?
So why we are still in getting a null? This null doesn't come directly from the
This null doesn't come directly from the tables. It just come because of joining
tables. It just come because of joining tables. So that means there are
tables. So that means there are customers in the CRM table that is not
customers in the CRM table that is not available in the ARB table and if there
available in the ARB table and if there is like no match what going to happen we
is like no match what going to happen we will get a null from SQL. So this null
will get a null from SQL. So this null means there was no match and that's why
means there was no match and that's why we are getting this null. It is not
we are getting this null. It is not coming from the content of the tables
coming from the content of the tables and this is of course an issue. But now
and this is of course an issue. But now the big issue what can happen for those
the big issue what can happen for those two scenarios here we have the data but
two scenarios here we have the data but they are different. And here again we
they are different. And here again we have to ask the experts about it. What
have to ask the experts about it. What is the master here? Is it the CRM system
is the master here? Is it the CRM system or the ARP? And let's say from their
or the ARP? And let's say from their answer going to say the master data for
answer going to say the master data for the customer information is the CRM. So
the customer information is the CRM. So that means the CRM informations are more
that means the CRM informations are more accurate than the ERP information and
accurate than the ERP information and this is only about the customers of
this is only about the customers of course. So for this scenario where we
course. So for this scenario where we have female and male then the correct
have female and male then the correct information is the female from the first
information is the female from the first source system. The same goes over here
source system. The same goes over here and here we have like male and female
and here we have like male and female then the correct one is the male because
then the correct one is the male because this source system is the master. Okay.
this source system is the master. Okay. So now let's go and build this business
So now let's go and build this business rule. We're going to start as usual with
rule. We're going to start as usual with the case win. So the first very
the case win. So the first very important rule is if we have a data in
important rule is if we have a data in the gender information from the CRM
the gender information from the CRM system from the master then go and use
system from the master then go and use it. So we're going to go and check the
it. So we're going to go and check the gender information from the CRM table.
gender information from the CRM table. So customer gender is not equal to not
So customer gender is not equal to not available. So that means we have a value
available. So that means we have a value male or female. Let me just have here a
male or female. Let me just have here a comma like this. Then what's going to
comma like this. Then what's going to happen? Go and use it. So we're going to
happen? Go and use it. So we're going to use the value from the master. CRM is
use the value from the master. CRM is the master for gender info. Now
the master for gender info. Now otherwise that means it is not available
otherwise that means it is not available from the CRM table. Then go and use and
from the CRM table. Then go and use and grab the information from the second
grab the information from the second table. So we're going to say CA gender.
table. So we're going to say CA gender. But now we have to be careful with this
But now we have to be careful with this null over here. We have to convert it to
null over here. We have to convert it to not available as well. So we're going to
not available as well. So we're going to use the
use the kis. So if this is a null then go and
kis. So if this is a null then go and use the not available like this. So
use the not available like this. So that's it. Let's have an end. And let me
that's it. Let's have an end. And let me just push this over here. So let's go
just push this over here. So let's go and call it new gen for now. Let's go
and call it new gen for now. Let's go and execute it and let's go and check
and execute it and let's go and check the different scenarios. All those
the different scenarios. All those values over here we have data from the
values over here we have data from the CRM system and this is as well
CRM system and this is as well represented in the new column. But now
represented in the new column. But now for the second part we don't have data
for the second part we don't have data from the first system. So we are trying
from the first system. So we are trying to get it from the second system. So for
to get it from the second system. So for the first one is not available and then
the first one is not available and then we try to get it from the second source
we try to get it from the second source system. So now we are activating the
system. So now we are activating the else. Well it is null and with that the
else. Well it is null and with that the kalis is activated and we are replacing
kalis is activated and we are replacing the null with not available. For the
the null with not available. For the second scenario as well, the first
second scenario as well, the first search system don't have the gender
search system don't have the gender information. That's why we are grabbing
information. That's why we are grabbing it from the second. So with that we have
it from the second. So with that we have a female. And then the third one the
a female. And then the third one the same thing we don't have information but
same thing we don't have information but we get it from the second source system.
we get it from the second source system. We have the male and the last one it is
We have the male and the last one it is not available in both source systems.
not available in both source systems. That's why we are getting not available.
That's why we are getting not available. So with that as you can see we have a
So with that as you can see we have a perfect new column where we are
perfect new column where we are integrating two different source system
integrating two different source system in one. And this is exactly what we call
in one. And this is exactly what we call data integration. This piece of
data integration. This piece of information, it is way better than the
information, it is way better than the source CRM and as well the source ARP.
source CRM and as well the source ARP. It is more rich and has more
It is more rich and has more information. And this is exactly why we
information. And this is exactly why we try to get data from different source
try to get data from different source system in order to get rich information
system in order to get rich information in the data warehouse. So with that we
in the data warehouse. So with that we have a nice logic and as you can see
have a nice logic and as you can see it's way easier to separate it in
it's way easier to separate it in separate query in order first to build
separate query in order first to build the logic and then take it to the
the logic and then take it to the original query. So what I'm going to do,
original query. So what I'm going to do, I'm just going to go and copy everything
I'm just going to go and copy everything from here and go back to our query. I'm
from here and go back to our query. I'm going to go and delete those
going to go and delete those informations the gender and I will put
informations the gender and I will put our new logic over here. So a comma and
our new logic over here. So a comma and let's go and execute. So with that we
let's go and execute. So with that we have our new nice column. Now with that
have our new nice column. Now with that we have very nice objects. We don't have
we have very nice objects. We don't have duplicates and we have integrated data
duplicates and we have integrated data together. So we took three tables and we
together. So we took three tables and we put it in one object. Now the next step
put it in one object. Now the next step is that we're going to go and give nice
is that we're going to go and give nice friendly names. The rule in the gold
friendly names. The rule in the gold layer that to use friendly names and not
layer that to use friendly names and not to follow the names that we get from the
to follow the names that we get from the source system and we have to make sure
source system and we have to make sure that we are following the rules by the
that we are following the rules by the naming conventions. So we are following
naming conventions. So we are following the snake case. So let's go and do it
the snake case. So let's go and do it step by step. For the first one let's go
step by step. For the first one let's go and call it the customer ID. And then
and call it the customer ID. And then the next one I will get rid of using
the next one I will get rid of using keys and so on. I'm going to go and call
keys and so on. I'm going to go and call it customer number because those are
it customer number because those are customer numbers. Then for the next one,
customer numbers. Then for the next one, we're going to call it first name
we're going to call it first name without using any prefixes. And the next
without using any prefixes. And the next one last name and we have here marital
one last name and we have here marital status. So I will be using the exact
status. So I will be using the exact name but without the prefix. And here we
name but without the prefix. And here we just going to call it gender. And this
just going to call it gender. And this one we're going to call it career date.
one we're going to call it career date. And this one birth date. And the last
And this one birth date. And the last one going to be the country. So let's go
one going to be the country. So let's go and execute it. Now as you can see the
and execute it. Now as you can see the names are really friendly. So we have
names are really friendly. So we have customer ID, customer numbers, first
customer ID, customer numbers, first name, last name, material status,
name, last name, material status, gender. So as you can see the names are
gender. So as you can see the names are really nice and really easy to
really nice and really easy to understand. Now the next step I'm going
understand. Now the next step I'm going to think about the order of those
to think about the order of those columns. So the first two it makes sense
columns. So the first two it makes sense to have it together. The first name,
to have it together. The first name, last name, then I think the country is
last name, then I think the country is very important information. So I'm going
very important information. So I'm going to go and get it from here and put it
to go and get it from here and put it exactly after the last name is just
exactly after the last name is just nicer. So let's go and execute it again.
nicer. So let's go and execute it again. So the first name, last name, country.
So the first name, last name, country. It's always nice to group up relevant
It's always nice to group up relevant columns together, right? So we have here
columns together, right? So we have here the status of the gender and so on. And
the status of the gender and so on. And then we have the career date and the
then we have the career date and the birth date. I think I'm going to go and
birth date. I think I'm going to go and switch the birth date with the career
switch the birth date with the career date. It's more important than the
date. It's more important than the career dates like this. And here not
career dates like this. And here not forget the comma. So execute again. So
forget the comma. So execute again. So it looks wonderful. Now comes a very
it looks wonderful. Now comes a very important decision about these objects.
important decision about these objects. Is it a fact table or a dimension? Well,
Is it a fact table or a dimension? Well, as we learned, dimensions hold
as we learned, dimensions hold descriptive informations about an
descriptive informations about an object. And as you can see, we have here
object. And as you can see, we have here a descriptions about the customers. So
a descriptions about the customers. So all those columns are describing the
all those columns are describing the customer information. And we don't have
customer information. And we don't have here like transactions and events. And
here like transactions and events. And we don't have like measures and so on.
we don't have like measures and so on. So we cannot say this object is a fact.
So we cannot say this object is a fact. It is clearly a dimension. So that's why
It is clearly a dimension. So that's why we're going to go and call this object
we're going to go and call this object the dimension customer. Now there is one
the dimension customer. Now there is one thing that if you are creating a new
thing that if you are creating a new dimension you need always a primary key
dimension you need always a primary key for the dimension. Of course we can go
for the dimension. Of course we can go over here and depend on the primary key
over here and depend on the primary key that we get from the source system but
that we get from the source system but sometimes you can have like dimensions
sometimes you can have like dimensions where you don't have like a primary key
where you don't have like a primary key that you can count on. So what we have
that you can count on. So what we have to do is to go and generate a new
to do is to go and generate a new primary key in the data warehouse. And
primary key in the data warehouse. And those primary keys we call it surrogate
those primary keys we call it surrogate keys. Srogate keys are system generated
keys. Srogate keys are system generated unique identifier that is assigned to
unique identifier that is assigned to each records to make the record unique.
each records to make the record unique. It is not a business key. It has no
It is not a business key. It has no meaning and no one in the business knows
meaning and no one in the business knows about it. We only use it in order to
about it. We only use it in order to connect our data model. And in this way
connect our data model. And in this way we have more control on how to connect
we have more control on how to connect our data model and we don't have to
our data model and we don't have to depend always on the source system. And
depend always on the source system. And there are different ways on how to
there are different ways on how to generate surrogate keys like defining it
generate surrogate keys like defining it in the DDL or maybe using the window
in the DDL or maybe using the window function row number in this data
function row number in this data warehouse. I'm going to go with a simple
warehouse. I'm going to go with a simple solution where we're going to go and use
solution where we're going to go and use the window function. So now in order to
the window function. So now in order to generate a surrogate key for this
generate a surrogate key for this dimension what we're going to do it is
dimension what we're going to do it is very simple. So we're going to say row
very simple. So we're going to say row number over and here we have to order by
number over and here we have to order by something. You can order by the create
something. You can order by the create date or the customer ID or the customer
date or the customer ID or the customer number. whatever you want but in this
number. whatever you want but in this example I'm going to go and order by the
example I'm going to go and order by the customer ID. So we have to follow the
customer ID. So we have to follow the naming convention that all surrogate
naming convention that all surrogate keys with a key at the end as a suffix.
keys with a key at the end as a suffix. So now let's go and query those
So now let's go and query those informations. And as you can see at the
informations. And as you can see at the start we have a customer key and this is
start we have a customer key and this is a sequence. We don't have here of course
a sequence. We don't have here of course any duplicates. And now this target key
any duplicates. And now this target key is generated in the data warehouse and
is generated in the data warehouse and we're going to use this key in order to
we're going to use this key in order to connect the data model. So now with that
connect the data model. So now with that our query is ready and the last step is
our query is ready and the last step is that we're going to go and create the
that we're going to go and create the object and as we decided all the objects
object and as we decided all the objects in the gold layer going to be virtual
in the gold layer going to be virtual one. So that means we're going to go and
one. So that means we're going to go and create a view. So we're going to say
create a view. So we're going to say create view gold dot dim. So follow the
create view gold dot dim. So follow the naming convention stand for the
naming convention stand for the dimension and we're going to have the
dimension and we're going to have the customers and then after that we have
customers and then after that we have ass. So with that everything is ready.
ass. So with that everything is ready. Let's go and execute it. It was
Let's go and execute it. It was successful. Let's go to the views now
successful. Let's go to the views now and you can see our first objects. So we
and you can see our first objects. So we have the dimension customers in the gold
have the dimension customers in the gold layer. Now as you know me in the next
layer. Now as you know me in the next step that we're going to go and check
step that we're going to go and check the quality of this new objects. So
the quality of this new objects. So let's go and have a new query. So select
let's go and have a new query. So select star from our view temp customers. And
star from our view temp customers. And now we have to make sure that everything
now we have to make sure that everything in the right position like this. And now
in the right position like this. And now we can do different checks like the
we can do different checks like the uniqueness and so on. But I'm worried
uniqueness and so on. But I'm worried about the gender information. So let's
about the gender information. So let's go and have a distinct of all values. So
go and have a distinct of all values. So as you can see it is working perfectly.
as you can see it is working perfectly. We have only female, male and not
We have only female, male and not available. So that's it with that. We
available. So that's it with that. We have our first new
dimension. Okay friends. So now let's go and build the second object. We have the
and build the second object. We have the products. So as you can see product
products. So as you can see product information is available in both source
information is available in both source systems. As usual, we're going to start
systems. As usual, we're going to start with the CRM informations and then we're
with the CRM informations and then we're going to go and join it with the other
going to go and join it with the other table in order to get the category
table in order to get the category informations. So those are the columns
informations. So those are the columns that we want from this table. Now we
that we want from this table. Now we come here to a big decision about this
come here to a big decision about this objects. This object contains historical
objects. This object contains historical informations and as well the current
informations and as well the current informations. Now of course depend on
informations. Now of course depend on the requirement whether you have to do
the requirement whether you have to do analyszis on the historical
analyszis on the historical informations. But if you don't have such
informations. But if you don't have such a requirements we can go and stay with
a requirements we can go and stay with only the current informations of the
only the current informations of the products. So we don't have to include
products. So we don't have to include all the history in the objects and it is
all the history in the objects and it is anyway as we learned from the model over
anyway as we learned from the model over here we are not using the primary key we
here we are not using the primary key we are using the product key. So now what
are using the product key. So now what we have to do is to filter out the
we have to do is to filter out the historical data and to stay only with
historical data and to stay only with the current data. So we're going to have
the current data. So we're going to have here a wear condition. And now in order
here a wear condition. And now in order to select the current data what we're
to select the current data what we're going to do we're going to go and target
going to do we're going to go and target the end dates. If the end date is null
the end dates. If the end date is null that means it is a current data. Let's
that means it is a current data. Let's take this example over here. So you can
take this example over here. So you can see here we have three records for the
see here we have three records for the same product key and for the first two
same product key and for the first two records we have here an information in
records we have here an information in the end dates because it is historical
the end dates because it is historical informations but the last record over
informations but the last record over here we have it as a null and that's
here we have it as a null and that's because this is the current information
because this is the current information it is open and it's not closed yet. So
it is open and it's not closed yet. So in order to select only the current
in order to select only the current informations it is very simple we can
informations it is very simple we can say brd in dates is null. So if you go
say brd in dates is null. So if you go now and execute it, you will get only
now and execute it, you will get only the current products. You will not have
the current products. You will not have any history. And of course we can go and
any history. And of course we can go and add comment to it. Filter out all
add comment to it. Filter out all historical data. And this means of
historical data. And this means of course we don't need the end date in our
course we don't need the end date in our selection of course because it is always
selection of course because it is always a null. So with that we have only the
a null. So with that we have only the current data. Now the next step is that
current data. Now the next step is that we have to go and join it with the
we have to go and join it with the product categories from the ERP. And
product categories from the ERP. And we're going to use here the ID. So as
we're going to use here the ID. So as usual the master information is the CRM
usual the master information is the CRM and everything else going to be
and everything else going to be secondary. That's why I use the lift
secondary. That's why I use the lift join just to make sure I'm not losing
join just to make sure I'm not losing I'm not filtering any data because if
I'm not filtering any data because if there is no match then we lose data. So
there is no match then we lose data. So lift join silver ERP and the category.
lift join silver ERP and the category. So let's call it PC. And now what we're
So let's call it PC. And now what we're going to do we're going to go and join
going to do we're going to go and join it using the key. So en from the CRM we
it using the key. So en from the CRM we have the category ID equal to PC ID. And
have the category ID equal to PC ID. And now we have to go and pick columns from
now we have to go and pick columns from the second table. So it's going to be
the second table. So it's going to be the PC. We have the category very
the PC. We have the category very important PC. We have the
important PC. We have the subcategory and we can go and get the
subcategory and we can go and get the maintenance. So something like this.
maintenance. So something like this. Let's go and query. And with that we
Let's go and query. And with that we have all those columns comes from the
have all those columns comes from the first table and those three comes from
first table and those three comes from the second. So with that we have
the second. So with that we have collected all the product informations
collected all the product informations from the two source systems. Now the
from the two source systems. Now the next step is we have to go and check the
next step is we have to go and check the quality of these results. And of course
quality of these results. And of course what is very important is to check the
what is very important is to check the uniqueness. So what we're going to do
uniqueness. So what we're going to do we're going to go and have the following
we're going to go and have the following query. I want to make sure that the
query. I want to make sure that the product key is
product key is unique because we're going to use it
unique because we're going to use it later in order to join the table with
later in order to join the table with the sales. So
the sales. So from and then we have to have group by
from and then we have to have group by product key and we're going to say
product key and we're going to say having
having counts higher than one. So let's go and
counts higher than one. So let's go and check. Perfect. We don't have any
check. Perfect. We don't have any duplicates. The second table didn't
duplicates. The second table didn't cause any duplicates for our join. And
cause any duplicates for our join. And as well this means we don't have
as well this means we don't have historical data and each product is only
historical data and each product is only one records and we don't have any
one records and we don't have any duplicates. So I'm really happy about
duplicates. So I'm really happy about that. So let's go and query again. Now,
that. So let's go and query again. Now, of course, the next step, do we have
of course, the next step, do we have anything to integrate together? Do we
anything to integrate together? Do we have the same information twice? Well,
have the same information twice? Well, we don't have that. The next step is
we don't have that. The next step is that we're going to go and group up the
that we're going to go and group up the relevant informations together. So, I'm
relevant informations together. So, I'm going to say the product ID, then the
going to say the product ID, then the product key, and the product name are
product key, and the product name are together. So, all those three
together. So, all those three informations are together. And after
informations are together. And after that, we can put all the category
that, we can put all the category informations together. So, we're going
informations together. So, we're going to have the category ID, the category
to have the category ID, the category itself, the subcategory. Let me just
itself, the subcategory. Let me just query and see the results. So we have
query and see the results. So we have the product ID key name and then we have
the product ID key name and then we have the category ID name and the subcategory
the category ID name and the subcategory and then maybe as well to put the
and then maybe as well to put the maintenance after the subcategory like
maintenance after the subcategory like this and I think the product cost and
this and I think the product cost and the line can start could stay at the
the line can start could stay at the end. So let me just check. So those
end. So let me just check. So those three four informations about the
three four informations about the category and then we have the cost line
category and then we have the cost line and the start date. I'm really happy
and the start date. I'm really happy with that. The next step we're going to
with that. The next step we're going to go and give nice names, friendly names
go and give nice names, friendly names for those columns. So let's start with
for those columns. So let's start with the first one. This is the product ID.
the first one. This is the product ID. The next one going to be the product
The next one going to be the product number. We need the key for the
number. We need the key for the surrogate key later. And then we have
surrogate key later. And then we have the product name. And after that we have
the product name. And after that we have the category ID and the category. And
the category ID and the category. And this is the subcategory. And then the
this is the subcategory. And then the next one going to stay as it is. I don't
next one going to stay as it is. I don't have to rename it. The next one going to
have to rename it. The next one going to be the cost and the product line and the
be the cost and the product line and the last one going to be the start stage. So
last one going to be the start stage. So let's go and execute it. Now we can see
let's go and execute it. Now we can see very nicely in the output all those
very nicely in the output all those friendly names for the columns and it
friendly names for the columns and it looks way nicer than before. I don't
looks way nicer than before. I don't have even to describe those informations
have even to describe those informations the name describe it. So perfect. Now
the name describe it. So perfect. Now the next big decision is what do we have
the next big decision is what do we have here? Do we have a fact or dimension?
here? Do we have a fact or dimension? What do you think? Well, as you can see
What do you think? Well, as you can see here again, we have a lot of
here again, we have a lot of descriptions about the products. So all
descriptions about the products. So all those informations are describing the
those informations are describing the business object products. We don't have
business object products. We don't have like here transactions, events, a lot of
like here transactions, events, a lot of different keys and ids. So we don't have
different keys and ids. So we don't have really here facts. We have a dimension.
really here facts. We have a dimension. Each row is exactly describing one
Each row is exactly describing one object, describing one product. That's
object, describing one product. That's why this is a dimension. Okay. So now
why this is a dimension. Okay. So now since this is a dimension, we have to go
since this is a dimension, we have to go and create a primary key for it. Well,
and create a primary key for it. Well, actually the surrogate key and as we
actually the surrogate key and as we have done it for the customers, we're
have done it for the customers, we're going to go and use the window function
going to go and use the window function row number in order to generate it over
row number in order to generate it over and then we have to sort the data. I
and then we have to sort the data. I will go with the start date. So let's go
will go with the start date. So let's go with the start dates and as well the
with the start dates and as well the product key and we're going to give it a
product key and we're going to give it a name products key like this. So let's go
name products key like this. So let's go and execute it. With that, we have now
and execute it. With that, we have now generated a primary key for each product
generated a primary key for each product and we're going to be using it in order
and we're going to be using it in order to connect our data model. All right.
to connect our data model. All right. Now, the next step with that, we're
Now, the next step with that, we're going to go and build the view. So,
going to go and build the view. So, we're going to say create view. We're
we're going to say create view. We're going to say gold and dimension products
going to say gold and dimension products and then us. So, let's go and create our
and then us. So, let's go and create our object. And now, if you go and refresh
object. And now, if you go and refresh the views, you will see our second
the views, you will see our second object, the second dimension. So, we
object, the second dimension. So, we have here in the gold layer the
have here in the gold layer the dimension products. And as usual, we're
dimension products. And as usual, we're going to go and have a look to this view
going to go and have a look to this view just to make sure that everything is
just to make sure that everything is fine. So dem products. So let's execute
fine. So dem products. So let's execute it. And by looking to the data
it. And by looking to the data everything looks nice. So with that we
everything looks nice. So with that we have now two
dimensions. All right friends. So with that we have covered a lot of stuff. So
that we have covered a lot of stuff. So we have covered the customers and the
we have covered the customers and the products and we are left with only one
products and we are left with only one table where we have the transactions the
table where we have the transactions the sales and for the sales information we
sales and for the sales information we have only data from the CRM. We don't
have only data from the CRM. We don't have anything from the ERP. So let's go
have anything from the ERP. So let's go and build it. Okay. So now I have all
and build it. Okay. So now I have all those informations and now of course we
those informations and now of course we have only one table. We don't have to do
have only one table. We don't have to do any integrations and so on. And now we
any integrations and so on. And now we have to answer the big question. Do we
have to answer the big question. Do we have here a dimension or a fact? Well by
have here a dimension or a fact? Well by looking to those details we can see
looking to those details we can see transactions. We can see events. We have
transactions. We can see events. We have a lot of dates, informations. We have as
a lot of dates, informations. We have as well a lot of measures and metrics and
well a lot of measures and metrics and as well we have a lot of ids. So it is
as well we have a lot of ids. So it is connecting multiple dimensions. And this
connecting multiple dimensions. And this is exactly a perfect setup for effect.
is exactly a perfect setup for effect. So we're going to go and use those
So we're going to go and use those informations as a facts. And of course
informations as a facts. And of course as we learned a fact is connecting
as we learned a fact is connecting multiple dimensions. We have to present
multiple dimensions. We have to present in this fact the surrogate keys that
in this fact the surrogate keys that comes from the dimensions. So those two
comes from the dimensions. So those two informations the product key and the
informations the product key and the customer ID those informations comes
customer ID those informations comes from the source system and as we learned
from the source system and as we learned we want to connect our data model using
we want to connect our data model using the surrogate keys. So what we're going
the surrogate keys. So what we're going to do we're going to replace those two
to do we're going to replace those two informations with the surrogate keys
informations with the surrogate keys that we have generated and in order to
that we have generated and in order to do that we have to go and join now the
do that we have to go and join now the two dimensions in order to get the
two dimensions in order to get the surrogate key and we call this process
surrogate key and we call this process of course data lookup. So we are joining
of course data lookup. So we are joining the tables in order only to get one
the tables in order only to get one information. So let's go and do that. We
information. So let's go and do that. We will go with a lift join of course not
will go with a lift join of course not to lose any transaction. So first we're
to lose any transaction. So first we're going to go and join it with the product
going to go and join it with the product key. Now of course in the silver layer
key. Now of course in the silver layer we don't have any surrogate keys. We
we don't have any surrogate keys. We have it in the gold layer. So that means
have it in the gold layer. So that means for the fact table we're going to be
for the fact table we're going to be joining the silver layer together with
joining the silver layer together with the gold layer. So, gold dots and then
the gold layer. So, gold dots and then the dimension products and I'm going to
the dimension products and I'm going to just call it PR. And we're going to join
just call it PR. And we're going to join the SD using the product key together
the SD using the product key together with the product
with the product number from the dimension. And now the
number from the dimension. And now the only information that we need from the
only information that we need from the dimension is the key, the surrogate key.
dimension is the key, the surrogate key. So, we're going to go over here and say
So, we're going to go over here and say product key. And what I'm going to do,
product key. And what I'm going to do, I'm going to go and remove this
I'm going to go and remove this information from here because we don't
information from here because we don't need it. We don't need the original
need it. We don't need the original product key from the source system. We
product key from the source system. We need the surrogate key that we have
need the surrogate key that we have generated in our own in this data
generated in our own in this data warehouse. So the same thing going to
warehouse. So the same thing going to happen as well for the customer. So gold
happen as well for the customer. So gold dimension customer again we are doing
dimension customer again we are doing here a lookup in order to get the
here a lookup in order to get the information on SD. So we are joining
information on SD. So we are joining using this ID over here equal to the
using this ID over here equal to the customer ID because this is a customer
customer ID because this is a customer ID. And what we're going to do the same
ID. And what we're going to do the same thing we need the surrogate key the
thing we need the surrogate key the customer key and we're going to delete
customer key and we're going to delete the ID because we don't need it. Now we
the ID because we don't need it. Now we have the surrogate key. So now let's go
have the surrogate key. So now let's go and execute it. And now with that we
and execute it. And now with that we have in our fact table the two keys from
have in our fact table the two keys from the dimensions. And now this can help us
the dimensions. And now this can help us to connect the data model to connect the
to connect the data model to connect the facts with the dimensions. So this is
facts with the dimensions. So this is very necessary step building the fact
very necessary step building the fact table. You have to put the surrogate
table. You have to put the surrogate keys from the dimensions in the facts.
keys from the dimensions in the facts. So that was actually the hardest part
So that was actually the hardest part building the facts. Now the next step
building the facts. Now the next step all what you have to do is to go and
all what you have to do is to go and give friendly names. So we're going to
give friendly names. So we're going to go over here and say order number. Then
go over here and say order number. Then the surrogate keys are already friendly.
the surrogate keys are already friendly. So we're going to go over here and say
So we're going to go over here and say this is the order date. And the next one
this is the order date. And the next one going to be shipping date. And then the
going to be shipping date. And then the next one due age and the sales going to
next one due age and the sales going to be I'm going to say sales
be I'm going to say sales amount the
amount the quantity and the final one is the price.
quantity and the final one is the price. So now let's go and execute it and look
So now let's go and execute it and look to the results. So now as you can see
to the results. So now as you can see the columns looks very friendly and now
the columns looks very friendly and now about the order of the columns we use
about the order of the columns we use the following schema. So first in the
the following schema. So first in the fact table we have all the surrogate
fact table we have all the surrogate keys from the dimensions. Then second we
keys from the dimensions. Then second we have all the dates and at the end you
have all the dates and at the end you group up all the measures and the
group up all the measures and the metrics at the end of the fact. So
metrics at the end of the fact. So that's it for the query for the facts.
that's it for the query for the facts. Now we can go and build it. So we're
Now we can go and build it. So we're going to say
going to say create view gold in the gold layer and
create view gold in the gold layer and this time we're going to use the fact
this time we're going to use the fact underscore and we're going to go and
underscore and we're going to go and call it sales and then don't forget
call it sales and then don't forget about the ass. So that's it. Let's go
about the ass. So that's it. Let's go and create it. Perfect. Now we can see
and create it. Perfect. Now we can see the fact. So with that we have three
the fact. So with that we have three objects in the go there. We have two
objects in the go there. We have two dimensions and one facts. And now of
dimensions and one facts. And now of course the next step with that we're
course the next step with that we're going to go and check the quality of the
going to go and check the quality of the view. So let's have a simple
view. So let's have a simple select fact sales. So let's execute it.
select fact sales. So let's execute it. Now by checking the result you can see
Now by checking the result you can see it is exactly like the result from the
it is exactly like the result from the query and everything looks nice. Okay.
query and everything looks nice. Okay. So now one more trick that I usually do
So now one more trick that I usually do after building effect is try to connect
after building effect is try to connect the whole data model in order to find
the whole data model in order to find any issues. So let's go and do that. We
any issues. So let's go and do that. We will do just simple lift join with the
will do just simple lift join with the dimensions. So gold dimension customers
dimensions. So gold dimension customers see and we will use the
see and we will use the keys and then we're going to say where
keys and then we're going to say where customer key is null. So there is no
customer key is null. So there is no matching. So let's go and execute it.
matching. So let's go and execute it. And with that as you can see in the
And with that as you can see in the results we are not getting anything that
results we are not getting anything that means everything is matching perfectly
means everything is matching perfectly and we can do as well the same thing
and we can do as well the same thing with the products. So left join called
with the products. So left join called then products p on product key and then
then products p on product key and then we connect it with the fact product key
we connect it with the fact product key and then we going go and check the
and then we going go and check the product key from the dimension like
product key from the dimension like this. So we are checking whether we can
this. So we are checking whether we can connect the fact together with the
connect the fact together with the dimension products. Let's go and check
dimension products. Let's go and check and as you can see as well we are not
and as you can see as well we are not getting anything and this is all right.
getting anything and this is all right. So with that we have now SQL codes that
So with that we have now SQL codes that is tested and as well creating the gold
is tested and as well creating the gold layer. Now in the next step as you know
layer. Now in the next step as you know in our requirements we have to make
in our requirements we have to make clear documentations for the end users
clear documentations for the end users in order to use our data model. So let's
in order to use our data model. So let's go and draw a data model of the star
schema. So let's go and draw our data model. Let's go and search for a table.
model. Let's go and search for a table. And now what I'm going to do, I'm going
And now what I'm going to do, I'm going to go and take this one where I can say
to go and take this one where I can say what is the primary key and what is the
what is the primary key and what is the foreign key. And I'm going to go and
foreign key. And I'm going to go and change a little bit the design. So it's
change a little bit the design. So it's going to be rounded. And let's say I'm
going to be rounded. And let's say I'm going to go and change to this color.
going to go and change to this color. And maybe go to the size, make it 16.
And maybe go to the size, make it 16. And then I'm going to go and select all
And then I'm going to go and select all the columns and make it as well 16 just
the columns and make it as well 16 just to increase the size. And then go to our
to increase the size. And then go to our range and we can go and increase it 39.
range and we can go and increase it 39. So now let's go and zoom in a little bit
So now let's go and zoom in a little bit for the first table. Let's go and call
for the first table. Let's go and call it gold dimension customers and make it
it gold dimension customers and make it a little bit bigger like this. And now
a little bit bigger like this. And now we're going to go and define here the
we're going to go and define here the primary key. It is the customer key. And
primary key. It is the customer key. And what else we're going to do? We're going
what else we're going to do? We're going to go and list all the columns in the
to go and list all the columns in the dimension. It is a little bit annoying
dimension. It is a little bit annoying but the result is going to be awesome.
but the result is going to be awesome. So what do we have? The customer ID. We
So what do we have? The customer ID. We have the customer number and then we
have the customer number and then we have the first name. Now in case you
have the first name. Now in case you want a new rows so you can hold control
want a new rows so you can hold control and enter and you can go and add the
and enter and you can go and add the other columns. So now pause the video
other columns. So now pause the video and then go and create the two
and then go and create the two dimensions the customers and the
dimensions the customers and the products and add all the columns that
products and add all the columns that you have built in the
[Music] view. Welcome back. So now I have those
view. Welcome back. So now I have those two dimensions. The third one going to
two dimensions. The third one going to be the fact table. Now for the fact
be the fact table. Now for the fact table I'm going to go with different
table I'm going to go with different color. for example, the blue and I'm
color. for example, the blue and I'm going to go and put it in the middle.
going to go and put it in the middle. Something like this. So, we're going to
Something like this. So, we're going to say gold fact sales and here for that we
say gold fact sales and here for that we don't have primary key. So, we're going
don't have primary key. So, we're going to go and delete it. And I have to go
to go and delete it. And I have to go and add all the columns of the facts.
and add all the columns of the facts. So, order number, products key, customer
So, order number, products key, customer key. Okay. All right. Perfect. Now, what
key. Okay. All right. Perfect. Now, what we can do, we can go and add the foreign
we can do, we can go and add the foreign key information. So, the product key is
key information. So, the product key is a foreign key for the products. So,
a foreign key for the products. So, we're going to say FK1. And the customer
we're going to say FK1. And the customer key going to be the foreign key for the
key going to be the foreign key for the customers. So FK2 and of course you can
customers. So FK2 and of course you can go and increase the spacing for that.
go and increase the spacing for that. Okay. So now after we have the tables
Okay. So now after we have the tables the next step in data modeling is to go
the next step in data modeling is to go and describe the relationship between
and describe the relationship between these tables. This is of course very
these tables. This is of course very important for reporting and analytics in
important for reporting and analytics in order to understand how I'm going to go
order to understand how I'm going to go and use the data model. And we have
and use the data model. And we have different types of relationships. We
different types of relationships. We have one to one, one to many. And in
have one to one, one to many. And in star schema data model the relationship
star schema data model the relationship between the dimension and the fact is
between the dimension and the fact is one to many. And that's because in the
one to many. And that's because in the table customers we have for a specific
table customers we have for a specific customer only one record describing the
customer only one record describing the customer but in the fact table the
customer but in the fact table the customer might exist in multiple records
customer might exist in multiple records and that's because customers can order
and that's because customers can order multiple times. So that's why in fact it
multiple times. So that's why in fact it is many and in the dimension side it is
is many and in the dimension side it is one. Now in order to see all those
one. Now in order to see all those relationships we're going to go to the
relationships we're going to go to the menu to the left side and as you can see
menu to the left side and as you can see we have here entity relations and now we
we have here entity relations and now we have different types of arrows. So for
have different types of arrows. So for example we have zero to many, one to
example we have zero to many, one to many, one to one and many different
many, one to one and many different types of relations. So now which one we
types of relations. So now which one we going to take? We're going to go and
going to take? We're going to go and pick this one. So it says one mandatory.
pick this one. So it says one mandatory. So that means the customer must exist in
So that means the customer must exist in the dimension table. Too many but it is
the dimension table. Too many but it is optional. So here we have three
optional. So here we have three scenarios. The customer didn't order
scenarios. The customer didn't order anything or the customer did order only
anything or the customer did order only once or the customer did order many
once or the customer did order many things. So that's why in the fact table
things. So that's why in the fact table it is optional. So we're going to take
it is optional. So we're going to take this one and place it over here. So
this one and place it over here. So we're going to go and connect this part
we're going to go and connect this part to the customer dimension and the many
to the customer dimension and the many parts to the facts. Well actually we
parts to the facts. Well actually we have to do it on the customers. So with
have to do it on the customers. So with that we are describing the relationship
that we are describing the relationship between the dimensions and fact with one
between the dimensions and fact with one to many. One is mandatory for the
to many. One is mandatory for the customer dimension and many is optional
customer dimension and many is optional to the facts. So we have the same story
to the facts. So we have the same story as well for the products. So the many
as well for the products. So the many part to the facts and the one goes to
part to the facts and the one goes to the products. So it's going to look like
the products. So it's going to look like this. Each time you are connecting new
this. Each time you are connecting new dimension to the fact table, it is
dimension to the fact table, it is usually one to many relationship. So you
usually one to many relationship. So you can go and add anything you want to this
can go and add anything you want to this model like for example a text like
model like for example a text like explaining something. For example, if
explaining something. For example, if you have some complicated calculations
you have some complicated calculations and so on, you can go and write this
and so on, you can go and write this information over here. So for example,
information over here. So for example, we can say over here sales calculation,
we can say over here sales calculation, we can make it a little bit smaller. So
we can make it a little bit smaller. So let's go with 18. So we can go and write
let's go with 18. So we can go and write here the formula for that. So sales
here the formula for that. So sales equal quantity multiplied with the price
equal quantity multiplied with the price and make this little bit bigger. So it
and make this little bit bigger. So it is really nice info that we can add it
is really nice info that we can add it to the data model and even we can go and
to the data model and even we can go and link it to the column. So we can go and
link it to the column. So we can go and take this arrow for example put it like
take this arrow for example put it like this and link it to the column and with
this and link it to the column and with that you have as well nice explanation
that you have as well nice explanation about the business rule or the
about the business rule or the calculation. So you can go and add any
calculation. So you can go and add any descriptions that you want to the data
descriptions that you want to the data model. Just to make it clear for anyone
model. Just to make it clear for anyone that is using your data model. So with
that is using your data model. So with that you don't have only like three
that you don't have only like three tables in the database. You have as well
tables in the database. You have as well like some kind of documentations and
like some kind of documentations and explanation. In one click we can see how
explanation. In one click we can see how the data model is built and how you can
the data model is built and how you can connect the tables together. It is
connect the tables together. It is amazing really for all users of your
amazing really for all users of your data model. All right. So now with that
data model. All right. So now with that we have really nice data model. And now
we have really nice data model. And now in the next step we're going to go and
in the next step we're going to go and create quickly a data catalog.
All right, great. So with that we have a data model and we can say we have
data model and we can say we have something called a data products and we
something called a data products and we will be sharing this data product with
will be sharing this data product with different types of users and there is
different types of users and there is something that every data products
something that every data products absolutely needs and that is the data
absolutely needs and that is the data catalog. It is a document that can
catalog. It is a document that can describe everything about your data
describe everything about your data model. columns, the tables, maybe the
model. columns, the tables, maybe the relationship between the tables as well.
relationship between the tables as well. And with that, you make your data
And with that, you make your data product clear for everyone. And it's
product clear for everyone. And it's going to be for them way easier to
going to be for them way easier to derive more insights and reports from
derive more insights and reports from your data product. And what is the most
your data product. And what is the most important one? It is time-saving because
important one? It is time-saving because if you don't do that, what's going to
if you don't do that, what's going to happen? Each consumer, each user of your
happen? Each consumer, each user of your data product will keep asking you the
data product will keep asking you the same questions about what do you mean
same questions about what do you mean with this column? What is this table?
with this column? What is this table? How to connect the table A with the
How to connect the table A with the table P? and you will keep repeating
table P? and you will keep repeating yourself and explaining stuff. So
yourself and explaining stuff. So instead of that you prepare a data
instead of that you prepare a data catalog, a data model and you deliver
catalog, a data model and you deliver everything together to the users and
everything together to the users and with that you are saving a lot of time
with that you are saving a lot of time and stress. I know it is annoying to
and stress. I know it is annoying to create a data catalog but it is
create a data catalog but it is investments and best practices. So now
investments and best practices. So now let's go and create one. Okay. So now in
let's go and create one. Okay. So now in order to do that I have created a new
order to do that I have created a new file called data catalog in the folder
file called data catalog in the folder documents. And here what we're going to
documents. And here what we're going to do is very straightforward. We're going
do is very straightforward. We're going to make a section for each table in the
to make a section for each table in the code layer. So for example we have here
code layer. So for example we have here the table dimension customers. What you
the table dimension customers. What you have to do first is to describe this
have to do first is to describe this table. So we are saying it stores
table. So we are saying it stores details about the customers with the
details about the customers with the demographics and geographics data. So
demographics and geographics data. So you give a short description for the
you give a short description for the table and then after that you're going
table and then after that you're going to go and list all your columns inside
to go and list all your columns inside this table and maybe as well the data
this table and maybe as well the data type. But what is way important is the
type. But what is way important is the description for each column. So you give
description for each column. So you give a very short description like for
a very short description like for example here the gender of the customer.
example here the gender of the customer. And now one of the best practices of
And now one of the best practices of describing a column is to give examples
describing a column is to give examples because you can understand quickly the
because you can understand quickly the purpose of the columns by just seeing an
purpose of the columns by just seeing an example. Right? So here we are saying we
example. Right? So here we are saying we can find inside the male, female and not
can find inside the male, female and not available. So with that the consumer of
available. So with that the consumer of your table can immediately understand uh
your table can immediately understand uh it will not be an M or an F. It's going
it will not be an M or an F. It's going to be a full friendly value without
to be a full friendly value without having them to go and query the content
having them to go and query the content of the table. They can understand
of the table. They can understand quickly the purpose of that column. So
quickly the purpose of that column. So with that we have a full description for
with that we have a full description for all the columns of our dimension. The
all the columns of our dimension. The same thing we're going to do for the
same thing we're going to do for the products. So again, a description for
products. So again, a description for the table and as well a description for
the table and as well a description for each column and the same thing for the
each column and the same thing for the facts. So that's it. With that you have
facts. So that's it. With that you have like a data catalog for your data
like a data catalog for your data products at the code layer. And with
products at the code layer. And with that the business user or the data
that the business user or the data analyst have better and clear
analyst have better and clear understanding of the content of your
understanding of the content of your code layer. All right my friends. So
code layer. All right my friends. So that's all for the data catalog. In the
that's all for the data catalog. In the next step we're going to go back to DO
next step we're going to go back to DO where we're going to finalize the data
where we're going to finalize the data flow diagram. So let's go.
Okay. So now we're going to go and extend our data flow diagram, but this
extend our data flow diagram, but this time for the gold layer. So now let's go
time for the gold layer. So now let's go and copy the whole thing from the silver
and copy the whole thing from the silver layer and put it over here side by side.
layer and put it over here side by side. And of course we're going to go and
And of course we're going to go and change the coloring to the gold. And now
change the coloring to the gold. And now we're going to go and rename stuff. So
we're going to go and rename stuff. So this is the gold layer. But now of
this is the gold layer. But now of course we cannot leave those tables like
course we cannot leave those tables like this. We have completely new data model.
this. We have completely new data model. So what do we have over here? We have
So what do we have over here? We have the fact sales, we have dimension
the fact sales, we have dimension customers, and as well we have dimension
customers, and as well we have dimension products. So now what I'm going to do,
products. So now what I'm going to do, I'm going to go and remove all those
I'm going to go and remove all those stuff. We have only three tables. And
stuff. We have only three tables. And let's go and put those three tables
let's go and put those three tables somewhere here in the center. So now
somewhere here in the center. So now what you have to do is to go and start
what you have to do is to go and start connecting those stuff. I'm going to go
connecting those stuff. I'm going to go with this arrow over here, direct
with this arrow over here, direct connection, and start connecting stuff.
connection, and start connecting stuff. So the sales details goes to the fact
So the sales details goes to the fact table. Maybe put the fact table over
table. Maybe put the fact table over here. And then we have the dimension
here. And then we have the dimension customer. This comes from the CRM
customer. This comes from the CRM customer info. And we have two tables
customer info. And we have two tables from the ERP. It comes from this table
from the ERP. It comes from this table as well. And the location from the ERP.
as well. And the location from the ERP. Now the same thing goes for the
Now the same thing goes for the products. It comes from the product info
products. It comes from the product info and comes from the categories from the
and comes from the categories from the ERP. Now, as you can see here, we have
ERP. Now, as you can see here, we have cross arrows. So what you can do, we can
cross arrows. So what you can do, we can go and select everything and we can say
go and select everything and we can say line jumps with a gap. And this makes it
line jumps with a gap. And this makes it a little bit like better in the visual
a little bit like better in the visual for the arrows. So now for example if
for the arrows. So now for example if someone asks you where the data come
someone asks you where the data come from for the dimension products you can
from for the dimension products you can open this diagram and tell them okay
open this diagram and tell them okay this comes from the server layer. We
this comes from the server layer. We have like two tables. The product info
have like two tables. The product info from the CRM and as well the categories
from the CRM and as well the categories from the ERP and those several tables
from the ERP and those several tables comes from the bronze layer and you can
comes from the bronze layer and you can see the product info comes from the CRM
see the product info comes from the CRM and the category comes from the ERP. So
and the category comes from the ERP. So it is very simple. We have just created
it is very simple. We have just created a full data lineage for our data
a full data lineage for our data warehouse from the sources into the
warehouse from the sources into the different layers in our data warehouse
different layers in our data warehouse and data lineage is this really amazing
and data lineage is this really amazing documentation that can help not only
documentation that can help not only your users but as well the developers.
your users but as well the developers. All right. So with that we have very
All right. So with that we have very nice data flow diagram and a data
nice data flow diagram and a data lineage. All right. So we have completed
lineage. All right. So we have completed the data flow. It's really feel like
the data flow. It's really feel like progress like achievements as we are
progress like achievements as we are clicking through all those tasks. And
clicking through all those tasks. And now we come to the last task in building
now we come to the last task in building the data warehouse where we're going to
the data warehouse where we're going to go and commit our work in the get
repo. Okay. So now let's put our scripts in the project. So we're going to go to
in the project. So we're going to go to the scripts over here. We have here
the scripts over here. We have here bronze silver but we don't have a gold.
bronze silver but we don't have a gold. So let's go and create a new file. We're
So let's go and create a new file. We're going to have gold/ and then we're going
going to have gold/ and then we're going to say ddl gold.sql. So now we're going
to say ddl gold.sql. So now we're going to go and paste our views. So we have
to go and paste our views. So we have here our three views. And as usual at
here our three views. And as usual at the start we can describe the purpose of
the start we can describe the purpose of the views. So we are saying create gold
the views. So we are saying create gold views. This script can go and create
views. This script can go and create views for the code layer and the code
views for the code layer and the code layer represent the final dimension and
layer represent the final dimension and fact tables. The star schema each view
fact tables. The star schema each view perform transformations and combination
perform transformations and combination data from the server layer to produce
data from the server layer to produce business ready data sets and those views
business ready data sets and those views can be used for analytics and reporting.
can be used for analytics and reporting. So that's it. Let's go and commit it.
So that's it. Let's go and commit it. Okay. So with that as you can see we
Okay. So with that as you can see we have the bronze the silver. So we have
have the bronze the silver. So we have all our ETLs and scripts in the
all our ETLs and scripts in the repository. And now as well for the code
repository. And now as well for the code layer, we're going to go and add all
layer, we're going to go and add all those quality checks that we have used
those quality checks that we have used in order to validate the dimensions and
in order to validate the dimensions and facts. So we're going to go to the test
facts. So we're going to go to the test over here and we're going to go and
over here and we're going to go and create a new file. It's going to be
create a new file. It's going to be quality checks gold and the file type is
quality checks gold and the file type is SQL. So now let's go and paste our
SQL. So now let's go and paste our quality checks. So we have the check for
quality checks. So we have the check for the fact, the two dimensions and as well
the fact, the two dimensions and as well an explanation about the script. So we
an explanation about the script. So we are validating the integrity and the
are validating the integrity and the accuracy of the go layer. And here we
accuracy of the go layer. And here we are checking the uniqueness of the
are checking the uniqueness of the surrogate keys and whether we are able
surrogate keys and whether we are able to connect the data model. So let's put
to connect the data model. So let's put that as well in our git and commit the
that as well in our git and commit the changes. And in case we come up with a
changes. And in case we come up with a new quality checks, we're going to go
new quality checks, we're going to go and add it to our script here. So those
and add it to our script here. So those checks are really important if you are
checks are really important if you are modifying the ATLs or you want to make
modifying the ATLs or you want to make sure that after each those script should
sure that after each those script should run and so on. It is like a quality gate
run and so on. It is like a quality gate to make sure that everything is fine in
to make sure that everything is fine in the gold layer. Perfect. So now we have
the gold layer. Perfect. So now we have our code in our repository. Okay
our code in our repository. Okay friends. So now what you have to do is
friends. So now what you have to do is to go and finalize the get repo. So for
to go and finalize the get repo. So for example all the documentations that we
example all the documentations that we have created during the projects we can
have created during the projects we can go and upload them in the docs. So for
go and upload them in the docs. So for example you can see here the data
example you can see here the data architecture the data flow data
architecture the data flow data integration data model and so on. So
integration data model and so on. So that each time you edit those pages you
that each time you edit those pages you can commit your work and you have like a
can commit your work and you have like a version of that. And another thing that
version of that. And another thing that you can do is that you go to the readme
you can do is that you go to the readme like for example over here I have added
like for example over here I have added the project overview some important
the project overview some important links and as well the data architecture
links and as well the data architecture and a little description of the
and a little description of the architecture of course and of course
architecture of course and of course don't forget to add few words about
don't forget to add few words about yourself and important profiles in the
yourself and important profiles in the different social medias. All right my
different social medias. All right my friends. So with that we have committed
friends. So with that we have committed our work and as well closed the last
our work and as well closed the last epic building the god layer and with
epic building the god layer and with that we have completed all the phases of
that we have completed all the phases of building a data warehouse. Everything is
building a data warehouse. Everything is 100% and this feels really nice. All
100% and this feels really nice. All right my friends. So with that we have
right my friends. So with that we have covered the first type of SQL projects
covered the first type of SQL projects that data warehousing projects. This is
that data warehousing projects. This is usually a very complex project that you
usually a very complex project that you can get involved in a company and this
can get involved in a company and this is really amazing project if you are
is really amazing project if you are planning to be a data engineer. But of
planning to be a data engineer. But of course, if you are a data analyst, you
course, if you are a data analyst, you might end up as well building
might end up as well building warehouses. So now we have everything
warehouses. So now we have everything prepared for the second type of projects
prepared for the second type of projects in SQL. We will deep dive now into the
in SQL. We will deep dive now into the exploratory data analyzers. So let's
go. And now here we're going to cover the second type of projects where we're
the second type of projects where we're going to use our basic SQL skills in
going to use our basic SQL skills in order to do something called data
order to do something called data profiling where we're going to try to
profiling where we're going to try to understand all the aspects of our data
understand all the aspects of our data sets using simple aggregations like the
sets using simple aggregations like the sum, average, count and as well we will
sum, average, count and as well we will be using techniques like some
be using techniques like some [Music]
[Music] queries. All right my friends. So the
queries. All right my friends. So the first step in any data project is that
first step in any data project is that we need data sets. If you have done the
we need data sets. If you have done the previous project where we have built the
previous project where we have built the SQL data warehouse, then you have
SQL data warehouse, then you have everything the data and the database. So
everything the data and the database. So you don't have to worry about it. But if
you don't have to worry about it. But if you skip that, which I don't recommend,
you skip that, which I don't recommend, I still have prepared for you the files
I still have prepared for you the files and the database. So let's get the data
and the database. So let's get the data and create our database. All right. So
and create our database. All right. So now if you go to the link in the
now if you go to the link in the description, we're going to go to the
description, we're going to go to the downloads. And of course, you can
downloads. And of course, you can subscribe to my newsletter. And then
subscribe to my newsletter. And then here we have the SQL course materials.
here we have the SQL course materials. And here we have a link for data
And here we have a link for data analytics projects. Let's go to the
analytics projects. Let's go to the link. And now here you have some
link. And now here you have some important links like downloading the
important links like downloading the server the management studio where we're
server the management studio where we're going to write our SQLs and as well
going to write our SQLs and as well there is a link to the g repository and
there is a link to the g repository and as well what is very important is to
as well what is very important is to download all the project files. So click
download all the project files. So click on that and download all the files. Now
on that and download all the files. Now extract the file and put it somewhere
extract the file and put it somewhere safe at your PC and now inside it you
safe at your PC and now inside it you can find all the scripts and the data
can find all the scripts and the data sets. Now there is like three ways on
sets. Now there is like three ways on how to create the database in SQL
how to create the database in SQL server. So the first one is by executing
server. So the first one is by executing scripts. If you go to the scripts over
scripts. If you go to the scripts over here, the first one we have a file
here, the first one we have a file called init database. Just go inside it
called init database. Just go inside it and copy the whole thing and then let's
and copy the whole thing and then let's go to SQL server. Now make a new query
go to SQL server. Now make a new query and make sure you switch to the master
and make sure you switch to the master database and then paste the whole code.
database and then paste the whole code. So now what you are doing here is we are
So now what you are doing here is we are creating a new database. We are creating
creating a new database. We are creating a schema and then three very important
a schema and then three very important tables that we're going to use in our
tables that we're going to use in our data analyzes. Now there is like only
data analyzes. Now there is like only one thing that you have to change in
one thing that you have to change in this script and that is the path of the
this script and that is the path of the files. And once you have done that just
files. And once you have done that just go and execute the whole script. And now
go and execute the whole script. And now as you can see everything is done and
as you can see everything is done and there is like data inserted. Now if you
there is like data inserted. Now if you go to the left side to the database and
go to the left side to the database and refresh you can find a new database
refresh you can find a new database called data warehouse analytics. And if
called data warehouse analytics. And if you go inside the tables you will find
you go inside the tables you will find our three tables customer products and
our three tables customer products and sales. So this is one way on how to
sales. So this is one way on how to create the database. The second methods
create the database. The second methods is to go to the databases over here.
is to go to the databases over here. Right click on it and say new database.
Right click on it and say new database. And for example, let's call it data
And for example, let's call it data warehouse analytics. I'm going to call
warehouse analytics. I'm going to call it two because I have already one. And
it two because I have already one. And then click okay. And with that you have
then click okay. And with that you have a new database. So what we're going to
a new database. So what we're going to do now, we're going to right click on it
do now, we're going to right click on it and then go to tasks and then import
and then go to tasks and then import flat file. And now what we're going to
flat file. And now what we're going to do, we're going to go and import the CSV
do, we're going to go and import the CSV files to our new database. So we can go
files to our new database. So we can go next and then you have to go and locate
next and then you have to go and locate your files. I have them somewhere over
your files. I have them somewhere over here. So data set CSV files and we have
here. So data set CSV files and we have to focus on the gold tables. So I'm
to focus on the gold tables. So I'm going to go and select this one and then
going to go and select this one and then next. Now I'm just getting an overview
next. Now I'm just getting an overview of my data. So next. Now just to make
of my data. So next. Now just to make sure that you are not getting any error,
sure that you are not getting any error, I'm going to go and allow nulls and
I'm going to go and allow nulls and that's all. So next and finish. So
that's all. So next and finish. So perfect. The data has been inserted. Now
perfect. The data has been inserted. Now let's go to our database tables. And as
let's go to our database tables. And as you can see, we have here our new table.
you can see, we have here our new table. So you have to go and repeat this three
So you have to go and repeat this three times in order to import the data. Well,
times in order to import the data. Well, you can use this method if the first
you can use this method if the first method didn't work. But I really
method didn't work. But I really recommend you to use the script in order
recommend you to use the script in order to create the database. The third way is
to create the database. The third way is to go and restore the database itself.
to go and restore the database itself. Now how we're going to do it? We're
Now how we're going to do it? We're going to go again to the data sets and
going to go again to the data sets and as you can see we have here a database
as you can see we have here a database backup. So as you can see we have here a
backup. So as you can see we have here a PAK file. So now what you have to do is
PAK file. So now what you have to do is to go and copy that and then we're going
to go and copy that and then we're going to go to the database location. So it
to go to the database location. So it really depend where you have installed
really depend where you have installed the SQL server. So currently I have it
the SQL server. So currently I have it here program files Microsoft SQL server
here program files Microsoft SQL server and then the express MSSQL backup and
and then the express MSSQL backup and you have to place the file over here. So
you have to place the file over here. So I have it here data warehouse analytics
I have it here data warehouse analytics backup. And now all what you have to do
backup. And now all what you have to do is to right click on the database and
is to right click on the database and then say restore database and then we're
then say restore database and then we're going to go to the device three points
going to go to the device three points and we're going to say add. And now you
and we're going to say add. And now you can see our database data warehouse
can see our database data warehouse analytics. Once we say okay and then
analytics. Once we say okay and then okay and now since I have it already I
okay and now since I have it already I will get an error but once I click okay
will get an error but once I click okay the whole database can be restored
the whole database can be restored without running any scripts. So those
without running any scripts. So those are the three ways on how to create the
are the three ways on how to create the database of the projects and if you have
database of the projects and if you have built with me the data warehouse
built with me the data warehouse projects before you don't have to do it
projects before you don't have to do it because we have built that together. So
because we have built that together. So pause the video and get the data for the
projects. All right my friends. So we're going to start with a secret, a little
going to start with a secret, a little trick that I usually do by analyzing any
trick that I usually do by analyzing any data sets. So let's start with little
data sets. So let's start with little coffee before we start. H this is really
coffee before we start. H this is really hot. Okay. So the secret says as I'm
hot. Okay. So the secret says as I'm looking to any data sets in any
looking to any data sets in any projects, I see the data always divided
projects, I see the data always divided between dimensions and measures.
between dimensions and measures. What truth? You take the blue pill, you
What truth? You take the blue pill, you take the red pill. All I'm offering is
take the red pill. All I'm offering is the truth. Nothing more.
the truth. Nothing more. If you see your data like me as
If you see your data like me as dimensions and measures, you can
dimensions and measures, you can generate like endless amount of insights
generate like endless amount of insights from any projects from any data sets and
from any projects from any data sets and you will find me through the projects
you will find me through the projects that I'm always speaking about measures
that I'm always speaking about measures and dimensions. So I'm going to show you
and dimensions. So I'm going to show you how I usually do it. So now usually by
how I usually do it. So now usually by looking to any data sets in any
looking to any data sets in any projects. So you have like multiple
projects. So you have like multiple columns and rows here I see the data
columns and rows here I see the data always splitted into two categories
always splitted into two categories either a dimension or a measure. And now
either a dimension or a measure. And now of course the question is here is my
of course the question is here is my column a dimension or a measure? Well in
column a dimension or a measure? Well in order to assign it to one of those
order to assign it to one of those categories you have to ask the first
categories you have to ask the first question is it a numeric value? If it's
question is it a numeric value? If it's not so you have like string or date or
not so you have like string or date or any other data type then it is a
any other data type then it is a dimension and if it is yes in numeric
dimension and if it is yes in numeric then you have to ask the second question
then you have to ask the second question does it make sense to aggregate it. So
does it make sense to aggregate it. So if the answer for both questions is yes,
if the answer for both questions is yes, it is numeric and it makes sense to
it is numeric and it makes sense to aggregate it then it is a measure
aggregate it then it is a measure otherwise it is a dimension. Now let's
otherwise it is a dimension. Now let's practice and have some examples. So now
practice and have some examples. So now by looking to the values of the column
by looking to the values of the column category you can see all the values are
category you can see all the values are characters. So it is not numeric that
characters. So it is not numeric that means this column is a dimension. So it
means this column is a dimension. So it is very simple. Let's take another
is very simple. Let's take another column. We have the sales amount. So now
column. We have the sales amount. So now as you can see the values are numeric
as you can see the values are numeric and as well it makes sense to aggregate
and as well it makes sense to aggregate those values. we can get the total sales
those values. we can get the total sales or the average sales and so on. So it
or the average sales and so on. So it fulfill both of the conditions. It is
fulfill both of the conditions. It is numeric and it makes sense to aggregate
numeric and it makes sense to aggregate it. That's why we say sales is a
it. That's why we say sales is a measure. Now if you're checking the
measure. Now if you're checking the values of the product name, you can see
values of the product name, you can see that all of them are characters and
that all of them are characters and names. So it is not numeric. That means
names. So it is not numeric. That means the product is a dimension. Moving on to
the product is a dimension. Moving on to the next one, we have the quantity. The
the next one, we have the quantity. The values are numeric and as well it makes
values are numeric and as well it makes sense to aggregate it. Can summarize all
sense to aggregate it. Can summarize all those values to have the total quantity.
those values to have the total quantity. So quantity is a measure. Now if you're
So quantity is a measure. Now if you're looking to the values of the birth dates
looking to the values of the birth dates you can see this is a date information
you can see this is a date information it is not numeric so that means it is a
it is not numeric so that means it is a dimension right but if you calculate the
dimension right but if you calculate the age from the birth dates age of the
age from the birth dates age of the customer going to be in numeric and it
customer going to be in numeric and it makes sense to aggregate it for example
makes sense to aggregate it for example finding the average age of customers. So
finding the average age of customers. So if we derive a numeric value from a
if we derive a numeric value from a dimension then we can use it as a
dimension then we can use it as a measure. So age is measure and now we
measure. So age is measure and now we come to something really tricky. This is
come to something really tricky. This is the ID. So for example if you are
the ID. So for example if you are checking the customer ID you can see all
checking the customer ID you can see all those values are numeric. So the first
those values are numeric. So the first condition is fulfilled. Now the very
condition is fulfilled. Now the very important question does it make sense to
important question does it make sense to aggregate the ids? Well those ids are
aggregate the ids? Well those ids are unique identifier for a customer and if
unique identifier for a customer and if you find like the average of that it is
you find like the average of that it is not like helpful right I cannot think of
not like helpful right I cannot think of one use case of aggregating the customer
one use case of aggregating the customer ID like having the average of all those
ID like having the average of all those ids or summarizing the ids. So it makes
ids or summarizing the ids. So it makes no sense to aggregate it. That's why we
no sense to aggregate it. That's why we can consider the ID of a customer as a
can consider the ID of a customer as a dimension not as a measure. So as you
dimension not as a measure. So as you can see it is very simple. If it is
can see it is very simple. If it is numeric and it makes sense to aggregate
numeric and it makes sense to aggregate then it is measure otherwise it is a
then it is measure otherwise it is a dimension. And this is the foundations
dimension. And this is the foundations of any data analytics. If you see your
of any data analytics. If you see your data as dimensions and measures you can
data as dimensions and measures you can generate a lot of use cases and insights
generate a lot of use cases and insights from your data sets. Now I totally
from your data sets. Now I totally understand if you are still confused
understand if you are still confused about dimensions and measures and you
about dimensions and measures and you might be asking why do I need measures
might be asking why do I need measures and dimensions. Well if you are doing
and dimensions. Well if you are doing any type of data analysis or you are
any type of data analysis or you are exploring any data sets you will be end
exploring any data sets you will be end up always like grouping up the data by
up always like grouping up the data by something like you are grouping the data
something like you are grouping the data by countries or grouping the data by for
by countries or grouping the data by for example products or categories. So we
example products or categories. So we need dimensions to group up our data and
need dimensions to group up our data and in the other sides you will be asking
in the other sides you will be asking questions like how much how many what is
questions like how much how many what is the total of something. So you always
the total of something. So you always need to aggregate or calculate something
need to aggregate or calculate something right and for that you need the measure.
right and for that you need the measure. So we need the measures in order to
So we need the measures in order to answer the question how many and how
answer the question how many and how much and we need the dimensions in order
much and we need the dimensions in order to group up the data by something. So
to group up the data by something. So that's why almost in any type of data
that's why almost in any type of data analyzes you need dimensions and
analyzes you need dimensions and measures and this going to be more clear
measures and this going to be more clear as we progress in the projects. All
as we progress in the projects. All right. So now I'm going to walk you
right. So now I'm going to walk you through the project road map and I have
through the project road map and I have split that into six steps. So we're
split that into six steps. So we're going to do different types of
going to do different types of explorations like the database
explorations like the database dimensions, measures, dates and we're
dimensions, measures, dates and we're going to do some basics analyszis like
going to do some basics analyszis like the magnitude and the ranking. So let's
the magnitude and the ranking. So let's start with the first step in our
start with the first step in our projects. We're going to do database
exploration. So let's say that you have joined a team and you got an access to a
joined a team and you got an access to a database. The first thing that I usually
database. The first thing that I usually do is that I explore the structure of
do is that I explore the structure of the database just to have basic
the database just to have basic understandings about the database
understandings about the database tables, the views, columns. Are we
tables, the views, columns. Are we talking about like 10 tables, hundreds
talking about like 10 tables, hundreds of tables? So it is just a few queries
of tables? So it is just a few queries in order to say hello to the database.
in order to say hello to the database. So now let's go to SQL and explore the
So now let's go to SQL and explore the database of our projects. So now how we
database of our projects. So now how we going to do it? Either you go to the
going to do it? Either you go to the left side over here and start clicking
left side over here and start clicking the objects of your database and explore
the objects of your database and explore the tables, views, columns and so on. Or
the tables, views, columns and so on. Or a better way that I usually do it that I
a better way that I usually do it that I explore the database using a query. So
explore the database using a query. So what we can do, we can go and select
what we can do, we can go and select data from the system tables because the
data from the system tables because the database stores metadata informations
database stores metadata informations about our tables and objects. So we're
about our tables and objects. So we're going to target an information schema.
going to target an information schema. This is an internal schema in the
This is an internal schema in the database where we have like multiple
database where we have like multiple tables and views to explore the metadata
tables and views to explore the metadata and the structure of our database. So
and the structure of our database. So for example, we can go with the tables.
for example, we can go with the tables. So let's go and create it. And with that
So let's go and create it. And with that you have a list of tables and with that
you have a list of tables and with that you can see multiple informations like a
you can see multiple informations like a catalog, the schema and the table names
catalog, the schema and the table names and you can see over here the object
and you can see over here the object type whether it is a table or a view. If
type whether it is a table or a view. If you done the data warehouse project with
you done the data warehouse project with me then you will find a lot of tables.
me then you will find a lot of tables. But if you are just doing the data
But if you are just doing the data analyzes you will see only those three
analyzes you will see only those three tables. So customers, products and
tables. So customers, products and sales. So with that we can see in our
sales. So with that we can see in our database there are like around 15 tables
database there are like around 15 tables or three tables. Now in the output you
or three tables. Now in the output you can see the database name the schema and
can see the database name the schema and a list of all tables and of course don't
a list of all tables and of course don't forget that you are using the database
forget that you are using the database that we created. So with that we have a
that we created. So with that we have a nice quick list with all tables inside
nice quick list with all tables inside our database. Now the next step we can
our database. Now the next step we can go and drill down and check what are the
go and drill down and check what are the columns that we have inside our
columns that we have inside our database. And for that we can as well
database. And for that we can as well target the same schema. So select star
target the same schema. So select star from information schema and it is very
from information schema and it is very simple. So we're going to go to the
simple. So we're going to go to the table columns. So let's go and execute
table columns. So let's go and execute it. And now we will see a lot of
it. And now we will see a lot of informations over here. So we can see in
informations over here. So we can see in our database we have around 101 columns.
our database we have around 101 columns. So that we can see all the columns
So that we can see all the columns available in our database. And what I
available in our database. And what I usually do with that I go and select the
usually do with that I go and select the columns only for specific table. So we
columns only for specific table. So we can say where are table name
can say where are table name equal let's get for example the
equal let's get for example the dimension customers. So let's query the
dimension customers. So let's query the whole thing and with that we can see we
whole thing and with that we can see we have 10 columns inside this dimension
have 10 columns inside this dimension and this is how the columns are sorted
and this is how the columns are sorted inside our table or view and we can see
inside our table or view and we can see all the metadata informations about each
all the metadata informations about each column. So now as you can see we are now
column. So now as you can see we are now exploring the structure of our database
exploring the structure of our database and this is really helpful to get an
and this is really helpful to get an overview of the database and the
overview of the database and the projects. Are we talking about like 20
projects. Are we talking about like 20 tables or hundreds of tables? And we can
tables or hundreds of tables? And we can quickly see the naming of the columns,
quickly see the naming of the columns, the tables. This is really important to
the tables. This is really important to get a feeling about the projects and
get a feeling about the projects and sets the foundations for exploring the
sets the foundations for exploring the data inside those tables. All right
data inside those tables. All right friends, so with that we have done the
friends, so with that we have done the first step. We have explored the
first step. We have explored the database structure and now we can start
database structure and now we can start diving into the actual data. The first
diving into the actual data. The first thing that we can explore is the
thing that we can explore is the dimensions.
Okay. So what we going to do with the dimension exploration? All what we have
dimension exploration? All what we have to do is to go and identify the unique
to do is to go and identify the unique values of each dimension that we have
values of each dimension that we have inside our database. This can help us to
inside our database. This can help us to understand what are the categories,
understand what are the categories, which countries, what are the product
which countries, what are the product types that we have inside our database
types that we have inside our database and we have a very simple formula for
and we have a very simple formula for that. So all what you need is the SQL
that. So all what you need is the SQL keyword distinct together with any
keyword distinct together with any dimension in your data set like distinct
dimension in your data set like distinct country, distinct category. So for
country, distinct category. So for example if you are checking any column
example if you are checking any column that is dimension you can see a lot of
that is dimension you can see a lot of values and repeating stuff but now once
values and repeating stuff but now once you say distinct column what going to
you say distinct column what going to happen you will get a list of all unique
happen you will get a list of all unique values and with that you can understand
values and with that you can understand quickly I have three different types so
quickly I have three different types so I have a bc and this as well going to
I have a bc and this as well going to help you to understand the granularity
help you to understand the granularity of your dimension does the dimension has
of your dimension does the dimension has like three values or 100 value so it is
like three values or 100 value so it is very simple let's go and analyze our
very simple let's go and analyze our dimensions okay so now let's explore the
dimensions okay so now let's explore the dimension values inside our database so
dimension values inside our database so let's start with the first table the
let's start with the first table the customers and if you check those columns
customers and if you check those columns we have to find an interesting dimension
we have to find an interesting dimension like for example the country. So now
like for example the country. So now what we can do we can go and explore all
what we can do we can go and explore all the countries our customers come from.
the countries our customers come from. So let's go and do that. It is very
So let's go and do that. It is very simple. Select distinct and then we have
simple. Select distinct and then we have our column the dimension country from
our column the dimension country from our table customers. So let's go and
our table customers. So let's go and execute it. And with that we can see in
execute it. And with that we can see in the result we have six countries. This
the result we have six countries. This is really nice in order to understand
is really nice in order to understand the geographical spread. So we have
the geographical spread. So we have customers for our business that comes
customers for our business that comes from six different countries. Germany,
from six different countries. Germany, United States, France, Canada and so on.
United States, France, Canada and so on. So now with that we have like the first
So now with that we have like the first little insights about our business. Now
little insights about our business. Now let's jump to another table the
let's jump to another table the products. So what we have to do is to
products. So what we have to do is to explore all the categories inside our
explore all the categories inside our business the major divisions. So we're
business the major divisions. So we're going to say select distinct category
going to say select distinct category from our table products. So let's go and
from our table products. So let's go and execute it. Now in the output you can
execute it. Now in the output you can see we have four categories. We have the
see we have four categories. We have the accessories, bikes, clothing and
accessories, bikes, clothing and components. This is like giving us an
components. This is like giving us an overview of the product range. What are
overview of the product range. What are the major divisions inside our business?
the major divisions inside our business? Now the next one I'm digging deeper in
Now the next one I'm digging deeper in this information. So not only I want to
this information. So not only I want to see the categories, I would like as well
see the categories, I would like as well to see the
to see the subcategories. I'm not starting a new
subcategories. I'm not starting a new query because there is of course
query because there is of course relationship between the category and
relationship between the category and the subcategory. Let's go now and
the subcategory. Let's go now and execute it. Now you can see in the
execute it. Now you can see in the output our categories are now splitted
output our categories are now splitted into more specific groups. So for
into more specific groups. So for example the bikes over here we have
example the bikes over here we have mountain bikes, road bikes and so on. So
mountain bikes, road bikes and so on. So as you can see the subcategories has
as you can see the subcategories has more details about the products than the
more details about the products than the category. And now in order to get the
category. And now in order to get the full picture we going to bring now the
full picture we going to bring now the product name. So with that we're going
product name. So with that we're going to get a big picture in one shot. So now
to get a big picture in one shot. So now you can see the whole hierarchy of our
you can see the whole hierarchy of our products. And of course it is more
products. And of course it is more interesting if you go and sort the data
interesting if you go and sort the data by those three informations. So let me
by those three informations. So let me just execute it again. So now if you go
just execute it again. So now if you go and explore our data for example we have
and explore our data for example we have here the category accessories and we
here the category accessories and we have a subcategory inside it called
have a subcategory inside it called lights. And in this subcategory we have
lights. And in this subcategory we have three different products. And if you
three different products. And if you scroll to the end of our table you can
scroll to the end of our table you can see that we have around
see that we have around 295 products. So you can see the
295 products. So you can see the granularity of the product name is
granularity of the product name is different than the category and the
different than the category and the subcategory. And all those three
subcategory. And all those three informations are related to each others.
informations are related to each others. So now as you can see after exploring
So now as you can see after exploring those dimensions we have now better
those dimensions we have now better understanding on how the data is
understanding on how the data is organized and this can help us by the
organized and this can help us by the analyzes if you are aggregating by the
analyzes if you are aggregating by the category you will get only four rows. If
category you will get only four rows. If you are aggregating by the products you
you are aggregating by the products you will get hundreds of rows. So this is
will get hundreds of rows. So this is how we explore the dimensions of our
how we explore the dimensions of our database. Okay. So now with that we have
database. Okay. So now with that we have a clear picture about the dimensions
a clear picture about the dimensions inside our data sets. And now in the
inside our data sets. And now in the next step we're going to deep dive into
next step we're going to deep dive into one special type of dimensions. We have
one special type of dimensions. We have the dates. So we're going to explore the
the dates. So we're going to explore the date
columns. Okay. So now what we going to do with the date exploration? We're
do with the date exploration? We're going to go and explore the boundaries
going to go and explore the boundaries of the dates that we have in the data
of the dates that we have in the data sets. What is the earliest and the
sets. What is the earliest and the latest dates in my data? We're going to
latest dates in my data? We're going to understand the time span. Do we have in
understand the time span. Do we have in our business 2 years or like 10 years?
our business 2 years or like 10 years? And this is of course very important to
And this is of course very important to understand in order later to make
understand in order later to make different types of time analyzes. Now
different types of time analyzes. Now the formula for that is very simple. All
the formula for that is very simple. All what we need is the min and max
what we need is the min and max functions in order to get the earliest
functions in order to get the earliest and the latest dates. And of course
and the latest dates. And of course we're going to apply that on date
we're going to apply that on date columns, date dimensions. So for
columns, date dimensions. So for example, we're going to have like min
example, we're going to have like min order date, max create date, min birth
order date, max create date, min birth date. So any date that you have in your
date. So any date that you have in your data set. And here if you look to any
data set. And here if you look to any date column inside your data, you will
date column inside your data, you will find multiple values. But what is
find multiple values. But what is interesting is to understand what is the
interesting is to understand what is the earliest date like here for example 2018
earliest date like here for example 2018 and what is the latest date for example
and what is the latest date for example 2028 and with that we can understand aha
2028 and with that we can understand aha we have like time span of 10 years using
we have like time span of 10 years using the date diff function. So now let's go
the date diff function. So now let's go and apply our new formula on our date
and apply our new formula on our date columns. All right. So now let's search
columns. All right. So now let's search for date informations inside our
for date informations inside our database. And usually you're going to
database. And usually you're going to find a lot in the facts. So let's go to
find a lot in the facts. So let's go to the fact cells. And here we have like
the fact cells. And here we have like multiple dates. the order date, shipping
multiple dates. the order date, shipping date and due dates. Now let's go and
date and due dates. Now let's go and explore the boundaries of the order
explore the boundaries of the order date. So we have the following task.
date. So we have the following task. Find the date of the first and last
Find the date of the first and last order. So how we going to do that? We're
order. So how we going to do that? We're going to say select and we are targeting
going to say select and we are targeting the order date from our table sales. So
the order date from our table sales. So let's go and execute it. And now we can
let's go and execute it. And now we can see we have a lot of values inside our
see we have a lot of values inside our database. So now in order to find the
database. So now in order to find the first dates, what we're going to do,
first dates, what we're going to do, we're going to go and use the function
we're going to go and use the function min in order to get the minimum order
min in order to get the minimum order dates. So we're going to go and call it
dates. So we're going to go and call it first order dates. So let's go and
first order dates. So let's go and execute it. So now we can see the date
execute it. So now we can see the date of the first order. It is in December
of the first order. It is in December 2010. Now let's go and find the date of
2010. Now let's go and find the date of the last order. So we're going to have
the last order. So we're going to have this time the max order date. Uh let's
this time the max order date. Uh let's go and call it last order date. So let's
go and call it last order date. So let's go and explore now the other boundary
go and explore now the other boundary and with that we can see in January 2014
and with that we can see in January 2014 it is the date of the last order in our
it is the date of the last order in our system. So with that we have explored
system. So with that we have explored the boundaries of the order dates the
the boundaries of the order dates the first and the last and of course we can
first and the last and of course we can now understand very quickly that we have
now understand very quickly that we have four years of sales inside our business
four years of sales inside our business but we can go and calculate it. So now
but we can go and calculate it. So now the task says how many years of sales
the task says how many years of sales are available. Now in order to find the
are available. Now in order to find the years between those two dates, we have
years between those two dates, we have another scale function. It's called date
another scale function. It's called date diff. And now we have to go and subtract
diff. And now we have to go and subtract two dates. Now this function need three
two dates. Now this function need three arguments. The first one you have to
arguments. The first one you have to specify whether it is a year, month and
specify whether it is a year, month and day. And we start with the smallest
day. And we start with the smallest date. So it's going to be the min order
date. So it's going to be the min order dates. And then the last argument is
dates. And then the last argument is going to be the latest or the highest
going to be the latest or the highest date. And it's going to be the max order
date. And it's going to be the max order dates. And we can go and call it order
dates. And we can go and call it order range in years. Okay. So let's go and
range in years. Okay. So let's go and execute it. And with that you can see in
execute it. And with that you can see in the output we have four years. Of course
the output we have four years. Of course if you want to go and check the months
if you want to go and check the months you can go over here and say month and
you can go over here and say month and execute. So between those two dates we
execute. So between those two dates we have 37 months. And of course now we
have 37 months. And of course now we have to go and rename it. So with that
have to go and rename it. So with that we have explored the dimension order
we have explored the dimension order dates. But what is more interesting is
dates. But what is more interesting is to check the customers and here we have
to check the customers and here we have the birth date. So now what we can do,
the birth date. So now what we can do, we can go and find the youngest and the
we can go and find the youngest and the oldest customer. So let's go and do
oldest customer. So let's go and do that. We're going to say select
that. We're going to say select minates and with that we are getting the
minates and with that we are getting the oldest birth date and we will get now
oldest birth date and we will get now the max birth date and with that we will
the max birth date and with that we will get the youngest birth date from our
get the youngest birth date from our table customers. So let's go and explore
table customers. So let's go and explore that. Now we can see the birth date of
that. Now we can see the birth date of the oldest customer. I hope he or she is
the oldest customer. I hope he or she is still alive. So it is more than 100
still alive. So it is more than 100 years and the youngest customer is
years and the youngest customer is around like 40 years. So we don't have
around like 40 years. So we don't have really young customers inside our
really young customers inside our business. And of course if you don't
business. And of course if you don't want to see the birth dates, you want to
want to see the birth dates, you want to see the age, what you have to do is
see the age, what you have to do is actually very simple. You're going to
actually very simple. You're going to use as well diff and we want the year
use as well diff and we want the year and then we're going to say min birth
and then we're going to say min birth date with the current date and time. And
date with the current date and time. And for that we have a function called get a
for that we have a function called get a date and we're going to call it oldest
date and we're going to call it oldest age. So if you go ahead and execute this
age. So if you go ahead and execute this one over here you can see the age of the
one over here you can see the age of the oldest customer it is 109. Of course you
oldest customer it is 109. Of course you can do the same thing for the youngest.
can do the same thing for the youngest. If you just replace this with max and
If you just replace this with max and here we have the youngest age. So let's
here we have the youngest age. So let's go and execute it. It is 39. So my
go and execute it. It is 39. So my friends this is how we explore the
friends this is how we explore the boundaries of a date and by finding the
boundaries of a date and by finding the first date and the last date and the
first date and the last date and the years between them we are having now
years between them we are having now more understanding of the time span of
more understanding of the time span of our business and that's going to help us
our business and that's going to help us later by making different type of
later by making different type of complex analyzers. So this is how we
complex analyzers. So this is how we explore the dates. All right. So with
explore the dates. All right. So with that we have now a clear picture about
that we have now a clear picture about the scope of our projects and the date
the scope of our projects and the date range inside our data sets. Now in the
range inside our data sets. Now in the next step, we're going to go and explore
next step, we're going to go and explore the second type of data, the
measures. All right. So now what is exactly exploring the measures? What
exactly exploring the measures? What we're going to do is to calculate and
we're going to do is to calculate and find out the key metrics of our
find out the key metrics of our business, the big numbers, the highest
business, the big numbers, the highest level of aggregations of our data. And
level of aggregations of our data. And the formula for that is very simple.
the formula for that is very simple. We're going to go and use the aggregate
We're going to go and use the aggregate functions in SQL like the sum, average,
functions in SQL like the sum, average, count for any measure inside our data
count for any measure inside our data sets. So for example, we're going to
sets. So for example, we're going to find the total sales by summarizing the
find the total sales by summarizing the sales value, finding the average price,
sales value, finding the average price, finding the sum of quantity in order to
finding the sum of quantity in order to have a big number about all sold items.
have a big number about all sold items. So always an aggregate function together
So always an aggregate function together with a measure. So for example, if you
with a measure. So for example, if you have a column where you have a lot of
have a column where you have a lot of values and you go and summarize all
values and you go and summarize all those values, you will get for example
those values, you will get for example 240. So this is a key metric. This is
240. So this is a key metric. This is the highest level of aggregations and
the highest level of aggregations and the value is not splitted at all. So for
the value is not splitted at all. So for example, we say this is the total
example, we say this is the total revenue of our business. And this is
revenue of our business. And this is exactly what we mean by exploring the
exactly what we mean by exploring the measures. We will get those big numbers.
measures. We will get those big numbers. So now let's go and apply those
So now let's go and apply those aggregate functions to the measures that
aggregate functions to the measures that we have inside our data set. Okay. So
we have inside our data set. Okay. So now we're going to go and spotlight on
now we're going to go and spotlight on the big numbers that matters the most of
the big numbers that matters the most of our business. So now based on those
our business. So now based on those three tables, I have collected here the
three tables, I have collected here the following questions. So let's go and
following questions. So let's go and solve them one by one. The first one is
solve them one by one. The first one is find the total sales. So we're going to
find the total sales. So we're going to go and summarize by using the sum
go and summarize by using the sum function for the sales amount as total
function for the sales amount as total sales from our table fact sales. So
sales from our table fact sales. So let's go and execute it. So this is the
let's go and execute it. So this is the total amount of sales in our business.
total amount of sales in our business. It is around 29 millions. So this is the
It is around 29 millions. So this is the business total revenue. Now we can go to
business total revenue. Now we can go to the second one. It says show how many
the second one. It says show how many items are sold. So this time we need
items are sold. So this time we need another column but from the same table
another column but from the same table from the fact sales. So the question is
from the fact sales. So the question is how many items that means we want the
how many items that means we want the quantity and we're going to stay with
quantity and we're going to stay with the same function. So we are summarizing
the same function. So we are summarizing all the values of the quantity and we
all the values of the quantity and we can call it total quantity. Let's go and
can call it total quantity. Let's go and explore that. So we can see our business
explore that. So we can see our business did sold around 60,000 items and these
did sold around 60,000 items and these 60,000 items did generate around 30
60,000 items did generate around 30 million. So let's keep going. The next
million. So let's keep going. The next question, find the average selling
question, find the average selling price. So that means we are targeting
price. So that means we are targeting the same table. And here we have the
the same table. And here we have the price informations. So we're going to
price informations. So we're going to say the price. This time the aggregate
say the price. This time the aggregate function going to be the average. And
function going to be the average. And we're going to call it average price. So
we're going to call it average price. So let's go and execute it. So the average
let's go and execute it. So the average price in our business is 486. So that
price in our business is 486. So that means our business is selling like
means our business is selling like expensive items. Now let's go to the
expensive items. Now let's go to the next question. It says find the total
next question. It says find the total number of orders. And for that we're
number of orders. And for that we're going to go and use the function count
going to go and use the function count and we can count the order numbers. So
and we can count the order numbers. So order number total orders let's go and
order number total orders let's go and execute it. So it says we have 60,000
execute it. So it says we have 60,000 orders. And now as you are working with
orders. And now as you are working with the count function what I usually do I
the count function what I usually do I try to count the same thing but using a
try to count the same thing but using a distinct. So distinct order number. So,
distinct. So distinct order number. So, what I'm trying to do here is first
what I'm trying to do here is first eliminate any duplicates in the order
eliminate any duplicates in the order number and then count it. I don't want
number and then count it. I don't want to count the same order twice inside our
to count the same order twice inside our sales. So, let's go and execute that.
sales. So, let's go and execute that. Now, as you can see, we have only 27,000
Now, as you can see, we have only 27,000 orders out of 60,000. So, that means the
orders out of 60,000. So, that means the same order is repeating in our database.
same order is repeating in our database. Let's have actually a look. So, select
Let's have actually a look. So, select star from our table and let's go and
star from our table and let's go and have a look. Now as you can see from the
have a look. Now as you can see from the first order over here you can see the
first order over here you can see the same order is repeated three times and
same order is repeated three times and that's because this customer did order
that's because this customer did order three things in the same order. So now
three things in the same order. So now of course what is the definition of
of course what is the definition of order? Usually the whole thing is one
order? Usually the whole thing is one order. That's why in order to get an
order. That's why in order to get an accurate number of orders you have to go
accurate number of orders you have to go and use a distinct in order to eliminate
and use a distinct in order to eliminate first all duplicates and then count how
first all duplicates and then count how many orders we have. So in this scenario
many orders we have. So in this scenario I'm going to say in our business we have
I'm going to say in our business we have around 27,000 orders. So that's why it
around 27,000 orders. So that's why it is little bit tricky using the count
is little bit tricky using the count function. Always try to compare the
function. Always try to compare the numbers before and after using distinct.
numbers before and after using distinct. So let's keep going to the next one. It
So let's keep going to the next one. It says find the total number of products.
says find the total number of products. So it is very simple. We're going to say
So it is very simple. We're going to say select count and we're going to say
select count and we're going to say product key as total products from the
product key as total products from the table gold products. So let's go and
table gold products. So let's go and execute it. So as you can see we have
execute it. So as you can see we have 295 and if you go and make it distinct
295 and if you go and make it distinct just to check you will get the same
just to check you will get the same number. So that means there is no
number. So that means there is no duplicates and of course you can go and
duplicates and of course you can go and count the product name instead. The
count the product name instead. The names of the product is unique. So
names of the product is unique. So that's why we are as well getting the
that's why we are as well getting the same numbers. So that's it. Let's
same numbers. So that's it. Let's continue find the total number of
continue find the total number of customers. So the same thing select
customers. So the same thing select count and you can go with a customer key
count and you can go with a customer key for example from called a dimension
for example from called a dimension customers and I'm going to call it as
customers and I'm going to call it as total customers. So let's go and execute
total customers. So let's go and execute it. So we can see in our system we have
it. So we can see in our system we have 18,000 registered customers. Now the
18,000 registered customers. Now the next one it says find the total number
next one it says find the total number of customers that has placed an order.
of customers that has placed an order. So that means having a customer inside
So that means having a customer inside our database doesn't mean that this
our database doesn't mean that this customer did already placed an order.
customer did already placed an order. Maybe we have customer that just
Maybe we have customer that just registered and didn't order anything. So
registered and didn't order anything. So what we're going to do, we're going to
what we're going to do, we're going to take the same query, but instead of
take the same query, but instead of targeting the customers table, we're
targeting the customers table, we're going to target our fact the sales. So
going to target our fact the sales. So let's go and execute it. So now, as you
let's go and execute it. So now, as you can see, we are getting 16,000, which
can see, we are getting 16,000, which makes no sense because one customer
makes no sense because one customer might order multiple stuff. So what
might order multiple stuff. So what we're going to do, we're going to say
we're going to do, we're going to say distinct and let's query it again. So
distinct and let's query it again. So now it is more correct. We are getting
now it is more correct. We are getting around 18,000 customers. Now we can go
around 18,000 customers. Now we can go and compare them one by one. So as you
and compare them one by one. So as you can see we are getting the same numbers.
can see we are getting the same numbers. So that means all our registered
So that means all our registered customers did already placed an order
customers did already placed an order because the numbers are matching. So it
because the numbers are matching. So it is very simple. We are just using an
is very simple. We are just using an aggregate functions and that we are
aggregate functions and that we are getting those key values. But what I
getting those key values. But what I usually do is that I collect all those
usually do is that I collect all those measures in one query in order to have
measures in one query in order to have an overview of all key numbers in our
an overview of all key numbers in our business. So instead of me querying each
business. So instead of me querying each one of them individually, I combine them
one of them individually, I combine them in one go. So now what we're going to
in one go. So now what we're going to do, we're going to generate a report
do, we're going to generate a report that shows all key metrics of our
that shows all key metrics of our business. So how I usually do it, I'm
business. So how I usually do it, I'm going to go and get the first query for
going to go and get the first query for the total sales and put it over here.
the total sales and put it over here. And now I'm going to build only two
And now I'm going to build only two columns. The first one is the name of
columns. The first one is the name of the measure and the second one is the
the measure and the second one is the value of the measure. So let me show you
value of the measure. So let me show you what I mean. Now this one over here, I
what I mean. Now this one over here, I will not call it total sales. I'm going
will not call it total sales. I'm going to make it like generic. So I'm going to
to make it like generic. So I'm going to say measure value. And before it we're
say measure value. And before it we're going to make another column from a
going to make another column from a static string value is the total sales
static string value is the total sales and we're going to call it measure name
and we're going to call it measure name like this. So let's go and just execute
like this. So let's go and just execute this one over here. So the measure is
this one over here. So the measure is total sales. So it is not anymore like
total sales. So it is not anymore like the column name. It is now a value in
the column name. It is now a value in the output and the measure value is like
the output and the measure value is like around 30 millions. Now what I'm going
around 30 millions. Now what I'm going to do I'm going to go and add another
to do I'm going to go and add another measure as a second row. And in order to
measure as a second row. And in order to do that, we're going to use the union
do that, we're going to use the union all and then copy the whole thing over
all and then copy the whole thing over here and say total quantity and we're
here and say total quantity and we're going to change the measure to quantity.
going to change the measure to quantity. So now let's select both of them and
So now let's select both of them and query. And now as you can see we have
query. And now as you can see we have now the two big numbers in one query. So
now the two big numbers in one query. So the total sales and the total quantity.
the total sales and the total quantity. So now what we can do we can go and
So now what we can do we can go and collect all those big numbers and
collect all those big numbers and measures and put it in one query. So
measures and put it in one query. So with that we have the average price, the
with that we have the average price, the total number of orders, product,
total number of orders, product, customers and as well you can go and
customers and as well you can go and target different tables because SQL
target different tables because SQL cares here only about the number of
cares here only about the number of columns and the data type of columns
columns and the data type of columns must be matching. So now let's go and
must be matching. So now let's go and query this and now in single query we
query this and now in single query we can see the big numbers the key metrics
can see the big numbers the key metrics of our business. We can see the total
of our business. We can see the total sales, total quantity, average price and
sales, total quantity, average price and so on. This is a super report where you
so on. This is a super report where you can generate it for any business where
can generate it for any business where you have in one go the full big picture
you have in one go the full big picture about the business. So this is how I
about the business. So this is how I generally do if I'm exploring a new
generally do if I'm exploring a new database. I put all those big numbers
database. I put all those big numbers and measures in one query to have better
and measures in one query to have better understanding about the business. All
understanding about the business. All right my friends. So with that we have
right my friends. So with that we have now a clear understanding about the
now a clear understanding about the dimensions and as well the measures of
dimensions and as well the measures of our data sets. Now in the next step
our data sets. Now in the next step we're going to go and start combining
we're going to go and start combining stuff together in order to generate
stuff together in order to generate insights. And we're going to focus now
insights. And we're going to focus now in a very basic analyszis. It is the
in a very basic analyszis. It is the magnitude
analyzis. Okay. So now what is exactly a magnitude analyszis? It's all about
magnitude analyszis? It's all about comparing the measure values across
comparing the measure values across different categories and dimensions. And
different categories and dimensions. And this can help us of course to understand
this can help us of course to understand the importance of different categories.
the importance of different categories. Now the formula for that going to be
Now the formula for that going to be interesting. So now this time we will be
interesting. So now this time we will be mixing stuff together. So first we have
mixing stuff together. So first we have to go and aggregate a specific measure
to go and aggregate a specific measure and then we say by dimension. We need
and then we say by dimension. We need here the dimension in order to split the
here the dimension in order to split the measure. It sounds complicated but it is
measure. It sounds complicated but it is very simple and basics. So for example
very simple and basics. So for example we can say the total sales by country,
we can say the total sales by country, the total quantity by category, the
the total quantity by category, the average price by products, the total
average price by products, the total orders by customer and if you follow
orders by customer and if you follow this formula you will be generating
this formula you will be generating endless amount of insights by just
endless amount of insights by just combining any measure with any
combining any measure with any dimension. You can call it it is a new
dimension. You can call it it is a new insight. So it's going to look like like
insight. So it's going to look like like this. If you have one measure that is
this. If you have one measure that is like for example 600 and if you put now
like for example 600 and if you put now this measure together with dimension
this measure together with dimension what's going to happen this 600 is going
what's going to happen this 600 is going to be splitted by the dimension values.
to be splitted by the dimension values. So A going to have like 200, B going to
So A going to have like 200, B going to have 300 and C 100. And now with that we
have 300 and C 100. And now with that we can go and compare those categories
can go and compare those categories right. So we can see now that category B
right. So we can see now that category B has the highest measure and the C has
has the highest measure and the C has the lowest. And this help us to compare
the lowest. And this help us to compare the values of the measure. what is the
the values of the measure. what is the best category and what is the worst
best category and what is the worst category. So this is very basics
category. So this is very basics analyszis. So let's go and apply this
analyszis. So let's go and apply this formula on our data sets. Okay. So now
formula on our data sets. Okay. So now let's go and break all our measures by
let's go and break all our measures by dimensions. So here I have prepared few
dimensions. So here I have prepared few interesting examples where first we're
interesting examples where first we're going to break the total number of
going to break the total number of customers. As we learned we have 18,000
customers. As we learned we have 18,000 by the countries. So the measure is
by the countries. So the measure is total customers and the dimension going
total customers and the dimension going to be the countries. So let's go and
to be the countries. So let's go and write the query for that. So we're going
write the query for that. So we're going to select. So the first thing that we're
to select. So the first thing that we're going to go and add is the dimension. So
going to go and add is the dimension. So it's going to be the country. And then
it's going to be the country. And then we need the measure. It's going to be
we need the measure. It's going to be the count of the customer key. So this
the count of the customer key. So this will give us the total customers. And we
will give us the total customers. And we need to select our table. So it's going
need to select our table. So it's going to be the dimension customers. And of
to be the dimension customers. And of course we have to go and group up the
course we have to go and group up the data by the countries. So group up
data by the countries. So group up country. So let's go and execute it. And
country. So let's go and execute it. And with that you see again the list of
with that you see again the list of countries. So we have our six countries
countries. So we have our six countries and then the total customers for each
and then the total customers for each country. So that we can see the
country. So that we can see the distribution of customers by the
distribution of customers by the country. But what we usually do is that
country. But what we usually do is that we go and sort the data by the measure
we go and sort the data by the measure the total customers like this. And we're
the total customers like this. And we're going to sort it by descending. So with
going to sort it by descending. So with that we will get first the countries
that we will get first the countries with the highest customers. So let's go
with the highest customers. So let's go and execute it. So now we can see in the
and execute it. So now we can see in the results the highest number of customers
results the highest number of customers come from United States then Australia,
come from United States then Australia, United Kingdom
United Kingdom 337 customers without the country
337 customers without the country informations it is not available. So
informations it is not available. So that's it right it is very simple. So
that's it right it is very simple. So with that we have splitted the total
with that we have splitted the total number of customers by a dimension the
number of customers by a dimension the country. Now of course we can go and
country. Now of course we can go and split the data by different type of
split the data by different type of dimension. So for the next one we are
dimension. So for the next one we are saying find the total customers by
saying find the total customers by gender. So here's the same thing. We
gender. So here's the same thing. We have the same measure that to other
have the same measure that to other customers but we are splitting the data
customers but we are splitting the data by different type of dimension. So just
by different type of dimension. So just copy and paste and now instead of
copy and paste and now instead of countries we just going to switch it to
countries we just going to switch it to gender and over here and that's it. So
gender and over here and that's it. So let's go and execute. So now as you can
let's go and execute. So now as you can see the granularity of the gender over
see the granularity of the gender over here is different than the countries. We
here is different than the countries. We have here only three values and we can
have here only three values and we can see it is almost splitted evenly between
see it is almost splitted evenly between male customers and female customers. And
male customers and female customers. And of course this going to help us to
of course this going to help us to understand the demography of our
understand the demography of our customers. And as you can see it was
customers. And as you can see it was very simple. We just switch the
very simple. We just switch the dimension. So you can go and split as
dimension. So you can go and split as well by the marital status and so on.
well by the marital status and so on. Now let's go and split the total
Now let's go and split the total products by the category. Well actually
products by the category. Well actually the query is going to be very simple as
the query is going to be very simple as well. So select and here we're going to
well. So select and here we're going to have the same aggregate function the
have the same aggregate function the count products key as total
count products key as total products from our table gold dimension
products from our table gold dimension products and then we're going to group
products and then we're going to group up by the dimension the category and
up by the dimension the category and we're going to order by as well the same
we're going to order by as well the same thing total products distinct from the
thing total products distinct from the highest to the lowest. So let's go and
highest to the lowest. So let's go and execute it. And with that we can see how
execute it. And with that we can see how many products do we have in each of
many products do we have in each of those categories. And we can see the
those categories. And we can see the biggest category the components and
biggest category the components and after that the pikes. And this is
after that the pikes. And this is interesting that we have seven products
interesting that we have seven products where we have nulls where they don't
where we have nulls where they don't belong to any category. This is really
belong to any category. This is really nice. Let's go to the next one. What do
nice. Let's go to the next one. What do we have over here? What is the average
we have over here? What is the average costs in each category? So this is like
costs in each category? So this is like different style of question but at the
different style of question but at the ends we're going to have the same thing.
ends we're going to have the same thing. We have over here the average costs.
We have over here the average costs. This is the measure and the category is
This is the measure and the category is our dimension. It's like we are saying
our dimension. It's like we are saying find average costs by category. So what
find average costs by category. So what we're going to do, we're going to go and
we're going to do, we're going to go and copy the same query and the dimension is
copy the same query and the dimension is the same. So the categories but the
the same. So the categories but the measure is different. We are not talking
measure is different. We are not talking about the total products. We are going
about the total products. We are going to say average and here we're going to
to say average and here we're going to have the column costs and let's go and
have the column costs and let's go and rename it average costs. So that's it as
rename it average costs. So that's it as well for the order by we have to use the
well for the order by we have to use the new measure. So let's go and execute it.
new measure. So let's go and execute it. So now we can see the most expensive
So now we can see the most expensive category is the bikes costs a lot
category is the bikes costs a lot compared to the accessories of course.
compared to the accessories of course. So you can see the accessories is only
So you can see the accessories is only 13 and the bikes is 900. So this is as
13 and the bikes is 900. So this is as well gives us insights about how
well gives us insights about how expensive each category is and as you
expensive each category is and as you can see it is always the same templates.
can see it is always the same templates. We are splitting specific measure by a
We are splitting specific measure by a dimension. So let's keep going to the
dimension. So let's keep going to the next one. It says what is the total
next one. It says what is the total revenue generated for each category. So
revenue generated for each category. So again here the question is find the
again here the question is find the total revenue by category. So the total
total revenue by category. So the total revenue here is the measure and the
revenue here is the measure and the category again is the dimension. So now
category again is the dimension. So now the total revenue comes from the fact
the total revenue comes from the fact and the category comes this time from
and the category comes this time from the dimension. So that means we have to
the dimension. So that means we have to go and join tables right. So how we
go and join tables right. So how we going to do it? Let's go and start with
going to do it? Let's go and start with the select star from and I would like
the select star from and I would like always to start from the fact table. So
always to start from the fact table. So fact sales f and then we're going to go
fact sales f and then we're going to go and join it with the dimension and
and join it with the dimension and usually I go with the left join in order
usually I go with the left join in order to not lose anything because if you use
to not lose anything because if you use an inner join you might lose in the fact
an inner join you might lose in the fact few orders and few sales I don't want
few orders and few sales I don't want that. So lift join with the dimension
that. So lift join with the dimension this one going to be the products and
this one going to be the products and the key for that going to be very simple
the key for that going to be very simple going to be the product key and the same
going to be the product key and the same thing for the facts. So with that we
thing for the facts. So with that we join the fact table with the dimension.
join the fact table with the dimension. So now we have to go and pick what do we
So now we have to go and pick what do we need? We need from the fact the sales
need? We need from the fact the sales right. So sales amount and we need from
right. So sales amount and we need from the products the category and we want to
the products the category and we want to group up the data by the category. So so
group up the data by the category. So so this part is done. What is missing is of
this part is done. What is missing is of course the aggregations. So we are
course the aggregations. So we are aggregating actually the sales. So sum
aggregating actually the sales. So sum sales and we can call it total revenue.
sales and we can call it total revenue. So like this. And of course we can go
So like this. And of course we can go and order the data by the total revenue
and order the data by the total revenue by our measure and distinct from highest
by our measure and distinct from highest to the lowest. So as you can see it is
to the lowest. So as you can see it is exactly like the previous one. But here
exactly like the previous one. But here the data doesn't come from only one
the data doesn't come from only one table. Here it comes from two tables. So
table. Here it comes from two tables. So the measure come from the facts and the
the measure come from the facts and the dimension come from the dimension
dimension come from the dimension products. And this is classic right? The
products. And this is classic right? The dimension has all those descriptions and
dimension has all those descriptions and details about the products like the
details about the products like the categories. And the fact table has all
categories. And the fact table has all those measures and dates that we use in
those measures and dates that we use in order to calculate our measures. So
order to calculate our measures. So that's it. Let's go and execute it. Now,
that's it. Let's go and execute it. Now, as you can see in the output, the
as you can see in the output, the category bikes is bringing the most of
category bikes is bringing the most of revenue. So here it's like in millions
revenue. So here it's like in millions 28 millions of sales and the accessories
28 millions of sales and the accessories and the closing is not really bringing a
and the closing is not really bringing a lot of like revenue. Both of them are
lot of like revenue. Both of them are below like 1 million. So with that you
below like 1 million. So with that you can understand our business is making a
can understand our business is making a lot of money selling bikes, right? So my
lot of money selling bikes, right? So my friends as we are exploring the data we
friends as we are exploring the data we are understanding more and more about
are understanding more and more about our business right so let's keep going
our business right so let's keep going to the next one we have here the
to the next one we have here the question what is the total revenue
question what is the total revenue generated by each customer so now we
generated by each customer so now we want to find out the top spender right
want to find out the top spender right select star and as well we start from
select star and as well we start from the fact table and this time we're going
the fact table and this time we're going to lift join it with the customers right
to lift join it with the customers right so the dimension customers and we're
so the dimension customers and we're going to go join the data so we're going
going to go join the data so we're going to use the customer key for the join And
to use the customer key for the join And what we're going to do, we're going to
what we're going to do, we're going to go and get maybe the customer key. And
go and get maybe the customer key. And let's go and get as well the first name,
let's go and get as well the first name, maybe few details about the customer and
maybe few details about the customer and as well the last name. So those are the
as well the last name. So those are the columns that we want from the customers.
columns that we want from the customers. And now what do we need? We need the
And now what do we need? We need the aggregation. So it's going to be the
aggregation. So it's going to be the same thing. Sales amount as total
same thing. Sales amount as total revenue. And we have to go and group up
revenue. And we have to go and group up the data by all those three
the data by all those three informations. So we're going to go and
informations. So we're going to go and copy paste. And at the end as usual,
copy paste. And at the end as usual, we're going to order by the measure
we're going to order by the measure total revenue descending. So that's it.
total revenue descending. So that's it. It is exactly like previous one but with
It is exactly like previous one but with different dimensions. So let's go and
different dimensions. So let's go and query it. And now we get a full list of
query it. And now we get a full list of all our customers, the 18,000s. And we
all our customers, the 18,000s. And we can see the total revenue for each
can see the total revenue for each customer. So we can see Nicole and
customer. So we can see Nicole and Caitlyn, they are our top spenders and
Caitlyn, they are our top spenders and the most royal customers that generated
the most royal customers that generated sales and revenue for our business. This
sales and revenue for our business. This is really cool. Right now let's go to
is really cool. Right now let's go to the next one. It says what is the
the next one. It says what is the distribution of sold items across
distribution of sold items across countries. It is like finding the total
countries. It is like finding the total quantity by countries. So it is very
quantity by countries. So it is very simple. I'm going to go and take the
simple. I'm going to go and take the same query because countries comes from
same query because countries comes from the dimension customers and the sold
the dimension customers and the sold items the quantity come from the sales.
items the quantity come from the sales. So we are doing the same joints but with
So we are doing the same joints but with different dimensions and measures. So
different dimensions and measures. So what do we need from the customers is
what do we need from the customers is only the country and the measure going
only the country and the measure going to be the quantity. And here we're going
to be the quantity. And here we're going to go and say total sold items and we
to go and say total sold items and we have to change the group by to the
have to change the group by to the countries and sorting the data by the
countries and sorting the data by the new measure. That's it. And with that we
new measure. That's it. And with that we are generating new reports by just
are generating new reports by just changing the dimensions and measures. So
changing the dimensions and measures. So again this is very interesting to
again this is very interesting to understand which country is generating
understand which country is generating like good business for us. So my friends
like good business for us. So my friends as you might already noticed if in the
as you might already noticed if in the dimension we have like small number of
dimension we have like small number of unique values like in the countries we
unique values like in the countries we have here only seven values in the
have here only seven values in the gender we have only three we call those
gender we have only three we call those dimensions low cardality dimensions
dimensions low cardality dimensions because we have low number of values
because we have low number of values inside it and in the result we will get
inside it and in the result we will get only here for example seven rows but if
only here for example seven rows but if our dimension is high cardality like by
our dimension is high cardality like by the customers we have 18,000 unique
the customers we have 18,000 unique customers then our measure going to be
customers then our measure going to be splitted by those 18,000 and in the
splitted by those 18,000 and in the results we will get exactly the same
results we will get exactly the same number of customers. So the number of
number of customers. So the number of rows and results really depends on the
rows and results really depends on the cardality of the dimension. So as you
cardality of the dimension. So as you can see we can generate a lot of
can see we can generate a lot of different reports by only following this
different reports by only following this formula dividing the measure by a
formula dividing the measure by a dimension. So we just generated eight
dimension. So we just generated eight different insights and reports by only
different insights and reports by only few measures and dimensions. So now what
few measures and dimensions. So now what you can do you can pause the video and
you can do you can pause the video and try different dimensions and measures in
try different dimensions and measures in order to have more insights about our
order to have more insights about our business. Okay. So as you can see this
business. Okay. So as you can see this is the basics analyszis that we can do
is the basics analyszis that we can do in any data set or any domain where we
in any data set or any domain where we are aggregating a measure by dimension.
are aggregating a measure by dimension. Now in the next and last step in our
Now in the next and last step in our projects we will be doing ranking
analyszis. Okay. So what is ranking analyszis? It is very basic. We're going
analyszis? It is very basic. We're going to go and order the value of our
to go and order the value of our dimension based on a measure in order to
dimension based on a measure in order to identify the top performers and as well
identify the top performers and as well the bottom performers. And the formula
the bottom performers. And the formula for that is going to be the following.
for that is going to be the following. So this time we're going to be ranking
So this time we're going to be ranking the dimensions by an aggregated measure.
the dimensions by an aggregated measure. So for example, we're going to rank the
So for example, we're going to rank the countries by the total sale or we're
countries by the total sale or we're going to find the top five products by
going to find the top five products by the sold item, the quantity or the
the sold item, the quantity or the bottom three customers by total orders.
bottom three customers by total orders. So it's like the magnitude analyzes.
So it's like the magnitude analyzes. We're going to have like an ordered list
We're going to have like an ordered list of dimensions value. For example, from
of dimensions value. For example, from the highest to the lowest in order to
the highest to the lowest in order to identify quickly the top performers. And
identify quickly the top performers. And of course we can go and filter the data
of course we can go and filter the data by saying I would like to have only the
by saying I would like to have only the top two categories. And with that you
top two categories. And with that you are removing all other dimensions that
are removing all other dimensions that are not on the top two. And in SQL we
are not on the top two. And in SQL we can use for that the keyword top or we
can use for that the keyword top or we can use the ranking window functions
can use the ranking window functions like rank, dense rank, row number and so
like rank, dense rank, row number and so on. So let's go and apply our formula in
on. So let's go and apply our formula in order to rank our data set. Okay. So now
order to rank our data set. Okay. So now let's check our data. We're going to
let's check our data. We're going to start with the first question. Which
start with the first question. Which five products generate the highest
five products generate the highest revenue? So we are searching for the
revenue? So we are searching for the best performing products in our
best performing products in our business. So of course the first
business. So of course the first question what is the dimension and
question what is the dimension and measure that we have in this question.
measure that we have in this question. Well the revenue that means we need the
Well the revenue that means we need the sales from the facts and the products
sales from the facts and the products that means we need the dimension
that means we need the dimension products. Now in order to write this
products. Now in order to write this query it's going to be very simple. So
query it's going to be very simple. So we can use as well the group by I will
we can use as well the group by I will not write it from the scratch. So I'm
not write it from the scratch. So I'm just going to take this query over here
just going to take this query over here where we aggregated the total sales by
where we aggregated the total sales by the category. Now what I have to do is
the category. Now what I have to do is just to change the dimension. So instead
just to change the dimension. So instead of the category we need the product name
of the category we need the product name and we are aggregating now the data by
and we are aggregating now the data by the product name because we need the top
the product name because we need the top five products right. So the revenue is
five products right. So the revenue is the sales amount and with that we have
the sales amount and with that we have like almost everything is ready. So
like almost everything is ready. So let's go and execute it. And now we can
let's go and execute it. And now we can see we have a list of all products in
see we have a list of all products in our business and as well we can see the
our business and as well we can see the total revenue. But the task says here we
total revenue. But the task says here we need the top five. So we don't need all
need the top five. So we don't need all the products from our database. We have
the products from our database. We have to go and select only this subset. Now
to go and select only this subset. Now in order to do that in SQL server, it's
in order to do that in SQL server, it's very simple. We're going to go over here
very simple. We're going to go over here and say top five and SQL going to go and
and say top five and SQL going to go and return only the first five rows from the
return only the first five rows from the results. So let's go and execute it. And
results. So let's go and execute it. And as you can see now in the results, we
as you can see now in the results, we have only five products with the highest
have only five products with the highest sales. And that's it. With that, we have
sales. And that's it. With that, we have solved the task and we can see the top
solved the task and we can see the top five products and all of them are pikes.
five products and all of them are pikes. Now let's go and check the other sides.
Now let's go and check the other sides. We want to find the five worst
We want to find the five worst performing products by the same measure,
performing products by the same measure, the sales. And this is very simple. So
the sales. And this is very simple. So what we're going to do, we're going to
what we're going to do, we're going to go and take the same query over here.
go and take the same query over here. And now what we're going to do, we're
And now what we're going to do, we're going to go and sort the data from the
going to go and sort the data from the lowest to the highest. So instead of
lowest to the highest. So instead of descending, we're going to remove it.
descending, we're going to remove it. And with that, SQL going to use the
And with that, SQL going to use the ascending. So let's go and execute it.
ascending. So let's go and execute it. And with that, as you can see, we are
And with that, as you can see, we are getting the worst five performing
getting the worst five performing products by just sorting the data
products by just sorting the data differently. So it is very simple right
differently. So it is very simple right and with that we can see our five best
and with that we can see our five best sellers and the five worst sellers. And
sellers and the five worst sellers. And now what we can do we can go and just
now what we can do we can go and just change the dimension and generate
change the dimension and generate different reports like instead of the
different reports like instead of the product name let's go and check the
product name let's go and check the subcategories what are the best
subcategories what are the best subcategories of our data. So I just
subcategories of our data. So I just change the dimension let's go and query.
change the dimension let's go and query. So with that we can see the best
So with that we can see the best subcategories we have in our business
subcategories we have in our business and the same thing if you want to go and
and the same thing if you want to go and check the worst performing
check the worst performing subcategories. So generating reports is
subcategories. So generating reports is very simple and now my friends in SQL
very simple and now my friends in SQL there is like two ways on how to create
there is like two ways on how to create ranking. We have a simple one where we
ranking. We have a simple one where we are using the group by clouds together
are using the group by clouds together with the keyword top. But if you are
with the keyword top. But if you are generating a reports where it's things
generating a reports where it's things are more complex and you need more
are more complex and you need more flexibility, you should use the window
flexibility, you should use the window functions. So let me show you how I can
functions. So let me show you how I can solve this task using the window
solve this task using the window function. So now I'm going to go and
function. So now I'm going to go and take almost the same query. Let's put it
take almost the same query. Let's put it over here. I'm going to get rid of the
over here. I'm going to get rid of the top five. And let's see, we are still
top five. And let's see, we are still speaking about the products name as well
speaking about the products name as well with a group I. But now what we're going
with a group I. But now what we're going to do, we're going to go and generate a
to do, we're going to go and generate a rank. So we can go and use for example
rank. So we can go and use for example the row number. And in scale there's
the row number. And in scale there's like different types of window functions
like different types of window functions for ranking. One of them is the row
for ranking. One of them is the row number or the rank and then we're going
number or the rank and then we're going to say over. Now we're going to go and
to say over. Now we're going to go and sort the data. It's like we have done in
sort the data. It's like we have done in the previous one. We have to sort the
the previous one. We have to sort the data by the total revenue and the total
data by the total revenue and the total revenue is the sum of sales and
revenue is the sum of sales and descending and we're going to call this
descending and we're going to call this rank products. So let's go and execute
rank products. So let's go and execute it. Now as you can see we have created a
it. Now as you can see we have created a new column where we have like a rank. So
new column where we have like a rank. So we have for each products like one rank
we have for each products like one rank until the last
until the last products 130. So now what we are
products 130. So now what we are interested is to go and select the top
interested is to go and select the top five. Right now in order to do that we
five. Right now in order to do that we need a second step. That's why we're
need a second step. That's why we're going to go and use the subquery. So
going to go and use the subquery. So we're going to say select star from and
we're going to say select star from and then we're going to put the whole thing
then we're going to put the whole thing in a subquery something like that. And
in a subquery something like that. And all what you have to do is to use the
all what you have to do is to use the new flag that we have created in order
new flag that we have created in order to filter the data. So we're going to
to filter the data. So we're going to say where the rank products is smaller
say where the rank products is smaller or equal to five. And with that we
or equal to five. And with that we should get only the top five products.
should get only the top five products. So let's go and execute it. And as you
So let's go and execute it. And as you can see we are getting the same results.
can see we are getting the same results. Now, of course, with the window
Now, of course, with the window function, it is more complicated than
function, it is more complicated than the first one. But with the window
the first one. But with the window function, we get more flexibility on
function, we get more flexibility on selecting more columns or adding more
selecting more columns or adding more different types of aggregations and
different types of aggregations and details on the query. And as well, we
details on the query. And as well, we can go and use different types of
can go and use different types of ranking functions that handles the tice
ranking functions that handles the tice differently. So, if the task is very
differently. So, if the task is very simple like this, I'm going to go with
simple like this, I'm going to go with the simple group pie. But if you are
the simple group pie. But if you are generating like complex reports, I'm
generating like complex reports, I'm going to go with the window function. So
going to go with the window function. So now what you can do, you can go and rank
now what you can do, you can go and rank the data by different dimensions and
the data by different dimensions and measures. For example, find the top 10
measures. For example, find the top 10 customers who have generated the highest
customers who have generated the highest revenue. And as well, you can go and
revenue. And as well, you can go and find the three customers with the fewest
find the three customers with the fewest orders placed. So again, we can go and
orders placed. So again, we can go and reuse the previous queries that we have
reuse the previous queries that we have generated. So this query generates the
generated. So this query generates the customers and their total sales. And all
customers and their total sales. And all what you have to do is to say top 10 and
what you have to do is to say top 10 and then rerun the query. And with that, we
then rerun the query. And with that, we are getting the top 10 customers. and
are getting the top 10 customers. and about the lowest three customers. All
about the lowest three customers. All what we have to do is to go and replace
what we have to do is to go and replace the measure. So we are counting the
the measure. So we are counting the unique number of orders. So we're going
unique number of orders. So we're going to say total orders and as well go
to say total orders and as well go change the order by not descending
change the order by not descending ascending. And we need the top three. So
ascending. And we need the top three. So let's go and execute it. So we can see
let's go and execute it. So we can see the three customers that did order only
the three customers that did order only once and they are the three customers
once and they are the three customers with the fewest orders. So as you can
with the fewest orders. So as you can see by just switching the dimensions and
see by just switching the dimensions and measures we are generating completely
measures we are generating completely new important insights and as you can
new important insights and as you can see as we are exploring the data we are
see as we are exploring the data we are understanding what are the best products
understanding what are the best products what are the top customers that are
what are the top customers that are usually very important for reporting.
usually very important for reporting. All right my friends so with that we
All right my friends so with that we have covered the last step in our
have covered the last step in our projects how to rank our data and with
projects how to rank our data and with that we have covered all the steps of
that we have covered all the steps of the project road map. We have done a lot
the project road map. We have done a lot of explorations for the database,
of explorations for the database, dimensions, measures. We have combined
dimensions, measures. We have combined the dimensions and measures in order to
the dimensions and measures in order to do magnitude and ranking analyszis.
do magnitude and ranking analyszis. Okay, my friends. So that's all about
Okay, my friends. So that's all about the EDA projects. And now in the next
the EDA projects. And now in the next one, we will do the last type of
one, we will do the last type of projects, the advanced data analytics.
projects, the advanced data analytics. So let's
go. And now the type that we're going to cover is advanced analytics projects
cover is advanced analytics projects using SQL where we're going to write
using SQL where we're going to write complex SQL queries to answer real
complex SQL queries to answer real business questions. So we're going to
business questions. So we're going to use the advanced window functions, the
use the advanced window functions, the CTE subqueries and we're going to go and
CTE subqueries and we're going to go and script two big queries in order to
script two big queries in order to generate two reports. So with this type
generate two reports. So with this type of project, you will learn how to solve
of project, you will learn how to solve real business questions using advanced
real business questions using advanced techniques. All right. So for this
techniques. All right. So for this project as well, we have a road map
project as well, we have a road map where we're going to progress through
where we're going to progress through different type of steps and analyzes. So
different type of steps and analyzes. So we're going to do many stuff like change
we're going to do many stuff like change over time, cumulative analyszis,
over time, cumulative analyszis, performance, data segmentations and at
performance, data segmentations and at the end reporting and all using SQL. So
the end reporting and all using SQL. So let's start with the first step in the
let's start with the first step in the road map. We going to analyze the change
road map. We going to analyze the change over time. So let's
go. Okay. So now what is change over time? It is a technique in order to
time? It is a technique in order to analyze how a measure evolves over the
analyze how a measure evolves over the time. And this is very important in
time. And this is very important in order to track the trends and as well to
order to track the trends and as well to identify seasonality of your data. And
identify seasonality of your data. And the formula for that is very simple.
the formula for that is very simple. We're going to go and aggregate a
We're going to go and aggregate a measure but this time based on a date
measure but this time based on a date dimension. For example, the total sales
dimension. For example, the total sales by a year, the average cost by the
by a year, the average cost by the month. So if you combine any aggregated
month. So if you combine any aggregated measure together with a date column or
measure together with a date column or dimension, then all what you are doing
dimension, then all what you are doing is you are analyzing the change over
is you are analyzing the change over time. So for example, we're going to go
time. So for example, we're going to go and break our measure this time for
and break our measure this time for example by the years. And with that we
example by the years. And with that we can track immediately how our business
can track immediately how our business is doing over the time over the years.
is doing over the time over the years. So for example, we can see here the best
So for example, we can see here the best year was 2024 and then we have really
year was 2024 and then we have really hard decline in our business in 2025 and
hard decline in our business in 2025 and then slightly it's going up in 2026. So
then slightly it's going up in 2026. So with that we can quickly analyze the
with that we can quickly analyze the trends of our business. So now let's go
trends of our business. So now let's go and check the trends and the changes
and check the trends and the changes over time in our business. Okay. So now
over time in our business. Okay. So now let's analyze the trends and changes
let's analyze the trends and changes over time in our data and in order to do
over time in our data and in order to do this kind of analyzes usually we target
this kind of analyzes usually we target the fact table because there usually we
the fact table because there usually we have our measures and as well dates. So
have our measures and as well dates. So we have the order date, shipping date
we have the order date, shipping date and due date. Now what we can do we can
and due date. Now what we can do we can go and analyze there the sales
go and analyze there the sales performance over time. So as we learned
performance over time. So as we learned all what we need is a metric and a date.
all what we need is a metric and a date. Let's go for example and select the
Let's go for example and select the order date and as well one of those
order date and as well one of those measures sales amount from our fact
measures sales amount from our fact table. So let's go and query it. And we
table. So let's go and query it. And we can go and order the data by the order
can go and order the data by the order dates ascending. So let's go and
dates ascending. So let's go and execute. And as you can see we have
execute. And as you can see we have nulls in our data. What we can do? We
nulls in our data. What we can do? We can go and filter those data out. We
can go and filter those data out. We don't need it. So we're going to say
don't need it. So we're going to say where order date is not null. So let's
where order date is not null. So let's go and execute it again. All right. So
go and execute it again. All right. So that we don't have those orders. Now, as
that we don't have those orders. Now, as you can see, we have sales over time,
you can see, we have sales over time, right? We have a date and we have a
right? We have a date and we have a measure. So this looks really good. But
measure. So this looks really good. But now what we're going to do, we're going
now what we're going to do, we're going to go and aggregate the data by the
to go and aggregate the data by the sales amount. So let's go and say sum.
sales amount. So let's go and say sum. And we're going to call it total sales.
And we're going to call it total sales. And then we group up the data by the
And then we group up the data by the order dates. So let's go and execute it.
order dates. So let's go and execute it. And with that, as you can see, for each
And with that, as you can see, for each day, we have the total sales. So now the
day, we have the total sales. So now the granularity of our data is the day and
granularity of our data is the day and we can say of course now we are
we can say of course now we are analyzing the sales over time but
analyzing the sales over time but usually we don't aggregate the data on
usually we don't aggregate the data on the day level we want to have higher
the day level we want to have higher aggregations for example let's go to the
aggregations for example let's go to the years and now in order to change the
years and now in order to change the dimension date here from a day to a year
dimension date here from a day to a year we have to use date functions and there
we have to use date functions and there are a lot of date functions in order to
are a lot of date functions in order to extract that date part and now in order
extract that date part and now in order just to get the year we have a quick
just to get the year we have a quick function called year and it going to
function called year and it going to convert convert our date to year. So
convert convert our date to year. So let's call it order year and of course
let's call it order year and of course we have to go and group up the data by
we have to go and group up the data by the year and as well sort it by the
the year and as well sort it by the year. So let's go and execute. Now we
year. So let's go and execute. Now we are at the year level and we have only
are at the year level and we have only five years. So that means we have
five years. So that means we have changed the aggregation from the day to
changed the aggregation from the day to year and now it is very easily to
year and now it is very easily to analyze the performance of our business
analyze the performance of our business over the years. So the first year was
over the years. So the first year was the lowest and you can see 2013 is the
the lowest and you can see 2013 is the best year in our business and then it is
best year in our business and then it is declined massively in 2014. And of
declined massively in 2014. And of course we can go and add more measures
course we can go and add more measures to our data not only the total sales.
to our data not only the total sales. For example, let's go and calculate the
For example, let's go and calculate the total number of customers. So we can say
total number of customers. So we can say count distinct customer key as total
count distinct customer key as total customers. So let's go and execute it.
customers. So let's go and execute it. And with that we can check are we
And with that we can check are we gaining like customers over the time if
gaining like customers over the time if there are any trends that we can see and
there are any trends that we can see and we can go and keep extending stuff like
we can go and keep extending stuff like we can go and add the total number of
we can go and add the total number of quantities. So summarize quantity as
quantities. So summarize quantity as total quantity. So let's go and execute
total quantity. So let's go and execute and with that we have really nice
and with that we have really nice picture in order to understand is the
picture in order to understand is the revenue increasing or decreasing over
revenue increasing or decreasing over the time what is the best year the worst
the time what is the best year the worst year are we gaining customers over time
year are we gaining customers over time if there any like trends that we can
if there any like trends that we can spot now by looking to the result you
spot now by looking to the result you can see this gives us highlevel
can see this gives us highlevel long-term view of your data and of
long-term view of your data and of course it helps for strategic decisions
course it helps for strategic decisions and now what we can do we can go and
and now what we can do we can go and drill down to the months so we can go
drill down to the months so we can go and aggregate the data by the month
and aggregate the data by the month regardless list the years in order to
regardless list the years in order to give us an idea how each month is
give us an idea how each month is performing on average. So all what we
performing on average. So all what we have to do is to switch the function
have to do is to switch the function from year to a month like this. And of
from year to a month like this. And of course for the group by and the order by
course for the group by and the order by let's go and execute and of course in
let's go and execute and of course in the output we will get all the months
the output we will get all the months and guess what which month is the best
and guess what which month is the best for sales is of course December because
for sales is of course December because you have all those Christmas and stuff
you have all those Christmas and stuff and the worst months as you can see is
and the worst months as you can see is February. So with that we are
February. So with that we are understanding the seasonality of our
understanding the seasonality of our business and the trends patterns of our
business and the trends patterns of our business. And as you are not including
business. And as you are not including the year in our analyzes you are
the year in our analyzes you are aggregating all the data from all years.
aggregating all the data from all years. Now what we can do we can make it more
Now what we can do we can make it more specific for each year where you go and
specific for each year where you go and add the year informations to our query.
add the year informations to our query. So we can have both a year and months.
So we can have both a year and months. Let me just change this to a month. And
Let me just change this to a month. And of course we have to go and add it to
of course we have to go and add it to the group by and the order by. So let's
the group by and the order by. So let's go and execute and with that we are
go and execute and with that we are aggregating the data of a month of
aggregating the data of a month of specific year. So now we have all the
specific year. So now we have all the months of all years and now if you want
months of all years and now if you want to focus on only one year what you can
to focus on only one year what you can do you can go and filter the data by the
do you can go and filter the data by the order year and with that you can see how
order year and with that you can see how the data is evolving over time. Now of
the data is evolving over time. Now of course in SQL we can go and format the
course in SQL we can go and format the date differently. So instead of using
date differently. So instead of using the year and the month in separate
the year and the month in separate columns what we can do we can use the
columns what we can do we can use the date trunk function. So instead of here
date trunk function. So instead of here we're going to say date trunk and if you
we're going to say date trunk and if you want the granularity of your date at the
want the granularity of your date at the month level we're going to say month and
month level we're going to say month and then the date and with that you will get
then the date and with that you will get both the year and the date and let's
both the year and the date and let's call it order date like this. So let's
call it order date like this. So let's go and execute. Now in the output we
go and execute. Now in the output we will get exactly the same result as
will get exactly the same result as before but instead of having like two
before but instead of having like two columns for the year and the month we
columns for the year and the month we have everything in one and because we
have everything in one and because we saved the month that means it's still
saved the month that means it's still going to go and remove all the days. So
going to go and remove all the days. So as you can see it always starts with the
as you can see it always starts with the one. So the first day of the month and
one. So the first day of the month and with that you will get one row for each
with that you will get one row for each month for each year. And if you want to
month for each year. And if you want to change that quickly to a year just you
change that quickly to a year just you go and change the date parts to a year
go and change the date parts to a year and you will get the granularity of the
and you will get the granularity of the year. Now if you don't like this format
year. Now if you don't like this format and you would like to have your specific
and you would like to have your specific format what you can do you can go and
format what you can do you can go and use the format function. So format the
use the format function. So format the first argument is going to be the date
first argument is going to be the date and then you go and do your format that
and then you go and do your format that you want. So for example it start with
you want. So for example it start with the years and let's say I would like to
the years and let's say I would like to have the abbreviation of the month name.
have the abbreviation of the month name. So something like this and of course
So something like this and of course group by and order by. So let's go and
group by and order by. So let's go and execute it. And with that we got our
execute it. And with that we got our format the year minus then the
format the year minus then the abbreviation of the month. But you have
abbreviation of the month. But you have to be careful which function you are
to be careful which function you are using because the format you will get in
using because the format you will get in the output a string. And as you can see
the output a string. And as you can see you cannot sort it correctly. So the
you cannot sort it correctly. So the data here is sorted by the year but not
data here is sorted by the year but not by the month. But if you are using date
by the month. But if you are using date trunk you can see the data is correctly
trunk you can see the data is correctly sorted. So if we switch it to a month it
sorted. So if we switch it to a month it will be as well. Okay. So everything is
will be as well. Okay. So everything is sorted correctly because the output here
sorted correctly because the output here is a date and SQL going to sort the date
is a date and SQL going to sort the date correctly. It is not string. And if you
correctly. It is not string. And if you are using the year and the month the
are using the year and the month the output here going to be an integer and
output here going to be an integer and sorting an integer is not a problem. So
sorting an integer is not a problem. So of course you can go and pick the one
of course you can go and pick the one that you like. So that's it. Let's go
that you like. So that's it. Let's go and execute it. And now you can go and
and execute it. And now you can go and keep analyzing by finding another date
keep analyzing by finding another date in our data set and another measure. So
in our data set and another measure. So as you can see it is very simple. Okay.
as you can see it is very simple. Okay. So that's all about how to analyze the
So that's all about how to analyze the trends and the change over time. Now in
trends and the change over time. Now in the next step we're going to do some
the next step we're going to do some kind of advanced aggregations by doing
kind of advanced aggregations by doing cumulative
analyszis. Okay. So what is cumulative analyszis? It is aggregating the data
analyszis? It is aggregating the data progressively over the time and this is
progressively over the time and this is very important technique in order to
very important technique in order to understand how our business is growing
understand how our business is growing over the time. So how our business is
over the time. So how our business is progressing over the time whether it is
progressing over the time whether it is growing or declining it is very
growing or declining it is very interesting analyszis. So the formula
interesting analyszis. So the formula going to be very similar to the changes
going to be very similar to the changes over time but instead of having a simple
over time but instead of having a simple aggregations on the measure we're going
aggregations on the measure we're going to aggregate our measure but this time
to aggregate our measure but this time cumulative. So we are like adding stuff
cumulative. So we are like adding stuff on top of each others and the data again
on top of each others and the data again can split it by the date dimension cuz
can split it by the date dimension cuz we want to track the progress over the
we want to track the progress over the time. For example, we can find the
time. For example, we can find the running total of sales or the moving
running total of sales or the moving average of sales by a month. So now
average of sales by a month. So now let's have again our simple example
let's have again our simple example where our sales is splitted by the
where our sales is splitted by the years. Now this is the classic change
years. Now this is the classic change over time. But in order now to make it
over time. But in order now to make it cumulative what can happen? We're going
cumulative what can happen? We're going to take the measure and add to it. For
to take the measure and add to it. For example, 2024 we have 300. And now for
example, 2024 we have 300. And now for 2025, we're going to add the 300
2025, we're going to add the 300 together with the 100 in order to make
together with the 100 in order to make it cumulative. So for 2025, we're going
it cumulative. So for 2025, we're going to have 400. And the same thing for
to have 400. And the same thing for 2026, we're going to go and add the 400
2026, we're going to go and add the 400 together with the 200. And with that, we
together with the 200. And with that, we will get 600. So as you can see, we are
will get 600. So as you can see, we are keep adding the values in order to
keep adding the values in order to generate something called cumulative
generate something called cumulative value. Now for this type of analysis, we
value. Now for this type of analysis, we use in SQL the aggregate window
use in SQL the aggregate window functions. in order to find out the
functions. in order to find out the cumulative values. So now let's go and
cumulative values. So now let's go and apply our formula in order to find
apply our formula in order to find whether our business is growing or
whether our business is growing or declining. So let's go. Okay, so now we
declining. So let's go. Okay, so now we have to analyze the following. We're
have to analyze the following. We're going to calculate the total sales for
going to calculate the total sales for each month and as well the running total
each month and as well the running total of sales over time in order to analyze
of sales over time in order to analyze the trends. So let's see how we're going
the trends. So let's see how we're going to do that. Let's start with the easy
to do that. Let's start with the easy stuff where we're going to calculate the
stuff where we're going to calculate the total sales for each month. So we are
total sales for each month. So we are calculating the changes over time and we
calculating the changes over time and we have already done that. So all what we
have already done that. So all what we need is a date and a measure. Our date
need is a date and a measure. Our date going to be the order date and the
going to be the order date and the measure going to be the sales amount
measure going to be the sales amount from our fact
from our fact table. So let's query this. And now we
table. So let's query this. And now we want to find the total sales for each
want to find the total sales for each month. That means we're going to change
month. That means we're going to change the granularity of the order date from a
the granularity of the order date from a day to a month. And I usually like using
day to a month. And I usually like using the date rank for this kind of tasks.
the date rank for this kind of tasks. And the granularity going to be the
And the granularity going to be the month. So this is the order
month. So this is the order dates. And now for the sales we're going
dates. And now for the sales we're going to use aggregate function sum sales as
to use aggregate function sum sales as total sales. And of course we have to go
total sales. And of course we have to go and group up the data by the
and group up the data by the date. So let's go and execute it. So as
date. So let's go and execute it. So as you can see we have now the total sales
you can see we have now the total sales for each month. And don't forget to get
for each month. And don't forget to get rid of the nulls. So where we can say
rid of the nulls. So where we can say where order date is not null. Now it
where order date is not null. Now it looks better. We don't have nulls. And
looks better. We don't have nulls. And of course we can go and order the data
of course we can go and order the data by our date. Now our measure is just
by our date. Now our measure is just aggregated for each month individually.
aggregated for each month individually. Right? But we don't want that. We want
Right? But we don't want that. We want to have like a running total. So we'd
to have like a running total. So we'd like to have like commumulative metric.
like to have like commumulative metric. In order to do that, we have to use
In order to do that, we have to use window function. So let's go and do
window function. So let's go and do that. We will use a subquery for that.
that. We will use a subquery for that. In order just to make it simple. So what
In order just to make it simple. So what we need? We need the order date and
we need? We need the order date and let's say the total sales and here we
let's say the total sales and here we have to have our window
have to have our window function. Then we're going to put the
function. Then we're going to put the rest in a subquery. And of course we can
rest in a subquery. And of course we can go and get rid of the order by because
go and get rid of the order by because anyway our data going to be sorted using
anyway our data going to be sorted using the window function. So now let's start
the window function. So now let's start writing our window function. We will
writing our window function. We will have the sum of total sales. So we want
have the sum of total sales. So we want to summarize those new values. And we're
to summarize those new values. And we're going to build a window function like
going to build a window function like this over. We don't have to go and
this over. We don't have to go and partition anything. So we can go
partition anything. So we can go immediately and say order by our new
immediately and say order by our new order date that we have calculated. And
order date that we have calculated. And we want it to be ascending. So actually
we want it to be ascending. So actually that's it. So as running total sales. So
that's it. So as running total sales. So let's try that out. Now if you look to
let's try that out. Now if you look to the result you can see that all those
the result you can see that all those values are cumulative and it is working
values are cumulative and it is working like this. The first total sales is
like this. The first total sales is equal to the total sales because
equal to the total sales because previously we don't have anything. Now
previously we don't have anything. Now for the next row what going to happen is
for the next row what going to happen is going to go and add this value to the
going to go and add this value to the previous one. And with that we get the
previous one. And with that we get the running total value. Now moving on to
running total value. Now moving on to the third row is going to go and add all
the third row is going to go and add all those three values together. And of
those three values together. And of course this going to give us the running
course this going to give us the running total for this month and so on. So as
total for this month and so on. So as SQL is moving through the window it is
SQL is moving through the window it is always adding the current value to all
always adding the current value to all previous values. And this is because of
previous values. And this is because of the default frame of the window. The
the default frame of the window. The frame going to be between the unbounded
frame going to be between the unbounded preceding and the current row. So that
preceding and the current row. So that means for example if we are at this row
means for example if we are at this row over here current total sales for this
over here current total sales for this month is this one and the unbounded
month is this one and the unbounded preceding is all the values before this
preceding is all the values before this month. So that means we are getting all
month. So that means we are getting all the previous values together with the
the previous values together with the current value and with that we will get
current value and with that we will get the effect of the running total sales.
the effect of the running total sales. And now of course as you can see it is
And now of course as you can see it is going through all the years. Right now
going through all the years. Right now we can go and limit the running total
we can go and limit the running total for only one year. So for each new year
for only one year. So for each new year it has to reset and start from the
it has to reset and start from the scratch. So that means we are
scratch. So that means we are partitioning the data. For each year we
partitioning the data. For each year we would like to have partition. For the
would like to have partition. For the first year, it's going to be 2010. It is
first year, it's going to be 2010. It is one row. And for the 2011, we're going
one row. And for the 2011, we're going to get the whole partition over here.
to get the whole partition over here. So, in order to partition our window,
So, in order to partition our window, it's very simple. We're going to go and
it's very simple. We're going to go and say partition by the order date. That's
say partition by the order date. That's it. Let's go and execute it. Now, let's
it. Let's go and execute it. Now, let's go and check for the first partition for
go and check for the first partition for 2010. You can see the running total is
2010. You can see the running total is the same as the first month. But since
the same as the first month. But since we have only one month, that's it for
we have only one month, that's it for this year. Now, as we go to the next
this year. Now, as we go to the next year, as you can see, it resets. So you
year, as you can see, it resets. So you can see the running total sales for
can see the running total sales for 2011. It is exactly as January. It is
2011. It is exactly as January. It is not adding up now the value of the
not adding up now the value of the current value with the previous one
current value with the previous one because the previous one is outside of
because the previous one is outside of the window. So as you can see we are
the window. So as you can see we are getting running total for the whole year
getting running total for the whole year and once we hit a new year it is going
and once we hit a new year it is going to reset. So it is working and this is
to reset. So it is working and this is how you can create cumulative values in
how you can create cumulative values in SQL. And of course if you would like to
SQL. And of course if you would like to change the granularity of our data it is
change the granularity of our data it is very simple. All what you have to do is
very simple. All what you have to do is to go over here and say instead of month
to go over here and say instead of month we're going to make it as a year. And of
we're going to make it as a year. And of course don't forget to change as well
course don't forget to change as well the group by. So let's go ahead and
the group by. So let's go ahead and execute. And with that we are creating
execute. And with that we are creating cumulative values for each year. But of
cumulative values for each year. But of course it makes no sense to partition by
course it makes no sense to partition by the years. Let's go and remove it and
the years. Let's go and remove it and execute it again. And with that you are
execute it again. And with that you are creating the running total sales the
creating the running total sales the cumulative metric over the years. So as
cumulative metric over the years. So as you can see it is very simple. Now we
you can see it is very simple. Now we can go and add like another measure and
can go and add like another measure and another aggregation like for example
another aggregation like for example instead of finding the running total we
instead of finding the running total we can find the moving average. So let's
can find the moving average. So let's for example go and get the moving
for example go and get the moving average of the price. So first we have
average of the price. So first we have to calculate the average of the price as
to calculate the average of the price as average price. And now what we have to
average price. And now what we have to do is to go and make another window
do is to go and make another window function over here where we are saying
function over here where we are saying average the average price and we're
average the average price and we're going to go and call it moving
going to go and call it moving average. That's it. So let's go and
average. That's it. So let's go and execute it. And with that you are
execute it. And with that you are getting the moving average price of our
getting the moving average price of our sales. All right. So now you might still
sales. All right. So now you might still asking what is really different between
asking what is really different between using a normal aggregation and
using a normal aggregation and cumulative aggregation. Well, we usually
cumulative aggregation. Well, we usually use normal aggregations in order to
use normal aggregations in order to check the performance of each individual
check the performance of each individual row. Like if I want to see how each year
row. Like if I want to see how each year is performing, I'm going to go and do a
is performing, I'm going to go and do a normal aggregation. But if you want to
normal aggregation. But if you want to see a progression and you want to
see a progression and you want to understand how your business is growing,
understand how your business is growing, you have to go and use cumulative
you have to go and use cumulative aggregations because you can see easily
aggregations because you can see easily here the progress of your business over
here the progress of your business over the years. So there is like a difference
the years. So there is like a difference between using cumulative value and
between using cumulative value and normal aggregation. All right. So with
normal aggregation. All right. So with that you have done with the cumulative
that you have done with the cumulative analyszis and you have learned all
analyszis and you have learned all different types of aggregations. Now the
different types of aggregations. Now the next step in our road map we're going to
next step in our road map we're going to do performance
analyszis. Okay. So what is performance analyszis? It is the process of
analyszis? It is the process of comparing the current value with a
comparing the current value with a target value to compare the performance
target value to compare the performance of specific category and this can help
of specific category and this can help us in order to measure the success to
us in order to measure the success to compare the performance. So the formula
compare the performance. So the formula for that is very simple. We're going to
for that is very simple. We're going to find the difference between the current
find the difference between the current measure and the target measure by
measure and the target measure by subtracting them. Like for example, we
subtracting them. Like for example, we can go and compare the current sale with
can go and compare the current sale with the average sale or the current year
the average sale or the current year sales with the previous year sales or
sales with the previous year sales or the current sales with the lowest sales
the current sales with the lowest sales or maybe the highest sales. So as you
or maybe the highest sales. So as you can see we are always comparing the
can see we are always comparing the current measure together with a target
current measure together with a target with something else. So for example, we
with something else. So for example, we have here again a measure that is
have here again a measure that is splitted by three categories. So those
splitted by three categories. So those values are the current values. Now if
values are the current values. Now if you have a target like for example the
you have a target like for example the average. Now as you can see for each row
average. Now as you can see for each row we have like the 200. Now what we can do
we have like the 200. Now what we can do once we have those two things in one row
once we have those two things in one row we can go and simply subtract them. So
we can go and simply subtract them. So for the A the current value is exactly
for the A the current value is exactly equal to the average. Both of them is
equal to the average. Both of them is 200 and the difference between them is
200 and the difference between them is zero. So this product is performing as
zero. So this product is performing as an average. Now for the next one we have
an average. Now for the next one we have 300 and the target is 200. So the
300 and the target is 200. So the differences between them is 100. That
differences between them is 100. That means this category is performing very
means this category is performing very well. So this is a good performer. Now
well. So this is a good performer. Now for the last one we will get minus 100.
for the last one we will get minus 100. So that means it is below the average.
So that means it is below the average. So it is not performing very well. And
So it is not performing very well. And for this type of analysis we usually use
for this type of analysis we usually use window functions like the aggregate
window functions like the aggregate window functions, the sum, average, max,
window functions, the sum, average, max, min or the value window functions like
min or the value window functions like lead and lag. So now let's go back to
lead and lag. So now let's go back to SQL and apply this formula in order to
SQL and apply this formula in order to measure the performance of our business.
measure the performance of our business. So let's go. All right my friends. So
So let's go. All right my friends. So now we have the following task. analyze
now we have the following task. analyze the yearly performance of products by
the yearly performance of products by comparing their sales to both the
comparing their sales to both the average sales performance of the
average sales performance of the products and the previous year sales.
products and the previous year sales. Okay, this sounds a little bit
Okay, this sounds a little bit complicated and serious. Let's have some
complicated and serious. Let's have some coffee before we
coffee before we start. Okay, so what do we have over
start. Okay, so what do we have over here? So it is talking about the yearly
here? So it is talking about the yearly performance of products. So that means
performance of products. So that means we need the order date as a dimension
we need the order date as a dimension and as well the product and the measure
and as well the product and the measure that is used over here is the sales. So
that is used over here is the sales. So let's do it step by step. So we need
let's do it step by step. So we need things from our fact table. So fact
things from our fact table. So fact sales and we need the product. So I'm
sales and we need the product. So I'm going to go and get it from the
going to go and get it from the dimension product in order to have a
dimension product in order to have a nice name. So we have to join the data
nice name. So we have to join the data by the product key and I'm going to go
by the product key and I'm going to go and change the alias to P. So product
and change the alias to P. So product key. Okay. So with that we have our two
key. Okay. So with that we have our two tables. Now let's go and select our
tables. Now let's go and select our columns. So we need the order date. We
columns. So we need the order date. We need the product name and we need our
need the product name and we need our measure. So it's going to be the sales
measure. So it's going to be the sales amount. All right. So now let's go and
amount. All right. So now let's go and query those informations. Now we have to
query those informations. Now we have to analyze the yearly performance. That
analyze the yearly performance. That means we don't need the day. The
means we don't need the day. The granularity is the years. So that's why
granularity is the years. So that's why let's go and convert it using year
let's go and convert it using year function. And we're going to call it
function. And we're going to call it order year. And of course we have to go
order year. And of course we have to go and aggregate then the sales. And I'm
and aggregate then the sales. And I'm going to call it current sales. And of
going to call it current sales. And of course we have to group up the data by
course we have to group up the data by the date, the year and as well by the
the date, the year and as well by the product name. So that's it. Let's go and
product name. So that's it. Let's go and execute it. And of course I'm going to
execute it. And of course I'm going to go and get rid of all those nulls. So
go and get rid of all those nulls. So where order date is not null. All right.
where order date is not null. All right. So with that we have solved the first
So with that we have solved the first part. So we have the yearly performance
part. So we have the yearly performance of the product. Now in the task we have
of the product. Now in the task we have to compare this value the current sales
to compare this value the current sales to the average sales performance of the
to the average sales performance of the products. So that means we need the
products. So that means we need the average and as well the previous year
average and as well the previous year sales. So that means we have to compare
sales. So that means we have to compare each value to the previous year for the
each value to the previous year for the same product of course. So that means
same product of course. So that means things are getting a little bit more
things are getting a little bit more complicated and with that we need the
complicated and with that we need the help of the window functions. Let's do
help of the window functions. Let's do it one by one. Let's focus on the
it one by one. Let's focus on the average sales. So now what we're going
average sales. So now what we're going to do based on those values based on
to do based on those values based on this results we will do a new
this results we will do a new calculations and aggregations. And now
calculations and aggregations. And now in order to do that either we use a
in order to do that either we use a subquery or a city. I'm going to go with
subquery or a city. I'm going to go with a city because it looks nicer. So with
a city because it looks nicer. So with yearly product sales this is the new
yearly product sales this is the new name that we are giving for this
name that we are giving for this results. And now what we're going to do
results. And now what we're going to do we're going to build queries on top of
we're going to build queries on top of these results. So first of all I will
these results. So first of all I will just select everything from this table.
just select everything from this table. yearly product sales just to test. So it
yearly product sales just to test. So it is working. Now I'm selecting data from
is working. Now I'm selecting data from our city. So now the next step I'm going
our city. So now the next step I'm going to go and list all the columns that I
to go and list all the columns that I want in my results. So the order date,
want in my results. So the order date, the product
the product name, the current sales. This is just
name, the current sales. This is just nicer in order to have control on which
nicer in order to have control on which columns you want to present at the end
columns you want to present at the end results. Now the next step, I'm going to
results. Now the next step, I'm going to go and order the data by first the
go and order the data by first the product
product name and then the order year. And with
name and then the order year. And with that we can have better understanding of
that we can have better understanding of the results. So we can see this product
the results. So we can see this product has three years of sales and those are
has three years of sales and those are the current sales for each year. So now
the current sales for each year. So now we have to go and calculate the average
we have to go and calculate the average of those three sales. So in order to do
of those three sales. So in order to do that we're going to use the
that we're going to use the average current sales over we have to
average current sales over we have to decide now how to partition the data.
decide now how to partition the data. Since we are focusing on the products we
Since we are focusing on the products we have to partition the results by the
have to partition the results by the product name. So we're going to say
product name. So we're going to say partition
partition by product name and we don't have to
by product name and we don't have to sort the data because we are using the
sort the data because we are using the average. So it doesn't matter how the
average. So it doesn't matter how the data is sorted. So let's call it average
data is sorted. So let's call it average sales. So let's go ahead and execute it.
sales. So let's go ahead and execute it. And now if you are looking to the
And now if you are looking to the results for this product the average
results for this product the average sales of all those three values is
sales of all those three values is 13,000. So now as you can see for each
13,000. So now as you can see for each row we have the current sales and side
row we have the current sales and side by side with the average sales and the
by side with the average sales and the same thing for the next product as well.
same thing for the next product as well. So now since we have both of the
So now since we have both of the informations on the same row current
informations on the same row current sales and the average the change the
sales and the average the change the difference between the current value and
difference between the current value and the average value. So all what we have
the average value. So all what we have to do is to go and subtract right. So
to do is to go and subtract right. So we're going to say the current sales
we're going to say the current sales subtracted by the average sales and
subtracted by the average sales and we're going to call
we're going to call it the difference in average. So let's
it the difference in average. So let's go and execute it. And now as you can
go and execute it. And now as you can see we are getting now the comparison.
see we are getting now the comparison. we have the differences between the
we have the differences between the current and the average and of course
current and the average and of course what I like to do is to make a flag or
what I like to do is to make a flag or like indicator whether we are above the
like indicator whether we are above the average below the average or at the
average below the average or at the average so in order to do that we're
average so in order to do that we're going to go and use the case when
going to go and use the case when statement so if the difference is higher
statement so if the difference is higher than zero then we are above the average
than zero then we are above the average right above average oh let's have an
right above average oh let's have an abbreviation for that and if we are
abbreviation for that and if we are below zero that means we are below the
below zero that means we are below the average right so below
average right so below then below average and if it is exactly
then below average and if it is exactly zero else then it is average. So that's
zero else then it is average. So that's it. Let's end it and I'm going to call
it. Let's end it and I'm going to call it average change. So let's go and
it average change. So let's go and execute it. Now if you focus again on
execute it. Now if you focus again on one of the products you can see the
one of the products you can see the current sales of this product in 2012 it
current sales of this product in 2012 it is below the average. It is really low.
is below the average. It is really low. And for the next year for 2013 it is
And for the next year for 2013 it is above the average. It was really nice
above the average. It was really nice year for these products and the last
year for these products and the last year 2014 it was again below the
year 2014 it was again below the average. So with that we have really
average. So with that we have really nice flag in order to see quickly
nice flag in order to see quickly whether we are above or below the
whether we are above or below the average and it is interesting to see
average and it is interesting to see whether we have zeros. So yeah sometimes
whether we have zeros. So yeah sometimes it is exactly like the average and here
it is exactly like the average and here we have like a zero. It's not below or
we have like a zero. It's not below or above. So with that we are comparing the
above. So with that we are comparing the performance of the sales of each
performance of the sales of each products with the average. And as you
products with the average. And as you can see it is really simple. Yeah. using
can see it is really simple. Yeah. using the window functions. So let's go and
the window functions. So let's go and check again our task. We have compared
check again our task. We have compared the current sales to the average sales
the current sales to the average sales performance. Now we have to compare it
performance. Now we have to compare it as well with the previous year sales. So
as well with the previous year sales. So let's go back to our example over here.
let's go back to our example over here. This time we have to compare the current
This time we have to compare the current sales not with the average but with the
sales not with the average but with the previous year. So we don't have to write
previous year. So we don't have to write like another CTE or query. We can
like another CTE or query. We can continue with the same results. So now
continue with the same results. So now all what you have to do is to access the
all what you have to do is to access the previous year. And in order to do that,
previous year. And in order to do that, we have amazing window function called
we have amazing window function called lag. So let's do it step by step. So now
lag. So let's do it step by step. So now we're going to go and create a new
we're going to go and create a new column that's called lag. I want to
column that's called lag. I want to access the previous value of what the
access the previous value of what the current sales, right? So current sales
current sales, right? So current sales and
and over we still have to partition the data
over we still have to partition the data by the product name because we focus on
by the product name because we focus on the products. So partition by product
the products. So partition by product name. But now in order to access the
name. But now in order to access the previous value that means we have to
previous value that means we have to sort the data and we're going to sort it
sort the data and we're going to sort it by the years. We need the previous year.
by the years. We need the previous year. So we're going to say order by order
So we're going to say order by order year and we're going to sort it
year and we're going to sort it ascending from the lowest to the
ascending from the lowest to the highest. So we're going to leave it like
highest. So we're going to leave it like this. And with that this window function
this. And with that this window function going to give us the previous year sales
going to give us the previous year sales of the products. So I'm just going to
of the products. So I'm just going to call it previous year sales like this.
call it previous year sales like this. And I think here we have something
And I think here we have something wrong. Okay. So let's go a and execute
wrong. Okay. So let's go a and execute it and let's go and focus on one of
it and let's go and focus on one of those products. So now for the first
those products. So now for the first year of this product, the previous year
year of this product, the previous year was null, right? So we don't have any
was null, right? So we don't have any data from the previous year. But for the
data from the previous year. But for the 2013, we have a previous year of 2012.
2013, we have a previous year of 2012. So that's why now we are getting the
So that's why now we are getting the previous value of the sales based on the
previous value of the sales based on the years. And the same thing for the last
years. And the same thing for the last year over here. You can see we are
year over here. You can see we are getting the previous sales. So it is
getting the previous sales. So it is working. And for the next window, same
working. And for the next window, same thing for the first year. we will get
thing for the first year. we will get null and the previous sales we will get
null and the previous sales we will get it from the previous year. So with that
it from the previous year. So with that we have now the previous sales and if
we have now the previous sales and if you check this over here we have in the
you check this over here we have in the same row now the current sales of the
same row now the current sales of the current year and as well the sales of
current year and as well the sales of the previous year. Now what we have to
the previous year. Now what we have to do the same thing we have to go and
do the same thing we have to go and subtract those two informations in order
subtract those two informations in order to compare them. Right? So we're going
to compare them. Right? So we're going to go and do the same thing. So we will
to go and do the same thing. So we will get the current sales minus the whole
get the current sales minus the whole thing the whole window function and
thing the whole window function and we're going to call it previous year. So
we're going to call it previous year. So difference of the previous year and with
difference of the previous year and with that we are calculating the differences
that we are calculating the differences between them. So for this year for this
between them. So for this year for this product as you can see the difference
product as you can see the difference here is really big between the current
here is really big between the current sales and the previous year. Now of
sales and the previous year. Now of course what we can do we can go and make
course what we can do we can go and make as well a flag or an indicator. I'm
as well a flag or an indicator. I'm going to go and copy the whole thing
going to go and copy the whole thing from the previous average but we have to
from the previous average but we have to go and get the right function this and
go and get the right function this and the same over here and now it is not
the same over here and now it is not above or below the average I'm going to
above or below the average I'm going to say it is increasing or decreasing right
say it is increasing or decreasing right so increase or decrease and we're going
so increase or decrease and we're going to call it previous year change and
to call it previous year change and instead of average we can say no change
instead of average we can say no change so let's go and execute it and I'm
so let's go and execute it and I'm having here an extra comma let's go and
having here an extra comma let's go and execute it so again let's go and focus
execute it so again let's go and focus of one of those products. For the first
of one of those products. For the first year of this product, there is no change
year of this product, there is no change because there is no previous year. For
because there is no previous year. For the next year of this product, we have
the next year of this product, we have an increase, right? Because the current
an increase, right? Because the current sales is way higher than the previous
sales is way higher than the previous year. And now by going to the last year
year. And now by going to the last year of this product, we have a decrease
of this product, we have a decrease because the current sales is less than
because the current sales is less than the previous year. So my friends, we
the previous year. So my friends, we call this type of analyszis year over
call this type of analyszis year over year analyszis. And if you want to
year analyszis. And if you want to calculate the month over month analyzes,
calculate the month over month analyzes, it's very simple. All what you have to
it's very simple. All what you have to do is to go and change the function from
do is to go and change the function from year to a month and with that you are
year to a month and with that you are extracting the month part. And the
extracting the month part. And the difference between analyzing the months
difference between analyzing the months and years is of course the scope.
and years is of course the scope. Year-over-year is good for long-term
Year-over-year is good for long-term trends analyzes where on the other hand
trends analyzes where on the other hand the month over month it is shortterm
the month over month it is shortterm trends analyzes. You are just focusing
trends analyzes. You are just focusing on the seasonality of your data. So this
on the seasonality of your data. So this is how we analyze the performance of our
is how we analyze the performance of our business by comparing the current
business by comparing the current measure with a target measure and you
measure with a target measure and you can go and use different dimensions and
can go and use different dimensions and stuff. So instead of the sales you can
stuff. So instead of the sales you can check the quantity instead of products
check the quantity instead of products you can check the customers and you can
you can check the customers and you can go and compare the current information
go and compare the current information not only with the average or the
not only with the average or the previous year you can compare it with
previous year you can compare it with the lowest sales and the highest sales
the lowest sales and the highest sales and it can open the door for many
and it can open the door for many different insights. But we are always
different insights. But we are always using the same methods using the window
using the same methods using the window functions. We compare the current value
functions. We compare the current value with another value in our data sets. So
with another value in our data sets. So this is how we do performance
this is how we do performance comparison. All right. So that you have
comparison. All right. So that you have learned how to analyze the performance
learned how to analyze the performance of our business. Now in the next step
of our business. Now in the next step we're going to do partto-hole analyszis.
we're going to do partto-hole analyszis. So let's
go. Okay. So now what is exactly part to whole analyszis? Well, we use it in
whole analyszis? Well, we use it in order to find out the proportion of a
order to find out the proportion of a part relative to the whole. Well, here
part relative to the whole. Well, here we're going to analyze how an individual
we're going to analyze how an individual category is contributing to the overall
category is contributing to the overall in order to understand what is the most
in order to understand what is the most impacting category to the overall
impacting category to the overall business. So now for the formula, it is
business. So now for the formula, it is very simple. You have to go and pick one
very simple. You have to go and pick one of your measures divided by the total of
of your measures divided by the total of the measure and then multiply it by 100
the measure and then multiply it by 100 in order to find the percentage by a
in order to find the percentage by a specific dimension. Like for example, if
specific dimension. Like for example, if you take the sales, so you divide the
you take the sales, so you divide the sales by the total sales, multiplied by
sales by the total sales, multiplied by 100 by the category or if you take the
100 by the category or if you take the quantity divided by the total quantity
quantity divided by the total quantity and then find the percentage by a
and then find the percentage by a country. So for example, again we have
country. So for example, again we have our measure splitted by categories. But
our measure splitted by categories. But now instead of having this number, what
now instead of having this number, what we're going to do, we're going to
we're going to do, we're going to calculate the percentage. So for the
calculate the percentage. So for the first one, we're going to take the 200
first one, we're going to take the 200 divided by 600 multiply it by 100. So
divided by 600 multiply it by 100. So we're going to get the percentage 33. So
we're going to get the percentage 33. So once we do that for the all categories,
once we do that for the all categories, it's going to be now very easy to see
it's going to be now very easy to see that the category P it is contributing
that the category P it is contributing to the overall number by 50%. Which
to the overall number by 50%. Which makes it of course a top performer. So
makes it of course a top performer. So you can visual in your head as like a
you can visual in your head as like a pie chart and you can see how each part
pie chart and you can see how each part is contributing to the whole pie chart
is contributing to the whole pie chart and with that it can help us to
and with that it can help us to understand the importance of each
understand the importance of each category to our business. So now let's
category to our business. So now let's go and apply this formula to our
go and apply this formula to our measures in order to understand the
measures in order to understand the importance of our categories. So let's
importance of our categories. So let's go. Okay. So now let's do part hole
go. Okay. So now let's do part hole analyszis. All what we need one
analyszis. All what we need one dimension and one measure. So for
dimension and one measure. So for example we have the following task. It
example we have the following task. It is very simple. Which categories
is very simple. Which categories contribute the most to the overall
contribute the most to the overall sales. So now let's go and do it step by
sales. So now let's go and do it step by step. So first we're going to go and
step. So first we're going to go and collect the informations. So we need the
collect the informations. So we need the category. We need the sales amount and
category. We need the sales amount and those informations come as usual from
those informations come as usual from the fact sales and from our dimension
the fact sales and from our dimension the product. Right? So we have quickly
the product. Right? So we have quickly to go and connect them using the product
to go and connect them using the product key. Okay. So that's all what we need
key. Okay. So that's all what we need for our query. So let's go and select.
for our query. So let's go and select. So we have here the categories and the
So we have here the categories and the sales amount. So now the first thing we
sales amount. So now the first thing we have to calculate the total sales for
have to calculate the total sales for each category. So let's go and do that.
each category. So let's go and do that. It is very simple. So sum total sales
It is very simple. So sum total sales and we are grouping up the data by the
and we are grouping up the data by the category. So this is basics. Right now
category. So this is basics. Right now we have the total sales for each of
we have the total sales for each of those categories. Now in order to
those categories. Now in order to calculate the percentage we need two
calculate the percentage we need two measures the total sales for each
measures the total sales for each category and we have it here already and
category and we have it here already and as well side by side we need the total
as well side by side we need the total sales across all categories. So the big
sales across all categories. So the big number without any dimension but now as
number without any dimension but now as you look to the result you can see the
you look to the result you can see the granularity here is that category. Now
granularity here is that category. Now we need the total sales again by
we need the total sales again by different granularity. And in order to
different granularity. And in order to mix those stuff together we use the
mix those stuff together we use the window functions. So now how we going to
window functions. So now how we going to do it? either you go over here and start
do it? either you go over here and start writing your window function. And of
writing your window function. And of course, you can do it together with the
course, you can do it together with the group by or you can do it as a second
group by or you can do it as a second step in your query using either a CTE or
step in your query using either a CTE or a subquery. So I'm going to go with the
a subquery. So I'm going to go with the CTE just to make it clear. So category
CTE just to make it clear. So category sales like this. So now let's start
sales like this. So now let's start again selecting the same information. So
again selecting the same information. So category total sales from our table
category total sales from our table category or CTE sales. So let's go and
category or CTE sales. So let's go and execute it. So now we have the same
execute it. So now we have the same results and now we're going to go and
results and now we're going to go and build our window function like this. So
build our window function like this. So we're going to say the sum we want to
we're going to say the sum we want to aggregate all those values right to get
aggregate all those values right to get the total sales over the whole data
the total sales over the whole data sets. So we're going to say sum total
sets. So we're going to say sum total sales. And now in order to get the big
sales. And now in order to get the big number we're going to say over and
number we're going to say over and inside it we will not define anything
inside it we will not define anything because we don't want to partition the
because we don't want to partition the data. We don't want to introduce any
data. We don't want to introduce any dimension. We just want the big number.
dimension. We just want the big number. And with that we will get the overall
And with that we will get the overall sales. So let's go and execute it. Now
sales. So let's go and execute it. Now as you can see this is the total sales
as you can see this is the total sales by the category. So the total sales is
by the category. So the total sales is splitted by the categories. And this is
splitted by the categories. And this is the overall sales of all orders of
the overall sales of all orders of everything the highest number. Now since
everything the highest number. Now since we have them side by side what we can do
we have them side by side what we can do we can very easily calculate the path to
we can very easily calculate the path to whole or the percentage. So let's start
whole or the percentage. So let's start doing that. We need the total sales and
doing that. We need the total sales and we want to go and divide it by the
we want to go and divide it by the overall sales. So we're going to take
overall sales. So we're going to take our window function and put it over
our window function and put it over here. So let's go and multiply it now
here. So let's go and multiply it now with 100. I'm going to go and call it
with 100. I'm going to go and call it percentage of total. So let's go and
percentage of total. So let's go and execute it. Now as you can see we are
execute it. Now as you can see we are getting zeros and that's because the
getting zeros and that's because the total sales is not float. So what we
total sales is not float. So what we have to do is to go and cast it to
have to do is to go and cast it to something like a decimal. So floats like
something like a decimal. So floats like this. So let's go and reexecute it. And
this. So let's go and reexecute it. And now, as you can see, we are getting now
now, as you can see, we are getting now the percentages, but we have a lot of
the percentages, but we have a lot of numbers after the comma. So, we're going
numbers after the comma. So, we're going to go and round the numbers now. So,
to go and round the numbers now. So, let's go to the start round and then go
let's go to the start round and then go to the end, comma, and let's have like
to the end, comma, and let's have like two decimals. So, let's go and execute
two decimals. So, let's go and execute it again. Now, looks perfect. Now, what
it again. Now, looks perfect. Now, what we can do, we can go and add like a
we can do, we can go and add like a percentage. And with that, we are
percentage. And with that, we are converting the whole thing to a string.
converting the whole thing to a string. So, we're going to do concatenation. So,
So, we're going to do concatenation. So, concat at the start and go to the end.
concat at the start and go to the end. And let's add the percentage character.
And let's add the percentage character. And as well we can go and order the data
And as well we can go and order the data by the total sales descending. So let's
by the total sales descending. So let's go and execute it. So now by looking to
go and execute it. So now by looking to the result you can see the category
the result you can see the category bikes is dominating. So it is
bikes is dominating. So it is overwhelming top performing the
overwhelming top performing the categories. It is making 69% of the
categories. It is making 69% of the total sales of our business. So this
total sales of our business. So this means my friends most of the business
means my friends most of the business revenue comes from the bikes. And as you
revenue comes from the bikes. And as you can see the accessories and clothing
can see the accessories and clothing they are really minor contributors to
they are really minor contributors to our business which is not really good
our business which is not really good and this is actually dangerous thing. If
and this is actually dangerous thing. If you have like one category dominating
you have like one category dominating your whole business you are over relying
your whole business you are over relying on only one category in your business
on only one category in your business and if this fails this category then the
and if this fails this category then the whole business is going to fail. So by
whole business is going to fail. So by looking to this either the business has
looking to this either the business has to decide removing all those products by
to decide removing all those products by those two categories or to focus more on
those two categories or to focus more on bringing more revenue for the products
bringing more revenue for the products that are inside those two categories. So
that are inside those two categories. So as you can see guys those insights are
as you can see guys those insights are really amazing for the business and
really amazing for the business and helps the managers and the decision
helps the managers and the decision makers to understand what is going on
makers to understand what is going on quickly and make very critical
quickly and make very critical decisions. And now you can see as well
decisions. And now you can see as well from the results perfectly why the part
from the results perfectly why the part to whole analyszis is very important
to whole analyszis is very important because by just looking to those numbers
because by just looking to those numbers it's going to be really hard to
it's going to be really hard to understand the importance of the
understand the importance of the categories. But seeing the data as a
categories. But seeing the data as a percentage how each category is
percentage how each category is contributing to the whole sales of the
contributing to the whole sales of the business makes it easier to understand
business makes it easier to understand which category is underperforming or top
which category is underperforming or top performing. And now you have a very
performing. And now you have a very simple formula where you can go and
simple formula where you can go and change the metrics. For example, instead
change the metrics. For example, instead of total sales, you can go and change
of total sales, you can go and change the aggregations to total number of
the aggregations to total number of orders or the total number of customers.
orders or the total number of customers. So you can go and bring any type of
So you can go and bring any type of measures and bring it to this analyszis
measures and bring it to this analyszis and you're going to generate completely
and you're going to generate completely new view for the decision makers in
new view for the decision makers in order to develop a new strategy for the
order to develop a new strategy for the business. It was very interesting. Now
business. It was very interesting. Now in the next step, we're going to do my
in the next step, we're going to do my favorite topic where we're going to
favorite topic where we're going to start doing data segmentations using
start doing data segmentations using SQL. So let's go.
Okay. So now what is data segmentations? What we're going to do here is we're
What we're going to do here is we're going to go and group up the data based
going to go and group up the data based on specific range. So that means we're
on specific range. So that means we're going to go and create a new categories
going to go and create a new categories and then go and aggregate the data based
and then go and aggregate the data based on the new category. And the formula for
on the new category. And the formula for that going to be very interesting. So
that going to be very interesting. So it's going to be this time we're going
it's going to be this time we're going to have a measure by a measure not by
to have a measure by a measure not by dimension. So you have to go and pick
dimension. So you have to go and pick two different measures and convert one
two different measures and convert one of those measures to a range or to a
of those measures to a range or to a group and then aggregate the data by
group and then aggregate the data by this measure. So for example, we're
this measure. So for example, we're going to go and calculate the total
going to go and calculate the total number of products by the sales range or
number of products by the sales range or the total number of customers by the age
the total number of customers by the age group. So as you can see we have two
group. So as you can see we have two measures and we are trying to combine
measures and we are trying to combine them together in order to create new
them together in order to create new insights. Let's have the following
insights. Let's have the following example. So here for example we have
example. So here for example we have like two measures and now the first step
like two measures and now the first step is that we're going to take one of those
is that we're going to take one of those measures and convert it to a dimension.
measures and convert it to a dimension. converted to a category. For example,
converted to a category. For example, we're going to say if the values are
we're going to say if the values are like equal or below 100, it will be
like equal or below 100, it will be converted to a category called low. And
converted to a category called low. And between 100 and 200, it's going to be
between 100 and 200, it's going to be assigned to a new category called
assigned to a new category called medium. And everything above 200, it's
medium. And everything above 200, it's going to be large. So, as you can see
going to be large. So, as you can see what we are doing, we are taking one
what we are doing, we are taking one measure and based on the range of this
measure and based on the range of this measure, we are building a new
measure, we are building a new categories, new dimension. And now the
categories, new dimension. And now the final step is the easiest one. We're
final step is the easiest one. We're going to go and aggregate another
going to go and aggregate another measure based on the new category. So
measure based on the new category. So we're going to have seven for low, six
we're going to have seven for low, six for medium, and 15 for large. So with
for medium, and 15 for large. So with that, as you can see, we are creating
that, as you can see, we are creating new categories or segments based on a
new categories or segments based on a measure. And then we are aggregating
measure. And then we are aggregating another measure based of this new
another measure based of this new segments. And in SQL, in order to create
segments. And in SQL, in order to create those new categories and segments, we
those new categories and segments, we use the amazing case when statements
use the amazing case when statements because it's going to help us to define
because it's going to help us to define the rules and based on the range, it's
the rules and based on the range, it's going to go and create a new category
going to go and create a new category and labels. So now let's go and apply
and labels. So now let's go and apply this formula on our data set in order to
this formula on our data set in order to segment our data. So let's go. Okay. So
segment our data. So let's go. Okay. So now let's go and segment our data and
now let's go and segment our data and all what we need is two measures. So now
all what we need is two measures. So now we have the following task and it says
we have the following task and it says segment products into cost ranges and
segment products into cost ranges and count how many products fall into each
count how many products fall into each segment. So now by looking to this task
segment. So now by looking to this task we have two measures. First the costs
we have two measures. First the costs and as well the second one is the total
and as well the second one is the total number of products. And of course we
number of products. And of course we have to go and segment one of those two
have to go and segment one of those two measures. And in this task we are
measures. And in this task we are segmenting the costs. So we have to
segmenting the costs. So we have to focus now on taking this measure and
focus now on taking this measure and convert it to a dimension. So now all
convert it to a dimension. So now all those informations are available in the
those informations are available in the table products. So now let's go and
table products. So now let's go and select few columns. We're going to get
select few columns. We're going to get the product key and let's get the
the product key and let's get the product name and the costs. That's all
product name and the costs. That's all what we need. So let's execute it. Now
what we need. So let's execute it. Now as you can see this is our measure the
as you can see this is our measure the costs. Now we have to go and convert
costs. Now we have to go and convert this measure to dimension. And in order
this measure to dimension. And in order to do that, we use the case win
to do that, we use the case win statements. We always use the case win
statements. We always use the case win statement in order to create new
statement in order to create new categories. So let's go and do that.
categories. So let's go and do that. Case win. Let's start with the first
Case win. Let's start with the first range. Let's say it is below 100. So all
range. Let's say it is below 100. So all the costs that are below 100. We're
the costs that are below 100. We're going to label it with a new value. It's
going to label it with a new value. It's going to be below 100. So now let's go
going to be below 100. So now let's go to the next range. We are saying when
to the next range. We are saying when costs now between 100 and 500. So all
costs now between 100 and 500. So all costs between this range. They will get
costs between this range. They will get the label 100 and 500. So this is very
the label 100 and 500. So this is very simple. Let's go and get another range.
simple. Let's go and get another range. For example, between 500 and 1,000. Then
For example, between 500 and 1,000. Then it's going to get a label between 500
it's going to get a label between 500 and 1,000. And now it depend how many
and 1,000. And now it depend how many categories and segments you want to
categories and segments you want to create. Each row of this case when each
create. Each row of this case when each condition will be creating like a new
condition will be creating like a new value for your dimension. So I'm going
value for your dimension. So I'm going to stop with that. I'm going to say at
to stop with that. I'm going to say at the end else. So if the cost is not
the end else. So if the cost is not fulfilling any of those, it's going to
fulfilling any of those, it's going to be above 1,000. Right? So that's it.
be above 1,000. Right? So that's it. Let's give it a name. It's going to be
Let's give it a name. It's going to be cost range. So now let's go and execute
cost range. So now let's go and execute it. Now let's go and check the result.
it. Now let's go and check the result. For example, the cost here is zero. It
For example, the cost here is zero. It is below 100, which is correct. This
is below 100, which is correct. This value is above 1,000. This is between
value is above 1,000. This is between 500 and 1,000. And this is between 100
500 and 1,000. And this is between 100 and 500. So everything looks correct.
and 500. So everything looks correct. Nice. So with that we are done with the
Nice. So with that we are done with the first step where we have converted one
first step where we have converted one measure into a dimension. So with that
measure into a dimension. So with that we have now our segments. The next step
we have now our segments. The next step with that we're going to go and
with that we're going to go and aggregate the data based on this a new
aggregate the data based on this a new dimension. So either you do it in one go
dimension. So either you do it in one go or what I usually do I put everything in
or what I usually do I put everything in one city or a subquery and I'm going to
one city or a subquery and I'm going to call it products
call it products segments as based on this results I'm
segments as based on this results I'm going to go and aggregate the data. So
going to go and aggregate the data. So this is my temporary results and now
this is my temporary results and now we're going to go and just aggregate the
we're going to go and just aggregate the data like this. So let's get first our
data like this. So let's get first our dimension cost range and then we need
dimension cost range and then we need our measure. So it's going to be count
our measure. So it's going to be count product
product key as total products from our city. It
key as total products from our city. It was the product segments and then group
was the product segments and then group by our new dimension. That's it. It's
by our new dimension. That's it. It's very simple. Let's go and execute it
very simple. Let's go and execute it now. Now you can see in the output we
now. Now you can see in the output we have our segmented measure and we can
have our segmented measure and we can see the total numbers in each of those
see the total numbers in each of those segment and range and of course we can
segment and range and of course we can go and order the data by our aggregation
go and order the data by our aggregation the total products. Let's go and execute
the total products. Let's go and execute it maybe descending. So now as you can
it maybe descending. So now as you can see we have a lot of products that are
see we have a lot of products that are not costing a lot. It is below 100.
not costing a lot. It is below 100. After that between 100 500 and the
After that between 100 500 and the lowest number of products is in the
lowest number of products is in the range that is above 1,000. So we don't
range that is above 1,000. So we don't have a lot of products that are costing
have a lot of products that are costing a lot and that's because maybe we have a
a lot and that's because maybe we have a lot of accessories in the business. So
lot of accessories in the business. So my friends this is very powerful. If
my friends this is very powerful. If your dimensions in the data set is not
your dimensions in the data set is not enough to create insights you can take
enough to create insights you can take one of your measures convert it to a
one of your measures convert it to a dimension using case win and then
dimension using case win and then aggregate your other measures based on
aggregate your other measures based on this new dimension. So we are deriving
this new dimension. So we are deriving new informations and as I told you by
new informations and as I told you by just following this concept measures and
just following this concept measures and dimensions you can generate endless
dimensions you can generate endless amount of reports even if your business
amount of reports even if your business or your data set is small. Okay my
or your data set is small. Okay my friends so now let's go and segment
friends so now let's go and segment something else. So this time it's going
something else. So this time it's going to be a little bit more complicated. So
to be a little bit more complicated. So we have the following task and it says
we have the following task and it says group customers into three segments
group customers into three segments based on their spending behavior. So we
based on their spending behavior. So we have the VIB customers. They are the
have the VIB customers. They are the customers with at least 12 months of
customers with at least 12 months of history and spending more than 5,000.
history and spending more than 5,000. And the second category we have the
And the second category we have the regular customers. They have at least as
regular customers. They have at least as well 12 months of history but they spend
well 12 months of history but they spend like less than 5,000. And the last
like less than 5,000. And the last category we have the new customers.
category we have the new customers. Their lifespan is less than 12 months.
Their lifespan is less than 12 months. And we have to find the total number of
And we have to find the total number of customers by each group. So now here we
customers by each group. So now here we have a lot of measures and stuff. So the
have a lot of measures and stuff. So the first one is the total number of
first one is the total number of customers. This is going to be the final
customers. This is going to be the final aggregation that we're going to do. But
aggregation that we're going to do. But what is interesting, we're going to
what is interesting, we're going to build the segments and this time is
build the segments and this time is based on different columns. So first it
based on different columns. So first it is based on a measure the total number
is based on a measure the total number of months for each customer and as well
of months for each customer and as well the total spending, the total number of
the total spending, the total number of sales. So we have the sales, we have the
sales. So we have the sales, we have the total number of months and as well the
total number of months and as well the total number of customers. So now we're
total number of customers. So now we're going to do it step by step. Don't you
going to do it step by step. Don't you worry about it. So now what I usually
worry about it. So now what I usually do, I start collecting all the data that
do, I start collecting all the data that I need. So what do we need? We need a
I need. So what do we need? We need a customer key. In order to do the
customer key. In order to do the aggregation for the total number of
aggregation for the total number of customers, we need as well the sales
customers, we need as well the sales amount right for the spending. And now
amount right for the spending. And now in order to calculate those number of
in order to calculate those number of months, we need a date. And for that, we
months, we need a date. And for that, we have to calculate the lifespan of a
have to calculate the lifespan of a customer. And usually we create it using
customer. And usually we create it using the order date. I'm going to show you
the order date. I'm going to show you how we're going to do it. So we need the
how we're going to do it. So we need the order date. And of course, we have to
order date. And of course, we have to select our table. So let's start with
select our table. So let's start with the fact table. So fact sales and we're
the fact table. So fact sales and we're going to join it with the
going to join it with the customers. So our dimension customers
customers. So our dimension customers and the key for that it is the customer
and the key for that it is the customer key as well for the customers. And here
key as well for the customers. And here we have to specify which column come
we have to specify which column come from which table. So the first one from
from which table. So the first one from the customers, the sales from the fact
the customers, the sales from the fact and the order date from the fact as
and the order date from the fact as well. So now let's go and execute. Now
well. So now let's go and execute. Now we can see we have our customers, the
we can see we have our customers, the sales and the order dates. So now the
sales and the order dates. So now the sales going to help us in order to
sales going to help us in order to specify the range of spending. But now
specify the range of spending. But now what is interesting we have to calculate
what is interesting we have to calculate the lifespan. So now in order to get the
the lifespan. So now in order to get the lifespan we have to find out the first
lifespan we have to find out the first order and the last order of each
order and the last order of each customer. So how many months is between
customer. So how many months is between the first order and the last order. So
the first order and the last order. So in order to do that we need the min
in order to do that we need the min function for the order dates. So this is
function for the order dates. So this is the first order and the max in order to
the first order and the max in order to get the last order. Right.
get the last order. Right. And since we are using min and max, we
And since we are using min and max, we have to go and group up the data. And we
have to go and group up the data. And we need to do that anyway in order to get
need to do that anyway in order to get the total spending. So for the sales
the total spending. So for the sales amount, we're going to have the sum in
amount, we're going to have the sum in order to have the total spend total
order to have the total spend total spending. And we don't need the order
spending. And we don't need the order age. And the dimension where we're going
age. And the dimension where we're going to group up the data is by the customer
to group up the data is by the customer key. So let's go and execute it. So now
key. So let's go and execute it. So now in the results we have a list of all our
in the results we have a list of all our customers and as well the total spending
customers and as well the total spending for each customer and we have the first
for each customer and we have the first order date and the last order dates. Now
order date and the last order dates. Now in order to calculate how many months
in order to calculate how many months between the first order and the last
between the first order and the last order we can go and use the function
order we can go and use the function date diff in order to get a new measure.
date diff in order to get a new measure. So let's go and do that date diff. And
So let's go and do that date diff. And now since we need the number of months
now since we need the number of months we're going to use the month and then
we're going to use the month and then the second argument going to be the
the second argument going to be the first order. So order date and the
first order. So order date and the second one going to be the latest. So
second one going to be the latest. So max order date and we're going to call
max order date and we're going to call this lifpan. So let's go and query and
this lifpan. So let's go and query and let's have a look to our results. You
let's have a look to our results. You can see for this customer 712 between
can see for this customer 712 between the first order and the last order we
the first order and the last order we have 11 muscles and for this customer
have 11 muscles and for this customer over here we have zero because the first
over here we have zero because the first order and the last order is in the same
order and the last order is in the same month and maybe there is only one order.
month and maybe there is only one order. So with that we have the lifespan and as
So with that we have the lifespan and as you can see guys we have derived a new
you can see guys we have derived a new measure from the dimension order age in
measure from the dimension order age in order later to derive from this new
order later to derive from this new measure a new dimension the segments. So
measure a new dimension the segments. So we are converting a dimension to a
we are converting a dimension to a measure and then from a measure to a new
measure and then from a measure to a new dimension and this is usually what we do
dimension and this is usually what we do in analyzes and in SQL. So now do we
in analyzes and in SQL. So now do we have all the informations for the logic?
have all the informations for the logic? So we have the lifespan. So we have the
So we have the lifespan. So we have the total number of monsters, we have the
total number of monsters, we have the total spending and I think we are ready
total spending and I think we are ready to start building our segments. So now
to start building our segments. So now what we're going to do, we're going to
what we're going to do, we're going to create the segments based on these
create the segments based on these results that we have prepared. So this
results that we have prepared. So this result is the intermediate result before
result is the intermediate result before the final one. Now either you're going
the final one. Now either you're going to put it in a CTE or subquery. Well, I
to put it in a CTE or subquery. Well, I usually go and use the CTE. It is nicer.
usually go and use the CTE. It is nicer. So with customer
So with customer spending and I'm going to put the whole
spending and I'm going to put the whole thing in ECT and we can start writing a
thing in ECT and we can start writing a new query from the scratch based on the
new query from the scratch based on the inter results. So let's go and select
inter results. So let's go and select again the customer key. I'm going to get
again the customer key. I'm going to get the total spending and the lifpan. So we
the total spending and the lifpan. So we don't actually need the first and the
don't actually need the first and the last order and we're going to get all
last order and we're going to get all those informations from our new city. So
those informations from our new city. So let's go and execute. And now let's
let's go and execute. And now let's start building the segments. And as
start building the segments. And as usual, we're going to go and use the
usual, we're going to go and use the case win statements. It is just amazing
case win statements. It is just amazing statements in order to derive and build
statements in order to derive and build new columns. So now what do we have for
new columns. So now what do we have for the first category? So they are the
the first category? So they are the customers over 12 months and spending
customers over 12 months and spending more than 5,000. So now we're going to
more than 5,000. So now we're going to say if the laugh span is higher than 12
say if the laugh span is higher than 12 and the total spending is higher than
and the total spending is higher than 5,000 then we have our VIB customers. So
5,000 then we have our VIB customers. So this is the first label. Let's go to the
this is the first label. Let's go to the second one. If the lifespan as well I
second one. If the lifespan as well I think more than 12. So let's go and
think more than 12. So let's go and check. Well, it is at least 12. I have
check. Well, it is at least 12. I have here mistake. So it's going to be larger
here mistake. So it's going to be larger or equal. So now it is more correct. So
or equal. So now it is more correct. So the customers that has at least 12
the customers that has at least 12 months but they spend like 5,000 or
months but they spend like 5,000 or less. So that means it's going to stay
less. So that means it's going to stay the same condition but the total
the same condition but the total spending will be less or equal 5,000s
spending will be less or equal 5,000s and they are the regular customers. So
and they are the regular customers. So they will get this label. Now if it is
they will get this label. Now if it is not fulfilling those two conditions what
not fulfilling those two conditions what this means this means this is a new
this means this means this is a new customer right. So they will get this
customer right. So they will get this label. Let's go and have an end and
label. Let's go and have an end and let's call it customer segments. So
let's call it customer segments. So let's go and execute it. Now let's have
let's go and execute it. Now let's have a look for this customer 712. So the
a look for this customer 712. So the total spending is less than 5,000. So
total spending is less than 5,000. So this customer is not a VIB and as well
this customer is not a VIB and as well the lifespan is less than 12. So that
the lifespan is less than 12. So that means for us it is a new customer. Now
means for us it is a new customer. Now the next one we have a VIB. So this
the next one we have a VIB. So this customer has a history at least 12
customer has a history at least 12 months. So we have here 16 months and as
months. So we have here 16 months and as well the total spending more than 5,000.
well the total spending more than 5,000. That's why this customer is a VIB. But
That's why this customer is a VIB. But now let's go and search for a regular
now let's go and search for a regular customer
customer 2349. So this customer spent less than
2349. So this customer spent less than 5,000. So we are fulfilling this
5,000. So we are fulfilling this condition over here and as well this
condition over here and as well this customer has at least 12 months of
customer has at least 12 months of history that's why we have a regular. So
history that's why we have a regular. So now as you can see we have derived a new
now as you can see we have derived a new dimension from two measures the lifespan
dimension from two measures the lifespan and the total spending. Now of course
and the total spending. Now of course the last step what is going to be we
the last step what is going to be we have to go and find the total number of
have to go and find the total number of customers for each of those categories.
customers for each of those categories. So now what we're going to do we're
So now what we're going to do we're going to remove all those stuff and
going to remove all those stuff and we're going to start with our new
we're going to start with our new dimension and then comes the aggregation
dimension and then comes the aggregation count customer key. So as total
count customer key. So as total customers and then we have to group up
customers and then we have to group up the data by our new dimension. So this
the data by our new dimension. So this going to be really annoying if I'm going
going to be really annoying if I'm going to take this here and put it in the
to take this here and put it in the group I because this means each time I'm
group I because this means each time I'm changing the logic I have to take care
changing the logic I have to take care of that twice. One in the select
of that twice. One in the select statement and the second one in the
statement and the second one in the group I. So now actually instead of that
group I. So now actually instead of that what I'm going to do I changed my mind.
what I'm going to do I changed my mind. I'm going to still having the
I'm going to still having the aggregation in the second step. So we
aggregation in the second step. So we need the customer key we have the
need the customer key we have the definition of our customer segments. And
definition of our customer segments. And now I'm going to go and use the subquery
now I'm going to go and use the subquery where I put the aggregation as a second
where I put the aggregation as a second step. So my friends that means this is
step. So my friends that means this is again a second intermediate results. You
again a second intermediate results. You can of course put it in a second city.
can of course put it in a second city. So that means this is the first
So that means this is the first intermediate results where we have
intermediate results where we have created the lifespan and the total
created the lifespan and the total spending and the second intermediate
spending and the second intermediate result is creating the customer segments
result is creating the customer segments and the third step and the last one is
and the third step and the last one is by doing the final aggregation. So we're
by doing the final aggregation. So we're going to do it like this. Select our
going to do it like this. Select our dimension customer segments. Then we're
dimension customer segments. Then we're going to go and count the customer key
going to go and count the customer key from our sub query. So this is our
from our sub query. So this is our subquery and don't forget to group by
subquery and don't forget to group by our dimension customer segments. I think
our dimension customer segments. I think I have it wrong. All right. So this is
I have it wrong. All right. So this is the subquery and this is the final step
the subquery and this is the final step where we are aggregating everything. I'm
where we are aggregating everything. I'm going to go and order the data by the
going to go and order the data by the total customers like this. So now let's
total customers like this. So now let's go and execute the whole thing. Well
go and execute the whole thing. Well descending not ascending. Okay. Okay. So
descending not ascending. Okay. Okay. So now we can see from our results the
now we can see from our results the highest number of our customers belong
highest number of our customers belong to the category new. So we have 14,000
to the category new. So we have 14,000 customers that are new in our business.
customers that are new in our business. And then the second category we have the
And then the second category we have the regular customers. So we have around
regular customers. So we have around 2,000 customers. And in VIB we have a
2,000 customers. And in VIB we have a lot of VIB customers. So we have
lot of VIB customers. So we have 1,655 VIB customers in our business. So
1,655 VIB customers in our business. So with that my friends, we have done data
with that my friends, we have done data segmentation. It is amazing. We have
segmentation. It is amazing. We have segmented our customers based on their
segmented our customers based on their spending behavior and as you can see all
spending behavior and as you can see all those informations are totally derived
those informations are totally derived from the our data and this help us to
from the our data and this help us to have a deep understanding of the
have a deep understanding of the behavior of our customers and of course
behavior of our customers and of course this can help as well making smart
decisions. All right my friends so with that we have covered the five different
that we have covered the five different types of data analytics thus we can do
types of data analytics thus we can do using SQL. Now what I usually do as the
using SQL. Now what I usually do as the last tip in my project is that I try to
last tip in my project is that I try to collect all the different types of
collect all the different types of explorations and analyzes that I have
explorations and analyzes that I have done in my data sets so that I can put
done in my data sets so that I can put everything in one for example view or
everything in one for example view or table and then offer it to other users
table and then offer it to other users and with that it going to help the other
and with that it going to help the other users or stakeholders to make a quick
users or stakeholders to make a quick analyszis for decision- making. So now
analyszis for decision- making. So now what we're going to do, we're going to
what we're going to do, we're going to have like some kind of requirements
have like some kind of requirements where we're going to bring a lot of
where we're going to bring a lot of different analyzes in one big script in
different analyzes in one big script in order to have insights about one object
order to have insights about one object like for example the customers. So I'm
like for example the customers. So I'm going to show you the requirement of
going to show you the requirement of this reports and we're going to analyze
this reports and we're going to analyze it and start writing the scripts. So
it and start writing the scripts. So let's go. Okay friends. So now let's
let's go. Okay friends. So now let's create a customer report and here are
create a customer report and here are the requirements for the report. So now
the requirements for the report. So now we have like a general statement. It
we have like a general statement. It says this report should consolidate key
says this report should consolidate key customer metrics and behaviors. So it
customer metrics and behaviors. So it says first we have to gather all the
says first we have to gather all the details about the customers like names,
details about the customers like names, age, transaction details and then we
age, transaction details and then we have to segment the customers into
have to segment the customers into categories VIB, regular and new and as
categories VIB, regular and new and as well by the age groups and we have to
well by the age groups and we have to provide as well aggregations like the
provide as well aggregations like the total order, total sales, quantity,
total order, total sales, quantity, products and so on. And we have to
products and so on. And we have to generate important KPIs like the
generate important KPIs like the recency, the average order value, the
recency, the average order value, the average monthly spends. So we have a lot
average monthly spends. So we have a lot of things and we're going to do it step
of things and we're going to do it step by step. All right. Now I'm going to
by step. All right. Now I'm going to take you step by step in the process of
take you step by step in the process of building a complex query that I usually
building a complex query that I usually use in order to build a report. Now the
use in order to build a report. Now the first thing that I usually do is I start
first thing that I usually do is I start selecting the data from the database and
selecting the data from the database and I usually start with the fact table. So
I usually start with the fact table. So this is my starting point and then
this is my starting point and then usually I join it with the dimensions
usually I join it with the dimensions and here I use lift join and after that
and here I use lift join and after that I think about how to filter the data
I think about how to filter the data because usually we don't need all the
because usually we don't need all the data that is available in the database
data that is available in the database and of course in the result I will not
and of course in the result I will not be selecting all the columns. I'm going
be selecting all the columns. I'm going to be selecting only the relevant
to be selecting only the relevant columns that I need for my reports. So
columns that I need for my reports. So since we have like complex query we will
since we have like complex query we will be dividing the process into multiple
be dividing the process into multiple steps and I usually call this step the
steps and I usually call this step the base data and this going to be the
base data and this going to be the foundation the scope for the next steps
foundation the scope for the next steps and since we have like multiple steps
and since we have like multiple steps I'm going to put this in a CTE so we
I'm going to put this in a CTE so we have this as an intermediate results and
have this as an intermediate results and what we're going to do in this step as
what we're going to do in this step as well we're going to do few
well we're going to do few transformations like maybe calculating
transformations like maybe calculating and deriving new columns maybe
and deriving new columns maybe formatting the date so some basic
formatting the date so some basic transformations so now let's go and
transformations so now let's go and build this results for our report so the
build this results for our report so the first step is retrieving the core
first step is retrieving the core columns from the tables. So let's go and
columns from the tables. So let's go and do it together. So we need of course our
do it together. So we need of course our fact table facts and we need our
fact table facts and we need our dimension gold customer and as usual
dimension gold customer and as usual we're going to go and connect them. All
we're going to go and connect them. All right. Okay. So this is the basic and
right. Okay. So this is the basic and now what we're going to do we're going
now what we're going to do we're going to go and retrieve all the columns that
to go and retrieve all the columns that we need for our reports. So let's start
we need for our reports. So let's start picking stuff. So order number let's get
picking stuff. So order number let's get the product key the order date sales
the product key the order date sales amount quantity and I think that's all
amount quantity and I think that's all from the facts let's go and get few
from the facts let's go and get few informations from the customers so let's
informations from the customers so let's get the customer key the customer number
get the customer key the customer number the first name and as well the last name
the first name and as well the last name and what else we can go and get the
and what else we can go and get the birth dates because we have to create
birth dates because we have to create the age groups so birth dates let's go
the age groups so birth dates let's go and query. So I think those are all the
and query. So I think those are all the columns that we need in order to do the
columns that we need in order to do the next steps. And now before we go and
next steps. And now before we go and proceed with the aggregations, what
proceed with the aggregations, what we're going to do, we're going to think
we're going to do, we're going to think about filtering the data. As I recall,
about filtering the data. As I recall, we have some orders where the order date
we have some orders where the order date is null. So I'm going to go and remove
is null. So I'm going to go and remove those stuff. So order date is not null.
those stuff. So order date is not null. So that means in the first query the
So that means in the first query the base query not only I'm selecting the
base query not only I'm selecting the columns that I need for the reports also
columns that I need for the reports also I'm defining the scope of the data sets
I'm defining the scope of the data sets by filtering the data. So you can as
by filtering the data. So you can as well make the scope here only one year
well make the scope here only one year or something. Now what else we can do is
or something. Now what else we can do is to think about all those columns and
to think about all those columns and whether we can do any type of
whether we can do any type of transformations in order to prepare them
transformations in order to prepare them for the aggregations. Like for example
for the aggregations. Like for example I'm going to go and say you know what
I'm going to go and say you know what instead of first and last name I'm going
instead of first and last name I'm going to put them together in one. So it's
to put them together in one. So it's going to be the customer name. It's
going to be the customer name. It's better than having like two columns. So,
better than having like two columns. So, let's go and do it. We're going to say
let's go and do it. We're going to say concat and then we're going to start
concat and then we're going to start with the first name and we're going to
with the first name and we're going to have a separator between them. You can
have a separator between them. You can have like a minus or a white space like
have like a minus or a white space like this and after that the last name. So,
this and after that the last name. So, let's call it customer name. And we can
let's call it customer name. And we can go and get rid of those two columns. So,
go and get rid of those two columns. So, let's go and execute. And with that, you
let's go and execute. And with that, you have everything in one column. Now,
have everything in one column. Now, another thing that we can prepare that
another thing that we can prepare that we don't need the birth date. We
we don't need the birth date. We actually need for our reports the age
actually need for our reports the age groups. So that means we have to go and
groups. So that means we have to go and calculate the age. So let's go and
calculate the age. So let's go and transform it. So date diff we want it in
transform it. So date diff we want it in years, the birth date and the current
years, the birth date and the current date from system and we're going to call
date from system and we're going to call it age. So let's execute again. Perfect.
it age. So let's execute again. Perfect. So with that we have all the data that
So with that we have all the data that we need for our reports. Let's go and
we need for our reports. Let's go and put everything in one city. So I'm going
put everything in one city. So I'm going to call it with query as and put
to call it with query as and put everything in this city. And I'm going
everything in this city. And I'm going to go and put this comment over here
to go and put this comment over here inside the city. Perfect. And now we're
inside the city. Perfect. And now we're going to go and write a query from the
going to go and write a query from the scratch. Paste on our intermediate
scratch. Paste on our intermediate results. So base is query. It's execute.
results. So base is query. It's execute. All right. So now by looking to our
All right. So now by looking to our report with that we have the important
report with that we have the important columns. Right. So now in the next step
columns. Right. So now in the next step we're going to do aggregations on top of
we're going to do aggregations on top of these intermediate results. So here
these intermediate results. So here we're going to do all the aggregations
we're going to do all the aggregations that is needed for the report and we're
that is needed for the report and we're going to put everything again in CTE as
going to put everything again in CTE as an intermediate results which makes
an intermediate results which makes everything a modular and easy to read.
everything a modular and easy to read. So now let's go and do the necessary
So now let's go and do the necessary aggregations on the result that we have
aggregations on the result that we have previously prepared. So that's why this
previously prepared. So that's why this is very important as a second step in
is very important as a second step in our report. Always tend to make a
our report. Always tend to make a separated CTE only for aggregations. So
separated CTE only for aggregations. So let's go and do that. I'm going to go
let's go and do that. I'm going to go and select again all the customer
and select again all the customer informations like the customer key
informations like the customer key number, age. So I'm just going to copy
number, age. So I'm just going to copy and paste and put it over here. And we
and paste and put it over here. And we just need the column
just need the column names. So the key number, name, and age.
names. So the key number, name, and age. Now after that, we're going to start
Now after that, we're going to start doing aggregations. So what do you want
doing aggregations. So what do you want to aggregate is first, for example, the
to aggregate is first, for example, the total number of orders. So we're going
total number of orders. So we're going to go and count distinct order
to go and count distinct order number as total orders. So this is one
number as total orders. So this is one aggregation. We can go and summarize all
aggregation. We can go and summarize all those sales
those sales amounts as
amounts as total sales and the quantities as well.
total sales and the quantities as well. So sum quantity as total quantity and as
So sum quantity as total quantity and as well we can go and count how many
well we can go and count how many products did our customer order. So the
products did our customer order. So the products key as total
products key as total products. So what I'm doing now I'm just
products. So what I'm doing now I'm just looking to our intermediate results and
looking to our intermediate results and try to figure out what we can aggregate
try to figure out what we can aggregate for example it makes no sense to
for example it makes no sense to aggregate for example the ages right so
aggregate for example the ages right so from the order number we have total
from the order number we have total orders total product sales amount
orders total product sales amount quantity and from the right side we
quantity and from the right side we cannot aggregate anything and that's
cannot aggregate anything and that's because they are the details of the
because they are the details of the customers but from the fact table we can
customers but from the fact table we can do a lot of aggregations so now what we
do a lot of aggregations so now what we can do with the order date over here we
can do with the order date over here we can for example find the last order
can for example find the last order dates from our customer which is really
dates from our customer which is really nice information. So we can say max
nice information. So we can say max order date as last order and of course
order date as last order and of course we can go and calculate the lifespan and
we can go and calculate the lifespan and that we're going to need it as you
that we're going to need it as you remember in order to categorize our
remember in order to categorize our customer. So I will just copy and paste
customer. So I will just copy and paste it from the previous query is the date
it from the previous query is the date diff month between the first order from
diff month between the first order from the customer and the last order of the
the customer and the last order of the customer. So and we call this lifespan.
customer. So and we call this lifespan. Okay. So we derived two measures or
Okay. So we derived two measures or aggregations from the order date. Now I
aggregations from the order date. Now I think we have done everything possible
think we have done everything possible and what is missing of course is to have
and what is missing of course is to have a group by because we are doing
a group by because we are doing aggregations and we are grouping by the
aggregations and we are grouping by the customer details. So going to be
customer details. So going to be customer key, customer number, name and
customer key, customer number, name and age. So I think we have everything for
age. So I think we have everything for our aggregations. Let's go and execute
our aggregations. Let's go and execute it. A list of all customers and we have
it. A list of all customers and we have few details about the customers and now
few details about the customers and now we have a lot of measures. So the total
we have a lot of measures. So the total order, total sales, total quantity,
order, total sales, total quantity, products, the last order and the
products, the last order and the lifespan. And with that we have covered
lifespan. And with that we have covered this part over here where we have
this part over here where we have provided aggregations on the customer
provided aggregations on the customer level. So we have the details and we
level. So we have the details and we have the aggregations. All right. So
have the aggregations. All right. So with that we have now all the
with that we have now all the preparations that is required to build
preparations that is required to build the final results. So it really depend
the final results. So it really depend on the scenario. If it's possible we can
on the scenario. If it's possible we can take all the data from one city or if
take all the data from one city or if it's needed we can get it from multiple
it's needed we can get it from multiple cities. But in our scenario, we're going
cities. But in our scenario, we're going to take it from the second city, the
to take it from the second city, the aggregations, and we're going to prepare
aggregations, and we're going to prepare the final results. So here we're going
the final results. So here we're going to bring everything together and we
to bring everything together and we might introduce final transformations
might introduce final transformations that is needed for the reports. So let's
that is needed for the reports. So let's go and write the query for the final
go and write the query for the final results. Now we can go and start
results. Now we can go and start segmenting our customer and as well
segmenting our customer and as well creating the KPIs. So let's go to the
creating the KPIs. So let's go to the third step. I'm going to go and put this
third step. I'm going to go and put this in a CTE. So let's call it customer
in a CTE. So let's call it customer aggregation.
aggregation. And now based on these results, we will
And now based on these results, we will write the final query. So I like always
write the final query. So I like always to put a comment about the steps. So the
to put a comment about the steps. So the first city is the base query where we
first city is the base query where we just joined the data and prepared it.
just joined the data and prepared it. And then the second query is for the
And then the second query is for the aggregations. And the final one is for
aggregations. And the final one is for the final results. So let's go and start
the final results. So let's go and start writing our final query. We will start
writing our final query. We will start with select. And I'm going to go and
with select. And I'm going to go and list again all the customer
list again all the customer informations. So I'm going to go and get
informations. So I'm going to go and get again same things. We have the customer
again same things. We have the customer key, customer number, name, age and so
key, customer number, name, age and so on. And now after that we need to create
on. And now after that we need to create the age categories. And now after that
the age categories. And now after that I'm going to go and get all those
I'm going to go and get all those measures as well from our query. But of
measures as well from our query. But of course without the calculations I just
course without the calculations I just need the names of
it. So with that we have everything from our previous CTE. So the customer
our previous CTE. So the customer aggregation. Okay. So let's just test
aggregation. Okay. So let's just test it. Now everything is working. So now
it. Now everything is working. So now what we have to do? We have to create
what we have to do? We have to create few categories age category and as well
few categories age category and as well the segments of the customers right for
the segments of the customers right for segmenting the customers we have already
segmenting the customers we have already done the query so I will just copy and
done the query so I will just copy and paste it from the previous analyszis it
paste it from the previous analyszis it looks like this if the lifespan is at
looks like this if the lifespan is at least like 12 months and the sales above
least like 12 months and the sales above 5,000 then a less or equal 5,000 then
5,000 then a less or equal 5,000 then regular otherwise it is a new customer
regular otherwise it is a new customer so this is our first segment but the
so this is our first segment but the second segment about the ages we're
second segment about the ages we're going to go and build it now and again
going to go and build it now and again how we going to do it when so if the age
how we going to do it when so if the age for example example less than 20 then
for example example less than 20 then the customer is under 20. Let's make
the customer is under 20. Let's make another range where we say if the
another range where we say if the customer age is
customer age is between 20 and let's say
between 20 and let's say 29 then we have the second range and we
29 then we have the second range and we can keep repeating the same thing for
can keep repeating the same thing for the second one. It really depend how
the second one. It really depend how many categories you want to build. So 30
many categories you want to build. So 30 and 39 I belong to this group. Now the
and 39 I belong to this group. Now the next one let's have the 40s as well
next one let's have the 40s as well right so 40 49 same thing over here and
right so 40 49 same thing over here and now else let's say 50 and above right
now else let's say 50 and above right and above so let's go and end it as age
and above so let's go and end it as age group I just want to sort it little bit
group I just want to sort it little bit like this okay now it looks nice so with
like this okay now it looks nice so with that again we have turned a measure into
that again we have turned a measure into a dimension and let's go and execute it
a dimension and let's go and execute it now so now by checking the results we
now so now by checking the results we have the details of the customers and
have the details of the customers and Now we have a new category. So as you
Now we have a new category. So as you can see it is working. 54 it is above
can see it is working. 54 it is above 50. This is in the range between 40 and
50. This is in the range between 40 and 49. We have here 67 above 50. I believe
49. We have here 67 above 50. I believe we don't have any customer that is below
we don't have any customer that is below 20. Right? Or even between 20 and 30.
20. Right? Or even between 20 and 30. Okay. So with that we have created our
Okay. So with that we have created our two categories and by looking to the
two categories and by looking to the reports you see we can segment the
reports you see we can segment the customers now into categories. The VIB,
customers now into categories. The VIB, regular, new and the age group. And with
regular, new and the age group. And with that we have covered all those three
that we have covered all those three requirements and we come now to the last
requirements and we come now to the last requirements. We have to calculate the
requirements. We have to calculate the following KPIs. Now the first one it is
following KPIs. Now the first one it is an easy one. It is the recency. How many
an easy one. It is the recency. How many months since the last order we have
months since the last order we have calculated over here the last order for
calculated over here the last order for the customer. It is this one. And now in
the customer. It is this one. And now in order to find the recency it is very
order to find the recency it is very simple. So all we have to do is to take
simple. So all we have to do is to take this over here. I will just put it maybe
this over here. I will just put it maybe after the segmentation. And all what you
after the segmentation. And all what you have to do is to use the date diff as
have to do is to use the date diff as usual. So month is the last order date
usual. So month is the last order date and the get date. So as you can see we
and the get date. So as you can see we are using this setup like in many
are using this setup like in many analyzes right we always find the
analyzes right we always find the differences between a date from our data
differences between a date from our data sets and the current date and time and
sets and the current date and time and with that we will get the recency. So
with that we will get the recency. So let's go and execute it. Now you can see
let's go and execute it. Now you can see how many months since the last order of
how many months since the last order of the customer and of course you can go
the customer and of course you can go and test it using the last order date.
and test it using the last order date. And this is really important in order to
And this is really important in order to understand whether the customer is still
understand whether the customer is still active or inactive. Okay, so this is for
active or inactive. Okay, so this is for the first easy KPI. Now let's go to the
the first easy KPI. Now let's go to the second one. It says calculate the
second one. It says calculate the average order value. So how we going to
average order value. So how we going to do this? Let's go back over here. Now in
do this? Let's go back over here. Now in order to compute the average order
order to compute the average order value, we have to divide the total sales
value, we have to divide the total sales by the total orders. So how many revenue
by the total orders. So how many revenue did the customer generate? And we divide
did the customer generate? And we divide it by the total number of orders and
it by the total number of orders and after that we have to find the average.
after that we have to find the average. So it is very simple. Let's go and write
So it is very simple. Let's go and write that. We're going to go to the end of
that. We're going to go to the end of our table where we're going to put our
our table where we're going to put our KPI and I'm going to say here compute
KPI and I'm going to say here compute average order value. So as a shortcut
average order value. So as a shortcut AVO. So we say total sales divided by
AVO. So we say total sales divided by total orders. And let's call it average
total orders. And let's call it average order value. So let's go and execute it.
order value. So let's go and execute it. And if you go to the last over here, you
And if you go to the last over here, you can see the average order value of our
can see the average order value of our customers. But now if you are dividing
customers. But now if you are dividing numbers together you have to be careful
numbers together you have to be careful that you are not dividing by zero
that you are not dividing by zero otherwise you will get an error. So
otherwise you will get an error. So imagine that a customer has a zero
imagine that a customer has a zero didn't order anything you might get an
didn't order anything you might get an error. In our scenario, we don't have
error. In our scenario, we don't have that because we are starting from the
that because we are starting from the order table or the fact table. But
order table or the fact table. But still, I like to make sure this never
still, I like to make sure this never happens. And for that, I usually go and
happens. And for that, I usually go and use the case when statements. Very
use the case when statements. Very simple one. If the total orders is equal
simple one. If the total orders is equal to zero, then make it zero. Otherwise,
to zero, then make it zero. Otherwise, do the calculation that we talked about.
do the calculation that we talked about. So like this. And at the ends, we will
So like this. And at the ends, we will add an end. So that's it. And with that,
add an end. So that's it. And with that, I make sure we will never divide by
I make sure we will never divide by zero. So that's it. It was simple,
zero. So that's it. It was simple, right? Let's go to the last KBI the
right? Let's go to the last KBI the average monthly spend. So how we will
average monthly spend. So how we will calculate that
calculate that compute average monthly spend. So now
compute average monthly spend. So now since we are speaking about the spending
since we are speaking about the spending that means we need the total sales.
that means we need the total sales. Right? So how much sales did the
Right? So how much sales did the customer generate totally and then we
customer generate totally and then we divide it by the number of months and
divide it by the number of months and with that we will get the average
with that we will get the average monthly spend. Right? So that means we
monthly spend. Right? So that means we can divide the total sales by the
can divide the total sales by the lifespan as we calculated it is the
lifespan as we calculated it is the period where the customer has been
period where the customer has been active from the starts until the end.
active from the starts until the end. Okay. So now let's do it step by step.
Okay. So now let's do it step by step. First we have to be careful that we are
First we have to be careful that we are not dividing by zero and I believe in
not dividing by zero and I believe in the lifespan we have zeros. So what
the lifespan we have zeros. So what we're going to say as usual case when
we're going to say as usual case when lifespan is equal to zero then this time
lifespan is equal to zero then this time we will not make it zero the customer
we will not make it zero the customer exist only for one month. So what we can
exist only for one month. So what we can do we can get the total sales of the
do we can get the total sales of the customer and we don't have to divide it
customer and we don't have to divide it by the month in order to find the
by the month in order to find the average because the average is equal to
average because the average is equal to the current total
the current total sales. So with that we make sure we are
sales. So with that we make sure we are not dividing by zero otherwise we're
not dividing by zero otherwise we're going to have our calculation. So total
going to have our calculation. So total sales divided by life span. So the total
sales divided by life span. So the total sale divided by the months and with that
sale divided by the months and with that we will get the average monthly spend.
we will get the average monthly spend. So and and ass and we're going to call
So and and ass and we're going to call it average monthly spend. Perfect. So
it average monthly spend. Perfect. So let's go and try that out. Let's go to
let's go and try that out. Let's go to the right side. And with that we have
the right side. And with that we have our third KPI and we have the average
our third KPI and we have the average monthly spends. And with that guys, we
monthly spends. And with that guys, we have now full reports about the
have now full reports about the customers and we have covered all the
customers and we have covered all the requirements. All right. So with that we
requirements. All right. So with that we have the final results and we have
have the final results and we have fulfilled the requirements. So what
fulfilled the requirements. So what we're going to do, we're going to take
we're going to do, we're going to take the whole query and put it in the
the whole query and put it in the database as a view. And once we have the
database as a view. And once we have the view, the report in the database, we can
view, the report in the database, we can share it with the others. Now the other
share it with the others. Now the other data analyst in the team can go and
data analyst in the team can go and maybe create a dashboard in order to
maybe create a dashboard in order to visual data using API tool like Tableau
visual data using API tool like Tableau or PowerBI. But in this scenario, the
or PowerBI. But in this scenario, the user can go and connect your view the
user can go and connect your view the last prepared data to the dashboard. And
last prepared data to the dashboard. And with that the user can quickly generate
with that the user can quickly generate insights without doing a lot of steps in
insights without doing a lot of steps in order to prepare the data for the
order to prepare the data for the visualizations. And of course the data
visualizations. And of course the data analyst can go and connect the
analyst can go and connect the dimensions and facts. But having this
dimensions and facts. But having this one solid view it's going to be like way
one solid view it's going to be like way easier to consume. And of course the
easier to consume. And of course the data analyst can as well write a query
data analyst can as well write a query on top of your view in order to generate
on top of your view in order to generate a quick insights. So as you can see
a quick insights. So as you can see using only SQL you are covering a lot of
using only SQL you are covering a lot of complex steps in order to make the data
complex steps in order to make the data ready for reporting and analyzes and
ready for reporting and analyzes and this is what usually happened in real
this is what usually happened in real projects. We're going to go and put the
projects. We're going to go and put the query in the database so that the others
query in the database so that the others can use it. So what we're going to do
can use it. So what we're going to do very simple create review and we're
very simple create review and we're going to put it in a good layer and
going to put it in a good layer and we're going to call it report customers
we're going to call it report customers and then ask like this and let's go and
and then ask like this and let's go and execute it. It is successful. Now if you
execute it. It is successful. Now if you go to our database and check the views
go to our database and check the views you will find a new view called gold
you will find a new view called gold report customers. Now all what you have
report customers. Now all what you have to do is to go and have a simple select.
to do is to go and have a simple select. So codes reports customers and you will
So codes reports customers and you will get an amazing report about the
get an amazing report about the customers. This kind of reporting it is
customers. This kind of reporting it is very important because you are giving a
very important because you are giving a full picture 360° view of all your
full picture 360° view of all your customers. So you have details,
customers. So you have details, categories, measures everything in one
categories, measures everything in one go and it going to makes life easier.
go and it going to makes life easier. Now for any user of this view to quickly
Now for any user of this view to quickly understand the data and generate maybe
understand the data and generate maybe insights based in this one view that can
insights based in this one view that can helps of course your customers. So I
helps of course your customers. So I just want to show you now what this
just want to show you now what this means. If a user using your reports so
means. If a user using your reports so either in SQL or maybe they're going to
either in SQL or maybe they're going to go and connect it to PowerBI or Tableau
go and connect it to PowerBI or Tableau they can generate immediately insights.
they can generate immediately insights. So for example, if they go and say count
So for example, if they go and say count customer number so as total customers
customer number so as total customers and then they're going to go and take
and then they're going to go and take any dimension for example the age group.
any dimension for example the age group. So something like this and then group by
So something like this and then group by the age group. Put just put it here
the age group. Put just put it here first. And then they're going to go and
first. And then they're going to go and add any other measure. For example, the
add any other measure. For example, the total
total sales and any other measure that you
sales and any other measure that you have in this view and then execute and
have in this view and then execute and quickly they can do analyszis on top of
quickly they can do analyszis on top of your view without having them to go to
your view without having them to go to their fact and dimensions. So this is
their fact and dimensions. So this is like one extra prepared layer the data
like one extra prepared layer the data model that you have built. And if you
model that you have built. And if you don't want to group it by the ages, you
don't want to group it by the ages, you can go and have the customer segments
can go and have the customer segments and it will be working. So quickly they
and it will be working. So quickly they can analyze the new derived informations
can analyze the new derived informations that you have prepared in your reports.
that you have prepared in your reports. So guys, this is amazing reports about
So guys, this is amazing reports about the
customers. And now what you're going to do, you're going to go and prepare the
do, you're going to go and prepare the second report where you have to build
second report where you have to build complete insights about the products of
complete insights about the products of the business. It is very similar to the
the business. It is very similar to the customers. So we want to generate a
customers. So we want to generate a report for the products. You have to
report for the products. You have to provide details like the product name,
provide details like the product name, category, subcategory and the costs. You
category, subcategory and the costs. You have to segment the products by the
have to segment the products by the revenue. So you can have categories like
revenue. So you can have categories like high, medium and low. And then you have
high, medium and low. And then you have to provide the basic aggregations at the
to provide the basic aggregations at the level of the products and then calculate
level of the products and then calculate few KPIs. So as you can see it is very
few KPIs. So as you can see it is very similar to the customers. And now what
similar to the customers. And now what you have to do you have to pause the
you have to do you have to pause the video follow the same step at the
video follow the same step at the customers where we join the tables car
customers where we join the tables car create aggregations and put everything
create aggregations and put everything like in CTE and at the end once you are
like in CTE and at the end once you are done create the view where you have the
done create the view where you have the report about the products. So I'm going
report about the products. So I'm going to go now and do it offline and I will
to go now and do it offline and I will see you
see you [Music]
[Music] soon. Okay my friends I hope you are
soon. Okay my friends I hope you are done with the reports. I'm going to show
done with the reports. I'm going to show you quickly how I've done it. So I've
you quickly how I've done it. So I've just created a new view called report
just created a new view called report products and then we start with the base
products and then we start with the base query where we have joined the fact
query where we have joined the fact table with the dimension products and
table with the dimension products and collected all the columns that we need
collected all the columns that we need for the reports and we put everything in
for the reports and we put everything in the first city. So this is the first
the first city. So this is the first step and there was from my side no need
step and there was from my side no need for any transformations over here. So we
for any transformations over here. So we go now to the second step and here we
go now to the second step and here we have to put all the different types of
have to put all the different types of aggregations in one go. So we calculate
aggregations in one go. So we calculate the lifespan, the last sales order,
the lifespan, the last sales order, total orders, total customers, sales
total orders, total customers, sales quantity and as well I have created the
quantity and as well I have created the average selling price of the products.
average selling price of the products. It is very simple. We are dividing the
It is very simple. We are dividing the sales amount by the quantity. So this is
sales amount by the quantity. So this is the basic aggregations about the
the basic aggregations about the products and finally we have the final
products and finally we have the final query. So we start with selecting the
query. So we start with selecting the basic informations about the products.
basic informations about the products. So we have the key, name, category and
So we have the key, name, category and then we have here the recency and we
then we have here the recency and we have our new segments. This one is very
have our new segments. This one is very easy for the products. So we are saying
easy for the products. So we are saying if the total sales is higher than 50,000
if the total sales is higher than 50,000 then this is a high performer and if
then this is a high performer and if it's like between 50 and 10k then this
it's like between 50 and 10k then this is a mid-range otherwise it is low
is a mid-range otherwise it is low performer. So the segmentations of the
performer. So the segmentations of the products is very simple and after that
products is very simple and after that we have like all our measures that we
we have like all our measures that we aggregated in the CTE and now we come to
aggregated in the CTE and now we come to the two KBIS. It is very similar to the
the two KBIS. It is very similar to the customers. So the first one the average
customers. So the first one the average order revenue it is simply dividing the
order revenue it is simply dividing the sales by the total orders and you have
sales by the total orders and you have to take care of the zeros of course and
to take care of the zeros of course and the average monthly revenue we divide
the average monthly revenue we divide the total sales by the lifespan of the
the total sales by the lifespan of the products and of course if the lifespan
products and of course if the lifespan is zero so it is only one month then it
is zero so it is only one month then it is the total sales and with that you
is the total sales and with that you generate the average monthly revenue. So
generate the average monthly revenue. So as you can see it is very similar to the
as you can see it is very similar to the customers but still the focus here is
customers but still the focus here is the products. Now of course we put this
the products. Now of course we put this query in view. So we have the report
query in view. So we have the report products side by side by the report
products side by side by the report customers and now we have really amazing
customers and now we have really amazing report about the products where we have
report about the products where we have everything. So we have a lot of details
everything. So we have a lot of details about the customers. We have as well a
about the customers. We have as well a dimension in order to segment our
dimension in order to segment our products and we have a lot of measures
products and we have a lot of measures that are really important about each
that are really important about each products. So we have the total number of
products. So we have the total number of orders sales, how many customers did
orders sales, how many customers did order the products, the average price,
order the products, the average price, the average revenue and the monthly
the average revenue and the monthly average revenue. And this gives you
average revenue. And this gives you really deep insights about each product
really deep insights about each product of your business. And of course, this is
of your business. And of course, this is very helpful in order to compare the
very helpful in order to compare the products, right? And now, of course,
products, right? And now, of course, this is core analyzis that you're going
this is core analyzis that you're going to need it a lot in your business.
to need it a lot in your business. That's why we offer it as a view. So, I
That's why we offer it as a view. So, I think we have now two amazing reports
think we have now two amazing reports about our
data. All right, my friends. So, now don't forget to put all your work in the
don't forget to put all your work in the Git repository in order to share it with
Git repository in order to share it with others as a successful project. So as
others as a successful project. So as usual we have the data sets,
usual we have the data sets, documentations and as well the scripts
documentations and as well the scripts that you have done through this projects
that you have done through this projects and here I'm putting everything
and here I'm putting everything together. So we have all the activity of
together. So we have all the activity of the exploration as well with the
the exploration as well with the advanced analyszis that we have done. So
advanced analyszis that we have done. So we have the change over time, the
we have the change over time, the cumulative analyszis, performance, data
cumulative analyszis, performance, data segmentations, part tool analyszis and
segmentations, part tool analyszis and as well our two new reports. So I
as well our two new reports. So I recommend you if you haven't done that
recommend you if you haven't done that yet go and create now a repository put
yet go and create now a repository put all your work there to make sure that
all your work there to make sure that everyone can access and see your work
everyone can access and see your work and my friends don't forget to add nice
and my friends don't forget to add nice commenting on your code and formatting
commenting on your code and formatting and styling your code should be perfect.
and styling your code should be perfect. So if you haven't done that yet go and
So if you haven't done that yet go and do it now. All right my friends so with
do it now. All right my friends so with that we have done the last step in our
that we have done the last step in our road map. We have created two solid
road map. We have created two solid reporting for our users. And with that,
reporting for our users. And with that, we have completed all the steps of our
we have completed all the steps of our advanced analytics projects. And with
advanced analytics projects. And with this project and the previous projects,
this project and the previous projects, you can see now the full picture on how
you can see now the full picture on how to do data analytics on any data sets
to do data analytics on any data sets using SQL. So starting by the first step
using SQL. So starting by the first step where we have explored the database and
where we have explored the database and end up having a very solid reports where
end up having a very solid reports where we have consolidated everything in one
we have consolidated everything in one view and with that we have now really
view and with that we have now really great understanding about the business,
great understanding about the business, about our data. And now what you can do,
about our data. And now what you can do, you can go and grab any data sets in the
you can go and grab any data sets in the internet and you can go through all
internet and you can go through all these faces again and I promise you at
these faces again and I promise you at the end you will have a full picture and
the end you will have a full picture and understanding of the business and this
understanding of the business and this is what I exactly do in each project if
is what I exactly do in each project if I want to understand any type of data
I want to understand any type of data sets. All right my friends. So with that
sets. All right my friends. So with that we have covered the last type of SQL
we have covered the last type of SQL projects the advanced data analytics.
projects the advanced data analytics. And with that we have now three solid
And with that we have now three solid projects using SQL and they are very
projects using SQL and they are very similar to real world projects in the
similar to real world projects in the industry especially if you want to be a
industry especially if you want to be a data engineer or a data analyst. And my
data engineer or a data analyst. And my friends we have covered the last chapter
friends we have covered the last chapter in our course. So this is the advanced
in our course. So this is the advanced level in SQL. And those are all the
level in SQL. And those are all the chapters that I have designed for you to
chapters that I have designed for you to take you from the basics to intermediate
take you from the basics to intermediate and then to the advanced topics. My
and then to the advanced topics. My friend, you made it. Congrats. You
friend, you made it. Congrats. You should be really proud of yourself. And
should be really proud of yourself. And now with that, I can say that I have
now with that, I can say that I have shared everything that I know about SQL
shared everything that I know about SQL and you can now solve any complex task
and you can now solve any complex task using SQL like I do in my real projects.
using SQL like I do in my real projects. And I hope that you have enjoyed the
And I hope that you have enjoyed the journey. And if you do and you want me
journey. And if you do and you want me to create more free courses like this,
to create more free courses like this, make sure to support the channel by
make sure to support the channel by subscribing, liking, and commenting.
subscribing, liking, and commenting. This of course going to make the channel
This of course going to make the channel grow, reach the others, and as well
grow, reach the others, and as well motivates me to make more content like
motivates me to make more content like this. So nothing left to say. Thank you
this. So nothing left to say. Thank you so much for watching and I will see you
so much for watching and I will see you in the next course.
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.