This master class provides a comprehensive overview of Data Engineering, covering its fundamental concepts, lifecycle, architecture, tools, and cloud implementations, aiming to equip learners with the knowledge to build and manage data systems effectively.
Mind Map
Нажмите, чтобы развернуть
Нажмите, чтобы открыть полную интерактивную карту
In this three-hour Data Engineering Master Class, you will learn about what Data Engineering is,
the Data Engineering life cycle, data generation and storage, database management systems,
data modeling, SQL versus NoSQL, data processing systems like OLTP versus OLAP, ETL pipelines
(Extract, Transform, Load), data architecture, and I will give you a complete guide on how to build
the architecture from scratch. We'll cover data warehousing, dimensional modeling, slowly changing
dimensions, data marts, data lakes, data lake versus data warehouse, big data landscape, data
engineering on cloud, top AWS services you should learn for data engineering, and we will understand
real-world case study architectures on AWS, GCP data services, and Azure data services. We will
also explore the modern data stack, important tools for data engineering that you should learn,
understanding Python and SQL for data engineering, understanding data warehouse tools like Snowflake,
BigQuery, understanding Apache Spark with Databricks, understanding Apache Airflow
and Apache Kafka for data engineering, and many more things. So sit tight, get your notebooks,
pen and paper, and start taking notes so that you can remember this for a longer period of
time. And before you move forward, make sure to hit the like button and subscribe to the channel
if you are new here. Let's get started with the Fundamentals of Data Engineering Master Class.
The Fundamentals of Data Engineering Okay, we'll start by understanding what
Data Engineering is because if we want to understand different fundamental concepts,
we need to have our basics clear. Now, if you have been following me on this channel for the past
few years, then you might already know what Data Engineering is because we keep talking about this.
But if you're seeing me for the first time or if you're just getting started with Data Engineering,
it is important for you to understand what Data Engineering is. So let's start with that.
Okay, now we already know, right? Everything that happens on the internet mainly, okay,
because this is where Data Engineering happens, on the internet. All of these are the businesses.
Okay, the businesses are, let's say, Amazon. Okay, what is the business of Amazon? Amazon
is an e-commerce company. What do they do? They give you the ability to purchase products online,
okay, from your home. Now, this is the business of Amazon. What is the business of, let's say,
Netflix? Okay, the business of Netflix is to give you exclusive content. You buy the premium,
and they give you the exclusive content. On top of that, they also give recommendations and all
of the other things. Okay, this is the business of Netflix. What is the business of, let's say,
Zomato? Okay, this is a food delivery app in India. From your home, you can order food,
okay, and the order will get delivered to you within, like, half an hour to an hour. Okay,
there are multiple companies doing businesses on the internet. Now, all of these companies, okay,
have certain goals and visions for the business, right? They want to understand the customer.
Why do they want to understand the customer? So that they can provide better services. Okay,
they want to increase their profit. Okay, I want to increase my profit. Okay, this is one
of the goals, that I want to increase my profit, understand my customer. They also want to detect
some of the bottlenecks they might have in the business, okay, so improve the process, okay,
improve the business process. And like this, there might be multiple goals a company might have.
Now, if they want to achieve all of these goals, they need to understand how these things are
happening, and one of the best ways companies can do that is by understanding the data. Now,
most of the time, all of these decisions are taken based on assumptions, right? A business person,
let's say, who is working in the shipping department of Amazon, okay, is actually working on
the ground and has knowledge about this particular segment—the shipping, okay? Now, he already has
some business knowledge to take direct decisions on this particular segment of the business, okay,
because he's an expert. He's been working in this particular field for, like, 15 to 20 years,
so he understands what might be the problem. But a lot of times, even as humans, we might miss out
on some of the information that we don't know. And the best way to understand all of this information
is by understanding what the data says. You can assume certain things, and you can be right for
some time, but if you want to be right most of the time, the best way is to be sure about it.
The only way you can be sure about all of these things is by understanding what the data says,
okay? And this is where the entire picture of Data Engineering, Data Science, Machine Learning, AI,
all of these come into the picture. So let's start by understanding all of these things one by one.
Okay, I just painted you the picture. The reason we are doing Data Engineering and Data Science
in the first place is that companies want to understand, okay? They want to understand,
they want to improve their business, they want to provide better services to customers, they want
to, like, remove the challenges they might have in the business by using the data because data gives
you the direct answer. It gives you the factual understanding rather than you just assuming
things, okay? So this is the understanding of why we need the data-driven system.
Now, how do all of these things happen? Okay, we already know this,
okay? At the end, we want to have the final outcome. It can be—we already understood,
right?—improve business revenue, and it can be, like, recommendation and overall things.
So these are my business goals, okay? These are my business goals. Business goals, okay? Every
single thing you do in your data ecosystem, or in general in the engineering ecosystem online,
is for this only, okay? Anything you do should create value for the business. Even if you use,
like, the highest algorithm, but it doesn't impact the final outcome of the business,
it is completely useless, okay? It should help the business in some way; it should help the business
to save costs, it should help the business to improve the process, it should help the business
to understand the customer—whatever it can be. If it can provide the final value, then it is useful;
otherwise, it is completely useless, okay? So this is very important—everything that
you do should create value for the business. If this is clear, let's start by understanding
the entire pipeline of Data Engineering and the entire pipeline of the overall internet system,
okay? Before we just understand the Data Engineering life cycle, we need to understand
how different things or different fields come together to make the complete system,
okay? So we have the company—this is my company over here—which is, let's say, this is my company,
okay? And the company is doing, like, it can be Amazon or whatever. We'll take one example, okay?
Now, at the front end, we usually have the application, okay? This is my application, okay?
This might be my mobile—there's a button, and this is my application. And the user interacts with the
application, okay? I'm the user—I have Instagram installed on my phone, I have Facebook, I have
whatever, okay? I might be using LinkedIn—I have the application. And whenever I interact with this
application, data gets generated, okay? Whenever I click on any application, whenever I, like,
like something, when I comment on something, every single thing that I do, even if I go to Amazon,
if I click on a certain product, every single thing, okay, every single thing generates data,
okay? Now, all of this data will get stored, okay, inside the DBMS, okay? These are called database
management systems. Now, there are different types of database management systems we will understand
in this video, but just try to understand every single thing that we do gets stored inside a
DBMS, a database management system, okay? Now, these systems are usually designed for
storing this kind of data, right? You can store this data easily. There is something called CRUD
operation, okay, which is called Create, Read, Update, and Delete. We'll understand
that in the further video, but all of these databases are called relational databases,
specially designed for storing all of these things. Now, once you store all of these things,
alright, we have the data available. Now, data might be coming from multiple places, but let's
understand—from the application, our data gets stored inside the DBMS, and from there, our entire
Data Engineering pipeline starts, okay? The Data Engineering happens, Data Science happens, Machine
Learning or Data Analytics might happen over here, and then there might be a final dashboard,
okay? There might be some dashboard or some charts available here, okay? Businesses use this,
or there might be a machine learning model, okay? So this is like a robot, okay? I'm bad at drawing,
but this is one of the robots or machine learning models that might help in understanding all of
these different things, okay? Just trying to, like, just trying to paint a simple picture of the
entire ecosystem—there are many different things that go here, okay? The application development,
there might be DevOps who might be deploying the application, but in general, from application to
DBMS, whenever we have any data available, okay? This is where internet companies come into the
picture because you can store all of the data inside the DBMS, database management system,
okay? And then you can utilize all of this data for this kind of workload, okay?
Once you have the data generated, this is where the Data Engineering starts, because without data,
you don't have Data Engineering, Data Science, Machine Learning—because they work fundamentally
on the data. If you have the data, then you can do something about it; if you don't have the data,
then you can't do anything about it, okay? So the fundamental concept of a data-driven system
is having a data generation in place, and this is what the data generation looks like, okay?
You have the application, the data is getting generated, there might be other things such
as sensor data, okay? A truck is moving from one location, okay? This is one of the trucks, okay,
that is moving from one location, and it is going from location A to location B. Now, in between,
from B, it might go to C location, okay? Now, the truck goes from here to here to here, okay?
It goes from here to here to here. Now, we need to capture all of this data, and all of this data
gets captured by the sensors, right? The truck might have sensors, just like we interact with
the application. The truck might have sensors. Just like this, we have the stock market data,
we have data coming from numerous places, okay? So we understand how the data is getting generated,
sent to the system, and all of the other things. So this is the fundamental concept
of Data Engineering, which is where Data Engineering sits in the first place, okay?
As we move forward, we will understand all of the different parts of Data Engineering individually,
but just try to understand where Data Engineering really fits into the entire cycle, okay? It is
between the application development and the database. So whenever your data gets generated,
okay, it is over here—this is my application side, and this is my Data Science, Machine Learning,
dashboarding side. Data Engineering sits in between. It is kind of like a plumber,
okay? I'm connecting one thing to the second thing by transforming data and some of the other things
that we will understand, and then I pass the data to the next end, okay? I get the data from one
source, and I pass my data to the next source. How do Data Engineers do that? What are the different
features, functionality, and frameworks they use? We will talk about all of these things one by one
in this video, so don't worry about it, okay? I hope you understood the basics until now.
Okay, so now that we understood where Data Engineering sits,
what is the role of Data Engineers in this place? Because application developers, so we have the
software engineers, okay? The general role of software engineering is to develop the app,
web app—it can be a web application, it can be writing code, developing, or deploying some of
the things, okay? Then we might have the DBA. This thing can also be done by the software
engineers in smaller companies, but if you're working in a big company, a DBA is a Database
Administrator who develops the data, right? They build the different tables, they build the
different columns, and all of the other things. These are built by DBAs. Usually, Data Engineers
can also do that, or the software engineers can also do that—depends on the company's
size and your job profile—but let's understand. We might have a DBA who will build a database,
okay? So this guy will be writing a database. Now, who do we have? We have Data Engineers, okay?
Data Engineers. The roles of Data Engineers—there are many different things I'll tell you, just it's
to write the ETL pipeline (Extract, Transform, Load). We have a dedicated section on this,
but ETL is basically we extract data from one end, we transform that data, and we load that data,
okay? Then, it can be also building a database or a data warehouse, okay? Database or data
warehouse. Data Engineers can also do that, okay? They can build relational databases or dimensional
modeling—we'll understand that. Working with big data, big data, and processing all of this data
using Spark, Hadoop, using different frameworks or Kafka, okay? To process batch data or real-time
data, we can also do that. Data integration, data integration—so again, data is coming from the API,
data is coming from the sensors, data is coming from the RDBMS, so we want to integrate all the
data, so Data Engineers have the responsibility. There are other responsibilities such as quality
check of the data and governance, how to organize all of this data properly, so these
are the core use cases of the Data Engineers. Now, after that, we have Data Science people,
okay? Data Science or Data Analysts, okay? Usually, the difference between Data Science
and Data Analysts is basically that Data Analysts usually answer questions about what has happened
in the past, okay? How can we, like, what was the revenue of this particular product last year
compared to the last five years, right? They are trying to find the pattern from the past and find
some of the answers. The role of Data Science is to predict what can happen in the future,
right? We did a product sale for this particular product X amount for the last one year—what will
be the product sale for this particular product for the next six months? This is what Data Science
answers, right? They try to predict what will happen in the future based on past patterns,
and we have the Machine Learning Engineers who can basically automate all of the other things. So on
Amazon, we have the recommendation system, right? All of this recommendation system is
done by Machine Learning Engineers. They deploy the machine learning models onto the production
system so that a system can learn by itself and generate the right output for the user. So you
can predict what is happening inside your system, or you can predict how the users are behaving and
recommend them the right information. Like on Instagram, you go to Reels, you see the right
reels as per your interest, okay? They don't directly recommend you random things—they also
recommend you random things just to understand if you like it or not. So they are just trying
to train the machine learning algorithm based on your usage on the application, okay?
Now, the difference between DS and ML is quite thin, okay? You might see a Data Science person
might do the ML work, or an ML person might do the Data Science work, but in larger organizations,
they might have individual work to do, okay? They have core responsibilities,
but in smaller organizations, they might have to do all of these things by themselves,
so do not get caught up in the title like, "Oh, what does a Data Science person do? What does
a Machine Learning Engineer do?" Just try to understand their core responsibility from the
top level. In the actual organization, when you go to work, okay, when you start working, you might
have to do everything by yourself because the role is just a name, okay? But this is the core
distinction between all of these roles. There are other roles such as DevOps, DataOps—these
are just fancy names, but on a fundamental level, you might be doing similar work, okay?
So we understood what Data Engineering is, okay? The role of Data Engineering is to take
data from one source, okay? It can be any data from, like, RDBMS, API, do some transformation,
and pass this data to Data Science or Machine Learning guys so that they can build dashboards
or they can, you know, build machine learning models. Now, all of these things that we do,
okay, there is a proper approach to it, okay? You can't directly get the data from one source and
directly push it to the Data Science person—there has to be a step-by-step approach that is designed
properly so that the entire pipeline that you generate has some purpose to serve,
okay? And this is what we will understand, okay? So this is what we call a Data Engineering life
cycle. This is taken from the book Fundamentals of Data Engineering. I have recommended this book
to so many people, and it is one of the best books if you want to understand the fundamentals of Data
Engineering. A lot of the material that I have learned about the fundamentals is from that book,
and some of the material I also added in this video, so you will get the understanding, okay?
So the first step here is data generation, okay? Now, this thing we already talked about,
right? Data generation—data is getting generated from multiple places. We already know data comes
from what? APIs, okay? RDBMS, it comes from sensors, it comes from analytics like Google
Analytics or all of the other things, okay? So data is coming from multiple places. Now,
all of this data that is coming from these different places, we need to aggregate this
data together and ingest it into the system, okay? Now, this is what the next step is over
here. Let me just remove this, okay, this part, okay? The data ingest, okay? We are getting the
data generated from one place, then we need to ingest this data to one particular system. The
ingestion can be setting up the connection with the API, setting the connection with the RDBMS,
building a system that can read the data from sensors, and then automatically ingest this
data into our Data Engineering system, okay? We will understand what this entire Data Engineering
system feels like when we actually look at the project example, but these are the fundamentals,
okay? We have the data generation, and that data is getting ingested into some kind of system,
okay? And we just build a programmatic connection between this. So whenever any data gets added to
the RDBMS, okay, it should automatically get ingested into our system. There are
multiple approaches to do that, but these are the fundamentals. Once the data is getting generated,
we ingest this into our system. Then the data that got ingested will get stored, okay? There's some
kind of storage layer we have, so every data that is coming from multiple places, we have
to store all of this data at some location, okay? It should get stored at some location at least,
so this is where the storage happens, okay? We are storing this data at some location. Now,
between this ingestion and the serving, okay? Serving is basically we are serving our data
to machine learning, analytics, and reporting, okay? The thing that we understood over here,
okay? After the Data Engineering happens, we have Data Science, Machine Learning persons who are
building a dashboard or who are building a machine learning model. The same thing here is that this
is the part, okay? The Machine Learning or the analytics—we have reporting, dashboarding—all of
these things happen over here. This is where the data is ingested, and this is where the data is
getting stored, okay? Between that is the core of Data Engineering that is called a transformation.
Transformation is basically the set of business logic, alright, that we need to convert our raw
data—this is usually what we call raw data because it is coming from the system, okay? So this is my
raw data. This is my raw data, okay? Here, this is my transformed data when we serve this, okay?
When we serve this, this is the transformed data, this is the raw data, and everything that happens
between this and this is called a transformation. Transformation is a set of business logic, and it
can be anything, okay? So consider this example. Let me just explain this part. Now, we have data
coming from the API, okay? I have data coming from the API, and I have data coming from the RDBMS,
okay? Now, in both of the data, I have a date column, okay? I have a date column, I have a
date column, and the format of the date in the API is YYYY-MM-DD, okay? It is like 2024-06-01, okay?
The first—I think it's June of 2024—the date is something like this. Now, in RDBMS, okay, the date
format is like MM-DD-YYYY, okay? Something like this—it can be like 01 or 06, sorry, 01 and 2024,
alright? Now we have a date coming. Now, what we need to do is we need to join this system because
at the end, we need to find the analysis. There might be some ID column here, okay, and there
might be one more ID column available over here. We need to join these two data together. Now, when
do we join it? Okay, when we join data coming from the API, this might be, let's say, product date,
okay? This is a product date, okay? And this is an order date—it can be anything like this,
okay? Now, when we join this information, we need to transform this data into one particular logic,
alright, that can be formatted as this particular format or this particular format—it can be
anything. This is the decision that business people or you can take, like I want to transform
this data based on this format only, so any information that is coming from any other sources,
okay, it should be transformed into the YYYY-MM-DD format for the date, okay? So if we are getting
this data after the transformation block, okay, so we will have our transformation block here. I
will have my transformation block, okay? This data will go inside this and this, okay? Transformation
can be done by Python, PySpark, Scala, whatever it is, okay? We will understand all of these things,
okay? How do we do the transformation? And at the end of this, I will get this data into YYYY-MM-DD,
okay? The date values will be converted into one single thing. This is what we call a
transformation, okay? This is one example, but transformation can be anything, okay? It can be
removing duplicate values, it can be removing the null values, okay? It can be aggregating the data,
it can be merging two data sets, it can be generating a new column based on the two different
columns, concatenating—it can be anything, okay? It can be filtering—whatever it is, transformation
is basically a set of business logic that you have to write inside the code or inside the SQL
query or use any tool to do that to generate a suitable outcome so that the Data Science person
or the Machine Learning person can build a model or build a dashboard to find the relevant answer,
okay? So as a Data Engineer, my role is to organize the data into the proper structure so
that we can easily visualize this or we can easily understand what is going on inside the data,
so that is my job. I want to make the data into the proper structure, and that usually
happens in the transformation layer, okay? Now that we understood what is going on, we are
getting data generated from one source—it can be many sources, APIs, sensors, whatever, okay? All
of this data is getting ingested into one system. Ingestion basically means making a connection in
such a way that any time a new data is getting generated, we automatically fetch this data, okay,
and store it inside our storage system, okay? This is what we understood. Now, once we have
this data available, we need to make sure the data that is coming from all of these different systems
passes through a certain transformation logic so that our data gets structured. Once that is done,
we serve this data to a user. A user can be a Machine Learning Engineer, a Data Analyst,
or some dashboard expert—it can be anything, okay? They are using this data so that they can
understand, build machine learning models. This is the entire Data Engineering life cycle that we are
talking about, okay? There are some undercurrents that we will understand in further videos,
so don't worry about it, but I hope you understand the complete Data Engineering life cycle from a
fundamental point of view because this is really important, right? You can use any tools, right,
to do all of these things, but if you understand the fundamental side of it,
then it doesn't matter which tool you use—you already know what needs to be done, so you can
pick the shittiest tool in the market, okay, and still make this entire pipeline work, okay?
That is the power you have as a Data Engineer because once you understand the fundamentals,
you don't really need to know anything else. You can learn tools within 30 minutes,
okay? It doesn't take time to learn any new modern tool—it's very simple. Even to learn Spark and how
to write the Spark code, it's very easy, okay? You just need to understand some of the functions and
execute. There are some angles to Spark, such as the internal and the understanding of executors,
drivers, and all of these other things that you need to understand to become a better engineer,
but to do this entire job is not that difficult, okay? You just need to make—you just need to
understand how to make connections between systems and execute the entire thing, okay?
Now that you understood, we can go forward and start talking about the individual components,
right? How can I do the generation? How can I do the ingestion? What can I use for the
transformation? How can I do the serving? What is used for storage, okay? Machine Learning,
Analytics, Reporting—every single thing that we will talk about, and we will also talk
about this part further down the video, okay? Now this is understood, let's talk about the
data generation and data storage one more time. Alright, so we got the basics until now—data is
generated from multiple places. Data is coming from transactional systems. Transactional systems,
okay, these are called RDBMS, okay? There are multiple types of transactional systems that we
will talk about, so don't worry about it. Data is coming from IoT devices, so we have the IoT
devices, okay? It is coming from there. It is also coming from web and social media, okay?
We understand data is coming from logs and machine data, okay? This is also important because, again,
we are running the technical machines, so they are also generating logs, and if you want to improve
the utilization of this technical machine, we can also use this log data to understand what is going
on and save costs over there also, okay? Then we might have some API data—API or third-party data,
okay? Third-party data. Sorry for the bad handwriting, but this is where the data
is getting generated, okay? Now, once we have the data available, we have to store this data, okay?
The storing of the data is basically we store it in a relational database, okay? This is the same
transactional data and relational data, so from the application to RDBMS, data is generated,
okay? This is where the data generation—you can also put the RDBMS into data generation because
it is connected to the application, okay? And you can also put it on the storage layer because
data is getting stored inside the RDBMS, so you can also keep it generation and storage—it doesn't
matter, okay? Because from the Data Engineering point of view, we usually consider RDBMS as a
data generation source, okay? From the application point of view, we usually consider it as a storage
layer also, okay? So quite tricky to understand, but it's simple. You can consider RDBMS as data
generation and storage also. We also have a NoSQL database, okay? NoSQL database that we
will understand. For data storage, we have data warehouses, okay? This is what we are talking
about, okay? The thing that we understood about storage, okay, generation, and ingestion is this
part—this is the data generation, okay? And the storage that we talked about over here is this
part, okay? We can store our data in the RDBMS, NoSQL, data warehouse, or object storage—object
storage can be like S3, Google Cloud Storage, Azure Blob Storage, all of these other things
that we will also understand, okay? You can also call these things a data lake, data lake, okay? So
these are the storage systems. We understood the generation, how the data is getting generated,
and where our data will get stored. So, okay, this is what we understand. Now let's understand about
the DBMS, okay? The thing that we were talking about—transactional systems and RDBMS systems,
okay, that are used for data generation and data storage—in reality, we use the DBMS,
Database Management System, okay? These are the systems specially designed for
storing your data in a structured way so that you can easily query your data.
Now, understand this, okay? You can also store your data in MS Excel or Google Sheets. If you
already know, right? You can have columns here and rows and column formats, so you can store
your data. But if you want to store, let's say, millions of data or billions of data,
and if you want to find a specific record, MS Excel will not be able to handle that,
okay? Because if you want to find a specific record from, let's say, the thousand lines
or the one lakh, one lakh row, okay, it will be very difficult for you to do that. DBMS systems,
okay, are specially designed for this kind of workload, okay? You can store your data,
and you can easily retrieve, update your data as per your requirement. There are different types of
DBMS systems available. We have PostgreSQL—this is open source. We also have MySQL—this is open
source. We have Microsoft SQL Server, we have Oracle, okay? These are enterprise-level,
okay? If you want to get started, Postgres and MySQL are the easiest to get started. Now,
to work with all of these systems, we have a language, okay? We have a language called SQL—this
stands for Structured Query Language, okay? Now, this is the language that we use to communicate
with the database. You might already know about this because you've been following me,
or you have heard about it somewhere, but if you're new to Data Engineering or just in
general to the data space, SQL is the language that we use to communicate with the database.
Now, what can we do with SQL? We can do multiple things. We can select the data,
okay? We have a SELECT query to fetch the data. We can insert the data, okay? We can insert the
data. We can update the data. We can delete the data, okay? All of this data is getting stored
inside the table. It looks something like this, okay? The table will have a column name, okay,
and the actual data stored inside this—this is where all of the actual data is getting stored.
The data that we talk about, like it can be, let's say, this is our data, okay, student data,
okay? And there is a table, Student. What will Student have? Student will have ID,
okay? Student will have a name. It will have age, and it might have, let's say,
a city where the student lives. So ID can be one. The name can be, let's say, D, okay? Age can be
26, and the city can be Mumbai. Just like this, there might be some other person who might be,
let's say, Akash. Age can be 25, and is living in Delhi. Okay, like this, we have data stored inside
our table, okay? So this is what is happening over here, okay? We can select a specific data,
let's say, where the student ID is equal to two, okay? I can select this particular data by writing
the SQL queries. I can insert new data as ID3, okay? I can delete this data if I want, okay? And
I can update this data, say, if I want to update the age or I want to update the name. There are
multiple SQL cases. If you want to learn about SQL, I have a course so you can learn in-depth,
but this is the fundamental concept of SQL, okay? Now, this is what we understood, right? This is
the SQL that is used for working with the DBMS systems—this is the language or scripting language
that you can call to work with the system, alright? Now we have a concept of data modeling.
Now, this is where we are slowly diving into the Data Engineering fundamentals concept one by one,
okay? We have cleared the foundation part of Data Engineering. Now we are diving into
the individual concepts that are important for you to understand the entire life cycle, okay?
Data modeling. Now see, whenever we are designing any application or whenever we are thinking to
build or store our data, we need to design a data model. Data modeling is basically a visual
representation of how our data looks, okay? So we will take one example, okay? Let's take the
example that we all understand, which is Amazon, okay? We are building the data model for Amazon.
Now, just use your general knowledge, okay, and common sense to think about what information
Amazon will store. Data modeling is basically charting out or building a visual representation
of how our data will get stored inside the RDBMS, okay? This is the entire goal of it, okay? So I
need to think about what kind of tables or what kind of data that I want to store for my system,
okay? I want to store in Amazon, right? I might be storing information such as about the orders,
okay? I'm storing about the orders. I might store about the users, okay? Users who are on
my website. Orders, then the product, I've been storing about the product, okay? What else? I
might store about the payments, okay? Okay, what else? What else? Shipping information,
okay? Shipping. I might store information about the sellers, okay? Sellers who are selling on
my platform. And like this, there might be hundreds of tables in the actual Amazon,
right? But this is the basic table. Like I say, I'm starting my e-commerce company,
and I'm designing a data model from scratch. Amazon doesn't exist, nothing exists, and I'm the
first person who is starting an e-commerce company on this entire planet, okay? And I'm thinking,
okay, I'm going to be Designing my data model, initially, it will have some kind of tables. Okay,
these are the pieces of information that I want to capture for my system. Okay, this is, we are
talking from the application side right now. Okay, so we are slowly moving onto data engineering,
one by one. These are all concepts you really need to understand if you want to become a data
engineer. So, I'm going step by step to make you understand each and every single concept.
Okay, so we have the orders, users, products, payment, shipping, and sellers. Now, let's say
I'm satisfied with all of this information that I want to capture. What I will do, I will first
design a data model for this. Okay, it will look something like this. So, first of all, I have
the orders. I will create an order table. This is my order table. Okay, order. Now, the order will
have a lot of things. So, first of all, I have the order ID, order name, and order date. Okay,
let's be satisfied with this. Then we have the user. I have the user available. The user will
also have the user ID, name, age, address, and all of the other things just like a normal user has.
Okay, then we have the product. Now, we have the product information. In the product, we have the
product ID. This is the primary key or the unique key to understand which product it is. Then we
have the product name, product category, product description, product quantity, product weight,
product unit size—lots of things that we can store. Then we have the payment. Payment ID,
payment amount, and payment date can be there. So, we'll just keep these three things. Then we
will have shipping. Shipping ID and shipping date, okay, just keep these two. And the sellers. Okay,
this is sellers. We will have the sellers' ID, seller name, age, location, or whatever it is.
Okay, so we just kind of figured out the tables that we want for our database. Now we need to join
them. Alright, so all of these tables only make sense if they have a relationship with each other,
right? So how does the relationship happen? Okay, a user orders a product. So, the order will have
all of the information that is getting ordered on the platform. Okay, so on the order, we also have
a user ID. This is a foreign key; this will be joined over here. A user can order multiple or
single products, so we will have information about the user ID. A user ID has ordered a product.
Which product did they order? Okay, so we also need to add a product ID. Product ID, so we will
join this particular thing over here. Okay, it is joining. Let me just change the color for this.
Okay, so we understand that a user will order a product. Product ID will be over here. So,
in the order table, which user ordered the product? Okay, which product did that
particular user order? Okay, then this is done. Like, this is a user and product. We can also add
payment information. Let's say if I want to add payment information, it can be added over here.
The payment ID, okay, and then I can also track down the payment. The payment can also be tracked
down easily over here. So in the order, what was the payment ID? If you want to understand how much
payment that particular user made, we can also do that. So this is what we can add here. Okay,
then in the shipping, what do we have? We have the connection ready. Then for the seller, okay,
which seller is selling which product? So we can also add the seller ID over here. So
let me just get the right color. We can have a seller ID inside the product ID,
so we can understand which seller is selling which particular product, and then we can
make a connection between them too. So for the seller ID, I have, okay,
not this one. After this, what do we have? Okay, a seller. This is a seller product;
this is a seller's information. He's selling a particular product, so we can also make a
connection between these as well. So what we will do is we will add information about, let's say,
below the product. I will just use a different color to show you that there will be a seller
ID just to understand which seller is selling that particular product, and then we can make
a connection between this seller ID from here. Okay, it will go and it will come over here,
seller ID, something like this. But in general, and then we might have the shipping information,
so shipping will have information about the order ID, which order is getting shipped. Okay, so we
can join this particular thing over here also. So all of these tables will be connected
together. Again, this is the worst way to draw this particular thing,
but I just want to show you the fundamental side of it. Because if I just show you the picture,
if you just search on Google for a data model picture, you will find a lot of data models.
So in reality, a data model really looks like this. There are some applications,
such as draw.io, or there are some specific applications for databases to make this
kind of diagram. And I teach all of these things in my SQL courses. So, if you want,
you can check the description if you want to know more about it. But this is the fundamental concept
of data modeling. I go in-depth in my courses, but I just want to give you a good overview.
Okay, now we understand the data modeling. This is what we usually call a SQL table because these are
relational databases; they have a specific schema defined. So, this is the data model. Now, in this
data model, every single piece of information has some kind of schema attached to it. The schema
is basically the data type. So, let's say the order ID will be the integer information. Okay,
order ID will be, let's say, order ID will be an integer; order name will be a string. Okay,
order date will be the date value. User ID will be an integer again. Just like this, each and every
single column has some kind of schema or data type attached to it. This is called a SQL or
relational database table because it is properly structured; every schema is properly defined,
and you use SQL queries to work with it. After that, we have something called a NoSQL
database. In SQL, we store our data in the column and row format, but in the NoSQL database, we can
store our data in different types of formats. One of the formats is the key-value. If you know the
basics of Python or JSON, it's something like this: we have the ID, and there will
be a value attached to it, ID one. Then we will have the name, and the name will be, let's say,
D. Something like this. And the age will be, let's say, 26. All of this information will be stored
in the key and value. So, if you want to find, let's say, a particular piece of information,
you can just search it by the name, age, or something like that. Then we have the column
family. All of this data is actually stored inside the column. We have the document, we have the
graph data. Graph data is used for representation. We don't want to deep dive into it; I just want
to give you an overview that this kind of database also exists for some kind of workload.
After this, these are the usual comparisons that I want to talk about: SQL versus NoSQL. SQL is
relational, which basically means that the data model we talked about, all of these things, are
properly stored and have a relationship between them. As you can see, this table is connected
to that one; the order table is connected to shipping; the shipping table is connected to
the product ID; the user table is connected to the order. They have a relationship between each of
them with specific primary and foreign key IDs. So, this is called an SQL relational database.
Then we have the analytical, which is usually OLAP, or data warehousing. Data warehouses,
this is what we will talk about further down the video, but these are the SQL databases.
Then we have the NoSQL. In NoSQL, we have the graph, wide column, document, key-value. Well,
if you want to understand all of this, you can just Google it, and you will understand most of
it. We don't want to spend time on NoSQL because we will mainly be focusing on SQL. This is what
you will be working with mainly in the real world because most of the data is actually stored in
SQL databases, and you will be using data warehouses. So, let's talk about that one by one.
Okay, now in SQL, the two things that we talked about, relational and analytical,
these are the two different data processing systems because all the data storage processing,
okay, and we want to talk about that. So let me just get that information. Let's do this. So,
we have two data storage processing systems. One is called OLTP, and the second is called
OLAP. OLTP means online transactional processing, and OLAP is called online analytical processing.
Okay, in SQL, we have the relational and the analytical. These are the two
things. Relational is usually called online transactional processing, and the analytical
is called online analytical processing. This is a relational database. This is a relational DB,
and this is the data warehouse. Data warehouse. And you will be juggling between these two only
as a data engineer. Now we are slowly, slowly deep diving into data engineering, so pay attention.
Okay, now OLTP system has some kind of use case, and OLAP system has some kind of use case. This
is not something where OLTP is better or OLAP is better; they both have their own places in
the entire system. Now, the use case of OLTP is usually for processing transactional data.
It is used for transactional data. What does transactional data mean? It means that when
you send money to one person from your account, it goes to the other account. That is considered
a transaction. When you purchase something on Amazon, when you buy something on Amazon,
that particular information of the product—that this user purchased this particular product
and made payment for this amount—that entire thing is called a single transaction that is
stored inside the OLTP system. These systems are mainly designed for this kind of workflow. So,
when you want to do a fast insert of the data, when you want to do an update,
or when you want to do quick reads of the data on an individual level, these are the best systems
if you want to do that. These are very fast if you want to insert or update quickly. We
talked about the CRUD operation: Create, Read, Update, Delete. It is very useful for that.
It is very useful for this kind of workload. So, the use case of OLTP is more on the
transaction level. Whenever you have a lot of transactions happening on an e-commerce website
or banking, the transaction doesn't only mean money transactions. It can be any transaction,
such as if you buy a product, if you return some product—all of these are the individual
row-level information that is getting stored. But if you want to understand what is happening,
let's say if I want to understand the last five years of data using the OLTP system or SQL,
I won't be able to do that. And I'll explain the reason behind it, but for that, we have an OLAP
system. The OLAP system, the name literally says that it is for online analytical processing. The
reason OLAP systems are good is that they are mainly used for analysis workloads. So, if you
want to analyze the last five years of data, you can easily do that using the OLAP system.
Let me just explain this individually so that you have a better understanding. So, the OLTP system
is mostly row-based. So, every piece of information that you store is stored inside
the row. Like, this is my ID, this is my name, this is my age, this is my payment that I made,
something like this. Now, all of this information is getting stored inside the individual row. Now,
this is the OLTP system used for transactions, so this is really good for row-level operations. If
you want to do something on the row level, if you want to update the date of birth, if you want to
update the age, delete a particular thing at the row level, this is very easy. But let's say if I
want to analyze the entire data—let's say this is the payment made for 10 rupees, 20, and 30,
and what I want to do, I want to aggregate, and like this, there are millions of rows available
like this. And if I want to analyze this entire data from start to end, what I will have to do,
if I were to write the query, such as 'SELECT * FROM' or 'SELECT SUM from payment from this
particular user table,' let's say if I run this query, the thing is, the way this entire variable
gets executed, it will first fetch all of these individual rows inside the result set. One by one,
it will fetch all of these rows, and then from that entire result set, it will just pick this
single column. This particular single column will get picked after this, and then it will
do the sum. Now, picking the entire column or scanning the entire row from start to end is
a useless process for this operation. Understand this, right? Because we just want to get the sum
of payment, I just want to get the information about the payment only. Why am I scanning each
and every individual row? Because this entire database—OLTP databases—are stored on the row
level. Every single piece of information is stored in the row. So, even if I want to get the
information about the payment, I will have to scan all of the data from start to end and then just
select the one single column only. Now, as I said, this is only good for row-level transactions,
if I want to update or delete a specific row. On the other hand, OLAP systems, let me just draw
this, OLAP systems are column-based. So, all of the things are the same. Every single thing, such
as the ID, name, date of birth, age, whatever it is, and this is my payment. On the OLAP system or
the data warehouse, if I execute the same query, these are column-based. Most of the time, you will
find them as column-based. So, all of the single pieces of information that are getting stored
will be stored like this. In this case, we are storing individual rows, so it gets stored like
this. We will have the first row, and all of the information about... let me just draw it properly.
We will have one single row available, one, then we will have, let's say, the field name, age is
25, and this information will get stored. After this, there will be one more row that will attach,
so everything will get stored at the row level. Over here, everything we are storing is at the row
level. Over here, it will store everything at the column level. So, IDs will get stored one, two,
three. IDs will get stored, then we will have the name stored inside one single column, and we will
have the payment information stored inside the column. This says 23 or 25 dollars, 26 dollars,
something like this. So, every single thing that is getting stored internally is at the column
level. Just try to understand and visualize this. So, when I run the same query on the OLAP system,
instead of scanning the entire row, instead of scanning the entire thing and then fetching this,
it will directly go to the payment and directly give me the sum. So, the useless operation of
scanning the ID or the name is not needed. We can directly go to the payment level, and we can
fetch the result that we need. This is the difference between OLTP and OLAP.
Now, understand this as a data engineer. As a data engineer, you will be taking data from OLTP
systems to OLAP systems. In between, we will be writing transformations. The thing that we
understood about data generation, data generation and storage is my OLTP system. This is where the
data is getting generated. This is where I do the storage; this is where I do the transformation,
and this is where I do the analysis. This is where the data warehouse will come into the picture,
and the data analyst will write the query to understand the data, and then they will build
dashboards, ML models, AI models, whatever you want to call them. They will use this OLAP system,
data warehouse, or the storage layer that we will have. We will understand data storage again in
the future about object storage, so don't worry about it. But this is the fundamental of it. Now,
we are just trying to zoom into the individual component and understand what is going on.
So, data engineering is basically taking this data and moving it somewhere else. We should take the
data from OLTP systems, APIs, ingest it into the system, do some transformation, apply some logic,
and load it into the data warehouse. This is the core of data engineering. But how do we do
this? You understand everything, but how does this entire pipeline happen? We have something called
ETL: Extract, Transform, Load. You might already know this; everyone keeps talking about it. We
call this ETL: Extract, Transform, Load. The same thing that we talked about in the lifecycle. The
data engineering lifecycle is one way; it is ETL only. We are extracting data, transforming data,
and this is the serving layer, which is the loading of data. That is just a conceptual
architecture of how things work. This is what really happens in the real world. We build the
ETL pipeline. We extract the data, we transform the data, and we load the data. Now we already
know about this, right? Where do we even extract all of this data? We extract our data from DBMS,
analytics, sensor APIs, and all of this data from multiple sources. Then this data comes,
and then we do the transformation. We understood transformation also, right? It is about removing
duplicates, handling null values. Structured data means getting all of the information into the same
scale. If one age is stored inside, let's say, the string value, and another source has the age
stored inside the integer value, we bring it to the integer level. If the date is in a different
format, we bring it to the same level. And then we load our data. The load can be on anything; it
can be on the data warehouse. Data warehouses are like Snowflake DB, BigQuery, Redshift, and a lot
of different data warehouses. Or you can also store it in object storage. Object storage stages
like S3, Google Cloud Storage (GCS), or Azure Data Lake, we also have that. This is the core concept
of ETL that we will also talk about one by one. Now, okay, so you understood the upper layer.
We did all of this work just to understand this particular thing, the data engineering lifecycle,
the top layer of it. Just the top layer of everything that we did till now. But
just to understand the top layer, now I want to understand the bottom layer of the data
engineering lifecycle. That is the undercurrent: security, data management, data architecture,
orchestration, and software engineering. These undercurrents are also important.
Security: just by the name, you understand that our data should be secure. That basically means
who is able to access our data and the system. We need to make sure the right
person with the right authorization can only access our data. We should not give access to
our data to every single person working in the company. This is the importance of security.
Data management: that basically means data governance. Data governance means we should
be able to easily find the data that we need. Think about this, right? I was working at an
e-commerce company in Europe and the US for furniture. They had tables—more than thousands
of tables—in the system. Now, if I had to find particular data, where this data is stored,
I had to go through the documentation they created to understand, okay, this data can be found at
this particular location. This is what we call data governance: the ability to find data. Then
the definition: what each and every single column means. Think about it, if you have thousands of
tables, and if you access one of the tables from that pool, and that particular single table has,
let's say, hundreds of columns, and you want to understand what the sixth column means. It
could be something like the payment gateway ID or XYZ, something like that. I don't know what
this particular column means. This is the use of definitions, understanding what the data is, what
type of data is stored. This is very important. Data governance. Accountability: who owns this
data? Who is the user? Did you create this table? Which user created the table? So I can go to that
user and understand if I don't really understand the purpose of this table, I can go to the user.
If I am working in the shipping department, I am an engineer over there, and I created the
entire shipping table. Now, if any person from, let's say, the order department or the return
department wants to understand what is going on inside this table, they can directly reach out to
me. I am accountable for that particular data. That is what accountability means.
Then we have data modeling, which we already understood. Data integrity: making sure every
piece of data makes sense; every piece of data is proper. It basically means the data is correct;
it should not have any random information. DataOps: you might already know about DevOps.
DevOps is basically to automate the entire process of deployment of your application using the best
practices. DataOps is somewhat similar. You monitor data governance, observability, incident
reporting. That basically means everything that is happening inside your data system. Every single
thing that is happening in your data system, you should be able to monitor. You should be able to
report the incidents that are happening. All of these things should be automated,
and that is a fundamental concept of DataOps, data operations. So, all of the operations of the data,
right? When you deploy something, is it working fine? If it is working fine or not,
I should be able to get the error message. I should be able to observe how my data pipelines
are working. I should be able to monitor what is going on. All of this is a part of DataOps.
Data architecture: we have a detailed section after this about data architecture where you
analyze the information, analyze the trade-offs, and add value to the business by designing the
proper architecture for the system. We'll understand this.
Orchestration: this is used for coordination, for scheduling jobs, and managing tasks. In data
engineering, we have multiple data pipelines working. Data pipelines are basically the ETL
jobs. It is just a fancy name, but it's just extracting, transforming, and loading the
data to some location. This entire operation is called a data pipeline. Now, like this,
there might be hundreds of data pipelines deployed in the organization. I need to orchestrate all
of these things. Let's say once the first data pipeline completes, I should only run
the second data pipeline because the second data pipeline is dependent on the first data pipeline.
All of these things are called orchestration. We have a tool called Apache Airflow for this kind
of workload, and we will also understand orchestration as we go into the future.
Software engineering: software engineering is basically programming, software design,
testing, and debugging. You have to apply the best practices of software engineering when
you write the ETL, the transformation job using code. You should use a design pattern of software
engineering for scalability. You should also use testing and debugging approaches to test your data
pipelines. So, all of these are the fundamental concepts. When building a data pipeline,
you should remember security is important, data management is important, DataOps is important,
architecture is important, orchestration and software engineering. Just fundamental concepts,
good to know. You don't need to deep dive into it right now; as you move in your career,
you will understand them one by one. The next thing I want to talk about is data
architecture. If you want to become a good data engineer, you should understand data architecture,
and we will be referring to one of the new data that I wrote, "Data Architect 101 for
Data Engineers." So, let's jump into that. So, before we move forward, I just want to say
that I am re-recording this part of the segment because I was recording this part yesterday,
and my disc got full. I ran out of space, my OBS stopped recording in between, and the entire file,
like a one and a half-hour file, got corrupted. So, I'm re-recording this part of the video just
to have one complete video. If you're still watching this video till here,
I'll urge you to at least like this video because it takes a lot of effort, and do
comment something so that it increases the reach of this video and it reaches more and more people.
Okay, let's start with the video. Now, till now, what we have done is we have understood the basics
of data engineering, right? We understood what data engineering is, where data engineering fits
in the entire pipeline, the data engineering lifecycle, different parts of ETL, OLAP
versus OLTP. So, we cleared the basic fundamentals required to understand core data engineering. Now,
so we understood the core data engineering aspect. Now, I want to take you on a journey to understand
how data engineering happens in the real world, from understanding how to build the architecture,
how the architecture is actually built from the ground up, how the thought pattern is developed,
how you understand the business side, how to choose the right technology and put all of
these things together and individual components, each and everything. Now, we will understand.
Okay, so let's start. Now, I want to make you understand data architecture first. Because
before we even understand the different parts of data engineering, it is really important that you
understand how to build the basic architecture as a data engineer. Because this is the core skill
set, and we'll be learning about that, right? So, I published this particular newsletter.
If you are interested, you can also subscribe to it. Just go to the DataVidhya.substack.com
to get the high-quality data engineering blogs. Okay, so Data Architect 101 for Data Engineers.
Now, till now, we have understood that the goal of every data project is to solve a business
problem. From the start of the video, I've been saying this particular thing again and again,
that everything you do as a data engineer or as an engineer in general, right? You are doing all
of these things for the business. Now, it can be anything from reducing the current system
cost to building a full-fledged data system to help businesses make data-driven decisions. Now,
I want to take you on a journey to understand how to think about building data architecture from
the data engineering point of view. Because as you grow in your career, you should have
the basic understanding of how to design the architecture and how to build data systems. What
is data architecture? So, from the definition of the fundamentals of data engineering, data
architecture is a design of systems to support the evolving data needs of an enterprise. Evolving
data needs are achieved by flexible and reversible decisions reached through a careful evaluation
of trade-offs. We'll understand this technical architecture, but in simple terms, it is basically
like before you construct a building, right? You have to build a blueprint of the building. If
you're trying to build, let's say, a 12-floor building, you have to first build the blueprint.
Inside the blueprint, you have to add some of the things, such as the foundation, floor plans,
elevation, elevator, stairs, office, restroom—all of these things you have to first plan, and then
you can start building the entire construction. Data architecture has a similar concept. Instead
of foundation, floor plans, elevation, and elevators, you'll have to think about storage,
what are the different software that you have to use, how does the data actually flow, interfaces,
how do you write the transformation, the staging areas, data warehouses, reporting systems, and
many more. Just like you think about building an entire building, the construction, you also have
to think about the data when you are building data architecture. You also have to think about what
are the different components that we need in order to build the entire system. This is how we start.
Now, as per the technical definition that we just read, it says that decisions should be flexible
and reversible, which means like each and every component that you put inside the architecture,
in case something goes wrong, you should be able to easily replace it with something else,
and it should be easily reversible. So, every decision that you take, if it goes in the wrong
direction, it should be easily reversible so that you can make it right. This is what it
means. It is achieved by flexible and reversible decisions reached through a careful evaluation
of trade-offs. Trade-offs are basically, you have to understand, based on your requirement,
which technologies you can choose. We'll understand all of this step by step.
Now, building data architecture is divided into two different parts. One is business needs,
and the second is technological integration, basically the operational architecture and
the technical architecture. Let's try to understand both of these right now,
and then we'll deep dive into them individually. We focus on the business goals and requirements
inside the operational architecture. Again, we understood, right? Everything that we are
doing is for the business only. So, before you think about choosing the right technologies or
writing code and all of the other things, first, you need to define what the business even needs
in the first place. Because once you know that, then you can think about the technological side.
So, the first step in building data architecture, or even if you're building your own personal
project, is to understand the operational side or the business side. For example, in an e-commerce
platform, what is the impact of the XYZ category of the product? So, I want to find this particular
thing. This is my business goal. I want to find information about this particular product. Why
is there a delay in product shipping? So, I want to understand what is happening with the product
shipping. I want to understand why there is a delay in shipping. So, this is my business goal.
How do we manage data quality from the third-party vendor? In e-commerce, right? We are working with
different third parties, such as FedEx or some shipping department, or the data might be coming
from multiple places. How do we manage data quality while working with these vendors? These
are the different business goals that we have. So, while building technical architecture, we
need to think in this particular direction. These are different things that the business needs. So,
now I have to build my technical architecture to fulfill all of these different requirements.
In the technical architecture, we focus on the technical side for solving how to ingest,
store, and transform data. What happens when we have a sudden order spike? Basically,
on the technical side, we mainly focus on storage, technical things such as how do we ingest data,
how do we transform data, and if there is a festival or a sudden spike in the system. So,
we also think about scalability. This is more of a system design side. One is a business goal, where
you focus on the business. One is the business side, where you focus on what the business needs.
The second is the technical side, where you think about what are the different technologies that you
can use. Let's try to understand all of these things in a little more detail with examples.
The operational architecture ensures that your data practice aligns closely with the business
objectives. It is the "why" behind every piece of data you collect, process, and store. Again,
business architecture or the Operational architecture is basically the "why"—why you
are doing this entire activity. Why are we even building everything? It is to support the business
in achieving their goals. So, operational architecture is basically the "why" behind
every piece of data you collect. Here are some insights to think
about when building the operational architecture or defining the business goals. First, start with
the end in mind. Always begin by understanding the business problem you are trying to solve.
This clarity will guide your decisions and ensure that your data architecture directly
contributes to the business outcome. This is very important—start with the end in mind. We need to
understand what the business goals are before you even think about building the architecture or the
technologies. Understand what the business needs, because once you define that, you can easily build
the technological side. Technology is very easy to build if you know what the business needs. If
you don't know what the business needs, you will be stuck in building the architecture
and will never be able to get out of it. Second, iterate and evolve. Business keeps
changing every six months—a new product line comes up, something keeps changing. Business priorities,
product strategies—these things happen on the business side. So, when you design
your architecture, it should be able to iterate and evolve quickly as per the business changes.
And focus on impact. Everything you do should generate value for the business. Every data
solution you architect should have a clear line of sight to its business impact. It can be improving
customer satisfaction, streamlining operations, or enhancing decision-making. The value of your
data initiative should be measurable and aligned with business priorities. This is operational
architecture and aligning with business goals. Now let's talk about the technical architecture,
the building block. This is where the actual execution happens. While operational architecture
is about "why," the technical architecture is the "how" of the equation. By focusing on specific
technologies and methodologies, you'll be able to meet your operational goals. So, what do we do? We
use technologies—technology is our "how" to meet the business goals, which is basically the "what"
we want to achieve. Very simple to understand. If you want to build the technologies,
we have thousands of tools available in the market. This is the big data landscape,
and you can see there are so many different tools available that you can't even see them all until
you zoom in. If you want to understand each tool, you need to know that these are different
tools available for different kinds of workloads. We have a proper framework to choose the different
technologies as per your business use case. Now, you can't choose any random technology and think,
"Okay, I'll use Snowflake, I'll use Apache Spark, I'll use these fancy tools just to
solve my business problem." It doesn't matter. You can even use a simple Python
script as long as it solves and helps you reach your business goals. Technology is
not about choosing fancy tools or something everyone is using in the market. As a business,
you should be thinking about saving costs and reaching your business needs. Whatever technology
helps you, whether it is an enterprise-level technology or an open-source technology, as long
as it solves your business problem, you're good. Now, let's try to understand that one by one. How
do you build the technical architecture? Simplicity is key—the aim is to keep your
technical architecture as simple as possible while meeting your needs. This approach makes
your system more maintainable, scalable, and less prone to error. The simpler you keep things,
the easier it is to maintain, scale, and quickly identify errors. The more complex
the system, the harder it is to debug errors. Second is choosing the right tools for the job.
There is no one-size-fits-all solution in data architecture. The right storage,
processing, and analysis tools totally depend on your requirements and the specific use
case. If you have structured data, you can go with a data warehouse. If you have millions of rows,
you might not need Snowflake or another expensive database. You can work with
basic ad-hoc query interfaces like Amazon Athena, which will be good to go. All of these different
decisions should be made based on your business understanding. It's not about choosing fancy
tools; it's about solving your business problem. Third is building for scale and flexibility. Even
if you are not dealing with billions of rows right now, in the future your business will grow. If you
are projecting that growth, you should be planning the architecture to scale all the systems.
For example, currently, you're using Python to process millions of rows, but you know you'll
have billions of rows tomorrow. You should keep the system ready in the backend for that growth.
For instance, you can use distributed processing like Apache Spark and scale up the cluster as
needed. Start with a smaller cluster and then think about scaling up as you move forward. It's
not that everything is perfect when you start; you start small and evolve as you move forward.
Third is embedding automation. A lot of times, you might monitor different systems manually,
try to solve different errors manually, or build data pipelines manually. Instead,
you should generate scripts and automation to do these things. In case an error occurs, you should
get an email or a Slack notification, depending on your system integration. Instead of checking every
single day whether your data pipeline is working, you should have an alerting mechanism in place so
that you don't have to check manually. Finally, prioritize data
security and governance. In the digital age, data leakage is quite common, so you should properly
secure your database, encrypt your data, and keep your data secure within the network. These
are the different things you need to consider while building your technical architecture.
Now, let's bring all of these different things together to understand how this happens in the
real world. Let's take the example of the data architecture for an e-commerce platform—pretty
easy to understand. The first thing is that we need to understand the business needs. In
this case, let's define the business goals, because this is what we understood first. We
define the operational architecture, like what are the goals of the business. In this case,
the first goal is to improve customer experience: improve site navigation, personalize product
recommendations, and enhance customer service. Simple to understand. We want to improve the
overall site navigation, how customers interact with the application, and build a recommendation
engine and customer service integration. Next is operational efficiency: streamline
inventory management, order processing, and shipping to reduce costs and delivery
times. We need to improve our entire operational efficiency so we can reduce order processing time,
reduce shipping costs, and shorten delivery times. Then, marketing insights: we want to understand
how customers are behaving so we can improve product placement and increase sales.
Vendor management: we might be working with different vendors, so we also want to build a
strategy for better product availability, pricing strategies, and quality control.
And fifth, compliance and security: in an e-commerce platform, people will be
making payments, so there are compliance requirements we need to follow. For example,
we don't capture credit card information, or if we do, we should mask it so that it
doesn't get leaked. These are some of the compliance requirements we have to follow.
So, these are the business goals, right? We want to increase customer experience, operational
efficiency, marketing insight, vendor management, compliance, and security. Now, based on these
business goals, we can think about building the architecture—the actual technical architecture.
The first is our data ingestion layer. We are getting data from multiple sources,
and the purpose of the ingest is to collect data from various sources such as website interactions,
server logs, vendor systems, inventory management, and customer support. We can use technology like
Apache Kafka for real-time data streaming to handle data coming from different sources.
After we capture our data, we need to store it in some object storage for a longer period of
time. The purpose is to store collected data in a structured manner for easy access and analysis.
Different components, like object storage (S3 bucket) for unstructured data, or data warehouses
like Snowflake or BigQuery for structured data, can be used depending on
your business requirements. How do you decide which one to
use—Snowflake or Redshift, for example? It depends. If you're already on AWS,
going with Redshift might be a good choice due to integration. But if Redshift is too expensive for
your business needs, you can go with Snowflake or even open-source solutions. You need to research,
understand your data size and frequency, and do a simple proof of concept (PoC) to
see how different technologies behave with your data. Whatever works best, you can choose that.
So, we might have to structure our data before we put it into the data warehouse—that's where
the data processing and transformation layer comes in. This is where we clean, validate,
and transform our raw data into a structured format. For this, we can use Apache Spark if
we're working with large datasets. If you have a smaller dataset, like a few million rows,
you can go with simple Python scripts. But if you have a large dataset and data coming from
multiple sources, you might want to go with Apache Spark, a highly used framework by top companies.
After the data is in the data warehouse, the data analysis and business layer comes
into play. This is where machine learning engineers and data analysts build dashboards
and machine learning models for predictions to help the business move forward. This is where
the final value comes in—when a person from the business team can look at a dashboard,
see issues in shipping, and make the right decisions to improve the overall business.
Business intelligence tools like Tableau and Power BI help us visualize data,
and machine learning platforms like TensorFlow and PyTorch help us build recommendation engines
and algorithms. There's also the side of data security and compliance, where we ensure that
we meet regulatory compliance, such as GDPR and CCPA. These are government regulations you need
to follow when storing data, like encrypting or masking personal information. We'll cover
data masking in more detail later in this video, so don't worry about it.
Lastly, we have the data integration and API layer. We'll be working with multiple vendors
and sending data between different systems, so we should build an API for easy integration between
systems. We also need to think about this. So, if we meet all the standards,
our final architecture might look like this (example architecture shown). This is not
the final architecture, but it might look like this, and you can improve on top of it.
As you can see, we have data coming from on-premises systems, social media,
and stream data. This data is ingested into the system, stored on AWS S3 as a data lake.
We can use transformation layers such as AWS Glue and Lambda to process our data,
and then store it on Amazon Redshift. We can also use Amazon Athena as an ad-hoc query interface
and SageMaker as a machine learning platform. Visualization is done through tools like Tableau.
This architecture is built to fulfill our business needs. We define the business goals,
then define the tools to use, and then build the architecture. If you look at this architecture,
it looks similar to the data engineering lifecycle we discussed earlier. There's data collection,
ingestion, storage, transformation, serving, and end users. The data engineering lifecycle is the
fundamental block, and this real-world architecture applies those concepts.
You can plug and play—if you want to use Google Cloud Storage instead of
S3 as a data lake, you can. If you want to replace Amazon Redshift with Snowflake, you
can. If you prefer Databricks over AWS Glue, go for it. Use what best meets your business needs.
That's everything about building architecture. I hope you understood. If this is clear,
we can move forward and discuss the other parts. Okay, let me check this. All right, this looks
good. Let's continue with our second thing. Now that we've understood architecture and
how it's built, let's try to understand the individual components of the architecture,
their use cases, and how the entire execution happens while building this.
Let's start by understanding the data warehouse. This is what the architecture of a data warehouse
looks like (architecture shown). So, we have data coming from multiple places, as we discussed. Data
comes from APIs, RDBMS, websites—all these places generate data. This data goes to
the streaming engine and gets ingested, and then we write the ETL pipeline. After ETL,
our data gets stored inside the data warehouse. This is the ETL pipeline—what we are doing
is extracting data, transforming it, and then loading it onto the data warehouse.
There's one more concept called EL, where instead of transforming the data first,
we extract and load the data into a staging area, or directly into the data warehouse. We then do
the transformation on the fly using SQL queries. This is ELT—extract, load, transform. In ETL,
we extract, transform, and then load it as per our requirement. These are the two ways you can
build a data warehouse. In the real world, ETL is highly used because it's the most structured way
to organize your data. ELT is also used, and some newer companies are trying to replace ETL with
ELT, where you don't have to do the transformation first—you load your data into the warehouse as it
is and then transform it as needed. However, ELT is not as successful because real-world
data is often messy and requires some processing before storing it in the
data warehouse. ETL is what you'll be using most of the time, but it's good
to know that ELT also exists for some use cases. When we built the data model in our relational
database part, we understood that data models are normalized—this means we try to create as many
tables as possible and reduce duplicates in each table. This allows us to have proper information
stored across different tables. Let me show you that again for clarity. This is what it looks like
(example shown). We have different tables that store different information. If you want to get
information about a user who purchased a product, you need to pull the user ID, connect it with the
order table to get order information, then connect with the product information, and if you want to
track payment information, you'll need to join the payment ID—joining
four different tables to get one outcome. However, relational databases are not designed
for analytical workloads. Even if you join all this data and try to run analysis queries by
aggregating user or order information, the OLTP database (Online Transaction Processing database)
will struggle because it's not designed for that kind of workload. It will pull all these
rows one by one and then pull one single column for your final analysis—not ideal.
This is where the data warehouse comes in, but you can't just store your data in a data warehouse
without following specific methods—that's where dimensional modeling comes in. Just like we have
a method to store data in relational databases (data modeling), we have a method to store data
in a data warehouse called dimensional modeling. In dimensional modeling, we have two things:
Dimensions and Facts. Dimensions and Facts are the two types of tables you'll create
to build your data warehouse. This is called a dimension table, and this is called a fact
table. The fact table is always one—there will be one fact table and multiple dimension tables.
The fact table stores information about quantitative data points that
can be measured in the business, such as sales amount, product quantity sold,
revenue, profit—all the quantitative values that get stored in the fact table.
It is the center of your dimensional modeling. On the other hand, there are multiple dimension
tables, each representing different business categories. For example, you might have a
product dimension, a date dimension, and an order dimension. Each dimension table
stores information about the categories or descriptive attributes, such as product name,
product category, user name, user city—all descriptive attributes related to the dimension.
If you want to understand how all this happens in detail,
I have a course available on data warehousing with Snowflake, where I go deep into this. For now, I'm
just providing a fundamental overview. Dimensional modeling is built using two concepts:
star schema and snowflake schema. These are the two methodologies or concepts used to build a
dimension model. Let me show you (example shown). This is what a star schema looks like—there's a
fact table in the center with different dimension tables attached to it. It looks like a star,
hence the name "star schema." The snowflake schema is a more normalized version, where there are
sub-dimension tables attached to the dimension table. It kind of resembles a relational data
model but still has a fact table in the middle, with different dimension tables attached to it.
In the star schema, you have the fact table in the center and dimension tables attached to it,
forming a star shape. The snowflake schema is similar, but with sub-dimension tables
added to the main dimension tables. The snowflake schema is different from Snowflake, the database
company that offers databases as a service. Let's look at an example. Let's say we're working
with an e-commerce company. We'll have a fact table in the center, such as an order fact table,
where it stores all transactional information. This will have a unique ID and quantitative
attributes like price, quantity, and weight—all measurable attributes in the business. Then,
you'll have different dimension tables, like order dimension, product dimension, and date dimension.
Each dimension table will store descriptive values like product name, product category, and other
relevant information. You can join these tables using a common key, such as product ID, to get the
final analysis. This makes analysis easier because if you want to get information about a product and
its quantity, you just need to join two tables. This join happens in the data warehouse, and the
OLAP database (Online Analytical Processing database) will handle this more efficiently.
If you want to understand this in more depth, I go hands-on with this
in my data warehouse course on Snowflake. I teach these concepts using real datasets.
Now that we've covered facts and dimensions, I want to talk about Slowly Changing Dimensions
(SCDs). We know that these facts, such as quantity, product weight,
and price, keep changing. Quantity changes, product prices change, and these changes
need to be reflected in the system. We understood that the data flows from sources, like APIs
or RDBMS systems, through ETL to the data warehouse, where it gets updated daily, hourly,
or however frequently it's scheduled. But these dimensions, like product name and user address,
don't change frequently—these are dimension values that don't change for long periods. However,
when they do change, how do we handle that? This is where the concept of Slowly
Changing Dimensions (SCDs) comes in. SCDs deal with handling dimension values that
change slowly over time. There are different strategies for handling SCDs, categorized into
different types like SCD1, SCD2, and SCD3, each with its own approach to handling these changes.
In SCD Type 1, the values are overwritten, and no history is maintained. For example, if we
overwrite data without keeping the previous value, we are using SCD1. If a customer's city changes
from New York to New Jersey, we simply overwrite the New York value with New Jersey. In this case,
there's no way to know what the previous value was—this approach can be used for some use cases.
In SCD Type 2, we maintain a complete history of changes. Every time there is a change,
we add a new row with all the details without deleting the previous value. There are multiple
ways to handle this, such as using a flag approach. For instance, if the city was New
York and then changes to New Jersey, we'll add a new row with an "is active" flag to indicate the
current value. If there are further changes, like moving to Miami, we'll add another row,
keeping the history intact. We can also use version numbers or date ranges to track changes.
In SCD Type 3, we maintain partial history. For example, we might store the current and previous
city in separate columns. If the city changes from New York to New Jersey, we keep New York
in the "previous city" column and New Jersey in the "current city" column.
There are also more advanced types like SCD6, which is a combination of SCD1, SCD2, and SCD3,
capturing the current city, previous city, start date, end date, and active flag all together.
These are fundamental concepts, and if you want to do hands-on practice,
you can find tutorials or check my course on Snowflake, where I cover these concepts in
depth with real datasets. Lastly, there's the concept
of data marts. Let me take a sip of water; you can also drink some water.
Okay, so data marts. Now, data marts are basically a subset of a data warehouse. Okay,
to understand this, the subset part we understand, right? In a data warehouse, we have many different
tables available like this. Okay, now these tables can be, let's say, product information,
order information, payment information. We have the fact table, like the fact table information.
These are the product dimension, order dimension, payment dimension, user dimension,
date dimension. Okay, these are the different tables available in the data warehouse.
Now, like this, there are many different teams working in the organization. We have different
teams available, such as the teams that might be about shipping, okay, who handles the shipping
information. There's a team that handles the refund information, okay, there's a team that
handles the payment, there's a team that handles the third party, okay, accounting, IT—these are
the different departments available. Inside these departments, we have different teams available.
Now, all of these teams don't really need this entire dataset. Every team wants to solve their
own business use case. So, understand this: this is my company. Inside the company, okay,
I have different departments, okay? These are my departments. Inside the departments,
we have different teams working on different problems. Okay, if you work in any company,
you will see that in a large company, you will always see something like this,
okay? You will see something like this, where we have the company. Inside the company, we have the
different departments. Inside the departments, we have different teams trying to solve their own
department's issues so that they can meet the company's goals. It looks like something like
this: if this team solves their own problem, that basically means they are solving the department's
problem, which means they are solving the company's problem. And if all of these different
departments solve their own problems, that basically means the company is moving forward.
Now, in order for these departments to solve their own problems, what they want to do is build their
own reporting system, right? The analysis, data science, machine learning models—what they do
is that they build the reporting system as per their team's requirement or their department's
requirement. And what they do is create a subset of the data warehouse as per the requirement.
So instead of, let's say, for the shipping department, right, they just need the information
about, let's say, users, payment, and the product. Just think about that—they just need the
information from these three tables only. So, they will create their own new table,
using these three tables, and they will choose the columns that they need for reporting and
all of the other things. Let's say there are 300 columns available in these three tables—they will
just pick 100 columns from these three tables, okay, and they will build their own reporting
system for their own department. This is called a data mart. Right, I am building this—this is
a subset of a data warehouse. I'm building my own reporting system; I'm building my own table as per
my requirement. Data mart, a subset of the data warehouse—pretty simple to understand. I solve
my department's problem; that helps the company solve their own problem. Simple. Okay, understood
the data mart concept? Let's move forward. Now, the data lake. This is the new term that
was tossed because of object storage now, okay? Before we store our data in the data warehouse,
we understood, right, what we have to do—we have to, again, process the data through ETL, okay,
and then you can build a data warehouse and store your data. Every data that you store inside the
data warehouse gets stored in a structured format, right? It is getting stored as a structured
format. So, what that means is that every time you want to store your data or make any changes, you
have to make changes in the structure, and that is quite difficult to do, right? If you have a table
that already has millions of rows of data and five columns, okay, think about this: I have a table,
okay, 1, 2, 3, 4, 5, okay, 1, 2, 3, 4, 5—it has five columns and millions of rows. Now, tomorrow
I decide to add one more column, okay, inside the data. So all of these values will be null, okay? I
will have to change the entire structure and then start adding new rows as per the new data. So,
changing the structure, changing the schema type, is quite difficult in a data warehouse because you
have to take a lot of things into consideration. Now, data comes—what it says, right? Okay,
you don't worry about the ETL, okay, you don't worry about the ETL, you don't worry about writing
transformations and putting your structured data. What you can do—you can use a data lake,
like S3—you store all of your data into the data lake. Data lake is basically a storage location.
You can use S3 as a data lake storage, okay? It is a centralized repository where you dump all
of your data as it is, right? I will store all of my CSV data, I will store all of my Parquet data,
I will store all of my JSON data as it is onto the different folder structures in my data lake.
Now, different teams from different departments—we understood, right,
as in the data mart, there are different teams working here—they want their own columns,
they want to generate their own reporting. So, what does the data lake say? Dump all of your
data as it is into the data lake, okay? As per your requirement, you can query the data from the
data lake itself. Basically, you can directly read the data from the S3 file storage system, okay,
object storage system, as per your requirement. This is called schema on read. Again, concepts
are getting quite heavy, but I'm just trying to keep it easy. So, take a break if you want,
but I'll continue—just pause the video if it is getting heavy and come back again, but if
you can understand, just go forward with it. Okay, it is basically a centralized repository,
okay? So, data lake. So, data lake is basically a centralized depository. You can use S3, okay,
you can use Azure Blob Storage or Azure Data Lake, okay, Azure—you can also use Google Cloud Storage,
okay, as a data lake, which is kind of like object storage, where you dump all of your data as it is,
as raw, okay? And on the other side, right, on the data lake, there are users or teams, okay, who can
read this data, okay? This is called schema on read, okay? What they will say: okay, I want to
read this column from this file, I want to read these columns from this file—all of the different
file systems, okay? I will read all of this data as per my requirement, okay, and I will build the
table onto, let's say, Athena, okay, Athena, or any ad hoc query interface, okay? That is up to
me, and they will build their own table, or they can directly also pull the data from the data lake
and put it in the structured format. So here, we will only process the data that we need, okay?
Instead of processing all of the data over here in the ETL and data warehouse part, we will only
process the data that we need and then store that data in the data warehouse for querying purposes.
Now, it is not that, okay, data warehouse is bad because it requires a lot of processing
or that data lake is good. Both of these systems have their own place in the architecture because
data warehouses give you the flexibility of structured data so you can do the analysis,
whereas data lakes give you the ability to access any data anytime you want as per your requirement.
Okay, let's understand the difference between a data lake and a data warehouse, okay? Inside the
data warehouse, data is structured, as you can see it over here. Let me just, okay, let me just zoom
in. Okay, data is structured, okay? The users are business analysts, and it is used for batch
processing for BI reporting and all of the other things. The data is pre-defined, contains smaller
data, and it is usually relational, right—columns and rows. Over here, data is unstructured because
you can store JSON data, you can store Parquet, CSV, whatever you want. Alright, users are usually
data analysts and data scientists because instead of—think about this, right? Data scientists
want to build their own machine learning models. Now, in the data warehouse, alright,
once you have data added, you can only work with the limited data, right, because you defined that
as per the business goals, and changing the structure is quite difficult—you have to do
a lot of changes inside the pipeline also. So, for data scientists and data analysts, a data
lake is a gold mine because it is completely raw data, right, stored as a file storage, stored as
a file inside the object storage as it is. It is up to me which data I want to read, which columns
I want to read, as per my requirement. I can read using Python code, I can write Spark code,
I can build a table on top of it as per my requirement, okay? So these are the users. The use
case is for stream processing, machine learning, real-time data analysis—you can use that. Okay,
the data is raw, data is large, and it is undefined, okay? It is not properly relational,
so it is undefined, okay? This is the difference between a data lake and a data warehouse,
okay? This is what we have understood till now. Now, this is just the fundamental concept. The
actual hands-on part, if you want, I have some projects available freely on the YouTube channel,
okay? I will just comment down—I will give you the link to that, okay? So if you want to do that,
you can do it and understand the data warehouse and also the data lake. I also teach all of
these things hands-on in my courses, so if you are interested, just check the link in the description
about the combo pack, okay? So till now, we have understood a lot of different things, okay? We
started by understanding what data engineering is, where data engineering actually fits into
the entire pipeline, okay? We understood about the different roles such as software engineering,
DBA, DS, ML, and all of the other things. We understood about the important part, which is the
data engineering life cycle, okay? We understood about the ingestion, transformation, serving, how
all of these things happen. The storage part, we understood about why transformation is needed—like
how the transformation actually happens. Data generation, data storage, DBMS systems,
relational databases, data modeling, okay, how data modeling actually happens, NoSQL databases,
SQL versus NoSQL, data storage processing such as OLTP versus OLAP, the difference between row-based
transaction and column-based databases, why OLTP is needed, why OLAP is needed, why transformation
is needed because we go from OLTP to OLAP while doing the transformation, okay? We understood
about ETL processing, understood about the undercurrent such as security, data management,
data ops, architecture, software engineering. We delved deep into the data architecture part,
okay? We understood about operational architecture and technical architecture, about a lot of
things. We understood about the data warehouse, the important part, okay? ETL versus ELT,
understood about dimensional modeling, understood about the snowflake schema and the star schema,
understood about the difference between fact tables and dimension tables, such as how to build
the dimension tables. It stored transactional values and categorical values, understood about
slowly changing dimensions, why we need them, different types of them, a lot of things. Data
marts—a subset of the data warehouse—why we need data marts. Understood about the data lake and
the difference between a data warehouse and a data lake, okay? Understood a lot of things about data
engineering, actually. I was not even expecting to go into this deep before recording this video—I
thought I'd just give an overview, but I went into a flow state and started recording and explaining
everything because I really love teaching, right? So, understood a lot of things. If you’ve reached
this section, do let me know by commenting, because it might be around 2 hours by now,
and if you're still watching, salute! Alright, so do let me know by commenting that you watched
this video till here and you are about to complete the entire thing, okay? And I just want to plug my
courses—if you're interested, right, if you love my teaching and the way I teach, then do check
out my data engineering courses. I create in-depth data engineering courses in the market, okay? It's
not just about the course—it's about giving you the experience, okay? The understanding of proper
technology, how this works in the real world, right? It's not just about learning technologies;
it's about understanding where it is used, how to use it, following best practices—all of these
things I teach in my courses, so do check them out. You'll find the link in the description.
You'll also find the latest coupon code available with a discount, so go at least check that out.
And yeah, let's continue with our video. Okay, now we understood the fundamentals
and we also looked at this big data landscape. Let me just zoom in, right? Can you see the
tools' names? Can you see the different things available, right? These are the data warehouses,
okay? As you can see, Snowflake, AWS Redshift might be here, Microsoft Firebolt,
Oracle—there are some new companies here. This is used for data lakes. As you can see,
there might be S3, Databricks is used, Cloudera has their own stuff going on—these are storage
systems provided by the different NoSQL databases, like MongoDB. There might be Cassandra somewhere,
Couchbase DB, and all of the other things. Real-time databases, graph databases—you see,
I was telling you about this, right? For every single use case, like for visualization,
BI platform, data science notebooks, MLOps, there might be some product analysis—all of
these different things, right? We have a set of tools available, right? Every single technology,
everything that we want to do, we have a different toolset available for every single thing that we
understood while we were talking about the architecture part. We understood that every
single thing needs a set of tools, and we have more than thousands of tools to pick from, okay?
Now, we will understand these individual tools, what they do, why they exist, right? What are the
use cases for them, which tools are the most demanded and used by the industry,
okay? So that we will understand, and how to work with them. Let's go one by one.
Now, let's talk about the cloud platforms, right? We understood about the cloud platform.
Cloud platforms are basically giant computers built in some data center, the basement of a
company. It can be Amazon, okay, this is Amazon, this is Google, okay, and this is Microsoft.
Now, again, these are the three top cloud providers available in the market. There
are plenty of cloud providers—you have Cloudera, you have IBM Cloud, you have Oracle Cloud. Every
different cloud provider has its own features, but these are the three top cloud providers available.
What is cloud computing? It is basically these companies giving you the computer resources
and different services so that you can use them for your work. Before this cloud, what we used
to do—we used to build our own servers, okay? Own servers, that means you get your RAM, you
get your hard disk, okay? You get the processing power, processor, okay? You get the GPU if needed,
you get all of the wires, you get the ACs to cool down the servers, you get the networking adapters,
you get all of these different things, switches, you get the routers—every single thing you get,
you build it on your own. Okay, now you can do this—a lot of people still do it because they want
to save on cloud costs, but this also comes with a trade-off because you have to maintain them,
okay? You have to maintain this. What if the power goes down, right? What if my hard disk
fails and I lose all of my data? You also have to think about replication, you also have to think
about scalability, okay? How do I scale this entire thing? Because let's say, right now,
I'm just working with millions of data and the users are small. Tomorrow, my business grows,
so I will have to buy new hardware, okay, and upgrade my system. What if my hard disk fails?
What if my RAM fails, okay? What if the hardware fails? What if an earthquake comes and I lose all
of my data center resources? Anything can happen, right? You don't have control over nature. So,
this is the reason people usually go with the cloud providers because I don't want to set up
all of these things by myself if I can directly pay to the cloud providers, okay? And these
cloud providers always charge pay-per-use, okay? Pay-per-use means that you only pay for what you
use. That's pretty awesome, right? I will only pay for what I use, whatever resources I consume. So,
if I just use a simple virtual machine, which is like the online computer, and I run it for
two hours for some workload, I am only going to pay for these two hours, okay? I will not,
in the on-premise data center, have to keep running these machines 24 hours because this
is how the entire server is set up, right? If I want to store something, because I will be running
some other workloads, my website is also hosted on that, there are some other functions running,
databases, and everything, so I have to keep it running for 24 hours. Let's say if I just
want to do some workload quickly for two hours onto the cloud, I can rent that,
and I can also pay for that use case. Cloud has multiple services available
for different use cases, okay? These different services are divided into
three different parts. We have PaaS, okay, we have SaaS, okay, and we have IaaS, okay?
This is Platform as a Service, this is Software as a Service, and this is Infrastructure as
a Service. What do these three things mean? Platform as a Service means they give you the
direct platform, so you don't have to worry about setting up different things. So, for example,
for example, on AWS, we have a service called AWS Lambda, okay? You can call it Platform as
a Service because they directly give you one kind of platform where you can just
focus on writing your code—they will take care of all of the infrastructure side,
such as running the server, all of the backend things, they will take care of the maintenance
and everything. You just focus on writing your code. This is called Platform as a Service.
Second is Software as a Service. You can think about Software as a Service as Google Suite,
alright? You have Google Sheets, AppScript—not the AppScript, what do we have, like Google
Slides? You have the entire Google Suite. You can think about that as Software as a Service
because they are directly giving you access to the software as a service for your work,
so you can use that and grow your business. Then we have Infrastructure as a Service, okay?
That basically means cloud providers will give you the infrastructure. So, an example of this
is EC2 machine—this is basically a virtual machine online. There’s also a concept called EMR—this is
like Elastic MapReduce, to run your Spark jobs. These are different infrastructures that they
give you so that you can run your workloads, okay? This is how cloud platforms are divided into three
services—they give you these services that you can use to grow your business, right? Now, these
services have different names as per the cloud providers, right? If I go to AWS, right, on AWS,
we have these many services and many more. These are just a few services, right? We have EC2—just
don't worry about the names if you are seeing them for the first time, just forget what it means,
right? Don't worry about it. If you already know, that's good, but if you are seeing these services
or these logos for the first time, just forget about them, right? There’s something called EC2,
which is like the virtual machine, we have Lambda, where you can just write and run your code on the
serverless machine, okay? Elastic Container Service, if you want to run a Docker image,
okay? There is Simple Email Service—it is used for notification purposes or email purposes. Aurora
is the database created by AWS, so if you want to store your relational data, it is a service. It's
like AWS is giving you the service, so you only pay for the number of hours you use or the number
of resources that you consume, so that you don't have to build all of these things by yourself.
Everything is built for you, pay for it, and grow your business. Elasticache, DynamoDB, right? EMR,
VPC, CloudFront, Elastic Load Balancing, Kinesis for real-time data, RDS for relational databases,
Redshift for data warehouse, right? Kinesis, Elasticsearch for some IoT devices, Simple
Storage Service, object storage to build a data lake, right? File system, Elastic Block Storage,
Cognito, API Gateway, Queue System—you need to build your entire technical architecture,
right? We understood—we have the business goals, but once you define the business goals,
you think about, right now, how to build my technical architecture. So, you start thinking,
okay, which cloud computing platform should I go with? Now, most of the time, you might
have the answer—let's say you are a student right now, okay? You might have a question:
which cloud computing is the best and will give me a job? The answer is, pick any one of the three,
and there are high chances that you will get a job because most companies only work
with these cloud providers. If I were to rank them, okay, this is just my personal opinion,
it can be wrong. This was my opinion a few years back, like one year back, you can also say that.
I used to rank AWS as one, okay? Azure as two, and Google Cloud as three. Okay, now it is changing,
and I'm seeing the trend that Azure can be one because a lot of companies are using Azure due
to their new functionality and good services. The services that they provide are specific to
the enterprise level, so Azure is good if you want to target enterprise-level companies. They always
go with Azure, especially in India, because a lot of companies directly use the Microsoft app suite,
like Microsoft 365 at the enterprise level—because Microsoft Word, PowerPoint, and all of the other
things. So, they are likely to go with Azure because the integration is quite simple,
right? A lot of startups usually go with AWS because AWS gives you good credits, you can
easily start, and a lot of people know AWS, like the industry. If you want to find resources or
employees with AWS skills, it is quite easy to find, so a lot of startups pick up AWS. Like,
I'm building my data engineering startup, okay? I'm also using AWS for my infrastructure.
The third one, I still say, is Google Cloud. Again, there are some services Google provides
that are really good, but these are my takes, right? This is my personal take, it can be wrong,
but this is what I see in the industry. I say if you want to target top companies—when I mean top
companies, I mean the enterprise level, like banks, like a lot of top companies,
the service-based companies can also be taken into the picture, like Infosys, TCS, and all of
the other things, up to your requirement—but top companies that already have an IPO set up, you
can just research their company architecture, and you will find that you will see a lot of companies
use Azure if they are enterprise level. A lot of startups, like Indian startups—if you see Zepto,
if you see CRED, okay—all of these guys are on AWS because it's good for startups, they give it
a good ecosystem. So, I say if you want to target startups, learn AWS and GCP. I think it's—I always
suggest either learning Azure or AWS unless you want to target a specific company and they tell
you that they require skills in GCP, then go with GCP. Okay, I just answered your question.
If you are a student, then you can go with this. If you are someone who is looking to build the
architecture, again, the situation is the same: think about the services that solve your problem.
Okay, we will talk about the different services, but the idea is to think about what services these
cloud providers give us that can help us solve our business goals. We understood about operational
and technical architecture—now you start thinking from this point of view: if I were to choose AWS,
GCP, and Azure, and if I say, okay, Azure gives me these services, AWS gives me these services,
and as per my requirement, I can easily solve all of my business problems using Azure because
they have a good service pack together, so I'll go with Azure. Like, I can do a simple small
project on Azure and see if that works—if it works, I can move my entire production
workloads onto Azure. Okay, if that doesn't work, there is also the concept of hybrid cloud,
so you use some services from Azure, you use the best services from AWS, you use the best
services from Google Cloud, okay, and build your system. For example, in my personal opinion,
right? I really love Google BigQuery—this is a data warehouse provided by Google, okay?
And on Azure, I really love the Databricks integration, okay? On AWS, I really love the
Glue service, which is serverless Spark workloads. So, what I can do, okay—Glue, or also I love the
S3 as an object storage, okay? If I want to do that, I can use S3 as my object storage,
I can use Databricks as my Spark workload, and I can use BigQuery as my data warehouse. So, you can
also do cross-cloud integration, but maintaining all of these things is quite difficult. Again,
there are some tools that can help you with that, but these are the different concepts
that you can explore. I just want to throw them at you right now so that you can keep that in mind,
okay? Let's move forward—let's talk about the services that we understood, okay?
Now, we understood, right, these are the services—so let's say if I go with AWS,
and if I build my entire architecture, if I want to build my ETL pipeline, okay,
how will I go with that? Let's say this is how it will happen, okay? Let me just remove this,
okay. Collect, process, store, and analyze, right? Data engineering lifecycle—the simple
architecture that we've been understanding. I can collect data from S3, Kinesis, DynamoDB, RDS, MSK,
whatever, right? This is object storage, this is the real-time data streaming platform, this is the
NoSQL database, this is the relational database. We understood data is coming from multiple places,
okay, where we can collect our data and easily ingest the data, and we can collect again, Siri—
Stop—why? Okay, I found this on the web for when we can collect Siri. Check it out—stop! Okay,
yeah, so we can collect the data, okay? Then we can do the event processing,
okay? Let's say if you want to do something, let's say every time data gets uploaded onto Amazon S3,
I want to run the Lambda function. Okay, Lambda function is basically the compute service, so if
you want to run small code, you can do that—I can do this, and then I can do the actual data
processing using EMR, which is a Spark workload. I can run the machine learning, I can run AWS Glue,
again the Spark workload, and then I can use these services for analysis. So on AWS itself,
I can build my entire data system, right? Instead of going out and picking random tools, AWS gives
you a wide range of services that you can pick from that pool and build your entire data system,
okay? This is just an example, okay? Just to help you understand from this entire service tool pack,
right, that AWS gives you—we understood about services. Services can be platforms—they might
give you the platform, they might give you the software as a service,
they might give you the infrastructure as a service, right? These are the different
services that they provide, and using these services, I can build my entire platform,
okay? And it might look something like this, okay? Now just pay attention, okay? Don't get confused,
don't get scared about all of these things—now we're just trying to go a little bit advanced,
okay? And this is the architecture of one of the top companies or top startups called Dream11 in
India, okay? Dream11 is like the fantasy betting app. This is the architecture of Dream11 that
they have used to build on AWS. Now, if you see this architecture, you will understand it is not
completely AWS, okay? There are some things that they use from AWS, as you can see over here, okay,
and there are some things that they use that are open source, and this is how technical systems are
built. This is the final version of Dream11—they went through three different phases to build this
particular architecture. I have posted a LinkedIn post—I will put the link in the description. If I
forget, just comment it out—I will add it to it. Okay, now let's try to understand and also let's
try to remember our data engineering lifecycle, okay? Even though this architecture looks quite
complicated, the fundamental concept, okay, the data engineering lifecycle is quite the same,
okay? First of all, what do we have in the data engineering lifecycle? First,
we have the generation source. Now here, as we understood, our data is coming from
multiple places, so we have third-party vendors, okay? As you can see—let me just zoom in. Okay,
our data is coming from third-party vendors, there is some RDBMS, like MySQL, there is some NoSQL,
like Cassandra. Where is the streaming data coming? There is Cassandra NoSQL database,
okay? And then there's the application, so there's iOS and Android application, and there's
the desktop, Dream11.com, as you can see over here. So, we understood, right, data comes from
multiple places. In this case, data is coming from third-party vendors, okay? Data is coming
from third-party vendors, data is coming from the databases, data is coming from the application and
the iOS. I kept telling you, right? I kept telling you data is coming from multiple bases—this is
what it means. It is coming from multiple places. Now I want to ingest this data into my system,
and most of the time, for ingestion, for real-time streaming ingestion, or just ingestion,
people use Apache Kafka, okay? Apache Kafka is a real-time data streaming platform, a distributed
real-time data streaming platform, so you can work on large-scale data, okay? And you can easily put
Kafka in between to consume all of the data, okay? In Kafka, these are all of the producers,
okay? Let me just write it over here. All of these people, okay, are producers who are producing all
of this data, okay? Once the data gets into Kafka over here, okay, everything else that happens is
consumers—consumers who are consuming all of this data, okay? Simple to understand. Again, we are
not deep-diving into Kafka—I will be launching a course on Kafka, so you can keep an eye on that,
okay, in the future. But data is getting produced and data is getting consumed—here, consumption is
basically what I want to do with this data, okay? So, there is a batch pipeline going on over here,
as you can see, okay? This is a batch pipeline. First of all, we understood, right, once the data
is ingested, we need to store our data somewhere, right? There was a storage layer below. So, the
data gets stored inside Amazon S3 as a data lake. Now, the concept of the data lake is coming. Now,
the concept of Amazon S3, which is a service on AWS S3, is coming, right? I kept telling you—you
can use S3 as a data lake. I store my data onto the data lake, okay? What happens here after this?
This data goes through the ETL, okay? As you can see over here, this data is going through the ETL,
okay, and the ETL is happening using Apache Spark, okay? There is some Apache Spark workload
available, and then it stores our data onto Amazon Redshift. This is what I kept telling you—this
is a data warehouse, okay? This is my data warehouse service available on AWS, right? This
is my ingestion, this is my ingestion, this is my data warehouse, this is my data lake, this is my
storage, right? This is my storage, okay, this is my data warehouse, this is my ingestion. There's
one more thing that I told you, right? In a data warehouse, we put our data by transforming and
making it into the structured format. Now, there is one more pipeline that goes—it is called ad hoc
analysis, okay? And as you can see, it is using Amazon Athena, which is a query engine for ad hoc
analysis, and I told you, right? The Looker, okay, the reporting system or the data science people,
can use the raw data that is coming. I can use this raw data as it is, okay, from the system as
per my requirement, or I can also use structured data as per my requirement. So, I get access
to both of these things—I can get the proper structured data also, and I also get the raw
data as per my requirement, okay? This is there, again. Understand the data engineering lifecycle
that we understood—understand, now try to connect every single thing that we have done, right? We
understood data warehouses, we understood data lakes, we understood ETL, we understood ingestion,
storage—every single thing is put together into the real-time system of Dream11 case study, right?
We're just trying to understand the real-world architecture right now, and how they use the
fundamental concept in the real world, okay? Every single thing that we talked about, okay, it makes
sense here, right? We have the ETL system for ad hoc analysis, we have the structured data, we have
the ingestion system going on, okay? This is just a batch pipeline, okay? This is there. There's one
more thing—we have the real-time pipeline going on over here, okay? For the real-time pipeline,
what they are using is Apache Flink, okay? Apache Flink is used for the streaming engine,
so if they want to understand data on a real-time basis, they can use this and analyze it. So,
from the streaming engine, we go to Elasticsearch, there might be some notification service,
there might be some visualization available over here—not sure about that, but this is
the entire pipeline. And the fundamental concept that we use is the data engineering lifecycle,
okay? And all of the concepts that we use. So, every time I store my data onto Redshift,
this is the data warehouse. I might use dimensional modeling, okay? After the ETL, I use
Apache Spark to transform my data. I use S3 as my data lake storage, and I use Amazon Athena for the
ad hoc query. I use Looker for my visualization, I use Jupyter Notebook for my data science workload,
I use Kafka for my ingestion—these are all of my sources. I use Apache Flink for handling real-time
data streaming—this is the real architecture. This is everything we did in the last two hours
just to understand this particular thing. Once you have understood this, you get a good gist of data
engineering. Now you know, like, yeah, I am a data engineer because I understand this architecture
and what is going on. This is the fundamental part, right? Once you understand the fundamentals,
you can understand any architecture. Now, once you complete this entire video, you can understand any
architecture in the world because you will know, okay, there is some ingestion going on, there
is some transformation happening, there is some loading happening, there is some ETL happening.
I understand this—the tool is different, right? I can replace this tool with Snowflake, I can
replace this tool with Databricks, right? I can use something else over here—it doesn't matter,
okay? It will work the same. The features might be different, the performance might be different,
but fundamentally, it will give the same output. But for their use cases, for Dream11,
they might have tried multiple things and then finally came up with the final architecture that
is currently working for their system, right? Everything that you see on their application,
everything you see on your app as Dream11, there is this kind of system behind that,
making it possible, okay? It's not some magic. Alright, we understood AWS—now let's understand
GCP. Just like AWS, right, we understood that we have different services available on GCP also:
ingestion, okay? For ingestion, we have App Engine, Cloud Pub/Sub, Cloud Transfer Service,
BigQuery, Cloud Function. Now, I want to tell you this: most of the cloud providers have similar
services available. For example, this is a data warehouse available on GCP called BigQuery, which
is the same as Redshift on AWS. Okay, there’s Cloud Function available, which is the same as
AWS Lambda—fundamentally, they give you a similar platform to perform your workload. The name is
different, the feature is different, the cost is different, but fundamentally, it is the same,
right? Just like we have Cloud Storage—this is basically the S3 of AWS Cloud Storage,
object storage. We have Cloud SQL—this is the RDS, this is the Relational Database
Service that we talked about, right? BigQuery is a data warehouse that we understood. Data Prep,
DataProc is the same as EMR, okay? Elastic MapReduce of AWS. So, if you understand,
okay, services are the same—like most of the cloud providers have similar overlapping services. It's
always about choosing the best service for your use case. So, for ingestion, they have this many,
for storage, they have this many, for processing, they have this many, and for exploration. Again,
the concept of the data engineering lifecycle: I have to ingest something, I have to store
something, I have to process something, I have to serve something. Okay, this is the simple
architecture on the GCP. Same fundamental concept applies—I have data coming from multiple places,
I ingest this data, I store this data, there is a pipeline that is running right now. Again,
I store some data, I store the data onto BigQuery, okay? And then there’s some privacy on Oracle
Cloud, like identity campaign running—this is like the end-user part, right? Customer platform—there
is customer data, and data destinations such as web apps, customer service, marketing messaging.
Same fundamental concept: data source, collect, process, store, and give it to something. This is
where the entire data engineering is happening, right? I get the data, I ingest it properly,
I store it, I process it, I give it back. Okay, now this is done on GCP. Again,
let's look at the Azure level also, okay? These are the developer services. We have compute,
okay? For the compute, we have virtual machine, cloud machine, batch storage, again, the same.
We have the web and mobile app. For data, we have the SQL database, Redis Cache, we have SQL Data
Warehouses—that is also available. For analytics, we have Data Lake Analytics, Data Lake Store,
Stream Analytics, Machine Learning, Data Factory, okay? IoT, we have media, we have identity access.
In my opinion, Azure has, as a data engineer, Azure has a very good service for data engineering
workloads, okay? They have three services that I really like on Azure. So, one is Databricks,
okay? I really like Databricks because it is properly integrated with Azure, and Databricks
is basically the environment to run Apache Spark workloads, okay? Second is Data Factory,
okay? Data Factory, and third, I like Synapse Analytics, Synapse Analytics, okay? Most of the
services—and there’s a new service available I haven't explored called Fabric. Microsoft Fabric
is basically the combination of these multiple services where you can do everything in one place,
okay? It is especially designed for data engineering workloads,
making your life much easier. I have a project on this on my YouTube channel available for free,
okay? I'll put the link in the description—if I forget, do let me know by commenting, I will put
that. If you want, you can explore that. I also teach about all of these things in my courses,
so we do have projects available on that—you can explore that by going to the website.
Okay, now, this is the architecture side of the same thing, just like AWS,
okay? I can replicate this entire architecture on GCP also and also on Azure. What I have to do is
basically just replace—let's say if I'm replacing this entire thing onto the GCP, what I will do,
instead of S3, I will use Cloud Storage, okay? Instead of Redshift, I will use BigQuery,
okay? I can put DataProc here, okay? I can also put BigQuery here if I want. For streaming engine,
I can also put Pub/Sub and DataFlow, okay? Um, what else? For Kafka, I can put Pub/Sub,
but I'll say I'll go with Kafka—Kafka is best. Okay, Looker is good, this is good, everything
else seems fine. So, I can also convert this—my AWS architecture—through GCP. Performance might
be different, the costing might be different, the UI might be different, the integration might be
different, but I can do that. Okay, I can also do the same for Azure as well, simple. Okay, and
this is what the Azure architecture says, right? What do we have? We have the customer stream data,
we have the customer batch files, okay? Uh, we are ingesting this particular thing, and we are
just adding this onto the ADLS, which is Azure Data Lake Gen Storage. It is a service available
from the external source. Okay, now there's a Data Factory running, okay? It is like data is coming
from on-premise sources, and some stream data goes to the Data Factory. It gets entered into the raw
zone, okay? The raw data is getting entered. We use Databricks over here to process this
raw data and store it in the processed folder. After that, it goes to the analytical zone,
and it goes to the SQL pool, which is a SQL data warehouse. Now from here, customers can
use this to build the Power BI dashboard and get insights, okay? It can also be integrated
with the desktop application if needed. Same concept: collect, ingest, store, transform,
serve, and use it. Same thing is happening, and there are some undercurrents. As you can see,
we are using the Azure Key Vault to securely store our keys, Log Analytics, Azure Preview,
and Azure DevOps to properly operationalize our entire integration for the scripts.
These are the different services that can be used together. So, these are the three different
pipelines that we have used till now. Okay, I just showed you using AWS, I showed you using
GCP, and I showed you using Azure. Now, let's look at the modern data architecture, right? This is
also modern, just especially built on the cloud. This is the modern data architecture, right?
Modern data architecture is basically where new companies are coming into the market and saying,
"Okay, the tools that you guys are using are old now. They don't work with the new data workloads,
the new volume, and the approach is very old, okay? And I, as a new startup, I am a modern
data company. I will make your life easier." So instead of you doing the ETL, remove the ETL,
okay? I will say directly load the data into my product, okay? And I will directly transform it
for you as per your requirement, so that you can directly save time on the ETL and start querying
the data. This is what the modern company says. They all have different requirements, so they
directly give you the integration between your different sources, as you can see here, right? Uh,
I have different data coming from sources like Stripe, Google, PostgreSQL, Google Play. What
they say is that they have the integration with all of these different sources. These are the
applications, right? Fivetran, Airbyte, Stitch, okay? These are usually used for ingestion. You
can also use Python and SQL, which is also the modern way. Before this, we had Hadoop and all of
the other workloads. This company comes and says, "Okay, use our system because we have made all of
these things easier for you. Directly connect with these multiple sources, we'll pull the
data for you, and we'll directly load it onto the data warehouse so that you can do everything as
per your requirement." There is a new tool called DBT (Data Build Tool) that is used for analysis.
People say that it is going to replace SQL. Not going to happen. Most of the time, a lot
of companies come and want to replace SQL, but still, SQL is the king of data, right? You should
always learn SQL. DBT is also gaining a lot of popularity. It has—you can divide your data into
multiple things, so multiple—you can divide your data into multiple stages. The thing that I told
you about, like ELT, right? ELT. We were doing ETL till now: Extract, Transform, Load. Now we will
do EL, which is we will extract our data and directly load our data into the data warehouse,
okay? It can be Snowflake, BigQuery, Redshift, doesn't matter. And we will divide our data into
different landing zones, okay? This is modern data architecture, right? We create the landing area,
we create the staging area, we create the warehouse layer, and we create the M layer. Same
fundamental concept. If you see, there is a data mart, there's a data warehouse, there's object
storage, and there's a landing area to store the raw data, right? I store my raw data, I store the
staging area after some transformation, I store my data in the warehouse, and this is my M layer. All
of these things you can create inside the DBT, and directly you can store your data onto Snowflake,
Redshift, and all of the other things. Same thing. Then it can be consumed by the BI people,
machine learning people, they can build dashboards on different tools, uh, you can, uh,
do the analysis on different tools. There's also the concept of reverse ETL. Companies are using
that. Basically, that means I have transformed my data, I can put back this data onto the
source system and get more insights from the transformed data by ingesting that data back to
the system again. Uh, this is a totally different concept. Um, I'll cover it in some other videos,
but there's also a concept of reverse ETL that we also saw in the data engineering life cycle, okay?
Modern data architecture—we understood about GCP, we understood about Azure, AWS, and the modern
data tools. A combination of these different tools and AWS and Azure can build the modern
data platform. Now, again, we talked about this. Here we have like thousands of tools available,
right? How do I decide which tool is best for me? First, I look at the business requirement. Does
this solve my business problem? Yes. If it does, then I should use that tool. If it doesn't, then
it should be reversible. So, okay, I can easily, uh, remove this. Let's say this tool is costing
me too much, okay? And this is not really even solving my problem. I can directly remove this
and go with another tool. If this tool is also not working, I can directly go with Spark because it
is going to work because it is open source, right? All of these things are going to cost you. Spark
is going to cost you for the server, so, uh, you have to choose between—this is easy, this might be
quite difficult to set up. So, as a company, if you are a startup, people usually go with using
these things because it saves time, okay? You have the money, but you want to save time, so you
can go with this. This will solve your problem, this will also solve your problem, okay? Whatever
solves your problem, whatever helps you reach your business goals, you can go with that, okay? Now,
uh, I just want to take a break, so I'll have some water, and I'll come back in 1 minute. Alright,
till now we have understood a lot of things. Now, this is kind of like the end of the video,
and again, I can't cover every single thing, but I want to leave you with some of the important
tools that you can learn about data engineering and some of the concepts, uh, at the end, okay?
So that is important for you in your career. So let's start with that, okay? Important tools for
data engineering. Now, first of all, if you want to become a data engineer, you have to learn a few
things. First of all is the programming language. You have three choices: Python, Scala, and
Java. Now, if you want to learn any programming language, I always suggest starting with Python,
okay? It's the easiest to learn, mostly used by industry because if you want to write the,
uh, ETL scripts, if you want to write the Kafka ingestion engine, and all of the other things,
Python has a lot of packages that make your life much easier, and even industries, uh, use Python
for all of these workloads, so you should always go with Python. If not Python, you can also go
with Java. Java also has good support because most of the open-source frameworks like, uh, Apache
Spark, all of them are written in Java, okay? So you can go with Java also, but my suggestion
is to go with Python, okay? Important for you to learn in Python. You can learn about the basics of
Python, such as variables, operators, basic data structures like dictionaries, lists, all of the
other things. Important things to learn include how to work with date and time formats, how to,
uh, like there's a package inside Python called Pandas, so you should learn how to transform
the basics of data, how to work with different file formats also, like CSV, JSON, Avro, okay?
This is what you can learn in Python. Uh, I have already created a detailed roadmap for this,
so I'll also put the link to that particular video in the description. If I forget, do let
me know, and I will add it, okay? SQL—again, SQL is the backbone of your data career. You
cannot skip this. This is how you communicate with the databases. We understood everything,
so you have to learn SQL. This is non-negotiable, okay? You cannot skip SQL. You cannot skip Python.
This is the foundation, so you have to, have to learn this. After this, you can understand Linux
commands because you will be working with some of the, uh, cloud providers or Linux machines.
Most of the, like, 80 to 90% of the servers online run on Linux servers only, so you should learn to
interact because it doesn't have a GUI, right? You don't have a graphical user interface. You'll be
accessing it using the terminal. You can learn commands like cd, how to clear, how to copy,
how to exit, how to find something, how to view the file, okay? These are the different
commands that you can learn. You can just search on YouTube, Basic Linux Commands, and you will
get a good tutorial, okay? Now, we have data warehouses. Again, you don't have to learn all of
the data warehouses, okay? You can learn—you have the AWS Redshift available, you have BigQuery, we
have Hive available, SAP Analytics, and Snowflake. My suggestion is to either learn Snowflake because
this is not dependent on the cloud platform, okay? This is cloud-independent, so you can learn this.
Also, this is highly demanded in the market, so you can easily learn this and add it to your
skill set, very highly in demand. There's one more that I love personally, which is BigQuery,
okay? Because I've worked with BigQuery for the last three to four years, and I've really enjoyed
this service, so this is one of my favorites because this is one of my favorite services on
GCP. So, my suggestion is to go with Snowflake because this is cloud-independent, okay? If you
are working with a specific cloud, you are anyway going to learn—let's say if you're learning AWS,
you are anyway going to learn about Redshift. If you're learning BigQuery, you are anyway going to
learn about—if you're learning GCP, you are anyway going to learn about BigQuery, right? So, my
suggestion is to just learn Snowflake because you will learn about these three by learning about the
cloud. Hive is an open-source, uh, tool that not many people use. It is just used for the metastore
for Apache Spark or Apache Hadoop workloads, okay? As a metastore to store some of the information,
but not really recommended to learn it separately. You can just learn the basics, and in case you
have a requirement, then you can learn it on the go, right? It will take you like one
to two days if you have the basics clear, okay? Data processing. This is interesting, okay? For
different workloads, you can use Apache Spark for batch and streaming. You have to learn Spark. This
is very, very important, okay? You cannot skip Spark also because this is used by top companies
to process big data. You also have to learn Kafka because this is very important to process
real-time data, okay? Out of these three, I would say learn Apache Flink for real-time analytics.
You can use Kafka for streaming. You can use Flink for analytics. There's NiFi and Apache Beam also.
If you learn GCP, okay, you will automatically learn Apache Beam also. So, my suggestion is learn
Apache Spark, learn Apache Kafka only. Not really right now—if in case you have to use it somewhere,
you will learn it on the go, okay? Just add Kafka and Spark to your skill set. Data orchestration.
Okay, we have many tools available. Out of these, you should use Apache Airflow,
one of the highly used tools in the market, okay? We have these modern data tools, okay? Uh,
these tools take roughly 30 minutes to 1 hour to learn, okay? If you have your fundamentals clear,
right? You can just watch one video and understand more like 80 to 90% of the tools, right? It is
very simple. I learned about Mage in just one hour, okay? It didn't take me more than that. So,
there's one project available on our channel also, so if you want, you can learn that. These modern
tools are created to make your life easier, okay? So, to learn Apache Airflow is quite complicated,
right? It will take some time to understand the gist of it, and we have a course on that, so I'll
tell you about this right now, but you can learn about Dagster, Mage, and Prefect within one hour.
I don't think that will take so much time, okay? And these are the modern data tools available,
okay? These are all part of the modern data stack, okay? These are the modern data stack,
okay? As you can see, we have—for ingestion, I think this is Airbyte, and this is Fivetran
for ingesting data. For data storage, we have BigQuery, Snowflake, Databricks. For BI, we have
Looker, Data Studio. For data transformation, we have DBT. Data orchestration, right? If you
want to orchestrate your entire thing, we have Airflow. There are some data quality frameworks,
Great Expectations, and there are metadata platforms like OpenLineage and DataHub. Again,
you can just search about the tool name, and you will get what they do, okay? When
we talk about the modern data stack, it is really important to just understand why these tools exist
in the market, like what problem do they solve. So, in this case, Fivetran solves the problem of
data ingestion—it takes data from one source and pushes the data to the other source. DBT gives
you modern transformation, okay? Uh, DBT gives you modern data transformation. Airflow is for
orchestration, so if you want to orchestrate and build a data pipeline, you can do that. Uh, this
is for data quality and governance, so these are some of the tools available. Just search online,
and you'll find plenty of resources. Alright, uh, now I want to cover these individual things,
right? Uh, what do you need to learn about Python? What do you need to learn about SQL?
What do you need to learn about data warehouses, Spark, Apache Airflow, and Kafka, okay? So I just
want to cover these individual things. Again, I already have the roadmap available, but I'll
just quickly go through this part. Let me just open this, right? Uh, this is available here. So
learning Python is one thing, and learning Python for data engineering is another thing, right? You
can learn Python for free online, but if you want to learn Python for data engineering, you have to
learn certain things. I'll just show you quickly because I have it on my website itself. So this
is my Python for Data Engineering course. I'll just go through the modules. You don't have to
take this course, but if you want, you can learn these things for free online also. I have created
courses just to give you a structured learning approach so that you don't get distracted,
okay? So you can learn the basics. All of these modules are open, so if you want, you can learn
them. You can start with strings, you can learn about numbers, you can learn about data types,
you can learn about data structures like lists, dictionaries, sets, tuples, okay? You can learn
about conditional statements like if-else, you can learn about loops (for loop, while loop),
then you can go to the intermediate level, such as understanding Python packages, how to import them,
list comprehensions, exception handling. We have to learn how to work with text files,
basics of Lambda functions, and object-oriented programming. There are some advanced concepts such
as NumPy, understanding the NumPy package, Pandas basics, how to use Pandas for transformation,
then working with date-time formats—very important if you want to work as a data engineer—how to work
with different file formats like JSON, CSV, Excel, Avro, okay? And these are the basics.
In my course, I have included one project for Python, okay? This is like a Spotify data pipeline
project. Uh, I'll tell you about this part, uh, at the end, okay? If you're interested. Then we have
SQL. Inside SQL, what do you have to learn? You can pick one DBMS. We are going with PostgreSQL
because PostgreSQL is open-source, easy to learn, and easy to set up. Learn about the important
keywords of SQL such as SELECT, INSERT, UPDATE, and all of the other things. Learn about data
types and how to create tables, how to create a database, different types of queries available,
okay? Like DML, DDL—like Data Manipulation Language, Data Definition Language—you can learn
about that. Uh, you can learn about operators in SQL, okay? You can learn about ALTER query,
statements, joins like inner, left, right, outer, cross join, ORDER BY, GROUP BY, HAVING clause,
aggregation functions like MIN, MAX, and all the other things. Also, understand the advanced
topics like subqueries, Common Table Expressions, window functions, analytical functions like RANK,
DENSE_RANK, ROW_NUMBER, LEAD, LAG, set operations, working with date-time, case statements,
stored procedures. Learn about data modeling—we understood the basics of it. It is like ER
modeling and data modeling. So learn about that and just try to build your own data model. Like
you can pick one company name like e-commerce or Instagram or any company like Netflix, and you can
build a data model as a project, right? It looks something like this, a data model, as you can see
over here. This is like an Instagram data model, okay? This is like an e-commerce data model,
okay? After this, you can learn about data warehouses, okay? In data warehouses,
you can start with the basics: understand what a data warehouse is, understand OLTP vs OLAP—we
understood about this—understand the difference between data warehouses and data lakes, ETL
process, learn about Snowflake, like basics—just create an account on Snowflake. We have tutorials
on Snowflake also on the YouTube channel. Learn about dimensional modeling, so deep dive into
dimensional modeling, which is understanding what dimensional modeling is, understanding
fact tables, dimension tables, understanding star schema, snowflake schema, types of fact tables,
how to create fact tables, factless fact tables, surrogate keys, date dimension.
So these are the things you can learn about dimensional modeling. You can learn about SCD,
Slowly Changing Dimensions. You can learn about ETL—these are the concepts that you can cover in
the Snowflake database, okay? Like staging, copy command, file formats, handling unstructured data,
how to work with them, virtual warehouses, caching, clustering, storage integration,
Snowpipe, time travel, how to undrop things, how to recover data from the past, types of tables,
zero-copy cloning, data sharing, materialized views—these are the concepts that you can learn in
Snowflake, right? I'm just trying to give you an overview of the things that you can learn. For me,
I have created this step-by-step roadmap. I'm still building this entire thing,
so you can go to this website, DataVidhya, and you will see that I'm trying to build a course—first
is Python, then the second one is SQL, third one is data warehouses, fourth one is Spark with
Databricks, fifth one is workflow orchestration. I'm currently working on the Kafka course, okay?
And then there will be a dedicated cloud computing course in the future, okay? So, after this,
we have Apache Spark. This is very important. In Apache Spark, understand what Apache Spark is, why
we need Apache Spark, understand the architecture, understand concepts such as DataFrame,
transformations, actions, lazy evaluation in Apache Spark, okay? Learn how to install
Apache Spark—very important. Then we have this, uh, deep dive into the structured API in Apache
Spark. We have two things: structured API and the lower-level API. So learn about the structured
API, basics of it, how to define user-defined functions, data types of Apache Spark,
data sources, partitioning, bucketing, how to work with external tables. Then we also have
the lower-level API, such as understanding the Resilient Distributed Dataset (RDD),
also learn about production applications, how to run Spark on the cluster and on Databricks,
okay? These are topics that you can also cover, like you can just screenshot this,
or you can also visit the DataVidhya website just to get an understanding of the modules, okay? You
can learn all of these things for free online, okay? You don't have to, uh, really go through
this because I'm just going through this because this makes this entire thing easier to explain,
right? Uh, for Airflow also, you can just go through this section, okay? What are the things
that you need to cover? There are a few concepts that are important, okay? And then you can build
the projects like this. So I just quickly showed you, like, what are the different topics that you
can cover from the website. So instead of writing each and every single thing onto this page, uh,
that will just increase the time of the video, and I'm also, uh, feeling pain inside my throat,
uh, because I've been recording this thing for the last 3 hours, okay? Uh, so I just quickly showed
you that particular thing. These are the two different topics that I also wanted to cover: data
security and data masking, okay? Data security is important—we talked about this at the initial
stage. Uh, in data security, we have to take care of three things: confidentiality, integrity,
and availability, right? Ensure your data is accessible only to authorized users, so you
don't give access to your data to every user—only the authorized users should be able to access it.
Integrity is basically maintaining the accuracy and completeness of your data, so your data should
be accurate and should be able to provide the final value, and availability means that your data
is available to authorized users whenever it's needed, okay? These are the three important things
in data security. These are the measures you should take: first, you should encrypt your data,
okay? Encryption should happen so that, uh, if it goes through the network, uh, other people should
not be able to understand what the data is. Access control—only give the data to specific users. Data
classification—classify your data, like if this data is confidential or not, and security—like
secure data on the network level. The one concept that I wanted to talk about is data masking, okay?
Uh, that I talked about at the governance level. So usually what happens is that, uh, you have an
employee table, okay? What you have to do is—like there are some governance restrictions, some
regulations by governments that say you should not store sensitive information about the users,
right? Like credit card numbers, addresses, social security numbers, and all of the other things. So
when you do store it, make sure you mask them. Masking is basically a technique. So basically,
this is the ID of the user, right? What do I do if I want to mask this? I will, uh, replace this with
some random number. Let's say this is my Social Security number, okay? What I will do is—if this
is my Social Security number, what will I do? I will just mask this with something like this:
XXX-XX- and I will just reveal the last four digits, like this, okay? You can also do this
for credit cards. This is called masking, okay? Now, these are the different file formats you can
use for big data. These are common: JSON, CSV, Parquet, and ORC. Every file format has its own
use case. I don't want to go deep dive into this right now. I covered some of the things in my
courses, or you can just Google this, and you will understand most of the things that you want to
learn, okay? So, till now, we have covered a lot of different things. I might have missed some of
the topics, right? So I cannot cover every single topic in a single video, right? In a single video,
this might be around a 3-hour video, okay? I'm not sure—once I edit, once I sit and edit this video,
I'll get to know about the timeline, but approximately, this might be a 3-hour video. I
might have missed some of the topics, so what you can do—you can comment down, okay, the topics that
you want to learn. Just the fundamental topics, right? Hands-on, we will have the projects for
that. Just the fundamental concepts that you want to learn. What I will do—I will club all of these
topics inside part two, and I will create a video like this—like a long, three-hour video that you
can watch, okay? Now, you understood all of these things. Now, if you like the way I teach and if
you really want to learn about data engineering, you can go to the website. I will put the link in
the description: DataVidhya/combo-pack. On this combo pack, you will get five courses because,
till now, only five courses have launched. I'm currently working on the Apache Kafka course,
as you can see. We have a 'Not Available' for that also. So, you will also get access to this kind of
notes if you, uh, enroll in the course because I created all of these kinds of notes by myself,
okay? So that you can revise at any time that you want, okay? So you will get access to all of these
notes. So this is my Zero-to-Hero Data Engineering Combo Pack. It comes with the five courses. Now,
in the future, when they launch the course, uh, you can enroll in that course separately,
and I will also create a new combo pack. So if you see the new combo pack while enrolling in this,
okay, at that time, you might see GCP also added into this, okay? In this course, you
will get around 14+ projects. I will teach you, okay, how to make one project the best project.
It is like a step-by-step approach you will get. So as you can see over here, uh, in your Python
for Data Engineering project, uh, course, you will build this particular project, okay? In Snowflake,
you will build a similar project, but instead of using Glue Crawler, Catalog, and Amazon Athena,
we will be using Snowflake over here, okay? Now, in the Spark course, we will replace this Lambda
part, okay, for Python, and we will replace it with Apache Spark. This way, you will understand
how to evolve one simple project and how to plug and play with different toolsets. This is what you
will learn in the entire combo pack, right? How to take one simple project and make it the best
production-level project as we go forward, okay? We start with the basics, we'll replace
some components, we will add Spark, then we will also add Apache Airflow, okay? Inside Airflow,
we will use the same project, and we will use Docker and Apache Airflow to orchestrate
this entire pipeline. We will also create a similar project just using Apache Airflow only,
okay? So as you can see, one simple project we will create in like five to six different ways,
right? So that you get an understanding that data engineering is not just about using tools;
it's about the fundamentals. The fundamentals that we understood, you will actually implement all of
these things over here like this. You will also get projects on Apache NiFi and real-time data
streaming, and there's one project available on Twitter data analysis, which is also available on
YouTube. Uh, there's a project available on GCP also. There's one project on Azure. Uh,
I think the GCP project is—uh, there's a project available on Azure over here also. Uh, the GCP
project is also available—let me just show you over here. Uh, yeah, this is a crypto data
pipeline project available in the Apache Airflow course, okay? So you will also learn about Azure,
you will also learn about AWS, and you will learn about GCP just by doing these five courses. And
then, in the future, we will have in-depth courses on individual clouds also, so you will
get like 14 different projects over here, okay? Five courses—you can get the information about
all of these over here just by clicking this, okay? These are the reviews from our students,
okay? Previous students, and they have built their own projects till now, so if you want to check,
you can also go through this and understand that they have built some amazing, uh, projects. You
can just click over here, okay? And you will be redirected to the link of the project. I hope
this is working, okay? Or you can go here—this also. Uh, yeah, as you can see over here, uh,
this guy actually built the Airbnb project, uh, using Azure. So just like this, you can build your
own project and put it on your resume also. Uh, this course is for everyone, like cloud engineers,
web developers, data engineers, uh, technical consultants, so it doesn't matter who you are—you
can learn this. What you will get from this course—you will get the code template, okay? You
will get each and everything about the code that you can use. You'll get access to the interactive
Discord community, uh, you will get support if you are stuck with any doubts or any errors, uh,
you can ask it on the Discord channel. Someone will help you out, or I will help you out. Or
in the future, you will also get early access and a discount—like a huge discount to future courses,
right? So, let's say if I launch the Kafka course, you will get the detailed—you will get a huge
discount for that course also. These are some of the reviews from our students, so you can go
through that. These are some of the commonly asked questions, so you can also go through that. So,
if you're interested, you can go through this. If you're not, you can also learn by yourself,
okay? My voice is breaking, but the best part about this particular data engineering roadmap
is that every single course is in-depth, so you will learn about most of the things. Most of the
bootcamps available in the market just give you surface-level knowledge. So, if you go to any
website that offers data engineering courses, they teach you all of these things, but for each and
every module, they might have like two to three videos added to their module, and they are done.
I will give you all of these things in detail, and with that, you also get access to the notes
like this, okay? If I show you here—if I show you the Obsidian, you can see this interactive
graph environment. You can see the Apache Spark topics available here, okay? Apache Spark—what are
the different topics connected? These topics are also connected between different courses also, so
partitioning or the transformation concept is also applied to the data warehouse and Apache Kafka. So
as you can see, data warehouse, you can see the SQL, you can see—you can also go with the basic
topics. So all of these are like the topics—cloud, you can easily interact. You can directly search
specific topics such as partitioning, okay, partitioning, uh, partitioning, okay? And I
can know that, okay, partitioning is available over here, and also over here, so I can easily
search and learn about the different things, okay? After this, uh, there's one more thing, uh, what
is this? Okay, you will get the detailed notes. So if I were to show you the notes—let's say,
uh, this is the basics of Docker, okay? You can go and search about it, Airflow basics UI, uh,
let's say if I want to—like writing my first DAG, okay? This will give me every single instruction,
right? How to write my first DAG, what are the codes, everything that I have to do,
every single thing you will get here, okay? Every single thing. So this will make your life so much
easier that you don't get distracted by looking at different courses or different resources. You
just stick to one single path, and you can become a data engineer. So this is what I wanted to show
you about my courses. If you're interested, just check the link in the description. If you're not,
totally up to you, uh, you can use multiple resources. I also have free resources available,
so you can also check that on my YouTube channel. That's everything about this video. Uh, this is
now almost 3 and a half hours—I'm recording this video. Hopefully, the recording gets stored so
that I don't have to re-record this entire thing. This was everything for this video. If you're
watching this video till now, okay, do let me know by writing a comment, okay? Because this is a long
video. Also, like this video because I put a lot of hard work into it. And also, comment something,
share this with people so that all the people can take advantage of this and grow in their careers.
So, everything—thank you for watching this video. I'll see you in the next video. Thank you so much.
Нажмите на любой текст или временную метку, чтобы перейти к этому моменту видео
Поделиться:
Большинство транскрипций готово менее чем за 5 секунд
Копировать одним кликом125+ языковПоиск по текстуПерейти к временным меткам
Вставьте ссылку на YouTube
Введите ссылку на любое YouTube-видео, чтобы получить полную транскрипцию
Форма извлечения транскрипции
Большинство транскрипций готово менее чем за 5 секунд
Установите расширение для Chrome
Получайте транскрипции прямо на YouTube, не переходя на другие сайты. Установите наше расширение и открывайте текст любого видео в один клик — прямо на странице просмотра.