Fundamentals Of Data Engineering Masterclass

In this three-hour Data Engineering Master Class, you will learn about what Data Engineering is,

the Data Engineering life cycle, data generation and storage, database management systems,

data modeling, SQL versus NoSQL, data processing systems like OLTP versus OLAP, ETL pipelines

(Extract, Transform, Load), data architecture, and I will give you a complete guide on how to build

the architecture from scratch. We'll cover data warehousing, dimensional modeling, slowly changing

dimensions, data marts, data lakes, data lake versus data warehouse, big data landscape, data

engineering on cloud, top AWS services you should learn for data engineering, and we will understand

real-world case study architectures on AWS, GCP data services, and Azure data services. We will

also explore the modern data stack, important tools for data engineering that you should learn,

understanding Python and SQL for data engineering, understanding data warehouse tools like Snowflake,

BigQuery, understanding Apache Spark with Databricks, understanding Apache Airflow

and Apache Kafka for data engineering, and many more things. So sit tight, get your notebooks,

pen and paper, and start taking notes so that you can remember this for a longer period of

time. And before you move forward, make sure to hit the like button and subscribe to the channel

if you are new here. Let's get started with the Fundamentals of Data Engineering Master Class.

The Fundamentals of Data Engineering Okay, we'll start by understanding what

Data Engineering is because if we want to understand different fundamental concepts,

we need to have our basics clear. Now, if you have been following me on this channel for the past

few years, then you might already know what Data Engineering is because we keep talking about this.

But if you're seeing me for the first time or if you're just getting started with Data Engineering,

it is important for you to understand what Data Engineering is. So let's start with that.

Okay, now we already know, right? Everything that happens on the internet mainly, okay,

because this is where Data Engineering happens, on the internet. All of these are the businesses.

Okay, the businesses are, let's say, Amazon. Okay, what is the business of Amazon? Amazon

is an e-commerce company. What do they do? They give you the ability to purchase products online,

okay, from your home. Now, this is the business of Amazon. What is the business of, let's say,

Netflix? Okay, the business of Netflix is to give you exclusive content. You buy the premium,

and they give you the exclusive content. On top of that, they also give recommendations and all

of the other things. Okay, this is the business of Netflix. What is the business of, let's say,

Zomato? Okay, this is a food delivery app in India. From your home, you can order food,

okay, and the order will get delivered to you within, like, half an hour to an hour. Okay,

there are multiple companies doing businesses on the internet. Now, all of these companies, okay,

have certain goals and visions for the business, right? They want to understand the customer.

Why do they want to understand the customer? So that they can provide better services. Okay,

they want to increase their profit. Okay, I want to increase my profit. Okay, this is one

of the goals, that I want to increase my profit, understand my customer. They also want to detect

some of the bottlenecks they might have in the business, okay, so improve the process, okay,

improve the business process. And like this, there might be multiple goals a company might have.

Now, if they want to achieve all of these goals, they need to understand how these things are

happening, and one of the best ways companies can do that is by understanding the data. Now,

most of the time, all of these decisions are taken based on assumptions, right? A business person,

let's say, who is working in the shipping department of Amazon, okay, is actually working on

the ground and has knowledge about this particular segment—the shipping, okay? Now, he already has

some business knowledge to take direct decisions on this particular segment of the business, okay,

because he's an expert. He's been working in this particular field for, like, 15 to 20 years,

so he understands what might be the problem. But a lot of times, even as humans, we might miss out

on some of the information that we don't know. And the best way to understand all of this information

is by understanding what the data says. You can assume certain things, and you can be right for

some time, but if you want to be right most of the time, the best way is to be sure about it.

The only way you can be sure about all of these things is by understanding what the data says,

okay? And this is where the entire picture of Data Engineering, Data Science, Machine Learning, AI,

all of these come into the picture. So let's start by understanding all of these things one by one.

Okay, I just painted you the picture. The reason we are doing Data Engineering and Data Science

in the first place is that companies want to understand, okay? They want to understand,

they want to improve their business, they want to provide better services to customers, they want

to, like, remove the challenges they might have in the business by using the data because data gives

you the direct answer. It gives you the factual understanding rather than you just assuming

things, okay? So this is the understanding of why we need the data-driven system.

Now, how do all of these things happen? Okay, we already know this,

okay? At the end, we want to have the final outcome. It can be—we already understood,

right?—improve business revenue, and it can be, like, recommendation and overall things.

So these are my business goals, okay? These are my business goals. Business goals, okay? Every

single thing you do in your data ecosystem, or in general in the engineering ecosystem online,

is for this only, okay? Anything you do should create value for the business. Even if you use,

like, the highest algorithm, but it doesn't impact the final outcome of the business,

it is completely useless, okay? It should help the business in some way; it should help the business

to save costs, it should help the business to improve the process, it should help the business

to understand the customer—whatever it can be. If it can provide the final value, then it is useful;

otherwise, it is completely useless, okay? So this is very important—everything that

you do should create value for the business. If this is clear, let's start by understanding

the entire pipeline of Data Engineering and the entire pipeline of the overall internet system,

okay? Before we just understand the Data Engineering life cycle, we need to understand

how different things or different fields come together to make the complete system,

okay? So we have the company—this is my company over here—which is, let's say, this is my company,

okay? And the company is doing, like, it can be Amazon or whatever. We'll take one example, okay?

Now, at the front end, we usually have the application, okay? This is my application, okay?

This might be my mobile—there's a button, and this is my application. And the user interacts with the

application, okay? I'm the user—I have Instagram installed on my phone, I have Facebook, I have

whatever, okay? I might be using LinkedIn—I have the application. And whenever I interact with this

application, data gets generated, okay? Whenever I click on any application, whenever I, like,

like something, when I comment on something, every single thing that I do, even if I go to Amazon,

if I click on a certain product, every single thing, okay, every single thing generates data,

okay? Now, all of this data will get stored, okay, inside the DBMS, okay? These are called database

management systems. Now, there are different types of database management systems we will understand

in this video, but just try to understand every single thing that we do gets stored inside a

DBMS, a database management system, okay? Now, these systems are usually designed for

storing this kind of data, right? You can store this data easily. There is something called CRUD

operation, okay, which is called Create, Read, Update, and Delete. We'll understand

that in the further video, but all of these databases are called relational databases,

specially designed for storing all of these things. Now, once you store all of these things,

alright, we have the data available. Now, data might be coming from multiple places, but let's

understand—from the application, our data gets stored inside the DBMS, and from there, our entire

Data Engineering pipeline starts, okay? The Data Engineering happens, Data Science happens, Machine

Learning or Data Analytics might happen over here, and then there might be a final dashboard,

okay? There might be some dashboard or some charts available here, okay? Businesses use this,

or there might be a machine learning model, okay? So this is like a robot, okay? I'm bad at drawing,

but this is one of the robots or machine learning models that might help in understanding all of

these different things, okay? Just trying to, like, just trying to paint a simple picture of the

entire ecosystem—there are many different things that go here, okay? The application development,

there might be DevOps who might be deploying the application, but in general, from application to

DBMS, whenever we have any data available, okay? This is where internet companies come into the

picture because you can store all of the data inside the DBMS, database management system,

okay? And then you can utilize all of this data for this kind of workload, okay?

Once you have the data generated, this is where the Data Engineering starts, because without data,

you don't have Data Engineering, Data Science, Machine Learning—because they work fundamentally

on the data. If you have the data, then you can do something about it; if you don't have the data,

then you can't do anything about it, okay? So the fundamental concept of a data-driven system

is having a data generation in place, and this is what the data generation looks like, okay?

You have the application, the data is getting generated, there might be other things such

as sensor data, okay? A truck is moving from one location, okay? This is one of the trucks, okay,

that is moving from one location, and it is going from location A to location B. Now, in between,

from B, it might go to C location, okay? Now, the truck goes from here to here to here, okay?

It goes from here to here to here. Now, we need to capture all of this data, and all of this data

gets captured by the sensors, right? The truck might have sensors, just like we interact with

the application. The truck might have sensors. Just like this, we have the stock market data,

we have data coming from numerous places, okay? So we understand how the data is getting generated,

sent to the system, and all of the other things. So this is the fundamental concept

of Data Engineering, which is where Data Engineering sits in the first place, okay?

As we move forward, we will understand all of the different parts of Data Engineering individually,

but just try to understand where Data Engineering really fits into the entire cycle, okay? It is

between the application development and the database. So whenever your data gets generated,

okay, it is over here—this is my application side, and this is my Data Science, Machine Learning,

dashboarding side. Data Engineering sits in between. It is kind of like a plumber,

okay? I'm connecting one thing to the second thing by transforming data and some of the other things

that we will understand, and then I pass the data to the next end, okay? I get the data from one

source, and I pass my data to the next source. How do Data Engineers do that? What are the different

features, functionality, and frameworks they use? We will talk about all of these things one by one

in this video, so don't worry about it, okay? I hope you understood the basics until now.

Okay, so now that we understood where Data Engineering sits,

what is the role of Data Engineers in this place? Because application developers, so we have the

software engineers, okay? The general role of software engineering is to develop the app,

web app—it can be a web application, it can be writing code, developing, or deploying some of

the things, okay? Then we might have the DBA. This thing can also be done by the software

engineers in smaller companies, but if you're working in a big company, a DBA is a Database

Administrator who develops the data, right? They build the different tables, they build the

different columns, and all of the other things. These are built by DBAs. Usually, Data Engineers

can also do that, or the software engineers can also do that—depends on the company's

size and your job profile—but let's understand. We might have a DBA who will build a database,

okay? So this guy will be writing a database. Now, who do we have? We have Data Engineers, okay?

Data Engineers. The roles of Data Engineers—there are many different things I'll tell you, just it's

to write the ETL pipeline (Extract, Transform, Load). We have a dedicated section on this,

but ETL is basically we extract data from one end, we transform that data, and we load that data,

okay? Then, it can be also building a database or a data warehouse, okay? Database or data

warehouse. Data Engineers can also do that, okay? They can build relational databases or dimensional

modeling—we'll understand that. Working with big data, big data, and processing all of this data

using Spark, Hadoop, using different frameworks or Kafka, okay? To process batch data or real-time

data, we can also do that. Data integration, data integration—so again, data is coming from the API,

data is coming from the sensors, data is coming from the RDBMS, so we want to integrate all the

data, so Data Engineers have the responsibility. There are other responsibilities such as quality

check of the data and governance, how to organize all of this data properly, so these

are the core use cases of the Data Engineers. Now, after that, we have Data Science people,

okay? Data Science or Data Analysts, okay? Usually, the difference between Data Science

and Data Analysts is basically that Data Analysts usually answer questions about what has happened

in the past, okay? How can we, like, what was the revenue of this particular product last year

compared to the last five years, right? They are trying to find the pattern from the past and find

some of the answers. The role of Data Science is to predict what can happen in the future,

right? We did a product sale for this particular product X amount for the last one year—what will

be the product sale for this particular product for the next six months? This is what Data Science

answers, right? They try to predict what will happen in the future based on past patterns,

and we have the Machine Learning Engineers who can basically automate all of the other things. So on

Amazon, we have the recommendation system, right? All of this recommendation system is

done by Machine Learning Engineers. They deploy the machine learning models onto the production

system so that a system can learn by itself and generate the right output for the user. So you

can predict what is happening inside your system, or you can predict how the users are behaving and

recommend them the right information. Like on Instagram, you go to Reels, you see the right

reels as per your interest, okay? They don't directly recommend you random things—they also

recommend you random things just to understand if you like it or not. So they are just trying

to train the machine learning algorithm based on your usage on the application, okay?

Now, the difference between DS and ML is quite thin, okay? You might see a Data Science person

might do the ML work, or an ML person might do the Data Science work, but in larger organizations,

they might have individual work to do, okay? They have core responsibilities,

but in smaller organizations, they might have to do all of these things by themselves,

so do not get caught up in the title like, "Oh, what does a Data Science person do? What does

a Machine Learning Engineer do?" Just try to understand their core responsibility from the

top level. In the actual organization, when you go to work, okay, when you start working, you might

have to do everything by yourself because the role is just a name, okay? But this is the core

distinction between all of these roles. There are other roles such as DevOps, DataOps—these

are just fancy names, but on a fundamental level, you might be doing similar work, okay?

So we understood what Data Engineering is, okay? The role of Data Engineering is to take

data from one source, okay? It can be any data from, like, RDBMS, API, do some transformation,

and pass this data to Data Science or Machine Learning guys so that they can build dashboards

or they can, you know, build machine learning models. Now, all of these things that we do,

okay, there is a proper approach to it, okay? You can't directly get the data from one source and

directly push it to the Data Science person—there has to be a step-by-step approach that is designed

properly so that the entire pipeline that you generate has some purpose to serve,

okay? And this is what we will understand, okay? So this is what we call a Data Engineering life

cycle. This is taken from the book Fundamentals of Data Engineering. I have recommended this book

to so many people, and it is one of the best books if you want to understand the fundamentals of Data

Engineering. A lot of the material that I have learned about the fundamentals is from that book,

and some of the material I also added in this video, so you will get the understanding, okay?

So the first step here is data generation, okay? Now, this thing we already talked about,

right? Data generation—data is getting generated from multiple places. We already know data comes

from what? APIs, okay? RDBMS, it comes from sensors, it comes from analytics like Google

Analytics or all of the other things, okay? So data is coming from multiple places. Now,

all of this data that is coming from these different places, we need to aggregate this

data together and ingest it into the system, okay? Now, this is what the next step is over

here. Let me just remove this, okay, this part, okay? The data ingest, okay? We are getting the

data generated from one place, then we need to ingest this data to one particular system. The

ingestion can be setting up the connection with the API, setting the connection with the RDBMS,

building a system that can read the data from sensors, and then automatically ingest this

data into our Data Engineering system, okay? We will understand what this entire Data Engineering

system feels like when we actually look at the project example, but these are the fundamentals,

okay? We have the data generation, and that data is getting ingested into some kind of system,

okay? And we just build a programmatic connection between this. So whenever any data gets added to

the RDBMS, okay, it should automatically get ingested into our system. There are

multiple approaches to do that, but these are the fundamentals. Once the data is getting generated,

we ingest this into our system. Then the data that got ingested will get stored, okay? There's some

kind of storage layer we have, so every data that is coming from multiple places, we have

to store all of this data at some location, okay? It should get stored at some location at least,

so this is where the storage happens, okay? We are storing this data at some location. Now,

between this ingestion and the serving, okay? Serving is basically we are serving our data

to machine learning, analytics, and reporting, okay? The thing that we understood over here,

okay? After the Data Engineering happens, we have Data Science, Machine Learning persons who are

building a dashboard or who are building a machine learning model. The same thing here is that this

is the part, okay? The Machine Learning or the analytics—we have reporting, dashboarding—all of

these things happen over here. This is where the data is ingested, and this is where the data is

getting stored, okay? Between that is the core of Data Engineering that is called a transformation.

Transformation is basically the set of business logic, alright, that we need to convert our raw

data—this is usually what we call raw data because it is coming from the system, okay? So this is my

raw data. This is my raw data, okay? Here, this is my transformed data when we serve this, okay?

When we serve this, this is the transformed data, this is the raw data, and everything that happens

between this and this is called a transformation. Transformation is a set of business logic, and it

can be anything, okay? So consider this example. Let me just explain this part. Now, we have data

coming from the API, okay? I have data coming from the API, and I have data coming from the RDBMS,

okay? Now, in both of the data, I have a date column, okay? I have a date column, I have a

date column, and the format of the date in the API is YYYY-MM-DD, okay? It is like 2024-06-01, okay?

The first—I think it's June of 2024—the date is something like this. Now, in RDBMS, okay, the date

format is like MM-DD-YYYY, okay? Something like this—it can be like 01 or 06, sorry, 01 and 2024,

alright? Now we have a date coming. Now, what we need to do is we need to join this system because

at the end, we need to find the analysis. There might be some ID column here, okay, and there

might be one more ID column available over here. We need to join these two data together. Now, when

do we join it? Okay, when we join data coming from the API, this might be, let's say, product date,

okay? This is a product date, okay? And this is an order date—it can be anything like this,

okay? Now, when we join this information, we need to transform this data into one particular logic,

alright, that can be formatted as this particular format or this particular format—it can be

anything. This is the decision that business people or you can take, like I want to transform

this data based on this format only, so any information that is coming from any other sources,

okay, it should be transformed into the YYYY-MM-DD format for the date, okay? So if we are getting

this data after the transformation block, okay, so we will have our transformation block here. I

will have my transformation block, okay? This data will go inside this and this, okay? Transformation

can be done by Python, PySpark, Scala, whatever it is, okay? We will understand all of these things,

okay? How do we do the transformation? And at the end of this, I will get this data into YYYY-MM-DD,

okay? The date values will be converted into one single thing. This is what we call a

transformation, okay? This is one example, but transformation can be anything, okay? It can be

removing duplicate values, it can be removing the null values, okay? It can be aggregating the data,

it can be merging two data sets, it can be generating a new column based on the two different

columns, concatenating—it can be anything, okay? It can be filtering—whatever it is, transformation

is basically a set of business logic that you have to write inside the code or inside the SQL

query or use any tool to do that to generate a suitable outcome so that the Data Science person

or the Machine Learning person can build a model or build a dashboard to find the relevant answer,

okay? So as a Data Engineer, my role is to organize the data into the proper structure so

that we can easily visualize this or we can easily understand what is going on inside the data,

so that is my job. I want to make the data into the proper structure, and that usually

happens in the transformation layer, okay? Now that we understood what is going on, we are

getting data generated from one source—it can be many sources, APIs, sensors, whatever, okay? All

of this data is getting ingested into one system. Ingestion basically means making a connection in

such a way that any time a new data is getting generated, we automatically fetch this data, okay,

and store it inside our storage system, okay? This is what we understood. Now, once we have

this data available, we need to make sure the data that is coming from all of these different systems

passes through a certain transformation logic so that our data gets structured. Once that is done,

we serve this data to a user. A user can be a Machine Learning Engineer, a Data Analyst,

or some dashboard expert—it can be anything, okay? They are using this data so that they can

understand, build machine learning models. This is the entire Data Engineering life cycle that we are

talking about, okay? There are some undercurrents that we will understand in further videos,

so don't worry about it, but I hope you understand the complete Data Engineering life cycle from a

fundamental point of view because this is really important, right? You can use any tools, right,

to do all of these things, but if you understand the fundamental side of it,

then it doesn't matter which tool you use—you already know what needs to be done, so you can

pick the shittiest tool in the market, okay, and still make this entire pipeline work, okay?

That is the power you have as a Data Engineer because once you understand the fundamentals,

you don't really need to know anything else. You can learn tools within 30 minutes,

okay? It doesn't take time to learn any new modern tool—it's very simple. Even to learn Spark and how

to write the Spark code, it's very easy, okay? You just need to understand some of the functions and

execute. There are some angles to Spark, such as the internal and the understanding of executors,

drivers, and all of these other things that you need to understand to become a better engineer,

but to do this entire job is not that difficult, okay? You just need to make—you just need to

understand how to make connections between systems and execute the entire thing, okay?

Now that you understood, we can go forward and start talking about the individual components,

right? How can I do the generation? How can I do the ingestion? What can I use for the

transformation? How can I do the serving? What is used for storage, okay? Machine Learning,

Analytics, Reporting—every single thing that we will talk about, and we will also talk

about this part further down the video, okay? Now this is understood, let's talk about the

data generation and data storage one more time. Alright, so we got the basics until now—data is

generated from multiple places. Data is coming from transactional systems. Transactional systems,

okay, these are called RDBMS, okay? There are multiple types of transactional systems that we

will talk about, so don't worry about it. Data is coming from IoT devices, so we have the IoT

devices, okay? It is coming from there. It is also coming from web and social media, okay?

We understand data is coming from logs and machine data, okay? This is also important because, again,

we are running the technical machines, so they are also generating logs, and if you want to improve

the utilization of this technical machine, we can also use this log data to understand what is going

on and save costs over there also, okay? Then we might have some API data—API or third-party data,

okay? Third-party data. Sorry for the bad handwriting, but this is where the data

is getting generated, okay? Now, once we have the data available, we have to store this data, okay?

The storing of the data is basically we store it in a relational database, okay? This is the same

transactional data and relational data, so from the application to RDBMS, data is generated,

okay? This is where the data generation—you can also put the RDBMS into data generation because

it is connected to the application, okay? And you can also put it on the storage layer because

data is getting stored inside the RDBMS, so you can also keep it generation and storage—it doesn't

matter, okay? Because from the Data Engineering point of view, we usually consider RDBMS as a

data generation source, okay? From the application point of view, we usually consider it as a storage

layer also, okay? So quite tricky to understand, but it's simple. You can consider RDBMS as data

generation and storage also. We also have a NoSQL database, okay? NoSQL database that we

will understand. For data storage, we have data warehouses, okay? This is what we are talking

about, okay? The thing that we understood about storage, okay, generation, and ingestion is this

part—this is the data generation, okay? And the storage that we talked about over here is this

part, okay? We can store our data in the RDBMS, NoSQL, data warehouse, or object storage—object

storage can be like S3, Google Cloud Storage, Azure Blob Storage, all of these other things

that we will also understand, okay? You can also call these things a data lake, data lake, okay? So

these are the storage systems. We understood the generation, how the data is getting generated,

and where our data will get stored. So, okay, this is what we understand. Now let's understand about

the DBMS, okay? The thing that we were talking about—transactional systems and RDBMS systems,

okay, that are used for data generation and data storage—in reality, we use the DBMS,

Database Management System, okay? These are the systems specially designed for

storing your data in a structured way so that you can easily query your data.

Now, understand this, okay? You can also store your data in MS Excel or Google Sheets. If you

already know, right? You can have columns here and rows and column formats, so you can store

your data. But if you want to store, let's say, millions of data or billions of data,

and if you want to find a specific record, MS Excel will not be able to handle that,

okay? Because if you want to find a specific record from, let's say, the thousand lines

or the one lakh, one lakh row, okay, it will be very difficult for you to do that. DBMS systems,

okay, are specially designed for this kind of workload, okay? You can store your data,

and you can easily retrieve, update your data as per your requirement. There are different types of

DBMS systems available. We have PostgreSQL—this is open source. We also have MySQL—this is open

source. We have Microsoft SQL Server, we have Oracle, okay? These are enterprise-level,

okay? If you want to get started, Postgres and MySQL are the easiest to get started. Now,

to work with all of these systems, we have a language, okay? We have a language called SQL—this

stands for Structured Query Language, okay? Now, this is the language that we use to communicate

with the database. You might already know about this because you've been following me,

or you have heard about it somewhere, but if you're new to Data Engineering or just in

general to the data space, SQL is the language that we use to communicate with the database.

Now, what can we do with SQL? We can do multiple things. We can select the data,

okay? We have a SELECT query to fetch the data. We can insert the data, okay? We can insert the

data. We can update the data. We can delete the data, okay? All of this data is getting stored

inside the table. It looks something like this, okay? The table will have a column name, okay,

and the actual data stored inside this—this is where all of the actual data is getting stored.

The data that we talk about, like it can be, let's say, this is our data, okay, student data,

okay? And there is a table, Student. What will Student have? Student will have ID,

okay? Student will have a name. It will have age, and it might have, let's say,

a city where the student lives. So ID can be one. The name can be, let's say, D, okay? Age can be

26, and the city can be Mumbai. Just like this, there might be some other person who might be,

let's say, Akash. Age can be 25, and is living in Delhi. Okay, like this, we have data stored inside

our table, okay? So this is what is happening over here, okay? We can select a specific data,

let's say, where the student ID is equal to two, okay? I can select this particular data by writing

the SQL queries. I can insert new data as ID3, okay? I can delete this data if I want, okay? And

I can update this data, say, if I want to update the age or I want to update the name. There are

multiple SQL cases. If you want to learn about SQL, I have a course so you can learn in-depth,

but this is the fundamental concept of SQL, okay? Now, this is what we understood, right? This is

the SQL that is used for working with the DBMS systems—this is the language or scripting language

that you can call to work with the system, alright? Now we have a concept of data modeling.

Now, this is where we are slowly diving into the Data Engineering fundamentals concept one by one,

okay? We have cleared the foundation part of Data Engineering. Now we are diving into

the individual concepts that are important for you to understand the entire life cycle, okay?

Data modeling. Now see, whenever we are designing any application or whenever we are thinking to

build or store our data, we need to design a data model. Data modeling is basically a visual

representation of how our data looks, okay? So we will take one example, okay? Let's take the

example that we all understand, which is Amazon, okay? We are building the data model for Amazon.

Now, just use your general knowledge, okay, and common sense to think about what information

Amazon will store. Data modeling is basically charting out or building a visual representation

of how our data will get stored inside the RDBMS, okay? This is the entire goal of it, okay? So I

need to think about what kind of tables or what kind of data that I want to store for my system,

okay? I want to store in Amazon, right? I might be storing information such as about the orders,

okay? I'm storing about the orders. I might store about the users, okay? Users who are on

my website. Orders, then the product, I've been storing about the product, okay? What else? I

might store about the payments, okay? Okay, what else? What else? Shipping information,

okay? Shipping. I might store information about the sellers, okay? Sellers who are selling on

my platform. And like this, there might be hundreds of tables in the actual Amazon,

right? But this is the basic table. Like I say, I'm starting my e-commerce company,

and I'm designing a data model from scratch. Amazon doesn't exist, nothing exists, and I'm the

first person who is starting an e-commerce company on this entire planet, okay? And I'm thinking,

okay, I'm going to be Designing my data model, initially, it will have some kind of tables. Okay,

these are the pieces of information that I want to capture for my system. Okay, this is, we are

talking from the application side right now. Okay, so we are slowly moving onto data engineering,

one by one. These are all concepts you really need to understand if you want to become a data

engineer. So, I'm going step by step to make you understand each and every single concept.

Okay, so we have the orders, users, products, payment, shipping, and sellers. Now, let's say

I'm satisfied with all of this information that I want to capture. What I will do, I will first

design a data model for this. Okay, it will look something like this. So, first of all, I have

the orders. I will create an order table. This is my order table. Okay, order. Now, the order will

have a lot of things. So, first of all, I have the order ID, order name, and order date. Okay,

let's be satisfied with this. Then we have the user. I have the user available. The user will

also have the user ID, name, age, address, and all of the other things just like a normal user has.

Okay, then we have the product. Now, we have the product information. In the product, we have the

product ID. This is the primary key or the unique key to understand which product it is. Then we

have the product name, product category, product description, product quantity, product weight,

product unit size—lots of things that we can store. Then we have the payment. Payment ID,

payment amount, and payment date can be there. So, we'll just keep these three things. Then we

will have shipping. Shipping ID and shipping date, okay, just keep these two. And the sellers. Okay,

this is sellers. We will have the sellers' ID, seller name, age, location, or whatever it is.

Okay, so we just kind of figured out the tables that we want for our database. Now we need to join

them. Alright, so all of these tables only make sense if they have a relationship with each other,

right? So how does the relationship happen? Okay, a user orders a product. So, the order will have

all of the information that is getting ordered on the platform. Okay, so on the order, we also have

a user ID. This is a foreign key; this will be joined over here. A user can order multiple or

single products, so we will have information about the user ID. A user ID has ordered a product.

Which product did they order? Okay, so we also need to add a product ID. Product ID, so we will

join this particular thing over here. Okay, it is joining. Let me just change the color for this.

Okay, so we understand that a user will order a product. Product ID will be over here. So,

in the order table, which user ordered the product? Okay, which product did that

particular user order? Okay, then this is done. Like, this is a user and product. We can also add

payment information. Let's say if I want to add payment information, it can be added over here.

The payment ID, okay, and then I can also track down the payment. The payment can also be tracked

down easily over here. So in the order, what was the payment ID? If you want to understand how much

payment that particular user made, we can also do that. So this is what we can add here. Okay,

then in the shipping, what do we have? We have the connection ready. Then for the seller, okay,

which seller is selling which product? So we can also add the seller ID over here. So

let me just get the right color. We can have a seller ID inside the product ID,

so we can understand which seller is selling which particular product, and then we can

make a connection between them too. So for the seller ID, I have, okay,

not this one. After this, what do we have? Okay, a seller. This is a seller product;

this is a seller's information. He's selling a particular product, so we can also make a

connection between these as well. So what we will do is we will add information about, let's say,

below the product. I will just use a different color to show you that there will be a seller

ID just to understand which seller is selling that particular product, and then we can make

a connection between this seller ID from here. Okay, it will go and it will come over here,

seller ID, something like this. But in general, and then we might have the shipping information,

so shipping will have information about the order ID, which order is getting shipped. Okay, so we

can join this particular thing over here also. So all of these tables will be connected

together. Again, this is the worst way to draw this particular thing,

but I just want to show you the fundamental side of it. Because if I just show you the picture,

if you just search on Google for a data model picture, you will find a lot of data models.

So in reality, a data model really looks like this. There are some applications,

such as draw.io, or there are some specific applications for databases to make this

kind of diagram. And I teach all of these things in my SQL courses. So, if you want,

you can check the description if you want to know more about it. But this is the fundamental concept

of data modeling. I go in-depth in my courses, but I just want to give you a good overview.

Okay, now we understand the data modeling. This is what we usually call a SQL table because these are

relational databases; they have a specific schema defined. So, this is the data model. Now, in this

data model, every single piece of information has some kind of schema attached to it. The schema

is basically the data type. So, let's say the order ID will be the integer information. Okay,

order ID will be, let's say, order ID will be an integer; order name will be a string. Okay,

order date will be the date value. User ID will be an integer again. Just like this, each and every

single column has some kind of schema or data type attached to it. This is called a SQL or

relational database table because it is properly structured; every schema is properly defined,

and you use SQL queries to work with it. After that, we have something called a NoSQL

database. In SQL, we store our data in the column and row format, but in the NoSQL database, we can

store our data in different types of formats. One of the formats is the key-value. If you know the

basics of Python or JSON, it's something like this: we have the ID, and there will

be a value attached to it, ID one. Then we will have the name, and the name will be, let's say,

D. Something like this. And the age will be, let's say, 26. All of this information will be stored

in the key and value. So, if you want to find, let's say, a particular piece of information,

you can just search it by the name, age, or something like that. Then we have the column

family. All of this data is actually stored inside the column. We have the document, we have the

graph data. Graph data is used for representation. We don't want to deep dive into it; I just want

to give you an overview that this kind of database also exists for some kind of workload.

After this, these are the usual comparisons that I want to talk about: SQL versus NoSQL. SQL is

relational, which basically means that the data model we talked about, all of these things, are

properly stored and have a relationship between them. As you can see, this table is connected

to that one; the order table is connected to shipping; the shipping table is connected to

the product ID; the user table is connected to the order. They have a relationship between each of

them with specific primary and foreign key IDs. So, this is called an SQL relational database.

Then we have the analytical, which is usually OLAP, or data warehousing. Data warehouses,

this is what we will talk about further down the video, but these are the SQL databases.

Then we have the NoSQL. In NoSQL, we have the graph, wide column, document, key-value. Well,

if you want to understand all of this, you can just Google it, and you will understand most of

it. We don't want to spend time on NoSQL because we will mainly be focusing on SQL. This is what

you will be working with mainly in the real world because most of the data is actually stored in

SQL databases, and you will be using data warehouses. So, let's talk about that one by one.

Okay, now in SQL, the two things that we talked about, relational and analytical,

these are the two different data processing systems because all the data storage processing,

okay, and we want to talk about that. So let me just get that information. Let's do this. So,

we have two data storage processing systems. One is called OLTP, and the second is called

OLAP. OLTP means online transactional processing, and OLAP is called online analytical processing.

Okay, in SQL, we have the relational and the analytical. These are the two

things. Relational is usually called online transactional processing, and the analytical

is called online analytical processing. This is a relational database. This is a relational DB,

and this is the data warehouse. Data warehouse. And you will be juggling between these two only

as a data engineer. Now we are slowly, slowly deep diving into data engineering, so pay attention.

Okay, now OLTP system has some kind of use case, and OLAP system has some kind of use case. This

is not something where OLTP is better or OLAP is better; they both have their own places in

the entire system. Now, the use case of OLTP is usually for processing transactional data.

It is used for transactional data. What does transactional data mean? It means that when

you send money to one person from your account, it goes to the other account. That is considered

a transaction. When you purchase something on Amazon, when you buy something on Amazon,

that particular information of the product—that this user purchased this particular product

and made payment for this amount—that entire thing is called a single transaction that is

stored inside the OLTP system. These systems are mainly designed for this kind of workflow. So,

when you want to do a fast insert of the data, when you want to do an update,

or when you want to do quick reads of the data on an individual level, these are the best systems

if you want to do that. These are very fast if you want to insert or update quickly. We

talked about the CRUD operation: Create, Read, Update, Delete. It is very useful for that.

It is very useful for this kind of workload. So, the use case of OLTP is more on the

transaction level. Whenever you have a lot of transactions happening on an e-commerce website

or banking, the transaction doesn't only mean money transactions. It can be any transaction,

such as if you buy a product, if you return some product—all of these are the individual

row-level information that is getting stored. But if you want to understand what is happening,

let's say if I want to understand the last five years of data using the OLTP system or SQL,

I won't be able to do that. And I'll explain the reason behind it, but for that, we have an OLAP

system. The OLAP system, the name literally says that it is for online analytical processing. The

reason OLAP systems are good is that they are mainly used for analysis workloads. So, if you

want to analyze the last five years of data, you can easily do that using the OLAP system.

Let me just explain this individually so that you have a better understanding. So, the OLTP system

is mostly row-based. So, every piece of information that you store is stored inside

the row. Like, this is my ID, this is my name, this is my age, this is my payment that I made,

something like this. Now, all of this information is getting stored inside the individual row. Now,

this is the OLTP system used for transactions, so this is really good for row-level operations. If

you want to do something on the row level, if you want to update the date of birth, if you want to

update the age, delete a particular thing at the row level, this is very easy. But let's say if I

want to analyze the entire data—let's say this is the payment made for 10 rupees, 20, and 30,

and what I want to do, I want to aggregate, and like this, there are millions of rows available

like this. And if I want to analyze this entire data from start to end, what I will have to do,

if I were to write the query, such as 'SELECT * FROM' or 'SELECT SUM from payment from this

particular user table,' let's say if I run this query, the thing is, the way this entire variable

gets executed, it will first fetch all of these individual rows inside the result set. One by one,

it will fetch all of these rows, and then from that entire result set, it will just pick this

single column. This particular single column will get picked after this, and then it will

do the sum. Now, picking the entire column or scanning the entire row from start to end is

a useless process for this operation. Understand this, right? Because we just want to get the sum

of payment, I just want to get the information about the payment only. Why am I scanning each

and every individual row? Because this entire database—OLTP databases—are stored on the row

level. Every single piece of information is stored in the row. So, even if I want to get the

information about the payment, I will have to scan all of the data from start to end and then just

select the one single column only. Now, as I said, this is only good for row-level transactions,

if I want to update or delete a specific row. On the other hand, OLAP systems, let me just draw

this, OLAP systems are column-based. So, all of the things are the same. Every single thing, such

as the ID, name, date of birth, age, whatever it is, and this is my payment. On the OLAP system or

the data warehouse, if I execute the same query, these are column-based. Most of the time, you will

find them as column-based. So, all of the single pieces of information that are getting stored

will be stored like this. In this case, we are storing individual rows, so it gets stored like

this. We will have the first row, and all of the information about... let me just draw it properly.

We will have one single row available, one, then we will have, let's say, the field name, age is

25, and this information will get stored. After this, there will be one more row that will attach,

so everything will get stored at the row level. Over here, everything we are storing is at the row

level. Over here, it will store everything at the column level. So, IDs will get stored one, two,

three. IDs will get stored, then we will have the name stored inside one single column, and we will

have the payment information stored inside the column. This says 23 or 25 dollars, 26 dollars,

something like this. So, every single thing that is getting stored internally is at the column

level. Just try to understand and visualize this. So, when I run the same query on the OLAP system,

instead of scanning the entire row, instead of scanning the entire thing and then fetching this,

it will directly go to the payment and directly give me the sum. So, the useless operation of

scanning the ID or the name is not needed. We can directly go to the payment level, and we can

fetch the result that we need. This is the difference between OLTP and OLAP.

Now, understand this as a data engineer. As a data engineer, you will be taking data from OLTP

systems to OLAP systems. In between, we will be writing transformations. The thing that we

understood about data generation, data generation and storage is my OLTP system. This is where the

data is getting generated. This is where I do the storage; this is where I do the transformation,

and this is where I do the analysis. This is where the data warehouse will come into the picture,

and the data analyst will write the query to understand the data, and then they will build

dashboards, ML models, AI models, whatever you want to call them. They will use this OLAP system,

data warehouse, or the storage layer that we will have. We will understand data storage again in

the future about object storage, so don't worry about it. But this is the fundamental of it. Now,

we are just trying to zoom into the individual component and understand what is going on.

So, data engineering is basically taking this data and moving it somewhere else. We should take the

data from OLTP systems, APIs, ingest it into the system, do some transformation, apply some logic,

and load it into the data warehouse. This is the core of data engineering. But how do we do

this? You understand everything, but how does this entire pipeline happen? We have something called

ETL: Extract, Transform, Load. You might already know this; everyone keeps talking about it. We

call this ETL: Extract, Transform, Load. The same thing that we talked about in the lifecycle. The

data engineering lifecycle is one way; it is ETL only. We are extracting data, transforming data,

and this is the serving layer, which is the loading of data. That is just a conceptual

architecture of how things work. This is what really happens in the real world. We build the

ETL pipeline. We extract the data, we transform the data, and we load the data. Now we already

know about this, right? Where do we even extract all of this data? We extract our data from DBMS,

analytics, sensor APIs, and all of this data from multiple sources. Then this data comes,

and then we do the transformation. We understood transformation also, right? It is about removing

duplicates, handling null values. Structured data means getting all of the information into the same

scale. If one age is stored inside, let's say, the string value, and another source has the age

stored inside the integer value, we bring it to the integer level. If the date is in a different

format, we bring it to the same level. And then we load our data. The load can be on anything; it

can be on the data warehouse. Data warehouses are like Snowflake DB, BigQuery, Redshift, and a lot

of different data warehouses. Or you can also store it in object storage. Object storage stages

like S3, Google Cloud Storage (GCS), or Azure Data Lake, we also have that. This is the core concept

of ETL that we will also talk about one by one. Now, okay, so you understood the upper layer.

We did all of this work just to understand this particular thing, the data engineering lifecycle,

the top layer of it. Just the top layer of everything that we did till now. But

just to understand the top layer, now I want to understand the bottom layer of the data

engineering lifecycle. That is the undercurrent: security, data management, data architecture,

orchestration, and software engineering. These undercurrents are also important.

Security: just by the name, you understand that our data should be secure. That basically means

who is able to access our data and the system. We need to make sure the right

person with the right authorization can only access our data. We should not give access to

our data to every single person working in the company. This is the importance of security.

Data management: that basically means data governance. Data governance means we should

be able to easily find the data that we need. Think about this, right? I was working at an

e-commerce company in Europe and the US for furniture. They had tables—more than thousands

of tables—in the system. Now, if I had to find particular data, where this data is stored,

I had to go through the documentation they created to understand, okay, this data can be found at

this particular location. This is what we call data governance: the ability to find data. Then

the definition: what each and every single column means. Think about it, if you have thousands of

tables, and if you access one of the tables from that pool, and that particular single table has,

let's say, hundreds of columns, and you want to understand what the sixth column means. It

could be something like the payment gateway ID or XYZ, something like that. I don't know what

this particular column means. This is the use of definitions, understanding what the data is, what

type of data is stored. This is very important. Data governance. Accountability: who owns this

data? Who is the user? Did you create this table? Which user created the table? So I can go to that

user and understand if I don't really understand the purpose of this table, I can go to the user.

If I am working in the shipping department, I am an engineer over there, and I created the

entire shipping table. Now, if any person from, let's say, the order department or the return

department wants to understand what is going on inside this table, they can directly reach out to

me. I am accountable for that particular data. That is what accountability means.

Then we have data modeling, which we already understood. Data integrity: making sure every

piece of data makes sense; every piece of data is proper. It basically means the data is correct;

it should not have any random information. DataOps: you might already know about DevOps.

DevOps is basically to automate the entire process of deployment of your application using the best

practices. DataOps is somewhat similar. You monitor data governance, observability, incident

reporting. That basically means everything that is happening inside your data system. Every single

thing that is happening in your data system, you should be able to monitor. You should be able to

report the incidents that are happening. All of these things should be automated,

and that is a fundamental concept of DataOps, data operations. So, all of the operations of the data,

right? When you deploy something, is it working fine? If it is working fine or not,

I should be able to get the error message. I should be able to observe how my data pipelines

are working. I should be able to monitor what is going on. All of this is a part of DataOps.

Data architecture: we have a detailed section after this about data architecture where you

analyze the information, analyze the trade-offs, and add value to the business by designing the

proper architecture for the system. We'll understand this.

Orchestration: this is used for coordination, for scheduling jobs, and managing tasks. In data

engineering, we have multiple data pipelines working. Data pipelines are basically the ETL

jobs. It is just a fancy name, but it's just extracting, transforming, and loading the

data to some location. This entire operation is called a data pipeline. Now, like this,

there might be hundreds of data pipelines deployed in the organization. I need to orchestrate all

of these things. Let's say once the first data pipeline completes, I should only run

the second data pipeline because the second data pipeline is dependent on the first data pipeline.

All of these things are called orchestration. We have a tool called Apache Airflow for this kind

of workload, and we will also understand orchestration as we go into the future.

Software engineering: software engineering is basically programming, software design,

testing, and debugging. You have to apply the best practices of software engineering when

you write the ETL, the transformation job using code. You should use a design pattern of software

engineering for scalability. You should also use testing and debugging approaches to test your data

pipelines. So, all of these are the fundamental concepts. When building a data pipeline,

you should remember security is important, data management is important, DataOps is important,

architecture is important, orchestration and software engineering. Just fundamental concepts,

good to know. You don't need to deep dive into it right now; as you move in your career,

you will understand them one by one. The next thing I want to talk about is data

architecture. If you want to become a good data engineer, you should understand data architecture,

and we will be referring to one of the new data that I wrote, "Data Architect 101 for

Data Engineers." So, let's jump into that. So, before we move forward, I just want to say

that I am re-recording this part of the segment because I was recording this part yesterday,

and my disc got full. I ran out of space, my OBS stopped recording in between, and the entire file,

like a one and a half-hour file, got corrupted. So, I'm re-recording this part of the video just

to have one complete video. If you're still watching this video till here,

I'll urge you to at least like this video because it takes a lot of effort, and do

comment something so that it increases the reach of this video and it reaches more and more people.

Okay, let's start with the video. Now, till now, what we have done is we have understood the basics

of data engineering, right? We understood what data engineering is, where data engineering fits

in the entire pipeline, the data engineering lifecycle, different parts of ETL, OLAP

versus OLTP. So, we cleared the basic fundamentals required to understand core data engineering. Now,

so we understood the core data engineering aspect. Now, I want to take you on a journey to understand

how data engineering happens in the real world, from understanding how to build the architecture,

how the architecture is actually built from the ground up, how the thought pattern is developed,

how you understand the business side, how to choose the right technology and put all of

these things together and individual components, each and everything. Now, we will understand.

Okay, so let's start. Now, I want to make you understand data architecture first. Because

before we even understand the different parts of data engineering, it is really important that you

understand how to build the basic architecture as a data engineer. Because this is the core skill

set, and we'll be learning about that, right? So, I published this particular newsletter.

If you are interested, you can also subscribe to it. Just go to the DataVidhya.substack.com

to get the high-quality data engineering blogs. Okay, so Data Architect 101 for Data Engineers.

Now, till now, we have understood that the goal of every data project is to solve a business

problem. From the start of the video, I've been saying this particular thing again and again,

that everything you do as a data engineer or as an engineer in general, right? You are doing all

of these things for the business. Now, it can be anything from reducing the current system

cost to building a full-fledged data system to help businesses make data-driven decisions. Now,

I want to take you on a journey to understand how to think about building data architecture from

the data engineering point of view. Because as you grow in your career, you should have

the basic understanding of how to design the architecture and how to build data systems. What

is data architecture? So, from the definition of the fundamentals of data engineering, data

architecture is a design of systems to support the evolving data needs of an enterprise. Evolving

data needs are achieved by flexible and reversible decisions reached through a careful evaluation

of trade-offs. We'll understand this technical architecture, but in simple terms, it is basically

like before you construct a building, right? You have to build a blueprint of the building. If

you're trying to build, let's say, a 12-floor building, you have to first build the blueprint.

Inside the blueprint, you have to add some of the things, such as the foundation, floor plans,

elevation, elevator, stairs, office, restroom—all of these things you have to first plan, and then

you can start building the entire construction. Data architecture has a similar concept. Instead

of foundation, floor plans, elevation, and elevators, you'll have to think about storage,

what are the different software that you have to use, how does the data actually flow, interfaces,

how do you write the transformation, the staging areas, data warehouses, reporting systems, and

many more. Just like you think about building an entire building, the construction, you also have

to think about the data when you are building data architecture. You also have to think about what

are the different components that we need in order to build the entire system. This is how we start.

Now, as per the technical definition that we just read, it says that decisions should be flexible

and reversible, which means like each and every component that you put inside the architecture,

in case something goes wrong, you should be able to easily replace it with something else,

and it should be easily reversible. So, every decision that you take, if it goes in the wrong

direction, it should be easily reversible so that you can make it right. This is what it

means. It is achieved by flexible and reversible decisions reached through a careful evaluation

of trade-offs. Trade-offs are basically, you have to understand, based on your requirement,

which technologies you can choose. We'll understand all of this step by step.

Now, building data architecture is divided into two different parts. One is business needs,

and the second is technological integration, basically the operational architecture and

the technical architecture. Let's try to understand both of these right now,

and then we'll deep dive into them individually. We focus on the business goals and requirements

inside the operational architecture. Again, we understood, right? Everything that we are

doing is for the business only. So, before you think about choosing the right technologies or

writing code and all of the other things, first, you need to define what the business even needs

in the first place. Because once you know that, then you can think about the technological side.

So, the first step in building data architecture, or even if you're building your own personal

project, is to understand the operational side or the business side. For example, in an e-commerce

platform, what is the impact of the XYZ category of the product? So, I want to find this particular

thing. This is my business goal. I want to find information about this particular product. Why

is there a delay in product shipping? So, I want to understand what is happening with the product

shipping. I want to understand why there is a delay in shipping. So, this is my business goal.

How do we manage data quality from the third-party vendor? In e-commerce, right? We are working with

different third parties, such as FedEx or some shipping department, or the data might be coming

from multiple places. How do we manage data quality while working with these vendors? These

are the different business goals that we have. So, while building technical architecture, we

need to think in this particular direction. These are different things that the business needs. So,

now I have to build my technical architecture to fulfill all of these different requirements.

In the technical architecture, we focus on the technical side for solving how to ingest,

store, and transform data. What happens when we have a sudden order spike? Basically,

on the technical side, we mainly focus on storage, technical things such as how do we ingest data,

how do we transform data, and if there is a festival or a sudden spike in the system. So,

we also think about scalability. This is more of a system design side. One is a business goal, where

you focus on the business. One is the business side, where you focus on what the business needs.

The second is the technical side, where you think about what are the different technologies that you

can use. Let's try to understand all of these things in a little more detail with examples.

The operational architecture ensures that your data practice aligns closely with the business

objectives. It is the "why" behind every piece of data you collect, process, and store. Again,

business architecture or the Operational architecture is basically the "why"—why you

are doing this entire activity. Why are we even building everything? It is to support the business

in achieving their goals. So, operational architecture is basically the "why" behind

every piece of data you collect. Here are some insights to think

about when building the operational architecture or defining the business goals. First, start with

the end in mind. Always begin by understanding the business problem you are trying to solve.

This clarity will guide your decisions and ensure that your data architecture directly

contributes to the business outcome. This is very important—start with the end in mind. We need to

understand what the business goals are before you even think about building the architecture or the

technologies. Understand what the business needs, because once you define that, you can easily build

the technological side. Technology is very easy to build if you know what the business needs. If

you don't know what the business needs, you will be stuck in building the architecture

and will never be able to get out of it. Second, iterate and evolve. Business keeps

changing every six months—a new product line comes up, something keeps changing. Business priorities,

product strategies—these things happen on the business side. So, when you design

your architecture, it should be able to iterate and evolve quickly as per the business changes.

And focus on impact. Everything you do should generate value for the business. Every data

solution you architect should have a clear line of sight to its business impact. It can be improving

customer satisfaction, streamlining operations, or enhancing decision-making. The value of your

data initiative should be measurable and aligned with business priorities. This is operational

architecture and aligning with business goals. Now let's talk about the technical architecture,

the building block. This is where the actual execution happens. While operational architecture

is about "why," the technical architecture is the "how" of the equation. By focusing on specific

technologies and methodologies, you'll be able to meet your operational goals. So, what do we do? We

use technologies—technology is our "how" to meet the business goals, which is basically the "what"

we want to achieve. Very simple to understand. If you want to build the technologies,

we have thousands of tools available in the market. This is the big data landscape,

and you can see there are so many different tools available that you can't even see them all until

you zoom in. If you want to understand each tool, you need to know that these are different

tools available for different kinds of workloads. We have a proper framework to choose the different

technologies as per your business use case. Now, you can't choose any random technology and think,

"Okay, I'll use Snowflake, I'll use Apache Spark, I'll use these fancy tools just to

solve my business problem." It doesn't matter. You can even use a simple Python

script as long as it solves and helps you reach your business goals. Technology is

not about choosing fancy tools or something everyone is using in the market. As a business,

you should be thinking about saving costs and reaching your business needs. Whatever technology

helps you, whether it is an enterprise-level technology or an open-source technology, as long

as it solves your business problem, you're good. Now, let's try to understand that one by one. How

do you build the technical architecture? Simplicity is key—the aim is to keep your

technical architecture as simple as possible while meeting your needs. This approach makes

your system more maintainable, scalable, and less prone to error. The simpler you keep things,

the easier it is to maintain, scale, and quickly identify errors. The more complex

the system, the harder it is to debug errors. Second is choosing the right tools for the job.

There is no one-size-fits-all solution in data architecture. The right storage,

processing, and analysis tools totally depend on your requirements and the specific use

case. If you have structured data, you can go with a data warehouse. If you have millions of rows,

you might not need Snowflake or another expensive database. You can work with

basic ad-hoc query interfaces like Amazon Athena, which will be good to go. All of these different

decisions should be made based on your business understanding. It's not about choosing fancy

tools; it's about solving your business problem. Third is building for scale and flexibility. Even

if you are not dealing with billions of rows right now, in the future your business will grow. If you

are projecting that growth, you should be planning the architecture to scale all the systems.

For example, currently, you're using Python to process millions of rows, but you know you'll

have billions of rows tomorrow. You should keep the system ready in the backend for that growth.

For instance, you can use distributed processing like Apache Spark and scale up the cluster as

needed. Start with a smaller cluster and then think about scaling up as you move forward. It's

not that everything is perfect when you start; you start small and evolve as you move forward.

Third is embedding automation. A lot of times, you might monitor different systems manually,

try to solve different errors manually, or build data pipelines manually. Instead,

you should generate scripts and automation to do these things. In case an error occurs, you should

get an email or a Slack notification, depending on your system integration. Instead of checking every

single day whether your data pipeline is working, you should have an alerting mechanism in place so

that you don't have to check manually. Finally, prioritize data

security and governance. In the digital age, data leakage is quite common, so you should properly

secure your database, encrypt your data, and keep your data secure within the network. These

are the different things you need to consider while building your technical architecture.

Now, let's bring all of these different things together to understand how this happens in the

real world. Let's take the example of the data architecture for an e-commerce platform—pretty

easy to understand. The first thing is that we need to understand the business needs. In

this case, let's define the business goals, because this is what we understood first. We

define the operational architecture, like what are the goals of the business. In this case,

the first goal is to improve customer experience: improve site navigation, personalize product

recommendations, and enhance customer service. Simple to understand. We want to improve the

overall site navigation, how customers interact with the application, and build a recommendation

engine and customer service integration. Next is operational efficiency: streamline

inventory management, order processing, and shipping to reduce costs and delivery

times. We need to improve our entire operational efficiency so we can reduce order processing time,

reduce shipping costs, and shorten delivery times. Then, marketing insights: we want to understand

how customers are behaving so we can improve product placement and increase sales.

Vendor management: we might be working with different vendors, so we also want to build a

strategy for better product availability, pricing strategies, and quality control.

And fifth, compliance and security: in an e-commerce platform, people will be

making payments, so there are compliance requirements we need to follow. For example,

we don't capture credit card information, or if we do, we should mask it so that it

doesn't get leaked. These are some of the compliance requirements we have to follow.

So, these are the business goals, right? We want to increase customer experience, operational

efficiency, marketing insight, vendor management, compliance, and security. Now, based on these

business goals, we can think about building the architecture—the actual technical architecture.

The first is our data ingestion layer. We are getting data from multiple sources,

and the purpose of the ingest is to collect data from various sources such as website interactions,

server logs, vendor systems, inventory management, and customer support. We can use technology like

Apache Kafka for real-time data streaming to handle data coming from different sources.

After we capture our data, we need to store it in some object storage for a longer period of

time. The purpose is to store collected data in a structured manner for easy access and analysis.

Different components, like object storage (S3 bucket) for unstructured data, or data warehouses

like Snowflake or BigQuery for structured data, can be used depending on

your business requirements. How do you decide which one to

use—Snowflake or Redshift, for example? It depends. If you're already on AWS,

going with Redshift might be a good choice due to integration. But if Redshift is too expensive for

your business needs, you can go with Snowflake or even open-source solutions. You need to research,

understand your data size and frequency, and do a simple proof of concept (PoC) to

see how different technologies behave with your data. Whatever works best, you can choose that.

So, we might have to structure our data before we put it into the data warehouse—that's where

the data processing and transformation layer comes in. This is where we clean, validate,

and transform our raw data into a structured format. For this, we can use Apache Spark if

we're working with large datasets. If you have a smaller dataset, like a few million rows,

you can go with simple Python scripts. But if you have a large dataset and data coming from

multiple sources, you might want to go with Apache Spark, a highly used framework by top companies.

After the data is in the data warehouse, the data analysis and business layer comes

into play. This is where machine learning engineers and data analysts build dashboards

and machine learning models for predictions to help the business move forward. This is where

the final value comes in—when a person from the business team can look at a dashboard,

see issues in shipping, and make the right decisions to improve the overall business.

Business intelligence tools like Tableau and Power BI help us visualize data,

and machine learning platforms like TensorFlow and PyTorch help us build recommendation engines

and algorithms. There's also the side of data security and compliance, where we ensure that

we meet regulatory compliance, such as GDPR and CCPA. These are government regulations you need

to follow when storing data, like encrypting or masking personal information. We'll cover

data masking in more detail later in this video, so don't worry about it.

Lastly, we have the data integration and API layer. We'll be working with multiple vendors

and sending data between different systems, so we should build an API for easy integration between

systems. We also need to think about this. So, if we meet all the standards,

our final architecture might look like this (example architecture shown). This is not

the final architecture, but it might look like this, and you can improve on top of it.

As you can see, we have data coming from on-premises systems, social media,

and stream data. This data is ingested into the system, stored on AWS S3 as a data lake.

We can use transformation layers such as AWS Glue and Lambda to process our data,

and then store it on Amazon Redshift. We can also use Amazon Athena as an ad-hoc query interface

and SageMaker as a machine learning platform. Visualization is done through tools like Tableau.

This architecture is built to fulfill our business needs. We define the business goals,

then define the tools to use, and then build the architecture. If you look at this architecture,

it looks similar to the data engineering lifecycle we discussed earlier. There's data collection,

ingestion, storage, transformation, serving, and end users. The data engineering lifecycle is the

fundamental block, and this real-world architecture applies those concepts.

You can plug and play—if you want to use Google Cloud Storage instead of

S3 as a data lake, you can. If you want to replace Amazon Redshift with Snowflake, you

can. If you prefer Databricks over AWS Glue, go for it. Use what best meets your business needs.

That's everything about building architecture. I hope you understood. If this is clear,

we can move forward and discuss the other parts. Okay, let me check this. All right, this looks

good. Let's continue with our second thing. Now that we've understood architecture and

how it's built, let's try to understand the individual components of the architecture,

their use cases, and how the entire execution happens while building this.

Let's start by understanding the data warehouse. This is what the architecture of a data warehouse

looks like (architecture shown). So, we have data coming from multiple places, as we discussed. Data

comes from APIs, RDBMS, websites—all these places generate data. This data goes to

the streaming engine and gets ingested, and then we write the ETL pipeline. After ETL,

our data gets stored inside the data warehouse. This is the ETL pipeline—what we are doing

is extracting data, transforming it, and then loading it onto the data warehouse.

There's one more concept called EL, where instead of transforming the data first,

we extract and load the data into a staging area, or directly into the data warehouse. We then do

the transformation on the fly using SQL queries. This is ELT—extract, load, transform. In ETL,

we extract, transform, and then load it as per our requirement. These are the two ways you can

build a data warehouse. In the real world, ETL is highly used because it's the most structured way

to organize your data. ELT is also used, and some newer companies are trying to replace ETL with

ELT, where you don't have to do the transformation first—you load your data into the warehouse as it

is and then transform it as needed. However, ELT is not as successful because real-world

data is often messy and requires some processing before storing it in the

data warehouse. ETL is what you'll be using most of the time, but it's good

to know that ELT also exists for some use cases. When we built the data model in our relational

database part, we understood that data models are normalized—this means we try to create as many

tables as possible and reduce duplicates in each table. This allows us to have proper information

stored across different tables. Let me show you that again for clarity. This is what it looks like

(example shown). We have different tables that store different information. If you want to get

information about a user who purchased a product, you need to pull the user ID, connect it with the

order table to get order information, then connect with the product information, and if you want to

track payment information, you'll need to join the payment ID—joining

four different tables to get one outcome. However, relational databases are not designed

for analytical workloads. Even if you join all this data and try to run analysis queries by

aggregating user or order information, the OLTP database (Online Transaction Processing database)

will struggle because it's not designed for that kind of workload. It will pull all these

rows one by one and then pull one single column for your final analysis—not ideal.

This is where the data warehouse comes in, but you can't just store your data in a data warehouse

without following specific methods—that's where dimensional modeling comes in. Just like we have

a method to store data in relational databases (data modeling), we have a method to store data

in a data warehouse called dimensional modeling. In dimensional modeling, we have two things:

Dimensions and Facts. Dimensions and Facts are the two types of tables you'll create

to build your data warehouse. This is called a dimension table, and this is called a fact

table. The fact table is always one—there will be one fact table and multiple dimension tables.

The fact table stores information about quantitative data points that

can be measured in the business, such as sales amount, product quantity sold,

revenue, profit—all the quantitative values that get stored in the fact table.

It is the center of your dimensional modeling. On the other hand, there are multiple dimension

tables, each representing different business categories. For example, you might have a

product dimension, a date dimension, and an order dimension. Each dimension table

stores information about the categories or descriptive attributes, such as product name,

product category, user name, user city—all descriptive attributes related to the dimension.

If you want to understand how all this happens in detail,

I have a course available on data warehousing with Snowflake, where I go deep into this. For now, I'm

just providing a fundamental overview. Dimensional modeling is built using two concepts:

star schema and snowflake schema. These are the two methodologies or concepts used to build a

dimension model. Let me show you (example shown). This is what a star schema looks like—there's a

fact table in the center with different dimension tables attached to it. It looks like a star,

hence the name "star schema." The snowflake schema is a more normalized version, where there are

sub-dimension tables attached to the dimension table. It kind of resembles a relational data

model but still has a fact table in the middle, with different dimension tables attached to it.

In the star schema, you have the fact table in the center and dimension tables attached to it,

forming a star shape. The snowflake schema is similar, but with sub-dimension tables

added to the main dimension tables. The snowflake schema is different from Snowflake, the database

company that offers databases as a service. Let's look at an example. Let's say we're working

with an e-commerce company. We'll have a fact table in the center, such as an order fact table,

where it stores all transactional information. This will have a unique ID and quantitative

attributes like price, quantity, and weight—all measurable attributes in the business. Then,

you'll have different dimension tables, like order dimension, product dimension, and date dimension.

Each dimension table will store descriptive values like product name, product category, and other

relevant information. You can join these tables using a common key, such as product ID, to get the

final analysis. This makes analysis easier because if you want to get information about a product and

its quantity, you just need to join two tables. This join happens in the data warehouse, and the

OLAP database (Online Analytical Processing database) will handle this more efficiently.

If you want to understand this in more depth, I go hands-on with this

in my data warehouse course on Snowflake. I teach these concepts using real datasets.

Now that we've covered facts and dimensions, I want to talk about Slowly Changing Dimensions

(SCDs). We know that these facts, such as quantity, product weight,

and price, keep changing. Quantity changes, product prices change, and these changes

need to be reflected in the system. We understood that the data flows from sources, like APIs

or RDBMS systems, through ETL to the data warehouse, where it gets updated daily, hourly,

or however frequently it's scheduled. But these dimensions, like product name and user address,

don't change frequently—these are dimension values that don't change for long periods. However,

when they do change, how do we handle that? This is where the concept of Slowly

Changing Dimensions (SCDs) comes in. SCDs deal with handling dimension values that

change slowly over time. There are different strategies for handling SCDs, categorized into

different types like SCD1, SCD2, and SCD3, each with its own approach to handling these changes.

In SCD Type 1, the values are overwritten, and no history is maintained. For example, if we

overwrite data without keeping the previous value, we are using SCD1. If a customer's city changes

from New York to New Jersey, we simply overwrite the New York value with New Jersey. In this case,

there's no way to know what the previous value was—this approach can be used for some use cases.

In SCD Type 2, we maintain a complete history of changes. Every time there is a change,

we add a new row with all the details without deleting the previous value. There are multiple

ways to handle this, such as using a flag approach. For instance, if the city was New

York and then changes to New Jersey, we'll add a new row with an "is active" flag to indicate the

current value. If there are further changes, like moving to Miami, we'll add another row,

keeping the history intact. We can also use version numbers or date ranges to track changes.

In SCD Type 3, we maintain partial history. For example, we might store the current and previous

city in separate columns. If the city changes from New York to New Jersey, we keep New York

in the "previous city" column and New Jersey in the "current city" column.

There are also more advanced types like SCD6, which is a combination of SCD1, SCD2, and SCD3,

capturing the current city, previous city, start date, end date, and active flag all together.

These are fundamental concepts, and if you want to do hands-on practice,

you can find tutorials or check my course on Snowflake, where I cover these concepts in

depth with real datasets. Lastly, there's the concept

of data marts. Let me take a sip of water; you can also drink some water.

Okay, so data marts. Now, data marts are basically a subset of a data warehouse. Okay,

to understand this, the subset part we understand, right? In a data warehouse, we have many different

tables available like this. Okay, now these tables can be, let's say, product information,

order information, payment information. We have the fact table, like the fact table information.

These are the product dimension, order dimension, payment dimension, user dimension,

date dimension. Okay, these are the different tables available in the data warehouse.

Now, like this, there are many different teams working in the organization. We have different

teams available, such as the teams that might be about shipping, okay, who handles the shipping

information. There's a team that handles the refund information, okay, there's a team that

handles the payment, there's a team that handles the third party, okay, accounting, IT—these are

the different departments available. Inside these departments, we have different teams available.

Now, all of these teams don't really need this entire dataset. Every team wants to solve their

own business use case. So, understand this: this is my company. Inside the company, okay,

I have different departments, okay? These are my departments. Inside the departments,

we have different teams working on different problems. Okay, if you work in any company,

you will see that in a large company, you will always see something like this,

okay? You will see something like this, where we have the company. Inside the company, we have the

different departments. Inside the departments, we have different teams trying to solve their own

department's issues so that they can meet the company's goals. It looks like something like

this: if this team solves their own problem, that basically means they are solving the department's

problem, which means they are solving the company's problem. And if all of these different

departments solve their own problems, that basically means the company is moving forward.

Now, in order for these departments to solve their own problems, what they want to do is build their

own reporting system, right? The analysis, data science, machine learning models—what they do

is that they build the reporting system as per their team's requirement or their department's

requirement. And what they do is create a subset of the data warehouse as per the requirement.

So instead of, let's say, for the shipping department, right, they just need the information

about, let's say, users, payment, and the product. Just think about that—they just need the

information from these three tables only. So, they will create their own new table,

using these three tables, and they will choose the columns that they need for reporting and

all of the other things. Let's say there are 300 columns available in these three tables—they will

just pick 100 columns from these three tables, okay, and they will build their own reporting

system for their own department. This is called a data mart. Right, I am building this—this is

a subset of a data warehouse. I'm building my own reporting system; I'm building my own table as per

my requirement. Data mart, a subset of the data warehouse—pretty simple to understand. I solve

my department's problem; that helps the company solve their own problem. Simple. Okay, understood

the data mart concept? Let's move forward. Now, the data lake. This is the new term that

was tossed because of object storage now, okay? Before we store our data in the data warehouse,

we understood, right, what we have to do—we have to, again, process the data through ETL, okay,

and then you can build a data warehouse and store your data. Every data that you store inside the

data warehouse gets stored in a structured format, right? It is getting stored as a structured

format. So, what that means is that every time you want to store your data or make any changes, you

have to make changes in the structure, and that is quite difficult to do, right? If you have a table

that already has millions of rows of data and five columns, okay, think about this: I have a table,

okay, 1, 2, 3, 4, 5, okay, 1, 2, 3, 4, 5—it has five columns and millions of rows. Now, tomorrow

I decide to add one more column, okay, inside the data. So all of these values will be null, okay? I

will have to change the entire structure and then start adding new rows as per the new data. So,

changing the structure, changing the schema type, is quite difficult in a data warehouse because you

have to take a lot of things into consideration. Now, data comes—what it says, right? Okay,

you don't worry about the ETL, okay, you don't worry about the ETL, you don't worry about writing

transformations and putting your structured data. What you can do—you can use a data lake,

like S3—you store all of your data into the data lake. Data lake is basically a storage location.

You can use S3 as a data lake storage, okay? It is a centralized repository where you dump all

of your data as it is, right? I will store all of my CSV data, I will store all of my Parquet data,

I will store all of my JSON data as it is onto the different folder structures in my data lake.

Now, different teams from different departments—we understood, right,

as in the data mart, there are different teams working here—they want their own columns,

they want to generate their own reporting. So, what does the data lake say? Dump all of your

data as it is into the data lake, okay? As per your requirement, you can query the data from the

data lake itself. Basically, you can directly read the data from the S3 file storage system, okay,

object storage system, as per your requirement. This is called schema on read. Again, concepts

are getting quite heavy, but I'm just trying to keep it easy. So, take a break if you want,

but I'll continue—just pause the video if it is getting heavy and come back again, but if

you can understand, just go forward with it. Okay, it is basically a centralized repository,

okay? So, data lake. So, data lake is basically a centralized depository. You can use S3, okay,

you can use Azure Blob Storage or Azure Data Lake, okay, Azure—you can also use Google Cloud Storage,

okay, as a data lake, which is kind of like object storage, where you dump all of your data as it is,

as raw, okay? And on the other side, right, on the data lake, there are users or teams, okay, who can

read this data, okay? This is called schema on read, okay? What they will say: okay, I want to

read this column from this file, I want to read these columns from this file—all of the different

file systems, okay? I will read all of this data as per my requirement, okay, and I will build the

table onto, let's say, Athena, okay, Athena, or any ad hoc query interface, okay? That is up to

me, and they will build their own table, or they can directly also pull the data from the data lake

and put it in the structured format. So here, we will only process the data that we need, okay?

Instead of processing all of the data over here in the ETL and data warehouse part, we will only

process the data that we need and then store that data in the data warehouse for querying purposes.

Now, it is not that, okay, data warehouse is bad because it requires a lot of processing

or that data lake is good. Both of these systems have their own place in the architecture because

data warehouses give you the flexibility of structured data so you can do the analysis,

whereas data lakes give you the ability to access any data anytime you want as per your requirement.

Okay, let's understand the difference between a data lake and a data warehouse, okay? Inside the

data warehouse, data is structured, as you can see it over here. Let me just, okay, let me just zoom

in. Okay, data is structured, okay? The users are business analysts, and it is used for batch

processing for BI reporting and all of the other things. The data is pre-defined, contains smaller

data, and it is usually relational, right—columns and rows. Over here, data is unstructured because

you can store JSON data, you can store Parquet, CSV, whatever you want. Alright, users are usually

data analysts and data scientists because instead of—think about this, right? Data scientists

want to build their own machine learning models. Now, in the data warehouse, alright,

once you have data added, you can only work with the limited data, right, because you defined that

as per the business goals, and changing the structure is quite difficult—you have to do

a lot of changes inside the pipeline also. So, for data scientists and data analysts, a data

lake is a gold mine because it is completely raw data, right, stored as a file storage, stored as

a file inside the object storage as it is. It is up to me which data I want to read, which columns

I want to read, as per my requirement. I can read using Python code, I can write Spark code,

I can build a table on top of it as per my requirement, okay? So these are the users. The use

case is for stream processing, machine learning, real-time data analysis—you can use that. Okay,

the data is raw, data is large, and it is undefined, okay? It is not properly relational,

so it is undefined, okay? This is the difference between a data lake and a data warehouse,

okay? This is what we have understood till now. Now, this is just the fundamental concept. The

actual hands-on part, if you want, I have some projects available freely on the YouTube channel,

okay? I will just comment down—I will give you the link to that, okay? So if you want to do that,

you can do it and understand the data warehouse and also the data lake. I also teach all of

these things hands-on in my courses, so if you are interested, just check the link in the description

about the combo pack, okay? So till now, we have understood a lot of different things, okay? We

started by understanding what data engineering is, where data engineering actually fits into

the entire pipeline, okay? We understood about the different roles such as software engineering,

DBA, DS, ML, and all of the other things. We understood about the important part, which is the

data engineering life cycle, okay? We understood about the ingestion, transformation, serving, how

all of these things happen. The storage part, we understood about why transformation is needed—like

how the transformation actually happens. Data generation, data storage, DBMS systems,

relational databases, data modeling, okay, how data modeling actually happens, NoSQL databases,

SQL versus NoSQL, data storage processing such as OLTP versus OLAP, the difference between row-based

transaction and column-based databases, why OLTP is needed, why OLAP is needed, why transformation

is needed because we go from OLTP to OLAP while doing the transformation, okay? We understood

about ETL processing, understood about the undercurrent such as security, data management,

data ops, architecture, software engineering. We delved deep into the data architecture part,

okay? We understood about operational architecture and technical architecture, about a lot of

things. We understood about the data warehouse, the important part, okay? ETL versus ELT,

understood about dimensional modeling, understood about the snowflake schema and the star schema,

understood about the difference between fact tables and dimension tables, such as how to build

the dimension tables. It stored transactional values and categorical values, understood about

slowly changing dimensions, why we need them, different types of them, a lot of things. Data

marts—a subset of the data warehouse—why we need data marts. Understood about the data lake and

the difference between a data warehouse and a data lake, okay? Understood a lot of things about data

engineering, actually. I was not even expecting to go into this deep before recording this video—I

thought I'd just give an overview, but I went into a flow state and started recording and explaining

everything because I really love teaching, right? So, understood a lot of things. If you’ve reached

this section, do let me know by commenting, because it might be around 2 hours by now,

and if you're still watching, salute! Alright, so do let me know by commenting that you watched

this video till here and you are about to complete the entire thing, okay? And I just want to plug my

courses—if you're interested, right, if you love my teaching and the way I teach, then do check

out my data engineering courses. I create in-depth data engineering courses in the market, okay? It's

not just about the course—it's about giving you the experience, okay? The understanding of proper

technology, how this works in the real world, right? It's not just about learning technologies;

it's about understanding where it is used, how to use it, following best practices—all of these

things I teach in my courses, so do check them out. You'll find the link in the description.

You'll also find the latest coupon code available with a discount, so go at least check that out.

And yeah, let's continue with our video. Okay, now we understood the fundamentals

and we also looked at this big data landscape. Let me just zoom in, right? Can you see the

tools' names? Can you see the different things available, right? These are the data warehouses,

okay? As you can see, Snowflake, AWS Redshift might be here, Microsoft Firebolt,

Oracle—there are some new companies here. This is used for data lakes. As you can see,

there might be S3, Databricks is used, Cloudera has their own stuff going on—these are storage

systems provided by the different NoSQL databases, like MongoDB. There might be Cassandra somewhere,

Couchbase DB, and all of the other things. Real-time databases, graph databases—you see,

I was telling you about this, right? For every single use case, like for visualization,

BI platform, data science notebooks, MLOps, there might be some product analysis—all of

these different things, right? We have a set of tools available, right? Every single technology,

everything that we want to do, we have a different toolset available for every single thing that we

understood while we were talking about the architecture part. We understood that every

single thing needs a set of tools, and we have more than thousands of tools to pick from, okay?

Now, we will understand these individual tools, what they do, why they exist, right? What are the

use cases for them, which tools are the most demanded and used by the industry,

okay? So that we will understand, and how to work with them. Let's go one by one.

Now, let's talk about the cloud platforms, right? We understood about the cloud platform.

Cloud platforms are basically giant computers built in some data center, the basement of a

company. It can be Amazon, okay, this is Amazon, this is Google, okay, and this is Microsoft.

Now, again, these are the three top cloud providers available in the market. There

are plenty of cloud providers—you have Cloudera, you have IBM Cloud, you have Oracle Cloud. Every

different cloud provider has its own features, but these are the three top cloud providers available.

What is cloud computing? It is basically these companies giving you the computer resources

and different services so that you can use them for your work. Before this cloud, what we used

to do—we used to build our own servers, okay? Own servers, that means you get your RAM, you

get your hard disk, okay? You get the processing power, processor, okay? You get the GPU if needed,

you get all of the wires, you get the ACs to cool down the servers, you get the networking adapters,

you get all of these different things, switches, you get the routers—every single thing you get,

you build it on your own. Okay, now you can do this—a lot of people still do it because they want

to save on cloud costs, but this also comes with a trade-off because you have to maintain them,

okay? You have to maintain this. What if the power goes down, right? What if my hard disk

fails and I lose all of my data? You also have to think about replication, you also have to think

about scalability, okay? How do I scale this entire thing? Because let's say, right now,

I'm just working with millions of data and the users are small. Tomorrow, my business grows,

so I will have to buy new hardware, okay, and upgrade my system. What if my hard disk fails?

What if my RAM fails, okay? What if the hardware fails? What if an earthquake comes and I lose all

of my data center resources? Anything can happen, right? You don't have control over nature. So,

this is the reason people usually go with the cloud providers because I don't want to set up

all of these things by myself if I can directly pay to the cloud providers, okay? And these

cloud providers always charge pay-per-use, okay? Pay-per-use means that you only pay for what you

use. That's pretty awesome, right? I will only pay for what I use, whatever resources I consume. So,

if I just use a simple virtual machine, which is like the online computer, and I run it for

two hours for some workload, I am only going to pay for these two hours, okay? I will not,

in the on-premise data center, have to keep running these machines 24 hours because this

is how the entire server is set up, right? If I want to store something, because I will be running

some other workloads, my website is also hosted on that, there are some other functions running,

databases, and everything, so I have to keep it running for 24 hours. Let's say if I just

want to do some workload quickly for two hours onto the cloud, I can rent that,

and I can also pay for that use case. Cloud has multiple services available

for different use cases, okay? These different services are divided into

three different parts. We have PaaS, okay, we have SaaS, okay, and we have IaaS, okay?

This is Platform as a Service, this is Software as a Service, and this is Infrastructure as

a Service. What do these three things mean? Platform as a Service means they give you the

direct platform, so you don't have to worry about setting up different things. So, for example,

for example, on AWS, we have a service called AWS Lambda, okay? You can call it Platform as

a Service because they directly give you one kind of platform where you can just

focus on writing your code—they will take care of all of the infrastructure side,

such as running the server, all of the backend things, they will take care of the maintenance

and everything. You just focus on writing your code. This is called Platform as a Service.

Second is Software as a Service. You can think about Software as a Service as Google Suite,

alright? You have Google Sheets, AppScript—not the AppScript, what do we have, like Google

Slides? You have the entire Google Suite. You can think about that as Software as a Service

because they are directly giving you access to the software as a service for your work,

so you can use that and grow your business. Then we have Infrastructure as a Service, okay?

That basically means cloud providers will give you the infrastructure. So, an example of this

is EC2 machine—this is basically a virtual machine online. There’s also a concept called EMR—this is

like Elastic MapReduce, to run your Spark jobs. These are different infrastructures that they

give you so that you can run your workloads, okay? This is how cloud platforms are divided into three

services—they give you these services that you can use to grow your business, right? Now, these

services have different names as per the cloud providers, right? If I go to AWS, right, on AWS,

we have these many services and many more. These are just a few services, right? We have EC2—just

don't worry about the names if you are seeing them for the first time, just forget what it means,

right? Don't worry about it. If you already know, that's good, but if you are seeing these services

or these logos for the first time, just forget about them, right? There’s something called EC2,

which is like the virtual machine, we have Lambda, where you can just write and run your code on the

serverless machine, okay? Elastic Container Service, if you want to run a Docker image,

okay? There is Simple Email Service—it is used for notification purposes or email purposes. Aurora

is the database created by AWS, so if you want to store your relational data, it is a service. It's

like AWS is giving you the service, so you only pay for the number of hours you use or the number

of resources that you consume, so that you don't have to build all of these things by yourself.

Everything is built for you, pay for it, and grow your business. Elasticache, DynamoDB, right? EMR,

VPC, CloudFront, Elastic Load Balancing, Kinesis for real-time data, RDS for relational databases,

Redshift for data warehouse, right? Kinesis, Elasticsearch for some IoT devices, Simple

Storage Service, object storage to build a data lake, right? File system, Elastic Block Storage,

Cognito, API Gateway, Queue System—you need to build your entire technical architecture,

right? We understood—we have the business goals, but once you define the business goals,

you think about, right now, how to build my technical architecture. So, you start thinking,

okay, which cloud computing platform should I go with? Now, most of the time, you might

have the answer—let's say you are a student right now, okay? You might have a question:

which cloud computing is the best and will give me a job? The answer is, pick any one of the three,

and there are high chances that you will get a job because most companies only work

with these cloud providers. If I were to rank them, okay, this is just my personal opinion,

it can be wrong. This was my opinion a few years back, like one year back, you can also say that.

I used to rank AWS as one, okay? Azure as two, and Google Cloud as three. Okay, now it is changing,

and I'm seeing the trend that Azure can be one because a lot of companies are using Azure due

to their new functionality and good services. The services that they provide are specific to

the enterprise level, so Azure is good if you want to target enterprise-level companies. They always

go with Azure, especially in India, because a lot of companies directly use the Microsoft app suite,

like Microsoft 365 at the enterprise level—because Microsoft Word, PowerPoint, and all of the other

things. So, they are likely to go with Azure because the integration is quite simple,

right? A lot of startups usually go with AWS because AWS gives you good credits, you can

easily start, and a lot of people know AWS, like the industry. If you want to find resources or

employees with AWS skills, it is quite easy to find, so a lot of startups pick up AWS. Like,

I'm building my data engineering startup, okay? I'm also using AWS for my infrastructure.

The third one, I still say, is Google Cloud. Again, there are some services Google provides

that are really good, but these are my takes, right? This is my personal take, it can be wrong,

but this is what I see in the industry. I say if you want to target top companies—when I mean top

companies, I mean the enterprise level, like banks, like a lot of top companies,

the service-based companies can also be taken into the picture, like Infosys, TCS, and all of

the other things, up to your requirement—but top companies that already have an IPO set up, you

can just research their company architecture, and you will find that you will see a lot of companies

use Azure if they are enterprise level. A lot of startups, like Indian startups—if you see Zepto,

if you see CRED, okay—all of these guys are on AWS because it's good for startups, they give it

a good ecosystem. So, I say if you want to target startups, learn AWS and GCP. I think it's—I always

suggest either learning Azure or AWS unless you want to target a specific company and they tell

you that they require skills in GCP, then go with GCP. Okay, I just answered your question.

If you are a student, then you can go with this. If you are someone who is looking to build the

architecture, again, the situation is the same: think about the services that solve your problem.

Okay, we will talk about the different services, but the idea is to think about what services these

cloud providers give us that can help us solve our business goals. We understood about operational

and technical architecture—now you start thinking from this point of view: if I were to choose AWS,

GCP, and Azure, and if I say, okay, Azure gives me these services, AWS gives me these services,

and as per my requirement, I can easily solve all of my business problems using Azure because

they have a good service pack together, so I'll go with Azure. Like, I can do a simple small

project on Azure and see if that works—if it works, I can move my entire production

workloads onto Azure. Okay, if that doesn't work, there is also the concept of hybrid cloud,

so you use some services from Azure, you use the best services from AWS, you use the best

services from Google Cloud, okay, and build your system. For example, in my personal opinion,

right? I really love Google BigQuery—this is a data warehouse provided by Google, okay?

And on Azure, I really love the Databricks integration, okay? On AWS, I really love the

Glue service, which is serverless Spark workloads. So, what I can do, okay—Glue, or also I love the

S3 as an object storage, okay? If I want to do that, I can use S3 as my object storage,

I can use Databricks as my Spark workload, and I can use BigQuery as my data warehouse. So, you can

also do cross-cloud integration, but maintaining all of these things is quite difficult. Again,

there are some tools that can help you with that, but these are the different concepts

that you can explore. I just want to throw them at you right now so that you can keep that in mind,

okay? Let's move forward—let's talk about the services that we understood, okay?

Now, we understood, right, these are the services—so let's say if I go with AWS,

and if I build my entire architecture, if I want to build my ETL pipeline, okay,

how will I go with that? Let's say this is how it will happen, okay? Let me just remove this,

okay. Collect, process, store, and analyze, right? Data engineering lifecycle—the simple

architecture that we've been understanding. I can collect data from S3, Kinesis, DynamoDB, RDS, MSK,

whatever, right? This is object storage, this is the real-time data streaming platform, this is the

NoSQL database, this is the relational database. We understood data is coming from multiple places,

okay, where we can collect our data and easily ingest the data, and we can collect again, Siri—

Stop—why? Okay, I found this on the web for when we can collect Siri. Check it out—stop! Okay,

yeah, so we can collect the data, okay? Then we can do the event processing,

okay? Let's say if you want to do something, let's say every time data gets uploaded onto Amazon S3,

I want to run the Lambda function. Okay, Lambda function is basically the compute service, so if

you want to run small code, you can do that—I can do this, and then I can do the actual data

processing using EMR, which is a Spark workload. I can run the machine learning, I can run AWS Glue,

again the Spark workload, and then I can use these services for analysis. So on AWS itself,

I can build my entire data system, right? Instead of going out and picking random tools, AWS gives

you a wide range of services that you can pick from that pool and build your entire data system,

okay? This is just an example, okay? Just to help you understand from this entire service tool pack,

right, that AWS gives you—we understood about services. Services can be platforms—they might

give you the platform, they might give you the software as a service,

they might give you the infrastructure as a service, right? These are the different

services that they provide, and using these services, I can build my entire platform,

okay? And it might look something like this, okay? Now just pay attention, okay? Don't get confused,

don't get scared about all of these things—now we're just trying to go a little bit advanced,

okay? And this is the architecture of one of the top companies or top startups called Dream11 in

India, okay? Dream11 is like the fantasy betting app. This is the architecture of Dream11 that

they have used to build on AWS. Now, if you see this architecture, you will understand it is not

completely AWS, okay? There are some things that they use from AWS, as you can see over here, okay,

and there are some things that they use that are open source, and this is how technical systems are

built. This is the final version of Dream11—they went through three different phases to build this

particular architecture. I have posted a LinkedIn post—I will put the link in the description. If I

forget, just comment it out—I will add it to it. Okay, now let's try to understand and also let's

try to remember our data engineering lifecycle, okay? Even though this architecture looks quite

complicated, the fundamental concept, okay, the data engineering lifecycle is quite the same,

okay? First of all, what do we have in the data engineering lifecycle? First,

we have the generation source. Now here, as we understood, our data is coming from

multiple places, so we have third-party vendors, okay? As you can see—let me just zoom in. Okay,

our data is coming from third-party vendors, there is some RDBMS, like MySQL, there is some NoSQL,

like Cassandra. Where is the streaming data coming? There is Cassandra NoSQL database,

okay? And then there's the application, so there's iOS and Android application, and there's

the desktop, Dream11.com, as you can see over here. So, we understood, right, data comes from

multiple places. In this case, data is coming from third-party vendors, okay? Data is coming

from third-party vendors, data is coming from the databases, data is coming from the application and

the iOS. I kept telling you, right? I kept telling you data is coming from multiple bases—this is

what it means. It is coming from multiple places. Now I want to ingest this data into my system,

and most of the time, for ingestion, for real-time streaming ingestion, or just ingestion,

people use Apache Kafka, okay? Apache Kafka is a real-time data streaming platform, a distributed

real-time data streaming platform, so you can work on large-scale data, okay? And you can easily put

Kafka in between to consume all of the data, okay? In Kafka, these are all of the producers,

okay? Let me just write it over here. All of these people, okay, are producers who are producing all

of this data, okay? Once the data gets into Kafka over here, okay, everything else that happens is

consumers—consumers who are consuming all of this data, okay? Simple to understand. Again, we are

not deep-diving into Kafka—I will be launching a course on Kafka, so you can keep an eye on that,

okay, in the future. But data is getting produced and data is getting consumed—here, consumption is

basically what I want to do with this data, okay? So, there is a batch pipeline going on over here,

as you can see, okay? This is a batch pipeline. First of all, we understood, right, once the data

is ingested, we need to store our data somewhere, right? There was a storage layer below. So, the

data gets stored inside Amazon S3 as a data lake. Now, the concept of the data lake is coming. Now,

the concept of Amazon S3, which is a service on AWS S3, is coming, right? I kept telling you—you

can use S3 as a data lake. I store my data onto the data lake, okay? What happens here after this?

This data goes through the ETL, okay? As you can see over here, this data is going through the ETL,

okay, and the ETL is happening using Apache Spark, okay? There is some Apache Spark workload

available, and then it stores our data onto Amazon Redshift. This is what I kept telling you—this

is a data warehouse, okay? This is my data warehouse service available on AWS, right? This

is my ingestion, this is my ingestion, this is my data warehouse, this is my data lake, this is my

storage, right? This is my storage, okay, this is my data warehouse, this is my ingestion. There's

one more thing that I told you, right? In a data warehouse, we put our data by transforming and

making it into the structured format. Now, there is one more pipeline that goes—it is called ad hoc

analysis, okay? And as you can see, it is using Amazon Athena, which is a query engine for ad hoc

analysis, and I told you, right? The Looker, okay, the reporting system or the data science people,

can use the raw data that is coming. I can use this raw data as it is, okay, from the system as

per my requirement, or I can also use structured data as per my requirement. So, I get access

to both of these things—I can get the proper structured data also, and I also get the raw

data as per my requirement, okay? This is there, again. Understand the data engineering lifecycle

that we understood—understand, now try to connect every single thing that we have done, right? We

understood data warehouses, we understood data lakes, we understood ETL, we understood ingestion,

storage—every single thing is put together into the real-time system of Dream11 case study, right?

We're just trying to understand the real-world architecture right now, and how they use the

fundamental concept in the real world, okay? Every single thing that we talked about, okay, it makes

sense here, right? We have the ETL system for ad hoc analysis, we have the structured data, we have

the ingestion system going on, okay? This is just a batch pipeline, okay? This is there. There's one

more thing—we have the real-time pipeline going on over here, okay? For the real-time pipeline,

what they are using is Apache Flink, okay? Apache Flink is used for the streaming engine,

so if they want to understand data on a real-time basis, they can use this and analyze it. So,

from the streaming engine, we go to Elasticsearch, there might be some notification service,

there might be some visualization available over here—not sure about that, but this is

the entire pipeline. And the fundamental concept that we use is the data engineering lifecycle,

okay? And all of the concepts that we use. So, every time I store my data onto Redshift,

this is the data warehouse. I might use dimensional modeling, okay? After the ETL, I use

Apache Spark to transform my data. I use S3 as my data lake storage, and I use Amazon Athena for the

ad hoc query. I use Looker for my visualization, I use Jupyter Notebook for my data science workload,

I use Kafka for my ingestion—these are all of my sources. I use Apache Flink for handling real-time

data streaming—this is the real architecture. This is everything we did in the last two hours

just to understand this particular thing. Once you have understood this, you get a good gist of data

engineering. Now you know, like, yeah, I am a data engineer because I understand this architecture

and what is going on. This is the fundamental part, right? Once you understand the fundamentals,

you can understand any architecture. Now, once you complete this entire video, you can understand any

architecture in the world because you will know, okay, there is some ingestion going on, there

is some transformation happening, there is some loading happening, there is some ETL happening.

I understand this—the tool is different, right? I can replace this tool with Snowflake, I can

replace this tool with Databricks, right? I can use something else over here—it doesn't matter,

okay? It will work the same. The features might be different, the performance might be different,

but fundamentally, it will give the same output. But for their use cases, for Dream11,

they might have tried multiple things and then finally came up with the final architecture that

is currently working for their system, right? Everything that you see on their application,

everything you see on your app as Dream11, there is this kind of system behind that,

making it possible, okay? It's not some magic. Alright, we understood AWS—now let's understand

GCP. Just like AWS, right, we understood that we have different services available on GCP also:

ingestion, okay? For ingestion, we have App Engine, Cloud Pub/Sub, Cloud Transfer Service,

BigQuery, Cloud Function. Now, I want to tell you this: most of the cloud providers have similar

services available. For example, this is a data warehouse available on GCP called BigQuery, which

is the same as Redshift on AWS. Okay, there’s Cloud Function available, which is the same as

AWS Lambda—fundamentally, they give you a similar platform to perform your workload. The name is

different, the feature is different, the cost is different, but fundamentally, it is the same,

right? Just like we have Cloud Storage—this is basically the S3 of AWS Cloud Storage,

object storage. We have Cloud SQL—this is the RDS, this is the Relational Database

Service that we talked about, right? BigQuery is a data warehouse that we understood. Data Prep,

DataProc is the same as EMR, okay? Elastic MapReduce of AWS. So, if you understand,

okay, services are the same—like most of the cloud providers have similar overlapping services. It's

always about choosing the best service for your use case. So, for ingestion, they have this many,

for storage, they have this many, for processing, they have this many, and for exploration. Again,

the concept of the data engineering lifecycle: I have to ingest something, I have to store

something, I have to process something, I have to serve something. Okay, this is the simple

architecture on the GCP. Same fundamental concept applies—I have data coming from multiple places,

I ingest this data, I store this data, there is a pipeline that is running right now. Again,

I store some data, I store the data onto BigQuery, okay? And then there’s some privacy on Oracle

Cloud, like identity campaign running—this is like the end-user part, right? Customer platform—there

is customer data, and data destinations such as web apps, customer service, marketing messaging.

Same fundamental concept: data source, collect, process, store, and give it to something. This is

where the entire data engineering is happening, right? I get the data, I ingest it properly,

I store it, I process it, I give it back. Okay, now this is done on GCP. Again,

let's look at the Azure level also, okay? These are the developer services. We have compute,

okay? For the compute, we have virtual machine, cloud machine, batch storage, again, the same.

We have the web and mobile app. For data, we have the SQL database, Redis Cache, we have SQL Data

Warehouses—that is also available. For analytics, we have Data Lake Analytics, Data Lake Store,

Stream Analytics, Machine Learning, Data Factory, okay? IoT, we have media, we have identity access.

In my opinion, Azure has, as a data engineer, Azure has a very good service for data engineering

workloads, okay? They have three services that I really like on Azure. So, one is Databricks,

okay? I really like Databricks because it is properly integrated with Azure, and Databricks

is basically the environment to run Apache Spark workloads, okay? Second is Data Factory,

okay? Data Factory, and third, I like Synapse Analytics, Synapse Analytics, okay? Most of the

services—and there’s a new service available I haven't explored called Fabric. Microsoft Fabric

is basically the combination of these multiple services where you can do everything in one place,

okay? It is especially designed for data engineering workloads,

making your life much easier. I have a project on this on my YouTube channel available for free,

okay? I'll put the link in the description—if I forget, do let me know by commenting, I will put

that. If you want, you can explore that. I also teach about all of these things in my courses,

so we do have projects available on that—you can explore that by going to the website.

Okay, now, this is the architecture side of the same thing, just like AWS,

okay? I can replicate this entire architecture on GCP also and also on Azure. What I have to do is

basically just replace—let's say if I'm replacing this entire thing onto the GCP, what I will do,

instead of S3, I will use Cloud Storage, okay? Instead of Redshift, I will use BigQuery,

okay? I can put DataProc here, okay? I can also put BigQuery here if I want. For streaming engine,

I can also put Pub/Sub and DataFlow, okay? Um, what else? For Kafka, I can put Pub/Sub,

but I'll say I'll go with Kafka—Kafka is best. Okay, Looker is good, this is good, everything

else seems fine. So, I can also convert this—my AWS architecture—through GCP. Performance might

be different, the costing might be different, the UI might be different, the integration might be

different, but I can do that. Okay, I can also do the same for Azure as well, simple. Okay, and

this is what the Azure architecture says, right? What do we have? We have the customer stream data,

we have the customer batch files, okay? Uh, we are ingesting this particular thing, and we are

just adding this onto the ADLS, which is Azure Data Lake Gen Storage. It is a service available

from the external source. Okay, now there's a Data Factory running, okay? It is like data is coming

from on-premise sources, and some stream data goes to the Data Factory. It gets entered into the raw

zone, okay? The raw data is getting entered. We use Databricks over here to process this

raw data and store it in the processed folder. After that, it goes to the analytical zone,

and it goes to the SQL pool, which is a SQL data warehouse. Now from here, customers can

use this to build the Power BI dashboard and get insights, okay? It can also be integrated

with the desktop application if needed. Same concept: collect, ingest, store, transform,

serve, and use it. Same thing is happening, and there are some undercurrents. As you can see,

we are using the Azure Key Vault to securely store our keys, Log Analytics, Azure Preview,

and Azure DevOps to properly operationalize our entire integration for the scripts.

These are the different services that can be used together. So, these are the three different

pipelines that we have used till now. Okay, I just showed you using AWS, I showed you using

GCP, and I showed you using Azure. Now, let's look at the modern data architecture, right? This is

also modern, just especially built on the cloud. This is the modern data architecture, right?

Modern data architecture is basically where new companies are coming into the market and saying,

"Okay, the tools that you guys are using are old now. They don't work with the new data workloads,

the new volume, and the approach is very old, okay? And I, as a new startup, I am a modern

data company. I will make your life easier." So instead of you doing the ETL, remove the ETL,

okay? I will say directly load the data into my product, okay? And I will directly transform it

for you as per your requirement, so that you can directly save time on the ETL and start querying

the data. This is what the modern company says. They all have different requirements, so they

directly give you the integration between your different sources, as you can see here, right? Uh,

I have different data coming from sources like Stripe, Google, PostgreSQL, Google Play. What

they say is that they have the integration with all of these different sources. These are the

applications, right? Fivetran, Airbyte, Stitch, okay? These are usually used for ingestion. You

can also use Python and SQL, which is also the modern way. Before this, we had Hadoop and all of

the other workloads. This company comes and says, "Okay, use our system because we have made all of

these things easier for you. Directly connect with these multiple sources, we'll pull the

data for you, and we'll directly load it onto the data warehouse so that you can do everything as

per your requirement." There is a new tool called DBT (Data Build Tool) that is used for analysis.

People say that it is going to replace SQL. Not going to happen. Most of the time, a lot

of companies come and want to replace SQL, but still, SQL is the king of data, right? You should

always learn SQL. DBT is also gaining a lot of popularity. It has—you can divide your data into

multiple things, so multiple—you can divide your data into multiple stages. The thing that I told

you about, like ELT, right? ELT. We were doing ETL till now: Extract, Transform, Load. Now we will

do EL, which is we will extract our data and directly load our data into the data warehouse,

okay? It can be Snowflake, BigQuery, Redshift, doesn't matter. And we will divide our data into

different landing zones, okay? This is modern data architecture, right? We create the landing area,

we create the staging area, we create the warehouse layer, and we create the M layer. Same

fundamental concept. If you see, there is a data mart, there's a data warehouse, there's object

storage, and there's a landing area to store the raw data, right? I store my raw data, I store the

staging area after some transformation, I store my data in the warehouse, and this is my M layer. All

of these things you can create inside the DBT, and directly you can store your data onto Snowflake,

Redshift, and all of the other things. Same thing. Then it can be consumed by the BI people,

machine learning people, they can build dashboards on different tools, uh, you can, uh,

do the analysis on different tools. There's also the concept of reverse ETL. Companies are using

that. Basically, that means I have transformed my data, I can put back this data onto the

source system and get more insights from the transformed data by ingesting that data back to

the system again. Uh, this is a totally different concept. Um, I'll cover it in some other videos,

but there's also a concept of reverse ETL that we also saw in the data engineering life cycle, okay?

Modern data architecture—we understood about GCP, we understood about Azure, AWS, and the modern

data tools. A combination of these different tools and AWS and Azure can build the modern

data platform. Now, again, we talked about this. Here we have like thousands of tools available,

right? How do I decide which tool is best for me? First, I look at the business requirement. Does

this solve my business problem? Yes. If it does, then I should use that tool. If it doesn't, then

it should be reversible. So, okay, I can easily, uh, remove this. Let's say this tool is costing

me too much, okay? And this is not really even solving my problem. I can directly remove this

and go with another tool. If this tool is also not working, I can directly go with Spark because it

is going to work because it is open source, right? All of these things are going to cost you. Spark

is going to cost you for the server, so, uh, you have to choose between—this is easy, this might be

quite difficult to set up. So, as a company, if you are a startup, people usually go with using

these things because it saves time, okay? You have the money, but you want to save time, so you

can go with this. This will solve your problem, this will also solve your problem, okay? Whatever

solves your problem, whatever helps you reach your business goals, you can go with that, okay? Now,

uh, I just want to take a break, so I'll have some water, and I'll come back in 1 minute. Alright,

till now we have understood a lot of things. Now, this is kind of like the end of the video,

and again, I can't cover every single thing, but I want to leave you with some of the important

tools that you can learn about data engineering and some of the concepts, uh, at the end, okay?

So that is important for you in your career. So let's start with that, okay? Important tools for

data engineering. Now, first of all, if you want to become a data engineer, you have to learn a few

things. First of all is the programming language. You have three choices: Python, Scala, and

Java. Now, if you want to learn any programming language, I always suggest starting with Python,

okay? It's the easiest to learn, mostly used by industry because if you want to write the,

uh, ETL scripts, if you want to write the Kafka ingestion engine, and all of the other things,

Python has a lot of packages that make your life much easier, and even industries, uh, use Python

for all of these workloads, so you should always go with Python. If not Python, you can also go

with Java. Java also has good support because most of the open-source frameworks like, uh, Apache

Spark, all of them are written in Java, okay? So you can go with Java also, but my suggestion

is to go with Python, okay? Important for you to learn in Python. You can learn about the basics of

Python, such as variables, operators, basic data structures like dictionaries, lists, all of the

other things. Important things to learn include how to work with date and time formats, how to,

uh, like there's a package inside Python called Pandas, so you should learn how to transform

the basics of data, how to work with different file formats also, like CSV, JSON, Avro, okay?

This is what you can learn in Python. Uh, I have already created a detailed roadmap for this,

so I'll also put the link to that particular video in the description. If I forget, do let

me know, and I will add it, okay? SQL—again, SQL is the backbone of your data career. You

cannot skip this. This is how you communicate with the databases. We understood everything,

so you have to learn SQL. This is non-negotiable, okay? You cannot skip SQL. You cannot skip Python.

This is the foundation, so you have to, have to learn this. After this, you can understand Linux

commands because you will be working with some of the, uh, cloud providers or Linux machines.

Most of the, like, 80 to 90% of the servers online run on Linux servers only, so you should learn to

interact because it doesn't have a GUI, right? You don't have a graphical user interface. You'll be

accessing it using the terminal. You can learn commands like cd, how to clear, how to copy,

how to exit, how to find something, how to view the file, okay? These are the different

commands that you can learn. You can just search on YouTube, Basic Linux Commands, and you will

get a good tutorial, okay? Now, we have data warehouses. Again, you don't have to learn all of

the data warehouses, okay? You can learn—you have the AWS Redshift available, you have BigQuery, we

have Hive available, SAP Analytics, and Snowflake. My suggestion is to either learn Snowflake because

this is not dependent on the cloud platform, okay? This is cloud-independent, so you can learn this.

Also, this is highly demanded in the market, so you can easily learn this and add it to your

skill set, very highly in demand. There's one more that I love personally, which is BigQuery,

okay? Because I've worked with BigQuery for the last three to four years, and I've really enjoyed

this service, so this is one of my favorites because this is one of my favorite services on

GCP. So, my suggestion is to go with Snowflake because this is cloud-independent, okay? If you

are working with a specific cloud, you are anyway going to learn—let's say if you're learning AWS,

you are anyway going to learn about Redshift. If you're learning BigQuery, you are anyway going to

learn about—if you're learning GCP, you are anyway going to learn about BigQuery, right? So, my

suggestion is to just learn Snowflake because you will learn about these three by learning about the

cloud. Hive is an open-source, uh, tool that not many people use. It is just used for the metastore

for Apache Spark or Apache Hadoop workloads, okay? As a metastore to store some of the information,

but not really recommended to learn it separately. You can just learn the basics, and in case you

have a requirement, then you can learn it on the go, right? It will take you like one

to two days if you have the basics clear, okay? Data processing. This is interesting, okay? For

different workloads, you can use Apache Spark for batch and streaming. You have to learn Spark. This

is very, very important, okay? You cannot skip Spark also because this is used by top companies

to process big data. You also have to learn Kafka because this is very important to process

real-time data, okay? Out of these three, I would say learn Apache Flink for real-time analytics.

You can use Kafka for streaming. You can use Flink for analytics. There's NiFi and Apache Beam also.

If you learn GCP, okay, you will automatically learn Apache Beam also. So, my suggestion is learn

Apache Spark, learn Apache Kafka only. Not really right now—if in case you have to use it somewhere,

you will learn it on the go, okay? Just add Kafka and Spark to your skill set. Data orchestration.

Okay, we have many tools available. Out of these, you should use Apache Airflow,

one of the highly used tools in the market, okay? We have these modern data tools, okay? Uh,

these tools take roughly 30 minutes to 1 hour to learn, okay? If you have your fundamentals clear,

right? You can just watch one video and understand more like 80 to 90% of the tools, right? It is

very simple. I learned about Mage in just one hour, okay? It didn't take me more than that. So,

there's one project available on our channel also, so if you want, you can learn that. These modern

tools are created to make your life easier, okay? So, to learn Apache Airflow is quite complicated,

right? It will take some time to understand the gist of it, and we have a course on that, so I'll

tell you about this right now, but you can learn about Dagster, Mage, and Prefect within one hour.

I don't think that will take so much time, okay? And these are the modern data tools available,

okay? These are all part of the modern data stack, okay? These are the modern data stack,

okay? As you can see, we have—for ingestion, I think this is Airbyte, and this is Fivetran

for ingesting data. For data storage, we have BigQuery, Snowflake, Databricks. For BI, we have

Looker, Data Studio. For data transformation, we have DBT. Data orchestration, right? If you

want to orchestrate your entire thing, we have Airflow. There are some data quality frameworks,

Great Expectations, and there are metadata platforms like OpenLineage and DataHub. Again,

you can just search about the tool name, and you will get what they do, okay? When

we talk about the modern data stack, it is really important to just understand why these tools exist

in the market, like what problem do they solve. So, in this case, Fivetran solves the problem of

data ingestion—it takes data from one source and pushes the data to the other source. DBT gives

you modern transformation, okay? Uh, DBT gives you modern data transformation. Airflow is for

orchestration, so if you want to orchestrate and build a data pipeline, you can do that. Uh, this

is for data quality and governance, so these are some of the tools available. Just search online,

and you'll find plenty of resources. Alright, uh, now I want to cover these individual things,

right? Uh, what do you need to learn about Python? What do you need to learn about SQL?

What do you need to learn about data warehouses, Spark, Apache Airflow, and Kafka, okay? So I just

want to cover these individual things. Again, I already have the roadmap available, but I'll

just quickly go through this part. Let me just open this, right? Uh, this is available here. So

learning Python is one thing, and learning Python for data engineering is another thing, right? You

can learn Python for free online, but if you want to learn Python for data engineering, you have to

learn certain things. I'll just show you quickly because I have it on my website itself. So this

is my Python for Data Engineering course. I'll just go through the modules. You don't have to

take this course, but if you want, you can learn these things for free online also. I have created

courses just to give you a structured learning approach so that you don't get distracted,

okay? So you can learn the basics. All of these modules are open, so if you want, you can learn

them. You can start with strings, you can learn about numbers, you can learn about data types,

you can learn about data structures like lists, dictionaries, sets, tuples, okay? You can learn

about conditional statements like if-else, you can learn about loops (for loop, while loop),

then you can go to the intermediate level, such as understanding Python packages, how to import them,

list comprehensions, exception handling. We have to learn how to work with text files,

basics of Lambda functions, and object-oriented programming. There are some advanced concepts such

as NumPy, understanding the NumPy package, Pandas basics, how to use Pandas for transformation,

then working with date-time formats—very important if you want to work as a data engineer—how to work

with different file formats like JSON, CSV, Excel, Avro, okay? And these are the basics.

In my course, I have included one project for Python, okay? This is like a Spotify data pipeline

project. Uh, I'll tell you about this part, uh, at the end, okay? If you're interested. Then we have

SQL. Inside SQL, what do you have to learn? You can pick one DBMS. We are going with PostgreSQL

because PostgreSQL is open-source, easy to learn, and easy to set up. Learn about the important

keywords of SQL such as SELECT, INSERT, UPDATE, and all of the other things. Learn about data

types and how to create tables, how to create a database, different types of queries available,

okay? Like DML, DDL—like Data Manipulation Language, Data Definition Language—you can learn

about that. Uh, you can learn about operators in SQL, okay? You can learn about ALTER query,

database constraints, primary key, foreign key, ACID properties, normalization, INSERT, UPDATE

statements, joins like inner, left, right, outer, cross join, ORDER BY, GROUP BY, HAVING clause,

aggregation functions like MIN, MAX, and all the other things. Also, understand the advanced

topics like subqueries, Common Table Expressions, window functions, analytical functions like RANK,

DENSE_RANK, ROW_NUMBER, LEAD, LAG, set operations, working with date-time, case statements,

stored procedures. Learn about data modeling—we understood the basics of it. It is like ER

modeling and data modeling. So learn about that and just try to build your own data model. Like

you can pick one company name like e-commerce or Instagram or any company like Netflix, and you can

build a data model as a project, right? It looks something like this, a data model, as you can see

over here. This is like an Instagram data model, okay? This is like an e-commerce data model,

okay? After this, you can learn about data warehouses, okay? In data warehouses,

you can start with the basics: understand what a data warehouse is, understand OLTP vs OLAP—we

understood about this—understand the difference between data warehouses and data lakes, ETL

process, learn about Snowflake, like basics—just create an account on Snowflake. We have tutorials

on Snowflake also on the YouTube channel. Learn about dimensional modeling, so deep dive into

dimensional modeling, which is understanding what dimensional modeling is, understanding

fact tables, dimension tables, understanding star schema, snowflake schema, types of fact tables,

how to create fact tables, factless fact tables, surrogate keys, date dimension.

So these are the things you can learn about dimensional modeling. You can learn about SCD,

Slowly Changing Dimensions. You can learn about ETL—these are the concepts that you can cover in

the Snowflake database, okay? Like staging, copy command, file formats, handling unstructured data,

how to work with them, virtual warehouses, caching, clustering, storage integration,

Snowpipe, time travel, how to undrop things, how to recover data from the past, types of tables,

zero-copy cloning, data sharing, materialized views—these are the concepts that you can learn in

Snowflake, right? I'm just trying to give you an overview of the things that you can learn. For me,

I have created this step-by-step roadmap. I'm still building this entire thing,

so you can go to this website, DataVidhya, and you will see that I'm trying to build a course—first

is Python, then the second one is SQL, third one is data warehouses, fourth one is Spark with

Databricks, fifth one is workflow orchestration. I'm currently working on the Kafka course, okay?

And then there will be a dedicated cloud computing course in the future, okay? So, after this,

we have Apache Spark. This is very important. In Apache Spark, understand what Apache Spark is, why

we need Apache Spark, understand the architecture, understand concepts such as DataFrame,

transformations, actions, lazy evaluation in Apache Spark, okay? Learn how to install

Apache Spark—very important. Then we have this, uh, deep dive into the structured API in Apache

Spark. We have two things: structured API and the lower-level API. So learn about the structured

API, basics of it, how to define user-defined functions, data types of Apache Spark,

data sources, partitioning, bucketing, how to work with external tables. Then we also have

the lower-level API, such as understanding the Resilient Distributed Dataset (RDD),

also learn about production applications, how to run Spark on the cluster and on Databricks,

okay? These are topics that you can also cover, like you can just screenshot this,

or you can also visit the DataVidhya website just to get an understanding of the modules, okay? You

can learn all of these things for free online, okay? You don't have to, uh, really go through

this because I'm just going through this because this makes this entire thing easier to explain,

right? Uh, for Airflow also, you can just go through this section, okay? What are the things

that you need to cover? There are a few concepts that are important, okay? And then you can build

the projects like this. So I just quickly showed you, like, what are the different topics that you

can cover from the website. So instead of writing each and every single thing onto this page, uh,

that will just increase the time of the video, and I'm also, uh, feeling pain inside my throat,

uh, because I've been recording this thing for the last 3 hours, okay? Uh, so I just quickly showed

you that particular thing. These are the two different topics that I also wanted to cover: data

security and data masking, okay? Data security is important—we talked about this at the initial

stage. Uh, in data security, we have to take care of three things: confidentiality, integrity,

and availability, right? Ensure your data is accessible only to authorized users, so you

don't give access to your data to every user—only the authorized users should be able to access it.

Integrity is basically maintaining the accuracy and completeness of your data, so your data should

be accurate and should be able to provide the final value, and availability means that your data

is available to authorized users whenever it's needed, okay? These are the three important things

in data security. These are the measures you should take: first, you should encrypt your data,

okay? Encryption should happen so that, uh, if it goes through the network, uh, other people should

not be able to understand what the data is. Access control—only give the data to specific users. Data

classification—classify your data, like if this data is confidential or not, and security—like

secure data on the network level. The one concept that I wanted to talk about is data masking, okay?

Uh, that I talked about at the governance level. So usually what happens is that, uh, you have an

employee table, okay? What you have to do is—like there are some governance restrictions, some

regulations by governments that say you should not store sensitive information about the users,

right? Like credit card numbers, addresses, social security numbers, and all of the other things. So

when you do store it, make sure you mask them. Masking is basically a technique. So basically,

this is the ID of the user, right? What do I do if I want to mask this? I will, uh, replace this with

some random number. Let's say this is my Social Security number, okay? What I will do is—if this

is my Social Security number, what will I do? I will just mask this with something like this:

XXX-XX- and I will just reveal the last four digits, like this, okay? You can also do this

for credit cards. This is called masking, okay? Now, these are the different file formats you can

use for big data. These are common: JSON, CSV, Parquet, and ORC. Every file format has its own

use case. I don't want to go deep dive into this right now. I covered some of the things in my

courses, or you can just Google this, and you will understand most of the things that you want to

learn, okay? So, till now, we have covered a lot of different things. I might have missed some of

the topics, right? So I cannot cover every single topic in a single video, right? In a single video,

this might be around a 3-hour video, okay? I'm not sure—once I edit, once I sit and edit this video,

I'll get to know about the timeline, but approximately, this might be a 3-hour video. I

might have missed some of the topics, so what you can do—you can comment down, okay, the topics that

you want to learn. Just the fundamental topics, right? Hands-on, we will have the projects for

that. Just the fundamental concepts that you want to learn. What I will do—I will club all of these

topics inside part two, and I will create a video like this—like a long, three-hour video that you

can watch, okay? Now, you understood all of these things. Now, if you like the way I teach and if

you really want to learn about data engineering, you can go to the website. I will put the link in

the description: DataVidhya/combo-pack. On this combo pack, you will get five courses because,

till now, only five courses have launched. I'm currently working on the Apache Kafka course,

as you can see. We have a 'Not Available' for that also. So, you will also get access to this kind of

notes if you, uh, enroll in the course because I created all of these kinds of notes by myself,

okay? So that you can revise at any time that you want, okay? So you will get access to all of these

notes. So this is my Zero-to-Hero Data Engineering Combo Pack. It comes with the five courses. Now,

in the future, when they launch the course, uh, you can enroll in that course separately,

and I will also create a new combo pack. So if you see the new combo pack while enrolling in this,

okay, at that time, you might see GCP also added into this, okay? In this course, you

will get around 14+ projects. I will teach you, okay, how to make one project the best project.

It is like a step-by-step approach you will get. So as you can see over here, uh, in your Python

for Data Engineering project, uh, course, you will build this particular project, okay? In Snowflake,

you will build a similar project, but instead of using Glue Crawler, Catalog, and Amazon Athena,

we will be using Snowflake over here, okay? Now, in the Spark course, we will replace this Lambda

part, okay, for Python, and we will replace it with Apache Spark. This way, you will understand

how to evolve one simple project and how to plug and play with different toolsets. This is what you

will learn in the entire combo pack, right? How to take one simple project and make it the best

production-level project as we go forward, okay? We start with the basics, we'll replace

some components, we will add Spark, then we will also add Apache Airflow, okay? Inside Airflow,

we will use the same project, and we will use Docker and Apache Airflow to orchestrate

this entire pipeline. We will also create a similar project just using Apache Airflow only,

okay? So as you can see, one simple project we will create in like five to six different ways,

right? So that you get an understanding that data engineering is not just about using tools;

it's about the fundamentals. The fundamentals that we understood, you will actually implement all of

these things over here like this. You will also get projects on Apache NiFi and real-time data

streaming, and there's one project available on Twitter data analysis, which is also available on

YouTube. Uh, there's a project available on GCP also. There's one project on Azure. Uh,

I think the GCP project is—uh, there's a project available on Azure over here also. Uh, the GCP

project is also available—let me just show you over here. Uh, yeah, this is a crypto data

pipeline project available in the Apache Airflow course, okay? So you will also learn about Azure,

you will also learn about AWS, and you will learn about GCP just by doing these five courses. And

then, in the future, we will have in-depth courses on individual clouds also, so you will

get like 14 different projects over here, okay? Five courses—you can get the information about

all of these over here just by clicking this, okay? These are the reviews from our students,

okay? Previous students, and they have built their own projects till now, so if you want to check,

you can also go through this and understand that they have built some amazing, uh, projects. You

can just click over here, okay? And you will be redirected to the link of the project. I hope

this is working, okay? Or you can go here—this also. Uh, yeah, as you can see over here, uh,

this guy actually built the Airbnb project, uh, using Azure. So just like this, you can build your

own project and put it on your resume also. Uh, this course is for everyone, like cloud engineers,

web developers, data engineers, uh, technical consultants, so it doesn't matter who you are—you

can learn this. What you will get from this course—you will get the code template, okay? You

will get each and everything about the code that you can use. You'll get access to the interactive

Discord community, uh, you will get support if you are stuck with any doubts or any errors, uh,

you can ask it on the Discord channel. Someone will help you out, or I will help you out. Or

in the future, you will also get early access and a discount—like a huge discount to future courses,

right? So, let's say if I launch the Kafka course, you will get the detailed—you will get a huge

discount for that course also. These are some of the reviews from our students, so you can go

through that. These are some of the commonly asked questions, so you can also go through that. So,

if you're interested, you can go through this. If you're not, you can also learn by yourself,

okay? My voice is breaking, but the best part about this particular data engineering roadmap

is that every single course is in-depth, so you will learn about most of the things. Most of the

bootcamps available in the market just give you surface-level knowledge. So, if you go to any

website that offers data engineering courses, they teach you all of these things, but for each and

every module, they might have like two to three videos added to their module, and they are done.

I will give you all of these things in detail, and with that, you also get access to the notes

like this, okay? If I show you here—if I show you the Obsidian, you can see this interactive

graph environment. You can see the Apache Spark topics available here, okay? Apache Spark—what are

the different topics connected? These topics are also connected between different courses also, so

partitioning or the transformation concept is also applied to the data warehouse and Apache Kafka. So

as you can see, data warehouse, you can see the SQL, you can see—you can also go with the basic

topics. So all of these are like the topics—cloud, you can easily interact. You can directly search

specific topics such as partitioning, okay, partitioning, uh, partitioning, okay? And I

can know that, okay, partitioning is available over here, and also over here, so I can easily

search and learn about the different things, okay? After this, uh, there's one more thing, uh, what

is this? Okay, you will get the detailed notes. So if I were to show you the notes—let's say,

uh, this is the basics of Docker, okay? You can go and search about it, Airflow basics UI, uh,

let's say if I want to—like writing my first DAG, okay? This will give me every single instruction,

right? How to write my first DAG, what are the codes, everything that I have to do,

every single thing you will get here, okay? Every single thing. So this will make your life so much

easier that you don't get distracted by looking at different courses or different resources. You

just stick to one single path, and you can become a data engineer. So this is what I wanted to show

you about my courses. If you're interested, just check the link in the description. If you're not,

totally up to you, uh, you can use multiple resources. I also have free resources available,

so you can also check that on my YouTube channel. That's everything about this video. Uh, this is

now almost 3 and a half hours—I'm recording this video. Hopefully, the recording gets stored so

that I don't have to re-record this entire thing. This was everything for this video. If you're

watching this video till now, okay, do let me know by writing a comment, okay? Because this is a long

video. Also, like this video because I put a lot of hard work into it. And also, comment something,

share this with people so that all the people can take advantage of this and grow in their careers.

So, everything—thank you for watching this video. I'll see you in the next video. Thank you so much.

Транскрипция YouTubeГотовим результаты…

Транскрипция YouTube:
Fundamentals Of Data Engineering Masterclass

AutoDub

Транскрипция видео

Summary

Core Theme

Вставьте ссылку на YouTube

Установите расширение для Chrome

Мгновенные транскрипции: Просто измените домен в адресной строке!