YouTube Transcript:
Database Normalization for Beginners | How to Normalize Data w/ Power Query (full tutorial!)

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

This content demonstrates how to transform a single, merged data table into a normalized star schema (and further into a snowflake schema) using Power Query in Excel. This process enhances data integrity, reduces redundancy, and improves performance for data analysis.

Mind Map

Click to expand

Click to explore the full interactive mind map • Zoom, pan, and navigate

let me show you how to transform a

single merged table into a star schema

using power Query in [Music]

Excel before we get our hands dirty

let's talk about what we're doing here

which is something called database

normalization basically normalization is

about organizing your data in a way that

preserves integrity and eliminates

redundancy in other words it helps make

sure you don't screw up the data or

store a bunch of stuff you don't need

now there are multiple forms of

normalization which we'll talk about

later but here's a simple example this

Excel workbook contains transactional

records for a global retailer we have

information about each transaction a

unique ID the order number line item

order and delivery dates and quantities

sold which is great but we also have

tons of columns containing things like

customer demographics store locations

and product details now Excel users have

a habit of merging data like this using

lookup or index match functions and it's

tempting to do so because there's

something very intuitive and comforting

about having all the data in one

convenient place you can see it right

there in front of you and explore it

using familiar tools like worksheet

formulas or pivot tables but here's the

problem when you mash data together like

this you create a ton of redundancy and

build solutions that just don't scale

for example all of these columns here

which actually represent about 3/4 of

the data set are an absolute waste of space

space

why because all these details are

dependent on primary keys in other words

if we know the customer ID we also know

the gender name and address if we know

the product ID we can figure out the

product name category retail price and

so on so instead of storing those same

values and attributes over and over

again we can create separate lookup or

Dimension tables specifically designed

to store and D duplicate that type of

information in this case one dimension

table would contain a unique list of

customer customer IDs along with any

columns containing details about

customers another would contain a unique

list of store IDs with details about

each store location and a third would

contain unique product IDs with

information about each product by doing

this we can remove those fields from our

transaction table keep only the key

columns and create relationships between

them without writing a single formula

better yet we can do all of this within

excel's data model where we can load and

compress huge data sets without having

to worry about worksheet row limitations

all right enough talk let's fire up

Excel and see if we can use power query

to turn this table into a proper

relational model all righty so I've got

a brand new workbook here and my first

step will be to connect to our Excel

workbook containing the merged table so

let's head to the data tab get data from

file from Excel workbook going to select

that transactions workbook and click

import excel's going to create that

connection fire up the preview Pane and

here I can see the transactions tab

within the workbook with all of my data

previewed right here I'm going to click

transform data to fire up the query

editor and here we can see all of that

data that we previewed earlier the

transactional records the customer

details store locations product

information and so on note that this is

already in what's called first normal

form since the records in each field are

Atomic that basically just means that

each cell in the table contains one

single dat point there are no lists or

repeated groups or things like that so

our Focus here will be on getting rid of

some of these redundant fields that we

talked about by splitting out separate

Dimension tables now to do that what I'm

going to do is actually duplicate this

transaction table three times so I'm

going to be creating three separate

Dimension tables one for customers one

for stores and one for products and if

we start with our first duplicate here

why don't we kick things off with our

customer Dimension table so first step

is to scroll through and just isolate

all of the customer specific columns

including the key or the customer ID so

let's select customer ID we're going to

want gender name city state ZIP country

continent and date of birth so I'm going

to hold shift click through the customer

do or date of birth column and I'm going

to right click and remove all of the

other columns from this table now we

have a table containing just customer

detail and the key here to reduce that

redundancy that we're seeing is to

remove duplicates so that we end up with

a unique customer ID for each record

that's going to serve as the primary key

of this table so let's go to remove rows

remove duplicates and it's as simple as

that can also sort ascending for

readability and if we wanted to double

check and confirm that these customer

IDs are in fact unique what we could do

is head to view column profile and we

want to profile not just based on the

first, rows but the entire data set and

check it out we've got a row count of

11,887 all of which are distinct and

unique that means we have a valid

primary key for this table and each row

each customer ID represents one distinct

customer so we can go ahead and turn off

that column profile name this table

customers and we are good to go now

we're going to follow that same process

for stores and products so I'll go to

our next duplicated query here this time

I'm going to find all these store

related columns store ID country State

square meters and open date so it looks

like we want these five columns we'll

remove all of the others head to home remove

remove

duplicates and again this is optional

but we can sort ascending and now we can

very clearly see that this retailer

operates an online store and then a

whole bunch of stores looks like 66 or

so stores across different countries

like the US UK Netherlands Italy Germany

France Etc so we've just duplicated our

store information and created our stores

Dimension table let's go ahead and name

this one stores that looks good and

we're going to do the same thing for

products so let's scroll over find all

of our product related info starting

with product ID we've got a name brand

color cost retail price and then some

subcategory and category level

information too which we'll talk about

in a little bit more detail later on so

let's remove those other columns now we

have a product table jump back to my ID remove

remove

duplicates sort

ascending and we are in great shape so

let's name this table products and now

that we've created these Dimension

specific tables here's the beauty of

that we can go back to to our original

transaction table and we can actually

get rid of all of those columns with the

exception of the keys so we'll keep

customer ID but we're going to get rid

of all these details gender name City

and so on so everything through date of

birth I'm going to right click remove

those columns all the store attributes

right click remove and all of our

product details right click and

remove so now this table just gives us

the transactional level information

right the order number the line item the

order and delivery dates the quantity

plus our three keys customer store and

product ID which will allow us to create

relationships to those three dimension

tables that we just created so that's

everything we need to do for now what

I'm going to do is head to home close

and load to this is important I'm only

going to create a connection I don't

want to dump all of these rows and data

points into a worksheet I want to add

this to the data model instead and

create a proper relational model let's

press okay we'll see the queries and

connections paying fire up and we'll

start to see those connections loading

data into our

model all right perfect we've got all of

our data loaded from here we can head to

power pivot manage our data model here

you can see those connections the data

has been compressed here in the model

we're going to head to diagram view

where we can view each table as a

distinct object and what I'm going to do

is pull my transaction table which is my

data or fact table I'm going to put it

right here in the middle I'm going to

kind of surround it by my Dimension

tables customer stores and products and

now instead of writing all sorts of

complex lookup or index match functions

all I need to do is select the primary

keys from each Dimension table and map

them to the matching foreign keys in my

fact table so customer ID relates to

customer ID store ID relates to store ID

ID and product ID you guessed it relates

to product ID and what we've just

created is known as a star schema which

is a very common database structure and

often a best practice for many types of

data analytics we've got that Central

fact table transactions surrounded by

Dimension tables connected via one to

many relationships and with this star

schema we can access the exact same

information that we could using a merged

table we can even Explore it using power

pivot which is essentially just a

regular pivot table that sits on top of

a data model instead of a single table

so let's go ahead and see what that

looks like going to add a pivot table

right here in the worksheet cell A1 and

here I've got my familiar field list and

I can grab data from any of my related

tables in the model so let's take a look

at total quantity sold we can break that

down by product

category like so on rows we can sort

this descending to see which categories

are sold most often we could even add a

slicer to understand which stores in

which countries sell different types of

products start to get a sense of which

products are most popular in different

parts of the world and we could also

Define new calculations and measures

using data analysis expressions or Dax

we could add visuals using pivot charts

or even use Cube functions to pull

values from our model directly into

worksheet cells and as a bonus we've

removed over a million redundant data

points and reduced our workbook size by

More than 70% so this is all great but

it's important to note that technically

we haven't fully normalized the data

there are still some dependencies that

we could address by splitting out

additional tables in our model for

example let's look at our transactions

table so we know that this table is in

first normal form but in order to

further normalize it to what's called

second normal form it would require to

eliminate any partial dependencies in

other words columns that only depend on

part of the primary key which in this

case is our transaction ID if you look

closely you'll notice that some Fields

like order date delivery date customer

ID and store ID only depend on the order

number so the same information is

repeated for each line item in the table

and that makes sense since each order

takes place on one specific date and you

wouldn't see different customers

purchasing individual line items within

one order on the other hand some Fields

like quantity and product ID depend on

both the order number and the line item

or the full transaction ID since orders

can contain multiple individual products

what that means is that to achieve

second normal form we need to break this

into two Separate Tables one at the

order level and one at the order line

item level so let's make that happen

right click duplicate this trans

transaction table again let's drag it up

to keep them together and why don't we

start with our order level table here

and just like when we split out our

Dimension tables the key is to isolate

just the relevant columns here so for

order level detail what I'm going to

grab is the order number column and I'm

going to control click the order date

delivery date assuming all line items

ship at the same time which is the case

with this data set I'm going to select

customer ID and store ID let's right

click remove the other columns and now

just like Dimension tables we're going

to remove duplicates from this order

number field and this will become the

primary key in this table and what we

can do here is rename this table name

let's call this one

orders that's okay let's rename and this

duplicated version this will become line

items so for this one we want the full

granularity the full level of detail so

let's go ahead and keep the ID the order

number and the line item CU we want all

of this information here let's keep

those three we're going to keep the

quantity field and the last one we need

is the product ID field right click

remove the others and we actually don't

need to remove duplicates because we

have the deepest level of granularity

and we know that these transaction IDs

are already unique so let's double click

let's name this one

order line

items and there you have it we have

officially removed the partial

dependencies from that original

transactions table and we've achieved

second normal form now we could keep

going down this path with some of our

existing Dimension tables as well for

example let's look at products now this

table is actually already in second

normal form because all of our non-key

columns do depend on the full primary

key or product ID so there are no

partial dependencies like we just saw in

our transaction table that said if we

scroll through you'll start to see that

we do still have some redundancies here

and we could continue to normalize this

table from second to third normal form

which would involve getting rid of any

transitive dependencies as well I know

that's a mouthful but it's basically

when columns are dependent on fields

other than the primary key and we do

have some transitive dependencies here

our product category and our product

subcategory don't depend entirely on the

product ID but rather their own key

columns like product category ID and

product subcategory ID so this is

another case where splitting our tables

will help us eliminate some of that

redundancy what we can do is duplicate

our product table going duplicate it

twice because we're going to end up with

one dimension table that's at the

product level one that's at the

subcategory level and one that's at the

category level so let's start with our

second version here this will be our

subcategory table and let's find all the

fields that are subcategory related got

the product subcategory ID and the

subcategory name and here's the catch

because I want this table to also

connect us or relate to the category

level detail I also want the category ID

field here as well so I'm going to right

click remove the

others and I can remove any duplicates

from our subcategory ID that's going to

turn this into the primary key of this

subcategory table and we can rename it

subcategories similar approach for

categories let's go to our next

duplicated version this time we just

need the category ID and the category

name remove everything

else get rid of those

duplicates and now we have a nice clean

category level Dimension table showing

the eight product categories that this

retailer sells let's rename that one categories

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:Database Normalization for Beginners | How to Normalize Data w/ Power Query (full tutorial!)