This content demonstrates how to transform a single, merged data table into a normalized star schema (and further into a snowflake schema) using Power Query in Excel. This process enhances data integrity, reduces redundancy, and improves performance for data analysis.
Mind Map
Click to expand
Click to explore the full interactive mind map • Zoom, pan, and navigate
let me show you how to transform a
single merged table into a star schema
using power Query in [Music]
Excel before we get our hands dirty
let's talk about what we're doing here
which is something called database
normalization basically normalization is
about organizing your data in a way that
preserves integrity and eliminates
redundancy in other words it helps make
sure you don't screw up the data or
store a bunch of stuff you don't need
now there are multiple forms of
normalization which we'll talk about
later but here's a simple example this
Excel workbook contains transactional
records for a global retailer we have
information about each transaction a
unique ID the order number line item
order and delivery dates and quantities
sold which is great but we also have
tons of columns containing things like
customer demographics store locations
and product details now Excel users have
a habit of merging data like this using
lookup or index match functions and it's
tempting to do so because there's
something very intuitive and comforting
about having all the data in one
convenient place you can see it right
there in front of you and explore it
using familiar tools like worksheet
formulas or pivot tables but here's the
problem when you mash data together like
this you create a ton of redundancy and
build solutions that just don't scale
for example all of these columns here
which actually represent about 3/4 of
the data set are an absolute waste of space
space
why because all these details are
dependent on primary keys in other words
if we know the customer ID we also know
the gender name and address if we know
the product ID we can figure out the
product name category retail price and
so on so instead of storing those same
values and attributes over and over
again we can create separate lookup or
Dimension tables specifically designed
to store and D duplicate that type of
information in this case one dimension
table would contain a unique list of
customer customer IDs along with any
columns containing details about
customers another would contain a unique
list of store IDs with details about
each store location and a third would
contain unique product IDs with
information about each product by doing
this we can remove those fields from our
transaction table keep only the key
columns and create relationships between
them without writing a single formula
better yet we can do all of this within
excel's data model where we can load and
compress huge data sets without having
to worry about worksheet row limitations
all right enough talk let's fire up
Excel and see if we can use power query
to turn this table into a proper
relational model all righty so I've got
a brand new workbook here and my first
step will be to connect to our Excel
workbook containing the merged table so
let's head to the data tab get data from
file from Excel workbook going to select
that transactions workbook and click
import excel's going to create that
connection fire up the preview Pane and
here I can see the transactions tab
within the workbook with all of my data
previewed right here I'm going to click
transform data to fire up the query
editor and here we can see all of that
data that we previewed earlier the
transactional records the customer
details store locations product
information and so on note that this is
already in what's called first normal
form since the records in each field are
Atomic that basically just means that
each cell in the table contains one
single dat point there are no lists or
repeated groups or things like that so
our Focus here will be on getting rid of
some of these redundant fields that we
talked about by splitting out separate
Dimension tables now to do that what I'm
going to do is actually duplicate this
transaction table three times so I'm
going to be creating three separate
Dimension tables one for customers one
for stores and one for products and if
we start with our first duplicate here
why don't we kick things off with our
customer Dimension table so first step
is to scroll through and just isolate
all of the customer specific columns
including the key or the customer ID so
let's select customer ID we're going to
want gender name city state ZIP country
continent and date of birth so I'm going
to hold shift click through the customer
do or date of birth column and I'm going
to right click and remove all of the
other columns from this table now we
have a table containing just customer
detail and the key here to reduce that
redundancy that we're seeing is to
remove duplicates so that we end up with
a unique customer ID for each record
that's going to serve as the primary key
of this table so let's go to remove rows
remove duplicates and it's as simple as
that can also sort ascending for
readability and if we wanted to double
check and confirm that these customer
IDs are in fact unique what we could do
is head to view column profile and we
want to profile not just based on the
first, rows but the entire data set and
check it out we've got a row count of
11,887 all of which are distinct and
unique that means we have a valid
primary key for this table and each row
each customer ID represents one distinct
customer so we can go ahead and turn off
that column profile name this table
customers and we are good to go now
we're going to follow that same process
for stores and products so I'll go to
our next duplicated query here this time
I'm going to find all these store
related columns store ID country State
square meters and open date so it looks
like we want these five columns we'll
remove all of the others head to home remove
remove
duplicates and again this is optional
but we can sort ascending and now we can
very clearly see that this retailer
operates an online store and then a
whole bunch of stores looks like 66 or
so stores across different countries
like the US UK Netherlands Italy Germany
France Etc so we've just duplicated our
store information and created our stores
Dimension table let's go ahead and name
this one stores that looks good and
we're going to do the same thing for
products so let's scroll over find all
of our product related info starting
with product ID we've got a name brand
color cost retail price and then some
subcategory and category level
information too which we'll talk about
in a little bit more detail later on so
let's remove those other columns now we
have a product table jump back to my ID remove
remove
duplicates sort
ascending and we are in great shape so
let's name this table products and now
that we've created these Dimension
specific tables here's the beauty of
that we can go back to to our original
transaction table and we can actually
get rid of all of those columns with the
exception of the keys so we'll keep
customer ID but we're going to get rid
of all these details gender name City
and so on so everything through date of
birth I'm going to right click remove
those columns all the store attributes
right click remove and all of our
product details right click and
remove so now this table just gives us
the transactional level information
right the order number the line item the
order and delivery dates the quantity
plus our three keys customer store and
product ID which will allow us to create
relationships to those three dimension
tables that we just created so that's
everything we need to do for now what
I'm going to do is head to home close
and load to this is important I'm only
going to create a connection I don't
want to dump all of these rows and data
points into a worksheet I want to add
this to the data model instead and
create a proper relational model let's
press okay we'll see the queries and
connections paying fire up and we'll
start to see those connections loading
data into our
model all right perfect we've got all of
our data loaded from here we can head to
power pivot manage our data model here
you can see those connections the data
has been compressed here in the model
we're going to head to diagram view
where we can view each table as a
distinct object and what I'm going to do
is pull my transaction table which is my
data or fact table I'm going to put it
right here in the middle I'm going to
kind of surround it by my Dimension
tables customer stores and products and
now instead of writing all sorts of
complex lookup or index match functions
all I need to do is select the primary
keys from each Dimension table and map
them to the matching foreign keys in my
fact table so customer ID relates to
customer ID store ID relates to store ID
ID and product ID you guessed it relates
to product ID and what we've just
created is known as a star schema which
is a very common database structure and
often a best practice for many types of
data analytics we've got that Central
fact table transactions surrounded by
Dimension tables connected via one to
many relationships and with this star
schema we can access the exact same
information that we could using a merged
table we can even Explore it using power
pivot which is essentially just a
regular pivot table that sits on top of
a data model instead of a single table
so let's go ahead and see what that
looks like going to add a pivot table
right here in the worksheet cell A1 and
here I've got my familiar field list and
I can grab data from any of my related
tables in the model so let's take a look
at total quantity sold we can break that
down by product
category like so on rows we can sort
this descending to see which categories
are sold most often we could even add a
slicer to understand which stores in
which countries sell different types of
products start to get a sense of which
products are most popular in different
parts of the world and we could also
Define new calculations and measures
using data analysis expressions or Dax
we could add visuals using pivot charts
or even use Cube functions to pull
values from our model directly into
worksheet cells and as a bonus we've
removed over a million redundant data
points and reduced our workbook size by
More than 70% so this is all great but
it's important to note that technically
we haven't fully normalized the data
there are still some dependencies that
we could address by splitting out
additional tables in our model for
example let's look at our transactions
table so we know that this table is in
first normal form but in order to
further normalize it to what's called
second normal form it would require to
eliminate any partial dependencies in
other words columns that only depend on
part of the primary key which in this
case is our transaction ID if you look
closely you'll notice that some Fields
like order date delivery date customer
ID and store ID only depend on the order
number so the same information is
repeated for each line item in the table
and that makes sense since each order
takes place on one specific date and you
wouldn't see different customers
purchasing individual line items within
one order on the other hand some Fields
like quantity and product ID depend on
both the order number and the line item
or the full transaction ID since orders
can contain multiple individual products
what that means is that to achieve
second normal form we need to break this
into two Separate Tables one at the
order level and one at the order line
item level so let's make that happen
right click duplicate this trans
transaction table again let's drag it up
to keep them together and why don't we
start with our order level table here
and just like when we split out our
Dimension tables the key is to isolate
just the relevant columns here so for
order level detail what I'm going to
grab is the order number column and I'm
going to control click the order date
delivery date assuming all line items
ship at the same time which is the case
with this data set I'm going to select
customer ID and store ID let's right
click remove the other columns and now
just like Dimension tables we're going
to remove duplicates from this order
number field and this will become the
primary key in this table and what we
can do here is rename this table name
let's call this one
orders that's okay let's rename and this
duplicated version this will become line
items so for this one we want the full
granularity the full level of detail so
let's go ahead and keep the ID the order
number and the line item CU we want all
of this information here let's keep
those three we're going to keep the
quantity field and the last one we need
is the product ID field right click
remove the others and we actually don't
need to remove duplicates because we
have the deepest level of granularity
and we know that these transaction IDs
are already unique so let's double click
let's name this one
order line
items and there you have it we have
officially removed the partial
dependencies from that original
transactions table and we've achieved
second normal form now we could keep
going down this path with some of our
existing Dimension tables as well for
example let's look at products now this
table is actually already in second
normal form because all of our non-key
columns do depend on the full primary
key or product ID so there are no
partial dependencies like we just saw in
our transaction table that said if we
scroll through you'll start to see that
we do still have some redundancies here
and we could continue to normalize this
table from second to third normal form
which would involve getting rid of any
transitive dependencies as well I know
that's a mouthful but it's basically
when columns are dependent on fields
other than the primary key and we do
have some transitive dependencies here
our product category and our product
subcategory don't depend entirely on the
product ID but rather their own key
columns like product category ID and
product subcategory ID so this is
another case where splitting our tables
will help us eliminate some of that
redundancy what we can do is duplicate
our product table going duplicate it
twice because we're going to end up with
one dimension table that's at the
product level one that's at the
subcategory level and one that's at the
category level so let's start with our
second version here this will be our
subcategory table and let's find all the
fields that are subcategory related got
the product subcategory ID and the
subcategory name and here's the catch
because I want this table to also
connect us or relate to the category
level detail I also want the category ID
field here as well so I'm going to right
click remove the
others and I can remove any duplicates
from our subcategory ID that's going to
turn this into the primary key of this
subcategory table and we can rename it
it
subcategories similar approach for
categories let's go to our next
duplicated version this time we just
need the category ID and the category
name remove everything
else get rid of those
duplicates and now we have a nice clean
category level Dimension table showing
the eight product categories that this
retailer sells let's rename that one categories
categories
and now that we have these Dimension
tables split out we can go back to
products and get rid of those redundant
Fields so now all we really need is the
subcategory ID which will allow us to
connect to subcategories and then from
subcategories we can connect to
categories so I can select everything
after that subcategory name category ID
and category remove those and we've
eliminated that redundancy so let's go
ahead and close and load this to data
model press okay and we should see those
new queries here with the data loading
in all right looks like it's all loaded
and now we can head to our data model we
can go back into our diagram View and we
should start to see some of these new
tables here so instead of one
transaction table now we've got orders
and order line items we've got products
here and then hanging out over here on
the right don't miss them we've got
subcategories and categories and you can
see our relationships have gone away
since we modified our model so we
basically just need to recreate or
reconfigure our data model based on the
new table relationships so we know that
the fields that were related at the
order level were customer ID like so and store
store
ID well products connect to the order
line item level so that's where we find
our product ID we can also connect our
distinct order number using a on to many
relationship to the order number in our
line item table then we can make a
similar chain of relationships
connecting our product subcategory ID to
our subcategories table and our category
ID to our category level table so as you
can see here our data model has become
quite a bit more complex we no longer
have that nice clean star schema we've
got snowflake schemas which are
basically just chains of Dimension and
subdimension tables and we could go even
further by normalizing our customer and
store table as well so at this point
you're probably wondering how much
normalization is enough do I need to
eliminate every single redundant data
point is there some Universal standard
or best practice the short answer is no
the most important thing to keep in mind
about normalization is that it's all
about tradeoffs more normalization means
better Integrity less redundancy and
smaller individual tables but it also
means more complex data models and
therefore more complex queries
especially for multi-table analysis
that's why star schemas even though they
aren't usually fully normalized are such
a popular choice for things like bi
reporting or exploratory analysis now I
hope that helps if you'd like to learn
more check out the description for links
to our self-paced courses learning paths
and guided projects and as always make
sure to like And subscribe for more data
content just like this I'll see you in
the next one [Music]
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.