0:01 let me show you how to transform a
0:04 single merged table into a star schema
0:06 using power Query in [Music]
0:13 Excel before we get our hands dirty
0:15 let's talk about what we're doing here
0:17 which is something called database
0:20 normalization basically normalization is
0:22 about organizing your data in a way that
0:24 preserves integrity and eliminates
0:26 redundancy in other words it helps make
0:28 sure you don't screw up the data or
0:30 store a bunch of stuff you don't need
0:31 now there are multiple forms of
0:33 normalization which we'll talk about
0:35 later but here's a simple example this
0:37 Excel workbook contains transactional
0:40 records for a global retailer we have
0:42 information about each transaction a
0:45 unique ID the order number line item
0:46 order and delivery dates and quantities
0:48 sold which is great but we also have
0:50 tons of columns containing things like
0:53 customer demographics store locations
0:55 and product details now Excel users have
0:58 a habit of merging data like this using
1:00 lookup or index match functions and it's
1:02 tempting to do so because there's
1:04 something very intuitive and comforting
1:06 about having all the data in one
1:08 convenient place you can see it right
1:10 there in front of you and explore it
1:12 using familiar tools like worksheet
1:15 formulas or pivot tables but here's the
1:17 problem when you mash data together like
1:20 this you create a ton of redundancy and
1:22 build solutions that just don't scale
1:24 for example all of these columns here
1:26 which actually represent about 3/4 of
1:29 the data set are an absolute waste of space
1:30 space
1:32 why because all these details are
1:35 dependent on primary keys in other words
1:37 if we know the customer ID we also know
1:39 the gender name and address if we know
1:41 the product ID we can figure out the
1:43 product name category retail price and
1:46 so on so instead of storing those same
1:48 values and attributes over and over
1:50 again we can create separate lookup or
1:53 Dimension tables specifically designed
1:55 to store and D duplicate that type of
1:57 information in this case one dimension
1:59 table would contain a unique list of
2:01 customer customer IDs along with any
2:03 columns containing details about
2:05 customers another would contain a unique
2:07 list of store IDs with details about
2:10 each store location and a third would
2:12 contain unique product IDs with
2:14 information about each product by doing
2:16 this we can remove those fields from our
2:19 transaction table keep only the key
2:21 columns and create relationships between
2:24 them without writing a single formula
2:26 better yet we can do all of this within
2:28 excel's data model where we can load and
2:30 compress huge data sets without having
2:33 to worry about worksheet row limitations
2:34 all right enough talk let's fire up
2:37 Excel and see if we can use power query
2:39 to turn this table into a proper
2:41 relational model all righty so I've got
2:43 a brand new workbook here and my first
2:45 step will be to connect to our Excel
2:48 workbook containing the merged table so
2:50 let's head to the data tab get data from
2:53 file from Excel workbook going to select
2:55 that transactions workbook and click
2:57 import excel's going to create that
2:59 connection fire up the preview Pane and
3:01 here I can see the transactions tab
3:03 within the workbook with all of my data
3:05 previewed right here I'm going to click
3:07 transform data to fire up the query
3:10 editor and here we can see all of that
3:12 data that we previewed earlier the
3:14 transactional records the customer
3:17 details store locations product
3:19 information and so on note that this is
3:21 already in what's called first normal
3:24 form since the records in each field are
3:26 Atomic that basically just means that
3:29 each cell in the table contains one
3:31 single dat point there are no lists or
3:34 repeated groups or things like that so
3:36 our Focus here will be on getting rid of
3:38 some of these redundant fields that we
3:40 talked about by splitting out separate
3:43 Dimension tables now to do that what I'm
3:44 going to do is actually duplicate this
3:47 transaction table three times so I'm
3:49 going to be creating three separate
3:51 Dimension tables one for customers one
3:54 for stores and one for products and if
3:56 we start with our first duplicate here
3:57 why don't we kick things off with our
4:00 customer Dimension table so first step
4:03 is to scroll through and just isolate
4:05 all of the customer specific columns
4:08 including the key or the customer ID so
4:10 let's select customer ID we're going to
4:14 want gender name city state ZIP country
4:16 continent and date of birth so I'm going
4:18 to hold shift click through the customer
4:21 do or date of birth column and I'm going
4:22 to right click and remove all of the
4:25 other columns from this table now we
4:27 have a table containing just customer
4:30 detail and the key here to reduce that
4:32 redundancy that we're seeing is to
4:34 remove duplicates so that we end up with
4:37 a unique customer ID for each record
4:39 that's going to serve as the primary key
4:42 of this table so let's go to remove rows
4:45 remove duplicates and it's as simple as
4:47 that can also sort ascending for
4:49 readability and if we wanted to double
4:51 check and confirm that these customer
4:54 IDs are in fact unique what we could do
4:57 is head to view column profile and we
4:59 want to profile not just based on the
5:02 first, rows but the entire data set and
5:05 check it out we've got a row count of
5:08 11,887 all of which are distinct and
5:11 unique that means we have a valid
5:13 primary key for this table and each row
5:17 each customer ID represents one distinct
5:20 customer so we can go ahead and turn off
5:23 that column profile name this table
5:26 customers and we are good to go now
5:28 we're going to follow that same process
5:30 for stores and products so I'll go to
5:33 our next duplicated query here this time
5:35 I'm going to find all these store
5:39 related columns store ID country State
5:42 square meters and open date so it looks
5:44 like we want these five columns we'll
5:47 remove all of the others head to home remove
5:48 remove
5:51 duplicates and again this is optional
5:53 but we can sort ascending and now we can
5:56 very clearly see that this retailer
5:58 operates an online store and then a
6:01 whole bunch of stores looks like 66 or
6:03 so stores across different countries
6:07 like the US UK Netherlands Italy Germany
6:10 France Etc so we've just duplicated our
6:12 store information and created our stores
6:14 Dimension table let's go ahead and name
6:17 this one stores that looks good and
6:19 we're going to do the same thing for
6:21 products so let's scroll over find all
6:24 of our product related info starting
6:27 with product ID we've got a name brand
6:30 color cost retail price and then some
6:32 subcategory and category level
6:34 information too which we'll talk about
6:37 in a little bit more detail later on so
6:39 let's remove those other columns now we
6:43 have a product table jump back to my ID remove
6:44 remove
6:47 duplicates sort
6:50 ascending and we are in great shape so
6:54 let's name this table products and now
6:56 that we've created these Dimension
6:58 specific tables here's the beauty of
7:00 that we can go back to to our original
7:02 transaction table and we can actually
7:05 get rid of all of those columns with the
7:07 exception of the keys so we'll keep
7:09 customer ID but we're going to get rid
7:13 of all these details gender name City
7:15 and so on so everything through date of
7:17 birth I'm going to right click remove
7:20 those columns all the store attributes
7:23 right click remove and all of our
7:27 product details right click and
7:30 remove so now this table just gives us
7:32 the transactional level information
7:34 right the order number the line item the
7:37 order and delivery dates the quantity
7:39 plus our three keys customer store and
7:42 product ID which will allow us to create
7:44 relationships to those three dimension
7:46 tables that we just created so that's
7:48 everything we need to do for now what
7:50 I'm going to do is head to home close
7:53 and load to this is important I'm only
7:55 going to create a connection I don't
7:57 want to dump all of these rows and data
8:00 points into a worksheet I want to add
8:02 this to the data model instead and
8:04 create a proper relational model let's
8:06 press okay we'll see the queries and
8:08 connections paying fire up and we'll
8:10 start to see those connections loading
8:12 data into our
8:14 model all right perfect we've got all of
8:17 our data loaded from here we can head to
8:20 power pivot manage our data model here
8:22 you can see those connections the data
8:24 has been compressed here in the model
8:26 we're going to head to diagram view
8:28 where we can view each table as a
8:29 distinct object and what I'm going to do
8:32 is pull my transaction table which is my
8:34 data or fact table I'm going to put it
8:36 right here in the middle I'm going to
8:38 kind of surround it by my Dimension
8:41 tables customer stores and products and
8:43 now instead of writing all sorts of
8:46 complex lookup or index match functions
8:48 all I need to do is select the primary
8:51 keys from each Dimension table and map
8:53 them to the matching foreign keys in my
8:56 fact table so customer ID relates to
8:59 customer ID store ID relates to store ID
9:03 ID and product ID you guessed it relates
9:05 to product ID and what we've just
9:08 created is known as a star schema which
9:10 is a very common database structure and
9:12 often a best practice for many types of
9:14 data analytics we've got that Central
9:17 fact table transactions surrounded by
9:20 Dimension tables connected via one to
9:22 many relationships and with this star
9:24 schema we can access the exact same
9:27 information that we could using a merged
9:29 table we can even Explore it using power
9:31 pivot which is essentially just a
9:33 regular pivot table that sits on top of
9:36 a data model instead of a single table
9:37 so let's go ahead and see what that
9:40 looks like going to add a pivot table
9:42 right here in the worksheet cell A1 and
9:44 here I've got my familiar field list and
9:47 I can grab data from any of my related
9:49 tables in the model so let's take a look
9:53 at total quantity sold we can break that
9:55 down by product
9:58 category like so on rows we can sort
10:00 this descending to see which categories
10:04 are sold most often we could even add a
10:07 slicer to understand which stores in
10:10 which countries sell different types of
10:12 products start to get a sense of which
10:13 products are most popular in different
10:15 parts of the world and we could also
10:18 Define new calculations and measures
10:21 using data analysis expressions or Dax
10:23 we could add visuals using pivot charts
10:25 or even use Cube functions to pull
10:27 values from our model directly into
10:30 worksheet cells and as a bonus we've
10:32 removed over a million redundant data
10:35 points and reduced our workbook size by
10:39 More than 70% so this is all great but
10:40 it's important to note that technically
10:43 we haven't fully normalized the data
10:45 there are still some dependencies that
10:46 we could address by splitting out
10:48 additional tables in our model for
10:50 example let's look at our transactions
10:53 table so we know that this table is in
10:55 first normal form but in order to
10:57 further normalize it to what's called
10:59 second normal form it would require to
11:02 eliminate any partial dependencies in
11:04 other words columns that only depend on
11:06 part of the primary key which in this
11:09 case is our transaction ID if you look
11:11 closely you'll notice that some Fields
11:13 like order date delivery date customer
11:17 ID and store ID only depend on the order
11:19 number so the same information is
11:22 repeated for each line item in the table
11:24 and that makes sense since each order
11:27 takes place on one specific date and you
11:28 wouldn't see different customers
11:31 purchasing individual line items within
11:33 one order on the other hand some Fields
11:36 like quantity and product ID depend on
11:39 both the order number and the line item
11:42 or the full transaction ID since orders
11:45 can contain multiple individual products
11:47 what that means is that to achieve
11:49 second normal form we need to break this
11:52 into two Separate Tables one at the
11:54 order level and one at the order line
11:57 item level so let's make that happen
11:59 right click duplicate this trans
12:01 transaction table again let's drag it up
12:03 to keep them together and why don't we
12:05 start with our order level table here
12:07 and just like when we split out our
12:10 Dimension tables the key is to isolate
12:12 just the relevant columns here so for
12:14 order level detail what I'm going to
12:16 grab is the order number column and I'm
12:19 going to control click the order date
12:22 delivery date assuming all line items
12:24 ship at the same time which is the case
12:26 with this data set I'm going to select
12:30 customer ID and store ID let's right
12:32 click remove the other columns and now
12:34 just like Dimension tables we're going
12:36 to remove duplicates from this order
12:38 number field and this will become the
12:41 primary key in this table and what we
12:43 can do here is rename this table name
12:45 let's call this one
12:49 orders that's okay let's rename and this
12:51 duplicated version this will become line
12:53 items so for this one we want the full
12:56 granularity the full level of detail so
12:58 let's go ahead and keep the ID the order
13:00 number and the line item CU we want all
13:02 of this information here let's keep
13:04 those three we're going to keep the
13:06 quantity field and the last one we need
13:10 is the product ID field right click
13:12 remove the others and we actually don't
13:14 need to remove duplicates because we
13:16 have the deepest level of granularity
13:18 and we know that these transaction IDs
13:21 are already unique so let's double click
13:23 let's name this one
13:24 order line
13:27 items and there you have it we have
13:29 officially removed the partial
13:31 dependencies from that original
13:33 transactions table and we've achieved
13:36 second normal form now we could keep
13:38 going down this path with some of our
13:40 existing Dimension tables as well for
13:43 example let's look at products now this
13:45 table is actually already in second
13:48 normal form because all of our non-key
13:50 columns do depend on the full primary
13:53 key or product ID so there are no
13:55 partial dependencies like we just saw in
13:58 our transaction table that said if we
14:00 scroll through you'll start to see that
14:02 we do still have some redundancies here
14:04 and we could continue to normalize this
14:08 table from second to third normal form
14:10 which would involve getting rid of any
14:13 transitive dependencies as well I know
14:15 that's a mouthful but it's basically
14:17 when columns are dependent on fields
14:20 other than the primary key and we do
14:22 have some transitive dependencies here
14:24 our product category and our product
14:27 subcategory don't depend entirely on the
14:30 product ID but rather their own key
14:32 columns like product category ID and
14:35 product subcategory ID so this is
14:37 another case where splitting our tables
14:40 will help us eliminate some of that
14:42 redundancy what we can do is duplicate
14:45 our product table going duplicate it
14:47 twice because we're going to end up with
14:48 one dimension table that's at the
14:50 product level one that's at the
14:52 subcategory level and one that's at the
14:55 category level so let's start with our
14:57 second version here this will be our
15:00 subcategory table and let's find all the
15:02 fields that are subcategory related got
15:05 the product subcategory ID and the
15:07 subcategory name and here's the catch
15:09 because I want this table to also
15:11 connect us or relate to the category
15:15 level detail I also want the category ID
15:17 field here as well so I'm going to right
15:19 click remove the
15:22 others and I can remove any duplicates
15:25 from our subcategory ID that's going to
15:27 turn this into the primary key of this
15:31 subcategory table and we can rename it
15:32 it
15:35 subcategories similar approach for
15:36 categories let's go to our next
15:39 duplicated version this time we just
15:41 need the category ID and the category
15:44 name remove everything
15:47 else get rid of those
15:50 duplicates and now we have a nice clean
15:52 category level Dimension table showing
15:54 the eight product categories that this
15:57 retailer sells let's rename that one categories
15:59 categories
16:00 and now that we have these Dimension
16:02 tables split out we can go back to
16:04 products and get rid of those redundant
16:08 Fields so now all we really need is the
16:10 subcategory ID which will allow us to
16:13 connect to subcategories and then from
16:15 subcategories we can connect to
16:17 categories so I can select everything
16:20 after that subcategory name category ID
16:24 and category remove those and we've
16:26 eliminated that redundancy so let's go
16:29 ahead and close and load this to data
16:32 model press okay and we should see those
16:34 new queries here with the data loading
16:37 in all right looks like it's all loaded
16:39 and now we can head to our data model we
16:42 can go back into our diagram View and we
16:44 should start to see some of these new
16:46 tables here so instead of one
16:48 transaction table now we've got orders
16:51 and order line items we've got products
16:54 here and then hanging out over here on
16:55 the right don't miss them we've got
16:58 subcategories and categories and you can
17:00 see our relationships have gone away
17:02 since we modified our model so we
17:04 basically just need to recreate or
17:06 reconfigure our data model based on the
17:09 new table relationships so we know that
17:11 the fields that were related at the
17:15 order level were customer ID like so and store
17:16 store
17:19 ID well products connect to the order
17:21 line item level so that's where we find
17:24 our product ID we can also connect our
17:27 distinct order number using a on to many
17:29 relationship to the order number in our
17:31 line item table then we can make a
17:33 similar chain of relationships
17:36 connecting our product subcategory ID to
17:39 our subcategories table and our category
17:42 ID to our category level table so as you
17:44 can see here our data model has become
17:47 quite a bit more complex we no longer
17:49 have that nice clean star schema we've
17:51 got snowflake schemas which are
17:53 basically just chains of Dimension and
17:55 subdimension tables and we could go even
17:58 further by normalizing our customer and
18:00 store table as well so at this point
18:02 you're probably wondering how much
18:05 normalization is enough do I need to
18:07 eliminate every single redundant data
18:09 point is there some Universal standard
18:12 or best practice the short answer is no
18:14 the most important thing to keep in mind
18:16 about normalization is that it's all
18:19 about tradeoffs more normalization means
18:22 better Integrity less redundancy and
18:25 smaller individual tables but it also
18:27 means more complex data models and
18:30 therefore more complex queries
18:32 especially for multi-table analysis
18:34 that's why star schemas even though they
18:37 aren't usually fully normalized are such
18:39 a popular choice for things like bi
18:41 reporting or exploratory analysis now I
18:43 hope that helps if you'd like to learn
18:45 more check out the description for links
18:48 to our self-paced courses learning paths
18:50 and guided projects and as always make
18:52 sure to like And subscribe for more data
18:54 content just like this I'll see you in
18:57 the next one [Music]