0:04 Enterprise Computing HSC course unit one
0:07 data science so in this data science
0:10 unit we are establishing that data is
0:13 the foundation of all systems we need
0:15 data in our systems because data is what
0:18 supports our decision making processes
0:21 as humans and also indirectly using AI
0:24 to interpret data and then us still
0:26 viewing what the AI interprets to
0:28 understand what actions we should take
0:30 place within our Enterprise what we
0:33 should do moving forward so this whole
0:36 unit is targeted at understanding data
0:38 so the first subsection is that of
0:40 collecting storing and analyzing data
0:42 three separate processes that are
0:45 aligned with data in how we get it how
0:48 we handle it and how we understand it
0:49 firstly is's understanding the
0:51 difference between quantitative and
0:54 qualitative data quantitative in the
0:56 amounts of data we are getting and then
0:59 qualitative how valuable that data is we
1:01 need to understand and that
1:02 distinguishment this day and age it's
1:05 very easy to collect data and store data
1:07 but then that brings us to our next
1:09 point of Big Data the fact that we can
1:12 get data very easily but this data then
1:15 accumulates and takes up file space so
1:17 we need specialized systems and make use
1:20 of online storage for storing Big Data
1:23 because it is hard to store but having
1:25 that data is available because we can
1:28 make lots of analyst and interpretations
1:30 from that data so we need to build our
1:32 systems around the notion of Big
1:35 Data one fact that you might know about
1:36 data is that we need to store data at
1:39 specific data types the way we store
1:41 data impacts on its function what
1:43 software can be used with it compression
1:46 that can be applied with it screen tools
1:48 we can use to gather it and interpret it
1:50 so that is the data types and the basic
1:53 data types are text and number which can
1:55 be in form of integers and floating
1:58 points okay but then we also have
2:00 booleans where we can select
2:02 specifically if it's an on or off or yes
2:04 or no type of response within it and
2:06 then obviously the file extensions that
2:08 can be applied to data as well all of
2:11 those impact on data's data type and
2:13 interprets how a system will use that
2:15 data from here then we have a
2:17 measurements applied to data so a
2:19 variety of different ways of scaling
2:21 data and grasping how large it could be
2:23 and how it could be used there and
2:24 there's some interesting terms there
2:26 that I won't go into at at the moment
2:29 because I'm still learning them myself
2:31 from here we've got got data sampling
2:33 when we are getting data from the
2:34 environment and putting it into the
2:37 system and one such area we know with
2:39 data sampling is that of gathering audio
2:41 data that's actually called sampling but
2:43 we're going beyond that here too and
2:45 then the notion of active and passive
2:47 sampling when we are intentionally
2:48 getting specific types of data or
2:50 whether the system is doing it itself
2:52 automatically and Gathering data for
2:55 us we then have aspects related to data
2:59 relevance okay how relevant is the data
3:01 for the enterpr prises operations
3:04 accuracy the uh correctness of data that
3:06 we are getting which we're using in our
3:09 system the validity that data is valid
3:11 and follows the appropriate rules for it
3:14 to be correct by the system and then the
3:17 reliability of data in satisfying what
3:20 we are using it for so they do overlap
3:22 they are all features of each other but
3:24 all slightly different definitions there
3:26 ultimately that we are getting data for
3:27 our system that is correct and
3:30 meaningful for our operations
3:34 we then have informatics supporting our
3:35 understanding of data now this can be
3:37 done in a variety of ways but the way
3:39 data is then starting to get displayed
3:42 in the system we can start making those
3:43 interpretations which brings us then to
3:46 our next point of presenting data and
3:47 I've got this one in yellow because this
3:49 might be ways you are presenting it
3:51 within your assessment task through
3:54 graphs and infographics okay which begin
3:56 illustrating data and making them better to
3:57 to
3:59 comprehend those spreadsheet style Dash
4:02 ports where we have data and it's
4:04 represented visually but then if we
4:06 start changing the data within this uh
4:08 spreadsheet then the actual graphs that
4:10 are on display and the pivot tables are
4:13 on display they change live in response
4:16 to the values we are changing we can use
4:18 data then to generate reports as an
4:20 output of data that could be presented
4:22 and have our own interpretations written
4:24 on it and we can also stablish things
4:27 such as Network diagrams and Maps which
4:30 can show obviously the makeup up of
4:32 different segments of a network or a
4:34 geographical location and how data might
4:36 different in it has it's dispersed
4:39 across a specific Network or landscape
4:41 so all those features can be used to present
4:42 present
4:45 data then we can talk about structured
4:47 and unstructured data sets and this can
4:49 be affiliated with big data but
4:51 essentially as data is accumulated is it
4:52 in a structured format or is it just
4:54 Gathering numbers and we need to kind of
4:56 structure it later and what we do there
4:58 okay so we need to differentiate between
5:01 those two forms of data sets Okay from
5:03 here then we also need to gather sources
5:05 of feedback based on our data or based
5:08 on our system okay where data is acting
5:11 uh and we are getting it but then what
5:12 response are we doing in relation to
5:14 that data because that's what it's all
5:16 about we get the data and we make a
5:18 response so it's ensures that we know
5:20 what our sources of feedback are we have
5:22 criteria to make sure that our feedback
5:24 is effective and appropriate for
5:26 whatever our system is
5:29 doing now we then come to errors in data
5:32 and errors can be detrimental to
5:34 operations so they must be identified so
5:37 that they can be addressed errors can
5:38 come at the initial point of collection
5:40 from our data sources which is why it's
5:42 so important to cross reference our data
5:45 sources if we are putting incorrect data
5:47 into our system it will ultimately be
5:49 incorrect information and once processed
5:52 create incorrect values that our system
5:54 will process and we need to make sure we
5:56 identify that so we then don't use those
5:58 incorrect values as a part of our
6:00 decision-making processes as said this
6:03 stems to Raw verse process data okay
6:05 when data goes in raw we haven't checked
6:08 it there could be errors there and if
6:10 it's then processed okay it will lead to
6:12 incorrect operations taking place so we
6:14 need things such as validations and
6:17 verifications in place to check data
6:19 when it is entered into a system that it
6:21 does go in correctly and if someone
6:24 accidentally does do a typo okay it will
6:25 identify this is the wrong format or
6:28 doesn't follow the range limits we put
6:30 on it okay that there rules in place to
6:31 ensure that data when goes into the
6:33 system through validation through
6:35 verification that ensures that it's
6:37 entered as a correct format but as said
6:39 if the data source is correct we're
6:40 going to get from it's still going to go
6:41 incorrectly so we need to do our own
6:44 research on our end to cross check data
6:46 and make sure it's correct the other
6:48 area of error related to data is that of
6:51 bias that we are selecting data sources
6:55 that skew data to a specific way that we
6:58 want it to be this can be intentional or
6:59 it could be unintentional that we're
7:02 just not doing a wide enough level of
7:04 research and Gathering data from a wide
7:07 enough array of different sources for
7:09 our system so we've got a factor in that
7:10 buyas can lead to errors as well and
7:12 we've got to try to counteract that by
7:15 getting data from a variety of locations
7:17 in a bunch of diverse locations that
7:20 fully represent the scope of data we're
7:21 trying to represent within our enterprise
7:22 enterprise
7:25 system the next one then is blockchain
7:27 blockchain being that we can track the
7:29 movement of data and this is obviously
7:31 heavily affiliated with cryptocurrency
7:32 and that might be the best way to
7:34 understand that we can actually see how
7:37 a cryptocurrency such as Bitcoin has
7:39 moved through different ownerships and
7:40 we can actually track it from its
7:42 Inception so we can actually track data
7:44 that's what blockchain is all about so
7:46 areas where blockchaining can be used
7:48 such as for online voting and tracking
7:52 who's doing specific voting um online
7:53 identities and what those identities are
7:56 doing the movement of specific items
7:59 when we this could be digital items or
8:01 physical items but knowing who has
8:03 ownership on them and thus support
8:06 recordkeeping we can put a name to these
8:08 things okay and I should specify with
8:09 online voting too they're probably not
8:11 tracking the name of the person they're
8:12 probably just tracking their voting not
8:13 you're not allowed to track who they
8:15 actually voted for all that because it
8:17 is meant to have a nomin imity to it and
8:18 all of
8:21 that the next area is then privacy and
8:23 security of data and specific tools that
8:25 we've got to be aware of where we might
8:27 have to put security procedures in place
8:30 one such one which is obvious is
8:32 AutoFill it's great that we our personal
8:34 information and our financial
8:36 information can be remembered by our
8:39 browser and be integrated and inserted
8:41 into text boxes automatically when we do
8:43 it do an online purchase but then
8:45 there's a security Factor related to
8:49 that so convenience can be at the cost
8:51 of security we got to weigh that up with
8:53 our system we also have that of private
8:55 and public connections it's great to we
8:57 go somewhere such as a public library
9:00 and access a public network connection
9:02 but is it secure whereas if I use my own
9:05 hotspotting or if I just do my work from
9:07 home I have a better private connection
9:10 there the use of checkbox too uh can
9:12 also be a factor in relation to security
9:14 when we're switching things on and off
9:16 and how it's being used and then also
9:18 terms of agreements for the things that
9:20 we sign up for are we actually reading
9:22 them and that's a big issue because we
9:24 sign up for a lot of things these days
9:26 specifically with online platforms but
9:27 do we fully understand what we're
9:30 signing up for it is Ed in their terms
9:32 of agreement they do say such as through
9:34 social media platforms how they're going
9:36 to use our data but we didn't even read
9:38 it CU you know sometimes we sign up when
9:40 we're young and we don't even care but
9:42 those terms of agreement could say that
9:43 they're going to use the pictures we're
9:45 uploading to a platform as a part of
9:47 their own business okay or it could also
9:49 limit what we can do as a part of their
9:51 licensing agreement how we use their
9:53 specific data and platforms so all this
9:54 is important and it's all written within
9:56 terms of agreement it's just so long to
9:58 read and that's also an issue there in
10:01 relation to privacy and security
10:03 and then we also have the impact of data
10:06 scale the amount of data that is
10:09 available we are very data Rich these
10:12 days it's very easy to get data as said
10:15 with big data so we've got to factor in
10:17 the volume of raw data we're putting
10:18 into our systems how much we're putting
10:21 in where's it going to be stored whether
10:23 locally or in online platforms which is
10:25 more so the case so that it can be
10:27 networked as part of a large enterprise
10:30 system how data might not necessarily be
10:32 downloaded uh from these online
10:33 platforms but it's more likely to be
10:36 streamed live to keep the data of the
10:38 local storage and keep it on the online
10:40 storage for
10:42 efficiency the way machine learning
10:45 interacts with data so machine learning
10:47 is obviously when the AI is learning
10:50 itself so based on it accumulating data
10:52 it changes its responses and interprets
10:54 data in different ways so that
10:57 accumulations helps it learn data can
11:00 also impact on human behavior us as
11:02 humans responding to data what do we see
11:04 how do we change our actions in response
11:07 to data and then the ethical
11:09 implications of data what do we do in
11:12 response to data and also where are we
11:15 getting data from is it always ethical
11:19 how we get the data all right and where
11:21 can we read data from and who owns that
11:24 data so there's many aspects to data and
11:27 specifically the collection and who is
11:29 viewing it that relate to the ethic of
11:31 it okay not all data can be public
11:33 because it's all about P it relates to
11:35 private individuals in some cases so
11:37 there's many ethical implications in
11:41 relation to the impact of data okay the
11:42 final two things I'll talk about in this
11:46 section is firstly data storage how data
11:47 will be stored and I've already said it
11:49 a few times that we have data that could
11:51 be stored on the local storage of a
11:52 system on its hard drives and solid
11:55 States we can also have local network
11:57 storage where we have our own servers
11:58 but also these days as well we have
12:00 cloud cloud storage and then a variety
12:03 of ways that can be used public clouds P
12:05 private clouds hybrid clouds and then
12:07 that often is the foundation for the
12:09 enterprise system and the sharing of
12:11 data across a Global Network for that
12:15 system so that is then the data storage
12:16 but then we also have this thing called
12:20 a data warehouse because we have so much
12:23 data okay sometimes we take data from
12:25 specific time periods so it might be
12:27 last year's data related to last year's
12:30 customers okay and then we save that
12:33 away to a data warehouse once put in
12:35 that data warehouse it might then go
12:38 with all our previous years okay worth
12:41 of data in that warehouse and we store
12:44 it there to analyze that data using
12:47 Technologies such as olap okay which are
12:49 used for data mining and in that data
12:51 warehouse we then can look for Trends
12:54 and patterns in historical data that can
12:56 support us in planning for future
12:59 operations so a very supportive tool to
13:02 okay for the storage of data but the
13:05 analysis of data okay and hopefully
13:07 assist us with predicting successful
13:10 plans for the future the next section
13:12 then is that of data quality data
13:14 quality means that obviously data is
13:17 correct and reliable but data is
13:19 Meaningful for the operations of an
13:22 Enterprise so firstly is the ethical use
13:24 of data as we already said with ethical
13:27 implications we've got this data now we
13:29 need to control who who can view this
13:31 data and that might be linked to
13:33 permissions and who the data is relevant
13:36 to as a part of their operations within
13:39 the Enterprise and also the sharing of
13:41 data and data transparency and the fact
13:44 that we have people's personal data
13:45 we've got to keep it secure from cyber
13:48 security and things like that as well so
13:50 we've got to keep an ethical lens on
13:53 when accessing data realizing data is
13:55 viable and we've got to keep it
13:57 private this links us to our social
14:00 legal and ethical issues that a bias
14:01 which I spoke about before where we can
14:04 skew data in different directions and we
14:06 should try to get data from a variety of
14:08 sources the accuracy of data and how
14:11 correct it is the use of metadata the
14:14 data behind data okay which is the
14:17 fundamentals of databases and websites
14:19 and the fact that that also needs to be
14:20 kept private because that has uh links
14:22 to private
14:25 information copyright of specific data
14:27 and systems and the acknowledgement of
14:30 sources of data that are used within our
14:33 systems that we are referencing systems
14:35 companies people who produce data when
14:37 being used with our systems and then
14:39 stemming from that IP intellectual
14:42 property and then ICI IP indigenous
14:45 cultures intellectual property okay that
14:47 we know the laws that are around these
14:48 things and we've got to respect those
14:51 laws when we are using systems and data
14:54 that come that are under IP or
14:57 icip the establishment of permissions
14:59 rights and privacy rules around data
15:01 which we've mentioned before once again
15:03 to limit who can view data within
15:05 systems while we can all work for the
15:07 same Enterprise we shouldn't all have
15:09 access to all data of the Enterprise
15:11 that's why permissions and rights are
15:13 important to establish and then our
15:15 security tools for protecting our system
15:17 and our Network okay our login
15:19 procedures our use of Biometrics
15:21 encrypting data in transmited in storage
15:24 setting up a firewall for our Network a
15:26 whole variety of tools built to protect
15:29 our network from cyber security threats
15:32 specifically on the legal aspects of
15:34 data to we need to know existing
15:36 legislations in place such as the
15:38 Privacy Act 1988 and those principles
15:41 that surround it okay and then also if
15:43 we're unsure about things and we need
15:45 guidance who are the responsible
15:46 authorities we know the government but
15:49 then who within the government groups
15:51 such as the OIC who we contact in the
15:53 instance of a data breach things like
15:55 that okay that we need to know
15:57 specifically who to go to in instances
15:59 where there are concerns about data then
16:00 we also need to know about data
16:03 sovereignty of indigenous peoples and
16:05 how we support them and how data is used
16:07 in the context of their cultures and
16:09 their community and we still respect
16:11 their traditions and belief in how we
16:13 use that data to support
16:15 them okay and then we've got curated and
16:18 communicated data on social behavior
16:20 okay understanding things such as data
16:23 literacy how to actually specifically
16:25 understand data timelines of data and
16:28 how it is used okay signals and data
16:30 swamps and then educating users in this
16:32 area once again an area that I need to
16:34 look in more to get my own understanding
16:36 about it so that final point is relevant
16:38 to me too but there's some key terms
16:40 that are also very new to this course in
16:42 relation to data and social
16:45 behavior the final section is processing
16:47 and presenting data so data has been
16:49 processed turned into information and we
16:52 putting into a format that we can show
16:54 stakeholders clients or peers so that it
16:57 is ultimately comprehendable to them so
17:00 kind of the output of data that has been
17:02 digested in a way for people to
17:03 understand and here you're going to see
17:05 a lot more Yellow Boxes because it could
17:06 correlate the things that we could have
17:08 embedded into our assessment task so
17:10 first one is out of flat file databases
17:14 setting up a simple onetable database
17:16 that shows a variety of Records usually
17:19 related to one specific area that is
17:21 done using um a database package such as
17:24 Microsoft Access we also then have
17:26 spreadsheet summaries for the
17:28 correlation of information so this could
17:30 be as a user collating information
17:32 within the spreadsheets uh rows and
17:34 cells and all that but it could also be
17:37 that I've got a form on the front end uh
17:38 for the collection of data that I've
17:40 sent it out as a Google form and I've
17:42 shared it with a whole bunch of people
17:43 and then when they enter in their
17:46 responses it updates in the spreadsheet
17:49 okay and then from that spreadsheet I
17:51 can then develop things such as um
17:54 graphs and tables that summarize data
17:55 and make it more comprehendable which
17:57 then brings us to our next point of
18:00 filtering grouping and sorting data we
18:03 can use tools within the spreadsheet to
18:06 uh categorize our data and add filters
18:08 so we can look at specific data sets and
18:11 summarize data and focus on specific
18:13 groups we can link sheets with other
18:15 sheets and we can also make use of a
18:17 thing called conditional formatting
18:19 where specific values that meet certain
18:21 rules will be highlighted okay it could
18:23 be highlighted in red if certain value
18:25 is negative or highlighted green if a
18:26 certain value represents that a certain
18:29 area is doing well this help helps us
18:32 with data comparisons and then as said
18:34 before we can have forms acting as the
18:37 front end for our spreadsheet um
18:39 collecting data from a variety of
18:41 clients and users okay for our to
18:43 accumulate data within our spreadsheet
18:45 but then we could also have reports for
18:47 our summary that we put this all into a
18:51 formatted view to be printed off go all
18:53 sent out digitally that summarizes all
18:54 the information for our
18:58 stakeholders a very modern tool us this
18:59 day and this is also in conjunction with
19:02 spreadsheets is that of dashboards so
19:05 dashboards are like a very graphical
19:07 setup for a spreadsheet and in many
19:09 cases we actually get rid of the grid of
19:11 the spreadsheet so that big tabular
19:13 format kind of disappears and it's all
19:16 kind of text boxes and visualizations on
19:19 screen uh that are used to represent the
19:21 actual data so there will be a few
19:23 numbers on screen but it's more the
19:25 visualizations visualizations in the
19:28 forms of graphs but these graphs might
19:30 um change based on us entering different
19:32 data and data sets but that could be us
19:34 manually entering it we could also be
19:36 using things such as pivot tables and
19:39 slices so tables that will shrink and
19:42 enlarge based on what slices are active
19:44 so it could be that I have a specific
19:46 category of information you could think
19:49 of it as subjects at school and when I
19:51 click um English Advance only English
19:53 Advance students will appear in the
19:55 table and the marks allocated to them
19:57 but then if I also click English
19:59 Standard English Standard students and
20:00 English Advanced students will appear in
20:02 the table with their marks Al together
20:05 side by side so the table will adjust
20:08 depending on what slice of categories I
20:10 have switched on and off and then that
20:12 could also be linked to a graph that is
20:13 also adjusting accordingly and
20:16 representing metrics in a visualized
20:19 format visualization being key and
20:21 obviously visualization is now being
20:23 introduced here and that correlates with
20:26 our next unit of data visualization in
20:29 the Enterprise Computing year 12 course
20:31 we then have the design of a relational
20:33 database so these are the databases that
20:36 are larger than flatfile databases and
20:38 have multiple tables that we often refer
20:40 to as entities we create each of these
20:42 entities using a data dictionary that
20:45 allows us to establish metadata for each
20:47 of the entities what is the actual name
20:49 of the actual categories in these
20:51 entities which refer to as Fields okay
20:53 what data types are they made up giving
20:55 desri descriptions about it how long
20:57 will they be how much allocation of
20:59 memory will we give for each one we
21:01 provide examples of data and describe
21:02 the data they are all categories
21:05 included in a data dictionary as said we
21:07 use multiple entities to make a
21:09 relational database but we connect them
21:11 through relationships through primary
21:14 and foreign Keys each actual entity
21:15 needs to have a primary key which is its
21:18 main key usually an ID field that is a
21:20 specific number format and then we can
21:23 drag that over to as a foreign key okay
21:25 the exact same number to another entity
21:28 to establish that relationship once we
21:30 have these relational bases databases
21:32 set up we can search them and sort them
21:35 and one uh very fundamental way of doing
21:38 that is using SQL structured query
21:39 language where we have a series of
21:42 keywords used for selecting different
21:44 fields and extracting it from specific
21:46 tables and then applying a condition
21:48 using the wear keyword making use of
21:50 operators to say if data is greater than
21:53 less than equal to or combining criteria
21:55 together using and and or a whole
21:57 variety of tools for searching and
21:59 sorting within a relationship database
22:01 but also there's things such as QBE
22:03 within um modern database Management
22:05 systems that can do all this for us
22:07 using interfaces but we're still going
22:08 to know SQL because we're going to be
22:11 doing this in HSC and we can't use
22:12 software in the HSC we've got to do it
22:14 with our minds writing out the specific
22:17 code and then we we mentioned them
22:19 before forms and reports um in relations
22:21 to filing grouping sorting data well we
22:23 can set them up using um database
22:25 Management Systems in a relational
22:26 database for collecting data and
22:29 displaying data at both the front end of
22:30 collection and at the back end of
22:33 displaying information the final thing
22:35 about this unit is that of machine
22:37 learning and statistical modeling and
22:39 obviously very modern these days and
22:41 obviously a new part of the course in
22:43 that we now have systems with neural
22:47 networks that can learn themselves so
22:49 they accumulate all this data they
22:52 interpret all this data and then they
22:54 give us feedback and present the
22:56 visualization itself present the
22:58 statistics to us in a formatted View
23:01 summarizing it for us makinging our life
23:02 a lot more easier because it is
23:04 providing because one of the whole
23:05 themes of this unit that you've seen
23:07 with data science is how much data we
23:10 are collecting now okay terabytes of
23:13 data exobytes of data now okay data
23:15 amounts that we can't comprehend and
23:17 these larger Enterprise systems are
23:19 doing them daily the amount of data
23:21 think about how much data Google gets in
23:24 a day so if we can have machine learning
23:26 supporting us in this processing and
23:28 then giving us its output in a
23:30 statistical format in a model that we
23:32 can understand because it's a good
23:34 summary of that data that is of great
23:37 benefit to us as humans so I hope this
23:39 video has giv you an understanding of
23:41 this first unit of data science a lot of
23:44 new technical terms in this unit and
23:46 essentially the purpose of the unit
23:48 understanding the foundations of data in
23:51 how we collect it how it is made how we
23:53 store it how we analyze it and
23:56 essentially how it is of data quality
23:58 how it is of quality to us it is meaning
24:00 meaningful to us in our operations so we
24:02 need to understand and be able to
24:04 comprehend it as said this unit kind of
24:06 then stems into the second unit of data
24:08 visualizations where we start turning
24:10 data into a format that is
24:12 comprehendible and usable and thus
24:15 meaningful to present to people who
24:17 aren't as educated in Computing and in
24:19 data so they can understand it and use
24:21 it for their purposes but we'll get into
24:23 that when we do our next mind map on
24:25 data visualizations but hopefully at
24:27 this point you understand what data
24:28 science is all about for the Enterprise