0:03 hi I'm Oliver a PhD in machine learning
0:06 and current Solutions consultant here at
0:08 atakama today I'm going to be showing
0:10 you how atak Karma and its Suite of data
0:12 quality tools can be used to let data
0:13 scientists and machine learning
0:15 Engineers create better machine learning
0:22 quickly so to do this let's look at a
0:24 typical machine learning project in the
0:27 healthcare industry who are already
0:28 familiar with common machine learning
0:30 tools such as data bricks
0:31 which we'll use for the sake of this
0:35 demo so here we have a big data set of
0:37 test results from patients of various
0:40 hospitals and what's ultimately being
0:42 asked of the data scientist is for them
0:45 to use this data to determine patients
0:47 at a heighten risk for a heart attack to
0:50 allow for more optimal and streamlined
0:52 treatment as an example and so if we
0:55 look at this data as a casual Observer
0:58 we already see a lot of common data
1:01 issues where we have have variations in
1:04 the sex column we have lots of nulls we
1:06 have big Variety in some information
1:09 where perhaps a units change for
1:13 example and that's a big problem because
1:14 as anybody familiar with machine
1:16 learning knows one of the main
1:18 principles of getting good performance
1:21 out of a model is that garbage in
1:23 results in garbage
1:25 out and what it means for our data
1:27 scientist who actually has been asked to
1:29 come up with this predictive model is
1:31 that 50 to 80% of their time on this
1:33 project isn't actually going to making a
1:35 predictive model it's going to be doing
1:38 what we've just done now where we're
1:41 being just given a table of information
1:43 and ourselves in a silo determining what
1:45 these data issues are and Performing one
1:49 off very manual cleanup all of which has
1:51 to be repeated inevitably the next time
1:53 that this data gets used in another
1:56 project so how does that play out in
1:58 atakama well let's use our native
2:00 connections the data and the processing
2:03 and data bricks and let's look at that
2:06 exact same data set but you'll
2:08 immediately notice a difference and
2:10 that's that we've made data quality a
2:13 team spot so the data scientist who's
2:15 been asked to come up with this model
2:17 doesn't just get a dump of that data
2:19 they'll be told oh here's some machine
2:21 learning predictions to say well maybe
2:24 this data contains pii and oh these are
2:26 the exact issues that are wrong with
2:30 this data set and because it's a spot
2:32 where everyone is contributing and we
2:34 aren't making data scientists work in
2:36 silos we can even see nonobvious issues
2:39 as well so let's look at this seemingly
2:42 perfectly fine record where we see oh
2:44 here's an unreliable test machine what
2:46 does that mean well we can open up the
2:49 details we can go to the data quality
2:50 rule that's being applied we can look in
2:52 a description that tells us what's going
2:54 on and see that oh between this little
2:57 period of time there was some incorrect
2:58 incorrectly calibrated research equipment
3:00 equipment
3:03 so now instead of having to spend 50 to
3:05 80% of this project just understanding
3:08 and cleaning the data within a silo all
3:10 that happens is the data scientists they
3:13 go into the atakama data catalog they
3:16 get the data set and they're given the
3:18 additional issue tables that let them
3:20 quickly filter and impute based on all
3:23 the issues they're now already aware of
3:25 and the data engineers get a source of
3:27 issues that they can remediate against
3:28 all of which is of course doable within
3:30 ataca and one t
3:32 and that means that when the data set is
3:34 used again in another project because we
3:37 want to reuse our data to solve multiple
3:40 problems with it it's fits for
3:44 everybody so let's revisit this concept
3:46 again of garbage in and garbage out and
3:48 look at how a machine learning model
3:50 performs against an evaluation data set
3:55 of known good clean data after it's been
3:56 through the ATAC Karma process where
3:58 atak karma is telling us we have great
4:00 data quality this is what we want to be
4:02 testing our machine learning model
4:05 against we don't want to test it against
4:08 garbage so let's go back to data bricks
4:10 and look at the performance of our model
4:13 now that we quickly dealt with all this bad
4:14 bad
4:17 data what we see is that we get a better
4:20 performance Baseline on on a bunch of
4:22 reasonable standard usage statistics
4:25 where we're saying the Precision is
4:27 better the accuracy is better the F1 one
4:29 is better every conceivable statistic
4:31 that we want to use to build our model
4:35 just by using clean data is better we're
4:37 starting to build our model with a
4:40 better Bas line and we get more time to
4:42 actually work on the model to make it
4:46 better on top of that but what's really
4:49 good is that even after we finished and
4:51 developed our model ATAC Karma can still
4:54 be incredibly helpful in the business
4:56 context we need to think about how we
4:58 continuously evaluate that model in a
5:00 production environment to make sure it's
5:02 still working and guard it against data
5:05 drift so how will that play out without
5:08 ataka following standard devot practices used
5:09 used
5:13 today so let's say we finished our model
5:15 and the mlot team came up with a way to
5:20 evaluate it against known clean data
5:23 sets over time and what we'll see is oh
5:25 look the model is working it's working
5:27 it's working it's working it's working
5:30 oh no it's suddenly not working
5:33 but using standard mlops tools we're
5:36 going after the fact we're just going
5:38 off the outcome we just know all we know
5:39 is the output isn't correct we don't
5:41 know what has gone wrong and we don't
5:45 know what is incorrect so let's go back
5:47 to atakama
5:50 again and here we're using atacama's
5:52 data observability module that allows us
5:55 to monitor data as it changes
5:57 proactively against many many things
6:00 from just the pure data quality to
6:02 machine learning parod anomaly detection
6:04 on both the data and the metadata to
6:07 schema changes to the business terms
6:09 changing to the structure changing and
6:11 even how fresh the data is relative to
6:13 itself and how often it's used to
6:15 changing and we're immediately starting
6:19 to see a couple of issues so one the
6:21 metadata anomaly detection has noticed a
6:24 change and two the schema of the data
6:26 itself has changed so let's open that
6:30 data set and oh there we go we have a
6:33 bunch of issues in the gender column
6:35 let's open it and we can see oh we're
6:36 violating this gender synonym rule that
6:39 we talked about earlier we could even
6:41 look at it see the description and see
6:43 that it should be male or
6:45 female and just like that we've
6:48 immediately found what the issue is the
6:51 data has changed there's been data drift
6:53 and the Machine learning model is now
6:56 seeing new data and classifying it
6:58 incorrectly and in just seconds we've
6:59 got a root cause
7:01 atak Karma and the data catalog has
7:03 given us a data owner to take
7:05 responsibility for fixing it and we now
7:07 have a path to fix this problem very
7:11 very quickly so sum up if you're
7:13 building machine learning models don't
7:16 do it in silos ataca is a great tool for
7:18 breaking down data silos and make data
7:21 quality and governance a teen spot where
7:23 everybody contributes you can make
7:25 better machine learning models you can
7:27 make them faster and you can make them