Atakama's data quality tools empower data scientists and ML engineers to build better machine learning models faster by transforming data quality from a siloed, manual task into a collaborative, team-wide effort, and by enabling proactive monitoring for data drift.
Mind Map
Click to expand
Click to explore the full interactive mind map • Zoom, pan, and navigate
hi I'm Oliver a PhD in machine learning
and current Solutions consultant here at
atakama today I'm going to be showing
you how atak Karma and its Suite of data
quality tools can be used to let data
scientists and machine learning
Engineers create better machine learning
quickly so to do this let's look at a
typical machine learning project in the
healthcare industry who are already
familiar with common machine learning
tools such as data bricks
which we'll use for the sake of this
demo so here we have a big data set of
test results from patients of various
hospitals and what's ultimately being
asked of the data scientist is for them
to use this data to determine patients
at a heighten risk for a heart attack to
allow for more optimal and streamlined
treatment as an example and so if we
look at this data as a casual Observer
we already see a lot of common data
issues where we have have variations in
the sex column we have lots of nulls we
have big Variety in some information
where perhaps a units change for
example and that's a big problem because
as anybody familiar with machine
learning knows one of the main
principles of getting good performance
out of a model is that garbage in
results in garbage
out and what it means for our data
scientist who actually has been asked to
come up with this predictive model is
that 50 to 80% of their time on this
project isn't actually going to making a
predictive model it's going to be doing
what we've just done now where we're
being just given a table of information
and ourselves in a silo determining what
these data issues are and Performing one
off very manual cleanup all of which has
to be repeated inevitably the next time
that this data gets used in another
project so how does that play out in
atakama well let's use our native
connections the data and the processing
and data bricks and let's look at that
exact same data set but you'll
immediately notice a difference and
that's that we've made data quality a
team spot so the data scientist who's
been asked to come up with this model
doesn't just get a dump of that data
they'll be told oh here's some machine
learning predictions to say well maybe
this data contains pii and oh these are
the exact issues that are wrong with
this data set and because it's a spot
where everyone is contributing and we
aren't making data scientists work in
silos we can even see nonobvious issues
as well so let's look at this seemingly
perfectly fine record where we see oh
here's an unreliable test machine what
does that mean well we can open up the
details we can go to the data quality
rule that's being applied we can look in
a description that tells us what's going
on and see that oh between this little
period of time there was some incorrect
incorrectly calibrated research equipment
equipment
so now instead of having to spend 50 to
80% of this project just understanding
and cleaning the data within a silo all
that happens is the data scientists they
go into the atakama data catalog they
get the data set and they're given the
additional issue tables that let them
quickly filter and impute based on all
the issues they're now already aware of
and the data engineers get a source of
issues that they can remediate against
all of which is of course doable within
ataca and one t
and that means that when the data set is
used again in another project because we
want to reuse our data to solve multiple
problems with it it's fits for
everybody so let's revisit this concept
again of garbage in and garbage out and
look at how a machine learning model
performs against an evaluation data set
of known good clean data after it's been
through the ATAC Karma process where
atak karma is telling us we have great
data quality this is what we want to be
testing our machine learning model
against we don't want to test it against
garbage so let's go back to data bricks
and look at the performance of our model
now that we quickly dealt with all this bad
bad
data what we see is that we get a better
performance Baseline on on a bunch of
reasonable standard usage statistics
where we're saying the Precision is
better the accuracy is better the F1 one
is better every conceivable statistic
that we want to use to build our model
just by using clean data is better we're
starting to build our model with a
better Bas line and we get more time to
actually work on the model to make it
better on top of that but what's really
good is that even after we finished and
developed our model ATAC Karma can still
be incredibly helpful in the business
context we need to think about how we
continuously evaluate that model in a
production environment to make sure it's
still working and guard it against data
drift so how will that play out without
ataka following standard devot practices used
used
today so let's say we finished our model
and the mlot team came up with a way to
evaluate it against known clean data
sets over time and what we'll see is oh
look the model is working it's working
it's working it's working it's working
oh no it's suddenly not working
but using standard mlops tools we're
going after the fact we're just going
off the outcome we just know all we know
is the output isn't correct we don't
know what has gone wrong and we don't
know what is incorrect so let's go back
to atakama
again and here we're using atacama's
data observability module that allows us
to monitor data as it changes
proactively against many many things
from just the pure data quality to
machine learning parod anomaly detection
on both the data and the metadata to
schema changes to the business terms
changing to the structure changing and
even how fresh the data is relative to
itself and how often it's used to
changing and we're immediately starting
to see a couple of issues so one the
metadata anomaly detection has noticed a
change and two the schema of the data
itself has changed so let's open that
data set and oh there we go we have a
bunch of issues in the gender column
let's open it and we can see oh we're
violating this gender synonym rule that
we talked about earlier we could even
look at it see the description and see
that it should be male or
female and just like that we've
immediately found what the issue is the
data has changed there's been data drift
and the Machine learning model is now
seeing new data and classifying it
incorrectly and in just seconds we've
got a root cause
atak Karma and the data catalog has
given us a data owner to take
responsibility for fixing it and we now
have a path to fix this problem very
very quickly so sum up if you're
building machine learning models don't
do it in silos ataca is a great tool for
breaking down data silos and make data
quality and governance a teen spot where
everybody contributes you can make
better machine learning models you can
make them faster and you can make them
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.