YouTube 字幕：
Demo: Ataccama Data Quality For AI

不必从头看完视频——获取完整字幕，搜索关键词，一键复制。

AutoDub

听懂YouTube外语视频

沉浸式YouTube翻译中文配音

告别语言障碍，拥抱全球优质内容

免费使用

视频字幕

视频摘要

Summary

Core Theme

Atakama's data quality tools empower data scientists and ML engineers to build better machine learning models faster by transforming data quality from a siloed, manual task into a collaborative, team-wide effort, and by enabling proactive monitoring for data drift.

Mind Map

点击展开

点击探索完整互动思维导图

hi I'm Oliver a PhD in machine learning

and current Solutions consultant here at

atakama today I'm going to be showing

you how atak Karma and its Suite of data

quality tools can be used to let data

scientists and machine learning

Engineers create better machine learning

quickly so to do this let's look at a

typical machine learning project in the

healthcare industry who are already

familiar with common machine learning

tools such as data bricks

which we'll use for the sake of this

demo so here we have a big data set of

test results from patients of various

hospitals and what's ultimately being

asked of the data scientist is for them

to use this data to determine patients

at a heighten risk for a heart attack to

allow for more optimal and streamlined

treatment as an example and so if we

look at this data as a casual Observer

we already see a lot of common data

issues where we have have variations in

the sex column we have lots of nulls we

have big Variety in some information

where perhaps a units change for

example and that's a big problem because

as anybody familiar with machine

learning knows one of the main

principles of getting good performance

out of a model is that garbage in

results in garbage

out and what it means for our data

scientist who actually has been asked to

come up with this predictive model is

that 50 to 80% of their time on this

project isn't actually going to making a

predictive model it's going to be doing

what we've just done now where we're

being just given a table of information

and ourselves in a silo determining what

these data issues are and Performing one

off very manual cleanup all of which has

to be repeated inevitably the next time

that this data gets used in another

project so how does that play out in

atakama well let's use our native

connections the data and the processing

and data bricks and let's look at that

exact same data set but you'll

immediately notice a difference and

that's that we've made data quality a

team spot so the data scientist who's

been asked to come up with this model

doesn't just get a dump of that data

they'll be told oh here's some machine

learning predictions to say well maybe

this data contains pii and oh these are

the exact issues that are wrong with

this data set and because it's a spot

where everyone is contributing and we

aren't making data scientists work in

silos we can even see nonobvious issues

as well so let's look at this seemingly

perfectly fine record where we see oh

here's an unreliable test machine what

does that mean well we can open up the

details we can go to the data quality

rule that's being applied we can look in

a description that tells us what's going

on and see that oh between this little

period of time there was some incorrect

incorrectly calibrated research equipment

equipment

so now instead of having to spend 50 to

80% of this project just understanding

and cleaning the data within a silo all

that happens is the data scientists they

go into the atakama data catalog they

get the data set and they're given the

additional issue tables that let them

quickly filter and impute based on all

the issues they're now already aware of

and the data engineers get a source of

issues that they can remediate against

all of which is of course doable within

ataca and one t

and that means that when the data set is

used again in another project because we

want to reuse our data to solve multiple

problems with it it's fits for

everybody so let's revisit this concept

again of garbage in and garbage out and

look at how a machine learning model

performs against an evaluation data set

of known good clean data after it's been

through the ATAC Karma process where

atak karma is telling us we have great

data quality this is what we want to be

testing our machine learning model

against we don't want to test it against

garbage so let's go back to data bricks

and look at the performance of our model

now that we quickly dealt with all this bad

bad

data what we see is that we get a better

performance Baseline on on a bunch of

reasonable standard usage statistics

where we're saying the Precision is

better the accuracy is better the F1 one

is better every conceivable statistic

that we want to use to build our model

just by using clean data is better we're

starting to build our model with a

better Bas line and we get more time to

actually work on the model to make it

better on top of that but what's really

good is that even after we finished and

developed our model ATAC Karma can still

be incredibly helpful in the business

context we need to think about how we

continuously evaluate that model in a

production environment to make sure it's

still working and guard it against data

drift so how will that play out without

ataka following standard devot practices used

used

today so let's say we finished our model

and the mlot team came up with a way to

evaluate it against known clean data

sets over time and what we'll see is oh

look the model is working it's working

it's working it's working it's working

oh no it's suddenly not working

but using standard mlops tools we're

going after the fact we're just going

off the outcome we just know all we know

is the output isn't correct we don't

know what has gone wrong and we don't

know what is incorrect so let's go back

to atakama

again and here we're using atacama's

data observability module that allows us

to monitor data as it changes

proactively against many many things

from just the pure data quality to

machine learning parod anomaly detection

on both the data and the metadata to

schema changes to the business terms

changing to the structure changing and

even how fresh the data is relative to

itself and how often it's used to

changing and we're immediately starting

to see a couple of issues so one the

metadata anomaly detection has noticed a

change and two the schema of the data

itself has changed so let's open that

data set and oh there we go we have a

bunch of issues in the gender column

let's open it and we can see oh we're

violating this gender synonym rule that

we talked about earlier we could even

look at it see the description and see

that it should be male or

female and just like that we've

immediately found what the issue is the

data has changed there's been data drift

and the Machine learning model is now

seeing new data and classifying it

incorrectly and in just seconds we've

got a root cause

atak Karma and the data catalog has

given us a data owner to take

responsibility for fixing it and we now

have a path to fix this problem very

very quickly so sum up if you're

building machine learning models don't

do it in silos ataca is a great tool for

breaking down data silos and make data

quality and governance a teen spot where

everybody contributes you can make

better machine learning models you can

make them faster and you can make them

点击任意文字或时间戳，即可跳转到视频对应位置

大多数字幕 5 秒内即可准备好

一键复制125+ 种语言搜索内容跳转到时间戳

粘贴 YouTube 链接

输入任意 YouTube 视频链接，获取完整字幕

大多数字幕 5 秒内即可准备好

安装 Chrome 扩展

无需离开 YouTube，一键获取视频字幕。安装我们的 Chrome 扩展，直接在视频页面访问任意视频的完整字幕。

免费添加到 Chrome

支持 YouTube、Coursera、Udemy 等主流教育平台

快速获取字幕：直接修改地址栏中的域名即可！

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube 字幕正在为您准备结果……

YouTube 字幕：Demo: Ataccama Data Quality For AI