YouTube文字起こし：
Make a Difference With SAS Data Maker

動画を最後まで見なくてOK。完全な文字起こしを取得し、キーワード検索やワンクリックコピーができます。

AutoDub

YouTube外国語動画を理解

没入型YouTube日本語吹き替え

言語の壁を越えて、世界の優良コンテンツを楽しもう

無料で使う

動画の文字起こし

動画の要約

Summary

Core Theme

This presentation introduces SAS Data Maker, a tool designed to generate synthetic data, highlighting its potential to overcome data access, testing, and AI model development challenges by providing a flexible, privacy-preserving, and efficient alternative to real-world data.

Mind Map

クリックして展開

クリックしてインタラクティブなマインドマップを確認

Hello, everyone.

Welcome to the Spotlight Stage.

Thank you guys for taking time for this next presentation.

I'm Mark Demers, the Spotlight Stage host.

I'm glad you're here with us.

Show of hands, who's using synthetic data

or wants to use synthetic data?

Well, then this presentation is for you.

So I'm not going to waste time.

I'm turning it over to Brett Wujek

and to Sundaresh from SAS.

They're going to talk to you about SAS Data Maker.

I'm going to come around with this hand out

and I'm going to scan your badges

and try to not be annoying.

All right, thank you, Mark.

Yeah, and welcome, everyone, to our overview

of how synthetic data, especially generated

in a really easy and convenient manner with SAS Data Maker,

can make a difference for you and your organization

in your AI development efforts.

I was really excited to see those hands go up.

At least, you know, that was a good handful of people there.

Some of the SAS people mixed in, so maybe that didn't count.

But, you know, it's a whole new world with AI these days.

We're all living it.

It's evolving really fast.

And let me pause for a second and just

say when I use the term AI, I am yielding

to kind of the mainstream use of the AI term.

I'm including all sorts of all the analytics

under the full umbrella there.

And when we talk about AI, we know it all starts with data.

You know, having good and sufficient data.

And so what's the problem?

We live in a data-rich world, right?

We have an abundance of data.

We're flooded with data.

The fact is there's still a lot of challenges

with accessing and using data sufficiently.

And I'll get to those in a second.

When we talk with our customers about the concept

of synthetic data and the potential

that it has to bring value to their efforts,

they're really intrigued.

And hopefully, when you were introduced to this this morning,

possibly for the first time in our presentations

on main stage there, it started you thinking about, all right,

should I be taking advantage of this?

How could I use this?

Because there really is a lot of value to it.

And when we talk about synthetic data with our customers,

we hear kind of three main positions on it.

The first is really about the potential

it has for just opening up access to data

and sharing data across the enterprise.

Obviously, there's a lot of privacy issues and protections

on data and regulations to comply

with that kind of keep a lot of that data locked away

from people that really could make use of it in their efforts.

And Harry talked a lot about privacy this morning

and all the issues around that.

And that's very important.

So that's one aspect of it.

Just having some representation of real data

in a synthetic form that is allowed

to be used in all of your AI efforts is very valuable.

The second position we hear a lot

is about the potential to use synthetic data to test

applications and solutions that these organizations have

developed, ensure that they are robust,

be able to create new scenarios and potentially rare events

that they just don't have real data for,

to ensure that their products, their processes,

the decisions made from all of their AI efforts

are robust and behave as expected,

and do so in a way without harming

real people in the process, and do it

in a cost-effective manner.

So some of that downstream use of synthetic data

is really intriguing to our customers.

And then the third area we hear a lot

is really in just the AI development phase itself,

that middle phase of the AI lifecycle,

to kind of unlock some opportunities

to explore different approaches to their products

and their processes and the models

that they develop to innovate, to be

more productive in that phase.

And so a lot of compelling reasons

to kind of turn to synthetic data,

which is why we're really excited to be providing

this new offering, SAS Data Maker, for this.

Now, Sundaresh, when you talk to customers about synthetic data,

you're in a customer advisory role.

You're with customers all the time.

When you talk to them, what's kind of their first reaction

to all of this?

But while customers relate very easily

to concepts of privacy protection and testing

scenarios, they are unsure about how synthetic data

helps model development.

For example, the other day, we talked to a customer.

After listening to us, she still made a comment on the lines

of, but this is still made-up data.

So that notion of fakeness leads to customers

being a little unsure about how to use synthetic data

in model development directly.

But really, enlightened organizations

view things a little differently.

Let's consider a scenario, an example, to see just how.

Imagine that I'm a data scientist at a bank,

helping it make better decisions.

Some of these decisions involve helping

determine whom to approve or decline for a loan.

A loan decision in model poses challenges.

Applicants come from a variety of risk profiles.

And further, macroeconomic changes,

and what we term as portfolio shifts, they affect outcomes.

So what this means is that if I continue

to use historical, original data alone,

very soon my data ceases to be relevant for today's scenario.

It ceases to be relevant for future scenarios and dynamics.

And it's not as accurate as before.

In short, I'll have to wait until I get enough usable data

to obtain a relevant model.

Now, note also that model data might contain bias.

This bias could be due to multiple reasons,

historical reasons, systemic exclusions, participation

rates.

We won't get into the reasons why

bias occurs in original data.

But it's important to be aware of the same

and account for the same, because bias in data

affects machine learning outcomes,

where we measure bias through performance bias and prediction

bias.

The net result is I make unsure predictions.

And unsure predictions lead to suboptimal loss prevention

outcomes.

That's not a very nice situation for my bank to be in, right?

No, absolutely not.

And to be honest, this is precisely the type of situation

that synthetic data can really step in and help.

And you talked about, in this case, some insufficiency,

some deficiency in your model for some reason.

And one of the first places to look

is the data that's feeding your modeling process.

And there's a few different approaches to take on this.

One is just to kind of look at it purely from a data volume

perspective, more data.

That can potentially, it won't always help.

But it can potentially fill in some of the gaps

and discover some of the nonlinearities

in the behaviors of the parameters

and the relationships across the parameters.

So sometimes just having more data can help.

The other two positions are really

around a more focused approach on the data

that you're providing into the process,

being able to generate data on a conditional basis

and focus maybe on a certain segment that

is underrepresented and enrich that part of the data more.

Or from an outcome perspective, if there are rare events,

it's very likely that it's difficult for the model

to capture some of the behavior and relationships of inputs

to the model to the output in the case of rare events.

So having more rare events and balancing the outcomes

in the sense of the data that you're feeding the AI

is very important.

So we're going to start here with the simple case of, all

right, let's just try generating more data

and feed that into the process.

And so you saw a good demonstration from Harry

on main stage this morning.

This will be kind of a quick run through as well.

We're starting with some loan data.

Again, this was a simple single table.

Harry showed that we do support multi-table situations, which

are very common, and uncovering the referential integrity

across that and maintaining, supporting sequential data,

time series-based data, getting a good view of the columns

and all of the profile of these columns

and the metadata around that, understanding

the semantic types so we know how

to generate data for those columns

in a very representative manner.

So just kind of doing a quick check here,

and you can adjust those as needed.

And then as far as the training goes,

a handful of different algorithms.

In this case, we've got a pre-production version

with a few algorithms in it.

We're continuing to expand that out,

setting different options here, including

the differential privacy level that you

want baked into the training and generation process,

and then setting which metrics you

want to be able to use to evaluate that.

So I've trained a number of different models here.

So it's nice that I can kind of work through an experiment

and historically try different settings there.

And once I kind of hone in on one that I feel is good,

I can look at the different metrics for those

and assess these in terms of things like distributions.

Does the distribution of the synthetic data

look like it follows the distribution of the real data?

We're looking for a lot of purple here,

which is the overlap between those,

and just kind of doing some sanity check across those

from a visual perspective.

And then, of course, relationships among the columns.

We want to make sure we're preserving the correlation,

or in this case, the mutual information

across those columns to ensure that's properly represented.

And then when I'm satisfied with my generator,

I can now use it over and over again

to generate more data simply by specifying the destination,

where I want the generated data to go, and the output format.

And then I can specify the magnitude or volume of data

that I want generated.

So I've gone ahead and done that for 2x, 3x, 4, 5x, including

10x.

And I can kind of, again, do a sanity check

on the generated data, make sure that it

looks OK from a sample perspective

in the distributions, pass that back to my data scientist

to now start using in the development process

and explore and see the impact of it.

Great.

So here's what I did with the data you gave me.

For me, a holdout data set, which

I extracted from original data in the beginning,

that's the gold standard against which I evaluate all results.

I use SAS Viya to train a variety of models

against the many data sets that you gave me.

SAS Viya really helps me here due

to its distributed processing, ability

to provide multiple analytical methods,

and guided, templated modeling approach.

I take care not to give any preferential treatment

to one data set over the other.

Therefore, irrespective of whether it's

synthetic data or original data, the same modeling experiments,

the same model parameters, I let the data dictate the results.

And here are the results.

Let's first consider results based

on their own test data sets.

What you would notice is that synthetic data of many volumes

consistently gives me better separation power

compared to original data.

For example, for synthetic data sets,

I get a KS statistic, that's a measure of discriminatory power,

in the region of 0.82 and above, compared to 0.72

for the original data set.

This is true even for synthetic data of the exact same volume

as the original data set.

This is good result. But note, Brett,

this is not an axiomatic finding.

It's an empirical finding.

The fact that we got good results in this experiment

does not ensure that we will always

continue to get the same type of results.

It requires a whole lot more experimentation.

But the good news is, one, I get more robust results

because I can test my models on a larger volume of data.

Number two, I get smoother distributions

because that's how synthetic data algorithms work.

This helps me in downstream activities

like calibrating data sets or creating scorecards.

And three, I am facilitated in getting approval for my models

because no one can doubt that low data volumes affect

the model results.

Synthetic data offers me a way to establish the consistency

and robustness of results.

However, let's not get carried away.

Remember, I told you about the holdout data set.

That's the gold standard.

When I look at results on the holdout data set,

I find that the consolidated model, which

is an append of synthetic and original data sets,

that proves to be the best performing model for me.

Why?

Because synthetic data papers over the rough edges

that you find in original data and therefore

helps me get better modeling results

and helps me make better decisions.

OK, this is how more data helps me.

But what can I do about my bias problem?

Perhaps you give me a balanced data set or something?

Right, right.

So again, you'll talk about data volume,

looking for improvement in accuracy.

From a balancing perspective, there's

a few different approaches that we can use Data Maker for.

We can initially subset our real data down to the data

that we're really focused on and want to generate more of,

feed that into Data Maker to train a generative model that

can then generate data that is specifically

for that segment, that focus segment there.

That's one approach.

The other approach is to just feed all of the data in

and let it generate a model across the whole spectrum

of the space, and then subset that by filtering afterwards.

Now, in some cases, that's really your only opportunity,

because you may have very minimal amount of data

of the segment that you're wanting to focus on.

So generating it to capture the full relationships

across the full space and then subsetting afterwards

is often a very good approach.

And we're working on baking some of that conditional generation

capability into Data Maker very soon.

So I've done that and passed a more focused set

back to Sundaresh to see how that impacts the results.

Great.

So you're giving me two data sets now.

The gender balance data set gives me

an equal proportion of males to females.

The gender and default balance data set

not only gives me equal male to female ratio,

but also defaulters to non-defaulters

in the share of 50-50.

This is good, because if you had given me

undersampled data sets, if you'd used undersampling,

a conventional technique, I would have thrown these data

sets out, because the amount of filters

you have to apply on data to get data of such proportions

takes a long time and leads to very less data coming out

of the process.

But these two data sets help me, because gender balance data

set helps me assess for realistic scenarios.

That's how my data is in real life.

The gender and default balance data set

helps me get better separation power.

So I run the same experiments as before.

And what I found was that we did not

experience a significant reduction in bias.

But to be fair, I didn't expect that.

Bias is not indicated in just one variable alone.

It's just signaled by one variable like gender.

But what happens is that we might have addressed bias

to a certain extent by just balancing the data set,

taking care of the bias that's due to poor participation

rates in the past.

But there is systemic bias, inherent bias,

which is part of other data columns in the data set

as well.

So just balancing the data is not enough.

But it's a good first step, because it

helps me quantify the bias and helps me carry out

activities like bias mitigation and algorithmic treatment

down the line to take care of inherent bias in the data.

You can pass by the trustworthy AI booth

to learn more about bias mitigation.

But what I want to convey is that balancing the data

is just the first step before bias mitigation.

Yeah, absolutely.

I think that's a fair point.

And I think it's important for you to understand here.

We're not here to say that synthetic data is

the silver bullet, the magic wand, the unicorns

and rainbows that Brian threw up on the screen this morning.

But it can help.

And it can really allow for this sort of exploration.

You saw what it did for my data scientists

here to try a lot of different models

and see the impact of enriching data,

whether it be by adding more to fill in the space

or to balance the data out more effectively.

And so hopefully, this just starts

you thinking about trying to use synthetic data as opposed

to trying to capture more real data in your process

and the value that it can bring.

We're excited to kick off a private preview of Data Maker

coming up.

And if so, we encourage you to stop by the booth

if you're interested in that or if you just

want to talk through more about synthetic data and Data Maker

and the value it can bring to you and your organization.

Thanks.

Thanks, Sundarish and Brett.

That was awesome.

Speaking of data, so use your apps.

Give us feedback on the presentation.

And we're going to take a very short break.

Please don't go anywhere if you don't have to,

because the next presentation starts at 11.45.

テキストまたはタイムスタンプをクリックすると、動画のその場面に移動できます

ほとんどの文字起こしは5秒以内に完了

ワンクリックコピー125以上の言語内容を検索タイムスタンプにジャンプ

YouTube URLを貼り付け

任意のYouTube動画リンクを入力すると、完全な文字起こしを取得できます

ほとんどの文字起こしは5秒以内に完了

Chrome拡張機能を追加

YouTubeを離れずに文字起こしを瞬時に取得。Chrome拡張機能をインストールすると、動画視聴ページで任意の文字起こしにワンクリックでアクセスできます。

Chromeに追加 — 無料

YouTube、Coursera、Udemyなど主要な学習プラットフォームに対応

文字起こしをすばやく取得：アドレスバーのドメインを変えるだけ！

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube文字起こし結果を準備しています…

YouTube文字起こし：Make a Difference With SAS Data Maker