Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
Make a Difference With SAS Data Maker | SAS Software | YouTubeToText
YouTube Transcript: Make a Difference With SAS Data Maker
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
This presentation introduces SAS Data Maker, a tool designed to generate synthetic data, highlighting its potential to overcome data access, testing, and AI model development challenges by providing a flexible, privacy-preserving, and efficient alternative to real-world data.
Mind Map
Click to expand
Click to explore the full interactive mind map • Zoom, pan, and navigate
Hello, everyone.
Welcome to the Spotlight Stage.
Thank you guys for taking time for this next presentation.
I'm Mark Demers, the Spotlight Stage host.
I'm glad you're here with us.
Show of hands, who's using synthetic data
or wants to use synthetic data?
Well, then this presentation is for you.
So I'm not going to waste time.
I'm turning it over to Brett Wujek
and to Sundaresh from SAS.
They're going to talk to you about SAS Data Maker.
I'm going to come around with this hand out
and I'm going to scan your badges
and try to not be annoying.
All right, thank you, Mark.
Yeah, and welcome, everyone, to our overview
of how synthetic data, especially generated
in a really easy and convenient manner with SAS Data Maker,
can make a difference for you and your organization
in your AI development efforts.
I was really excited to see those hands go up.
At least, you know, that was a good handful of people there.
Some of the SAS people mixed in, so maybe that didn't count.
But, you know, it's a whole new world with AI these days.
We're all living it.
It's evolving really fast.
And let me pause for a second and just
say when I use the term AI, I am yielding
to kind of the mainstream use of the AI term.
I'm including all sorts of all the analytics
under the full umbrella there.
And when we talk about AI, we know it all starts with data.
You know, having good and sufficient data.
And so what's the problem?
We live in a data-rich world, right?
We have an abundance of data.
We're flooded with data.
The fact is there's still a lot of challenges
with accessing and using data sufficiently.
And I'll get to those in a second.
When we talk with our customers about the concept
of synthetic data and the potential
that it has to bring value to their efforts,
they're really intrigued.
And hopefully, when you were introduced to this this morning,
possibly for the first time in our presentations
on main stage there, it started you thinking about, all right,
should I be taking advantage of this?
How could I use this?
Because there really is a lot of value to it.
And when we talk about synthetic data with our customers,
we hear kind of three main positions on it.
The first is really about the potential
it has for just opening up access to data
and sharing data across the enterprise.
Obviously, there's a lot of privacy issues and protections
on data and regulations to comply
with that kind of keep a lot of that data locked away
from people that really could make use of it in their efforts.
And Harry talked a lot about privacy this morning
and all the issues around that.
And that's very important.
So that's one aspect of it.
Just having some representation of real data
in a synthetic form that is allowed
to be used in all of your AI efforts is very valuable.
The second position we hear a lot
is about the potential to use synthetic data to test
applications and solutions that these organizations have
developed, ensure that they are robust,
be able to create new scenarios and potentially rare events
that they just don't have real data for,
to ensure that their products, their processes,
the decisions made from all of their AI efforts
are robust and behave as expected,
and do so in a way without harming
real people in the process, and do it
in a cost-effective manner.
So some of that downstream use of synthetic data
is really intriguing to our customers.
And then the third area we hear a lot
is really in just the AI development phase itself,
that middle phase of the AI lifecycle,
to kind of unlock some opportunities
to explore different approaches to their products
and their processes and the models
that they develop to innovate, to be
more productive in that phase.
And so a lot of compelling reasons
to kind of turn to synthetic data,
which is why we're really excited to be providing
this new offering, SAS Data Maker, for this.
Now, Sundaresh, when you talk to customers about synthetic data,
you're in a customer advisory role.
You're with customers all the time.
When you talk to them, what's kind of their first reaction
to all of this?
But while customers relate very easily
to concepts of privacy protection and testing
scenarios, they are unsure about how synthetic data
helps model development.
For example, the other day, we talked to a customer.
After listening to us, she still made a comment on the lines
of, but this is still made-up data.
So that notion of fakeness leads to customers
being a little unsure about how to use synthetic data
in model development directly.
But really, enlightened organizations
view things a little differently.
Let's consider a scenario, an example, to see just how.
Imagine that I'm a data scientist at a bank,
helping it make better decisions.
Some of these decisions involve helping
determine whom to approve or decline for a loan.
A loan decision in model poses challenges.
Applicants come from a variety of risk profiles.
And further, macroeconomic changes,
and what we term as portfolio shifts, they affect outcomes.
So what this means is that if I continue
to use historical, original data alone,
very soon my data ceases to be relevant for today's scenario.
It ceases to be relevant for future scenarios and dynamics.
And it's not as accurate as before.
In short, I'll have to wait until I get enough usable data
to obtain a relevant model.
Now, note also that model data might contain bias.
where we measure bias through performance bias and prediction
bias.
The net result is I make unsure predictions.
And unsure predictions lead to suboptimal loss prevention
outcomes.
That's not a very nice situation for my bank to be in, right?
No, absolutely not.
And to be honest, this is precisely the type of situation
that synthetic data can really step in and help.
And you talked about, in this case, some insufficiency,
some deficiency in your model for some reason.
And one of the first places to look
is the data that's feeding your modeling process.
And there's a few different approaches to take on this.
One is just to kind of look at it purely from a data volume
perspective, more data.
That can potentially, it won't always help.
But it can potentially fill in some of the gaps
and discover some of the nonlinearities
in the behaviors of the parameters
and the relationships across the parameters.
So sometimes just having more data can help.
The other two positions are really
around a more focused approach on the data
that you're providing into the process,
being able to generate data on a conditional basis
and focus maybe on a certain segment that
is underrepresented and enrich that part of the data more.
Or from an outcome perspective, if there are rare events,
it's very likely that it's difficult for the model
to capture some of the behavior and relationships of inputs
to the model to the output in the case of rare events.
So having more rare events and balancing the outcomes
in the sense of the data that you're feeding the AI
is very important.
So we're going to start here with the simple case of, all
right, let's just try generating more data
and feed that into the process.
And so you saw a good demonstration from Harry
on main stage this morning.
This will be kind of a quick run through as well.
We're starting with some loan data.
Again, this was a simple single table.
Harry showed that we do support multi-table situations, which
are very common, and uncovering the referential integrity
across that and maintaining, supporting sequential data,
time series-based data, getting a good view of the columns
and all of the profile of these columns
and the metadata around that, understanding
the semantic types so we know how
to generate data for those columns
in a very representative manner.
So just kind of doing a quick check here,
and you can adjust those as needed.
And then as far as the training goes,
a handful of different algorithms.
In this case, we've got a pre-production version
with a few algorithms in it.
We're continuing to expand that out,
setting different options here, including
the differential privacy level that you
want baked into the training and generation process,
and then setting which metrics you
want to be able to use to evaluate that.
So I've trained a number of different models here.
So it's nice that I can kind of work through an experiment
and historically try different settings there.
And once I kind of hone in on one that I feel is good,
I can look at the different metrics for those
and assess these in terms of things like distributions.
Does the distribution of the synthetic data
look like it follows the distribution of the real data?
We're looking for a lot of purple here,
which is the overlap between those,
and just kind of doing some sanity check across those
from a visual perspective.
And then, of course, relationships among the columns.
We want to make sure we're preserving the correlation,
or in this case, the mutual information
across those columns to ensure that's properly represented.
And then when I'm satisfied with my generator,
I can now use it over and over again
to generate more data simply by specifying the destination,
where I want the generated data to go, and the output format.
And then I can specify the magnitude or volume of data
that I want generated.
So I've gone ahead and done that for 2x, 3x, 4, 5x, including
10x.
And I can kind of, again, do a sanity check
on the generated data, make sure that it
looks OK from a sample perspective
in the distributions, pass that back to my data scientist
to now start using in the development process
and explore and see the impact of it.
Great.
So here's what I did with the data you gave me.
For me, a holdout data set, which
I extracted from original data in the beginning,
that's the gold standard against which I evaluate all results.
I use SAS Viya to train a variety of models
against the many data sets that you gave me.
SAS Viya really helps me here due
to its distributed processing, ability
to provide multiple analytical methods,
and guided, templated modeling approach.
I take care not to give any preferential treatment
to one data set over the other.
Therefore, irrespective of whether it's
synthetic data or original data, the same modeling experiments,
the same model parameters, I let the data dictate the results.
And here are the results.
Let's first consider results based
on their own test data sets.
What you would notice is that synthetic data of many volumes
consistently gives me better separation power
compared to original data.
For example, for synthetic data sets,
I get a KS statistic, that's a measure of discriminatory power,
in the region of 0.82 and above, compared to 0.72
for the original data set.
This is true even for synthetic data of the exact same volume
as the original data set.
This is good result. But note, Brett,
this is not an axiomatic finding.
It's an empirical finding.
The fact that we got good results in this experiment
does not ensure that we will always
continue to get the same type of results.
It requires a whole lot more experimentation.
But the good news is, one, I get more robust results
because I can test my models on a larger volume of data.
Number two, I get smoother distributions
because that's how synthetic data algorithms work.
This helps me in downstream activities
like calibrating data sets or creating scorecards.
And three, I am facilitated in getting approval for my models
because no one can doubt that low data volumes affect
the model results.
Synthetic data offers me a way to establish the consistency
and robustness of results.
However, let's not get carried away.
Remember, I told you about the holdout data set.
That's the gold standard.
When I look at results on the holdout data set,
I find that the consolidated model, which
is an append of synthetic and original data sets,
that proves to be the best performing model for me.
Why?
Because synthetic data papers over the rough edges
that you find in original data and therefore
helps me get better modeling results
and helps me make better decisions.
OK, this is how more data helps me.
But what can I do about my bias problem?
Perhaps you give me a balanced data set or something?
Right, right.
So again, you'll talk about data volume,
looking for improvement in accuracy.
From a balancing perspective, there's
a few different approaches that we can use Data Maker for.
We can initially subset our real data down to the data
that we're really focused on and want to generate more of,
feed that into Data Maker to train a generative model that
can then generate data that is specifically
for that segment, that focus segment there.
That's one approach.
The other approach is to just feed all of the data in
and let it generate a model across the whole spectrum
of the space, and then subset that by filtering afterwards.
Now, in some cases, that's really your only opportunity,
because you may have very minimal amount of data
of the segment that you're wanting to focus on.
So generating it to capture the full relationships
across the full space and then subsetting afterwards
is often a very good approach.
And we're working on baking some of that conditional generation
capability into Data Maker very soon.
So I've done that and passed a more focused set
back to Sundaresh to see how that impacts the results.
Great.
So you're giving me two data sets now.
The gender balance data set gives me
an equal proportion of males to females.
The gender and default balance data set
not only gives me equal male to female ratio,
but also defaulters to non-defaulters
in the share of 50-50.
This is good, because if you had given me
undersampled data sets, if you'd used undersampling,
a conventional technique, I would have thrown these data
sets out, because the amount of filters
you have to apply on data to get data of such proportions
takes a long time and leads to very less data coming out
of the process.
But these two data sets help me, because gender balance data
set helps me assess for realistic scenarios.
That's how my data is in real life.
The gender and default balance data set
helps me get better separation power.
So I run the same experiments as before.
And what I found was that we did not
experience a significant reduction in bias.
But to be fair, I didn't expect that.
Bias is not indicated in just one variable alone.
It's just signaled by one variable like gender.
But what happens is that we might have addressed bias
to a certain extent by just balancing the data set,
taking care of the bias that's due to poor participation
rates in the past.
But there is systemic bias, inherent bias,
which is part of other data columns in the data set
as well.
So just balancing the data is not enough.
But it's a good first step, because it
helps me quantify the bias and helps me carry out
activities like bias mitigation and algorithmic treatment
down the line to take care of inherent bias in the data.
You can pass by the trustworthy AI booth
to learn more about bias mitigation.
But what I want to convey is that balancing the data
is just the first step before bias mitigation.
Yeah, absolutely.
I think that's a fair point.
And I think it's important for you to understand here.
We're not here to say that synthetic data is
the silver bullet, the magic wand, the unicorns
and rainbows that Brian threw up on the screen this morning.
But it can help.
And it can really allow for this sort of exploration.
You saw what it did for my data scientists
here to try a lot of different models
and see the impact of enriching data,
whether it be by adding more to fill in the space
or to balance the data out more effectively.
And so hopefully, this just starts
you thinking about trying to use synthetic data as opposed
to trying to capture more real data in your process
and the value that it can bring.
We're excited to kick off a private preview of Data Maker
coming up.
And if so, we encourage you to stop by the booth
if you're interested in that or if you just
want to talk through more about synthetic data and Data Maker
and the value it can bring to you and your organization.
Thanks.
Thanks, Sundarish and Brett.
That was awesome.
Speaking of data, so use your apps.
Give us feedback on the presentation.
And we're going to take a very short break.
Please don't go anywhere if you don't have to,
because the next presentation starts at 11.45.
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.