YouTube 字幕：
10 Experiments to Run Before You Die with Liz Obermaier | Sigsum 2025

不必从头看完视频——获取完整字幕，搜索关键词，一键复制。

AutoDub

听懂YouTube外语视频

沉浸式YouTube翻译中文配音

告别语言障碍，拥抱全球优质内容

免费使用

视频字幕

视频摘要

Summary

Core Theme

The presentation introduces a comprehensive framework of ten distinct experiment types, categorized into basic, hard questions, and jet fuel, aimed at empowering data scientists to run more effective and insightful experiments. The core message emphasizes the value of structured experimentation for making data-driven decisions and understanding product impact.

Thank you everyone again for being the

experimentation session. We have our

last but not least session from Liz

Openmire. Liz is a C uh sorry almost Liz

is a data scientist at Stasic and before

Stasic she used to work at Meta. uh

astic she built a lot of important

customerf facing features directly for

example the surrogant metrics if you are

familiar with this concept as well as

filer interval as one of the very first

platform that has that feature um so

today she's going to talk about 10 type

of experiment to run before you die here

you go [Music]

[Music]

>> hi everyone and thank you so much for

coming. I really appreciate you all

turning out and I hope you're as excited

as I am to talk about experiments today.

Um I know it's a bit of a clickbaity

title. Uh but uh hopefully the talk

itself lives up to this hype. Um I think

there's a little bit of an elephant in

the room. Obviously I'm going to try to

convince you today that you should be

running more experiments and I work for

Stats, a company that's trying to sell

you experiments. Um, but hopefully my

content today will be convincing enough

that you can look past my impure motives

and uh hopefully run some more experiments.

experiments.

>> She do. I wanted to start with this

quote by George Box that all models are

wrong but some are useful. It's a very

popular adage in kind of analytics and

experimentation community. You're all

probably familiar with it. Found that

it's very controversial but I still like

it. Um, but I think that folks don't

like it because they're like, "Oh, all

models are wrong. What What do you mean

by that? They're all wrong." Um, but

what we mean by that is every model is

going to have uncertainty in its results

and building it on assumptions. And

there's this inherent tradeoff that the

assumptions you're willing to make and

how certain you can be about your

results based on those assumptions. And

that's a very core tension to model

building, right? and is is going to be

something we'll talk about today with

all of these different 10 types of

experiments. And when I talk about

models, I also mean causal inference

>> sorry, spoilers.

Um, which is a good experiment.

Um, but I've obviously butchered this

quote. I've made it way too complex and

have been very pedantic about it. Uh,

but unfortunately as data scientists, I

feel like that's within our fatal flaw.

We want to be really precise, really

talk through all our assumptions when

business folks are maybe like just tell

me what just tell me what you want me to

do. Um, but for a data science audience,

for a data audience like this, I wanted

to walk through all of the nuance to

this kind of a quote, right?

With that in mind, when I'm talking

about experiments, I'm usually talking

about randomized control trials. And why

is that? Has anyone here run other types

of causal inference studies like

behavioral studies? A lot of it's not

autoometrics is built off behavioral

studies. A lot of different areas of

academia rely on this. Uh but the

painful thing is it's really hard and

you're working with a lot of assumptions

and you're working with a lot of

confounders. However, with

experimentation and in particular

randomized control trials, you have the

benefit of randomization doing a lot of

heavy lifting for you. You get to be a

little lazy, right? You're like, well,

in expectation if the sample size is big

enough, it it'll just all out, right?

any confounder that you think about or

even those that you don't think about at

all will be dealing that out when you

have enough of a sample size. And what

that actually does practically is free

up brain power for you know tackling the

corner cases, the edge cases,

communicating to stakeholders, doing

everything else other than trying to

convince people that your assumptions

are valid and that we should run it this

way rather than this other tweaked way.

Uh because they're very, you know, well

structured assumptions. I know that what

I basically said is well I don't like

making assumptions like the parallel

trends assumption for a diff and diff uh

behavioral study but I love making sa sa

is my favorite assumption um but I I

think that it's about a little bit of

the framing and the standard practices

of experimentation that that make them

The rest of this talk really is going to

be 10 different experiments that you can

run. Uh it's it's not that original. the

structure isn't mind-blown IG beans. Uh

so we really will be going through 10

different experiment types. Uh there

will be the basic experiments that I am

thinking probably everyone here has run

or heard of or interacted with before.

Um there will be the third questions

where we have some like atypical

situations where you might need to take

how you're setting up your experiment to

handle that. And then I'll be looking at

what I'm calling jet fuel, which is

things that help you speed up your

experimentation practices and make

decisions faster.

With our basic experiments, like I said,

they're basic, but they're fundamental.

Uh we're actually going to spend most of

our time here in the list because it is

kind of the most things that all people

are going to potentially be working

with. Our first example that we'll be

talking about, our first type of

experiment is the standard AB growth

test. And this is actually what most

experiments are in practice uh in the

industry, right? Because a lot of times

we have folks who are doing marketing

and want to get adoption for whatever

their product is, whatever their company

is. Um so you might realize that I am

definitely not a marketer because my

choice for the treatments here were

probably pretty terrible. I have the

control just till you sign up today,

please. I have uh another uh test arm

that says, "Hey, if you sign up today,

you can get a small sliver of Bitcoin."

And there's a third version where I say

offer expires today. We'll match your

first deposit. Um so these are all

potential different call to actions that

might happen on an email or a landing

page or something like that. And what we

can do here is really measure the

outcomes of those super effectively. But

I actually want people to think beyond

just the basic metrics of like, okay,

did they actually sign up? Did they

check it out? Were they a visitor to the

site? And think about much further down

impact from just did they click the

button? And did they convert on this one

instance? I'd love to ask the room. Does

anyone have an idea of what may be a

good additional metric to use in this situation?

situation? >> Yeah.

>> Yeah.

>> Second transactions,

>> second time transactions was what he

said, right? You care about the lifetime

of that user. Did getting that email

make them a repeat retentive customer.

Was that onboarding experience something

that impacted them down the line and for

the long term? Because often times uh

when we're making these kinds of

experiments, it's the entry point to a

product and that can have really big

implications beyond just did they sign

up after seeing this.

>> Anyone else have any additional ideas? Yeah,

Yeah, >> definitely.

>> definitely.

No cancellation.

>> Change your cancellation. Yes. Did we

have clickbait in our call to action and

people clicked on it? that we achieved

that goal, but they actually churned

immediately after because they realized

maybe it wasn't as good of a deal as

they thought. We got them to do that

initial click but not follow through on

what we them to do. Those are some great

ideas. I'm sure everyone has tons of

ideas like rare running an experiment

where we only care about three metrics,

right? Um but if we talk about every

experiment type in this detail, we'll be

here for the rest of the afternoon. for um

the next one I want to talk about is the

standard AB product test. The way that

this is going to differ from the growth

test is that people are already on your

product when this is happening. It could

be something like a notification test.

Uh who here likes notifications on their phone?

phone?

No volunteers. No volunteers. Sometimes

that was actually probably the most

honest answer, right? Sometimes they're

really helpful, sometimes they're super

annoying. And so this is going to be

another common type of test that people

will do to kind of understand how users

interact with their product and what's

sorry. Um, I know that none of that was

probably a surprise to any of the folks

in this room, but I do want to emphasize

that, uh, this is going to be a

commonplace type of experiment, but they

can get really complex really quickly.

And there are a lot of ways to think

outside the box even with just the basic

types of experiments. For example, if we

think about things like UI changes, the

classic example, it's like okay, it's

just a button test, right? Button, blue

button, contrived ex. But there's a lot

of things that you can do that are in

the scenes and that have a big impact

rather than just like, oh, the UI looks

a little different now. Um, if anyone

here's a user of Stat Sig, you've been a

part of one of my experiments, which is

uh doing different strategies for

caching and surveying and querying the

different uh metrics explorer queries

you might be using. Um, so again, it

doesn't just have to be UI. There was no

human involved in that at all. It was

just different query techniques. Um,

next, it's the population that you're

working on. So for the experiment I was

talking about my population wasn't users

kind of just assumed that faster query

equals better which is pretty safe

assumption. Um but what that meant is

that I could use a different unit type

being each individual query being

randomly assigned ra than needing to use

uh users which is a really typical unit

of randomization. Um, so this could

really anything. You could have sessions

as the unit identifier. And obviously as

you're choosing your unit of

randomization, you're going to want to

look through your assumption list of

whenever you're running an experiment of

like, hey, are these going to, you know,

conform to SVA? Is this reasonable? Like

what do I need to do in this situation?

Um, the next piece of advice for these

kinds of experiments, just don't mess

things up. Easy as that, right?

um they think that it can be really

sneaky to uh get bugs or regressions in

something that you're testing. And

honestly, when you're running an

experiment, that's probably the biggest

value having an experiment run is that

you pitch those mistakes or things that

are really bad for your business even

though you thought it was a great idea.

And either you can tweak the

implementation of it or you can be we

better scrap this idea altogether

because it's just not. Um and then

measurement is not one sizefits all. Um

different products are going to require

uh you know different measurement and

this is really where domain expertise

comes comes into play you know uh it

isn't just like a black box like any

data scientist can make good decisions

about any there is a level of in-depth

knowledge that really helps when you're

I have the third type of test up here

and I'd love if anyone has a guess for

what type of test this is.

I'll give you a second reading it, but

I'd love to hear if anyone has a guess

Any guesses? No volunteers. You can't

just shout it out to what's

what's

performance to this. That's a good

guess. Um, this is actually a really

messed up AI test. Um, obviously I

cherrypicked this example, you know, uh,

I'll I'll own up to that, but I think

that it's really important to be running

AI tests as well to understand that like

is your randomization working when there

is no real change and also kind of just

to get a sense of like an AI test like

they can really sneak up on you if you

like have a real test that's running

that is you think it's an AB test but it

actually didn't really change anything

meaningfully and it's actually an AA

test. Those can be really sneaky and you

might be shooting things that don't make

a difference because with a 95%

confidence interval which is pretty

industry standard. Yeah, there's that 5%

chance that you know you're getting

those positives right. Um so I think

that the AA test can be a really

powerful tool a from making sure that

your randomization and telemetry is all

working correctly and b just to kind of

familiarize yourself with the concept to

be like wait am I am I getting tricked?

Am I getting tricked by something that's

The fourth type of experiment that I

wanted to talk about was hold outs in.

Uh there are different ways to do pulled

outs or back tests, but it's basically

kind of this big umbrella of experiment

types where talking about withholding

products that you've shipped from a

certain set of the population. Um this

beautiful chart is actually from Etsy.

Uh they have a really great blog called

code as craft that I've really really

enjoyed their articles on. Um but this

is how Etsy runs their holds, right?

They're shipping things across, you

know, a quarter and then they have a

comparison period between two untainted

samples that have not gotten any of the

shipped experiences during the quarter

and they compare them into a sack. This

is one of the ways that you can run a

hold out, right? you're saying like,

hey, how do I for the winner's curse and

make sure my shipped experiments are

actually doing good? But I think there's

also a really interesting methodology um

that we like at stats, which is uh

comparing what you're shipping to that

hold out during that Q1 because it can

help you kind of understand what is the

total impact of my experimentation

program over time. You can also kind of

look at this daily time series type view

and you can see kind of intuitively the

impact of different rollouts as they

happen. Right? Any roll out before

you've shipped anything starts as an EA

test. So that looks really reasonable

that at the start there's there's zero

difference between the test and control

group. Sorry, I realize I haven't

explained this visualization at all as I

got here. Um but basically this

visualization is comparing your test to

your control group over the different

days uh that are happening here. And so

when this is pulled out uh you're

getting that sense of like okay as

certain uh launches are happening what

is the impact on the total population

and what is their aggregated impact at

That ends our section of the basic

experiments. Um, and so we're going to

be able to move on to the hard

questions. And I think that when we go

on to this next section, what I really

want to emphasize is these may not be

your hard questions, but they are

existing ones. And there's a lot of

literature out there about these kinds

of challenges that you might be facing.

So just kind of you know thinking about

like the experimentation community and

what kind of solutions there might be uh

for different challenges that folks

face. Uh one of them that I wanted to

talk about was interference.

Um I've talked about SIPA a lot. My

favorite assumption as I told you

earlier. Um we're basically assuming

that every unit that we're experimenting

on has a stable treatment. There's no

interference between those unit. It's

nice and clean. everyone's independent

of each other. I love assuming things

are independent variables, right? Um so

what about when is violated? Like if

we're putting up a billboard in a giant

city, you know, shameless plugs for

student sig too, obvious um but

basically when we're treating shulations

and we can't control for these kind of

network effects, um there are a few ways

that we have to deal with this kind of

violation of SIPA and be a little here.

uh one of them if it could be an

experiment maybe it is there's some

fuzzy definitions here but basically

using a synthetic control can be

reallyful for this type of situation uh

because you don't

like what if the billboard wasn't there

in that particular city oh you have a

synthetic control modeling would look

like from units that ain't having billboards

billboards

then kind of compare that test result

that you're observing to the synthetic

control that you right um an example of

how this works would be let's say you

know I'm changing something about the

vibe of Seattle and I'm like well

Seattle's kind of like if you mix

Boston and all I would basically be like

okay that's that's what the vibe of and

I'm do something that changes the vibe

of Seattle Um and so what I would do if

I'm making that experiment is I would

say okay well based on how SF

and Boston are doing that's my synthetic

control is because I can use that model

it's going to spit up what spit out what

the vibe of Seattle would be if I had no

treatment and that works as long as I

make sure those cities are untreated of

them. So it lets you be a little bit

creative in how you're measuring things

and it works really well. It has network

effects. You don't have a lot of sample

size. You're kind of bumped by some of

Sorry. Well, we'll grab pushies at the

but I know there's a lot to discuss

here. Obviously this kind of like in a

bush of different uh experimental

techniques, but we can definitely chat

after and I think there will be

different questions at the end too. Um,

another way to handle this same is

sweat. Um, this is really frequently

used in like the ride share problem

where it's like myth have markets. Uh,

or in DNA, right, where you try to match

people to play against each other. Um,

not only do you have the issue of, you

know, shareoffs between units, but you

also have, you know, people who are on

different teams or people who are buying

the, you know, ride share, uh, you know,

rides and people who are driving are in

the marketplace, right? Um, so these can

make, uh, you know, an already tricky

problem even trickier, right? Um but the

uh example of switchbacks that is really

interesting because instead of

experimenting on uh units of you drivers

or riders or different people playing a

video game, you're experimenting on

units of time which might also be broken

up into like different geographies or

maybe it's different servers. You just

video games example. And what you're

doing is you're kind of swapping in your

treatment and your control in these

units of time. Uh this obviously comes

with some new assumptions, right? You

probably don't want it to be something

super visible to the customer. So this

is something more like, oh, Uber's

pricing be experimented on this way or

the different matching algorithm in the

video game could be experimented on this

way, but not like the UI changed. that

would be super jarring if it's like well

I saw the 10 minutes ago but it's not

there anymore. Um there are also some

other constraints of this kind of

methodology that you have this kind of

burnout in uh the diagram that I'm

showing where we actually don't those to

either the test or the control but we

come with things that are relatively

shortterm too. That's actually the

matchmaking and the eukar example work

really well because when on the Uber

app, right, you're usually not able to

make quick decision of like, hey, I want

to get this ride here. Here's the price

of option. Cool. Let me book it. When

you're getting batched to play a video

game, right, you you kind of enter the

like gameplay screen and there's

batching that happens and you play the

game. They're all relatively shortterm.

So you can cut it with this time period

and use that as your randomization and

kind of stick it to later. Whereas like

for a billboard there's no way that

we're not taking down putting up taking

it putting you know there there's just

these practice right.

Um moving on to a different area not uh

dealing with those kind of uh violations

of such and different forms. Uh one is

about elasticity.

Um, I think that they're really cool

because they help you turn

experimentation from a tool to make one

decision into a tool to make any

decisions, right? The idea is that let's

say, you know, that I'm proving the

performance and latency of the map

queries that I was talking about

earlier. I know that's I've made that a

screen and you know can see hey load and

MP issues that trouble so much. If I

decrease load time, will people

actually, you know, stay there longer or

like have a better experience because of

that? And how much is really worth it,

right? Should I be my engineers spend

all the quarter on the increase and

latency or to worry about other things

like building features and you know, all

of the media things they could be

working, right? Um so I think that just

kind of like us is really powerful to

understand like yes we want to make the

right decision on an individual case but

also how do we keep making the right

decisions on to prioritize um and so

this can help answer the questions if we

have a finite amount of load time

decrease and we see how other

areed. Um, another way to be able to

achieve the same kind of is using a

regression which I on purpose make

product worse. This is very

controversial. I know several companies

that just fundamentally like you know we

don't we don't do regression testing

actually. Um, and it's very

controversial because you are making an

experience worse for people and you're

potentially causing your customers to

churn. like it it could really hurt your

metrics overall, right? Um but kind of

that balance between learning and uh

making the right thing. If you're

familiar with like ML works, you

probably know explore exploit, how much

are you wanting to explore here and how

fruitful is it going to be to user

population if you do this kind of

experimentation, right? I but

interesting because it does help you,

right? You get to make those

prioritization decisions really clearly

based on data instead of vibes.

>> Okay. Update. Next section that I'm

talking about is GQL things help you

experiment really really quickly. Um I

think that's important in terms of

addictums and what's nice we actually

borrow a lot from like medical

literature in past um because when they

were running experiments they were you

know people see right so they have a lot

of techniques to be able to make

decisions quickly and either really

quickly stop something that's targeting

people or really quickly uh something

that's helping people Okay,

one example of this is SPRT or

sequential probability ratio tests when

are basically a different role of

thinking through the instead of a p

value we have this probability ratio

like this likely ratio of one is that

it's kind of your vegetables of like you

really have to wrap our analysis for

every single that you're looking at and

updates of like okay what is evidence

that convinces me there's no effect.

What is the evidence that convinces me

there is an effect? Which we should all

be doing power in adolescence for like

your classic frequentist experiment. But

as Dylan was talking about earlier, you

know, not always necessarily, but with

our teeth, it's fundamentally part of

the process. So it does make everyone

eat their vegetables. And then also it

helps make decisions for sure in most

cases where you're able to kind of

quantify like do I have evidence that

this is you know not at all different or

do I have evidence that there isn't kind

I think great for this is bandits again

not exactly a randomized trial but you

do know bias that you're introducing in

the problem that it solves here is if we

see this cumulative success rate over

time and our probability of a variant

being tested over time. so bummer that

we're still making people use the circle

variant that's in the red and under

there like I don't know you get pretty

early on that maybe that's way to go

with neutron um so the great thing that

they do is say hey let's take that of

best variant over time and let's use

that as the sign

I think that was actually totally wrong

let's see specific but like we need

different types of

I yeah uh but I think again like super

probability ratio test these help you uh

make decisions early but also not make

decisions so you're worried right you're

just doing differential allocation

you're not necessarily like taking

something away when there still could be

and the last type of experiment that I

wanted to talk about was interl

experiment experiments which is very

very demo specific right um interle

experiments are sort of like a ranking

type situation like if you're searching

a product on Amazon's well what are the

results that they show you and in what

order right

so you're able to have these kind of two

outputs of what those should be and

intersperse them they probably should

then uh be able to understand user

preference and make these decisions faster

faster

Um and basically um what you're also

doing at the same time is your

understanding as a byproduct of this and

again very domain specific helps you

understand your specific topic much

better. Um but it is worth talking about

this because um this is one of those

cases where I'm like okay you actually

get to learn really well. Uh but also uh

you know you have the standard

experiment side um and you have a little

product of safe learn and this

experiment. Um okay I had you sitting

here for a long time. So I want to get

you all standing up through this last

little piece of participation and my 10

Yeah, still on. Um, and if you have done

four or fewer of these experiments,

please sit down. >> Okay.

>> Okay.

>> If you've done five or fewer, please sit down.

>> If done or fear, sit down. I think you

know where this is going, right? Um,

sound interest. Yes. I'm not giving in

the back the chance to participate.

How everyone's standing in the back has

please sit down

or senior please sit down. You have done

all okay but I think nine is all right.

Has anyone done all 10? Raise your hand

if you've done all eight.

Okay. But you can I click on four. I

don't know. I don't want to make you can

I pick more you've done.

>> Yeah. What's your favorite

switch? Switch back to game. Yeah. So,

hopefully this gives you some

inspiration. Um this not exhaustive by

any means. They're just ones I really

like. Um and so, uh if you'd like to, uh

I'd love to also figure out if you have

an extra one thing I missed. Um but I

think we're close to time. So, if anyone

has any questions, um, I'd love to

address them, but also know I'll I'll be

your I'm happy people later, too. But we question

your favorite.

>> Oh, well, my favorite I really like. I

know they're basic, but there's just so

many ways that you can do them. Um

because you can do like I showed you the

way that she does them versus comparing

your whole experimentation program to

get control of being led out versus

doing a back test to basically like

double confirm what you do. They're just

so versatile and like it's like uh

double checking your home too. So you

>> Talking about synthetic controls, how do

you compare them with pre-post testings?

because but synthetic synthetic controls

has a hard questions. So that would be

with pre-post. How do you compare those two?

two?

>> So our synthetic control methodology

that I'm used to using is not actually a

pre-post methodology. Um it's basically

using other units that are not being

treated to uh kind of calculate

counterfactual. Um, so basically instead

of using the pre-p period, you can use

different units that you can kind of

model into the unit that you're

experimenting, right? Um, and that way

what you can do is still account for

those kind of like team seasonality

things, right? Like I don't want to

compare um Black Friday sales to two

weeks ago sales in like a pre-post

situation. So I think that then

constructing your synthetic control not

from pre-experiment data but instead

from different units that aren't being

treated can be really powerful. Yeah.

>> The the diagram.

>> Yeah. Yeah. Oh. Oh, sorry. I must

understand what you're saying. Okay.

Okay. But I think that is cuing from

that pre-experiment period, right? If

you can, you know, construct your model,

you know, just, uh, constructing your

model based on some of the data, uh,

confirming your model based on some of

the data, right? Um, you can figure out

what your MSE is for that model, right?

And then you can kind of understand what

the uncertainty added to your analysis

by the fact that your uh, control group

is a minimal can be taken into account, right?

We can talk later too like more in

conversation. Yeah. Um I think we are at

time but again I would love to talk to

点击任意文字或时间戳，即可跳转到视频对应位置

大多数字幕 5 秒内即可准备好

一键复制125+ 种语言搜索内容跳转到时间戳

粘贴 YouTube 链接

输入任意 YouTube 视频链接，获取完整字幕

大多数字幕 5 秒内即可准备好

安装 Chrome 扩展

无需离开 YouTube，一键获取视频字幕。安装我们的 Chrome 扩展，直接在视频页面访问任意视频的完整字幕。

免费添加到 Chrome

支持 YouTube、Coursera、Udemy 等主流教育平台

快速获取字幕：直接修改地址栏中的域名即可！

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube 字幕正在为您准备结果……

YouTube 字幕：10 Experiments to Run Before You Die with Liz Obermaier | Sigsum 2025