YouTube-Transkript:
10 Experiments to Run Before You Die with Liz Obermaier | San Francisco Experimentation Meetup

Kein langes Zuschauen mehr – hol dir das vollständige Transkript, suche nach Stichwörtern und kopiere alles mit einem Klick.

AutoDub

Fremdsprachige YouTube-Videos verstehen

Immersive YouTube-Synchronisation auf Deutsch

Sprachbarrieren überwinden, erstklassige Inhalte aus aller Welt genießen

Kostenlos nutzen

Videotranskript

Videozusammenfassung

Summary

Core Theme

This content explores a diverse range of experimental methodologies beyond standard A/B testing, emphasizing the importance of choosing the right experiment type to fit specific contexts, accelerate decision-making, and gain deeper product insights.

Mind Map

Zum Vergrößern klicken

Klicke, um die vollständige interaktive Mind Map zu öffnen

We have Statsig's very own data

scientist, Liz Obermyer, with a hundred

Hi everyone. I'm so excited to talk to

you all today about 10 experiments to

run before you die. I think that when we

think about experimentation, sometimes

we just think about what we do currently

or what we've done in the past, but we

actually have a lot of tools at our

disposal to be able to do uh different

things in different circumstances to be

able to speed things up and use a wide

variety of techniques to fit the

circumstance that we have. Um, obviously

I work at Statsig, an experimentation

company. So I want you to run more

experiments, but hopefully this talk is

compelling enough that you're willing to

look past my impure motives and keep

experimenting, try new types of

experimentation and be able to just be

really data driven in what you're doing

and be able to test everything. Um,

before we begin, I wanted to share this

quote attributed to George Fox. All

models are wrong, but some are useful.

This is a very controversial quote. I

found uh oftentimes when you bring this

up, people are like, "Oh, so you're

wrong all the time, and you want me to

think that it's going to be useful?" Um,

but I think when we dig deeper into

that, what we mean by this, by something

being wrong, is we're making assumptions

and we have a certain amount of

uncertainty. It's not that it's

inherently wrong. There's just a limit

on what we know based on what we've

observed. and we've had to make some

assumptions in order to get there. And

when we're talking about models, that

means any of our causal inference

methodologies, including

experimentation. So when we're running

experiments, inherently we are making

assumptions and we're getting results

that aren't wrong per se, but they're

uncertain. And so we need to be willing

to make these assumptions and we need to

be willing to make do with this

uncertainty. And we're going to have a

tradeoff whenever we're going to be more

certain. It's because we've made more

assumptions to get to that answer. The

fewer assumptions we're willing to make,

the less we're going to have precise

answers because those assumptions help

us kind of narrow down and have more precision.

precision.

So, with that in mind, with me waxing

poetic about, you know, models being

wrong, but them still being useful, um,

I hope that that kind of sets the stage

for what we really care about. And

that's being able to use these to make

decisions with and to better understand

our product and use those goals to

understand what we're willing to assume

and what levels of uncertainty we're

willing to accept.

When I talk about broader causal

inference as a landscape where

experimentation exists, I think the

natural question is why experiments in

particular instead of just other causal

models. I think that when we talk about

other causal inference, it it can get

really tough, right? When we have diff

and diff or propensity score matching,

um those are inherently assumptions in

and of themselves and it tends to sit

further on the uh side of making

assumptions. so that we can be more

certain. But obviously in a lot of

academia and a lot of econ econometrics

and you know economic research we need

to make those assumptions because we

don't have the power of randomization.

That's what makes experimentation so

special is because we have this power of

randomization where we can assign

different treatments to different users

and then measure things instead of being

stuck with observational studies. And

randomization is so powerful because we

get to use the expectation that

differences among our population will

net out that we have this treatment

which is a really strong instrumental

variable that explains our results. And

so that's why experimentation is so nice

and in particular we'll be focusing on experimentation.

experimentation.

I have three different kinds of

experiments that I'm going to talk about

tonight. It really is just that. 10

experiments I'd like you to try running

before you die. Um, and I know it's a

little morbid. It's a little clickbaity.

Um, but we're going to start with our

basic experiments. And while these are

going to be the starting point for

everyone, it's not that you graduate to

more advanced experiments, you're

probably still running them a lot of the

time and they will fit your needs most

of the time. But there are certain

situations where you might deal with

things that don't normally happen and

special situations where you need to be

able to introduce more assumptions and

measure things in a different way. And

so that are those are going to be our

hard questions that we talk about. Um,

and then finally, we'll have some jet

fuel. Things that help you go fast,

things that help you make decisions more

quickly, um, instead of a a normal, uh,

AB test where you might have a fixed

time that you're running things for.

When we're talking about our basic

experiments, like I said, these are

going to be your bread and butter and

probably everyone is running these

already. I think when we start talking

about AB testing, it's interesting to

talk about what most AB testing

experiments are, which is growth

experiments. Um, we might have a variety

of people who aren't in marketing here,

but often marketing is a space where

measuring the difference between two

treatments is really important. Now,

obviously, I'm not a marketer based on

my treatments that I suggested where you

can sign up today. I'll give you, you

know, one tiny sliver of a bitcoin or uh

you know, I'll I or have an offer that

expires. So, I'm kind of using that as a

a a carrot to try to get you to uh click

on it. Um and we can measure the

difference between these kinds of

treatments, right? And what we're

getting is kind of like a sign up rate

and we're able to measure like how users

are moving through these new kind of uh

experiences that will actually get us to

have an audience for our product as a whole.

whole.

Now, we not only want to measure that

we're getting people into our

experiment, but there's also the other

side effects that we want to measure. If

we're looking at just these metrics,

we're getting a bit of an incomplete

picture. I'd love to hear from this

audience. What other metrics would you

want to measure in a growth experiment

other than just click-through rate, sign

up rate, um, and these kind of visitor

statuses that we have that show if

someone is interested in clicking

through here? Does anyone have any ideas

Session time, that's a great one. Yes.

>> Customer lifetime value. That's another

great thing to measure. Yes.

Unsubscribe rate. Exactly. Uh some of

these like unsubscribe rate helps you

understand is it just a clickbaity thing

that I got people to click but then they

don't really stick around. Customer

lifetime value that may take a long time

to measure in an experiment but it's

really vital you know data about your

platform as a whole. So that's something

that you might want to measure as well.

I think you guys had great examples and

kind of the whole story of this is it's

beyond just doing the basic part of

making it further in a signup funnel

clicking in subscribing once we really

do want to understand the full lifetime

of the product which is where we might

use our standard AB product tests and

what we would be doing in this

circumstance is you know building a new

feature using something with our

audience as a whole maybe adding

notifications and kind of continually

trying to understand what keeps

customers there, give them new things to

use and build that user base uh other

than just growing it. Um you don't want

to be pouring more water into a leaky

bucket the same way you don't want to

have great growth tactics with nothing

to help users retain.

I think that these are really common

place and they're probably what the bulk

of people are doing on a daily basis.

But that doesn't that they're just

simple and you don't need to think about

all the different options you might have

there. Um, I think one of the, you know,

major things to think about is it's not

just UI changes, right? It might be

algorithm changes, it might be

performance changes that you still want

to measure other impacts of because as

you make changes, inherently there will

be bugs sometimes that you might not

catch. there may be unexpected

regressions and it isn't just when any

user is interacting with the UI that

they might have a negative or positive

experience with the product. Uh when we

think about your population too, it

might not just be users or devices which

we typically think of. For example, if

you're a statig user, you've definitely

been in one of my experiments, which is

on the query level for some of our

products where we're trying to speed

things up and assuming that faster

queries are better. I think we can all

agree faster is usually better uh with

the same results accuracy. Um and so

that's another unit of experimentation

that you can think of beyond just a

user, a device, a geo. Um, I think also

avoiding messing things up. A lot of

times when you're building products, you

have a vision for how they go. And if

you have a successful experiment, it

it's just confirming what you already

thought, right? You thought this would

be something helpful to users and that

happened. You might have some metrics to

prove it now and say, "Hey, we should

ship this." But it can be even more

powerful when something unexpected

happens when there's a regression maybe

in a certain population and you get that

feedback loop to be able to iterate.

And then again the reminder that

measurement is not one-sizefits-all like

everyone was contributing in the

audience to our previous analysis. It it

wasn't just the surface level metrics

that we wanted to measure of success of

one thing. It was the ongoing success

and health of all of our users on the

product and not just one short-sighted thing.

thing.

I'd love for anyone to tell me what they

think this experiment is. Does anyone

have any ideas for what we might be

doing here? YEAH.

OH, an AA test. You know, you're right.

It It's really surprising results. This

is obviously a super cherrypicked AA

test that I kind of grabbed for shock

value, right? But I think that it's

really helpful to be able to understand

like, hey, am I reliably assigning

random users to their control and test?

Are my metrics going to be sensitive and

helpful in this situation? Am I going to

get a lot of false positives because I'm

seeing these kinds of crazy results? Um,

it can be a really helpful barometer for

if everything is set up right and if

your uh code is working as you intended

to. Like I said earlier, you need to be

able to plan for unintended consequences

and an AA test is a way to understand if

there's unintended consequences of your

experimental setup itself.

One of my favorite experiment types is

hold outs and back tests. And I think

these are becoming more and more

powerful as people continue to ship code

faster with things like AI agents

speeding up how quickly they can produce

code and ship things. It it also impacts

how much you need to have guard rails.

Uh one example of a a holdback being

really powerful is you can basically

ship that feature you're really excited

about and ship really rapidly to most of

your users. But with a hold back, you

can also understand over the long term

what the health of that decision was.

And if things aren't going as you

expected to after a short time, you can

change that decision and be able to um

you know reverse course in the case that

you may have had a false positive or you

may have uh not paid attention to a

novelty effect or maybe the entire

experiment that you run previously was a

novelty effect. This just helps you kind

of check your work and make sure that

you're really building things that are

long-term valuable to people.

Another really cool uh outcome of

holdouts is that you can look at the

evolution over time of a pulled hold out

where they really do start as just an AA

test if you haven't stripped anything

and then as time goes on you can

continue to observe how your changes

kind of stack onto each other and

understand the cumulative impact of a

team uh or or the whole company really

and it helps you understand the true

impact of your experiment.

experimentation program and what you've

been shipping.

When it comes to our hard questions that

we're trying to address, uh, a lot of

the times it's going to be in specific

circumstances. And this kind of is going

to be something where, uh, you're going

to look at the circumstances that you're

in. You might say, "Hey, I don't think I

can run an experiment actually." But

there might be some assumptions you can

make, some tweaks you can make with

assignment that'll allow you to run an

experiment or allow you to use a

different methodology.

One example of that is in the case where

there's interferences. I think the

really, you know, example that everyone

hears is what if we do a marketing

campaign out of house? What if we have

billboards up? What if I am doing radio

ads? How can I measure the impact of

that and how can I understand if it's

truly moving the needle and what the ROI

is? And this can be really tough to

measure in an experimental context

because that violates SUVA, right? We're

not able to treat individual units

individually and we're going to lack

sample size. If we say, oh, just the

geographies that we put things up in are

the units. We're going to end up with a

lot of variance.

One really powerful tool for this is

using a synthetic control. Um, Meta has

a really famous package called geo

testing. That's what we tend to use, but

this really is just a

description of a set of tools that you

can use to be able to synthesize

the counterfactual for what you're

trying to measure in real time with your

treatment. Right? So, a way that you can

do this is let's say I say that the vibe

of Seattle is like if I mixed Portland,

SF, and Boston all together. And I

wanted to run an experiment that I

thought would change the vibe of

Seattle. What I could do is model

Seattle's vibe after those three cities

that I just named. not do anything there

but change my treatment in Seattle and

be able to understand the impact by

comparing my counterfactual that I

synthesized with a model and my

observations of what actually happened

to Seattle. And what that allows us to

do is kind of uh be able to not worry

about um uh temporal effects. Right? If

I'm comparing this week to last week,

that might be a a perfectly fine thing

to do at some points in time. But if I'm

comparing Black Friday to the week

before, in a lot of retail environments,

that'd be crazy and it would look like

all of my out ofome campaigns were

incredible because I had so much more

sales than the previous week, right? So,

it's all about understanding your

context, what problems you're trying to

solve for, and what's the best approach

to use to kind of get around your

inability to uh measure a true

counterfactual and not have the sample size.

size.

Another way that you can deal with this

same issue is using switchbacks. Um,

this is really made famous by Uber and

Lyft, which tend to use this to test

different algorithm changes because they

have a marketplace, right? Any treatment

that you provide to the rider is going

to impact the drivers which in turn

impacts other riders as well as vice

versa. When you treat the drivers, it

impacts their riders which in turn

impacts other drivers. So you have this

kind of inextricable uh network effect

where you might want to use switchbacks. Um,

Um,

the reason this works so well or I guess

in Uber and Lift's case is because uh

when you go to that app, it's a very uh

temporal thing that you might be doing,

right? You're calling an Uber, you

decide to take it or you decide to

decline it, right? It's all within a

specified time frame. So you can have

this setup of treatment periods and

burnouts where you can really have this

time centric view of flipping back and

forth between your test and control. It

doesn't work as well when we talk about

a billboard because that's a lot harder

to flip back and forth and people aren't

going to be exposed in a really direct

way and measurable way in that example.

So when we have these different contexts

with kind of the same uh uh issue, we

have these different approaches at our

fingertips and it kind of really depends

on the context what we might choose to

do in a situation.

Another thing that's really exciting to

deal with different problems is that

same problem of ROI that we were talking

about with say a marketing campaign. We

can also bring that same concept of ROI

to investments that you make in your

product. Uh it it takes real engineering

time to drive down the latency of

something, right? It it takes real

effort to build new features. So when

you have something that you can model

with an elasticity test, it can be

really powerful to help you understand

how much bang for my buck do I get when

I'm making these big investments. Uh in

this example, if we talk about

decreasing load time, which I think we

can all agree is helpful, we might look

at our user metrics to see how helpful

is it really and be able to understand

how much we want to invest here and how

helpful it is. There is also kind of a

twist to elasticity tests is that you

can do a regression experiment. These

are pretty controversial actually. I

know there are some companies that just

don't do regression tests. I know

Snapchat just doesn't believe in them.

Why wouldn't we make a user's experience

worse on purpose? But the trade-off

there is that you get a lot of

information from making a user's

experience worse in this case. When we

think about load time, maybe it's not so

bad, right? It's a little annoying, but

I'm not going to stop using something

because of uh it taking 10 seconds to

load instead of 9.87, right? Um so you

can kind of model this elasticity of

behavior uh hopefully without turning

your users. Um, and so that's the

trade-off when you're trying to think

about running a regression test. Is it

worth what you learn to potentially have

negative business impacts while you run

I'd also like to finish by talking about

some of our options for jet fuel to be

able to make decisions faster. Right?

I'm sure we'd all love to make our

experimental decisions in an instant,

but it's not as simple as that, right? Um,

Um,

I think one really great technique for

this is using SPRT because what you can

do is define your likelihoods ahead of

time uh of what you find believable to

accept the null hypothesis or accept the

alternative hypothesis instead of your

traditional frequentness testing where

you're looking at confidence intervals,

but you don't have that true negative

case of like, hey, there's actually no

change here. SPRT helps you quantify a

uh likelihood where if you observe uh

certain things, you're pretty convinced

that there is no difference, right? It's

a way to formalize that kind of

understanding that frequentist doesn't a

frequentist analysis doesn't offer. Um,

and when you're formalizing that ahead

of time, it can let you make those

decisions earlier because when you're

trying to do no harm, it's less of an

interpretation uh after the fact and

more of a like ground truth that you can

measure based on assumptions you're

willing to make beforehand and based on

thresholds you're willing to set.

Another really exciting technique that I

uh really like is using a bandit

approach. There are a lot of options

when it comes to using a multi-arm

bandit, but at the end of the day, the

point is you can learn really quickly

when something doesn't work or when

something starts to look a lot better

than the others. And it's really painful

to be serving people a subpar experience

when you already kind of know what you

think works. Um, so what bandits do

that's really powerful is if we're

tracking our probability that a certain

variant is the best over time, isn't it

great to be able to set our rate of

assigning people to that group to the

same value? This is just one example,

Thompson sampling of this. There are

many different approaches, but at its

core, it's understanding that you really

don't want to send people to a variant

that you already know isn't the best.

Um, I want to leave you with my 10th

experiment, my 10th and final

experiment, uh, interle experiments. I

really see this being uh exciting for AI

use cases because when we typically

think of interle experiments, we're

going to think about search, we're going

to think about ranking cases and we

think about interle those results. But

if we think about the n equals one case

of this, it's just asking people to

choose between two options. And I think

that's a really powerful use case that

we've already seen uh different uh LLM

uh you know providers take to get

feedback on their models, right? It's

the N equals one case of interle

experiments to be able to randomize the

order that you show things and

understand uh you know how users click

on things, what their downstream

behavior is uh for those different arms

of the experiment. um maybe more

important for search and ranking. A

byproduct of that is getting a search

cost for users, maybe not so deep at the

N equals one case, but that gets to be

really important for when you think

about search engines who want to know

are people actually going to scroll? Are

people actually going to move to the

second page of search results? Um and

and understand what that actually looks

like in a user base.

So, with that in mind, we have 10

different types of experiments that we

talked about. I'd love to get an

understanding of how many of these

experiments people actually are running

or have run. I'd love to get the

audience standing up. If everyone could

stand up, that'd be great.

I do actually want you to stand up.

Yeah, I know. I know. Um, and then if

you've done four or fewer of these

experiments, please sit down.

If you've done six or fewer, please sit down.

down.

Okay. If you've done seven or fewer, sit down.

down.

Eight or fewer,

nine, 10. Have you guys all done all 10?

Nice. Okay. I'd love to ask the people

who are still standing, what's your

favorite experiment from this list? If

I say the sequential testing.

>> Sequential testing.

>> SPRT. Yeah.

>> SPRT. Yeah. >> Yeah.

>> Yeah.

>> Why do you like it so much?

>> It's a lot of people think that you

really need to wait till the end to make

certain decisions. Um, and they're

really not paying attention to how much

harm a certain experience is is making

just because they want to stay

statistically rigorous and they're not

really aware that that's also

statistically rigorous.

>> Yeah, I think it's really powerful

because it forces you to really do the

work ahead of time of setting up a power

analysis, understanding your metrics,

defining what you care about, which

hopefully we're all doing for all of our

exper experiments. But when we think

about it practically, you know, that

level of care and rigor isn't always

applied to a standard AB test. But SPRT

definitely forces you to do that because

it's part of the methodology to do it.

>> Totally. Early stopping is is very

important when when warranted. >> Yeah.

>> Yeah.

>> Yep. Thank you.

>> Would you mind passing it to the woman

to your right who also had done all of

them? What's your favorite?

>> I don't think I have a favorite. I think

I like that when you are testing a

product, you want to intermix the

different experiments so that you don't

get stuck in a conclusion that

you you go down a rabbit hole. So I

don't have a favorite.

>> It is the classic data scientist answer

I feel like of it depends but it is so

accurate because it does depend on

context, right?

>> Yeah. Yeah. Um,

okay. I I think that's it for me. Um,

thank you so much for your time. I

Klicke auf einen beliebigen Text oder Zeitstempel, um direkt zu dieser Stelle im Video zu springen

Die meisten Transkripte sind in unter 5 Sekunden bereit

Mit einem Klick kopieren125+ SprachenInhalt durchsuchenZu Zeitstempeln springen

YouTube-URL einfügen

Gib den Link eines beliebigen YouTube-Videos ein und erhalte das vollständige Transkript

Die meisten Transkripte sind in unter 5 Sekunden bereit

Unsere Chrome-Erweiterung installieren

Transkripte abrufen, ohne YouTube zu verlassen. Installiere unsere Chrome-Erweiterung und greife mit einem Klick direkt auf der Wiedergabeseite auf das Transkript jedes Videos zu.

Zu Chrome hinzufügen – kostenlos

Funktioniert mit YouTube, Coursera, Udemy und weiteren Lernplattformen

Transkripte sofort abrufen: Einfach die Domain in der Adressleiste ändern!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube-TranskriptDeine Ergebnisse werden vorbereitet …

YouTube-Transkript:10 Experiments to Run Before You Die with Liz Obermaier | San Francisco Experimentation Meetup