The presentation introduces a comprehensive framework of ten distinct experiment types, categorized into basic, hard questions, and jet fuel, aimed at empowering data scientists to run more effective and insightful experiments. The core message emphasizes the value of structured experimentation for making data-driven decisions and understanding product impact.
Thank you everyone again for being the
experimentation session. We have our
last but not least session from Liz
Openmire. Liz is a C uh sorry almost Liz
is a data scientist at Stasic and before
Stasic she used to work at Meta. uh
astic she built a lot of important
customerf facing features directly for
example the surrogant metrics if you are
familiar with this concept as well as
filer interval as one of the very first
platform that has that feature um so
today she's going to talk about 10 type
of experiment to run before you die here
you go [Music]
[Music]
>> hi everyone and thank you so much for
coming. I really appreciate you all
turning out and I hope you're as excited
as I am to talk about experiments today.
Um I know it's a bit of a clickbaity
title. Uh but uh hopefully the talk
itself lives up to this hype. Um I think
there's a little bit of an elephant in
the room. Obviously I'm going to try to
convince you today that you should be
running more experiments and I work for
Stats, a company that's trying to sell
you experiments. Um, but hopefully my
content today will be convincing enough
that you can look past my impure motives
and uh hopefully run some more experiments.
experiments.
>> She do. I wanted to start with this
quote by George Box that all models are
wrong but some are useful. It's a very
popular adage in kind of analytics and
experimentation community. You're all
probably familiar with it. Found that
it's very controversial but I still like
it. Um, but I think that folks don't
like it because they're like, "Oh, all
models are wrong. What What do you mean
by that? They're all wrong." Um, but
what we mean by that is every model is
going to have uncertainty in its results
and building it on assumptions. And
there's this inherent tradeoff that the
assumptions you're willing to make and
how certain you can be about your
results based on those assumptions. And
that's a very core tension to model
building, right? and is is going to be
something we'll talk about today with
all of these different 10 types of
experiments. And when I talk about
models, I also mean causal inference
>> sorry, spoilers.
Um, which is a good experiment.
Um, but I've obviously butchered this
quote. I've made it way too complex and
have been very pedantic about it. Uh,
but unfortunately as data scientists, I
feel like that's within our fatal flaw.
We want to be really precise, really
talk through all our assumptions when
business folks are maybe like just tell
me what just tell me what you want me to
do. Um, but for a data science audience,
for a data audience like this, I wanted
to walk through all of the nuance to
this kind of a quote, right?
With that in mind, when I'm talking
about experiments, I'm usually talking
about randomized control trials. And why
is that? Has anyone here run other types
of causal inference studies like
behavioral studies? A lot of it's not
autoometrics is built off behavioral
studies. A lot of different areas of
academia rely on this. Uh but the
painful thing is it's really hard and
you're working with a lot of assumptions
and you're working with a lot of
confounders. However, with
experimentation and in particular
randomized control trials, you have the
benefit of randomization doing a lot of
heavy lifting for you. You get to be a
little lazy, right? You're like, well,
in expectation if the sample size is big
enough, it it'll just all out, right?
any confounder that you think about or
even those that you don't think about at
all will be dealing that out when you
have enough of a sample size. And what
that actually does practically is free
up brain power for you know tackling the
corner cases, the edge cases,
communicating to stakeholders, doing
everything else other than trying to
convince people that your assumptions
are valid and that we should run it this
way rather than this other tweaked way.
Uh because they're very, you know, well
structured assumptions. I know that what
I basically said is well I don't like
making assumptions like the parallel
trends assumption for a diff and diff uh
behavioral study but I love making sa sa
is my favorite assumption um but I I
think that it's about a little bit of
the framing and the standard practices
of experimentation that that make them
The rest of this talk really is going to
be 10 different experiments that you can
run. Uh it's it's not that original. the
structure isn't mind-blown IG beans. Uh
so we really will be going through 10
different experiment types. Uh there
will be the basic experiments that I am
thinking probably everyone here has run
or heard of or interacted with before.
Um there will be the third questions
where we have some like atypical
situations where you might need to take
how you're setting up your experiment to
handle that. And then I'll be looking at
what I'm calling jet fuel, which is
things that help you speed up your
experimentation practices and make
decisions faster.
With our basic experiments, like I said,
they're basic, but they're fundamental.
Uh we're actually going to spend most of
our time here in the list because it is
kind of the most things that all people
are going to potentially be working
with. Our first example that we'll be
talking about, our first type of
experiment is the standard AB growth
test. And this is actually what most
experiments are in practice uh in the
industry, right? Because a lot of times
we have folks who are doing marketing
and want to get adoption for whatever
their product is, whatever their company
is. Um so you might realize that I am
definitely not a marketer because my
choice for the treatments here were
probably pretty terrible. I have the
control just till you sign up today,
please. I have uh another uh test arm
that says, "Hey, if you sign up today,
you can get a small sliver of Bitcoin."
And there's a third version where I say
offer expires today. We'll match your
first deposit. Um so these are all
potential different call to actions that
might happen on an email or a landing
page or something like that. And what we
can do here is really measure the
outcomes of those super effectively. But
I actually want people to think beyond
just the basic metrics of like, okay,
did they actually sign up? Did they
check it out? Were they a visitor to the
site? And think about much further down
impact from just did they click the
button? And did they convert on this one
instance? I'd love to ask the room. Does
anyone have an idea of what may be a
good additional metric to use in this situation?
situation? >> Yeah.
>> Yeah.
>> Second transactions,
>> second time transactions was what he
said, right? You care about the lifetime
of that user. Did getting that email
make them a repeat retentive customer.
Was that onboarding experience something
that impacted them down the line and for
the long term? Because often times uh
when we're making these kinds of
experiments, it's the entry point to a
product and that can have really big
implications beyond just did they sign
up after seeing this.
>> Anyone else have any additional ideas? Yeah,
Yeah, >> definitely.
>> definitely.
No cancellation.
>> Change your cancellation. Yes. Did we
have clickbait in our call to action and
people clicked on it? that we achieved
that goal, but they actually churned
immediately after because they realized
maybe it wasn't as good of a deal as
they thought. We got them to do that
initial click but not follow through on
what we them to do. Those are some great
ideas. I'm sure everyone has tons of
ideas like rare running an experiment
where we only care about three metrics,
right? Um but if we talk about every
experiment type in this detail, we'll be
here for the rest of the afternoon. for um
the next one I want to talk about is the
standard AB product test. The way that
this is going to differ from the growth
test is that people are already on your
product when this is happening. It could
be something like a notification test.
Uh who here likes notifications on their phone?
phone?
No volunteers. No volunteers. Sometimes
that was actually probably the most
honest answer, right? Sometimes they're
really helpful, sometimes they're super
annoying. And so this is going to be
another common type of test that people
will do to kind of understand how users
interact with their product and what's
sorry. Um, I know that none of that was
probably a surprise to any of the folks
in this room, but I do want to emphasize
that, uh, this is going to be a
commonplace type of experiment, but they
can get really complex really quickly.
And there are a lot of ways to think
outside the box even with just the basic
types of experiments. For example, if we
think about things like UI changes, the
classic example, it's like okay, it's
just a button test, right? Button, blue
button, contrived ex. But there's a lot
of things that you can do that are in
the scenes and that have a big impact
rather than just like, oh, the UI looks
a little different now. Um, if anyone
here's a user of Stat Sig, you've been a
part of one of my experiments, which is
uh doing different strategies for
caching and surveying and querying the
different uh metrics explorer queries
you might be using. Um, so again, it
doesn't just have to be UI. There was no
human involved in that at all. It was
just different query techniques. Um,
next, it's the population that you're
working on. So for the experiment I was
talking about my population wasn't users
kind of just assumed that faster query
equals better which is pretty safe
assumption. Um but what that meant is
that I could use a different unit type
being each individual query being
randomly assigned ra than needing to use
uh users which is a really typical unit
of randomization. Um, so this could
really anything. You could have sessions
as the unit identifier. And obviously as
you're choosing your unit of
randomization, you're going to want to
look through your assumption list of
whenever you're running an experiment of
like, hey, are these going to, you know,
conform to SVA? Is this reasonable? Like
what do I need to do in this situation?
Um, the next piece of advice for these
kinds of experiments, just don't mess
things up. Easy as that, right?
um they think that it can be really
sneaky to uh get bugs or regressions in
something that you're testing. And
honestly, when you're running an
experiment, that's probably the biggest
value having an experiment run is that
you pitch those mistakes or things that
are really bad for your business even
though you thought it was a great idea.
And either you can tweak the
implementation of it or you can be we
better scrap this idea altogether
because it's just not. Um and then
measurement is not one sizefits all. Um
different products are going to require
uh you know different measurement and
this is really where domain expertise
comes comes into play you know uh it
isn't just like a black box like any
data scientist can make good decisions
about any there is a level of in-depth
knowledge that really helps when you're
I have the third type of test up here
and I'd love if anyone has a guess for
what type of test this is.
I'll give you a second reading it, but
I'd love to hear if anyone has a guess
Any guesses? No volunteers. You can't
just shout it out to what's
what's
performance to this. That's a good
guess. Um, this is actually a really
messed up AI test. Um, obviously I
cherrypicked this example, you know, uh,
I'll I'll own up to that, but I think
that it's really important to be running
AI tests as well to understand that like
is your randomization working when there
is no real change and also kind of just
to get a sense of like an AI test like
they can really sneak up on you if you
like have a real test that's running
that is you think it's an AB test but it
actually didn't really change anything
meaningfully and it's actually an AA
test. Those can be really sneaky and you
might be shooting things that don't make
a difference because with a 95%
confidence interval which is pretty
industry standard. Yeah, there's that 5%
chance that you know you're getting
those positives right. Um so I think
that the AA test can be a really
powerful tool a from making sure that
your randomization and telemetry is all
working correctly and b just to kind of
familiarize yourself with the concept to
be like wait am I am I getting tricked?
Am I getting tricked by something that's
The fourth type of experiment that I
wanted to talk about was hold outs in.
Uh there are different ways to do pulled
outs or back tests, but it's basically
kind of this big umbrella of experiment
types where talking about withholding
products that you've shipped from a
certain set of the population. Um this
beautiful chart is actually from Etsy.
Uh they have a really great blog called
code as craft that I've really really
enjoyed their articles on. Um but this
is how Etsy runs their holds, right?
They're shipping things across, you
know, a quarter and then they have a
comparison period between two untainted
samples that have not gotten any of the
shipped experiences during the quarter
and they compare them into a sack. This
is one of the ways that you can run a
hold out, right? you're saying like,
hey, how do I for the winner's curse and
make sure my shipped experiments are
actually doing good? But I think there's
also a really interesting methodology um
that we like at stats, which is uh
comparing what you're shipping to that
hold out during that Q1 because it can
help you kind of understand what is the
total impact of my experimentation
program over time. You can also kind of
look at this daily time series type view
and you can see kind of intuitively the
impact of different rollouts as they
happen. Right? Any roll out before
you've shipped anything starts as an EA
test. So that looks really reasonable
that at the start there's there's zero
difference between the test and control
group. Sorry, I realize I haven't
explained this visualization at all as I
got here. Um but basically this
visualization is comparing your test to
your control group over the different
days uh that are happening here. And so
when this is pulled out uh you're
getting that sense of like okay as
certain uh launches are happening what
is the impact on the total population
and what is their aggregated impact at
That ends our section of the basic
experiments. Um, and so we're going to
be able to move on to the hard
questions. And I think that when we go
on to this next section, what I really
want to emphasize is these may not be
your hard questions, but they are
existing ones. And there's a lot of
literature out there about these kinds
of challenges that you might be facing.
So just kind of you know thinking about
like the experimentation community and
what kind of solutions there might be uh
for different challenges that folks
face. Uh one of them that I wanted to
talk about was interference.
Um I've talked about SIPA a lot. My
favorite assumption as I told you
earlier. Um we're basically assuming
that every unit that we're experimenting
on has a stable treatment. There's no
interference between those unit. It's
nice and clean. everyone's independent
of each other. I love assuming things
are independent variables, right? Um so
what about when is violated? Like if
we're putting up a billboard in a giant
city, you know, shameless plugs for
student sig too, obvious um but
basically when we're treating shulations
and we can't control for these kind of
network effects, um there are a few ways
that we have to deal with this kind of
violation of SIPA and be a little here.
uh one of them if it could be an
experiment maybe it is there's some
fuzzy definitions here but basically
using a synthetic control can be
reallyful for this type of situation uh
because you don't
like what if the billboard wasn't there
in that particular city oh you have a
synthetic control modeling would look
like from units that ain't having billboards
billboards
then kind of compare that test result
that you're observing to the synthetic
control that you right um an example of
how this works would be let's say you
know I'm changing something about the
vibe of Seattle and I'm like well
Seattle's kind of like if you mix
Boston and all I would basically be like
okay that's that's what the vibe of and
I'm do something that changes the vibe
of Seattle Um and so what I would do if
I'm making that experiment is I would
say okay well based on how SF
and Boston are doing that's my synthetic
control is because I can use that model
it's going to spit up what spit out what
the vibe of Seattle would be if I had no
treatment and that works as long as I
make sure those cities are untreated of
them. So it lets you be a little bit
creative in how you're measuring things
and it works really well. It has network
effects. You don't have a lot of sample
size. You're kind of bumped by some of
Sorry. Well, we'll grab pushies at the
but I know there's a lot to discuss
here. Obviously this kind of like in a
bush of different uh experimental
techniques, but we can definitely chat
after and I think there will be
different questions at the end too. Um,
another way to handle this same is
sweat. Um, this is really frequently
used in like the ride share problem
where it's like myth have markets. Uh,
or in DNA, right, where you try to match
people to play against each other. Um,
not only do you have the issue of, you
know, shareoffs between units, but you
also have, you know, people who are on
different teams or people who are buying
the, you know, ride share, uh, you know,
rides and people who are driving are in
the marketplace, right? Um, so these can
make, uh, you know, an already tricky
problem even trickier, right? Um but the
uh example of switchbacks that is really
interesting because instead of
experimenting on uh units of you drivers
or riders or different people playing a
video game, you're experimenting on
units of time which might also be broken
up into like different geographies or
maybe it's different servers. You just
video games example. And what you're
doing is you're kind of swapping in your
treatment and your control in these
units of time. Uh this obviously comes
with some new assumptions, right? You
probably don't want it to be something
super visible to the customer. So this
is something more like, oh, Uber's
pricing be experimented on this way or
the different matching algorithm in the
video game could be experimented on this
way, but not like the UI changed. that
would be super jarring if it's like well
I saw the 10 minutes ago but it's not
there anymore. Um there are also some
other constraints of this kind of
methodology that you have this kind of
burnout in uh the diagram that I'm
showing where we actually don't those to
either the test or the control but we
come with things that are relatively
shortterm too. That's actually the
matchmaking and the eukar example work
really well because when on the Uber
app, right, you're usually not able to
make quick decision of like, hey, I want
to get this ride here. Here's the price
of option. Cool. Let me book it. When
you're getting batched to play a video
game, right, you you kind of enter the
like gameplay screen and there's
batching that happens and you play the
game. They're all relatively shortterm.
So you can cut it with this time period
and use that as your randomization and
kind of stick it to later. Whereas like
for a billboard there's no way that
we're not taking down putting up taking
it putting you know there there's just
these practice right.
Um moving on to a different area not uh
dealing with those kind of uh violations
of such and different forms. Uh one is
about elasticity.
Um, I think that they're really cool
because they help you turn
experimentation from a tool to make one
decision into a tool to make any
decisions, right? The idea is that let's
say, you know, that I'm proving the
performance and latency of the map
queries that I was talking about
earlier. I know that's I've made that a
screen and you know can see hey load and
MP issues that trouble so much. If I
decrease load time, will people
actually, you know, stay there longer or
like have a better experience because of
that? And how much is really worth it,
right? Should I be my engineers spend
all the quarter on the increase and
latency or to worry about other things
like building features and you know, all
of the media things they could be
working, right? Um so I think that just
kind of like us is really powerful to
understand like yes we want to make the
right decision on an individual case but
also how do we keep making the right
decisions on to prioritize um and so
this can help answer the questions if we
have a finite amount of load time
decrease and we see how other
areed. Um, another way to be able to
achieve the same kind of is using a
regression which I on purpose make
product worse. This is very
controversial. I know several companies
that just fundamentally like you know we
don't we don't do regression testing
actually. Um, and it's very
controversial because you are making an
experience worse for people and you're
potentially causing your customers to
churn. like it it could really hurt your
metrics overall, right? Um but kind of
that balance between learning and uh
making the right thing. If you're
familiar with like ML works, you
probably know explore exploit, how much
are you wanting to explore here and how
fruitful is it going to be to user
population if you do this kind of
experimentation, right? I but
interesting because it does help you,
right? You get to make those
prioritization decisions really clearly
based on data instead of vibes.
>> Okay. Update. Next section that I'm
talking about is GQL things help you
experiment really really quickly. Um I
think that's important in terms of
addictums and what's nice we actually
borrow a lot from like medical
literature in past um because when they
were running experiments they were you
know people see right so they have a lot
of techniques to be able to make
decisions quickly and either really
quickly stop something that's targeting
people or really quickly uh something
that's helping people Okay,
one example of this is SPRT or
sequential probability ratio tests when
are basically a different role of
thinking through the instead of a p
value we have this probability ratio
like this likely ratio of one is that
it's kind of your vegetables of like you
really have to wrap our analysis for
every single that you're looking at and
updates of like okay what is evidence
that convinces me there's no effect.
What is the evidence that convinces me
there is an effect? Which we should all
be doing power in adolescence for like
your classic frequentist experiment. But
as Dylan was talking about earlier, you
know, not always necessarily, but with
our teeth, it's fundamentally part of
the process. So it does make everyone
eat their vegetables. And then also it
helps make decisions for sure in most
cases where you're able to kind of
quantify like do I have evidence that
this is you know not at all different or
do I have evidence that there isn't kind
I think great for this is bandits again
not exactly a randomized trial but you
do know bias that you're introducing in
the problem that it solves here is if we
see this cumulative success rate over
time and our probability of a variant
being tested over time. so bummer that
we're still making people use the circle
variant that's in the red and under
there like I don't know you get pretty
early on that maybe that's way to go
with neutron um so the great thing that
they do is say hey let's take that of
best variant over time and let's use
that as the sign
I think that was actually totally wrong
let's see specific but like we need
different types of
I yeah uh but I think again like super
probability ratio test these help you uh
make decisions early but also not make
decisions so you're worried right you're
just doing differential allocation
you're not necessarily like taking
something away when there still could be
and the last type of experiment that I
wanted to talk about was interl
experiment experiments which is very
very demo specific right um interle
experiments are sort of like a ranking
type situation like if you're searching
a product on Amazon's well what are the
results that they show you and in what
order right
so you're able to have these kind of two
outputs of what those should be and
intersperse them they probably should
then uh be able to understand user
preference and make these decisions faster
faster
Um and basically um what you're also
doing at the same time is your
understanding as a byproduct of this and
again very domain specific helps you
understand your specific topic much
better. Um but it is worth talking about
this because um this is one of those
cases where I'm like okay you actually
get to learn really well. Uh but also uh
you know you have the standard
experiment side um and you have a little
product of safe learn and this
experiment. Um okay I had you sitting
here for a long time. So I want to get
you all standing up through this last
little piece of participation and my 10
Yeah, still on. Um, and if you have done
four or fewer of these experiments,
please sit down. >> Okay.
>> Okay.
>> If you've done five or fewer, please sit down.
>> If done or fear, sit down. I think you
know where this is going, right? Um,
sound interest. Yes. I'm not giving in
the back the chance to participate.
How everyone's standing in the back has
please sit down
or senior please sit down. You have done
all okay but I think nine is all right.
Has anyone done all 10? Raise your hand
if you've done all eight.
Okay. But you can I click on four. I
don't know. I don't want to make you can
I pick more you've done.
>> Yeah. What's your favorite
switch? Switch back to game. Yeah. So,
hopefully this gives you some
inspiration. Um this not exhaustive by
any means. They're just ones I really
like. Um and so, uh if you'd like to, uh
I'd love to also figure out if you have
an extra one thing I missed. Um but I
think we're close to time. So, if anyone
has any questions, um, I'd love to
address them, but also know I'll I'll be
your I'm happy people later, too. But we question
your favorite.
>> Oh, well, my favorite I really like. I
know they're basic, but there's just so
many ways that you can do them. Um
because you can do like I showed you the
way that she does them versus comparing
your whole experimentation program to
get control of being led out versus
doing a back test to basically like
double confirm what you do. They're just
so versatile and like it's like uh
double checking your home too. So you
>> Talking about synthetic controls, how do
you compare them with pre-post testings?
because but synthetic synthetic controls
has a hard questions. So that would be
with pre-post. How do you compare those two?
two?
>> So our synthetic control methodology
that I'm used to using is not actually a
pre-post methodology. Um it's basically
using other units that are not being
treated to uh kind of calculate
counterfactual. Um, so basically instead
of using the pre-p period, you can use
different units that you can kind of
model into the unit that you're
experimenting, right? Um, and that way
what you can do is still account for
those kind of like team seasonality
things, right? Like I don't want to
compare um Black Friday sales to two
weeks ago sales in like a pre-post
situation. So I think that then
constructing your synthetic control not
from pre-experiment data but instead
from different units that aren't being
treated can be really powerful. Yeah.
>> The the diagram.
>> Yeah. Yeah. Oh. Oh, sorry. I must
understand what you're saying. Okay.
Okay. But I think that is cuing from
that pre-experiment period, right? If
you can, you know, construct your model,
you know, just, uh, constructing your
model based on some of the data, uh,
confirming your model based on some of
the data, right? Um, you can figure out
what your MSE is for that model, right?
And then you can kind of understand what
the uncertainty added to your analysis
by the fact that your uh, control group
is a minimal can be taken into account, right?
We can talk later too like more in
conversation. Yeah. Um I think we are at
time but again I would love to talk to
Cliquez sur n'importe quel texte ou horodatage pour accéder directement à ce moment de la vidéo
Partager :
La plupart des transcriptions sont prêtes en moins de 5 secondes
Copie en un clicPlus de 125 languesRechercher dans le contenuAller aux horodatages
Collez une URL YouTube
Entrez le lien de n'importe quelle vidéo YouTube pour obtenir la transcription complète
Formulaire d'extraction de transcription
La plupart des transcriptions sont prêtes en moins de 5 secondes
Installez notre extension Chrome
Obtenez les transcriptions instantanément sans quitter YouTube. Installez notre extension Chrome et accédez en un clic à la transcription de n'importe quelle vidéo directement depuis la page de lecture.