This content explores a diverse range of experimental methodologies beyond standard A/B testing, emphasizing the importance of choosing the right experiment type to fit specific contexts, accelerate decision-making, and gain deeper product insights.
Mind Map
Zum Vergrößern klicken
Klicke, um die vollständige interaktive Mind Map zu öffnen
We have Statsig's very own data
scientist, Liz Obermyer, with a hundred
Hi everyone. I'm so excited to talk to
you all today about 10 experiments to
run before you die. I think that when we
think about experimentation, sometimes
we just think about what we do currently
or what we've done in the past, but we
actually have a lot of tools at our
disposal to be able to do uh different
things in different circumstances to be
able to speed things up and use a wide
variety of techniques to fit the
circumstance that we have. Um, obviously
I work at Statsig, an experimentation
company. So I want you to run more
experiments, but hopefully this talk is
compelling enough that you're willing to
look past my impure motives and keep
experimenting, try new types of
experimentation and be able to just be
really data driven in what you're doing
and be able to test everything. Um,
before we begin, I wanted to share this
quote attributed to George Fox. All
models are wrong, but some are useful.
This is a very controversial quote. I
found uh oftentimes when you bring this
up, people are like, "Oh, so you're
wrong all the time, and you want me to
think that it's going to be useful?" Um,
but I think when we dig deeper into
that, what we mean by this, by something
being wrong, is we're making assumptions
and we have a certain amount of
uncertainty. It's not that it's
inherently wrong. There's just a limit
on what we know based on what we've
observed. and we've had to make some
assumptions in order to get there. And
when we're talking about models, that
means any of our causal inference
methodologies, including
experimentation. So when we're running
experiments, inherently we are making
assumptions and we're getting results
that aren't wrong per se, but they're
uncertain. And so we need to be willing
to make these assumptions and we need to
be willing to make do with this
uncertainty. And we're going to have a
tradeoff whenever we're going to be more
certain. It's because we've made more
assumptions to get to that answer. The
fewer assumptions we're willing to make,
the less we're going to have precise
answers because those assumptions help
us kind of narrow down and have more precision.
precision.
So, with that in mind, with me waxing
poetic about, you know, models being
wrong, but them still being useful, um,
I hope that that kind of sets the stage
for what we really care about. And
that's being able to use these to make
decisions with and to better understand
our product and use those goals to
understand what we're willing to assume
and what levels of uncertainty we're
willing to accept.
When I talk about broader causal
inference as a landscape where
experimentation exists, I think the
natural question is why experiments in
particular instead of just other causal
models. I think that when we talk about
other causal inference, it it can get
really tough, right? When we have diff
and diff or propensity score matching,
um those are inherently assumptions in
and of themselves and it tends to sit
further on the uh side of making
assumptions. so that we can be more
certain. But obviously in a lot of
academia and a lot of econ econometrics
and you know economic research we need
to make those assumptions because we
don't have the power of randomization.
That's what makes experimentation so
special is because we have this power of
randomization where we can assign
different treatments to different users
and then measure things instead of being
stuck with observational studies. And
randomization is so powerful because we
get to use the expectation that
differences among our population will
net out that we have this treatment
which is a really strong instrumental
variable that explains our results. And
so that's why experimentation is so nice
and in particular we'll be focusing on experimentation.
experimentation.
I have three different kinds of
experiments that I'm going to talk about
tonight. It really is just that. 10
experiments I'd like you to try running
before you die. Um, and I know it's a
little morbid. It's a little clickbaity.
Um, but we're going to start with our
basic experiments. And while these are
going to be the starting point for
everyone, it's not that you graduate to
more advanced experiments, you're
probably still running them a lot of the
time and they will fit your needs most
of the time. But there are certain
situations where you might deal with
things that don't normally happen and
special situations where you need to be
able to introduce more assumptions and
measure things in a different way. And
so that are those are going to be our
hard questions that we talk about. Um,
and then finally, we'll have some jet
fuel. Things that help you go fast,
things that help you make decisions more
quickly, um, instead of a a normal, uh,
AB test where you might have a fixed
time that you're running things for.
When we're talking about our basic
experiments, like I said, these are
going to be your bread and butter and
probably everyone is running these
already. I think when we start talking
about AB testing, it's interesting to
talk about what most AB testing
experiments are, which is growth
experiments. Um, we might have a variety
of people who aren't in marketing here,
but often marketing is a space where
measuring the difference between two
treatments is really important. Now,
obviously, I'm not a marketer based on
my treatments that I suggested where you
can sign up today. I'll give you, you
know, one tiny sliver of a bitcoin or uh
you know, I'll I or have an offer that
expires. So, I'm kind of using that as a
a a carrot to try to get you to uh click
on it. Um and we can measure the
difference between these kinds of
treatments, right? And what we're
getting is kind of like a sign up rate
and we're able to measure like how users
are moving through these new kind of uh
experiences that will actually get us to
have an audience for our product as a whole.
whole.
Now, we not only want to measure that
we're getting people into our
experiment, but there's also the other
side effects that we want to measure. If
we're looking at just these metrics,
we're getting a bit of an incomplete
picture. I'd love to hear from this
audience. What other metrics would you
want to measure in a growth experiment
other than just click-through rate, sign
up rate, um, and these kind of visitor
statuses that we have that show if
someone is interested in clicking
through here? Does anyone have any ideas
Session time, that's a great one. Yes.
>> Customer lifetime value. That's another
great thing to measure. Yes.
Unsubscribe rate. Exactly. Uh some of
these like unsubscribe rate helps you
understand is it just a clickbaity thing
that I got people to click but then they
don't really stick around. Customer
lifetime value that may take a long time
to measure in an experiment but it's
really vital you know data about your
platform as a whole. So that's something
that you might want to measure as well.
I think you guys had great examples and
kind of the whole story of this is it's
beyond just doing the basic part of
making it further in a signup funnel
clicking in subscribing once we really
do want to understand the full lifetime
of the product which is where we might
use our standard AB product tests and
what we would be doing in this
circumstance is you know building a new
feature using something with our
audience as a whole maybe adding
notifications and kind of continually
trying to understand what keeps
customers there, give them new things to
use and build that user base uh other
than just growing it. Um you don't want
to be pouring more water into a leaky
bucket the same way you don't want to
have great growth tactics with nothing
to help users retain.
I think that these are really common
place and they're probably what the bulk
of people are doing on a daily basis.
But that doesn't that they're just
simple and you don't need to think about
all the different options you might have
there. Um, I think one of the, you know,
major things to think about is it's not
just UI changes, right? It might be
algorithm changes, it might be
performance changes that you still want
to measure other impacts of because as
you make changes, inherently there will
be bugs sometimes that you might not
catch. there may be unexpected
regressions and it isn't just when any
user is interacting with the UI that
they might have a negative or positive
experience with the product. Uh when we
think about your population too, it
might not just be users or devices which
we typically think of. For example, if
you're a statig user, you've definitely
been in one of my experiments, which is
on the query level for some of our
products where we're trying to speed
things up and assuming that faster
queries are better. I think we can all
agree faster is usually better uh with
the same results accuracy. Um and so
that's another unit of experimentation
that you can think of beyond just a
user, a device, a geo. Um, I think also
avoiding messing things up. A lot of
times when you're building products, you
have a vision for how they go. And if
you have a successful experiment, it
it's just confirming what you already
thought, right? You thought this would
be something helpful to users and that
happened. You might have some metrics to
prove it now and say, "Hey, we should
ship this." But it can be even more
powerful when something unexpected
happens when there's a regression maybe
in a certain population and you get that
feedback loop to be able to iterate.
And then again the reminder that
measurement is not one-sizefits-all like
everyone was contributing in the
audience to our previous analysis. It it
wasn't just the surface level metrics
that we wanted to measure of success of
one thing. It was the ongoing success
and health of all of our users on the
product and not just one short-sighted thing.
thing.
I'd love for anyone to tell me what they
think this experiment is. Does anyone
have any ideas for what we might be
doing here? YEAH.
OH, an AA test. You know, you're right.
It It's really surprising results. This
is obviously a super cherrypicked AA
test that I kind of grabbed for shock
value, right? But I think that it's
really helpful to be able to understand
like, hey, am I reliably assigning
random users to their control and test?
Are my metrics going to be sensitive and
helpful in this situation? Am I going to
get a lot of false positives because I'm
seeing these kinds of crazy results? Um,
it can be a really helpful barometer for
if everything is set up right and if
your uh code is working as you intended
to. Like I said earlier, you need to be
able to plan for unintended consequences
and an AA test is a way to understand if
there's unintended consequences of your
experimental setup itself.
One of my favorite experiment types is
hold outs and back tests. And I think
these are becoming more and more
powerful as people continue to ship code
faster with things like AI agents
speeding up how quickly they can produce
code and ship things. It it also impacts
how much you need to have guard rails.
Uh one example of a a holdback being
really powerful is you can basically
ship that feature you're really excited
about and ship really rapidly to most of
your users. But with a hold back, you
can also understand over the long term
what the health of that decision was.
And if things aren't going as you
expected to after a short time, you can
change that decision and be able to um
you know reverse course in the case that
you may have had a false positive or you
may have uh not paid attention to a
novelty effect or maybe the entire
experiment that you run previously was a
novelty effect. This just helps you kind
of check your work and make sure that
you're really building things that are
long-term valuable to people.
Another really cool uh outcome of
holdouts is that you can look at the
evolution over time of a pulled hold out
where they really do start as just an AA
test if you haven't stripped anything
and then as time goes on you can
continue to observe how your changes
kind of stack onto each other and
understand the cumulative impact of a
team uh or or the whole company really
and it helps you understand the true
impact of your experiment.
experimentation program and what you've
been shipping.
When it comes to our hard questions that
we're trying to address, uh, a lot of
the times it's going to be in specific
circumstances. And this kind of is going
to be something where, uh, you're going
to look at the circumstances that you're
in. You might say, "Hey, I don't think I
can run an experiment actually." But
there might be some assumptions you can
make, some tweaks you can make with
assignment that'll allow you to run an
experiment or allow you to use a
different methodology.
One example of that is in the case where
there's interferences. I think the
really, you know, example that everyone
hears is what if we do a marketing
campaign out of house? What if we have
billboards up? What if I am doing radio
ads? How can I measure the impact of
that and how can I understand if it's
truly moving the needle and what the ROI
is? And this can be really tough to
measure in an experimental context
because that violates SUVA, right? We're
not able to treat individual units
individually and we're going to lack
sample size. If we say, oh, just the
geographies that we put things up in are
the units. We're going to end up with a
lot of variance.
One really powerful tool for this is
using a synthetic control. Um, Meta has
a really famous package called geo
testing. That's what we tend to use, but
this really is just a
description of a set of tools that you
can use to be able to synthesize
the counterfactual for what you're
trying to measure in real time with your
treatment. Right? So, a way that you can
do this is let's say I say that the vibe
of Seattle is like if I mixed Portland,
SF, and Boston all together. And I
wanted to run an experiment that I
thought would change the vibe of
Seattle. What I could do is model
Seattle's vibe after those three cities
that I just named. not do anything there
but change my treatment in Seattle and
be able to understand the impact by
comparing my counterfactual that I
synthesized with a model and my
observations of what actually happened
to Seattle. And what that allows us to
do is kind of uh be able to not worry
about um uh temporal effects. Right? If
I'm comparing this week to last week,
that might be a a perfectly fine thing
to do at some points in time. But if I'm
comparing Black Friday to the week
before, in a lot of retail environments,
that'd be crazy and it would look like
all of my out ofome campaigns were
incredible because I had so much more
sales than the previous week, right? So,
it's all about understanding your
context, what problems you're trying to
solve for, and what's the best approach
to use to kind of get around your
inability to uh measure a true
counterfactual and not have the sample size.
size.
Another way that you can deal with this
same issue is using switchbacks. Um,
this is really made famous by Uber and
Lyft, which tend to use this to test
different algorithm changes because they
have a marketplace, right? Any treatment
that you provide to the rider is going
to impact the drivers which in turn
impacts other riders as well as vice
versa. When you treat the drivers, it
impacts their riders which in turn
impacts other drivers. So you have this
kind of inextricable uh network effect
where you might want to use switchbacks. Um,
Um,
the reason this works so well or I guess
in Uber and Lift's case is because uh
when you go to that app, it's a very uh
temporal thing that you might be doing,
right? You're calling an Uber, you
decide to take it or you decide to
decline it, right? It's all within a
specified time frame. So you can have
this setup of treatment periods and
burnouts where you can really have this
time centric view of flipping back and
forth between your test and control. It
doesn't work as well when we talk about
a billboard because that's a lot harder
to flip back and forth and people aren't
going to be exposed in a really direct
way and measurable way in that example.
So when we have these different contexts
with kind of the same uh uh issue, we
have these different approaches at our
fingertips and it kind of really depends
on the context what we might choose to
do in a situation.
Another thing that's really exciting to
deal with different problems is that
same problem of ROI that we were talking
about with say a marketing campaign. We
can also bring that same concept of ROI
to investments that you make in your
product. Uh it it takes real engineering
time to drive down the latency of
something, right? It it takes real
effort to build new features. So when
you have something that you can model
with an elasticity test, it can be
really powerful to help you understand
how much bang for my buck do I get when
I'm making these big investments. Uh in
this example, if we talk about
decreasing load time, which I think we
can all agree is helpful, we might look
at our user metrics to see how helpful
is it really and be able to understand
how much we want to invest here and how
helpful it is. There is also kind of a
twist to elasticity tests is that you
can do a regression experiment. These
are pretty controversial actually. I
know there are some companies that just
don't do regression tests. I know
Snapchat just doesn't believe in them.
Why wouldn't we make a user's experience
worse on purpose? But the trade-off
there is that you get a lot of
information from making a user's
experience worse in this case. When we
think about load time, maybe it's not so
bad, right? It's a little annoying, but
I'm not going to stop using something
because of uh it taking 10 seconds to
load instead of 9.87, right? Um so you
can kind of model this elasticity of
behavior uh hopefully without turning
your users. Um, and so that's the
trade-off when you're trying to think
about running a regression test. Is it
worth what you learn to potentially have
negative business impacts while you run
I'd also like to finish by talking about
some of our options for jet fuel to be
able to make decisions faster. Right?
I'm sure we'd all love to make our
experimental decisions in an instant,
but it's not as simple as that, right? Um,
Um,
I think one really great technique for
this is using SPRT because what you can
do is define your likelihoods ahead of
time uh of what you find believable to
accept the null hypothesis or accept the
alternative hypothesis instead of your
traditional frequentness testing where
you're looking at confidence intervals,
but you don't have that true negative
case of like, hey, there's actually no
change here. SPRT helps you quantify a
uh likelihood where if you observe uh
certain things, you're pretty convinced
that there is no difference, right? It's
a way to formalize that kind of
understanding that frequentist doesn't a
frequentist analysis doesn't offer. Um,
and when you're formalizing that ahead
of time, it can let you make those
decisions earlier because when you're
trying to do no harm, it's less of an
interpretation uh after the fact and
more of a like ground truth that you can
measure based on assumptions you're
willing to make beforehand and based on
thresholds you're willing to set.
Another really exciting technique that I
uh really like is using a bandit
approach. There are a lot of options
when it comes to using a multi-arm
bandit, but at the end of the day, the
point is you can learn really quickly
when something doesn't work or when
something starts to look a lot better
than the others. And it's really painful
to be serving people a subpar experience
when you already kind of know what you
think works. Um, so what bandits do
that's really powerful is if we're
tracking our probability that a certain
variant is the best over time, isn't it
great to be able to set our rate of
assigning people to that group to the
same value? This is just one example,
Thompson sampling of this. There are
many different approaches, but at its
core, it's understanding that you really
don't want to send people to a variant
that you already know isn't the best.
Um, I want to leave you with my 10th
experiment, my 10th and final
experiment, uh, interle experiments. I
really see this being uh exciting for AI
use cases because when we typically
think of interle experiments, we're
going to think about search, we're going
to think about ranking cases and we
think about interle those results. But
if we think about the n equals one case
of this, it's just asking people to
choose between two options. And I think
that's a really powerful use case that
we've already seen uh different uh LLM
uh you know providers take to get
feedback on their models, right? It's
the N equals one case of interle
experiments to be able to randomize the
order that you show things and
understand uh you know how users click
on things, what their downstream
behavior is uh for those different arms
of the experiment. um maybe more
important for search and ranking. A
byproduct of that is getting a search
cost for users, maybe not so deep at the
N equals one case, but that gets to be
really important for when you think
about search engines who want to know
are people actually going to scroll? Are
people actually going to move to the
second page of search results? Um and
and understand what that actually looks
like in a user base.
So, with that in mind, we have 10
different types of experiments that we
talked about. I'd love to get an
understanding of how many of these
experiments people actually are running
or have run. I'd love to get the
audience standing up. If everyone could
stand up, that'd be great.
I do actually want you to stand up.
Yeah, I know. I know. Um, and then if
you've done four or fewer of these
experiments, please sit down.
If you've done six or fewer, please sit down.
down.
Okay. If you've done seven or fewer, sit down.
down.
Eight or fewer,
nine, 10. Have you guys all done all 10?
Nice. Okay. I'd love to ask the people
who are still standing, what's your
favorite experiment from this list? If
I say the sequential testing.
>> Sequential testing.
>> SPRT. Yeah.
>> SPRT. Yeah. >> Yeah.
>> Yeah.
>> Why do you like it so much?
>> It's a lot of people think that you
really need to wait till the end to make
certain decisions. Um, and they're
really not paying attention to how much
harm a certain experience is is making
just because they want to stay
statistically rigorous and they're not
really aware that that's also
statistically rigorous.
>> Yeah, I think it's really powerful
because it forces you to really do the
work ahead of time of setting up a power
analysis, understanding your metrics,
defining what you care about, which
hopefully we're all doing for all of our
exper experiments. But when we think
about it practically, you know, that
level of care and rigor isn't always
applied to a standard AB test. But SPRT
definitely forces you to do that because
it's part of the methodology to do it.
>> Totally. Early stopping is is very
important when when warranted. >> Yeah.
>> Yeah.
>> Yep. Thank you.
>> Would you mind passing it to the woman
to your right who also had done all of
them? What's your favorite?
>> I don't think I have a favorite. I think
I like that when you are testing a
product, you want to intermix the
different experiments so that you don't
get stuck in a conclusion that
you you go down a rabbit hole. So I
don't have a favorite.
>> It is the classic data scientist answer
I feel like of it depends but it is so
accurate because it does depend on
context, right?
>> Yeah. Yeah. Um,
okay. I I think that's it for me. Um,
thank you so much for your time. I
Klicke auf einen beliebigen Text oder Zeitstempel, um direkt zu dieser Stelle im Video zu springen
Teilen:
Die meisten Transkripte sind in unter 5 Sekunden bereit
Mit einem Klick kopieren125+ SprachenInhalt durchsuchenZu Zeitstempeln springen
YouTube-URL einfügen
Gib den Link eines beliebigen YouTube-Videos ein und erhalte das vollständige Transkript
Transkript-Extraktionsformular
Die meisten Transkripte sind in unter 5 Sekunden bereit
Unsere Chrome-Erweiterung installieren
Transkripte abrufen, ohne YouTube zu verlassen. Installiere unsere Chrome-Erweiterung und greife mit einem Klick direkt auf der Wiedergabeseite auf das Transkript jedes Videos zu.