0:02 We have Statsig's very own data
0:05 scientist, Liz Obermyer, with a hundred
0:14 Hi everyone. I'm so excited to talk to
0:16 you all today about 10 experiments to
0:19 run before you die. I think that when we
0:21 think about experimentation, sometimes
0:24 we just think about what we do currently
0:26 or what we've done in the past, but we
0:27 actually have a lot of tools at our
0:31 disposal to be able to do uh different
0:33 things in different circumstances to be
0:36 able to speed things up and use a wide
0:38 variety of techniques to fit the
0:41 circumstance that we have. Um, obviously
0:43 I work at Statsig, an experimentation
0:46 company. So I want you to run more
0:49 experiments, but hopefully this talk is
0:50 compelling enough that you're willing to
0:53 look past my impure motives and keep
0:55 experimenting, try new types of
0:58 experimentation and be able to just be
1:01 really data driven in what you're doing
1:04 and be able to test everything. Um,
1:06 before we begin, I wanted to share this
1:09 quote attributed to George Fox. All
1:11 models are wrong, but some are useful.
1:14 This is a very controversial quote. I
1:16 found uh oftentimes when you bring this
1:18 up, people are like, "Oh, so you're
1:20 wrong all the time, and you want me to
1:22 think that it's going to be useful?" Um,
1:24 but I think when we dig deeper into
1:27 that, what we mean by this, by something
1:30 being wrong, is we're making assumptions
1:31 and we have a certain amount of
1:33 uncertainty. It's not that it's
1:36 inherently wrong. There's just a limit
1:37 on what we know based on what we've
1:39 observed. and we've had to make some
1:42 assumptions in order to get there. And
1:44 when we're talking about models, that
1:46 means any of our causal inference
1:48 methodologies, including
1:51 experimentation. So when we're running
1:53 experiments, inherently we are making
1:56 assumptions and we're getting results
1:58 that aren't wrong per se, but they're
2:01 uncertain. And so we need to be willing
2:03 to make these assumptions and we need to
2:05 be willing to make do with this
2:06 uncertainty. And we're going to have a
2:09 tradeoff whenever we're going to be more
2:11 certain. It's because we've made more
2:14 assumptions to get to that answer. The
2:16 fewer assumptions we're willing to make,
2:18 the less we're going to have precise
2:21 answers because those assumptions help
2:23 us kind of narrow down and have more precision.
2:25 precision.
2:27 So, with that in mind, with me waxing
2:30 poetic about, you know, models being
2:32 wrong, but them still being useful, um,
2:34 I hope that that kind of sets the stage
2:36 for what we really care about. And
2:38 that's being able to use these to make
2:41 decisions with and to better understand
2:44 our product and use those goals to
2:46 understand what we're willing to assume
2:48 and what levels of uncertainty we're
2:51 willing to accept.
2:53 When I talk about broader causal
2:56 inference as a landscape where
2:58 experimentation exists, I think the
3:00 natural question is why experiments in
3:03 particular instead of just other causal
3:05 models. I think that when we talk about
3:08 other causal inference, it it can get
3:10 really tough, right? When we have diff
3:13 and diff or propensity score matching,
3:15 um those are inherently assumptions in
3:17 and of themselves and it tends to sit
3:20 further on the uh side of making
3:22 assumptions. so that we can be more
3:24 certain. But obviously in a lot of
3:27 academia and a lot of econ econometrics
3:30 and you know economic research we need
3:32 to make those assumptions because we
3:34 don't have the power of randomization.
3:37 That's what makes experimentation so
3:39 special is because we have this power of
3:42 randomization where we can assign
3:44 different treatments to different users
3:47 and then measure things instead of being
3:50 stuck with observational studies. And
3:52 randomization is so powerful because we
3:55 get to use the expectation that
3:58 differences among our population will
4:02 net out that we have this treatment
4:03 which is a really strong instrumental
4:07 variable that explains our results. And
4:09 so that's why experimentation is so nice
4:11 and in particular we'll be focusing on experimentation.
4:14 experimentation.
4:15 I have three different kinds of
4:17 experiments that I'm going to talk about
4:19 tonight. It really is just that. 10
4:21 experiments I'd like you to try running
4:25 before you die. Um, and I know it's a
4:27 little morbid. It's a little clickbaity.
4:30 Um, but we're going to start with our
4:33 basic experiments. And while these are
4:34 going to be the starting point for
4:36 everyone, it's not that you graduate to
4:38 more advanced experiments, you're
4:40 probably still running them a lot of the
4:42 time and they will fit your needs most
4:44 of the time. But there are certain
4:47 situations where you might deal with
4:49 things that don't normally happen and
4:51 special situations where you need to be
4:54 able to introduce more assumptions and
4:56 measure things in a different way. And
4:57 so that are those are going to be our
5:01 hard questions that we talk about. Um,
5:02 and then finally, we'll have some jet
5:05 fuel. Things that help you go fast,
5:06 things that help you make decisions more
5:10 quickly, um, instead of a a normal, uh,
5:12 AB test where you might have a fixed
5:15 time that you're running things for.
5:16 When we're talking about our basic
5:18 experiments, like I said, these are
5:19 going to be your bread and butter and
5:22 probably everyone is running these
5:25 already. I think when we start talking
5:27 about AB testing, it's interesting to
5:30 talk about what most AB testing
5:32 experiments are, which is growth
5:34 experiments. Um, we might have a variety
5:36 of people who aren't in marketing here,
5:39 but often marketing is a space where
5:41 measuring the difference between two
5:43 treatments is really important. Now,
5:45 obviously, I'm not a marketer based on
5:47 my treatments that I suggested where you
5:50 can sign up today. I'll give you, you
5:54 know, one tiny sliver of a bitcoin or uh
5:56 you know, I'll I or have an offer that
6:00 expires. So, I'm kind of using that as a
6:03 a a carrot to try to get you to uh click
6:05 on it. Um and we can measure the
6:07 difference between these kinds of
6:09 treatments, right? And what we're
6:11 getting is kind of like a sign up rate
6:14 and we're able to measure like how users
6:16 are moving through these new kind of uh
6:19 experiences that will actually get us to
6:21 have an audience for our product as a whole.
6:22 whole.
6:25 Now, we not only want to measure that
6:27 we're getting people into our
6:29 experiment, but there's also the other
6:31 side effects that we want to measure. If
6:33 we're looking at just these metrics,
6:34 we're getting a bit of an incomplete
6:36 picture. I'd love to hear from this
6:38 audience. What other metrics would you
6:42 want to measure in a growth experiment
6:44 other than just click-through rate, sign
6:46 up rate, um, and these kind of visitor
6:49 statuses that we have that show if
6:51 someone is interested in clicking
6:53 through here? Does anyone have any ideas
7:02 Session time, that's a great one. Yes.
7:04 >> Customer lifetime value. That's another
7:07 great thing to measure. Yes.
7:10 Unsubscribe rate. Exactly. Uh some of
7:12 these like unsubscribe rate helps you
7:14 understand is it just a clickbaity thing
7:17 that I got people to click but then they
7:19 don't really stick around. Customer
7:22 lifetime value that may take a long time
7:24 to measure in an experiment but it's
7:27 really vital you know data about your
7:29 platform as a whole. So that's something
7:30 that you might want to measure as well.
7:33 I think you guys had great examples and
7:36 kind of the whole story of this is it's
7:39 beyond just doing the basic part of
7:41 making it further in a signup funnel
7:44 clicking in subscribing once we really
7:46 do want to understand the full lifetime
7:48 of the product which is where we might
7:51 use our standard AB product tests and
7:53 what we would be doing in this
7:55 circumstance is you know building a new
7:57 feature using something with our
7:58 audience as a whole maybe adding
8:02 notifications and kind of continually
8:04 trying to understand what keeps
8:06 customers there, give them new things to
8:10 use and build that user base uh other
8:13 than just growing it. Um you don't want
8:15 to be pouring more water into a leaky
8:16 bucket the same way you don't want to
8:19 have great growth tactics with nothing
8:22 to help users retain.
8:24 I think that these are really common
8:26 place and they're probably what the bulk
8:29 of people are doing on a daily basis.
8:30 But that doesn't that they're just
8:32 simple and you don't need to think about
8:34 all the different options you might have
8:38 there. Um, I think one of the, you know,
8:40 major things to think about is it's not
8:42 just UI changes, right? It might be
8:45 algorithm changes, it might be
8:47 performance changes that you still want
8:49 to measure other impacts of because as
8:52 you make changes, inherently there will
8:54 be bugs sometimes that you might not
8:55 catch. there may be unexpected
8:58 regressions and it isn't just when any
9:00 user is interacting with the UI that
9:02 they might have a negative or positive
9:04 experience with the product. Uh when we
9:06 think about your population too, it
9:08 might not just be users or devices which
9:11 we typically think of. For example, if
9:13 you're a statig user, you've definitely
9:15 been in one of my experiments, which is
9:19 on the query level for some of our
9:21 products where we're trying to speed
9:23 things up and assuming that faster
9:24 queries are better. I think we can all
9:27 agree faster is usually better uh with
9:31 the same results accuracy. Um and so
9:33 that's another unit of experimentation
9:36 that you can think of beyond just a
9:40 user, a device, a geo. Um, I think also
9:43 avoiding messing things up. A lot of
9:45 times when you're building products, you
9:48 have a vision for how they go. And if
9:50 you have a successful experiment, it
9:52 it's just confirming what you already
9:53 thought, right? You thought this would
9:56 be something helpful to users and that
9:58 happened. You might have some metrics to
10:00 prove it now and say, "Hey, we should
10:02 ship this." But it can be even more
10:04 powerful when something unexpected
10:06 happens when there's a regression maybe
10:08 in a certain population and you get that
10:12 feedback loop to be able to iterate.
10:14 And then again the reminder that
10:16 measurement is not one-sizefits-all like
10:17 everyone was contributing in the
10:20 audience to our previous analysis. It it
10:23 wasn't just the surface level metrics
10:25 that we wanted to measure of success of
10:27 one thing. It was the ongoing success
10:29 and health of all of our users on the
10:32 product and not just one short-sighted thing.
10:35 thing.
10:38 I'd love for anyone to tell me what they
10:40 think this experiment is. Does anyone
10:42 have any ideas for what we might be
10:44 doing here? YEAH.
10:49 OH, an AA test. You know, you're right.
10:51 It It's really surprising results. This
10:53 is obviously a super cherrypicked AA
10:56 test that I kind of grabbed for shock
10:58 value, right? But I think that it's
11:00 really helpful to be able to understand
11:03 like, hey, am I reliably assigning
11:07 random users to their control and test?
11:10 Are my metrics going to be sensitive and
11:12 helpful in this situation? Am I going to
11:14 get a lot of false positives because I'm
11:16 seeing these kinds of crazy results? Um,
11:18 it can be a really helpful barometer for
11:21 if everything is set up right and if
11:24 your uh code is working as you intended
11:26 to. Like I said earlier, you need to be
11:29 able to plan for unintended consequences
11:31 and an AA test is a way to understand if
11:33 there's unintended consequences of your
11:37 experimental setup itself.
11:39 One of my favorite experiment types is
11:42 hold outs and back tests. And I think
11:44 these are becoming more and more
11:48 powerful as people continue to ship code
11:50 faster with things like AI agents
11:52 speeding up how quickly they can produce
11:55 code and ship things. It it also impacts
11:58 how much you need to have guard rails.
12:02 Uh one example of a a holdback being
12:04 really powerful is you can basically
12:06 ship that feature you're really excited
12:08 about and ship really rapidly to most of
12:11 your users. But with a hold back, you
12:14 can also understand over the long term
12:16 what the health of that decision was.
12:18 And if things aren't going as you
12:20 expected to after a short time, you can
12:24 change that decision and be able to um
12:26 you know reverse course in the case that
12:28 you may have had a false positive or you
12:31 may have uh not paid attention to a
12:33 novelty effect or maybe the entire
12:34 experiment that you run previously was a
12:36 novelty effect. This just helps you kind
12:38 of check your work and make sure that
12:40 you're really building things that are
12:43 long-term valuable to people.
12:46 Another really cool uh outcome of
12:48 holdouts is that you can look at the
12:51 evolution over time of a pulled hold out
12:54 where they really do start as just an AA
12:56 test if you haven't stripped anything
12:58 and then as time goes on you can
13:01 continue to observe how your changes
13:03 kind of stack onto each other and
13:05 understand the cumulative impact of a
13:09 team uh or or the whole company really
13:11 and it helps you understand the true
13:12 impact of your experiment.
13:14 experimentation program and what you've
13:17 been shipping.
13:19 When it comes to our hard questions that
13:21 we're trying to address, uh, a lot of
13:23 the times it's going to be in specific
13:27 circumstances. And this kind of is going
13:30 to be something where, uh, you're going
13:32 to look at the circumstances that you're
13:33 in. You might say, "Hey, I don't think I
13:35 can run an experiment actually." But
13:37 there might be some assumptions you can
13:38 make, some tweaks you can make with
13:41 assignment that'll allow you to run an
13:43 experiment or allow you to use a
13:45 different methodology.
13:47 One example of that is in the case where
13:49 there's interferences. I think the
13:51 really, you know, example that everyone
13:53 hears is what if we do a marketing
13:55 campaign out of house? What if we have
13:58 billboards up? What if I am doing radio
14:00 ads? How can I measure the impact of
14:03 that and how can I understand if it's
14:05 truly moving the needle and what the ROI
14:07 is? And this can be really tough to
14:10 measure in an experimental context
14:13 because that violates SUVA, right? We're
14:15 not able to treat individual units
14:18 individually and we're going to lack
14:21 sample size. If we say, oh, just the
14:23 geographies that we put things up in are
14:25 the units. We're going to end up with a
14:27 lot of variance.
14:29 One really powerful tool for this is
14:32 using a synthetic control. Um, Meta has
14:33 a really famous package called geo
14:35 testing. That's what we tend to use, but
14:38 this really is just a
14:41 description of a set of tools that you
14:45 can use to be able to synthesize
14:48 the counterfactual for what you're
14:50 trying to measure in real time with your
14:52 treatment. Right? So, a way that you can
14:56 do this is let's say I say that the vibe
15:00 of Seattle is like if I mixed Portland,
15:03 SF, and Boston all together. And I
15:04 wanted to run an experiment that I
15:07 thought would change the vibe of
15:10 Seattle. What I could do is model
15:13 Seattle's vibe after those three cities
15:17 that I just named. not do anything there
15:20 but change my treatment in Seattle and
15:23 be able to understand the impact by
15:25 comparing my counterfactual that I
15:28 synthesized with a model and my
15:30 observations of what actually happened
15:33 to Seattle. And what that allows us to
15:37 do is kind of uh be able to not worry
15:40 about um uh temporal effects. Right? If
15:42 I'm comparing this week to last week,
15:44 that might be a a perfectly fine thing
15:47 to do at some points in time. But if I'm
15:49 comparing Black Friday to the week
15:51 before, in a lot of retail environments,
15:53 that'd be crazy and it would look like
15:55 all of my out ofome campaigns were
15:58 incredible because I had so much more
16:00 sales than the previous week, right? So,
16:02 it's all about understanding your
16:04 context, what problems you're trying to
16:06 solve for, and what's the best approach
16:08 to use to kind of get around your
16:11 inability to uh measure a true
16:12 counterfactual and not have the sample size.
16:15 size.
16:17 Another way that you can deal with this
16:20 same issue is using switchbacks. Um,
16:22 this is really made famous by Uber and
16:25 Lyft, which tend to use this to test
16:27 different algorithm changes because they
16:29 have a marketplace, right? Any treatment
16:31 that you provide to the rider is going
16:33 to impact the drivers which in turn
16:35 impacts other riders as well as vice
16:37 versa. When you treat the drivers, it
16:39 impacts their riders which in turn
16:42 impacts other drivers. So you have this
16:45 kind of inextricable uh network effect
16:48 where you might want to use switchbacks. Um,
16:50 Um,
16:53 the reason this works so well or I guess
16:57 in Uber and Lift's case is because uh
16:59 when you go to that app, it's a very uh
17:01 temporal thing that you might be doing,
17:03 right? You're calling an Uber, you
17:05 decide to take it or you decide to
17:07 decline it, right? It's all within a
17:09 specified time frame. So you can have
17:12 this setup of treatment periods and
17:15 burnouts where you can really have this
17:17 time centric view of flipping back and
17:20 forth between your test and control. It
17:22 doesn't work as well when we talk about
17:24 a billboard because that's a lot harder
17:26 to flip back and forth and people aren't
17:29 going to be exposed in a really direct
17:32 way and measurable way in that example.
17:34 So when we have these different contexts
17:38 with kind of the same uh uh issue, we
17:39 have these different approaches at our
17:42 fingertips and it kind of really depends
17:44 on the context what we might choose to
17:47 do in a situation.
17:49 Another thing that's really exciting to
17:51 deal with different problems is that
17:53 same problem of ROI that we were talking
17:56 about with say a marketing campaign. We
17:59 can also bring that same concept of ROI
18:01 to investments that you make in your
18:04 product. Uh it it takes real engineering
18:06 time to drive down the latency of
18:08 something, right? It it takes real
18:11 effort to build new features. So when
18:13 you have something that you can model
18:15 with an elasticity test, it can be
18:17 really powerful to help you understand
18:19 how much bang for my buck do I get when
18:22 I'm making these big investments. Uh in
18:24 this example, if we talk about
18:26 decreasing load time, which I think we
18:28 can all agree is helpful, we might look
18:31 at our user metrics to see how helpful
18:33 is it really and be able to understand
18:36 how much we want to invest here and how
18:38 helpful it is. There is also kind of a
18:41 twist to elasticity tests is that you
18:43 can do a regression experiment. These
18:45 are pretty controversial actually. I
18:46 know there are some companies that just
18:49 don't do regression tests. I know
18:50 Snapchat just doesn't believe in them.
18:52 Why wouldn't we make a user's experience
18:55 worse on purpose? But the trade-off
18:57 there is that you get a lot of
18:59 information from making a user's
19:02 experience worse in this case. When we
19:04 think about load time, maybe it's not so
19:06 bad, right? It's a little annoying, but
19:08 I'm not going to stop using something
19:10 because of uh it taking 10 seconds to
19:14 load instead of 9.87, right? Um so you
19:16 can kind of model this elasticity of
19:19 behavior uh hopefully without turning
19:21 your users. Um, and so that's the
19:22 trade-off when you're trying to think
19:25 about running a regression test. Is it
19:28 worth what you learn to potentially have
19:30 negative business impacts while you run
19:36 I'd also like to finish by talking about
19:38 some of our options for jet fuel to be
19:40 able to make decisions faster. Right?
19:42 I'm sure we'd all love to make our
19:44 experimental decisions in an instant,
19:47 but it's not as simple as that, right? Um,
19:49 Um,
19:50 I think one really great technique for
19:54 this is using SPRT because what you can
19:57 do is define your likelihoods ahead of
19:59 time uh of what you find believable to
20:02 accept the null hypothesis or accept the
20:04 alternative hypothesis instead of your
20:06 traditional frequentness testing where
20:08 you're looking at confidence intervals,
20:10 but you don't have that true negative
20:13 case of like, hey, there's actually no
20:17 change here. SPRT helps you quantify a
20:20 uh likelihood where if you observe uh
20:22 certain things, you're pretty convinced
20:24 that there is no difference, right? It's
20:26 a way to formalize that kind of
20:28 understanding that frequentist doesn't a
20:31 frequentist analysis doesn't offer. Um,
20:33 and when you're formalizing that ahead
20:34 of time, it can let you make those
20:36 decisions earlier because when you're
20:38 trying to do no harm, it's less of an
20:41 interpretation uh after the fact and
20:43 more of a like ground truth that you can
20:45 measure based on assumptions you're
20:47 willing to make beforehand and based on
20:50 thresholds you're willing to set.
20:53 Another really exciting technique that I
20:55 uh really like is using a bandit
20:57 approach. There are a lot of options
20:59 when it comes to using a multi-arm
21:01 bandit, but at the end of the day, the
21:04 point is you can learn really quickly
21:07 when something doesn't work or when
21:08 something starts to look a lot better
21:12 than the others. And it's really painful
21:15 to be serving people a subpar experience
21:18 when you already kind of know what you
21:21 think works. Um, so what bandits do
21:23 that's really powerful is if we're
21:25 tracking our probability that a certain
21:28 variant is the best over time, isn't it
21:32 great to be able to set our rate of
21:34 assigning people to that group to the
21:37 same value? This is just one example,
21:39 Thompson sampling of this. There are
21:41 many different approaches, but at its
21:44 core, it's understanding that you really
21:47 don't want to send people to a variant
21:50 that you already know isn't the best.
21:52 Um, I want to leave you with my 10th
21:54 experiment, my 10th and final
21:57 experiment, uh, interle experiments. I
22:00 really see this being uh exciting for AI
22:02 use cases because when we typically
22:04 think of interle experiments, we're
22:06 going to think about search, we're going
22:08 to think about ranking cases and we
22:10 think about interle those results. But
22:13 if we think about the n equals one case
22:16 of this, it's just asking people to
22:18 choose between two options. And I think
22:20 that's a really powerful use case that
22:24 we've already seen uh different uh LLM
22:26 uh you know providers take to get
22:28 feedback on their models, right? It's
22:30 the N equals one case of interle
22:33 experiments to be able to randomize the
22:35 order that you show things and
22:38 understand uh you know how users click
22:40 on things, what their downstream
22:42 behavior is uh for those different arms
22:45 of the experiment. um maybe more
22:47 important for search and ranking. A
22:49 byproduct of that is getting a search
22:52 cost for users, maybe not so deep at the
22:55 N equals one case, but that gets to be
22:56 really important for when you think
22:58 about search engines who want to know
23:01 are people actually going to scroll? Are
23:02 people actually going to move to the
23:05 second page of search results? Um and
23:07 and understand what that actually looks
23:10 like in a user base.
23:12 So, with that in mind, we have 10
23:13 different types of experiments that we
23:15 talked about. I'd love to get an
23:18 understanding of how many of these
23:20 experiments people actually are running
23:22 or have run. I'd love to get the
23:24 audience standing up. If everyone could
23:27 stand up, that'd be great.
23:29 I do actually want you to stand up.
23:32 Yeah, I know. I know. Um, and then if
23:36 you've done four or fewer of these
23:40 experiments, please sit down.
23:43 If you've done six or fewer, please sit down.
23:46 down.
23:49 Okay. If you've done seven or fewer, sit down.
23:51 down.
23:53 Eight or fewer,
23:58 nine, 10. Have you guys all done all 10?
24:00 Nice. Okay. I'd love to ask the people
24:02 who are still standing, what's your
24:04 favorite experiment from this list? If
24:18 I say the sequential testing.
24:19 >> Sequential testing.
24:19 >> SPRT. Yeah.
24:20 >> SPRT. Yeah. >> Yeah.
24:21 >> Yeah.
24:22 >> Why do you like it so much?
24:25 >> It's a lot of people think that you
24:27 really need to wait till the end to make
24:29 certain decisions. Um, and they're
24:30 really not paying attention to how much
24:33 harm a certain experience is is making
24:34 just because they want to stay
24:37 statistically rigorous and they're not
24:38 really aware that that's also
24:39 statistically rigorous.
24:41 >> Yeah, I think it's really powerful
24:44 because it forces you to really do the
24:46 work ahead of time of setting up a power
24:48 analysis, understanding your metrics,
24:50 defining what you care about, which
24:51 hopefully we're all doing for all of our
24:54 exper experiments. But when we think
24:56 about it practically, you know, that
24:58 level of care and rigor isn't always
25:02 applied to a standard AB test. But SPRT
25:03 definitely forces you to do that because
25:06 it's part of the methodology to do it.
25:07 >> Totally. Early stopping is is very
25:09 important when when warranted. >> Yeah.
25:09 >> Yeah.
25:10 >> Yep. Thank you.
25:13 >> Would you mind passing it to the woman
25:15 to your right who also had done all of
25:17 them? What's your favorite?
25:19 >> I don't think I have a favorite. I think
25:21 I like that when you are testing a
25:23 product, you want to intermix the
25:26 different experiments so that you don't
25:31 get stuck in a conclusion that
25:34 you you go down a rabbit hole. So I
25:36 don't have a favorite.
25:38 >> It is the classic data scientist answer
25:41 I feel like of it depends but it is so
25:43 accurate because it does depend on
25:45 context, right?
25:48 >> Yeah. Yeah. Um,
25:51 okay. I I think that's it for me. Um,
25:52 thank you so much for your time. I