0:01 all right everyone I'm on the line with
0:04 Kelly revoir Kelly is an engineering
0:07 manager at stripe working on machine
0:09 learning infrastructure Kelly welcome to
0:11 this week in machine learning in AI I
0:13 thanks for having me
0:16 I really excited to chat same here same
0:20 here so we got in touch with you kind of
0:23 occasioned by a talk you're giving at
0:26 strata which is actually happening as we
0:29 speak I'm not physically in sf4 at this
0:33 time but your talk which is going to be
0:35 later today is on scaling model training
0:38 from flexible training api's to resource
0:41 management with kubernetes and of course
0:42 machine learning infrastructure and AI
0:45 of platforms is a very popular topic
0:47 here on the podcast and so I'm looking
0:49 forward to digging into the way stripe
0:53 is platforming its machine learning
0:56 processes and operations but before we
0:58 do that I'd love to hear a little bit
0:59 about your background and how you got
1:03 started working in this space yes great
1:05 maybe I'll say a little bit about what I
1:07 do now and then kind of work backwards
1:11 from that awesome so right now I'm an
1:13 engineering manager at stripe and I work
1:16 with our data infrastructure group which
1:18 is seven teams kind of at the lowest
1:20 level things like our production
1:22 databases or things like elastic search
1:24 clusters and then kind of working up
1:26 through like batching streaming
1:30 platforms core like ETL data pipelines
1:32 and libraries and also machine learning
1:35 infrastructure I've been at stripe for
1:39 very close to six years now from when
1:41 the company was about 50 people and have
1:42 basically worked on a bunch of different
1:46 things in sort of like risk data and
1:50 machine learning and both as an engineer
1:52 and engineering manager and also
1:54 initially more on kind of like the
1:56 application side and then over time
1:58 moving over to the the infrastructure
2:04 side by a training I am like a kind of
2:06 research scientist person so I studied
2:08 physics and electrical engineering in
2:10 school and did my PhD at Stanford
2:13 working on nanophotonics and then
2:16 a short postdoc at HP Labs nanophotonics
2:21 yeah I think you had recently optics
2:24 which is not too far away so maybe that
2:27 can see a little bit of an idea okay and
2:29 then yeah I was at HP Labs for a year so
2:31 working on sort of similar things and
2:35 also some 3d imaging and I guess I like
2:36 to call what I did although I don't know
2:38 that anyone else calls it that sort of
2:41 like full stack science where like you
2:44 have an idea and some theory or modeling
2:46 or simulation and then you use that to
2:48 design a device and then you actually go
2:49 in the cleanroom and like make the
2:50 device and then you actually go in the
2:52 optics lab and like you know shoot a
2:53 bunch of lasers out your device and
2:55 measure it and then you sort of like
2:57 process the data and compare it to your
2:59 theory and simulation and I was like I
3:02 found like kind of the two ends the most
3:04 like sort of the magical moment where
3:07 like you know the data that you
3:09 collected like matches what you thought
3:12 was gonna happen from your modeling and
3:13 I kind of decided that I wanted to do
3:15 more of that and a little less than like
3:17 fabrication or material science and I
3:18 was kind of sitting in Silicon Valley
3:20 and started looking around and like
3:22 stripe was super exciting in terms of
3:25 its mission like having interesting data
3:27 and just like having amazing people
3:29 awesome awesome stripe sounds really
3:31 interesting but shooting lasers at stuff
3:41 also sounds really really cool nice nice
3:45 and so maybe tell us a little bit about
3:50 stripes kind of machine learning journey
3:54 from an infrastructure perspective you
3:57 know how did it it sounds like you're
4:00 doing a bunch of interesting things both
4:03 from a training perspective from a data
4:06 management perspective inference but how
4:09 did it evolve yeah I think one thing
4:11 that's interesting about machine
4:13 learning at stripe Blake I think a lot
4:14 of places you talk to machine learning
4:16 kind of like started out as being for
4:20 some some kind of like offline analytics
4:22 more like you know internal business
4:24 questions like maybe like you're trying
4:26 to calculate long-term value of your user
4:27 user
4:28 and we do stuff like that now but we
4:30 actually started like our kind of core
4:33 uses have always been very much on kind
4:36 of the production side like our kind of
4:38 most business critical and first machine
4:39 lean you need machine learning use cases
4:42 where things like scoring transactions
4:44 in the charge flow to evaluate whether
4:47 they're fraudulent or not we're doing
4:49 kind of like internal risk management of
4:53 like you know making sure our users are
4:55 you know selling things that we can
4:56 support from our Terms of Service or
4:59 that they're kind of like you know good
5:02 users that we want to support and so we
5:04 we started out from having kind of a lot
5:05 of these more like production
5:07 requirements and it needs to be this
5:08 fast and it needs to be this reliable
5:09 and I think our machine learning
5:11 platform kind of like evolved from that
5:15 side where you know initially we had
5:16 kind of like one machine learning team
5:18 and then even just having a couple of
5:20 applications we started seeing like oh
5:22 here are some commonalities like
5:23 everyone needs to be able to score
5:27 models or you know even like having some
5:28 notion of shared features could be
5:30 really valuable across just a couple of
5:32 applications and then as we split our
5:34 machine learning team one piece of that
5:37 became machine learning infrastructure
5:40 which we've developed since then and you
5:41 know it's really important for that team
5:43 to work both with the teams doing the
5:45 business applications which now include
5:47 a bunch of other things in our user
5:49 facing products like radar and billing
5:52 as well as internally and also you know
5:53 it's important for the machine learning
5:55 infrastructure to build on the rest of
5:57 your data infrastructure and really the
5:59 rest of all of your infrastructure and
6:00 we've worked really closely with like
6:02 our orchestration team on you know as
6:05 you said and chatting about my talk like
6:08 getting training to run on kubernetes
6:11 yeah that's maybe an interesting place
6:15 to start the you kind of alluded to the
6:18 the interfaces between machine learning
6:21 infrastructure as a team and you know
6:22 data infrastructure you know just
6:29 infrastructure how do they how do they
6:30 connect you know maybe even
6:35 organizationally and how do they tend to
6:37 work with them up with one another for
6:41 example you know in you know training on
6:43 kubernetes you know where is the line
6:46 between what the ml infrastructure team
6:48 is doing and you know what it's
6:51 requiring of some you know broader
6:54 technology infrastructure group yeah I
6:55 think the kubernetes case is really
6:57 interesting and it's one that's been
7:01 super successful for us so I guess maybe
7:03 like a year or two ago we'd initially
7:06 focused on the kind of scoring like
7:07 real-time inference part of models
7:08 because that's the hardest and we'd sort
7:10 of left people on their own it's like
7:11 well you figure out how to treat a model
7:13 and then you know if you manage to do
7:15 that we'll help you score it and we
7:17 realized that that wasn't like great
7:20 right so we started thinking you know
7:21 what can we do and at first we built
7:23 some CLI tools to kind of like wrap the
7:25 Python people were doing but then we
7:26 wanted to kind of do more so eventually
7:28 we built an API and then a big hassle
7:30 had been the resource management and we
7:31 just kind of wanted to like abstract
7:34 that all away and as it happened at that
7:36 time our Constitution team had gotten
7:38 like really interested in kubernetes and
7:40 I think they wrote a blog post like
7:42 maybe year and a half ago they had kind
7:44 of just moved our first application into
7:46 kubernetes which was some of our cron
7:47 jobs that we use in our financial
7:49 infrastructure and so we ended up
7:51 collaborating this was kind of like a
7:53 great next step of a second application
7:58 they could work on and you know we had
8:00 some details we had to work out we're
8:01 having to figure out like how do we
8:03 package up all of our Python code and to
8:06 you know some docker file we can deploy
8:07 and it was really useful to be able to
8:10 work with them on that but I think we
8:12 have found really good interfaces in
8:14 working with them where you know we
8:15 wrote a client for the communities API
8:18 but it's like anytime we need help or
8:19 any time there's management of the
8:21 communities cluster they take care of
8:23 all of that so it's kind of given us
8:24 this flexibility where we can define
8:26 different instance and resource types
8:28 and swap them out really easily if we
8:30 need CPUs or GPUs or we need to like
8:33 expand the cluster but we as a machine
8:34 learning infrastructure kind of like
8:36 don't have to deal with managing
8:37 kubernetes or updating it we have this
8:39 amazing team of people who are like
8:40 totally focused on that for stripe
8:47 mm-hmm awesome awesome and then actually
8:49 let's maybe stay on
8:52 this you know this topic for a moment so
8:56 your talk as strata was focused on this
9:00 area what was kind of the flow of your
9:02 talk what were the main points that you
9:04 are that you're planning to go through
9:08 with the audience there yeah great
9:10 question so we we kind of think about
9:12 this in two pieces and you know maybe
9:14 that's cuz that's how we actually did it
9:18 so one piece was the resource management
9:20 that I talked about was you know getting
9:21 getting things to around on kubernetes I
9:22 was actually kind of like the second
9:26 piece for us the first piece was
9:28 figuring out sort of like how should the
9:30 user interact with things and like where
9:32 should we give them flexibility and
9:35 where should we constrain things and so
9:36 we ended up building what we call
9:38 internally railyard which is like a
9:41 model training API and it goes with
9:42 there's sort of two pieces there's like
9:45 what you put in the API request and then
9:46 there's what we call out workflow and
9:48 the API request is a little bit more
9:50 constrained like you have to say your
9:52 metadata for who's training so we can
9:54 track it you have to tell us like where
9:56 your data is like how you're doing
9:59 things like hold out just kind of basic
10:00 things that you'll always need to put
10:02 them we have this workflow piece that
10:05 people can write like kind of like
10:07 whatever Python they want as long as
10:08 they define a train method in it that
10:11 will hand us back like the fitted model
10:14 and we definitely have found that like
10:15 initially we were very focused on binary
10:18 classifiers for things like fraud but
10:20 people have done things like weird
10:21 embeddings if people doing
10:24 timeseriesforecasting we're using like
10:27 things like scikit-learn actually abused
10:29 fast text by georg prophet and so this
10:31 has worked pretty well in terms of like
10:33 providing enough flexibility that people
10:34 can do things that we actually didn't
10:36 anticipate originally but it's
10:39 constrained enough that we can run it
10:42 and sort of track what's going on and
10:45 you know give them what they need and be
10:46 able to automate the things we need to
10:49 automate okay and so you're the
10:53 interface you're describing is this kind
10:56 of Python and this train method are you
11:02 well actually that's it maybe a question
11:04 do you that are the users do you think
11:07 of your users as more kind of the data
11:10 science type of user or machine learning
11:13 engineer type of user or is there a mix
11:16 of those two you know types of
11:18 backgrounds yeah it's a mix which has
11:20 been really interesting and I think
11:22 coming back to what I said earlier like
11:24 because we initially focused on these
11:26 kind of critical production in these
11:28 cases we started out where the team's
11:30 users were really pretty much all
11:31 machine learning engineers and very
11:33 highly skilled machine learning
11:35 engineers like people who are excellent
11:37 programmers and you know they know stats
11:38 in ml and they're kind of like the
11:41 unicorns to hire and over time we've
11:44 been able to broaden that and I think
11:47 having things like you know this tooling
11:49 has made that possible like in our user
11:52 survey right after we first shipped even
11:54 just the kind of like API workflow piece
11:55 and we were actually just like running
11:57 it on some box as a sidecar process we
11:59 hadn't even done kubernetes yet but a
12:01 lot of the feedback we got was like oh
12:03 this new person started on my team and I
12:04 just like pointed them to the directory
12:06 where the workflows are and I like
12:07 didn't have to think about how to split
12:09 all these things out because like you
12:11 know you just kind of pointed me in the
12:12 right direction and I could point them
12:14 in the right direction so I think that
12:16 having having these kind of like common
12:18 ways of doing things has been a way to
12:20 broaden our user set and as our data
12:22 science team which is more internally
12:24 focused has grown they've been able to
12:25 kind of like start picking up
12:29 increasingly large pieces of what we
12:31 built for the ML engineers as well and
12:33 we've been like excited to see that and
12:38 work with them and so the the interface
12:42 then is kind of Python code and our is
12:46 the platform container izing that code
12:49 or is the user expected to do it or is
12:51 it integrated into some kind of workflow
12:53 like they check it in and then it
12:55 becomes available you know to the
12:59 platform via check-in or see ICD type of
13:03 process yeah so we still have the
13:06 experimental flow where people can like
13:08 kind of try things out but when you're
13:09 ready to productionize your workflow
13:11 basically what you do is you get your
13:12 code review
13:15 you merge it and we use we ended up
13:17 using Google's subpar library because it
13:19 works really well with basil which we
13:22 use for a lot of our build tooling to
13:27 kind of what are those - yeah so subpar
13:30 is a Google library that helps us like
13:32 package Python code into like a
13:34 self-contained executable both the
13:36 source code and any dependencies like if
13:37 you're running PI torch and you need
13:39 some kudos stuff okay
13:41 and it works kind of out of the box with
13:43 basil which is the open source version
13:46 of Google's build system which we have
13:48 started to use that stripe a few years
13:51 ago and have extended since it's really
13:53 nice for like speed reproducibility and
13:57 working with multiple languages so this
13:59 is where our ml in 14 kind of worked
14:00 with our orchestration team to figure
14:03 out the details here to be able to kind
14:05 of like package of all this Python code
14:07 and have it so that basically almost
14:08 like a service deploy you can kind of
14:10 like have it turn into a docker image
14:13 that you can deploy to like Amazon's ECR
14:16 and then kubernetes will kind of like
14:17 know how to pull that down and be able
14:21 to run it so the ml engineer the data
14:22 scientist doesn't really have to think
14:23 about any of that it just kind of works
14:26 as part of the you know you got your app
14:27 er emerged and you deploy something if
14:30 you need to change the workflow okay but
14:32 earlier on in the process when you're
14:37 experimenting the currency is a you know
14:40 some Python code are you [Music]
14:42 [Music]
14:47 does the are you like what kind of
14:50 tooling have you built up around
14:52 experiment management and automatically
14:56 tracking various experiment parameters
15:00 or hyper parameters hyper parameter
15:01 optimization and that kind of thing are
15:03 you doing all that or is that all on the
15:07 the user to do yeah that's a really good
15:10 question so one of the things that we
15:12 added and our API for training as we
15:14 found it was really useful to have this
15:17 like custom params field especially
15:19 because we eventually people ended up
15:20 and you know we have some shared
15:22 services to support this like sort of a
15:24 retraining service that can automate
15:26 your training requests
15:29 and so one of the things that people
15:31 from the beginning use the custom
15:32 programs for was hyper parameter
15:34 optimization optimization we are kind of
15:35 working toward building that out as a
15:39 first-class thing like we now have like
15:40 evaluation workflows that can be
15:42 integrated with all of this as well and
15:44 that's kind of like the first step you
15:45 need for high parameter optimization if
15:47 you want to do it as a service is like
15:48 what are you optimizing if you don't
15:51 know what you're looking at so that's
15:53 something we hope to do like over the
15:54 next you know three to six months is to
15:56 make that like a little bit more of
15:59 first-class support and you mentioned
16:02 this this directory of workflows
16:06 elaborate on that a little bit yeah so
16:08 one of the nice things is you know when
16:10 you're writing your workflow if you put
16:14 it in the right place then are like our
16:16 scholar service railyard will know where
16:18 to find it but one of the side benefits
16:20 has also just been that there is one
16:22 place where people's workflows are and
16:24 so that that's been kind of like a nice
16:26 place for people to get started and see
16:27 like you know what models are other
16:30 people using or like what pre-processing
16:31 or kind of what other things are they
16:35 doing or what what types of parameters
16:37 like estimator parameters are they
16:40 looking at changing to just kind of you
16:41 know have that be like a little bit more
16:44 available to our users or internal users
16:49 mm-hmm and the workflow elements of this
16:53 is it is a graph based is it something
16:57 like airflow how's that implemented yeah
16:59 so in this case my workflow all I mean
17:02 it's just like Python code that you know
17:05 you give it like we're actually railyard
17:08 our API passes to it like what are your
17:10 features or what are your labels and
17:12 then you are Python code returns like
17:16 here is the fitted pipeline or model and
17:18 like usually something like the
17:21 evaluation data set that we can pass
17:25 back we have had so we've people have
17:27 kind of built us and users like
17:30 interesting things on top of having a
17:32 training API so some of our users built
17:34 out actually the folks working on radar
17:36 a fraud product built out like an auto
17:38 retraining service that we've since kind
17:40 of take it over and generalized
17:42 and where they schedule like nightly
17:44 retraining of all the tens and hundreds
17:47 of models and you know that's integrated
17:49 to be able to even like if the
17:51 evaluation looks better like potentially
17:54 automatically to play them we do also
17:56 have people who have put like training
17:59 models via our service into like air
18:01 flow decks if they have you know some
18:03 some slightly more complicated set of
18:06 things that they want to run so you
18:08 definitely seen that as well okay and
18:10 you've mentioned radar a couple of times
18:12 is that a product that stripe or an
18:16 internal project of yeah like user
18:20 facing fraud product it runs on all of
18:22 our machine learning infrastructure and
18:25 you know every charge that goes through
18:27 stripe within usually 100 milliseconds
18:29 or so we've kind of like done a bunch of
18:31 real-time future generation and
18:33 evaluated like kind of all of the models
18:37 that are appropriate and in addition to
18:38 sort of the machine learning piece
18:40 there's also a product piece for it
18:43 where users can get more visibility into
18:45 what our ml has done they can kind of
18:48 like write their own rules and like set
18:49 block thresholds on them and there's
18:51 there's sort of like a manual review
18:54 functionality so they're kind of some
18:55 more product pieces that are
18:56 complementary to the underlying machine
19:04 learning okay interesting and so just
19:06 trying to complete the picture here
19:09 you've got these workflows which are
19:12 essentially Python they expose a trained
19:21 entry point and do you are they you
19:23 mention this directory of workflows is
19:25 that like a directory like on a server
19:27 somewhere with just like dot PI files or
19:30 is that are they do you require that
19:32 they be versioned
19:34 and are you kind of managing those
19:37 versions yeah so that that's just like
19:39 actually like in a code basically so
19:40 that's like yeah the workflows live
19:45 together in code as part of as part of
19:47 kind of our tuning API it's like when
19:49 you send that here's my training request
19:52 which has you know here's my data here's
19:54 my metadata this is the workflow I want
19:57 you to run we give you back a job ID
19:59 which then you can check the status of
20:01 you can check the result the result will
20:03 have things in it like what was the get
20:06 Shaw and so that's like something that
20:08 we can track as well got it
20:12 so you're submitting the job with the
20:20 little bit which workflow you're running
20:23 through like in the case where you're
20:25 running on kubernetes you've merged your
20:28 code to master and then we kind of
20:30 package up all this code and deploy the
20:32 docker image and then from there you can
20:34 kind of make requests to our service
20:37 which will run the job on kubernetes so
20:39 at that point your code it's you know
20:41 whatever is on master for the workflow
20:43 plus whatever you've put in the request
20:44 got it
20:52 okay and so that's the the kind of the
20:54 shape of the training infrastructure
20:56 you've mentioned a couple of times that
20:58 you it sounds like there's some degree
21:02 to which actually I'm not sure maybe I'm
21:05 inferring a lot here but let's talk
21:08 about the where the the data comes from
21:12 for training and what kind of you know
21:14 platform support you're offering folks
21:15 yeah that's a really interesting
21:19 question kind of within the framework of
21:23 like what do you need for a like really
21:25 our API request we support two different
21:29 types of data sources one is more for
21:32 experimentation which is like you can
21:34 kind of tell us how to make the sequel
21:37 to query the data warehouse and that's
21:39 kind of nice for experimentation but not
21:42 so nice for production what pretty much
21:44 everyone uses for production is the
21:46 other data source we sort or
21:49 which is parquet from s3 so it's like
21:51 you tell us you know where to find that
21:54 and what your future names are and
21:57 usually that's generated by our futures
21:59 framework that we call semblance which
22:04 is basically like a DSL that helps you
22:06 know gives you a lot of ways to write
22:09 complex features like think have things
22:10 like counters be able to do things like
22:12 joins do a lot of transformations and
22:15 then you know the other infrastructure
22:18 team figures out like how to run that
22:21 code in batch if you are doing training
22:25 or like there's a way to run it in real
22:26 time basically and kind of like a
22:29 consumer setup but you only have to
22:32 write your code feature code like once
22:41 okay and so dr you also is it the user
22:43 that's only writing a feature code once
22:45 are you going after kind of sharing
22:48 features across the user base to what
22:52 extent or are you seeing shared features
22:58 yeah yes the user writes their code once
23:00 and like also I think having a framework
23:02 similar to the training workflows where
23:04 people can see what other people have
23:07 done has been really powerful so we do
23:10 have people who are like definitely kind
23:12 of sharing features across applications
23:14 and there's there's a little bit of a
23:15 trade-off like it's like a huge amount
23:16 of leverage if you don't have to rewrite
23:19 some complicated business logic you do
23:21 have to manage a little bit of making
23:23 sure that you know everything is
23:25 versioned and that you're paying
23:26 attention to like not deprecated
23:28 something someone else is using and that
23:30 you're not like just like changing a
23:33 definition in place that you are kind of
23:34 like creating a new version every time
23:36 you are changing something right so
23:38 there's a little bit more management
23:39 there and hopefully over time we can
23:41 improve our tooling around that but I
23:43 think it's you know even even since
23:44 before we had a features framework like
23:46 being able to kind of share some of that
23:48 stuff has been like hugely valuable for
23:56 us mmm and are you so what is the
23:59 features framework is that
24:04 is that a set of api's or is that kind
24:08 of a run time like what what exactly is it
24:08 it
24:12 yeah there's kind of two pieces so which
24:14 is basically sort of what you said like
24:17 you know whatever like the API like what
24:19 are what are the things we you know let
24:21 users Express and one thing we tried to
24:23 do there is actually constrain not a
24:25 little bit so we like you have to use
24:27 events for everything and we don't
24:29 really let you Express notions of time
24:31 so you kind of can't mess up that time
24:33 machine of like what was the state of
24:35 the features at some time in the past
24:37 where you want to be training your model
24:38 we kind of like take care of that for
24:41 you so that's kind of one piece and then
24:43 you know we kind of compile that into
24:46 like an ast and then we use that to
24:48 essentially write like a compiler to be
24:50 able to run it on different backends and
24:52 then we can kind of like you know write
24:54 tests and try and check at the framework
24:56 level that that things are gonna be as
24:58 close as possible to the same across
25:00 those different backends so back-end
25:03 could be something for training where
25:05 you're going to materialize like what
25:06 was the value of the features at each
25:08 point in time in the past that you want
25:10 as inputs to training your model or
25:12 another back-end could be like I
25:14 mentioned we have kind of this consumer
25:16 base back-end that we use like for
25:18 example for radar to be able to like
25:20 evaluate these features like as a charge
25:25 is happening and so to what extent you
25:28 find that that that limitation of
25:30 everything being event-based
25:33 gets in the way of what folks want to do
25:39 yeah that's definitely a little bit of a
25:40 paradigm shift for people because
25:42 they're like oh I just want to use this
25:46 thing from the database right but we
25:47 found that actually it's worked out
25:50 pretty well and that especially when you
25:52 have users who are ml engineers like
25:54 they do really understand the value of
25:56 like why you want to have things event
25:58 based and like the sort of gotchas that
26:01 that helps prevent because I think
26:03 everyone has their story about how you
26:04 were just looking something up in the
26:06 database but then you know the value
26:08 changed and you didn't realize it so
26:10 it's kind of like you're leaking future
26:12 information into your training data and
26:13 then your
26:14 it's not gonna do as well as you thought
26:19 it did so like I think moving to a more
26:21 event based world and I mean I think in
26:22 general stripe has also kind of been
26:25 doing more streaming work and more
26:29 having like good support also as at the
26:31 infrastructure level with Kafka has been
26:33 really helpful with that and so does
26:38 that mean that the models that they're
26:42 building need to be aware of kind of
26:45 this streaming paradigm during training
26:48 where do they get a static data set to
26:51 Train yeah so basically you can kind of
26:52 use our futures framework to just
26:55 generate like park' and s3 that has
26:57 materialized like all the information
26:59 you want of what was the value of each
27:01 of the features that you want at all the
27:03 points in time that you want and then
27:05 you know your input to the training API
27:09 is like please use this park' from s3 we
27:10 could make it a little more seamless
27:13 than that the nods works pretty well and
27:15 part I use just like a serialize like a
27:18 file format yeah it's pretty efficient
27:21 you know I think it's used in a lot of
27:23 kind of big data uses you can also do
27:25 things like predicate push down and we
27:26 have like a way in the training API to
27:29 kind of specify some filters there to
27:32 just kind of like save save some effort
27:35 use a predicate push down yeah so if you
27:37 know you only need certain columns or
27:38 something like you know you can you can
27:40 load it a little bit more efficiently
27:41 and not have to carry around a lot of
27:46 extra data got it okay the other
27:48 interesting thing that you talked about
27:52 in the context of the this event base
27:56 framework is the whole you know time
27:58 machine is the way you said it kind of
28:00 alluding to the point in time
28:05 correctness of you know feature snapshot
28:09 can you elaborate a little bit on did
28:10 you did you start there or did you
28:14 evolve to that that seems to be in my
28:17 conversations kind of I don't know maybe
28:19 like one of the
28:21 cutting edges or bleeding edges that
28:23 people are trying to deal with as they
28:25 scale up these these data management
28:28 systems for features yeah for this
28:32 particular project in this version we
28:34 started there straight previously had
28:36 kind of looked at something a little bit
28:38 related a couple years before and in a
28:39 lot of ways we kind of learned from that
28:41 so he ended up with something that was
28:43 more more powerful and sort of solved
28:45 some of these issues at the platform
28:49 level we did you know at that point we
28:50 had been running machine learning
28:51 applications in production for a few
28:53 years so I think everyone has their
28:56 horror stories right I was like all the
28:59 things that can go wrong especially kind
29:00 of a derp correct this level and like
29:02 everyone has their story about like
29:03 reimplemented features and different
29:05 languages which we we did for a while
29:07 too and kind of like all the things that
29:10 can go wrong there so yeah I think we
29:12 really tried to learn from both like
29:14 what are all the things we'd seen go
29:16 well or go wrong in individual
29:18 applications and then also from kind of
29:21 like our previous attempts at some of
29:23 this type of thing like what what was
29:24 good and you know what could still be better
29:24 better
29:29 mm-hmm and out of curiosity what do you
29:31 use for data warehouse and are there
29:34 multiple or is it is there just one and
29:36 we've used a combination of redshift and
29:41 presto over the past couple of years you
29:43 know they have a little bit of sort of
29:45 like different abilities and strengths
29:47 and those are those are things that
29:49 people like to use to experiment with
29:51 machine learning although like you know
29:52 we generally don't use them in our
29:54 production clothes because we kind of
29:57 prefer the event piece model it so is
30:03 the event based model is it kind of
30:06 parallel or orthogonal to to redshift or
30:08 press tours or is it a front-end to
30:13 either these two systems yeah I guess we
30:14 have we actually have a front-end that
30:18 we've built for redshift and presto you
30:20 know separately from from machine
30:21 learning that's really nice and lets
30:24 people like you know to the extent they
30:26 have permissions to do so like Explorer
30:29 tables or put annotations on tables and
30:32 we haven't integrated our
30:34 in general I would say we could do some
30:37 work on our UI is for formal stuff we
30:38 definitely focus more on the backend and
30:40 infra an API side although we do have
30:42 some things like our auto retraining
30:44 service has a UI where you can see like
30:47 what's the status of my job like was it
30:50 you know did it finish did it produce a
30:51 model that was better than the previous
30:55 model mm-hmm I think I'm just trying to
30:57 wrap my head around the the event based
31:01 model here you know as an example of a
31:04 question that's coming to mind in an
31:06 event-based world are you regenerating
31:10 the features you know every time and if
31:12 you've got you know some complex feature
31:15 that involves a lot of transformation or
31:17 you have to backfill a ton of data like
31:18 what does that even mean in an
31:21 event-based world where i think of like
31:23 you have events and they go away
31:25 yes I kind of store for all that that
31:30 isn't redshift or presto well you know
31:32 we're publishing something to Kafka and
31:35 then we're archiving it to s3 that then
31:37 that persists like you know as long as
31:40 we want it to in some cases basically
31:44 forever and so that is available we do
31:46 do you end up doing a decent amount of
31:49 back filling of kind of like you know
31:51 you define the transform features you
31:54 want but then you you know you need to
31:55 run that back over all the data you'll
31:56 need for your training so that's
31:58 something that we've actually done a lot
32:00 of from the beginning partly because of
32:02 our applications like when you're
32:04 looking at fraud you know the way you
32:06 find out if you were right or not is
32:09 that like in some time period usually
32:11 within 90 days but sometimes longer than
32:14 that the cardholder decides whether
32:15 they're going to dispute something as
32:19 fraudulent or not and that's compared to
32:21 like you know if you're doing ads or
32:22 trying to get clicks like you kind of
32:25 get the result right away right and we
32:28 you know so I think we've always like
32:30 been interested in kind of like being
32:32 able to backfill so that is you know you
32:34 can log things forward but then it's
32:34 like you'll probably have to wait a
32:36 little bit of time before you have
32:37 enough of a dataset that you can train
32:44 on it ok cool so we talked about the
32:46 data side of things we talked about
32:48 training and experiments
32:52 how about inference yes that's a really
32:54 great question and that's that's kind of
32:56 like the first thing that we built
33:00 infrastructure support for at first a
33:02 decent number of years ago like I think
33:03 even before things like tensorflow we're
33:09 really popular and so we have like our
33:11 own Scala service that we use to do our
33:16 production real-time inference and you
33:17 know we started out especially because
33:19 we have like mostly transactional data
33:20 we don't know a lot of things like
33:22 images at least as our most critical
33:25 applications at this point and a lot of
33:26 our early models and even still today
33:28 like most of our production models are
33:29 kind of like tree based models like
33:31 initially things like random forests and
33:31 now things more like
33:35 x/g boost and so you know we've kind of
33:38 like we have the serialization for that
33:40 built in to our training workflows and
33:42 we've optimized that to run pretty
33:44 efficiently in our Scala in print
33:46 service and then we've built some kind
33:49 of nice layers on top of that for things
33:51 like model composition kind of what we
33:53 call meta models where you know you can
33:55 kind of like take your machine learning
33:58 model and kind of like almost like
33:59 within the model sort of compose
34:02 something like add a threshold to it or
34:06 like for radar we trained you know some
34:07 array of like in some pieces users
34:09 specific models along with like maybe
34:11 more of some global models and so you
34:13 can kind of incorporate in the framework
34:16 of a model doing that dispatch where
34:17 you're kind of like if it matches these
34:19 conditions let's core with these models
34:21 otherwise score with this model and like
34:25 here's how you combine it and then the
34:26 way that interfaces with your
34:28 application is that each application has
34:31 what we call a tag and basically the tag
34:34 points to the model identifier which is
34:36 kind of like immutable and then whenever
34:37 you have a new model or you're ready to
34:39 ship you just like update what is that
34:42 tag point to and then you know put it in
34:44 production you're saying like score the
34:50 model for this tag okay and that is
34:52 pretty similar to like you know if you
34:54 read about Michelangelo and things like
34:56 that sometimes we're like we all came up with
34:58 with
35:02 it also sounds a little bit like sorry
35:03 say that again
35:04 yeah I think that like a lot of people
35:06 who kind of come up with some of these
35:07 that these ways of doing things that
35:11 just kind of make sense mm-hmm it also
35:14 sounds a little bit like some of what
35:16 Selden is trying to capture a Nakuru
35:20 Nettie's environment I which I guess
35:23 brings me to is the inference running in
35:27 kubernetes or is that a separate
35:29 separate infrastructure it's not right
35:31 now but I think that's mostly like a
35:34 matter of time and prioritization like
35:35 the first thing we moved to kubernetes
35:38 was the training piece because the
35:39 workflow management piece was so
35:40 powerful or sorry the resource
35:42 management piece was so powerful like
35:44 being able to swap out CPU GPU high
35:50 memory we've moved some of our like the
35:51 sort of real-time feature evaluation to
35:54 kubernetes which has been really great
35:56 and made it like a lot less toil to kind
35:58 of deploy new feature versions at some
36:00 point we will probably also move the
36:02 inference service to kubernetes we just
36:03 kind of haven't gotten there yet because
36:09 it is still some work to do that and is
36:13 the the inferences is happening on AWS
36:15 as well and are you using kind of
36:18 standard CPU instances or are you doing
36:23 anything fancy there yeah so we run on
36:26 cloud for pretty much everything and
36:31 definitely use a lot of AWS for the
36:32 real-time inference of the most
36:34 sensitive like production use cases
36:38 we're definitely mostly using CPU and
36:42 we've done a lot of optimization work so
36:43 that has worked pretty well for us I
36:45 think we do have some folks who've kind
36:47 of experimented a little bit with like
36:52 hourly or batch scoring using some other
36:53 things I think that's something that
36:54 we're definitely thinking about as we
36:58 have more people production izing kind
36:59 of like more complex types of models
37:01 where you know we might want something
37:03 different you mentioned a lot of
37:06 optimization that you've done is that on
37:08 a model and by MA
37:13 by model basis or are there platform
37:17 things that you've done that help
37:19 optimize across the various models that
37:22 you're deploying for a mutes yeah it
37:24 definitely a lot of things at the
37:25 platform level like I think the first
37:28 models that we ever ever scored and our
37:30 inference service were sterilized with
37:32 yeah mole and they were like really huge
37:35 and they caused a lot of garbage when we
37:38 tried to load them and so like we did
37:39 some work there for kind of tree based
37:44 models to be able to load things from
37:45 disk to memory really quickly and like
37:48 not producing much garbage so that's
37:49 that kind of thing are things that we
37:50 did especially kind of like in the
37:54 earlier days okay and are you what are
37:57 you using for querying the models so you
38:01 doing rest or G RPC or something
38:04 altogether different yeah we used rest
38:08 right now I think G RPC is like
38:09 something that we're interested in but
38:14 we haven't done yet okay and are you is
38:20 all of your all of the imprints done via
38:23 kind of V arrests and like a kind of
38:25 micro service style or do you also do
38:30 more I guess embedded types of inference
38:33 for like where you need have super low
38:35 latency requirements just rest kind of
38:37 meet the need across the application
38:41 portfolio yeah even for most critical
38:43 applications like shield thinks I've
38:44 worked pretty well one other thing our
38:46 orchestration team has done that's
38:48 worked really well for us is migrating a
38:51 lot of things to envoy so we've seen
38:53 some some things where like we didn't
38:54 understand why there was some delay like
38:56 in what we measured for how long things
38:58 text versus like what it took to the
39:00 user there just kind of went away as we
39:05 move to envoy and what is envoy envoy is
39:07 like a service service networking mush
39:10 so it was developed by lyft and it's
39:12 kind of like an open source open source
39:15 library and so it handles a lot of thing
39:16 it can have a lot of things like service
39:21 to service communication okay cool
39:22 and so the
39:27 the inference the inference environment
39:32 does it is it doing absent of kubernetes
39:33 all the things that you'd expect
39:34 communities to do in terms of like
39:39 auto-scaling and you know load balancing
39:42 across the different service instances
39:48 or is that stuff all done statically we
39:52 take care of the routing ourselves and
39:54 we also at this point have kind of like
39:56 sharded are in front service so not all
39:59 models are stored on every host so that
40:00 you know we don't need hosts with like
40:04 infinite memory and so that we take care
40:08 of ourselves the scaling we is not fully
40:10 automated at this point we do we have
40:11 kind of like quality of service that we
40:13 have like multiple kind of clusters of
40:15 machines and we tear a little bit by
40:17 like you know how sensitive your
40:19 application is and what you need from it
40:21 so that we can be a little bit more
40:23 relaxed with people who are developing
40:25 and want to test and not have that like
40:27 potentially have any impact on more
40:29 critical applications but we haven't
40:31 done like totally automated scaling not
40:32 something we kind of still look at a
40:36 little bit ourselves awesome awesome so
40:39 if you were kind of just starting down
40:42 this journey without having done all the
40:44 the things that that you've done it's
40:46 right what do you think you would start
40:50 if you just you know you're at an
40:52 organization that's kind of increasingly
40:55 invested in or investing in machine
40:58 learning and you know needs to try to
41:03 you know gain some efficiencies yeah I
41:04 mean I think if you're just starting out
41:06 like it's good to think about like what
41:10 are your requirements right and you know
41:11 if you're just trying to iterate quickly
41:13 it's like do the simplest thing possible
41:16 right so you know if you can do things
41:18 in batch like great do things in batch I
41:21 think a lotta there are a lot of both
41:23 open-source libraries as well as managed
41:26 solutions like on all the different
41:28 cloud providers so I think you know I
41:30 don't know you know if you're only one
41:32 person then I think that those could
41:34 make a lot of sense also for people
41:35 starting out because I ending one of the
41:37 interesting things with machine learning
41:39 applications is that it takes a little
41:42 bit of work like usually there's sort of
41:43 this threshold of like your modeling has
41:45 to be good enough for this to be like a
41:47 useful thing for you to do like for
41:49 fraud detection that's like if we can't
41:51 catch any fraud with our models then
41:52 like you know we probably shouldn't have
41:55 like a fraud detection product so I
41:56 think it is useful to kind of have like
41:59 a quick iteration cycle to find out like
42:01 is this a viable thing that you even
42:02 want to pursue and if you have an
42:03 infrastructure team they can kind of
42:06 like help lower the bar for that but I
42:07 think there are other ways to do that
42:09 especially as you know there's been like
42:11 this Cambrian explosion in the ecosystem
42:13 of different open-source platforms as
42:15 well as different managed solutions yeah
42:17 how do you how do you think an
42:20 organization knows when they should have
42:24 an infrastructure team ml in particular
42:26 yeah I think that's a really interesting
42:31 question I guess in our case I think you
42:33 know the person who originally founded
42:36 the machine learning infrastructure team
42:38 had worked in this area before at
42:40 Twitter and kind of had a sense of like
42:42 this is gonna be a thing that we're
42:44 really gonna want to invest in given how
42:45 important it is for a business and also
42:47 that if you don't kind of like dedicate
42:49 some folks to it it's easy for them to
42:51 kind of get sucked up in other things
42:52 like if you just have data
42:55 infrastructure that's undifferentiated
42:56 so I think it's a really interesting
42:59 question there probably is this business
43:01 piece rate of like what are your ml
43:03 obligations like how critical are they
43:06 to your business and like how difficult
43:08 are your infrastructure requirements for
43:10 them as well I think a lot of companies
43:12 develop their ml infrastructure like
43:14 starting out with things like making the
43:15 notebook experience really great because
43:17 they want to support like a lot of data
43:18 scientists who are doing a lot of
43:20 analysis and so that's like a little bit
43:22 of a different arc from from the one
43:23 that we've been on and I think that's
43:25 like actually a pretty business
43:28 dependent thing okay awesome
43:30 awesome well Kelly thanks so much for
43:32 taking the time to chat with me about
43:36 this really interesting story and I've
43:38 enjoyed learning about it cool and
43:39 thanks so much for chatting really