0:07 Um, so yeah, the topic of my talk is
0:08 pandas should go extinct and obviously
0:10 feel obligated to clarify not the fluffy
0:12 things that live at Adelaide Zoo, but in
0:14 fact the Python dataf frames library,
0:16 right? So I'm kind of just going to jump
0:17 straight into it with this incredibly
0:18 vibes based diagram you can see behind
0:20 me, which is essentially like the
0:22 adoption curve that you would typically
0:25 um adoption curve that you typically
0:26 take for like a dataf frame library,
0:28 right? So you'll kind of start on the
0:30 far left of this diagram, relatively
0:31 small amount of data. You'll use Excel
0:33 and then eventually you get to the point
0:34 when you're in sort of like, you know,
0:36 singledigit gigabytes where you graduate
0:38 to pandas, which is the typical path you
0:40 kind of follow, right? And you'll use
0:42 pandas for a while and you'll finish up
0:43 somewhere in the sort of double digit
0:45 gigabytes and you just don't have enough
0:46 RAM left. Your, you know, your computer
0:48 only has a single core and pandas or
0:50 your your computer only uses a single
0:51 core with pandas and then you kind of
0:53 hit this cliff, right? And this cliff
0:55 kind of lives here. And on the far right
0:57 hand side, you can see all these lovely
0:58 Spark derivatives, distributed querying
1:00 systems, Dask at the top just to make
1:02 things a bit more interesting. And
1:04 you're kind of forced to jump onto these
1:05 systems in order to keep querying your
1:07 data, doing analysis, that kind of
1:08 stuff. Now, my controversial take here
1:10 is essentially that you can avoid these
1:13 systems, but probably because your data
1:14 will never get to that size by using the
1:16 tools that I've kind of got on the top
1:17 of this slide, right? Sabio DB which is
1:19 an in-memory analytics database like
1:21 kind of SQL light for analytics if you
1:23 will and then polers which is a rust
1:25 based dataf frame library that's kind
1:26 feels very similar to pandas but makes
1:29 some important changes right so and I'll
1:32 say this this gap kind of happens around
1:33 the 100 GB size a bit bigger than that
1:35 maybe right and so the question you
1:37 might be asking is like why do I
1:38 actually care about data sets that are
1:40 about this size right and fundamentally
1:42 the answer to that question is who
1:44 actually has big data so this table is
1:46 taken from a paper published by Amazon
1:47 themselves. They operate Redshift.
1:49 Redshift is an analytics database,
1:51 basically a distributed analytics
1:53 database, kind of like Snowflake or
1:54 Spark or whatever, right? And what the
1:56 table here shows is actually fleet
1:57 statistics for how many rows are in each
1:59 table on Redshift, right? And you'll see
2:02 that 95% of the the tables contain less
2:04 than 10 the 8 rows. So if we make an
2:06 assumption say there's 1 kilobyte per
2:08 row, 95% of tables contain less than 100
2:10 gigs of data, right? But you can make
2:12 the argument, oh, I have heaps and heaps
2:13 of tables and I query them all the time,
2:15 right? You don't. Um so this this is the
2:19 from the same fleet statistics but
2:20 actually on query runtime. Now 87% of
2:22 queries are concluding in less than a
2:24 second. Let's make make some more
2:25 assumptions. You have 10 nodes in your
2:27 cluster. They do nothing but suck data
2:29 out of S3 and they can each do it at 8 8
2:31 gigs a second. That is a total of 80
2:33 gigabytes in your 87 87% of queries use
2:37 less than 80 gigabytes of data. Nobody
2:40 almost no one queries or stores big
2:42 data. Well, yeah, almost no one, right?
2:44 I work for a company that does, but
2:45 we'll ignore them for now.
2:47 So, I've talked a lot of I've kind of
2:49 made the case that no one really has
2:50 this data, but like you're kind of like,
2:51 what do these libraries look like? How
2:52 do they work? Shut up and show me the
2:54 code. Right? So, the the kind of
2:56 motivating example we're going to use
2:58 for this is the 1 billion row challenge.
2:59 Some of you might be familiar. It's
3:01 essentially a challenge that ran about a
3:02 year ago to find the fastest JVM
3:04 implementation of a program that could
3:05 compute the min, max, and mean for a set
3:07 of weather station data. Basically, you
3:09 take the blue stuff at the top and you
3:10 convert it to the red stuff at the
3:11 bottom. Right? The fastest JVM
3:14 implementation ran in about one and a
3:15 half seconds. pretty respectable. We're
3:17 going to rerun this challenge with our
3:18 own experimental setup with a 16 gig
3:20 CSV, 32 cores on a with 128 gigs of RAM
3:23 on an AWS box that I rented. Um, we're
3:26 going to do 30 repetitions for each
3:27 trial. I'm going to do two two warm-up
3:29 runs to make sure the caches are cleared
3:30 and all that kind of stuff. Notably, my
3:32 examples don't care about output
3:33 serialization, which the original
3:34 challenge did. The reason for that is
3:36 it's just not super relevant to the kind
3:37 of things we do with these tools. So,
3:40 the first example here is probably
3:41 something many of you will be familiar
3:42 with. It's, you know, basic vanilla
3:45 pandas. You read a CSV, you group it by,
3:47 you do an aggregation, and then we're
3:48 rounding it, which was the original spec
3:49 of the challenge. Fine. The one thing
3:51 I'll kind of call out here particularly
3:52 is like pandas is executing each of
3:54 these lines sequentially, right? It's
3:55 reading the CSV. It's doing the group
3:57 by. It's doing the A. And the reason
3:58 that I'm going to go I'm stressing that
4:00 point will become clear in a second,
4:02 right? This is the next one. So this is
4:04 Pol. This is the Rustbased library. It
4:06 should look very very familiar to anyone
4:07 who's used pandas with a single
4:09 exception that group by has an
4:10 underscore as God intended. Um so the
4:14 the you basically you read the CSV in
4:16 you do the group buy you do the A you do
4:18 some sorting whatever and the last line
4:19 is the one I'm really interested in here
4:21 which is collect right so by default pol
4:23 is actually lazy evaluated this makes a
4:26 big deal in terms of performance gains
4:27 because 40 50 years of database research
4:31 and optimization actually can now be
4:33 leveraged by this as opposed to dumbly
4:34 going I'm going to read it I'm going to
4:36 group it I'm going to add it it actually
4:37 goes okay here's a whole set of
4:39 operations now let's decompose them into
4:40 a DAG of operations and apply them in an
4:42 optimal manner. Right? So that's what
4:45 this collection. So you get like stuff
4:46 like predicate push down and other real
4:47 benefits you get from the true full fat
4:49 SQL database right here just in Python.
4:52 Right? And the last part of that which
4:53 is new streaming equals true is actually
4:54 just enabling it in streaming mode.
4:56 Right? It's like saying load the data
4:58 chunk wise do the processing and then
5:00 spit out the result again. Right? Don't
5:01 store everything in memory in the same
5:03 way that pandas does. Cool. And the last
5:06 example here is ductb which is a bit
5:07 more wacky but essentially as I said
5:09 inmemory SQL. So essentially this should
5:11 look very familiar to anyone who's done
5:13 any kind of rudimentary SQL but
5:14 essentially you're just creating a table
5:16 selecting the things you want doing a
5:17 group buy. Right now we get to the
5:20 interesting part which are the results.
5:21 So this um table has a few columns. So
5:23 there's obviously the library on the
5:24 left hand side median duration across
5:26 the 30 trials. Median max CPU which is
5:28 basically like you know 100% is a single
5:30 CPU core. Median max USS you can think
5:32 about that as the amount of RAM at max
5:34 that it used if you killed the process
5:35 and you what you would get back for the
5:36 operating system. And then swap is in
5:38 the right hand column which is not super
5:39 relevant for this trial but you'll see
5:40 in the future that it is. So pandas did
5:42 it in four and a half minutes. Would
5:44 anyone like to tell me or guess how fast
5:46 polers did it
5:47 in? Anyone? No. Hm. You're not far off.
5:52 5 seconds. Right. Uh so that's within an
5:54 order of magnitude of the best hand
5:56 toolled JVM solution running native
5:58 code. Right. Uh it used all 32 cores of
6:00 your machine that you actually paid for
6:02 as opposed to the one with pandas. It
6:03 used 18 gig of memory which is
6:04 reasonable. And again, it didn't dip
6:06 into swap because the machine has so
6:07 much memory it doesn't need to. And the
6:09 last one is ductb. So duct DB did it in
6:11 5 seconds, used full 32 cores and did it
6:13 in 2 gigs of memory. Not bad. Now the
6:16 question you're asking now is like this
6:18 is wonderful. This is, you know, really
6:19 good empirical production performance.
6:20 Great, but like how does this actually
6:22 change my life dayto-day other than when
6:23 you know the jobs run? And so the really
6:25 way to think about this is like does
6:26 this speed up your dev loot, make your
6:27 life better? That kind of thing, right?
6:29 And so the next set of trials is we just
6:30 did the same test again but on my local
6:33 laptop that has uh I don't know eight 16
6:35 gigs of RAM and eight cores right and
6:37 here's the results. So pandas did
6:39 abysmally. It swapped out heaps of stuff
6:42 to RAM. It um and it took ages and it
6:45 used a single core again. Polas
6:46 obviously did respectably again used all
6:48 eight cores very very slightly dipped
6:50 into swap. And then duct DB obviously
6:52 was a real winner here with 500 megs of
6:54 RAM usage all eight cores and ran in
6:55 pretty respectable time, right? Um so I
6:59 guess yeah the message here is like you
7:01 know it improves your dev loop and
7:02 improves your production performance,
7:03 right? And you know much more efficient
7:05 with resources.
7:06 So what do these tools really give us?
7:08 Great performance on a flawed benchmark.
7:10 All benchmarks are flawed. This one is
7:11 particularly flawed because I'm
7:13 basically chose it to demonstrate some
7:15 of the benefits of using these other
7:16 tools as opposed to pandas. Right? So
7:18 one of the main ones would be painless
7:19 multi-threading. I didn't do anything
7:21 with a thread pool. I didn't create
7:22 things. I didn't worry about fork join.
7:24 I just got all of the performance out of
7:26 my machine. Right? Um streaming which
7:29 both of these tools use under the hood.
7:30 ductb does a lot of that predicate push
7:32 down streaming stuff as well as as well
7:34 as polers which just you know put filter
7:36 things through memory essentially right
7:37 the other thing is spill to disk which
7:39 is not obvious in this example but
7:41 essentially pandas swapped out to disk
7:43 which is fine because that's what the
7:45 operating system is doing but the
7:46 operating system has no idea what your
7:47 job is doing right these these these
7:49 frameworks actually have the ability or
7:51 tools have the ability to spill to disk
7:52 natively so for large computations they
7:54 can say I don't think you're going to
7:55 need this in 10 lines time I'm going to
7:57 spill it out to disk and I'm going to
7:58 keep the stuff that's on the hot path in
7:59 memory That is a big boon. And the last
8:02 one is really the killer is lazy
8:03 evaluation. You can actually create a
8:04 DAG of the tasks that you're you know
8:06 evaluating and it's going to just
8:08 execute them in the optimal order.
8:10 Right? You've been you've been you're
8:12 interested but shiny tools have hurt you
8:13 before. We've all been there. Anyone
8:15 remember Redux? No. I'm not the youngest
8:18 here then. Um cool. So the way the way
8:21 you actually adopt these tools is using
8:22 Apache Arrow, right? This is kind of the
8:24 deacto in-memory data layout format that
8:27 um the community is kind of settling
8:29 towards, right? So, um it's been
8:31 supported in pandas since its 2.0
8:33 release in April 2023. Uh and support is
8:36 uh yeah, it's getting there. Um it's
8:38 really getting there. Um but the I guess
8:40 the key detail here is both of these
8:42 tools polers and ductb support memory um
8:45 zero copy serialization between
8:47 representations. I can go from pandas to
8:49 duct db to polers to back again without
8:51 copying anything in memory if it's all
8:52 pyro backed right if you don't if if the
8:56 pandas data frame is not pyro backed you
8:57 get a fresh copy but as we've already
8:58 seen these tools have much lower memory
9:00 footprints and so having a duplicate
9:01 copy in memory is probably fine if you
9:03 can afford the pandas overhead in the
9:04 first place so let's look at another
9:07 practical example the york taxi data set
9:10 so it's a data set of every taxi trip
9:11 taken in New York between 2009 to
9:13 present right where the data is on sort
9:15 of monthly monthly files and so you can
9:17 sort of process through them. Um,
9:19 corporate wants to find out if the
9:20 pandemic made cache less common. So,
9:21 we're looking at the files between 2019
9:23 and 2022. This is about 3 gigs of park
9:25 file on disk. Um, we're using duct DB
9:27 with pandas, but you can actually mix
9:28 and match these tools as much as you
9:30 want. You could use polers and pandas
9:31 or, you know, every combination thereof.
9:34 So, there's kind of four functions that
9:35 I'm going to call out here. Don't try
9:37 and read all this code. There's just too
9:38 much of it going on. But, there's
9:39 essentially four functions here. One for
9:41 reading and computing for each of the
9:42 particular libraries. And what you can
9:43 see is we're going to mix and match them
9:44 around and show how you can just use
9:46 them together. So the key points on this
9:48 this slide that I want you to take away
9:49 is reading this data from pandas is you
9:51 can naively just do a loop and just
9:52 merge them all together in memory and
9:54 basically we're just doing this dtype
9:55 backend pi arrow. At the bottom of the
9:57 slide there's a bunch of boring data
9:58 sciency stuff where we have to clamp the
9:59 data to the right range of and you know
10:01 all that nasty stuff you have to do when
10:03 you actually use real data. So uh the
10:05 pandas compute also looks really really
10:07 familiar to us. We're doing a group by,
10:08 we're doing an AG, we're doing some
10:10 horrible unstacking stuff in the middle.
10:12 Uh resetting some indices and basically
10:14 summing up all the column levels for the
10:16 enum for cash payment versus card versus
10:18 whatever, right? Again, don't sweat the
10:20 details too much. These are kind of here
10:21 for completeness. Duct DB um is looks
10:24 very familiar. Again, this is just a
10:25 select statement in in um SQL. The thing
10:28 I really want to call out here though is
10:29 this line, which is really sick because
10:31 essentially duct DB is just merging
10:33 these files for you. It's using a glob
10:34 pattern to just like find all the park
10:36 files in this directory and just go put
10:37 them together, right? Um the last part I
10:40 want to call out here as well is this
10:41 stuff the kind of boring data sciency
10:43 miscellaneous stuff I was talking about
10:44 before is actually happening at the
10:46 point where you read the data in. So all
10:48 this stuff about predicate push down and
10:49 like you know filtering data that you
10:51 don't need is happening for you. Right?
10:53 So pandas had to read all that stuff in
10:54 and then you had to say gh chuck away
10:56 the bits I don't want. Whereas this is
10:57 like just don't read in the crap that I
10:59 don't want in the first place. So
11:00 that'll help you save a lot of memory.
11:02 the duct DB computation. This is slide
11:04 really is concluded for completeness. I
11:06 wouldn't try and grock all this. It's a
11:07 bunch of common table expressions put
11:09 together with some joins and smooshed
11:10 together. I wouldn't stress about the
11:12 details. You can look at this later. The
11:14 last part here, again, I'm not going to
11:15 enumerate fully, but essentially is just
11:17 like it is all of the combinations of
11:18 these things, right? It's reading and
11:19 writing with each framework together and
11:21 then using pure solutions for each of
11:22 them, right? Again, wouldn't stress,
11:24 it's just there for completeness. The
11:25 part we're all really here is for the
11:27 performance on my actual machine sitting
11:28 at home on my on my laptop. So pure
11:31 pandas is obviously the worst. We all
11:32 knew that. Um it used about a core. It
11:34 used 14 gigs of memory and you know
11:36 swapped out lots of stuff. Uh duct db
11:38 doing a reading and pandas doing the
11:40 thinking kind of gets you your runtime
11:41 down simply because duct db can read
11:43 this stuff in parallel. It can take
11:44 those glob patterns and just go I've got
11:46 eight cores. I'm going to use them and
11:47 just read as much as I can. You use all
11:49 eight cores. You still swapped out
11:51 probably because the pandas group by
11:52 essentially expanded and then contracted
11:53 in memory again. And then if you see the
11:55 example where the pandas does the
11:56 reading, you still get the benefit, but
11:57 you don't get this expand contract
11:59 pattern in memory because it's not being
12:01 done by pandas. Uh, and finally, the
12:03 pure ductb solution, eight calls, 200
12:05 megs of RAM versus like 15 gigs for all
12:08 the other ones. I I mean, I'm impressed.
12:10 Um, cool. Did cash usage fall over the
12:13 pandemic? Uh, yes. I don't know. I don't
12:15 know if it's actually science
12:16 scientific. I don't really care. Here's
12:17 a chart just to prove that I did the
12:19 actual research. For some reason, in
12:20 March 2020, people started using cash
12:23 more, which I don't understand or
12:25 believe, but uh that is the chart. It
12:27 did go down. Okay. So, uh you're
12:29 technologists, you you should be
12:30 skeptical. If you're not, become
12:32 skeptical. Um why shouldn't you actually
12:34 just listen to me, right? The answer is
12:36 uh I'm just a dude with a laptop. Uh do
12:38 your own research. Uh clone my code.
12:40 It's open source. It's on GitHub. Do
12:42 your own benchmarks. Figure out how your
12:43 own workloads work. You know, don't just
12:45 take as wrote the thing as a dude on the
12:47 stage with a sweater set. The second
12:49 part is you're super deeply integrated
12:50 into the pandas ecosystem, right? I did
12:52 my honors degree in energy research
12:53 which used like there was this weird
12:55 package that did energy modeling that
12:56 used pandas as like a base data
12:58 structure which is really cooked in
12:59 hindsight but those people who are doing
13:02 that kind of modeling are not getting
13:03 away from pandas and honestly they
13:04 probably shouldn't. Another example is
13:06 kind of like geospatial as well like
13:08 don't bother it's not really worth it.
13:10 The last thing the last point being like
13:11 let pandas cook, right? Let them
13:13 improve, let them get better. They are
13:14 the kings, right? Um and that's true.
13:17 they have made progress on integrating
13:18 Pyro and getting better, but that
13:20 progress has been honestly abysmally
13:22 slow. Um, and the other part I will say
13:24 is that like there are benefits to
13:26 switching off of pandas that go beyond
13:27 performance. There is a lot of stuff
13:29 around ergonomics and API groility. Like
13:31 the mental load of using pandas is
13:33 significantly higher than just writing
13:34 SQL in my opinion. Um, and the other one
13:36 is like a transferable skill set. One of
13:38 my major criticisms of data science as a
13:40 field when I started out was that it
13:41 basically just feels like you're just
13:42 learning an API. There's no like novelty
13:44 there. It's just like how do I glue
13:46 together these pandas things, right? And
13:47 what you actually get by switching to a
13:49 tool that allows you to use SQL, for
13:50 example, is transferable skills across
13:52 different domains. I've suggested two
13:54 tools. Which one is better? This really,
13:56 really depends on your preference. SQL
13:57 versus code. I personally prefer Polus
13:59 because I like composable functions. I
14:00 like unit testing. I like that kind of
14:02 stuff. But duct DB goes well beyond
14:03 performance, right? You get local first
14:05 analytics, like you could do like local
14:06 first analytics on an embedded Android
14:08 app, right? Say you're making a health
14:09 app. You just embed this binary in your
14:11 code and you can do cool analytics
14:13 dashboards for your users without
14:14 sending anything to the server. That's
14:15 nuts. The other one is like compile it
14:18 to Wom and just like do SQL queries for
14:20 visual visualization from the front end.
14:22 Your back end doesn't return JSON data.
14:24 It returns signed S3 URLs and you can
14:26 just query them. That is nuts. It is so
14:29 cool as like a more app deb kind of
14:30 person as opposed to data person. So
14:34 that kind of brings us to like the
14:34 takeaways of this talk. um which is
14:36 basically like you can't avoid complex
14:38 distributed quering systems but you can
14:40 defer them potentially indefinitely by
14:41 just not using pandas in the first place
14:43 and the second one is like we really are
14:45 experiencing like a renaissance in
14:46 Python tooling at the moment
14:48 particularly in the dataf frame space
14:49 like actually get out there and try some
14:51 of them that your assumptions are no
14:52 longer valid from 5 years ago and
14:55 finally just a bit of shameless
14:56 self-promotion code for this is on
14:57 GitHub I have a LinkedIn I have a blog
14:58 that I occasionally post to if you want
15:00 to follow me on Twitter don't because I
15:01 don't care what you
15:05 That's I believe I have a slide as well
15:07 for feedback and what's up next. Um,
15:11 [Applause]
15:15 [Music]
15:17 [Applause]
15:19 [Music]