YouTube Transcript:
Latency2025 - Pandas Should Go Extinct presented by Eddie Atkinson
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
View:
Um, so yeah, the topic of my talk is
pandas should go extinct and obviously
feel obligated to clarify not the fluffy
things that live at Adelaide Zoo, but in
fact the Python dataf frames library,
right? So I'm kind of just going to jump
straight into it with this incredibly
vibes based diagram you can see behind
me, which is essentially like the
adoption curve that you would typically
um adoption curve that you typically
take for like a dataf frame library,
right? So you'll kind of start on the
far left of this diagram, relatively
small amount of data. You'll use Excel
and then eventually you get to the point
when you're in sort of like, you know,
singledigit gigabytes where you graduate
to pandas, which is the typical path you
kind of follow, right? And you'll use
pandas for a while and you'll finish up
somewhere in the sort of double digit
gigabytes and you just don't have enough
RAM left. Your, you know, your computer
only has a single core and pandas or
your your computer only uses a single
core with pandas and then you kind of
hit this cliff, right? And this cliff
kind of lives here. And on the far right
hand side, you can see all these lovely
Spark derivatives, distributed querying
systems, Dask at the top just to make
things a bit more interesting. And
you're kind of forced to jump onto these
systems in order to keep querying your
data, doing analysis, that kind of
stuff. Now, my controversial take here
is essentially that you can avoid these
systems, but probably because your data
will never get to that size by using the
tools that I've kind of got on the top
of this slide, right? Sabio DB which is
an in-memory analytics database like
kind of SQL light for analytics if you
will and then polers which is a rust
based dataf frame library that's kind
feels very similar to pandas but makes
some important changes right so and I'll
say this this gap kind of happens around
the 100 GB size a bit bigger than that
maybe right and so the question you
might be asking is like why do I
actually care about data sets that are
about this size right and fundamentally
the answer to that question is who
actually has big data so this table is
taken from a paper published by Amazon
themselves. They operate Redshift.
Redshift is an analytics database,
basically a distributed analytics
database, kind of like Snowflake or
Spark or whatever, right? And what the
table here shows is actually fleet
statistics for how many rows are in each
table on Redshift, right? And you'll see
that 95% of the the tables contain less
than 10 the 8 rows. So if we make an
assumption say there's 1 kilobyte per
row, 95% of tables contain less than 100
gigs of data, right? But you can make
the argument, oh, I have heaps and heaps
of tables and I query them all the time,
right? You don't. Um so this this is the
from the same fleet statistics but
actually on query runtime. Now 87% of
queries are concluding in less than a
second. Let's make make some more
assumptions. You have 10 nodes in your
cluster. They do nothing but suck data
out of S3 and they can each do it at 8 8
gigs a second. That is a total of 80
gigabytes in your 87 87% of queries use
less than 80 gigabytes of data. Nobody
almost no one queries or stores big
data. Well, yeah, almost no one, right?
I work for a company that does, but
we'll ignore them for now.
So, I've talked a lot of I've kind of
made the case that no one really has
this data, but like you're kind of like,
what do these libraries look like? How
do they work? Shut up and show me the
code. Right? So, the the kind of
motivating example we're going to use
for this is the 1 billion row challenge.
Some of you might be familiar. It's
essentially a challenge that ran about a
year ago to find the fastest JVM
implementation of a program that could
compute the min, max, and mean for a set
of weather station data. Basically, you
take the blue stuff at the top and you
convert it to the red stuff at the
bottom. Right? The fastest JVM
implementation ran in about one and a
half seconds. pretty respectable. We're
going to rerun this challenge with our
own experimental setup with a 16 gig
CSV, 32 cores on a with 128 gigs of RAM
on an AWS box that I rented. Um, we're
going to do 30 repetitions for each
trial. I'm going to do two two warm-up
runs to make sure the caches are cleared
and all that kind of stuff. Notably, my
examples don't care about output
serialization, which the original
challenge did. The reason for that is
it's just not super relevant to the kind
of things we do with these tools. So,
the first example here is probably
something many of you will be familiar
with. It's, you know, basic vanilla
pandas. You read a CSV, you group it by,
you do an aggregation, and then we're
rounding it, which was the original spec
of the challenge. Fine. The one thing
I'll kind of call out here particularly
is like pandas is executing each of
these lines sequentially, right? It's
reading the CSV. It's doing the group
by. It's doing the A. And the reason
that I'm going to go I'm stressing that
point will become clear in a second,
right? This is the next one. So this is
Pol. This is the Rustbased library. It
should look very very familiar to anyone
who's used pandas with a single
exception that group by has an
underscore as God intended. Um so the
the you basically you read the CSV in
you do the group buy you do the A you do
some sorting whatever and the last line
is the one I'm really interested in here
which is collect right so by default pol
is actually lazy evaluated this makes a
big deal in terms of performance gains
because 40 50 years of database research
and optimization actually can now be
leveraged by this as opposed to dumbly
going I'm going to read it I'm going to
group it I'm going to add it it actually
goes okay here's a whole set of
operations now let's decompose them into
a DAG of operations and apply them in an
optimal manner. Right? So that's what
this collection. So you get like stuff
like predicate push down and other real
benefits you get from the true full fat
SQL database right here just in Python.
Right? And the last part of that which
is new streaming equals true is actually
just enabling it in streaming mode.
Right? It's like saying load the data
chunk wise do the processing and then
spit out the result again. Right? Don't
store everything in memory in the same
way that pandas does. Cool. And the last
example here is ductb which is a bit
more wacky but essentially as I said
inmemory SQL. So essentially this should
look very familiar to anyone who's done
any kind of rudimentary SQL but
essentially you're just creating a table
selecting the things you want doing a
group buy. Right now we get to the
interesting part which are the results.
So this um table has a few columns. So
there's obviously the library on the
left hand side median duration across
the 30 trials. Median max CPU which is
basically like you know 100% is a single
CPU core. Median max USS you can think
about that as the amount of RAM at max
that it used if you killed the process
and you what you would get back for the
operating system. And then swap is in
the right hand column which is not super
relevant for this trial but you'll see
in the future that it is. So pandas did
it in four and a half minutes. Would
anyone like to tell me or guess how fast
polers did it
in? Anyone? No. Hm. You're not far off.
5 seconds. Right. Uh so that's within an
order of magnitude of the best hand
toolled JVM solution running native
code. Right. Uh it used all 32 cores of
your machine that you actually paid for
as opposed to the one with pandas. It
used 18 gig of memory which is
reasonable. And again, it didn't dip
into swap because the machine has so
much memory it doesn't need to. And the
last one is ductb. So duct DB did it in
5 seconds, used full 32 cores and did it
in 2 gigs of memory. Not bad. Now the
question you're asking now is like this
is wonderful. This is, you know, really
good empirical production performance.
Great, but like how does this actually
change my life dayto-day other than when
you know the jobs run? And so the really
way to think about this is like does
this speed up your dev loot, make your
life better? That kind of thing, right?
And so the next set of trials is we just
did the same test again but on my local
laptop that has uh I don't know eight 16
gigs of RAM and eight cores right and
here's the results. So pandas did
abysmally. It swapped out heaps of stuff
to RAM. It um and it took ages and it
used a single core again. Polas
obviously did respectably again used all
eight cores very very slightly dipped
into swap. And then duct DB obviously
was a real winner here with 500 megs of
RAM usage all eight cores and ran in
pretty respectable time, right? Um so I
guess yeah the message here is like you
know it improves your dev loop and
improves your production performance,
right? And you know much more efficient
with resources.
So what do these tools really give us?
Great performance on a flawed benchmark.
All benchmarks are flawed. This one is
particularly flawed because I'm
basically chose it to demonstrate some
of the benefits of using these other
tools as opposed to pandas. Right? So
one of the main ones would be painless
multi-threading. I didn't do anything
with a thread pool. I didn't create
things. I didn't worry about fork join.
I just got all of the performance out of
my machine. Right? Um streaming which
both of these tools use under the hood.
ductb does a lot of that predicate push
down streaming stuff as well as as well
as polers which just you know put filter
things through memory essentially right
the other thing is spill to disk which
is not obvious in this example but
essentially pandas swapped out to disk
which is fine because that's what the
operating system is doing but the
operating system has no idea what your
job is doing right these these these
frameworks actually have the ability or
tools have the ability to spill to disk
natively so for large computations they
can say I don't think you're going to
need this in 10 lines time I'm going to
spill it out to disk and I'm going to
keep the stuff that's on the hot path in
memory That is a big boon. And the last
one is really the killer is lazy
evaluation. You can actually create a
DAG of the tasks that you're you know
evaluating and it's going to just
execute them in the optimal order.
Right? You've been you've been you're
interested but shiny tools have hurt you
before. We've all been there. Anyone
remember Redux? No. I'm not the youngest
here then. Um cool. So the way the way
you actually adopt these tools is using
Apache Arrow, right? This is kind of the
deacto in-memory data layout format that
um the community is kind of settling
towards, right? So, um it's been
supported in pandas since its 2.0
release in April 2023. Uh and support is
uh yeah, it's getting there. Um it's
really getting there. Um but the I guess
the key detail here is both of these
tools polers and ductb support memory um
zero copy serialization between
representations. I can go from pandas to
duct db to polers to back again without
copying anything in memory if it's all
pyro backed right if you don't if if the
pandas data frame is not pyro backed you
get a fresh copy but as we've already
seen these tools have much lower memory
footprints and so having a duplicate
copy in memory is probably fine if you
can afford the pandas overhead in the
first place so let's look at another
practical example the york taxi data set
so it's a data set of every taxi trip
taken in New York between 2009 to
present right where the data is on sort
of monthly monthly files and so you can
sort of process through them. Um,
corporate wants to find out if the
pandemic made cache less common. So,
we're looking at the files between 2019
and 2022. This is about 3 gigs of park
file on disk. Um, we're using duct DB
with pandas, but you can actually mix
and match these tools as much as you
want. You could use polers and pandas
or, you know, every combination thereof.
So, there's kind of four functions that
I'm going to call out here. Don't try
and read all this code. There's just too
much of it going on. But, there's
essentially four functions here. One for
reading and computing for each of the
particular libraries. And what you can
see is we're going to mix and match them
around and show how you can just use
them together. So the key points on this
this slide that I want you to take away
is reading this data from pandas is you
can naively just do a loop and just
merge them all together in memory and
basically we're just doing this dtype
backend pi arrow. At the bottom of the
slide there's a bunch of boring data
sciency stuff where we have to clamp the
data to the right range of and you know
all that nasty stuff you have to do when
you actually use real data. So uh the
pandas compute also looks really really
familiar to us. We're doing a group by,
we're doing an AG, we're doing some
horrible unstacking stuff in the middle.
Uh resetting some indices and basically
summing up all the column levels for the
enum for cash payment versus card versus
whatever, right? Again, don't sweat the
details too much. These are kind of here
for completeness. Duct DB um is looks
very familiar. Again, this is just a
select statement in in um SQL. The thing
I really want to call out here though is
this line, which is really sick because
essentially duct DB is just merging
these files for you. It's using a glob
pattern to just like find all the park
files in this directory and just go put
them together, right? Um the last part I
want to call out here as well is this
stuff the kind of boring data sciency
miscellaneous stuff I was talking about
before is actually happening at the
point where you read the data in. So all
this stuff about predicate push down and
like you know filtering data that you
don't need is happening for you. Right?
So pandas had to read all that stuff in
and then you had to say gh chuck away
the bits I don't want. Whereas this is
like just don't read in the crap that I
don't want in the first place. So
that'll help you save a lot of memory.
the duct DB computation. This is slide
really is concluded for completeness. I
wouldn't try and grock all this. It's a
bunch of common table expressions put
together with some joins and smooshed
together. I wouldn't stress about the
details. You can look at this later. The
last part here, again, I'm not going to
enumerate fully, but essentially is just
like it is all of the combinations of
these things, right? It's reading and
writing with each framework together and
then using pure solutions for each of
them, right? Again, wouldn't stress,
it's just there for completeness. The
part we're all really here is for the
performance on my actual machine sitting
at home on my on my laptop. So pure
pandas is obviously the worst. We all
knew that. Um it used about a core. It
used 14 gigs of memory and you know
swapped out lots of stuff. Uh duct db
doing a reading and pandas doing the
thinking kind of gets you your runtime
down simply because duct db can read
this stuff in parallel. It can take
those glob patterns and just go I've got
eight cores. I'm going to use them and
just read as much as I can. You use all
eight cores. You still swapped out
probably because the pandas group by
essentially expanded and then contracted
in memory again. And then if you see the
example where the pandas does the
reading, you still get the benefit, but
you don't get this expand contract
pattern in memory because it's not being
done by pandas. Uh, and finally, the
pure ductb solution, eight calls, 200
megs of RAM versus like 15 gigs for all
the other ones. I I mean, I'm impressed.
Um, cool. Did cash usage fall over the
pandemic? Uh, yes. I don't know. I don't
know if it's actually science
scientific. I don't really care. Here's
a chart just to prove that I did the
actual research. For some reason, in
March 2020, people started using cash
more, which I don't understand or
believe, but uh that is the chart. It
did go down. Okay. So, uh you're
technologists, you you should be
skeptical. If you're not, become
skeptical. Um why shouldn't you actually
just listen to me, right? The answer is
uh I'm just a dude with a laptop. Uh do
your own research. Uh clone my code.
It's open source. It's on GitHub. Do
your own benchmarks. Figure out how your
own workloads work. You know, don't just
take as wrote the thing as a dude on the
stage with a sweater set. The second
part is you're super deeply integrated
into the pandas ecosystem, right? I did
my honors degree in energy research
which used like there was this weird
package that did energy modeling that
used pandas as like a base data
structure which is really cooked in
hindsight but those people who are doing
that kind of modeling are not getting
away from pandas and honestly they
probably shouldn't. Another example is
kind of like geospatial as well like
don't bother it's not really worth it.
The last thing the last point being like
let pandas cook, right? Let them
improve, let them get better. They are
the kings, right? Um and that's true.
they have made progress on integrating
Pyro and getting better, but that
progress has been honestly abysmally
slow. Um, and the other part I will say
is that like there are benefits to
switching off of pandas that go beyond
performance. There is a lot of stuff
around ergonomics and API groility. Like
the mental load of using pandas is
significantly higher than just writing
SQL in my opinion. Um, and the other one
is like a transferable skill set. One of
my major criticisms of data science as a
field when I started out was that it
basically just feels like you're just
learning an API. There's no like novelty
there. It's just like how do I glue
together these pandas things, right? And
what you actually get by switching to a
tool that allows you to use SQL, for
example, is transferable skills across
different domains. I've suggested two
tools. Which one is better? This really,
really depends on your preference. SQL
versus code. I personally prefer Polus
because I like composable functions. I
like unit testing. I like that kind of
stuff. But duct DB goes well beyond
performance, right? You get local first
analytics, like you could do like local
first analytics on an embedded Android
app, right? Say you're making a health
app. You just embed this binary in your
code and you can do cool analytics
dashboards for your users without
sending anything to the server. That's
nuts. The other one is like compile it
to Wom and just like do SQL queries for
visual visualization from the front end.
Your back end doesn't return JSON data.
It returns signed S3 URLs and you can
just query them. That is nuts. It is so
cool as like a more app deb kind of
person as opposed to data person. So
that kind of brings us to like the
takeaways of this talk. um which is
basically like you can't avoid complex
distributed quering systems but you can
defer them potentially indefinitely by
just not using pandas in the first place
and the second one is like we really are
experiencing like a renaissance in
Python tooling at the moment
particularly in the dataf frame space
like actually get out there and try some
of them that your assumptions are no
longer valid from 5 years ago and
finally just a bit of shameless
self-promotion code for this is on
GitHub I have a LinkedIn I have a blog
that I occasionally post to if you want
to follow me on Twitter don't because I
don't care what you
That's I believe I have a slide as well
for feedback and what's up next. Um,
[Applause]
[Music]
[Applause]
[Music]
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.
Works with YouTube, Coursera, Udemy and more educational platforms
Get Instant Transcripts: Just Edit the Domain in Your Address Bar!
YouTube
←
→
↻
https://www.youtube.com/watch?v=UF8uR6Z6KLc
YoutubeToText
←
→
↻
https://youtubetotext.net/watch?v=UF8uR6Z6KLc