YouTube Transcript:
Latency2025 - Pandas Should Go Extinct presented by Eddie Atkinson

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

Video Transcript

View:

Um, so yeah, the topic of my talk is

pandas should go extinct and obviously

feel obligated to clarify not the fluffy

things that live at Adelaide Zoo, but in

fact the Python dataf frames library,

right? So I'm kind of just going to jump

straight into it with this incredibly

vibes based diagram you can see behind

me, which is essentially like the

adoption curve that you would typically

um adoption curve that you typically

take for like a dataf frame library,

right? So you'll kind of start on the

far left of this diagram, relatively

small amount of data. You'll use Excel

and then eventually you get to the point

when you're in sort of like, you know,

singledigit gigabytes where you graduate

to pandas, which is the typical path you

kind of follow, right? And you'll use

pandas for a while and you'll finish up

somewhere in the sort of double digit

gigabytes and you just don't have enough

RAM left. Your, you know, your computer

only has a single core and pandas or

your your computer only uses a single

core with pandas and then you kind of

hit this cliff, right? And this cliff

kind of lives here. And on the far right

hand side, you can see all these lovely

Spark derivatives, distributed querying

systems, Dask at the top just to make

things a bit more interesting. And

you're kind of forced to jump onto these

systems in order to keep querying your

data, doing analysis, that kind of

stuff. Now, my controversial take here

is essentially that you can avoid these

systems, but probably because your data

will never get to that size by using the

tools that I've kind of got on the top

of this slide, right? Sabio DB which is

an in-memory analytics database like

kind of SQL light for analytics if you

will and then polers which is a rust

based dataf frame library that's kind

feels very similar to pandas but makes

some important changes right so and I'll

say this this gap kind of happens around

the 100 GB size a bit bigger than that

maybe right and so the question you

might be asking is like why do I

actually care about data sets that are

about this size right and fundamentally

the answer to that question is who

actually has big data so this table is

taken from a paper published by Amazon

themselves. They operate Redshift.

Redshift is an analytics database,

basically a distributed analytics

database, kind of like Snowflake or

Spark or whatever, right? And what the

table here shows is actually fleet

statistics for how many rows are in each

table on Redshift, right? And you'll see

that 95% of the the tables contain less

than 10 the 8 rows. So if we make an

assumption say there's 1 kilobyte per

row, 95% of tables contain less than 100

gigs of data, right? But you can make

the argument, oh, I have heaps and heaps

of tables and I query them all the time,

right? You don't. Um so this this is the

from the same fleet statistics but

actually on query runtime. Now 87% of

queries are concluding in less than a

second. Let's make make some more

assumptions. You have 10 nodes in your

cluster. They do nothing but suck data

out of S3 and they can each do it at 8 8

gigs a second. That is a total of 80

gigabytes in your 87 87% of queries use

less than 80 gigabytes of data. Nobody

almost no one queries or stores big

data. Well, yeah, almost no one, right?

I work for a company that does, but

we'll ignore them for now.

So, I've talked a lot of I've kind of

made the case that no one really has

this data, but like you're kind of like,

what do these libraries look like? How

do they work? Shut up and show me the

code. Right? So, the the kind of

motivating example we're going to use

for this is the 1 billion row challenge.

Some of you might be familiar. It's

essentially a challenge that ran about a

year ago to find the fastest JVM

implementation of a program that could

compute the min, max, and mean for a set

of weather station data. Basically, you

take the blue stuff at the top and you

convert it to the red stuff at the

bottom. Right? The fastest JVM

implementation ran in about one and a

half seconds. pretty respectable. We're

going to rerun this challenge with our

own experimental setup with a 16 gig

CSV, 32 cores on a with 128 gigs of RAM

on an AWS box that I rented. Um, we're

going to do 30 repetitions for each

trial. I'm going to do two two warm-up

runs to make sure the caches are cleared

and all that kind of stuff. Notably, my

examples don't care about output

serialization, which the original

challenge did. The reason for that is

it's just not super relevant to the kind

of things we do with these tools. So,

the first example here is probably

something many of you will be familiar

with. It's, you know, basic vanilla

pandas. You read a CSV, you group it by,

you do an aggregation, and then we're

rounding it, which was the original spec

of the challenge. Fine. The one thing

I'll kind of call out here particularly

is like pandas is executing each of

these lines sequentially, right? It's

reading the CSV. It's doing the group

by. It's doing the A. And the reason

that I'm going to go I'm stressing that

point will become clear in a second,

right? This is the next one. So this is

Pol. This is the Rustbased library. It

should look very very familiar to anyone

who's used pandas with a single

exception that group by has an

underscore as God intended. Um so the

the you basically you read the CSV in

you do the group buy you do the A you do

some sorting whatever and the last line

is the one I'm really interested in here

which is collect right so by default pol

is actually lazy evaluated this makes a

big deal in terms of performance gains

because 40 50 years of database research

and optimization actually can now be

leveraged by this as opposed to dumbly

going I'm going to read it I'm going to

group it I'm going to add it it actually

goes okay here's a whole set of

operations now let's decompose them into

a DAG of operations and apply them in an

optimal manner. Right? So that's what

this collection. So you get like stuff

like predicate push down and other real

benefits you get from the true full fat

SQL database right here just in Python.

Right? And the last part of that which

is new streaming equals true is actually

just enabling it in streaming mode.

Right? It's like saying load the data

chunk wise do the processing and then

spit out the result again. Right? Don't

store everything in memory in the same

way that pandas does. Cool. And the last

example here is ductb which is a bit

more wacky but essentially as I said

inmemory SQL. So essentially this should

look very familiar to anyone who's done

any kind of rudimentary SQL but

essentially you're just creating a table

selecting the things you want doing a

group buy. Right now we get to the

interesting part which are the results.

So this um table has a few columns. So

there's obviously the library on the

left hand side median duration across

the 30 trials. Median max CPU which is

basically like you know 100% is a single

CPU core. Median max USS you can think

about that as the amount of RAM at max

that it used if you killed the process

and you what you would get back for the

operating system. And then swap is in

the right hand column which is not super

relevant for this trial but you'll see

in the future that it is. So pandas did

it in four and a half minutes. Would

anyone like to tell me or guess how fast

polers did it

in? Anyone? No. Hm. You're not far off.

5 seconds. Right. Uh so that's within an

order of magnitude of the best hand

toolled JVM solution running native

code. Right. Uh it used all 32 cores of

your machine that you actually paid for

as opposed to the one with pandas. It

used 18 gig of memory which is

reasonable. And again, it didn't dip

into swap because the machine has so

much memory it doesn't need to. And the

last one is ductb. So duct DB did it in

5 seconds, used full 32 cores and did it

in 2 gigs of memory. Not bad. Now the

question you're asking now is like this

is wonderful. This is, you know, really

good empirical production performance.

Great, but like how does this actually

change my life dayto-day other than when

you know the jobs run? And so the really

way to think about this is like does

this speed up your dev loot, make your

life better? That kind of thing, right?

And so the next set of trials is we just

did the same test again but on my local

laptop that has uh I don't know eight 16

gigs of RAM and eight cores right and

here's the results. So pandas did

abysmally. It swapped out heaps of stuff

to RAM. It um and it took ages and it

used a single core again. Polas

obviously did respectably again used all

eight cores very very slightly dipped

into swap. And then duct DB obviously

was a real winner here with 500 megs of

RAM usage all eight cores and ran in

pretty respectable time, right? Um so I

guess yeah the message here is like you

know it improves your dev loop and

improves your production performance,

right? And you know much more efficient

with resources.

So what do these tools really give us?

Great performance on a flawed benchmark.

All benchmarks are flawed. This one is

particularly flawed because I'm

basically chose it to demonstrate some

of the benefits of using these other

tools as opposed to pandas. Right? So

one of the main ones would be painless

multi-threading. I didn't do anything

with a thread pool. I didn't create

things. I didn't worry about fork join.

I just got all of the performance out of

my machine. Right? Um streaming which

both of these tools use under the hood.

ductb does a lot of that predicate push

down streaming stuff as well as as well

as polers which just you know put filter

things through memory essentially right

the other thing is spill to disk which

is not obvious in this example but

essentially pandas swapped out to disk

which is fine because that's what the

operating system is doing but the

operating system has no idea what your

job is doing right these these these

frameworks actually have the ability or

tools have the ability to spill to disk

natively so for large computations they

can say I don't think you're going to

need this in 10 lines time I'm going to

spill it out to disk and I'm going to

keep the stuff that's on the hot path in

memory That is a big boon. And the last

one is really the killer is lazy

evaluation. You can actually create a

DAG of the tasks that you're you know

evaluating and it's going to just

execute them in the optimal order.

Right? You've been you've been you're

interested but shiny tools have hurt you

before. We've all been there. Anyone

remember Redux? No. I'm not the youngest

here then. Um cool. So the way the way

you actually adopt these tools is using

Apache Arrow, right? This is kind of the

deacto in-memory data layout format that

um the community is kind of settling

towards, right? So, um it's been

supported in pandas since its 2.0

release in April 2023. Uh and support is

uh yeah, it's getting there. Um it's

really getting there. Um but the I guess

the key detail here is both of these

tools polers and ductb support memory um

zero copy serialization between

representations. I can go from pandas to

duct db to polers to back again without

copying anything in memory if it's all

pyro backed right if you don't if if the

pandas data frame is not pyro backed you

get a fresh copy but as we've already

seen these tools have much lower memory

footprints and so having a duplicate

copy in memory is probably fine if you

can afford the pandas overhead in the

first place so let's look at another

practical example the york taxi data set

so it's a data set of every taxi trip

taken in New York between 2009 to

present right where the data is on sort

of monthly monthly files and so you can

sort of process through them. Um,

corporate wants to find out if the

pandemic made cache less common. So,

we're looking at the files between 2019

and 2022. This is about 3 gigs of park

file on disk. Um, we're using duct DB

with pandas, but you can actually mix

and match these tools as much as you

want. You could use polers and pandas

or, you know, every combination thereof.

So, there's kind of four functions that

I'm going to call out here. Don't try

and read all this code. There's just too

much of it going on. But, there's

essentially four functions here. One for

reading and computing for each of the

particular libraries. And what you can

see is we're going to mix and match them

around and show how you can just use

them together. So the key points on this

this slide that I want you to take away

is reading this data from pandas is you

can naively just do a loop and just

merge them all together in memory and

basically we're just doing this dtype

backend pi arrow. At the bottom of the

slide there's a bunch of boring data

sciency stuff where we have to clamp the

data to the right range of and you know

all that nasty stuff you have to do when

you actually use real data. So uh the

pandas compute also looks really really

familiar to us. We're doing a group by,

we're doing an AG, we're doing some

horrible unstacking stuff in the middle.

Uh resetting some indices and basically

summing up all the column levels for the

enum for cash payment versus card versus

whatever, right? Again, don't sweat the

details too much. These are kind of here

for completeness. Duct DB um is looks

very familiar. Again, this is just a

select statement in in um SQL. The thing

I really want to call out here though is

this line, which is really sick because

essentially duct DB is just merging

these files for you. It's using a glob

pattern to just like find all the park

files in this directory and just go put

them together, right? Um the last part I

want to call out here as well is this

stuff the kind of boring data sciency

miscellaneous stuff I was talking about

before is actually happening at the

point where you read the data in. So all

this stuff about predicate push down and

like you know filtering data that you

don't need is happening for you. Right?

So pandas had to read all that stuff in

and then you had to say gh chuck away

the bits I don't want. Whereas this is

like just don't read in the crap that I

don't want in the first place. So

that'll help you save a lot of memory.

the duct DB computation. This is slide

really is concluded for completeness. I

wouldn't try and grock all this. It's a

bunch of common table expressions put

together with some joins and smooshed

together. I wouldn't stress about the

details. You can look at this later. The

last part here, again, I'm not going to

enumerate fully, but essentially is just

like it is all of the combinations of

these things, right? It's reading and

writing with each framework together and

then using pure solutions for each of

them, right? Again, wouldn't stress,

it's just there for completeness. The

part we're all really here is for the

performance on my actual machine sitting

at home on my on my laptop. So pure

pandas is obviously the worst. We all

knew that. Um it used about a core. It

used 14 gigs of memory and you know

swapped out lots of stuff. Uh duct db

doing a reading and pandas doing the

thinking kind of gets you your runtime

down simply because duct db can read

this stuff in parallel. It can take

those glob patterns and just go I've got

eight cores. I'm going to use them and

just read as much as I can. You use all

eight cores. You still swapped out

probably because the pandas group by

essentially expanded and then contracted

in memory again. And then if you see the

example where the pandas does the

reading, you still get the benefit, but

you don't get this expand contract

pattern in memory because it's not being

done by pandas. Uh, and finally, the

pure ductb solution, eight calls, 200

megs of RAM versus like 15 gigs for all

the other ones. I I mean, I'm impressed.

Um, cool. Did cash usage fall over the

pandemic? Uh, yes. I don't know. I don't

know if it's actually science

scientific. I don't really care. Here's

a chart just to prove that I did the

actual research. For some reason, in

March 2020, people started using cash

more, which I don't understand or

believe, but uh that is the chart. It

did go down. Okay. So, uh you're

technologists, you you should be

skeptical. If you're not, become

skeptical. Um why shouldn't you actually

just listen to me, right? The answer is

uh I'm just a dude with a laptop. Uh do

your own research. Uh clone my code.

It's open source. It's on GitHub. Do

your own benchmarks. Figure out how your

own workloads work. You know, don't just

take as wrote the thing as a dude on the

stage with a sweater set. The second

part is you're super deeply integrated

into the pandas ecosystem, right? I did

my honors degree in energy research

which used like there was this weird

package that did energy modeling that

used pandas as like a base data

structure which is really cooked in

hindsight but those people who are doing

that kind of modeling are not getting

away from pandas and honestly they

probably shouldn't. Another example is

kind of like geospatial as well like

don't bother it's not really worth it.

The last thing the last point being like

let pandas cook, right? Let them

improve, let them get better. They are

the kings, right? Um and that's true.

they have made progress on integrating

Pyro and getting better, but that

progress has been honestly abysmally

slow. Um, and the other part I will say

is that like there are benefits to

switching off of pandas that go beyond

performance. There is a lot of stuff

around ergonomics and API groility. Like

the mental load of using pandas is

significantly higher than just writing

SQL in my opinion. Um, and the other one

is like a transferable skill set. One of

my major criticisms of data science as a

field when I started out was that it

basically just feels like you're just

learning an API. There's no like novelty

there. It's just like how do I glue

together these pandas things, right? And

what you actually get by switching to a

tool that allows you to use SQL, for

example, is transferable skills across

different domains. I've suggested two

tools. Which one is better? This really,

really depends on your preference. SQL

versus code. I personally prefer Polus

because I like composable functions. I

like unit testing. I like that kind of

stuff. But duct DB goes well beyond

performance, right? You get local first

analytics, like you could do like local

first analytics on an embedded Android

app, right? Say you're making a health

app. You just embed this binary in your

code and you can do cool analytics

dashboards for your users without

sending anything to the server. That's

nuts. The other one is like compile it

to Wom and just like do SQL queries for

visual visualization from the front end.

Your back end doesn't return JSON data.

It returns signed S3 URLs and you can

just query them. That is nuts. It is so

cool as like a more app deb kind of

person as opposed to data person. So

that kind of brings us to like the

takeaways of this talk. um which is

basically like you can't avoid complex

distributed quering systems but you can

defer them potentially indefinitely by

just not using pandas in the first place

and the second one is like we really are

experiencing like a renaissance in

Python tooling at the moment

particularly in the dataf frame space

like actually get out there and try some

of them that your assumptions are no

longer valid from 5 years ago and

finally just a bit of shameless

self-promotion code for this is on

GitHub I have a LinkedIn I have a blog

that I occasionally post to if you want

to follow me on Twitter don't because I

don't care what you

That's I believe I have a slide as well

for feedback and what's up next. Um,

[Applause]

[Music]

[Applause]

[Music]

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube Transcript:Latency2025 - Pandas Should Go Extinct presented by Eddie Atkinson

Video Transcript

Paste YouTube URL

Transcript Extraction Form

Get Our Chrome Extension

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube Transcript:
Latency2025 - Pandas Should Go Extinct presented by Eddie Atkinson