YouTube Transcript:
PromCon 2024 - The Weirdest PromQL You’ll Ever See: PromQL for Reporting, Analytics

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

This presentation demonstrates how to leverage Prometheus and PromQL for non-observability use cases, specifically for cost attribution and business intelligence reporting, by transforming raw time-series data into actionable financial insights.

Mind Map

Click to expand

Click to explore the full interactive mind map • Zoom, pan, and navigate

yeah hi everybody this is the weirdest

promql you'll ever see promql for

reporting analytics and business

intelligence now this is my first time

at promcon it's actually my first

conference talk um so you'll have to

forgive the kind of somewhat Batey title

I put together when I was uh pitching

this um we've we've seen some weird prom

today right um some of H and minoes and

I was told about this uh legendary talk

from five years ago um that's that's

gone down in history so maybe this is

the second weirdest prom you'll ever see

who knows and uh yeah here it is if you

want to go now you can uh you can go go

away happy now

um so yeah my name is Sam juel I'm a

senior software engineer at graffan Labs

I've been there three uh years now when

I'm not working I'm working hard or

maybe not so hard looking after my two

boys um here we're dressed up as uh

daffodils dressed up in yellow yellow

flowers um when I am working most

recently I've worked on cost attribution

this is helping customers to split up

their their bill their Graff Labs Bill

uh by label typically um and for large

companies large customers this is pretty

important important it means they can

break down that huge bill and see who is

sending what um and it's also important

for teams teams can then see their own

contribution to that enormous company

spend and start to kind of take

ownership of that so it's it's pretty

valuable you know in dollars we dog feed

this at

grafana and in the last month or two

we've we've cut metrics that were worth

$25,000 every month and um and we we're

just getting started we plan to save

even more over the coming months and

we're using we're using Prometheus and

we're using grafana um to do this which

is which is interesting because it's not

an observability use case um we're

storing the data in Prometheus we're

visualizing with grafana and that's why

I'm here talking today um to share like

how we're using Prometheus and to kind

of say that you to can use Prometheus

for non-observability use cases

um such things as you know you could

track dorom metrics you could track like

site activity or transactions you could

track alerts or incidents um or like us

you could track usage of like meted

Cloud spend Cloud you know meted cloud

services and save money and and we heard

yesterday from Swiss re they described

themselves uh one of their functions as

being a data shop and the other other

teams were consuming some of their data

data [Music]

[Music]

so so why why use Prometheus and why not

kind of like SQL why not some SQL data

warehouse um well you know you could you

could avoid adding an additional data

pipeline or database if you happen to

have your data there in Prometheus um

you can avoid ex exporting that data

which means you can avoid things like a

delay or or stale data in some cases and

we can lean on the ecosystem as well you

know it makes it very easy to build

dashboards or alert on that data so you

know there's opportunities here so my

goals for the talk are to uh equip you

to do a bit more with Prometheus and

hopefully learn something new about

promql and have some fun doing it if we can

can

so so

specifically I've been looking a lot at

usage data that's things like the count

of um Prometheus time series or bites of

telemetry data ingested bites of logs

traces data um at Grana we also

track our CL consumption of cloud

compute resources so how much CPU and

memory we're Computing across all our

kubernetes machines um to to ATT trct

costs and to manage those costs so when

we look at that usage dat data we've got

super high you know Prometheus is

powerful there's a reason we love it

it's like super high frequency in the

time Dimension so we can like see spikes

and and zoom in on those see exactly

when they occurred and correlate and

figure out you know what caused the

spike and and address it you know and it

you know the Prometheus label model also

allows us as high cardinality you know

as high as we want or higher sometimes

um so we can break down that usage in

kind of a space Dimension as well and

track it down to an instance um or a

team you know and we can address our

usage spikes great in kind of real time

or close to it but my goal is something

pretty different to that it's to grock

those costs um and view view the data

over a much longer time period so like

aggregate over time so some examples of

like times when you might want to do

that um at the end of the month you get

sent your bill by your like cloud

provider for your Cloud compute and um

you might want to compare that to your

own data and kind of audit the bill

they've sent to you

um and then you know challenge the bill

if it's wrong uh or break down the bill

even further further than the pro

provider allows you to do with their own

data um we might want to just like Gro

these costs uh by day you know see see

like daily or weekly how these things

are trending and and where the where the

sources are coming from the changes

coming from this one I think is quite an

interesting view this is kind of like a

monthly View and it's cumulative as well

and it allows you to compare across

months you know where the Peaks are so

was this month more than last month

which of the kind of teams which of the

instances are responsible and

also see some gradients see some rates

you know where where is it picking up

and where it slowing down so that's

quite a rich view on the data and then

over even longer time Horizons

um this is

uh like if you buy some credits up front

at the beginning of the year and you're

burning through them as the year goes on

you want to know a that you're not going

to fail to spend them all and lose them

uh once that 12 months runs out and

equally that you're not going to burn

through it all in six months and have to

like buy another contract within 6

months so um so yeah you you you want

yeah all right so let's uh let's start

writing some promql let's start trying

to build some of these visualizations so

we'll start with this daily

one um now the easiest way is to start from

from

a counter metric uh so here I've got

traces size total um so that's like to

in in bytes and you can see if I take

the sum of the rate we can see that you

know in August we're seeing around

85,000 bytes coming in every

second um and so the reason we've

started with a counter is because this

is the easiest so you can just use

increase increase and pass it the time

period of one day and it'll tell you how

much that bytes counter increased over

the day so on the 17th of August you can

see 7.4 billion

bytes um were you know were sent or

ingested over that last day but this is

kind of continuous data and it's

actually more data than we want to see

if we just want to grock this quickly so

we'll change the Min step and we'll

change to bars and there we are we've

got a daily view that's super nice we

can really quickly and easily understand

that and share it around the business you

you

know and if you want to do that in

Prometheus you can change the resolution

input that takes seconds so just convert

your day into seconds there but

there's something not quite right here

on the 16th of August if you hover you

can see that bar is the time the exact

time for that data point is midnight in

the morning and what it's done is it's

calculated the increase in that count um

over the past 24 hours so all of

those traces um were actually being

recorded in the previous 24 hours

meaning the 15th of the month not the

16th so we'd like to fix that fix the X

labels so we're just going to add one

more thing at this point which is to

offset offset by one day and you can see

the bars move over and they're now

labeled correctly they're labeled us for

when when that usage happened all right

let's move on to something

bit more tricky at graan Labs

we often find that we're working with

these gauge metrics and not counter

metrics well specifically rates right

rate metrics because we use recording

rules to kind of like Aggregate and

relabel lots of this data it's kind of

like there are too many labels and it's

too high cardinality um so we end up

with metrics that are more like this

byes received per second and so in this

case we we want to get to the same daily

data but

um because it's a rate we're going to

have to integrate the area under the

curve um

so now interestingly there's an open

pull request this guy juano also works

at gra Labs he's proposed this in kind

of July and August so I'm interested to

see where this goes I I really want to

see this com in um but in the meantime

there's kind of of some tricks to

getting this to work properly so if we

look at the docs we can track down some

overtime which says you know it takes a

range vector and what it returns is the

sum of all values in the specified

interval let's just have a look at how

that pans out in practice so if

we take some kind of take this example

from 1600 to

161 we pass one minute into some over

time and it's going to

do what we just said like sum all the

values in the specified interval so in

this case the values are 5 5 10 and 10

we get 30 okay that's what it's giving

us for the B bytes

received but the trouble is the btes

received per

second is exactly that it's per second

so what it's telling us is five bytes

were received in that one second five

bytes in that second 10 in that second

and 10 in that second so we've summed up

the uh bites received over four seconds

and not over a minute so we'll fix that

by multiplying by the time period we'll

multiply by 15 this is not kind of in

the docs I've opened a pull request

we'll see see what happens there but um

in order to get a true integration and

taking account of all that time we'll

have to multiply by the time period or scrape

scrape

interval now this isn't completely

robust yet if a scrape is missed or a

recording rule fails you might lose a

data point you might lose two if the

scrape interval actually changes or your

like frequency of data changes we can

fix that by adding a subquery so that

you know as long as you execute a

subquery you can guarantee that it's

going to evaluate at specific intervals

in time and then we'll we'll we'll cover

the whole time range

again great and so we can we can run

that we can see the the the curve again

and we can like present it as days

awesome we got there so we'll do another

one let's have a look at this monthly

now so you'd hope that you could just

kind of pull out where you had inserted

one day and stick in one month um that

doesn't work we get a paing error bad

number or

duration and presumably this is months

are different lengths right so it's like

what would what would one month be

translated into so prom doesn't accept

month um time

durations in some cases perhaps you

could get away with 30 days

but it's nice to have data for calendar

months because you know if you look at

it two weeks later a month later even

six months later it won't have changed

right it won't have moved around so

um so how are we going to do that well

there are some functions at some some of

these you've seen already today actually

month day of month days in month and

there's others like uh day and week for

example there's there's loads in the

docs um so I was just going to show what

these actually do so month will give you

the value of the month index so here you

can see it takes seven during July eight

during August there's

also and so you can um then apply a

condition if you wanted to you could say

the month has to equal eight and you'll

see no data for July and for September

so this might have some potential for us

there's also I thought I'd quickly show

day of month that climbs from a value of

one on 1st to 31 on the 31st again you

could apply a condition and use that to

maybe lift data just at the start of the

month or just at the end of the month for

for

example um

but yeah so I'm interested to try and

isolate August data to begin with and

I've got this kind of like synthetic metric

metric

is uh kind of take the intersection of

this synthetic metric with my actual

usage metric um and use this uh

synthetic metric as a filter

basically so intersection would be the

and set operator but and doesn't allow

for like the selective matching of

labels so it doesn't allow you to use

like on or ignore keywords or this group

group right or left people have talked

about joins earlier today as well so

what we're going to try and do instead

is to try multiplying these

together uh if we're going to multiply

them we want to scale

this we well we want it to just be have

a value of one really we don't want it

to have a value of eight and we don't

want to like divide by eight and

hardcode some value there because

because it'll be wrong for the other

so the way we're going to try and do

this is we'll introduce absent absent

always has a value one so it's nice in

that regard and when we put month equals

8 into absent you can see okay that

metric we had before was absent and

that's why we get the value one in July

and in September and then we can reverse

the condition if we do that we'll say

okay what about when the month was not

equal to eight where is it absent it's

absent in August fantastic okay so now

let's try and multiply this by our usage

metric first of

all that's the usage metric

fine and if we try to just multiply we

get no

data um and if we check the docs we see

uh you know binary operator between two instant

instant

vectors um the operator is applied to

each entry in the left hand side and

it's matching elements in the right hand

side and what is a matching element well

for one to one vector matches it's two

entries match if they have the exact

same set of labels so we must not have

the exact same set of labels presumably

so let's have a look we've

got you know our real metric on the on

the left has like cluster ID or ID and

on the right we have no labels that's

why it's not working so to begin with

let's simplify a bit what we've got on

the left I'll just do a sum by

ID um

in order to reduce the number of labels

I'm working with to begin with so that

that takes me down to one but I don't

want to throw them all away I want to

keep this one because as I said earlier

we want we want to we want that data in

the cardinality direction in the space

direction we want to know which teams

which which pods which instances um then

once we've done that we'll multiply and

this time we'll use

ignoring um and we and there it is we've

managed to like filter a usage metric to

a specific month um it's working there's

one more thing I want to do before we

move on and that's

to here the label is is not preserved

you can see the legend just shows the

the query string but uh by adding group

left to the to the join logic we can um

preserve the ID and get that to come out

in the result as

well okay so this is still a rate this

is still like bytes per second we need

to integrate underneath the curve again

to um

to get this like

cumulative um usage and just calculate

the bytes

received so we'll try and do what we did

before we'll sum over time this time

we'll um use 31 days as the period and

I'm multiplying by 60 here so this time

60 is the is the is the scrape interval

or the period between data points this

one is not working does does not work

yet it wants a range Vector so in order

to give it a range Vector we'll do the

same thing we did before actually we'll

turn this into a

subquery and um we'll make sure that the

subquery evaluation interval at one

minute is going to match the 60 so that

we get the the right the right

numbers and uh and there we go we've got

like an accumulating usage during the

month of August and we can see like what

the total was at the end of August but

also like how it was changing during the

month and this would work mid Monon

right this is nice because you know you

load this up on the 16th of August you

can see where you are

um but what's going on in

September well let's zoom out and have a

look if you look at September you'll see

the same usage pattern kind of in

Reverse it's been flipped upside down

and so we were accumulating usage during

August and we're de accumulating that same

same

usage um in in

September and like why is that happening

it's because

when we sum over time the time window

we're using is 31 days long so even on

the 30th of September we look back 31

days we see that last day in August is

still being counted to that to that

usage calculation we're not interested

in that September usage pattern at all

that when we get to the end of August we

basically like pay the bill and we start

from zero again so let's like try and

throw that away um this is this is where

it gets a bit more fun or hacky or you know

know

but let's let's push

forwards so we'll just apply the exact

same filter

again um and we'll

we'll uh cut the data off at the end of

August and there we have it we've got

like an

accumulating usage for August and

now we can start adding some more months

so let's try and

agly uh to agly what we'll do is we'll use

use

the or operator this is the union so

we're going to

like we've got one query at the top and

one query at the bottom you can see the

bottom one I have swapped out the number

eight for seven so I'm I want month

seven now when we Union those two

together they the labels are going to be

identical and so it's kind of like going

to stick them together in in time so so

there we are we've got uh July now I did

PR promise weird promise from ql I've

got a kind of like a code golf version

for like all months so if once you want

12 months you could do something like

this you could do like modulo divide by

three and compare it to 0 one or two I'm

not very proud of this although it is um

what's running in the code back home I'm

going to change this when I when I get

back to my desk um Y3 the reason you

need three rather than two is because

February is too short and the other

short months um

um I'm slightly more more happy with

this one I think this is slightly more

um predictable and possible to reason

about so copy it 12 times over I came

from a I came from a background in Ruby

and there's a great conference speaker

called Sandy Mets she wrote a book

called 99 bottles of oop that I

absolutely love and she talks about

getting to Shameless green Shameless

green is the version that passes the

tests with it and it's Shameless so this

um anyway it

works there you can see June has

appeared and September has appeared and

um that's fantastic and so just to like

put the icing on the cake this is scoped

to a single

ID um here ID

bar if we like unscop from a single ID

and choose an or ID instead now we can

see um see that rich view of the data we

saw before so yeah this is like super

interesting and potentially like really

time oh I can see I can see I've got

like a little demo I think so you can

see that this isn't just a picture but actually

actually

right I was just going to uh like zoom

out a bit and interact with the interact

a little bit and you could see that it

know and we can like show and hide and

focus on a single team and like share

that URL with our colleagues and tell

harder um so I I showed this at the

start as well um un fortunately I

haven't really got time now to cover

this one so I thought i' would say you

know this could be your homework task if

you want to go away and have a go or you

can like reach out to me afterwards if

you want to uh if you want to anyway so

that's it I hope I've kind of equipped

you to do a little bit more with

Prometheus and maybe taught you

something new about promql and that

we've like had a bit of fun doing it um

that's it thank you everybody for

listening and thanks to people who have

that wasn't too weird after all or was

question hi thank you for the talk um

just a quick question have you

considered how are you going to deal

I think the uh I mean I guess I have

some kind of answer the the beauty of

using that month operator is that it

will it knows it knows when the month

begins and when the month ends so um

that's why I'm using it right that's why

I'm using it instead of using 30 days so

um so yeah I'd hope I'd hope that this

would like help me help me get there um

and I wouldn't have to think about

it um can you explain again why you have

to use the ignoring instead of an on

parenthesis for the join part to work uh

there it might be that I could have used

on and then empty

brackets um

yeah that's the form I'm used to seeing

so I'm wondering if this was if there's

a specific reason why it had to be

different I no so typically like you've

got a set of labels you might have like three

three

labels um and if you want to join on one

of them then you can either say on that

one or you can say ignoring the other

two and um yeah both both do exactly

yeah so thanks for for the talk very

interesting um I have a comment and

question the comment is that it looks

like that this ignoring absent

absent

uh code can be replaced by scolar did

you consider scalar function by what by

Scala function SC yeah

scalar that's the first question uh let

just comment and uh the question is uh

did you consider uh time zone because as

I know um months function doesn't is return

return

returns um result on UTC time zone I I

recommend you adding time zone support

in this

query that's a really good question yeah

um through my work on this I'm

like and someone asked about like a leap

year I'm I'm often hitting time zone

issues and they kind of manifest

themselves most when Daylight Saving

occurs actually and so I'll get some

like I'll get January February March and

then I'll get March again and then I'll

get April May June or something or I'll

get a gap and um so I've like discovered

and reported like quite a few bugs

actually or uh at least one probably two

or three in in grafana around time zones

and the way they're handled

handled

um but basically

basically

we we we build dashboards for our

customers that show their their bills in

in like a billing dashboard in grafana

and we're also building like grafana app

plugins that show how their bills break

down by team and the bills are

calculated in the utc's time zone and so

as long as we are able to like always

work in

UTC then we can like show them the their

actual bill and label it with the right

month and you know it's not it's it's

not relevant to them to see their bill

in their local time zone because it

won't actually match up with what we're

billing them so mostly we yeah just just

work in UTC for this

work UTC for the win um quick

announcement uh the first two lightning

talk speakers Kieran and Christian

please come already to the front so we

can get started right after

this hello um thank you for the talking

is really amazing uh just a very maybe

dumb question it's like if it's only

dashboarding doesn't work to have a

variable using the days in month

function and getting the return from

this expression using as like the the

query the days in which month though

like surely you're looking at multiple

months at once on some versions of the

dashboard oh yeah I mean like for

example creating a a variable using

theing M function and providing maybe like

up time series or something like that

just to getting the month it's going to

be evaluated and then using the return

from this function as the range selector

you know does good work or not

really yeah I I think those kinds of

solutions can work yeah we've done some

pretty gnarly things as well with uh

with with dashboard variables but yeah

yeah like if you if

you had a dashboard that's like viewing

one specific month for example then yeah

for sure okay okay thank you

yeah I have one question if I may on

your left um hello amazing talk and I'm

just curious like you you do incredible

analytic queries using promql right and

typically you know data analytics they

just prefer you know maybe SQL and like

other other languages so I'm just

curious like what you have been missing

in promql that you would do easier in

SQL what could we add or should we EV

yeah we we had a hackathon project

um I yeah I so before I joined grafana

I'd never worked in observability I

didn't know what a Time series database

was or what like you know logs were

basically I'd been using a tool called

looker uh that got acquired by Google

which is a bi tool and I um I loved it

it a dashboarding tool I was like oh I

love this dashboarding until I can go

and work on grafana on dashboarding

um anyway how little I knew but

um what there was in the SQL world you

could have like a

materialized you know you're writing

these subqueries and subqueries and

subqueries but often like if you've got

a subquery you want to give it a name

and you want to make it a source of

Truth for your business like so

recording rule exactly exactly but there

were different versions of this you

could have materialized views or you

could have views that were not

materialized now recording rules are

like materialized views but we don't

I think Victoria metrics ql they have

variable maybe that's kind of similar

purpose so yeah there's some room to

questions okay then thank you again for

the weirdest talk ever um

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:PromCon 2024 - The Weirdest PromQL You’ll Ever See: PromQL for Reporting, Analytics