Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
PromCon 2024 - The Weirdest PromQL You’ll Ever See: PromQL for Reporting, Analytics | Prometheus Monitoring | YouTubeToText
YouTube Transcript: PromCon 2024 - The Weirdest PromQL You’ll Ever See: PromQL for Reporting, Analytics
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
This presentation demonstrates how to leverage Prometheus and PromQL for non-observability use cases, specifically for cost attribution and business intelligence reporting, by transforming raw time-series data into actionable financial insights.
Mind Map
Click to expand
Click to explore the full interactive mind map • Zoom, pan, and navigate
yeah hi everybody this is the weirdest
promql you'll ever see promql for
reporting analytics and business
intelligence now this is my first time
at promcon it's actually my first
conference talk um so you'll have to
forgive the kind of somewhat Batey title
I put together when I was uh pitching
this um we've we've seen some weird prom
today right um some of H and minoes and
I was told about this uh legendary talk
from five years ago um that's that's
gone down in history so maybe this is
the second weirdest prom you'll ever see
who knows and uh yeah here it is if you
want to go now you can uh you can go go
away happy now
um so yeah my name is Sam juel I'm a
senior software engineer at graffan Labs
I've been there three uh years now when
I'm not working I'm working hard or
maybe not so hard looking after my two
boys um here we're dressed up as uh
daffodils dressed up in yellow yellow
flowers um when I am working most
recently I've worked on cost attribution
this is helping customers to split up
their their bill their Graff Labs Bill
uh by label typically um and for large
companies large customers this is pretty
important important it means they can
break down that huge bill and see who is
sending what um and it's also important
for teams teams can then see their own
contribution to that enormous company
spend and start to kind of take
ownership of that so it's it's pretty
valuable you know in dollars we dog feed
this at
grafana and in the last month or two
we've we've cut metrics that were worth
$25,000 every month and um and we we're
just getting started we plan to save
even more over the coming months and
we're using we're using Prometheus and
we're using grafana um to do this which
is which is interesting because it's not
an observability use case um we're
storing the data in Prometheus we're
visualizing with grafana and that's why
I'm here talking today um to share like
how we're using Prometheus and to kind
of say that you to can use Prometheus
for non-observability use cases
um such things as you know you could
track dorom metrics you could track like
site activity or transactions you could
track alerts or incidents um or like us
you could track usage of like meted
Cloud spend Cloud you know meted cloud
services and save money and and we heard
yesterday from Swiss re they described
themselves uh one of their functions as
being a data shop and the other other
teams were consuming some of their data
data [Music]
[Music]
so so why why use Prometheus and why not
kind of like SQL why not some SQL data
warehouse um well you know you could you
could avoid adding an additional data
pipeline or database if you happen to
have your data there in Prometheus um
you can avoid ex exporting that data
which means you can avoid things like a
delay or or stale data in some cases and
we can lean on the ecosystem as well you
know it makes it very easy to build
dashboards or alert on that data so you
know there's opportunities here so my
goals for the talk are to uh equip you
to do a bit more with Prometheus and
hopefully learn something new about
promql and have some fun doing it if we can
can
so so
specifically I've been looking a lot at
usage data that's things like the count
of um Prometheus time series or bites of
telemetry data ingested bites of logs
traces data um at Grana we also
track our CL consumption of cloud
compute resources so how much CPU and
memory we're Computing across all our
kubernetes machines um to to ATT trct
costs and to manage those costs so when
we look at that usage dat data we've got
super high you know Prometheus is
powerful there's a reason we love it
it's like super high frequency in the
time Dimension so we can like see spikes
and and zoom in on those see exactly
when they occurred and correlate and
figure out you know what caused the
spike and and address it you know and it
you know the Prometheus label model also
allows us as high cardinality you know
as high as we want or higher sometimes
um so we can break down that usage in
kind of a space Dimension as well and
track it down to an instance um or a
team you know and we can address our
usage spikes great in kind of real time
or close to it but my goal is something
pretty different to that it's to grock
those costs um and view view the data
over a much longer time period so like
aggregate over time so some examples of
like times when you might want to do
that um at the end of the month you get
sent your bill by your like cloud
provider for your Cloud compute and um
you might want to compare that to your
own data and kind of audit the bill
they've sent to you
um and then you know challenge the bill
if it's wrong uh or break down the bill
even further further than the pro
provider allows you to do with their own
data um we might want to just like Gro
these costs uh by day you know see see
like daily or weekly how these things
are trending and and where the where the
sources are coming from the changes
coming from this one I think is quite an
interesting view this is kind of like a
monthly View and it's cumulative as well
and it allows you to compare across
months you know where the Peaks are so
was this month more than last month
which of the kind of teams which of the
instances are responsible and
also see some gradients see some rates
you know where where is it picking up
and where it slowing down so that's
quite a rich view on the data and then
over even longer time Horizons
um this is
uh like if you buy some credits up front
at the beginning of the year and you're
burning through them as the year goes on
you want to know a that you're not going
to fail to spend them all and lose them
uh once that 12 months runs out and
equally that you're not going to burn
through it all in six months and have to
like buy another contract within 6
months so um so yeah you you you want
yeah all right so let's uh let's start
writing some promql let's start trying
to build some of these visualizations so
we'll start with this daily
one um now the easiest way is to start from
from
a counter metric uh so here I've got
traces size total um so that's like to
in in bytes and you can see if I take
the sum of the rate we can see that you
know in August we're seeing around
85,000 bytes coming in every
second um and so the reason we've
started with a counter is because this
is the easiest so you can just use
increase increase and pass it the time
period of one day and it'll tell you how
much that bytes counter increased over
the day so on the 17th of August you can
see 7.4 billion
bytes um were you know were sent or
ingested over that last day but this is
kind of continuous data and it's
actually more data than we want to see
if we just want to grock this quickly so
we'll change the Min step and we'll
change to bars and there we are we've
got a daily view that's super nice we
can really quickly and easily understand
that and share it around the business you
you
know and if you want to do that in
Prometheus you can change the resolution
input that takes seconds so just convert
your day into seconds there but
there's something not quite right here
on the 16th of August if you hover you
can see that bar is the time the exact
time for that data point is midnight in
the morning and what it's done is it's
calculated the increase in that count um
over the past 24 hours so all of
those traces um were actually being
recorded in the previous 24 hours
meaning the 15th of the month not the
16th so we'd like to fix that fix the X
labels so we're just going to add one
more thing at this point which is to
offset offset by one day and you can see
the bars move over and they're now
labeled correctly they're labeled us for
when when that usage happened all right
let's move on to something
bit more tricky at graan Labs
we often find that we're working with
these gauge metrics and not counter
metrics well specifically rates right
rate metrics because we use recording
rules to kind of like Aggregate and
relabel lots of this data it's kind of
like there are too many labels and it's
too high cardinality um so we end up
with metrics that are more like this
byes received per second and so in this
case we we want to get to the same daily
data but
um because it's a rate we're going to
have to integrate the area under the
curve um
so now interestingly there's an open
pull request this guy juano also works
at gra Labs he's proposed this in kind
of July and August so I'm interested to
see where this goes I I really want to
see this com in um but in the meantime
there's kind of of some tricks to
getting this to work properly so if we
look at the docs we can track down some
overtime which says you know it takes a
range vector and what it returns is the
sum of all values in the specified
interval let's just have a look at how
that pans out in practice so if
we take some kind of take this example
from 1600 to
161 we pass one minute into some over
time and it's going to
do what we just said like sum all the
values in the specified interval so in
this case the values are 5 5 10 and 10
we get 30 okay that's what it's giving
us for the B bytes
received but the trouble is the btes
received per
second is exactly that it's per second
so what it's telling us is five bytes
were received in that one second five
bytes in that second 10 in that second
and 10 in that second so we've summed up
the uh bites received over four seconds
and not over a minute so we'll fix that
by multiplying by the time period we'll
multiply by 15 this is not kind of in
the docs I've opened a pull request
we'll see see what happens there but um
in order to get a true integration and
taking account of all that time we'll
have to multiply by the time period or scrape
scrape
interval now this isn't completely
robust yet if a scrape is missed or a
recording rule fails you might lose a
data point you might lose two if the
scrape interval actually changes or your
like frequency of data changes we can
fix that by adding a subquery so that
you know as long as you execute a
subquery you can guarantee that it's
going to evaluate at specific intervals
in time and then we'll we'll we'll cover
the whole time range
again great and so we can we can run
that we can see the the the curve again
and we can like present it as days
awesome we got there so we'll do another
one let's have a look at this monthly
now so you'd hope that you could just
kind of pull out where you had inserted
one day and stick in one month um that
doesn't work we get a paing error bad
number or
duration and presumably this is months
are different lengths right so it's like
what would what would one month be
translated into so prom doesn't accept
month um time
durations in some cases perhaps you
could get away with 30 days
but it's nice to have data for calendar
months because you know if you look at
it two weeks later a month later even
six months later it won't have changed
right it won't have moved around so
um so how are we going to do that well
there are some functions at some some of
these you've seen already today actually
month day of month days in month and
there's others like uh day and week for
example there's there's loads in the
docs um so I was just going to show what
these actually do so month will give you
the value of the month index so here you
can see it takes seven during July eight
during August there's
also and so you can um then apply a
condition if you wanted to you could say
the month has to equal eight and you'll
see no data for July and for September
so this might have some potential for us
there's also I thought I'd quickly show
day of month that climbs from a value of
one on 1st to 31 on the 31st again you
could apply a condition and use that to
maybe lift data just at the start of the
month or just at the end of the month for
for
example um
but yeah so I'm interested to try and
isolate August data to begin with and
I've got this kind of like synthetic metric
metric
is uh kind of take the intersection of
this synthetic metric with my actual
usage metric um and use this uh
synthetic metric as a filter
basically so intersection would be the
and set operator but and doesn't allow
for like the selective matching of
labels so it doesn't allow you to use
like on or ignore keywords or this group
group right or left people have talked
about joins earlier today as well so
what we're going to try and do instead
is to try multiplying these
together uh if we're going to multiply
them we want to scale
this we well we want it to just be have
a value of one really we don't want it
to have a value of eight and we don't
want to like divide by eight and
hardcode some value there because
because it'll be wrong for the other
so the way we're going to try and do
this is we'll introduce absent absent
always has a value one so it's nice in
that regard and when we put month equals
8 into absent you can see okay that
metric we had before was absent and
that's why we get the value one in July
and in September and then we can reverse
the condition if we do that we'll say
okay what about when the month was not
equal to eight where is it absent it's
absent in August fantastic okay so now
let's try and multiply this by our usage
metric first of
all that's the usage metric
fine and if we try to just multiply we
get no
data um and if we check the docs we see
uh you know binary operator between two instant
instant
vectors um the operator is applied to
each entry in the left hand side and
it's matching elements in the right hand
side and what is a matching element well
for one to one vector matches it's two
entries match if they have the exact
same set of labels so we must not have
the exact same set of labels presumably
so let's have a look we've
got you know our real metric on the on
the left has like cluster ID or ID and
on the right we have no labels that's
why it's not working so to begin with
let's simplify a bit what we've got on
the left I'll just do a sum by
ID um
in order to reduce the number of labels
I'm working with to begin with so that
that takes me down to one but I don't
want to throw them all away I want to
keep this one because as I said earlier
we want we want to we want that data in
the cardinality direction in the space
direction we want to know which teams
which which pods which instances um then
once we've done that we'll multiply and
this time we'll use
ignoring um and we and there it is we've
managed to like filter a usage metric to
a specific month um it's working there's
one more thing I want to do before we
move on and that's
to here the label is is not preserved
you can see the legend just shows the
the query string but uh by adding group
left to the to the join logic we can um
preserve the ID and get that to come out
in the result as
well okay so this is still a rate this
is still like bytes per second we need
to integrate underneath the curve again
to um
to get this like
cumulative um usage and just calculate
the bytes
received so we'll try and do what we did
before we'll sum over time this time
we'll um use 31 days as the period and
I'm multiplying by 60 here so this time
60 is the is the is the scrape interval
or the period between data points this
one is not working does does not work
yet it wants a range Vector so in order
to give it a range Vector we'll do the
same thing we did before actually we'll
turn this into a
subquery and um we'll make sure that the
subquery evaluation interval at one
minute is going to match the 60 so that
we get the the right the right
numbers and uh and there we go we've got
like an accumulating usage during the
month of August and we can see like what
the total was at the end of August but
also like how it was changing during the
month and this would work mid Monon
right this is nice because you know you
load this up on the 16th of August you
can see where you are
um but what's going on in
September well let's zoom out and have a
look if you look at September you'll see
the same usage pattern kind of in
Reverse it's been flipped upside down
and so we were accumulating usage during
August and we're de accumulating that same
same
usage um in in
September and like why is that happening
it's because
when we sum over time the time window
we're using is 31 days long so even on
the 30th of September we look back 31
days we see that last day in August is
still being counted to that to that
usage calculation we're not interested
in that September usage pattern at all
that when we get to the end of August we
basically like pay the bill and we start
from zero again so let's like try and
throw that away um this is this is where
it gets a bit more fun or hacky or you know
know
but let's let's push
forwards so we'll just apply the exact
same filter
again um and we'll
we'll uh cut the data off at the end of
August and there we have it we've got
like an
accumulating usage for August and
now we can start adding some more months
so let's try and
agly uh to agly what we'll do is we'll use
use
the or operator this is the union so
we're going to
like we've got one query at the top and
one query at the bottom you can see the
bottom one I have swapped out the number
eight for seven so I'm I want month
seven now when we Union those two
together they the labels are going to be
identical and so it's kind of like going
to stick them together in in time so so
there we are we've got uh July now I did
PR promise weird promise from ql I've
got a kind of like a code golf version
for like all months so if once you want
12 months you could do something like
this you could do like modulo divide by
three and compare it to 0 one or two I'm
not very proud of this although it is um
what's running in the code back home I'm
going to change this when I when I get
back to my desk um Y3 the reason you
need three rather than two is because
February is too short and the other
short months um
um I'm slightly more more happy with
this one I think this is slightly more
um predictable and possible to reason
about so copy it 12 times over I came
from a I came from a background in Ruby
and there's a great conference speaker
called Sandy Mets she wrote a book
called 99 bottles of oop that I
absolutely love and she talks about
getting to Shameless green Shameless
green is the version that passes the
tests with it and it's Shameless so this
um anyway it
works there you can see June has
appeared and September has appeared and
um that's fantastic and so just to like
put the icing on the cake this is scoped
to a single
ID um here ID
bar if we like unscop from a single ID
and choose an or ID instead now we can
see um see that rich view of the data we
saw before so yeah this is like super
interesting and potentially like really
time oh I can see I can see I've got
like a little demo I think so you can
see that this isn't just a picture but actually
actually
right I was just going to uh like zoom
out a bit and interact with the interact
a little bit and you could see that it
know and we can like show and hide and
focus on a single team and like share
that URL with our colleagues and tell
harder um so I I showed this at the
start as well um un fortunately I
haven't really got time now to cover
this one so I thought i' would say you
know this could be your homework task if
you want to go away and have a go or you
can like reach out to me afterwards if
you want to uh if you want to anyway so
that's it I hope I've kind of equipped
you to do a little bit more with
Prometheus and maybe taught you
something new about promql and that
we've like had a bit of fun doing it um
that's it thank you everybody for
listening and thanks to people who have
that wasn't too weird after all or was
question hi thank you for the talk um
just a quick question have you
considered how are you going to deal
I think the uh I mean I guess I have
some kind of answer the the beauty of
using that month operator is that it
will it knows it knows when the month
begins and when the month ends so um
that's why I'm using it right that's why
I'm using it instead of using 30 days so
um so yeah I'd hope I'd hope that this
would like help me help me get there um
and I wouldn't have to think about
it um can you explain again why you have
to use the ignoring instead of an on
parenthesis for the join part to work uh
there it might be that I could have used
on and then empty
brackets um
yeah that's the form I'm used to seeing
so I'm wondering if this was if there's
a specific reason why it had to be
different I no so typically like you've
got a set of labels you might have like three
three
labels um and if you want to join on one
of them then you can either say on that
one or you can say ignoring the other
two and um yeah both both do exactly
yeah so thanks for for the talk very
interesting um I have a comment and
question the comment is that it looks
like that this ignoring absent
absent
uh code can be replaced by scolar did
you consider scalar function by what by
Scala function SC yeah
scalar that's the first question uh let
just comment and uh the question is uh
did you consider uh time zone because as
I know um months function doesn't is return
return
returns um result on UTC time zone I I
recommend you adding time zone support
in this
query that's a really good question yeah
um through my work on this I'm
like and someone asked about like a leap
year I'm I'm often hitting time zone
issues and they kind of manifest
themselves most when Daylight Saving
occurs actually and so I'll get some
like I'll get January February March and
then I'll get March again and then I'll
get April May June or something or I'll
get a gap and um so I've like discovered
and reported like quite a few bugs
actually or uh at least one probably two
or three in in grafana around time zones
and the way they're handled
handled
um but basically
basically
we we we build dashboards for our
customers that show their their bills in
in like a billing dashboard in grafana
and we're also building like grafana app
plugins that show how their bills break
down by team and the bills are
calculated in the utc's time zone and so
as long as we are able to like always
work in
UTC then we can like show them the their
actual bill and label it with the right
month and you know it's not it's it's
not relevant to them to see their bill
in their local time zone because it
won't actually match up with what we're
billing them so mostly we yeah just just
work in UTC for this
work UTC for the win um quick
announcement uh the first two lightning
talk speakers Kieran and Christian
please come already to the front so we
can get started right after
this hello um thank you for the talking
is really amazing uh just a very maybe
dumb question it's like if it's only
dashboarding doesn't work to have a
variable using the days in month
function and getting the return from
this expression using as like the the
query the days in which month though
like surely you're looking at multiple
months at once on some versions of the
dashboard oh yeah I mean like for
example creating a a variable using
theing M function and providing maybe like
like
up time series or something like that
just to getting the month it's going to
be evaluated and then using the return
from this function as the range selector
you know does good work or not
really yeah I I think those kinds of
solutions can work yeah we've done some
pretty gnarly things as well with uh
with with dashboard variables but yeah
yeah like if you if
you had a dashboard that's like viewing
one specific month for example then yeah
for sure okay okay thank you
yeah I have one question if I may on
your left um hello amazing talk and I'm
just curious like you you do incredible
analytic queries using promql right and
typically you know data analytics they
just prefer you know maybe SQL and like
other other languages so I'm just
curious like what you have been missing
in promql that you would do easier in
SQL what could we add or should we EV
yeah we we had a hackathon project
um I yeah I so before I joined grafana
I'd never worked in observability I
didn't know what a Time series database
was or what like you know logs were
basically I'd been using a tool called
looker uh that got acquired by Google
which is a bi tool and I um I loved it
it a dashboarding tool I was like oh I
love this dashboarding until I can go
and work on grafana on dashboarding
um anyway how little I knew but
um what there was in the SQL world you
could have like a
materialized you know you're writing
these subqueries and subqueries and
subqueries but often like if you've got
a subquery you want to give it a name
and you want to make it a source of
Truth for your business like so
recording rule exactly exactly but there
were different versions of this you
could have materialized views or you
could have views that were not
materialized now recording rules are
like materialized views but we don't
I think Victoria metrics ql they have
variable maybe that's kind of similar
purpose so yeah there's some room to
questions okay then thank you again for
the weirdest talk ever um
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.