Hang tight while we fetch the video data and transcripts. This only takes a moment.
Connecting to YouTube player…
Fetching transcript data…
We’ll display the transcript, summary, and all view options as soon as everything loads.
Next steps
Loading transcript tools…
Building Observable Systems with eBPF and Linux (with Mohammed Aboullaite) | Developer Voices | YouTubeToText
YouTube Transcript: Building Observable Systems with eBPF and Linux (with Mohammed Aboullaite)
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
Video Summary
Summary
Core Theme
The discussion highlights the evolution of system monitoring from rudimentary, ad-hoc methods to sophisticated, unified observability strategies, emphasizing the critical role of eBPF and continuous profiling in managing the complexity of modern distributed systems.
Mind Map
Click to expand
Click to explore the full interactive mind map • Zoom, pan, and navigate
The worst system monitoring setup I've
ever witnessed was in the early 2000s
during the dotcom boom. There was this
company I was working with and they
needed exactly three servers. And they
signed a support contract worth the
equivalent of $2 million in today's
money. It was crazy back then. It's
absolutely ridiculous today. And I said
to them at the time, and I was only half
joking, I'll make you a counter offer. I
will give you the best support you will
ever witness. For half that price, for a
mere million dollars, I will camp out
next to the server rack for the whole
year. And I will never leave their side.
I'll be constantly watching them. And
they didn't accept my offer, which is a
shame because they went bust a few years
later, so I would have been able to
leave the room early.
Now, while we all contemplate whether
that would make an excellent season of
Squid Game, we must also contemplate
whether the state-of-the-art in system
observability has improved since those
days. And I hope it has because I'm
certain that the problems got harder.
Our expectations for scale and uptime
have gone up massively since then.
Meaning a lot of the systems we build
these days are distributed by default
which in turn means we need techniques
for building out different components.
We start introducing things like
microervices to manage the complexity
which in turn opens up building systems
with many different languages and
different databases.
How do you stay on top of all this? How
do you make sure it's performing well?
And how do you debug things when they go
wrong? I'll tell you how you don't do
it. You don't do it in an ad hoc way.
It's no good having a different
monitoring technique for every piece in
the system. System observability needs a
unified strategy. You've got to shoot
for something that's going to work
everywhere on every server for every
component written in every language. And
I think that means you have to tackle
the problem from the kernel level upwards.
upwards.
And that's where I need an expert.
Joining me to discuss the latest in
monitoring, profiling, and observability
strategies from the kernel all the way
to the dashboard is Muhammad Abouate.
He's a back-end engineer at Spotify and
he's going to take us through how you
can peak into the Linux kernel
programmatically with eBPF, how you
don't have to because there are several
projects that have already done it
already and how you go from there to a
complete monitoring picture of your
system. We've got a lot to pack in in
this one. We managed to cover everything
from packet filters to cultural changes.
All in service of getting a clear view
of what happens to your software when it
hits production. I'm your host Chris
Jenkins. This is Developer Voices and
today's voice is Muhammad Aboule. [Music]
And joining me today is Muhammad Abu.
How you doing, Muhammad?
>> Very good. Very good. And uh good to see
you again. It's been uh quite so long.
>> It's been a whole week. >> Exactly.
>> Exactly.
>> We were in Miami. We were supposed to
record this under the glamorous Miami
sun and logistics got in the way. So now
you're in a particularly glorious office
room there with the gray shining back at
you. And we'll do the best we can.
>> Yeah. And thanks for the for the
flexibility. Uh I got I got the calendar
invite wrong obviously because of the I
accepted it when I was in Stockholm but
then with the time zone I just like
sorry for that and thanks for the
flexibility and oh no problem allowing
us to discuss even if we're interested.
>> I'm sure there's a link here between
calendar problems and
um overloading of disperate systems and
having to reschedu the the long running
processes. I'm going to make that link
because we're going to talk about
profiling and performance and what to do
when your machine gets overloaded.
>> There we go.
>> So, um, you for context, you work at
Spotify and you worked at some other
interesting places.
>> You have done profiling in what we might
call the the very wild, right?
>> Yeah. Yeah. Correct. And I thought my
first question is
is is the state of profiling today that
there is one universal good answer that
works on every operating system, works
for every application and we should
start talking about that immediately
whatever it is or is it just there is no
oneizefits- all solution and we have to
talk about the different approaches
>> as anything in software engineering it
depends right um
>> and um I I think when we talk about
production and stuff we generally talk
about Linux as the dominant dominant uh
operating system in that sense. Um so a
lot of solution that that I worked with
and I worked on are primarily Linux uh
solutions. So if
so my experience would be primarily
around Linux. uh I won't be covering
obviously Windows because I had no
experience whatsoever uh in deploying
applications in Windows servers or using
Windows servers. I used it I think for a
brief period of time just getting access
to it and that's pretty much it. But my
experience has been primarily around
Linux. Just wanted to get that out of
the way and clarify it for for you and
the audience as well. So universally I
would say I don't know and that's
obviously an accepted answer even if a
lot of a lot of um tech folks doesn't
want to say I don't know especially in
the age of nabs uh but yeah I I don't
know about other operating systems uh
but yeah for Linux we're getting close
to it uh because of uh we're going to
probably dive into it because of evpf
and how it's built into the kernel. So
whenever you have an operating uh Linux
kernel I think with a specific version
kernel that it should be now widely uh
supported ebpf can be there and then
there's a lot of tools that are built on
top of ebpf for prof profiling.
>> Okay we are definitely going to dive
into that. Yeah,
>> I want to ask you one more contextual
question before we start on that though,
>> which is um I was thinking about this. I
feel very out of date on what the state
of profiling is. It's a good reason to
have you on the show.
>> I I remember days of, you know, you
gather logs from different machines and
at least try and put them in one place
and look at those. You'd find things
that seem to be a bit weird and you'd
probably end up recompiling
a suspected app with a minus minus
profile flag that was specific to that
language or that compiler and you'd slog
it away from there. >> Yeah.
>> Yeah.
>> And are we is are we still in that state
or is the state of the art moved on from
that to something better than remedial
profiling? I think we are uh in a sense
that um we as human beings like the
comfort zone. So that has been used for
quite a long period of time. We have a
lot of tools that are using it and then
it's just yeah a lot of people are still
using it and didn't get out of that uh
bubble. But on the other side we still
we are having now a lot of tools that
enable to do that uh in a much more
modern way in a much more continuous
way. And I believe the discussion that
we're going to have is more around what
we call now continuous profiling is like
how we can get this continuous feed of
data. Uh similar to what we have with
metrics. Um so how we can get this
continuous feedback about um not only
the health of our system but the code
that runs into our system. How we can
continuously verify uh how the memory is
used in our uh applications. how the CPU
is behaving not only the from
application holistic point of view but
also going down to the code bit and the
lines of code which method is using much
CPU uh which what what code is basically
stricting that much of my memor memory
what's inside my heap where my CPU spend
a lot of time uh so all that answers
profiling in general tries to answer but
what's what the shift that is happening
recently is we are trying to get into
that continuous collecting of that data.
It comes with a lot of challenges, don't
get me wrong, but it also it comes with
a lot of benefits uh that we are
continuously getting that feedback. We
are continuously analyzing it. But not
only from we are getting that dump,
analyzing it and then trying to figure
it out uh hours, minutes, even days
later. Now we are continuously seeing it
in when it's happened in real time which
is a big shift from uh where we were at
in the state that you mentioned.
>> Okay. I I was going to leave this till
later in the podcast but I have to bring
it up now. The idea of No, the idea of
like I see a lot of problems with the
idea of continually profiling all your applications.
applications.
>> Yeah. Um and the first is that just
sheer volume of data being gathered.
>> Correct. So
if that's solved, maybe we should talk
about what EBPF is and then address how
it solves it because that seems like a
showstopper to me. So it's it's going to
be a lot of lot of amount of data that
is generated. Um and there is no right
or wrong answer here. it just like you
have to experiment with it. You have to
find the best
u use case for you, the best thresholds
for you and how you're going to benefit
from the continuous profiling data. Um
so a clear sample that a lot of people
like simple rule of thumb is that the more
more
recent the data the more frequent we
keep it and the more historical data we
we make that window longer. An example
is we want to keep the frequency for the
last five minutes higher for the last
minute even more higher like we're
getting for example 100 millisecond for
the last minute we can get a second for
the last 5 minutes we can get a minute
for the last hour and then we can expand
on that. So have having that snapshot in
will enable us to lower the amount of
the data that we get and then we proceed
into the server.
>> So you saying go ahead.
>> Sorry. Are you saying then that like um
I'm thinking of a web server like I
might have I might be able to go and see
profiling data for every single function
call that served a single web web request.
request. >> Yeah.
>> Yeah.
>> Within the last 5 minutes. But if I come
a day later, I'm just going to get like
how long it took speaking to the
database, how long it took to serve the
whole request.
>> I mean, you can get the whole data for
every I mean, we we capture the data
based on thresholds. That's the what
widely is used and then you have a uh
between a time span that you get a
snapshot of what's happening and
profiling data. So let's assume it's 100
millisecond. Um and then you can get the
data of each 100 millisecond. But if you
check it within an hour,
>> it's a lot of data. Within a day, it's
even a larger data. Within a week, it's
going to be like a lot of data that you
need to store. And the problem with
storage is that it it it comes with a
cost. Um so you can you can have the
granularity to go with 100 millisecond a
week past that you can have that you can
do but then it comes with a cost that
you need to save that data somewhere.
>> So what I was talking about is
>> it is a problem but then there is some
ways to go around it and one of those
ways is the affinity and the fidelity of
the data. Uh how much frequent you want
to keep the data. So we can basically
try to minimize it by keeping the
fidelity higher closer to the date that
you are at closer that the time that you
are trying to look at and then trying to
condense it and minimize it as the time
passes. So you can have less data le a
less of obviously um uh f finer view but
then you gain in term of storage and how
much you store
>> right so does that mean how we doing
this are we like if I go back an hour
I'll find I've just got the average time
it took to call this particular function
>> exactly so you get something like that
>> instead of getting the 100 millisecond
you can get for that hour each 1 second
or probably uh 10 seconds. So we
optimize for that.
>> Okay. Okay. Then I think we need to dive
into what this mechanism is so we can
start to see what the kind of data we
can gather is. >> Yeah.
>> Yeah.
>> So EBPF I looked up the acronym for this
um the extended Berkeley packet filter.
>> I thought that sounds like a firewall.
Packet filters a firewall ar it is. And
um so for the record I had I had I had I
had few talks about EVPF and I always
made a joke that EBPF and BPF has
nothing to do uh between them. They are
similar in the names but the
functionalities are very different.
>> Um BPF uh was meant to be as a way to
filter network packets.
>> Okay. And then the idea behind eBPF was
exactly that. So we want to modernize
that eBPF
and then extend it. So that's where the
idea come from is like an extended BPF.
However, it evolved way too much to
become more than BPF. So it the there is
no version 2.0 for example of BPF. it's
it's becomes way more modern, way more
structured and it's even beyond what the
original uh intentions were. So it
started as a way to optimize for
networking but then it got used into oh
we can use the same principles for
monitoring and we can use the same
principles for security. So for idea
that you have that you can basically
offload and upload
programs to the kernel that were used
that were written in the userland it's
basically unlocked a lot of potential
and most importantly in a secure way.
So, EVPF
>> it's basically a framework, a toolkit
that enable you to write a program in
the user land and then it gets compiled
and verified and then loaded into the
kernel as it was written from the get-go
from the kernel. So the kernel now
kernel now can have a set of modules
or micro modules and then those modules
can be not only written by the kernel
developers can be written by everyone.
So that's why this extendability where
it come from that we are extending the
kernel we're making it uh more pluggable
and more modular that we can attriate
bits uh over it to extend its
functionality and then of course it's
it's an oversimplification of what the
framework does but it's at its core it's
it's basically that we are writing
programs that can be loaded into the
kernel of course it comes with a set of
limitation that there is uh you can
write it within C or Rust because that's
what what the kernel supports. You can
write it with Go and Python, but
obviously that ends up to be C compiled
to be loaded into the kernel. Um so the
program that you write that you write
needs to follow a certain specification
because there is a step that verifies
that the code that you write is actually
safe to run because it's loaded to the
kernel. But putting all of that aside,
>> the idea that um probably the listeners
and the viewers need to keep in mind is
that eBPF is simply a way to extend the
kernel allowing us to write programs and
loading into the kernels. So they are
from a kernel point of view as they were
develop developed from the get-go to be
run in the kernel,
>> right? And this idea just unlocked a lot
of potential. I mean you can imagine
running everything in the kernel. Okay,
that so my first simple question is are
they dynamically loaded? I don't have to
recompile the kernel for this. Uh
>> you don't have to recompile the kernel.
>> Good. Okay, because I've got you were
giving me as you were describing it, you
were giving me flashbacks to compiling
Linux kernels and I don't need to go
there ever again.
>> Okay, so the next two questions are how
flexible is that? I mean, we're talking
about profiling, but could you write any
arbitrary kernel code? And if so, what's
the security model?
>> Uh so you can't write any arbitrary
kernel code. I mean you can if it passes
the verifi ver verification step. So
when you lo when you try to load the
kernel to the program there's two
important steps that happens. So you
write the program and then you compiles
it and then if the compiler is happy to
run it into a what we call a just in
time compilation to have a bite code
that byte code. So the EVPF program you
need to attach it to an attachment point
but yeah whatever we'll get to that
later when you try to load it to that
attachment point or load it to the
kernel there is an important piece of
software that is called the verifier.
>> So the verifier it does what it's what
its name what what the name says
basically it verifies that the code that
you are trying to load actually is safe
to run. So it verifies that you do not
have access or try to access arbitrary
memory. You do not try to to expose bits
of memory that you don't have access to.
It verifies first of all that you have
access to
run the code. Uh and then it verifies that
that
the paths of your code end in a stable
state. You can't have for example a
while true in a kernel module. That
would basically stop the kernel from
working. Right? So it verifies that all
the execution paths in your code in your
code end in a stable state. So there is
an end statement and reachable statement
from there. So the verifier basically
does a lot of heavy lifting to ensure
that the code that you write is actually
safe to run and to be loaded within the
kernel. Of course here um just uh to
make everyone aware that we are talking
about the kernel. So aside from the fact
that you need to not load any third
party or arbitrary code EVPF code into
your kernel, the verifier helps you with
that, but it's still your responsibility
to make sure that the code that you that
you run try to load into the EVPF is
actually safe to run.
>> Right. So just because it passes the
verifier, it doesn't mean you can just
blindly trust the code you've been asked
to insert.
>> Yeah. especially if it's coming from a
third party because that's the kernel
someone the verifier is still uh
continuously evolved but as we know in
software engineering
>> but there there are bugs and someone
might have discovered that bug before
the before the Linux community have had
a patch so someone can basically uh use
it as a as a malware or use it as to
break your kernel use it to collect data
so there's a lot of security concern
that go with it and uh the best way is
to basically not load the a code that
you don't not do not verify into your
kind of environment.
>> Yes. Uh you've got to trust and verify I guess.
guess. >> Exactly.
>> Exactly.
>> Yeah. Yeah. Okay. So, um the you
mentioned bite code. This this is
compiling to some kind of kernel virtual machine.
machine. >> Yes.
>> Yes.
>> Which presumably limits the footprint of
the code which is why the verifier
stands a chance of working.
>> Exactly. Yeah.
>> All right. Yeah, that makes sense to me.
>> Okay. So we are talking specifically
then about kinds of ebpf program that
allow you to instrument
the running kernel and and hence your programs.
programs. >> Yeah.
>> Yeah.
>> How's that put together? What's it
actually doing from I I sort of imagine
myself down in the cellers of kernel
space looking up towards where the
application's running wondering how I'm
going to find it and instrument it.
So you know to our high level
applications needs to call the kernel
for everything. >> Yeah.
>> Yeah.
>> For accessing the memory, for calling
the CPU, spinning threats, accessing the
disk, all of that. So whenever you call
the kernel, the kernel has visibility
over that. It knows what you need. It
knows what bits of code is is getting
executed. It has visibility over
everything. So the EBPF folks or
especially the ones that are interested
in uh profiling said so we have the
visibility to do that why not simply
leverage that information that we have
and enhance it with some additional
information cuz when the program gets
executed the kernel can see a certain
program how bas how long in the CPU it
gets used how much memory is using it
all of that we know that from the age of
containers and even before that. So we
know that we can instrument that bit. >> Yeah.
>> Yeah.
>> So they added basically metadata into
okay. So this code is basically using
that memory and then dump it into a
profiling information because the kernel
has access over everything basically. So
we just like mapping oh this function
this is the function that uses this
amount of CPU and then how we can
collect that information dump it into a
store somewhere either locally some of
some of continuous profiling do that or
you can basically send that send it to a
back end that does the analysis
afterwards. So there are two strategies
that are happening uh there either in
cluster or offcluster uh in a in a
dedicated uh environment that does the post-processing
post-processing
>> right so you might have a separate like
um you might have a separate analysis
team running a whole cluster of things
gathering from the main network
>> yeah uh and that's how for example uh
data dog does it um I and probably some
of the cloud environment does it as well
like We take the stuff from your
environment, the collect metrics from
your environment and then send it to
dedicated servers that basically does
the analysis and performance. So you
have some sort of agents in that case an
ebpf profiler that does that or one of
the magic of ebpf is that you can
basically share the data between the
kernel and other CPU program or userland
program. So what does this mean is that
you collect the data, you save it in a
some sort of a database which is not a
database, it's just maps or ebpf maps.
>> Yeah. Okay.
>> So you save it there and then because
it's managed by the kernel, the kernel
verifies that only that program has
access to collect that data. So you're
collecting the data from there saving it
in a place and then your other program
is basically running to collect that
data either either analyze it
>> uh probably you want to compress that
data for network uh for network traffic
if you are send it somewhere uh so all
that post pro or postp processing just
after it's there because to not block
the kernel much we want to do the
operation as fast as possible we dump it
data and then all compression
optimization cleaning up the data and
all of that happened in the in another
usernet program before it's sent to do
uh to the to the another data another
cluster or server to do the uh most post
analysis and all of that.
>> Right. So so whilst it's the mechanism
is completely different it's mentally
it's the same as if someone dumps their
web data into for someone else to process.
process. >> Exactly.
>> Exactly.
>> Yeah. Yeah. Okay. That makes perfect sense.
sense.
>> Yeah. Um, I the one thing I'm not
getting here is I'm I'm at the kernel
level. I write let's say I write a
Python program that has a badly written
for loop and allocates way too much memory.
memory. >> Yeah,
>> Yeah,
>> I'm at the kernel level. I see like I'm
maloclocking this chunk of memory over
and over. How do I how do I stitch these
things together from the kernel level
call to the line of Python that's badly written?
You mean how to know which basically which
which
>> Yeah. So like
>> line of Uhhuh. Go ahead.
>> The the kernel when the kernel is madly
allocating memory, it doesn't know that
that's because there's a for loop on
line 27. >> Mhm.
>> Mhm.
>> But I want to know that as a programmer. >> Yeah.
>> Yeah.
>> So how do I connect the dots from user
space to kernel space?
I honestly don't know how that bit is is
actually managed to be honest. They use
some tools that does it. Uh maybe if we
dive into some of those tools we would
know the answer. Um so how that bit is actually
actually
>> uh done but I would imagine it just like
basically we're know when this bit is
using that much of data and then one and
instrumenting enhancing it with with
other bits. So I just don't want to to
throw anything that I I'm not aware of
uh or I don't know uh to the audience.
>> Okay. But but some so someone has
crossed that up and down that tower of
Babel to the point where
>> I can see my Python program and the the
impact of it.
>> Correct. All of all the continuous
profiling tools based on eBPF does that.
It's basically the it's one of the
building blocks of having a profiler is
to know which bit uh which bit of code
is basically using that much CPU and
that much memory and all of that. So
even the ebpf powered continuous
profiling tool or profiling tool does
that as well and they managed to crack
that okay which bit of code is using
that much uh that much data. uh I I can
imagine because when you call basically
the the eB program you can in enhance it
with the context so that context may be
enriched as well to get this data. Uh so
there's different bits that can be used.
I can't say for sure how it's how it's
done but for APF program you have the
context as well. So when you when we
want to call an ABPF function, we have
the context over what's what's what we
are calling and why and all that bits
that can be basically glue with together
in order to get that information.
>> Right. Yeah.
>> I'm not a kernel developer. So it's that
that that bit is a little bit uh nuance
to me.
>> It's it's useful to know like you you've
used this a lot. It's useful to know
where the boundaries of your knowledge
are and what you had to know and what
you've just learned because it's interesting.
interesting.
>> So, does this this makes me wonder when
I'm writing a program knowing that it's
going to be instrumented, do I change my
program? Can I? Should I? Must I?
>> No. For most of the cases, no. You don't
have to. And that's that's one of the
benefits of using
uh an ebpf or a continuous profiling ch
that doesn't
lead you to rewrite your program. Uh
some of the profiling would basically
you need to have add the when you want
to profile it and why uh add those bits
in order to gather those information. M
a lot of things with continuous
profiling and add annotations you don't
have to do that uh uh especially with
ebpf you have all the information there
for you. So you have the lines of code
and the a hierarchy of them as well uh
which one is called which and which one
ended using
blocking that much memory that much IO
and that much CPU. So you h you can basically
basically
instrument your code if needed. Don't
get me wrong, sometimes it's some use
cases uh it might you might want to
instrument the code because you are not
getting the bigger picture. But as those
profiling tools get more rich and uh
more widely used, they cover a lot of
the use cases that we rarely now come to
a use case where we have to uh basically
instrument our code. While they offer
the options, 90% of the use cases are
covered. You might stumble into a 10%
where you need specific data that is not
covered and all of that. You might need
to basically u instrument your code in
order to send this data to the profiler
in order to be analyzed. But 90 95%
of the use cases it's basically covered
from uh those continuous profiling
tools. And I have to mention it's not
it's not only eBPF. So EBPF is shiny and
going into what you mentioned in the
beginning into standardizing the way we
collect the data as going making it as
the universal way of basically how we
want to not only profile but also
monitor our application and secure them.
But there are tools that use basically
agents uh to collect this data um such
as for example uh periscope they have
for each language they have uh dedicated
way in order to uh gathering the data.
So it's not only ebpf even if it's now
it's booming within the Linux and the
kernel community there are other ways
that is you can install small agents
with a smaller footprint obviously into
your production environment with lower
overhead to collect this data but back
to your questions most of the cases no
you might run into it but yeah okay I I
do like the idea um if I'm understanding
this correctly of one tool that will
work regardless of language or tool.
Yeah, that's the power of ebpf.
>> Uh in the same way, so ebpf in the same
way as the containers manage to add this
layer of abstractions. So we don't care
about what language you are running. we
just provide us with this container
abstraction format and then we will
deploy it
>> um uh and use built orchestration tools
that basically does a lot of that and we
took it even in the uh era of AI because
of that abstraction way of seeing
things. EBPF added another layer of
abstraction. As soon as soon as you have
a Linux kernel that is within a specific
versions and later you can basically
write an EBPF program that does the
magic for you.
>> That's that sounds like it'd be full of
lots of different ideas, but I'm going
to try and stick to profiling and not
drag us down a rabbit hole. Very
tempting though. It is. Um, so
I guess the next question I have to ask
is what's the overhead of this? Because
I'm I have been in situations where like
profiling everything adds like 30% to
your CPU.
>> Yes. Um and and it has been the issue
for so long um until
the beginning of 2000 where Google
decided to
um launch a paper uh or publish a paper
sorry for how they do large scale
profiling on their end. uh I forgot
what's they called it what they didn't
call call it continuous profiling but
it's something large data center
provider data center provider whatever
that's that's name was but Google
>> I'll find it and I'll put it in the show
notes by the time this is published
>> Google was the first to publish a paper
with a working um
initial version of a profiling or of a
profiler uh using used go I think
or PR something like that. Um and then
it sets the building blocks of building
a profiler uh in a sense that you
collect the data there is a profiler
analyze the analyze the data and then
you would need a way to uh store this
data and the UI to see stuff.
>> Yes. So they shared that in 2000ish
uh I think uh and then the industry just
followed the path of Google with
variations over it but it sets basically
the foundations of how profiler out
built and even if if you check even the
audience check now the for some of the
tools either open source or commercial
uh the architecture if it's available
it's very similar to that original pair
paper. So we all benefited from that
original paper that was published by by Google
Google
>> and at that stage they must have been
fairly confident that the overhead was
small enough that they could run that on
the entire Google infrastructure.
>> Yes, correct. U and then you can imagine
that was the paper was published in 200
something but it was run a few years
before that uh in Google data centers in
order for it to be published. Uh so it's
it was definitely run uh years before
that. Um and one of the advantages is
the small overhead. So basically they
unlocked that bit of how they manage to
collect continuous profiling information
with minimal overhead across all this
data center and then if Google does it
at that scale then what's stopping
others of doing it and then a lot of a
lot of uh companies followed afterward.
Meta uh was back known as Facebook did
Amazon and all of that and then a lot of
tools started to pop up both open source
and commercial.
>> Yeah. Because when you're doing at that
kind of scale you can't afford to wait
for a problem to come along and then
persuade a team to profile a specific
application. Right.
>> Exactly. I mean because
>> at that scale you have a lot of um
feature teams or product teams and then
one team that is dedicated for the
infrastructure or it's going to be a big
team but still if you if whenever you
need something to profile you go back to
that team ask for the dump and then go
back again it just like it's going to
take forever and that team will be
flowed with requests. So having that as self-service
self-service
uh providing this information
continuously to the team that's wants it
is basically uh has a has a lot of
benefits both in terms of um getting the information
information
at well fast but also the teams can
control what informations they they want
to get. So back to the to your original
question of the overhead, there is
obviously an overhead. Um it depends on
the language, it depends on the runtime,
it depends on the tool that you're going
to use. So for ebpf programs, uh for the
ones that are open source at least, they
claim that they have an overhead between
1 to 2%.
Things like Periscope, Parka, Pixie,
they claim that's the overhead that they
have between 1 and 2%. Generally it's
between 1 to 5%.
uh if you want to run it run it run it
in production and that's because of a
lot of uh factors
>> how much frequent you want to collect
the data how much data you are collecting
collecting
uh how where and when you do your
processing of the data the compression
of the data do you postpone it over do
you send it to another uh server that
basically handle it handles it that is
obviously going to be of CPU but
a lot of transmission IIO that is going
in there right all you want to do is
some prep-processing in your cluster in
your server before sending it to u uh
another uh server and that would take
also from your CPU uh pixie for example
does store some data in your cluster in
memory I think for a for a period of
time so that also take from your server
but it's still within u a respectable
and acceptable uh threshold so Yeah,
general rule of thumb between 1 to 5%.
With EBPF programs, they claim it's uh
less than 2%. So, you can get it around
2% which is a huge win compared to what
we had previously with those uh uh heavy tools.
tools.
>> Yeah. Yeah. Yeah. I'm I'm trying to do
the kind of back of the envelope
calculations in my head. Anything less
than 10% I'd be pleased with. Anything
around 1% I would be tempted to leave it
continually running. >> Yeah.
>> Yeah. >> Right.
>> Right.
The problem is not
>> leaving it continuously running. I mean
leaving it continuously running is is a
feature but then the problem is you can
leave it continuously running. The
problem is what you going to do with
this amount of data.
>> The problem is not getting the data cuz
I think that bit is solved somehow
because the the overhead is not that
much. So we can keep it running forever.
uh but then you can get the amount as
much information as you want but then
once you get that information what you
going to do with it and then back to uh
the discussion that we started this uh
episode with is okay uh I have this huge data
data
should I keep it all should I pay for it
does it make sense to me and then
there's a question of also of
uh profiling needs to be actionable
There's no sense of having all that
information just like we oh uh it's
overwhelming know I don't know how what
to get out of the data or how to use it
in my um to gather any meaningful
conclusions or informations. So how much
data it's one question that you need to
have because it comes with a cost and
then how meaningful is the data for you
as well to gather actionable insights. Yeah,
Yeah,
>> these are two important questions that
um anyone willing or using continuous
profiling will have to ask or has to
answer basically.
>> Yes, it's uh being a that that's a
common problem with like um being able
to see inside the black box, right? The
first problem is seeing inside and then
then that creates a nice new problem.
>> How do I deal with this floodgate? Right.
Right.
>> Yep. So tell me, do you do you want to
tell me how we deal with the just
managing the sheer volume or should we
go to the tooling that lets you make
sense of it?
>> Um I think we touched upon how we can
manage the sheer amount of volume. Um
some of it would be basically um
um
keeping the last bits the fidelity
uh goes less as the time span
becomes larger meaning that um we don't
care about uh 3 months plus data. uh we
don't have we don't want as much
granularity and fidelity within two
months of data. Uh we can have a medium
of 1 month and then as the time goes by
we try to shrink and limit amount of
data that we we want to have for
example. So we don't keep all the
information our server but try to shrink
it and condense it um to basically lower
the amount of cost. And then what do you
do with the three months plus? Do you
throw it away? Do you keep it in um an
archive? Uh same question for the month
plus probably you don't keep it in your
primary database or primary data store.
You have it into a backup data store or
secondary data store that is way
cheaper. Uh so those are some techniques
basically to lower the amount of cost uh
that you have because in continuous
profiling because it's discontinuous bit
we're generally
95% of the case is interested in the
recent informations. So either what's
happening now what's actually happened
in uh the past week or the past month.
uh once it goes beyond that it becomes
kind of meaningless or less useful
compared to what I have now because
continuous profiling is enabling me to
compare how my code is performing now or
how my code is performing compared to
yesterday to the last week what I did
wrong in that time span that basically
brought the performance down or up so
that's sort of questions that I'm trying
to uh understand.
>> Yeah. Yeah. I would I would think I am
looking at recent data for oh no
something suddenly got slower and we're
panicking and then the older stuff is
are we gener is it me or are we getting
slower? Are we slower than we were six
months ago? Right.
>> Correct. Yeah.
>> Yeah. Yeah. >> Yeah.
>> Yeah.
>> Do you also get because the third
question I would ask is
>> yeah we've got Rust and Python in our
company and the Rust programs are faster
That's a good question. Um,
and if you if you check the if you check
the studies or the not the studies, what
is called the benchmarks, um, it
suggests that it's signific significantly
significantly
faster. However, the benchmarks
uh are basically simple codes. It is
like it's basically you send a request
or you execute a bit of code millions of
time and then you compare it but then
assume for example that's a question
that I mean think I'm I'm a Java
developer myself so Java is slower rust
is here
>> but then you add feature to them both
right so you add rust you add features
to rust and that's because you basically
won't run in productions a code that's
basically print hello world does a
simple thing or execute one request you
would add more and more feature to it.
So as you adding feature
>> basically both of them become slower and
slower and slower. So Rust may be a
little bit more faster as we're adding
more feature to it. But then you can see
that the distance between both languages
or different languages is basically
gonna come closer as we are adding more
features because
it's the way software runs. So we are
adding code to it that may might make
things slower um as we are adding. So
it's not only the runtime might be
faster but we are adding more feature to
it. it's basically will make it um a
little bit slower and then we
>> we might or might not see that big of a
difference in in our production
environment probably that's one of the
things that continuous profiling might
help with or might not uh depending on
the complexity of of the code. Um so
if we follow the benchmarks yes
definitely Rust is faster compared to uh
probably Go or Java but then as we start
to add more and more stuff to it the
complexity of the applications basically
take up from that and then the our
programs become uh a little bit bit by
bit slower and slower and that's where
probably some of the enterprise
languages benefit in the long run
because they are optimized for that bit.
So just to say that probably the
benchmarks even if they say that's a
language is slower than another one.
It's also depends on the use case and
what you are running and how you're
writing your code and all of that. So
it's like
>> it's not as simple as that.
>> If your program mostly waits for user
input then the thing you optimize is
sitting behind the keyboard. Right. >> Exactly.
>> Exactly. >> Yeah.
>> Yeah. >> Exactly.
>> Exactly.
>> But it'll be interesting to see that
kind of data and say okay in the real
world in our company this is how it's
actually playing out versus the
benchmarks. I think that'll be
>> that fascinating.
>> Okay. So, what so let's get into the
kind of reports that you can get from
one of these tools. Um you so the
tooling is a slightly separate thing
from EBPF. There are analysis tools for
this data.
>> Um so they are packaged as one. So when
you when you use a tool you use it as
one but the architecture is different.
There are multiple components uh of it
but all of some of them are based on
eBPF. So eBPF is one of the ways to
gather the data. Another way is using an
agent to get this data and the third
option is to instrument your code as you
mentioned and then send this data off.
Um and then you would have a profiler
that's basically that's the u heartbeat
that's the u
backbone of the profiling information
the core um
>> and then we need a way to analyze the
data and then a way to see the data and
visualize the data. So that architecture is
is
what the Google paper back again
described and what all the solutions
have in common. They have bits of
difference here and there, optimize
things uh differently here and there,
but they all share the same components basically.
basically.
>> Is there much to choose between them?
Are we like are we arguing over which
one has prettier graphs or
>> um it's it's not only that um how much
it cost is definitely important. How it
optimate how it optimized for the data
is it open source or commercial? Um so
if it's open source how much the cost of
running it is there support for it
especially if you're going for the
enterprise world if it's uh commercial
then how much I am paying for the cost
and then that cost will basically will
go how much data you want uh out of it
so it's not it's not how
how
nice are the graphs I would get but most
importantly how much data and granular
the data I would get I think That's the
the biggest factor. Uh and they all
share similar similar graphs. Uh so we'd
have flame graphs. So whenever we talks
about profiling, flame graphs come comes
uh comes to mind. Uh so they all offer
flame graphs to see which bits is using
uh which bits of code is using at that
amount of CPU. Can I see the memory as
well? you can compare between now and a
period of time or two different period
of times. Um
um they offer a way to filter uh
obviously so we can pick uh CPU, memory,
IO um and yeah all sort all sorts of stuff.
stuff.
>> This is um
this is making me think of the Google
browser console where I can get flame
graphs for my running JavaScript. It's
all very nice, but it will just tell me
what's happening in the browser and I
want that for the entire system, right? >> Yeah.
>> Yeah.
>> So, continuous profiling is basically
that for your application or your applications.
applications.
>> Yeah. Okay. So, what kind of questions
do you ask of a system like that?
>> Are you just trying to I mean, where do
you begin? It's a bit needle in a
haststack, isn't it? you just is there
are you looking for a report that says
this thing is slow or you waiting for
someone to say web servers are a bit
slow today and then you dig in how do
you start to know what to look at
so um
I think if if a system is
is slow
um you would have
traces I mean we talked about everyone
probably knows the three pillars of
observability. So we have traces, metrics,
metrics,
uh and uh logs things are slow.
>> What's the difference between traces and logs?
logs?
>> So logs are contextual data of of your
code. So basically you are saying code
executed here uh there's an error here.
So we instrument your application to log
stuff. traces are basically trying to um
correlate the events or correlate the
the journey of a request between
multiple services and components. It's
widely used in distributed applications.
While locks are within a single
applications, you can see the events
that happening that you added basically
in your code uh in that application and
see the path in the path of a single
event uh or a request uh in your
application and how it went. uh helps
you to understand how the request
basically behaved in your application
and traces try to uh find the
correlation between in a request and try
to uh draw the journey of that request
with multiple components. It's very
useful in distributed systems and
microservices architecture.
>> Yes. So what you're saying is if my user
says I tried to create an account and it
was really slow.
>> Yeah. I need to somehow trace that
request through the user micros service
to the account registration micros
service and know that those two calls on
two different machines
>> are one semantic thing.
>> Exactly. So yeah that's for the user is
one transactional atomic operation. >> Yeah
>> Yeah
>> but while for us it's like could be uh
the gateway it could be the accounting
service could be another service could
be the database. Uh so the traces helps
to understand where that bottleneck
comes from. >> Yes.
>> Yes.
>> And then once you identify that
bottleneck, you want to understand why
this application performance low. So
then you go and check the metrics of
that application and the logs of that
application. Uh so the metrics might
tell you that the CPU runs higher uh or
there's a lot of weight time or we are
just sitting idle waiting for an IIO
operation or there is memory that is uh
consumed very high in that application
or the locks might might say uh similar
things but then you don't know uh in
which part of the code is actually there
is this issue and then that's where
profiler came in. So if you're
interesting in CPU why my application is
taking slower or spinning a lot of CPUs,
you can go and check the bits of code
that basically does that might be a uh
loop operation that is poorly performed.
So it consumes a lot of um CPU time and
slowing everything down. Uh and and
probably for the memory, there is a
memory leak happening somewhere that you
weren't aware of. I mean you can see it
in the metrics but you don't know what
objects are or what method are basically
having that leak. So it's the profiling
can help you with that basically. So
generally that the journey so you have
an error uh when you I mean if an error
is happening you have an error and then
you try to uh boil it down to where
exactly this error is coming from. And
then once the maturity
goes up uh you start to import it as
part of your um health metrics
uh in a sense that you can include it as
part of u your post deployment routine
right you deploy something and then you
can compare it side by side. Oh, this
bit that I added did did it have a
significant impact on uh memory,
significant impact on CPU. And then you
can take it a little bit even further
and check okay you can add some alerts.
Uh I see that CPU has have thresholds
like if the CPU is more than 10% send me
an alert. If the CPU is more than 5% or
the memory more than 5% send me an
alert. uh could be a low urgency alert
just to oh I've noticed this you need to
be aware and then you can throw a little
bit of AI uh into it and then h have it
more uh dynamic to so it analyzes the
things for you some solutions offer that
as well and then it it notice oh post
deployment there is a shift in the
patterns between uh the previous
deployment and this deployment you are
noticing this kind of patterns.
>> Yeah. And when you've spent a lot of
when you spent a whole sprint or two
doing performance optimization, you want
to be able to say, "Look, I made it this
much faster." Right. >> Exactly.
>> Exactly.
>> You want quantifiable credit for your
end of year review.
>> Who doesn't? >> Yeah.
>> Yeah.
>> Okay. You've got to explain the
technical details of this particular
thing for me. Someone makes a request to
the web server. It goes to a user
registration thing. >> Yeah.
>> Yeah.
>> How do I stitch that together? That
seems to me like I would have to do some
kind of code changes to be be able to
connect those two calls together.
>> Uh you mean for the tracing?
>> Yeah. Yeah. Like how does how does the
tool that's constructing this trace for
me know that this request over there
resulted in that API call over there?
modify the context. So the request there
is a request that you make and it
generally either for gRPC we expand the
context or for HTTP we add it in the
headers. Uh so there is a request ID and
there is I forgot was is there a session
ID or something else. Um and then we
propagate. So the request ID is the one
from point point A to point B and the
session ID it goes the lifetime of the
request of the whole session basically
and then combining those two we're able
to correlate or between the session went
or the well session in that case is
request went through or probably there
is span and request so I forgot the
exact name is but that's basically how
it works so Using these two fields we
are able to reconstruct the journey of
the application knowing from point E to
point B to point C to point D and also
measuring that time it spent in each and
every uh iteration right but that is
that something
if I've got a web server sitting in
between a database and a load balancer
right I can imagine from what you said
that the web server would say the load
balancer is giving me this session ID I
I now need to pass it on to this SQL
call or something. >> Yeah.
>> Yeah.
>> So I would have to make a code change to
pass the session ids around.
>> So generally we use libraries to do that
for us. Uh so the the ecosystem has
evolved that uh you can basically now
things are uh embedded in most of the
applications. Uh for spring for example
there is spring sloth. I think that was
the name that basically offers you that
bits. Uh I'm a Java developer so I'm
giving that um but I imagine there is
equivalent things uh in the cloud native
ecosystem uh as well. Uh there is invoy
which basically in offers you a a way to
visualize those bits. Uh so there's a
lot of tools that basically offers you
to do that. Uh you just need to embed
them in your in your application. So the
entry point uh it's aware that oh that's
the entry point there is no span ID or
session ID or request ID. So I need to
be the generator of that uh of that um
session or that transaction or that
request and then it just propagated it
over and over again until it comes back
uh and then it stump it into uh
to save the data and save the whole
history and that's it basically.
>> Right. Yeah. I can you're giving me a
lot of flashbacks during this episode,
but I can sort of see how like Java
would be in the background attaching a
thread local variable which would then
get passed to my OM and Yeah. Okay.
>> Correct. Yeah.
>> So mostly mostly you want to arrange
that happens by magic but presumably in
some frameworks or languages you do have
to step in.
>> Exactly. I mean it's it's not happening
by magic but if you use some those
frameworks you won't even think about
it. it just like it just happens. Uh but
the I think the idea is is just that
they stitch uh context information
enable them to collect that data and
then gather it back and collect it and
stitch stitch the things together.
>> Yes. Okay. That's
>> I think open open telemetry has support
for it as well. It's become the def
facto standard when it comes to
monitoring and all of that.
>> Right. This implies that I mean it's
called continuous profiling as a
technique but it implies it also has to
be ubiquitous processing.
>> You'll only get those benefits if all
the machines are always profiling.
>> Yes, correct.
that's a big ask for a team maybe. I
mean all the machines are always profiling.
profiling.
um obviously comes with a cost and then
we are living in an era that um you pay
per node or instance. So the more
instances you have the likelihood of uh
you have to pay a big load of money and
that's for some might be out of uh
capacity. uh so you can resume it or
have it to your critical services that
you know they are consuming um a lot of
resources and you want to optimize those
resources or at least keep those service
healthier more performance and you know
that they bring them more uh value. Uh
so you want probably to invest in these and then when it proves value or when
and then when it proves value or when the um you're you're having a great year
the um you're you're having a great year and then you can offer that luxury of
and then you can offer that luxury of generalizing it to other services. Um so
generalizing it to other services. Um so I believe the there might be trade-offs
I believe the there might be trade-offs as you mentioned uh because the ideal
as you mentioned uh because the ideal scenario which would be to have it
scenario which would be to have it everywhere. Few companies have that
everywhere. Few companies have that luxury. Uh so you might want to
luxury. Uh so you might want to basically uh limit it to the critical
basically uh limit it to the critical services. Uh start there and then see if
services. Uh start there and then see if it brings value. Um make the team also
it brings value. Um make the team also aware of it because it comes also with a
aware of it because it comes also with a mind shift, a mental shift. uh from
mind shift, a mental shift. uh from traditional profiling or the pillars of
traditional profiling or the pillars of observability, it needs to be uh proven
observability, it needs to be uh proven um uh bringing value. So allowing that
um uh bringing value. So allowing that time of adaptation and uh culture change
time of adaptation and uh culture change and all of that is also important.
and all of that is also important. >> Oh yeah, because you probably need to
>> Oh yeah, because you probably need to get everyone in the organization to
get everyone in the organization to start buying into it.
start buying into it. >> Exactly. Um
>> Exactly. Um >> we as I mean we as human beings don't
>> we as I mean we as human beings don't like changes. Um and it's it it like
like changes. Um and it's it it like changes that someone else has told us we
changes that someone else has told us we have to make.
have to make. >> Exactly. Uh and even if we science prove
>> Exactly. Uh and even if we science prove that it brings value value, we would
that it brings value value, we would find a way to say it's not. Uh so
find a way to say it's not. Uh so allowing the time to uh shift that and
allowing the time to uh shift that and prove value and allowing the time for
prove value and allowing the time for the adaptation is also needs to be um
the adaptation is also needs to be um taken care of. Um because
taken care of. Um because it's the data is there, the graphs are
it's the data is there, the graphs are nice, you can do whatever you want with
nice, you can do whatever you want with it. But then if there is no developer uh
it. But then if there is no developer uh that can take those insight takes takes
that can take those insight takes takes the data and turn it into an action to
the data and turn it into an action to optimize things to improve things. It's
optimize things to improve things. It's basically uh for for uh for the company
basically uh for for uh for the company is just like a cost that they might want
is just like a cost that they might want to get rid of because it's not has it
to get rid of because it's not has it has not proven value. So it's there's a
has not proven value. So it's there's a little bit of adaptation in there and
little bit of adaptation in there and making sure that it brings value to the
making sure that it brings value to the developer and then it's part of their uh
developer and then it's part of their uh process and continuous improvement and
process and continuous improvement and all of that.
all of that. >> Yeah. So is does this mean that you are
>> Yeah. So is does this mean that you are very much you're trying to when you've
very much you're trying to when you've done it you're trying to show like look
done it you're trying to show like look how easy it is to profile your
how easy it is to profile your application rather than I found a
application rather than I found a problem with your application and here's
problem with your application and here's the data you need to go and fix it
the data you need to go and fix it >> um maybe both maybe oh uh it's easier to
>> um maybe both maybe oh uh it's easier to set up um it could basically can be
set up um it could basically can be onboarded part of your morning routine
onboarded part of your morning routine you just came in and then open uh in a
you just came in and then open uh in a publicly accessible URL. Uh check check
publicly accessible URL. Uh check check the stuff and then if you find a pattern
the stuff and then if you find a pattern that is weird in there, go and look at
that is weird in there, go and look at look into it and approve your
look into it and approve your performance of the application. Or
performance of the application. Or another way to get buying is when there
another way to get buying is when there is an issue there is no great way of
is an issue there is no great way of proving value than fixing that issue and
proving value than fixing that issue and using those tools. So it's going to be a
using those tools. So it's going to be a mix of uh both strategies. uh you want
mix of uh both strategies. uh you want to um get buy in that the tools is easy
to um get buy in that the tools is easy to use but also you want to prove value
to use but also you want to prove value and if you manage to prove it when
and if you manage to prove it when everyone hands on on something and
everyone hands on on something and everyone is focusing on something then
everyone is focusing on something then that would be a great one as well.
that would be a great one as well. >> Yeah. Yeah. Yeah. Um
>> Yeah. Yeah. Yeah. Um that makes me think of one more
that makes me think of one more practical question. Can I this is a
practical question. Can I this is a kernel level plugin. So can I roll it
kernel level plugin. So can I roll it out to many servers and dynamically
out to many servers and dynamically switch it on and off?
>> Um you mean the c the ebpf stuff? >> Yeah. So like if I can I just
>> Yeah. So like if I can I just try and put it on all my clustered
try and put it on all my clustered servers maybe not not gathering anything
servers maybe not not gathering anything and say okay we'll just flick the lights
and say okay we'll just flick the lights on for that one and take a look at that
on for that one and take a look at that today.
today. Is it trivial to switch them on and off?
Is it trivial to switch them on and off? >> It depends. I mean trivial is just like
>> It depends. I mean trivial is just like how
how easy your program is to use. I mean it
easy your program is to use. I mean it can be like um a configuration batch and
can be like um a configuration batch and then you can configure the service that
then you can configure the service that you don't want to to not be included. It
you don't want to to not be included. It could be a matter of um a matter of
could be a matter of um a matter of configuration and this configuration can
configuration and this configuration can be centralized. So you go to one place
be centralized. So you go to one place configure it and this configuration is
configure it and this configuration is dispatched everywhere and then the EBBF
dispatched everywhere and then the EBBF program basically read that
program basically read that configuration and then if it's on gather
configuration and then if it's on gather the data if it's off just ignores it. uh
the data if it's off just ignores it. uh it just also a matter of how well your
it just also a matter of how well your program is run and then um talking about
program is run and then um talking about EVPF program in general but with
EVPF program in general but with continuous profiling I think yes you can
continuous profiling I think yes you can do something like that if you want uh
do something like that if you want uh just keep it running there uh and if not
just keep it running there uh and if not just turn it off uh not collecting any
just turn it off uh not collecting any data and not adding any overhead for you
data and not adding any overhead for you >> I mean I'm going to make that more
>> I mean I'm going to make that more specific because I get the general idea
specific because I get the general idea but maybe if I couch In Javish terms,
but maybe if I couch In Javish terms, I've got a machine that I set up earlier
I've got a machine that I set up earlier and I want to either switch profiling on
and I want to either switch profiling on or off. Is it as simple as I'm going to
or off. Is it as simple as I'm going to call an MBAM N bean on that running JVM
call an MBAM N bean on that running JVM or am I redeploying a whole Kubernetes
or am I redeploying a whole Kubernetes pod with the new settings?
pod with the new settings? >> No, you can do it in in runtime
>> No, you can do it in in runtime >> dynamically at runtime. Yeah, you can do
>> dynamically at runtime. Yeah, you can do dynamic and that's that's what I
dynamic and that's that's what I mentioned the configur also depends on
mentioned the configur also depends on the configuration that you have in your
the configuration that you have in your u in your program basically because it's
u in your program basically because it's a program you can do anything with it.
a program you can do anything with it. >> Yeah. As long as you planned ahead.
>> Yeah. As long as you planned ahead. >> Exactly.
>> Exactly. >> Okay.
>> Okay. >> But if it's not as you mentioned you
>> But if it's not as you mentioned you would need basically to redeploy it.
would need basically to redeploy it. >> But then you don't need to recompile the
>> But then you don't need to recompile the kernel. It just like as any program you
kernel. It just like as any program you just need to deploy it and then it would
just need to deploy it and then it would take care of itself.
take care of itself. >> Okay. So I can set myself up with this
>> Okay. So I can set myself up with this kind of profiling. Yeah.
kind of profiling. Yeah. >> Okay. So, I've got a way of doing it
>> Okay. So, I've got a way of doing it that seems practical, has a low enough
that seems practical, has a low enough overhead that I might set up my whole
overhead that I might set up my whole cluster to run this.
cluster to run this. >> Yep.
>> Yep. >> I can see the data management problem,
>> I can see the data management problem, but you've given me some ideas about
but you've given me some ideas about mitigating that reporting. Getting back
mitigating that reporting. Getting back to the developer, I see the picture. I
to the developer, I see the picture. I think you need to start giving me some
think you need to start giving me some specific recommendations.
specific recommendations. Which tools would you pick for this
Which tools would you pick for this strategy?
strategy? I I like Periscope uh because it offers
I I like Periscope uh because it offers um best of both worlds. It has support
um best of both worlds. It has support for EBPF. Uh but if your orc for a way
for EBPF. Uh but if your orc for a way of another doesn't want to embold in an
of another doesn't want to embold in an ebpf journey yet um it has an agent um
ebpf journey yet um it has an agent um an agent alternative that you can use
an agent alternative that you can use for specific languages and runtimes. Oh,
for specific languages and runtimes. Oh, because the presumably the EBPF needs
because the presumably the EBPF needs root access, but the agent I can just
root access, but the agent I can just run in user space. So if I've got a
run in user space. So if I've got a security team, I don't need to have that
security team, I don't need to have that argument. Yes.
argument. Yes. >> Yeah. Exactly. Uh so it's it has a bit
>> Yeah. Exactly. Uh so it's it has a bit as you mentioned bit of both worlds. It
as you mentioned bit of both worlds. It has this enterprise uh flavor into it
has this enterprise uh flavor into it that make it more enterprisey and it's
that make it more enterprisey and it's it's pleasing to that enterprise word.
it's pleasing to that enterprise word. uh it tries to find shortcuts and fi and
uh it tries to find shortcuts and fi and meet you where you are you are rather
meet you where you are you are rather than trying to
than trying to move you to another uh runtime. Uh so
move you to another uh runtime. Uh so that's bit I like about it and then um
that's bit I like about it and then um periscope is from graphana and then a
periscope is from graphana and then a lot of companies has graphana as well.
lot of companies has graphana as well. So that the tooling would be based on
So that the tooling would be based on graphana which is um many are familiar
graphana which is um many are familiar with uh already um from the cloud side
with uh already um from the cloud side uh data dog um have their own and then
uh data dog um have their own and then we um when we were at Miami uh one of
we um when we were at Miami uh one of this one of the speakers basically were
this one of the speakers basically were using uh data dog and they were very
using uh data dog and they were very happy with it. So uh I think from if you
happy with it. So uh I think from if you want something that comes with ease of
want something that comes with ease of use uh you don't you don't want to
use uh you don't you don't want to bother yourself about deploying it and
bother yourself about deploying it and um managing it yourself uh data take
um managing it yourself uh data take care of that. Um, another two other open
care of that. Um, another two other open source tools. Uh, Parka is one. Uh, it's
source tools. Uh, Parka is one. Uh, it's really good. And then my favorite, uh,
really good. And then my favorite, uh, though it's still, uh, not early days,
though it's still, uh, not early days, but getting there. It's Pixie. Uh, it's
but getting there. It's Pixie. Uh, it's a CNCF project, uh, by, um, I forgot the
a CNCF project, uh, by, um, I forgot the name again.
name again. >> Uh, Newf
>> Uh, Newf by Newick. Um, so it's it's more than a
by Newick. Um, so it's it's more than a continuous profiling tool. It tries to
continuous profiling tool. It tries to be a monitoring tool. Uh it it it try to
be a monitoring tool. Uh it it it try to combines profiling together with the
combines profiling together with the metrics part. Um
metrics part. Um uh but it's it's it's cool. Uh it's open
uh but it's it's it's cool. Uh it's open source. Uh but it also offers a way to
source. Uh but it also offers a way to inboard it as part of new. I it has this
inboard it as part of new. I it has this new flavor as well. Um so it's it's my
new flavor as well. Um so it's it's my favorite tool. So, so far but again it's
favorite tool. So, so far but again it's still
still a little bit early days for it. Not that
a little bit early days for it. Not that widely adoption uh yet but it's it's
widely adoption uh yet but it's it's getting there since it's part of CNCF
getting there since it's part of CNCF projects. It has a larger community so
projects. It has a larger community so it's improving bit by bit.
it's improving bit by bit. >> Okay. I kind of want to ask you which
>> Okay. I kind of want to ask you which one you use at work but I suspect then
one you use at work but I suspect then the Spotify legal team will dive on this
the Spotify legal team will dive on this podcast. So I'll leave that question
podcast. So I'll leave that question entirely.
Uh, I think that gives me a complete picture. Maybe I need to go and play
picture. Maybe I need to go and play with one of these.
with one of these. Where would you start if it's just out
Where would you start if it's just out of pure curiosity? Pixie.
of pure curiosity? Pixie. >> I would start with Pixie. Yes. Um, yeah,
>> I would start with Pixie. Yes. Um, yeah, they have a a cloud version to So, you
they have a a cloud version to So, you would need to install the Pixie agent on
would need to install the Pixie agent on your cluster
your cluster >> and then they have a cloud last time I
>> and then they have a cloud last time I played with it and they have a cloud
played with it and they have a cloud version that basically get the data and
version that basically get the data and send it uh to it and you can visualize
send it uh to it and you can visualize it. Um so yeah, Pixie would be uh
it. Um so yeah, Pixie would be uh probably the first and then Periscope
probably the first and then Periscope maybe the second um because it's yeah uh
maybe the second um because it's yeah uh it's it's well integrated, easy to use
it's it's well integrated, easy to use uh and it's part of the Graphan
uh and it's part of the Graphan ecosystem so it's yeah nicer.
ecosystem so it's yeah nicer. >> Cool. I'm going to go and check those
>> Cool. I'm going to go and check those out. Muhammad, thank you very much for
out. Muhammad, thank you very much for joining me and I hope when you get to
joining me and I hope when you get to the end of this recording you don't
the end of this recording you don't think the um elapse time was too long or
think the um elapse time was too long or too short.
too short. It was very enjoyable.
It was very enjoyable. >> Great.
>> Great. >> Thank you for having me.
>> Thank you for having me. >> Thank you.
>> Thank you. >> Thank you, Muhammad. As always, the show
>> Thank you, Muhammad. As always, the show notes the place to head if you want
notes the place to head if you want links to anything we discussed. And
links to anything we discussed. And before you head there, please do take a
before you head there, please do take a moment to like and rate this episode and
moment to like and rate this episode and maybe share it around because it all
maybe share it around because it all helps other people find us. The
helps other people find us. The algorithm decides that if you liked it,
algorithm decides that if you liked it, other people will like it and off we go.
other people will like it and off we go. And that helps share the knowledge,
And that helps share the knowledge, which is the whole point of this
which is the whole point of this podcast. Please do make sure you're
podcast. Please do make sure you're subscribed so that you can find us in
subscribed so that you can find us in time for the next episode. And until
time for the next episode. And until then, I've been your host, Chris
then, I've been your host, Chris Jenkins. This has been Developer Voices
Jenkins. This has been Developer Voices with Muhammad Aboule. Thanks for
with Muhammad Aboule. Thanks for listening.
listening. [Music]
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.