YouTube Transcript:
Building Observable Systems with eBPF and Linux (with Mohammed Aboullaite)

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

AutoDub

Understand YouTube Foreign Videos

Immersive YouTube Voice Translation

Break language barriers, embrace global quality content

Solve Foreign Video Barriers Instantly

Video Transcript

Video Summary

Summary

Core Theme

The discussion highlights the evolution of system monitoring from rudimentary, ad-hoc methods to sophisticated, unified observability strategies, emphasizing the critical role of eBPF and continuous profiling in managing the complexity of modern distributed systems.

Mind Map

Click to expand

Click to explore the full interactive mind map • Zoom, pan, and navigate

The worst system monitoring setup I've

ever witnessed was in the early 2000s

during the dotcom boom. There was this

company I was working with and they

needed exactly three servers. And they

signed a support contract worth the

equivalent of $2 million in today's

money. It was crazy back then. It's

absolutely ridiculous today. And I said

to them at the time, and I was only half

joking, I'll make you a counter offer. I

will give you the best support you will

ever witness. For half that price, for a

mere million dollars, I will camp out

next to the server rack for the whole

year. And I will never leave their side.

I'll be constantly watching them. And

they didn't accept my offer, which is a

shame because they went bust a few years

later, so I would have been able to

leave the room early.

Now, while we all contemplate whether

that would make an excellent season of

Squid Game, we must also contemplate

whether the state-of-the-art in system

observability has improved since those

days. And I hope it has because I'm

certain that the problems got harder.

Our expectations for scale and uptime

have gone up massively since then.

Meaning a lot of the systems we build

these days are distributed by default

which in turn means we need techniques

for building out different components.

We start introducing things like

microervices to manage the complexity

which in turn opens up building systems

with many different languages and

different databases.

How do you stay on top of all this? How

do you make sure it's performing well?

And how do you debug things when they go

wrong? I'll tell you how you don't do

it. You don't do it in an ad hoc way.

It's no good having a different

monitoring technique for every piece in

the system. System observability needs a

unified strategy. You've got to shoot

for something that's going to work

everywhere on every server for every

component written in every language. And

I think that means you have to tackle

the problem from the kernel level upwards.

upwards.

And that's where I need an expert.

Joining me to discuss the latest in

monitoring, profiling, and observability

strategies from the kernel all the way

to the dashboard is Muhammad Abouate.

He's a back-end engineer at Spotify and

he's going to take us through how you

can peak into the Linux kernel

programmatically with eBPF, how you

don't have to because there are several

projects that have already done it

already and how you go from there to a

complete monitoring picture of your

system. We've got a lot to pack in in

this one. We managed to cover everything

from packet filters to cultural changes.

All in service of getting a clear view

of what happens to your software when it

hits production. I'm your host Chris

Jenkins. This is Developer Voices and

today's voice is Muhammad Aboule. [Music]

And joining me today is Muhammad Abu.

How you doing, Muhammad?

>> Very good. Very good. And uh good to see

you again. It's been uh quite so long.

>> It's been a whole week. >> Exactly.

>> Exactly.

>> We were in Miami. We were supposed to

record this under the glamorous Miami

sun and logistics got in the way. So now

you're in a particularly glorious office

room there with the gray shining back at

you. And we'll do the best we can.

>> Yeah. And thanks for the for the

flexibility. Uh I got I got the calendar

invite wrong obviously because of the I

accepted it when I was in Stockholm but

then with the time zone I just like

sorry for that and thanks for the

flexibility and oh no problem allowing

us to discuss even if we're interested.

>> I'm sure there's a link here between

calendar problems and

um overloading of disperate systems and

having to reschedu the the long running

processes. I'm going to make that link

because we're going to talk about

profiling and performance and what to do

when your machine gets overloaded.

>> There we go.

>> So, um, you for context, you work at

Spotify and you worked at some other

interesting places.

>> You have done profiling in what we might

call the the very wild, right?

>> Yeah. Yeah. Correct. And I thought my

first question is

is is the state of profiling today that

there is one universal good answer that

works on every operating system, works

for every application and we should

start talking about that immediately

whatever it is or is it just there is no

oneizefits- all solution and we have to

talk about the different approaches

>> as anything in software engineering it

depends right um

>> and um I I think when we talk about

production and stuff we generally talk

about Linux as the dominant dominant uh

operating system in that sense. Um so a

lot of solution that that I worked with

and I worked on are primarily Linux uh

solutions. So if

so my experience would be primarily

around Linux. uh I won't be covering

obviously Windows because I had no

experience whatsoever uh in deploying

applications in Windows servers or using

Windows servers. I used it I think for a

brief period of time just getting access

to it and that's pretty much it. But my

experience has been primarily around

Linux. Just wanted to get that out of

the way and clarify it for for you and

the audience as well. So universally I

would say I don't know and that's

obviously an accepted answer even if a

lot of a lot of um tech folks doesn't

want to say I don't know especially in

the age of nabs uh but yeah I I don't

know about other operating systems uh

but yeah for Linux we're getting close

to it uh because of uh we're going to

probably dive into it because of evpf

and how it's built into the kernel. So

whenever you have an operating uh Linux

kernel I think with a specific version

kernel that it should be now widely uh

supported ebpf can be there and then

there's a lot of tools that are built on

top of ebpf for prof profiling.

>> Okay we are definitely going to dive

into that. Yeah,

>> I want to ask you one more contextual

question before we start on that though,

>> which is um I was thinking about this. I

feel very out of date on what the state

of profiling is. It's a good reason to

have you on the show.

>> I I remember days of, you know, you

gather logs from different machines and

at least try and put them in one place

and look at those. You'd find things

that seem to be a bit weird and you'd

probably end up recompiling

a suspected app with a minus minus

profile flag that was specific to that

language or that compiler and you'd slog

it away from there. >> Yeah.

>> Yeah.

>> And are we is are we still in that state

or is the state of the art moved on from

that to something better than remedial

profiling? I think we are uh in a sense

that um we as human beings like the

comfort zone. So that has been used for

quite a long period of time. We have a

lot of tools that are using it and then

it's just yeah a lot of people are still

using it and didn't get out of that uh

bubble. But on the other side we still

we are having now a lot of tools that

enable to do that uh in a much more

modern way in a much more continuous

way. And I believe the discussion that

we're going to have is more around what

we call now continuous profiling is like

how we can get this continuous feed of

data. Uh similar to what we have with

metrics. Um so how we can get this

continuous feedback about um not only

the health of our system but the code

that runs into our system. How we can

continuously verify uh how the memory is

used in our uh applications. how the CPU

is behaving not only the from

application holistic point of view but

also going down to the code bit and the

lines of code which method is using much

CPU uh which what what code is basically

stricting that much of my memor memory

what's inside my heap where my CPU spend

a lot of time uh so all that answers

profiling in general tries to answer but

what's what the shift that is happening

recently is we are trying to get into

that continuous collecting of that data.

It comes with a lot of challenges, don't

get me wrong, but it also it comes with

a lot of benefits uh that we are

continuously getting that feedback. We

are continuously analyzing it. But not

only from we are getting that dump,

analyzing it and then trying to figure

it out uh hours, minutes, even days

later. Now we are continuously seeing it

in when it's happened in real time which

is a big shift from uh where we were at

in the state that you mentioned.

>> Okay. I I was going to leave this till

later in the podcast but I have to bring

it up now. The idea of No, the idea of

like I see a lot of problems with the

idea of continually profiling all your applications.

applications.

>> Yeah. Um and the first is that just

sheer volume of data being gathered.

>> Correct. So

if that's solved, maybe we should talk

about what EBPF is and then address how

it solves it because that seems like a

showstopper to me. So it's it's going to

be a lot of lot of amount of data that

is generated. Um and there is no right

or wrong answer here. it just like you

have to experiment with it. You have to

find the best

u use case for you, the best thresholds

for you and how you're going to benefit

from the continuous profiling data. Um

so a clear sample that a lot of people

like simple rule of thumb is that the more

recent the data the more frequent we

keep it and the more historical data we

we make that window longer. An example

is we want to keep the frequency for the

last five minutes higher for the last

minute even more higher like we're

getting for example 100 millisecond for

the last minute we can get a second for

the last 5 minutes we can get a minute

for the last hour and then we can expand

on that. So have having that snapshot in

will enable us to lower the amount of

the data that we get and then we proceed

into the server.

>> So you saying go ahead.

>> Sorry. Are you saying then that like um

I'm thinking of a web server like I

might have I might be able to go and see

profiling data for every single function

call that served a single web web request.

request. >> Yeah.

>> Yeah.

>> Within the last 5 minutes. But if I come

a day later, I'm just going to get like

how long it took speaking to the

database, how long it took to serve the

whole request.

>> I mean, you can get the whole data for

every I mean, we we capture the data

based on thresholds. That's the what

widely is used and then you have a uh

between a time span that you get a

snapshot of what's happening and

profiling data. So let's assume it's 100

millisecond. Um and then you can get the

data of each 100 millisecond. But if you

check it within an hour,

>> it's a lot of data. Within a day, it's

even a larger data. Within a week, it's

going to be like a lot of data that you

need to store. And the problem with

storage is that it it it comes with a

cost. Um so you can you can have the

granularity to go with 100 millisecond a

week past that you can have that you can

do but then it comes with a cost that

you need to save that data somewhere.

>> So what I was talking about is

>> it is a problem but then there is some

ways to go around it and one of those

ways is the affinity and the fidelity of

the data. Uh how much frequent you want

to keep the data. So we can basically

try to minimize it by keeping the

fidelity higher closer to the date that

you are at closer that the time that you

are trying to look at and then trying to

condense it and minimize it as the time

passes. So you can have less data le a

less of obviously um uh f finer view but

then you gain in term of storage and how

much you store

>> right so does that mean how we doing

this are we like if I go back an hour

I'll find I've just got the average time

it took to call this particular function

>> exactly so you get something like that

>> instead of getting the 100 millisecond

you can get for that hour each 1 second

or probably uh 10 seconds. So we

optimize for that.

>> Okay. Okay. Then I think we need to dive

into what this mechanism is so we can

start to see what the kind of data we

can gather is. >> Yeah.

>> Yeah.

>> So EBPF I looked up the acronym for this

um the extended Berkeley packet filter.

>> I thought that sounds like a firewall.

Packet filters a firewall ar it is. And

um so for the record I had I had I had I

had few talks about EVPF and I always

made a joke that EBPF and BPF has

nothing to do uh between them. They are

similar in the names but the

functionalities are very different.

>> Um BPF uh was meant to be as a way to

filter network packets.

>> Okay. And then the idea behind eBPF was

exactly that. So we want to modernize

that eBPF

and then extend it. So that's where the

idea come from is like an extended BPF.

However, it evolved way too much to

become more than BPF. So it the there is

no version 2.0 for example of BPF. it's

it's becomes way more modern, way more

structured and it's even beyond what the

original uh intentions were. So it

started as a way to optimize for

networking but then it got used into oh

we can use the same principles for

monitoring and we can use the same

principles for security. So for idea

that you have that you can basically

offload and upload

programs to the kernel that were used

that were written in the userland it's

basically unlocked a lot of potential

and most importantly in a secure way.

So, EVPF

>> it's basically a framework, a toolkit

that enable you to write a program in

the user land and then it gets compiled

and verified and then loaded into the

kernel as it was written from the get-go

from the kernel. So the kernel now

kernel now can have a set of modules

or micro modules and then those modules

can be not only written by the kernel

developers can be written by everyone.

So that's why this extendability where

it come from that we are extending the

kernel we're making it uh more pluggable

and more modular that we can attriate

bits uh over it to extend its

functionality and then of course it's

it's an oversimplification of what the

framework does but it's at its core it's

it's basically that we are writing

programs that can be loaded into the

kernel of course it comes with a set of

limitation that there is uh you can

write it within C or Rust because that's

what what the kernel supports. You can

write it with Go and Python, but

obviously that ends up to be C compiled

to be loaded into the kernel. Um so the

program that you write that you write

needs to follow a certain specification

because there is a step that verifies

that the code that you write is actually

safe to run because it's loaded to the

kernel. But putting all of that aside,

>> the idea that um probably the listeners

and the viewers need to keep in mind is

that eBPF is simply a way to extend the

kernel allowing us to write programs and

loading into the kernels. So they are

from a kernel point of view as they were

develop developed from the get-go to be

run in the kernel,

>> right? And this idea just unlocked a lot

of potential. I mean you can imagine

running everything in the kernel. Okay,

that so my first simple question is are

they dynamically loaded? I don't have to

recompile the kernel for this. Uh

>> you don't have to recompile the kernel.

>> Good. Okay, because I've got you were

giving me as you were describing it, you

were giving me flashbacks to compiling

Linux kernels and I don't need to go

there ever again.

>> Okay, so the next two questions are how

flexible is that? I mean, we're talking

about profiling, but could you write any

arbitrary kernel code? And if so, what's

the security model?

>> Uh so you can't write any arbitrary

kernel code. I mean you can if it passes

the verifi ver verification step. So

when you lo when you try to load the

kernel to the program there's two

important steps that happens. So you

write the program and then you compiles

it and then if the compiler is happy to

run it into a what we call a just in

time compilation to have a bite code

that byte code. So the EVPF program you

need to attach it to an attachment point

but yeah whatever we'll get to that

later when you try to load it to that

attachment point or load it to the

kernel there is an important piece of

software that is called the verifier.

>> So the verifier it does what it's what

its name what what the name says

basically it verifies that the code that

you are trying to load actually is safe

to run. So it verifies that you do not

have access or try to access arbitrary

memory. You do not try to to expose bits

of memory that you don't have access to.

It verifies first of all that you have

access to

run the code. Uh and then it verifies that

that

the paths of your code end in a stable

state. You can't have for example a

while true in a kernel module. That

would basically stop the kernel from

working. Right? So it verifies that all

the execution paths in your code in your

code end in a stable state. So there is

an end statement and reachable statement

from there. So the verifier basically

does a lot of heavy lifting to ensure

that the code that you write is actually

safe to run and to be loaded within the

kernel. Of course here um just uh to

make everyone aware that we are talking

about the kernel. So aside from the fact

that you need to not load any third

party or arbitrary code EVPF code into

your kernel, the verifier helps you with

that, but it's still your responsibility

to make sure that the code that you that

you run try to load into the EVPF is

actually safe to run.

>> Right. So just because it passes the

verifier, it doesn't mean you can just

blindly trust the code you've been asked

to insert.

>> Yeah. especially if it's coming from a

third party because that's the kernel

someone the verifier is still uh

continuously evolved but as we know in

software engineering

>> but there there are bugs and someone

might have discovered that bug before

the before the Linux community have had

a patch so someone can basically uh use

it as a as a malware or use it as to

break your kernel use it to collect data

so there's a lot of security concern

that go with it and uh the best way is

to basically not load the a code that

you don't not do not verify into your

kind of environment.

>> Yes. Uh you've got to trust and verify I guess.

guess. >> Exactly.

>> Exactly.

>> Yeah. Yeah. Okay. So, um the you

mentioned bite code. This this is

compiling to some kind of kernel virtual machine.

machine. >> Yes.

>> Yes.

>> Which presumably limits the footprint of

the code which is why the verifier

stands a chance of working.

>> Exactly. Yeah.

>> All right. Yeah, that makes sense to me.

>> Okay. So we are talking specifically

then about kinds of ebpf program that

allow you to instrument

the running kernel and and hence your programs.

programs. >> Yeah.

>> Yeah.

>> How's that put together? What's it

actually doing from I I sort of imagine

myself down in the cellers of kernel

space looking up towards where the

application's running wondering how I'm

going to find it and instrument it.

So you know to our high level

applications needs to call the kernel

for everything. >> Yeah.

>> Yeah.

>> For accessing the memory, for calling

the CPU, spinning threats, accessing the

disk, all of that. So whenever you call

the kernel, the kernel has visibility

over that. It knows what you need. It

knows what bits of code is is getting

executed. It has visibility over

everything. So the EBPF folks or

especially the ones that are interested

in uh profiling said so we have the

visibility to do that why not simply

leverage that information that we have

and enhance it with some additional

information cuz when the program gets

executed the kernel can see a certain

program how bas how long in the CPU it

gets used how much memory is using it

all of that we know that from the age of

containers and even before that. So we

know that we can instrument that bit. >> Yeah.

>> Yeah.

>> So they added basically metadata into

okay. So this code is basically using

that memory and then dump it into a

profiling information because the kernel

has access over everything basically. So

we just like mapping oh this function

this is the function that uses this

amount of CPU and then how we can

collect that information dump it into a

store somewhere either locally some of

some of continuous profiling do that or

you can basically send that send it to a

back end that does the analysis

afterwards. So there are two strategies

that are happening uh there either in

cluster or offcluster uh in a in a

dedicated uh environment that does the post-processing

post-processing

>> right so you might have a separate like

um you might have a separate analysis

team running a whole cluster of things

gathering from the main network

>> yeah uh and that's how for example uh

data dog does it um I and probably some

of the cloud environment does it as well

like We take the stuff from your

environment, the collect metrics from

your environment and then send it to

dedicated servers that basically does

the analysis and performance. So you

have some sort of agents in that case an

ebpf profiler that does that or one of

the magic of ebpf is that you can

basically share the data between the

kernel and other CPU program or userland

program. So what does this mean is that

you collect the data, you save it in a

some sort of a database which is not a

database, it's just maps or ebpf maps.

>> Yeah. Okay.

>> So you save it there and then because

it's managed by the kernel, the kernel

verifies that only that program has

access to collect that data. So you're

collecting the data from there saving it

in a place and then your other program

is basically running to collect that

data either either analyze it

>> uh probably you want to compress that

data for network uh for network traffic

if you are send it somewhere uh so all

that post pro or postp processing just

after it's there because to not block

the kernel much we want to do the

operation as fast as possible we dump it

data and then all compression

optimization cleaning up the data and

all of that happened in the in another

usernet program before it's sent to do

uh to the to the another data another

cluster or server to do the uh most post

analysis and all of that.

>> Right. So so whilst it's the mechanism

is completely different it's mentally

it's the same as if someone dumps their

web data into for someone else to process.

process. >> Exactly.

>> Exactly.

>> Yeah. Yeah. Okay. That makes perfect sense.

sense.

>> Yeah. Um, I the one thing I'm not

getting here is I'm I'm at the kernel

level. I write let's say I write a

Python program that has a badly written

for loop and allocates way too much memory.

memory. >> Yeah,

>> Yeah,

>> I'm at the kernel level. I see like I'm

maloclocking this chunk of memory over

and over. How do I how do I stitch these

things together from the kernel level

call to the line of Python that's badly written?

You mean how to know which basically which

which

>> Yeah. So like

>> line of Uhhuh. Go ahead.

>> The the kernel when the kernel is madly

allocating memory, it doesn't know that

that's because there's a for loop on

line 27. >> Mhm.

>> Mhm.

>> But I want to know that as a programmer. >> Yeah.

>> Yeah.

>> So how do I connect the dots from user

space to kernel space?

I honestly don't know how that bit is is

actually managed to be honest. They use

some tools that does it. Uh maybe if we

dive into some of those tools we would

know the answer. Um so how that bit is actually

actually

>> uh done but I would imagine it just like

basically we're know when this bit is

using that much of data and then one and

instrumenting enhancing it with with

other bits. So I just don't want to to

throw anything that I I'm not aware of

uh or I don't know uh to the audience.

>> Okay. But but some so someone has

crossed that up and down that tower of

Babel to the point where

>> I can see my Python program and the the

impact of it.

>> Correct. All of all the continuous

profiling tools based on eBPF does that.

It's basically the it's one of the

building blocks of having a profiler is

to know which bit uh which bit of code

is basically using that much CPU and

that much memory and all of that. So

even the ebpf powered continuous

profiling tool or profiling tool does

that as well and they managed to crack

that okay which bit of code is using

that much uh that much data. uh I I can

imagine because when you call basically

the the eB program you can in enhance it

with the context so that context may be

enriched as well to get this data. Uh so

there's different bits that can be used.

I can't say for sure how it's how it's

done but for APF program you have the

context as well. So when you when we

want to call an ABPF function, we have

the context over what's what's what we

are calling and why and all that bits

that can be basically glue with together

in order to get that information.

>> Right. Yeah.

>> I'm not a kernel developer. So it's that

that that bit is a little bit uh nuance

to me.

>> It's it's useful to know like you you've

used this a lot. It's useful to know

where the boundaries of your knowledge

are and what you had to know and what

you've just learned because it's interesting.

interesting.

>> So, does this this makes me wonder when

I'm writing a program knowing that it's

going to be instrumented, do I change my

program? Can I? Should I? Must I?

>> No. For most of the cases, no. You don't

have to. And that's that's one of the

benefits of using

uh an ebpf or a continuous profiling ch

that doesn't

lead you to rewrite your program. Uh

some of the profiling would basically

you need to have add the when you want

to profile it and why uh add those bits

in order to gather those information. M

a lot of things with continuous

profiling and add annotations you don't

have to do that uh uh especially with

ebpf you have all the information there

for you. So you have the lines of code

and the a hierarchy of them as well uh

which one is called which and which one

ended using

blocking that much memory that much IO

and that much CPU. So you h you can basically

basically

instrument your code if needed. Don't

get me wrong, sometimes it's some use

cases uh it might you might want to

instrument the code because you are not

getting the bigger picture. But as those

profiling tools get more rich and uh

more widely used, they cover a lot of

the use cases that we rarely now come to

a use case where we have to uh basically

instrument our code. While they offer

the options, 90% of the use cases are

covered. You might stumble into a 10%

where you need specific data that is not

covered and all of that. You might need

to basically u instrument your code in

order to send this data to the profiler

in order to be analyzed. But 90 95%

of the use cases it's basically covered

from uh those continuous profiling

tools. And I have to mention it's not

it's not only eBPF. So EBPF is shiny and

going into what you mentioned in the

beginning into standardizing the way we

collect the data as going making it as

the universal way of basically how we

want to not only profile but also

monitor our application and secure them.

But there are tools that use basically

agents uh to collect this data um such

as for example uh periscope they have

for each language they have uh dedicated

way in order to uh gathering the data.

So it's not only ebpf even if it's now

it's booming within the Linux and the

kernel community there are other ways

that is you can install small agents

with a smaller footprint obviously into

your production environment with lower

overhead to collect this data but back

to your questions most of the cases no

you might run into it but yeah okay I I

do like the idea um if I'm understanding

this correctly of one tool that will

work regardless of language or tool.

Yeah, that's the power of ebpf.

>> Uh in the same way, so ebpf in the same

way as the containers manage to add this

layer of abstractions. So we don't care

about what language you are running. we

just provide us with this container

abstraction format and then we will

deploy it

>> um uh and use built orchestration tools

that basically does a lot of that and we

took it even in the uh era of AI because

of that abstraction way of seeing

things. EBPF added another layer of

abstraction. As soon as soon as you have

a Linux kernel that is within a specific

versions and later you can basically

write an EBPF program that does the

magic for you.

>> That's that sounds like it'd be full of

lots of different ideas, but I'm going

to try and stick to profiling and not

drag us down a rabbit hole. Very

tempting though. It is. Um, so

I guess the next question I have to ask

is what's the overhead of this? Because

I'm I have been in situations where like

profiling everything adds like 30% to

your CPU.

>> Yes. Um and and it has been the issue

for so long um until

the beginning of 2000 where Google

decided to

um launch a paper uh or publish a paper

sorry for how they do large scale

profiling on their end. uh I forgot

what's they called it what they didn't

call call it continuous profiling but

it's something large data center

provider data center provider whatever

that's that's name was but Google

>> I'll find it and I'll put it in the show

notes by the time this is published

>> Google was the first to publish a paper

with a working um

initial version of a profiling or of a

profiler uh using used go I think

or PR something like that. Um and then

it sets the building blocks of building

a profiler uh in a sense that you

collect the data there is a profiler

analyze the analyze the data and then

you would need a way to uh store this

data and the UI to see stuff.

>> Yes. So they shared that in 2000ish

uh I think uh and then the industry just

followed the path of Google with

variations over it but it sets basically

the foundations of how profiler out

built and even if if you check even the

audience check now the for some of the

tools either open source or commercial

uh the architecture if it's available

it's very similar to that original pair

paper. So we all benefited from that

original paper that was published by by Google

Google

>> and at that stage they must have been

fairly confident that the overhead was

small enough that they could run that on

the entire Google infrastructure.

>> Yes, correct. U and then you can imagine

that was the paper was published in 200

something but it was run a few years

before that uh in Google data centers in

order for it to be published. Uh so it's

it was definitely run uh years before

that. Um and one of the advantages is

the small overhead. So basically they

unlocked that bit of how they manage to

collect continuous profiling information

with minimal overhead across all this

data center and then if Google does it

at that scale then what's stopping

others of doing it and then a lot of a

lot of uh companies followed afterward.

Meta uh was back known as Facebook did

Amazon and all of that and then a lot of

tools started to pop up both open source

and commercial.

>> Yeah. Because when you're doing at that

kind of scale you can't afford to wait

for a problem to come along and then

persuade a team to profile a specific

application. Right.

>> Exactly. I mean because

>> at that scale you have a lot of um

feature teams or product teams and then

one team that is dedicated for the

infrastructure or it's going to be a big

team but still if you if whenever you

need something to profile you go back to

that team ask for the dump and then go

back again it just like it's going to

take forever and that team will be

flowed with requests. So having that as self-service

self-service

uh providing this information

continuously to the team that's wants it

is basically uh has a has a lot of

benefits both in terms of um getting the information

information

at well fast but also the teams can

control what informations they they want

to get. So back to the to your original

question of the overhead, there is

obviously an overhead. Um it depends on

the language, it depends on the runtime,

it depends on the tool that you're going

to use. So for ebpf programs, uh for the

ones that are open source at least, they

claim that they have an overhead between

1 to 2%.

Things like Periscope, Parka, Pixie,

they claim that's the overhead that they

have between 1 and 2%. Generally it's

between 1 to 5%.

uh if you want to run it run it run it

in production and that's because of a

lot of uh factors

>> how much frequent you want to collect

the data how much data you are collecting

collecting

uh how where and when you do your

processing of the data the compression

of the data do you postpone it over do

you send it to another uh server that

basically handle it handles it that is

obviously going to be of CPU but

a lot of transmission IIO that is going

in there right all you want to do is

some prep-processing in your cluster in

your server before sending it to u uh

another uh server and that would take

also from your CPU uh pixie for example

does store some data in your cluster in

memory I think for a for a period of

time so that also take from your server

but it's still within u a respectable

and acceptable uh threshold so Yeah,

general rule of thumb between 1 to 5%.

With EBPF programs, they claim it's uh

less than 2%. So, you can get it around

2% which is a huge win compared to what

we had previously with those uh uh heavy tools.

tools.

>> Yeah. Yeah. Yeah. I'm I'm trying to do

the kind of back of the envelope

calculations in my head. Anything less

than 10% I'd be pleased with. Anything

around 1% I would be tempted to leave it

continually running. >> Yeah.

>> Yeah. >> Right.

>> Right.

The problem is not

>> leaving it continuously running. I mean

leaving it continuously running is is a

feature but then the problem is you can

leave it continuously running. The

problem is what you going to do with

this amount of data.

>> The problem is not getting the data cuz

I think that bit is solved somehow

because the the overhead is not that

much. So we can keep it running forever.

uh but then you can get the amount as

much information as you want but then

once you get that information what you

going to do with it and then back to uh

the discussion that we started this uh

episode with is okay uh I have this huge data

data

should I keep it all should I pay for it

does it make sense to me and then

there's a question of also of

uh profiling needs to be actionable

There's no sense of having all that

information just like we oh uh it's

overwhelming know I don't know how what

to get out of the data or how to use it

in my um to gather any meaningful

conclusions or informations. So how much

data it's one question that you need to

have because it comes with a cost and

then how meaningful is the data for you

as well to gather actionable insights. Yeah,

Yeah,

>> these are two important questions that

um anyone willing or using continuous

profiling will have to ask or has to

answer basically.

>> Yes, it's uh being a that that's a

common problem with like um being able

to see inside the black box, right? The

first problem is seeing inside and then

then that creates a nice new problem.

>> How do I deal with this floodgate? Right.

Right.

>> Yep. So tell me, do you do you want to

tell me how we deal with the just

managing the sheer volume or should we

go to the tooling that lets you make

sense of it?

>> Um I think we touched upon how we can

manage the sheer amount of volume. Um

some of it would be basically um

keeping the last bits the fidelity

uh goes less as the time span

becomes larger meaning that um we don't

care about uh 3 months plus data. uh we

don't have we don't want as much

granularity and fidelity within two

months of data. Uh we can have a medium

of 1 month and then as the time goes by

we try to shrink and limit amount of

data that we we want to have for

example. So we don't keep all the

information our server but try to shrink

it and condense it um to basically lower

the amount of cost. And then what do you

do with the three months plus? Do you

throw it away? Do you keep it in um an

archive? Uh same question for the month

plus probably you don't keep it in your

primary database or primary data store.

You have it into a backup data store or

secondary data store that is way

cheaper. Uh so those are some techniques

basically to lower the amount of cost uh

that you have because in continuous

profiling because it's discontinuous bit

we're generally

95% of the case is interested in the

recent informations. So either what's

happening now what's actually happened

in uh the past week or the past month.

uh once it goes beyond that it becomes

kind of meaningless or less useful

compared to what I have now because

continuous profiling is enabling me to

compare how my code is performing now or

how my code is performing compared to

yesterday to the last week what I did

wrong in that time span that basically

brought the performance down or up so

that's sort of questions that I'm trying

to uh understand.

>> Yeah. Yeah. I would I would think I am

looking at recent data for oh no

something suddenly got slower and we're

panicking and then the older stuff is

are we gener is it me or are we getting

slower? Are we slower than we were six

months ago? Right.

>> Correct. Yeah.

>> Yeah. Yeah. >> Yeah.

>> Yeah.

>> Do you also get because the third

question I would ask is

>> yeah we've got Rust and Python in our

company and the Rust programs are faster

That's a good question. Um,

and if you if you check the if you check

the studies or the not the studies, what

is called the benchmarks, um, it

suggests that it's signific significantly

significantly

faster. However, the benchmarks

uh are basically simple codes. It is

like it's basically you send a request

or you execute a bit of code millions of

time and then you compare it but then

assume for example that's a question

that I mean think I'm I'm a Java

developer myself so Java is slower rust

is here

>> but then you add feature to them both

right so you add rust you add features

to rust and that's because you basically

won't run in productions a code that's

basically print hello world does a

simple thing or execute one request you

would add more and more feature to it.

So as you adding feature

>> basically both of them become slower and

slower and slower. So Rust may be a

little bit more faster as we're adding

more feature to it. But then you can see

that the distance between both languages

or different languages is basically

gonna come closer as we are adding more

features because

it's the way software runs. So we are

adding code to it that may might make

things slower um as we are adding. So

it's not only the runtime might be

faster but we are adding more feature to

it. it's basically will make it um a

little bit slower and then we

>> we might or might not see that big of a

difference in in our production

environment probably that's one of the

things that continuous profiling might

help with or might not uh depending on

the complexity of of the code. Um so

if we follow the benchmarks yes

definitely Rust is faster compared to uh

probably Go or Java but then as we start

to add more and more stuff to it the

complexity of the applications basically

take up from that and then the our

programs become uh a little bit bit by

bit slower and slower and that's where

probably some of the enterprise

languages benefit in the long run

because they are optimized for that bit.

So just to say that probably the

benchmarks even if they say that's a

language is slower than another one.

It's also depends on the use case and

what you are running and how you're

writing your code and all of that. So

it's like

>> it's not as simple as that.

>> If your program mostly waits for user

input then the thing you optimize is

sitting behind the keyboard. Right. >> Exactly.

>> Exactly. >> Yeah.

>> Yeah. >> Exactly.

>> Exactly.

>> But it'll be interesting to see that

kind of data and say okay in the real

world in our company this is how it's

actually playing out versus the

benchmarks. I think that'll be

>> that fascinating.

>> Okay. So, what so let's get into the

kind of reports that you can get from

one of these tools. Um you so the

tooling is a slightly separate thing

from EBPF. There are analysis tools for

this data.

>> Um so they are packaged as one. So when

you when you use a tool you use it as

one but the architecture is different.

There are multiple components uh of it

but all of some of them are based on

eBPF. So eBPF is one of the ways to

gather the data. Another way is using an

agent to get this data and the third

option is to instrument your code as you

mentioned and then send this data off.

Um and then you would have a profiler

that's basically that's the u heartbeat

that's the u

backbone of the profiling information

the core um

>> and then we need a way to analyze the

data and then a way to see the data and

visualize the data. So that architecture is

what the Google paper back again

described and what all the solutions

have in common. They have bits of

difference here and there, optimize

things uh differently here and there,

but they all share the same components basically.

basically.

>> Is there much to choose between them?

Are we like are we arguing over which

one has prettier graphs or

>> um it's it's not only that um how much

it cost is definitely important. How it

optimate how it optimized for the data

is it open source or commercial? Um so

if it's open source how much the cost of

running it is there support for it

especially if you're going for the

enterprise world if it's uh commercial

then how much I am paying for the cost

and then that cost will basically will

go how much data you want uh out of it

so it's not it's not how

how

nice are the graphs I would get but most

importantly how much data and granular

the data I would get I think That's the

the biggest factor. Uh and they all

share similar similar graphs. Uh so we'd

have flame graphs. So whenever we talks

about profiling, flame graphs come comes

uh comes to mind. Uh so they all offer

flame graphs to see which bits is using

uh which bits of code is using at that

amount of CPU. Can I see the memory as

well? you can compare between now and a

period of time or two different period

of times. Um

um they offer a way to filter uh

obviously so we can pick uh CPU, memory,

IO um and yeah all sort all sorts of stuff.

stuff.

>> This is um

this is making me think of the Google

browser console where I can get flame

graphs for my running JavaScript. It's

all very nice, but it will just tell me

what's happening in the browser and I

want that for the entire system, right? >> Yeah.

>> Yeah.

>> So, continuous profiling is basically

that for your application or your applications.

applications.

>> Yeah. Okay. So, what kind of questions

do you ask of a system like that?

>> Are you just trying to I mean, where do

you begin? It's a bit needle in a

haststack, isn't it? you just is there

are you looking for a report that says

this thing is slow or you waiting for

someone to say web servers are a bit

slow today and then you dig in how do

you start to know what to look at

so um

I think if if a system is

is slow

um you would have

traces I mean we talked about everyone

probably knows the three pillars of

observability. So we have traces, metrics,

metrics,

uh and uh logs things are slow.

>> What's the difference between traces and logs?

logs?

>> So logs are contextual data of of your

code. So basically you are saying code

executed here uh there's an error here.

So we instrument your application to log

stuff. traces are basically trying to um

correlate the events or correlate the

the journey of a request between

multiple services and components. It's

widely used in distributed applications.

While locks are within a single

applications, you can see the events

that happening that you added basically

in your code uh in that application and

see the path in the path of a single

event uh or a request uh in your

application and how it went. uh helps

you to understand how the request

basically behaved in your application

and traces try to uh find the

correlation between in a request and try

to uh draw the journey of that request

with multiple components. It's very

useful in distributed systems and

microservices architecture.

>> Yes. So what you're saying is if my user

says I tried to create an account and it

was really slow.

>> Yeah. I need to somehow trace that

request through the user micros service

to the account registration micros

service and know that those two calls on

two different machines

>> are one semantic thing.

>> Exactly. So yeah that's for the user is

one transactional atomic operation. >> Yeah

>> Yeah

>> but while for us it's like could be uh

the gateway it could be the accounting

service could be another service could

be the database. Uh so the traces helps

to understand where that bottleneck

comes from. >> Yes.

>> Yes.

>> And then once you identify that

bottleneck, you want to understand why

this application performance low. So

then you go and check the metrics of

that application and the logs of that

application. Uh so the metrics might

tell you that the CPU runs higher uh or

there's a lot of weight time or we are

just sitting idle waiting for an IIO

operation or there is memory that is uh

consumed very high in that application

or the locks might might say uh similar

things but then you don't know uh in

which part of the code is actually there

is this issue and then that's where

profiler came in. So if you're

interesting in CPU why my application is

taking slower or spinning a lot of CPUs,

you can go and check the bits of code

that basically does that might be a uh

loop operation that is poorly performed.

So it consumes a lot of um CPU time and

slowing everything down. Uh and and

probably for the memory, there is a

memory leak happening somewhere that you

weren't aware of. I mean you can see it

in the metrics but you don't know what

objects are or what method are basically

having that leak. So it's the profiling

can help you with that basically. So

generally that the journey so you have

an error uh when you I mean if an error

is happening you have an error and then

you try to uh boil it down to where

exactly this error is coming from. And

then once the maturity

goes up uh you start to import it as

part of your um health metrics

uh in a sense that you can include it as

part of u your post deployment routine

right you deploy something and then you

can compare it side by side. Oh, this

bit that I added did did it have a

significant impact on uh memory,

significant impact on CPU. And then you

can take it a little bit even further

and check okay you can add some alerts.

Uh I see that CPU has have thresholds

like if the CPU is more than 10% send me

an alert. If the CPU is more than 5% or

the memory more than 5% send me an

alert. uh could be a low urgency alert

just to oh I've noticed this you need to

be aware and then you can throw a little

bit of AI uh into it and then h have it

more uh dynamic to so it analyzes the

things for you some solutions offer that

as well and then it it notice oh post

deployment there is a shift in the

patterns between uh the previous

deployment and this deployment you are

noticing this kind of patterns.

>> Yeah. And when you've spent a lot of

when you spent a whole sprint or two

doing performance optimization, you want

to be able to say, "Look, I made it this

much faster." Right. >> Exactly.

>> Exactly.

>> You want quantifiable credit for your

end of year review.

>> Who doesn't? >> Yeah.

>> Yeah.

>> Okay. You've got to explain the

technical details of this particular

thing for me. Someone makes a request to

the web server. It goes to a user

registration thing. >> Yeah.

>> Yeah.

>> How do I stitch that together? That

seems to me like I would have to do some

kind of code changes to be be able to

connect those two calls together.

>> Uh you mean for the tracing?

>> Yeah. Yeah. Like how does how does the

tool that's constructing this trace for

me know that this request over there

resulted in that API call over there?

modify the context. So the request there

is a request that you make and it

generally either for gRPC we expand the

context or for HTTP we add it in the

headers. Uh so there is a request ID and

there is I forgot was is there a session

ID or something else. Um and then we

propagate. So the request ID is the one

from point point A to point B and the

session ID it goes the lifetime of the

request of the whole session basically

and then combining those two we're able

to correlate or between the session went

or the well session in that case is

request went through or probably there

is span and request so I forgot the

exact name is but that's basically how

it works so Using these two fields we

are able to reconstruct the journey of

the application knowing from point E to

point B to point C to point D and also

measuring that time it spent in each and

every uh iteration right but that is

that something

if I've got a web server sitting in

between a database and a load balancer

right I can imagine from what you said

that the web server would say the load

balancer is giving me this session ID I

I now need to pass it on to this SQL

call or something. >> Yeah.

>> Yeah.

>> So I would have to make a code change to

pass the session ids around.

>> So generally we use libraries to do that

for us. Uh so the the ecosystem has

evolved that uh you can basically now

things are uh embedded in most of the

applications. Uh for spring for example

there is spring sloth. I think that was

the name that basically offers you that

bits. Uh I'm a Java developer so I'm

giving that um but I imagine there is

equivalent things uh in the cloud native

ecosystem uh as well. Uh there is invoy

which basically in offers you a a way to

visualize those bits. Uh so there's a

lot of tools that basically offers you

to do that. Uh you just need to embed

them in your in your application. So the

entry point uh it's aware that oh that's

the entry point there is no span ID or

session ID or request ID. So I need to

be the generator of that uh of that um

session or that transaction or that

request and then it just propagated it

over and over again until it comes back

uh and then it stump it into uh

to save the data and save the whole

history and that's it basically.

>> Right. Yeah. I can you're giving me a

lot of flashbacks during this episode,

but I can sort of see how like Java

would be in the background attaching a

thread local variable which would then

get passed to my OM and Yeah. Okay.

>> Correct. Yeah.

>> So mostly mostly you want to arrange

that happens by magic but presumably in

some frameworks or languages you do have

to step in.

>> Exactly. I mean it's it's not happening

by magic but if you use some those

frameworks you won't even think about

it. it just like it just happens. Uh but

the I think the idea is is just that

they stitch uh context information

enable them to collect that data and

then gather it back and collect it and

stitch stitch the things together.

>> Yes. Okay. That's

>> I think open open telemetry has support

for it as well. It's become the def

facto standard when it comes to

monitoring and all of that.

>> Right. This implies that I mean it's

called continuous profiling as a

technique but it implies it also has to

be ubiquitous processing.

>> You'll only get those benefits if all

the machines are always profiling.

>> Yes, correct.

that's a big ask for a team maybe. I

mean all the machines are always profiling.

profiling.

um obviously comes with a cost and then

we are living in an era that um you pay

per node or instance. So the more

instances you have the likelihood of uh

you have to pay a big load of money and

that's for some might be out of uh

capacity. uh so you can resume it or

have it to your critical services that

you know they are consuming um a lot of

resources and you want to optimize those

resources or at least keep those service

healthier more performance and you know

that they bring them more uh value. Uh

so you want probably to invest in these and then when it proves value or when

and then when it proves value or when the um you're you're having a great year

the um you're you're having a great year and then you can offer that luxury of

and then you can offer that luxury of generalizing it to other services. Um so

generalizing it to other services. Um so I believe the there might be trade-offs

I believe the there might be trade-offs as you mentioned uh because the ideal

as you mentioned uh because the ideal scenario which would be to have it

scenario which would be to have it everywhere. Few companies have that

everywhere. Few companies have that luxury. Uh so you might want to

luxury. Uh so you might want to basically uh limit it to the critical

basically uh limit it to the critical services. Uh start there and then see if

services. Uh start there and then see if it brings value. Um make the team also

it brings value. Um make the team also aware of it because it comes also with a

aware of it because it comes also with a mind shift, a mental shift. uh from

mind shift, a mental shift. uh from traditional profiling or the pillars of

traditional profiling or the pillars of observability, it needs to be uh proven

observability, it needs to be uh proven um uh bringing value. So allowing that

um uh bringing value. So allowing that time of adaptation and uh culture change

time of adaptation and uh culture change and all of that is also important.

and all of that is also important. >> Oh yeah, because you probably need to

>> Oh yeah, because you probably need to get everyone in the organization to

get everyone in the organization to start buying into it.

start buying into it. >> Exactly. Um

>> Exactly. Um >> we as I mean we as human beings don't

>> we as I mean we as human beings don't like changes. Um and it's it it like

like changes. Um and it's it it like changes that someone else has told us we

changes that someone else has told us we have to make.

have to make. >> Exactly. Uh and even if we science prove

>> Exactly. Uh and even if we science prove that it brings value value, we would

that it brings value value, we would find a way to say it's not. Uh so

find a way to say it's not. Uh so allowing the time to uh shift that and

allowing the time to uh shift that and prove value and allowing the time for

prove value and allowing the time for the adaptation is also needs to be um

the adaptation is also needs to be um taken care of. Um because

taken care of. Um because it's the data is there, the graphs are

it's the data is there, the graphs are nice, you can do whatever you want with

nice, you can do whatever you want with it. But then if there is no developer uh

it. But then if there is no developer uh that can take those insight takes takes

that can take those insight takes takes the data and turn it into an action to

the data and turn it into an action to optimize things to improve things. It's

optimize things to improve things. It's basically uh for for uh for the company

basically uh for for uh for the company is just like a cost that they might want

is just like a cost that they might want to get rid of because it's not has it

to get rid of because it's not has it has not proven value. So it's there's a

has not proven value. So it's there's a little bit of adaptation in there and

little bit of adaptation in there and making sure that it brings value to the

making sure that it brings value to the developer and then it's part of their uh

developer and then it's part of their uh process and continuous improvement and

process and continuous improvement and all of that.

all of that. >> Yeah. So is does this mean that you are

>> Yeah. So is does this mean that you are very much you're trying to when you've

very much you're trying to when you've done it you're trying to show like look

done it you're trying to show like look how easy it is to profile your

how easy it is to profile your application rather than I found a

application rather than I found a problem with your application and here's

problem with your application and here's the data you need to go and fix it

the data you need to go and fix it >> um maybe both maybe oh uh it's easier to

>> um maybe both maybe oh uh it's easier to set up um it could basically can be

set up um it could basically can be onboarded part of your morning routine

onboarded part of your morning routine you just came in and then open uh in a

you just came in and then open uh in a publicly accessible URL. Uh check check

publicly accessible URL. Uh check check the stuff and then if you find a pattern

the stuff and then if you find a pattern that is weird in there, go and look at

that is weird in there, go and look at look into it and approve your

look into it and approve your performance of the application. Or

performance of the application. Or another way to get buying is when there

another way to get buying is when there is an issue there is no great way of

is an issue there is no great way of proving value than fixing that issue and

proving value than fixing that issue and using those tools. So it's going to be a

using those tools. So it's going to be a mix of uh both strategies. uh you want

mix of uh both strategies. uh you want to um get buy in that the tools is easy

to um get buy in that the tools is easy to use but also you want to prove value

to use but also you want to prove value and if you manage to prove it when

and if you manage to prove it when everyone hands on on something and

everyone hands on on something and everyone is focusing on something then

everyone is focusing on something then that would be a great one as well.

that would be a great one as well. >> Yeah. Yeah. Yeah. Um

>> Yeah. Yeah. Yeah. Um that makes me think of one more

that makes me think of one more practical question. Can I this is a

practical question. Can I this is a kernel level plugin. So can I roll it

kernel level plugin. So can I roll it out to many servers and dynamically

out to many servers and dynamically switch it on and off?

>> Um you mean the c the ebpf stuff? >> Yeah. So like if I can I just

>> Yeah. So like if I can I just try and put it on all my clustered

try and put it on all my clustered servers maybe not not gathering anything

servers maybe not not gathering anything and say okay we'll just flick the lights

and say okay we'll just flick the lights on for that one and take a look at that

on for that one and take a look at that today.

today. Is it trivial to switch them on and off?

Is it trivial to switch them on and off? >> It depends. I mean trivial is just like

>> It depends. I mean trivial is just like how

how easy your program is to use. I mean it

easy your program is to use. I mean it can be like um a configuration batch and

can be like um a configuration batch and then you can configure the service that

then you can configure the service that you don't want to to not be included. It

you don't want to to not be included. It could be a matter of um a matter of

could be a matter of um a matter of configuration and this configuration can

configuration and this configuration can be centralized. So you go to one place

be centralized. So you go to one place configure it and this configuration is

configure it and this configuration is dispatched everywhere and then the EBBF

dispatched everywhere and then the EBBF program basically read that

program basically read that configuration and then if it's on gather

configuration and then if it's on gather the data if it's off just ignores it. uh

the data if it's off just ignores it. uh it just also a matter of how well your

it just also a matter of how well your program is run and then um talking about

program is run and then um talking about EVPF program in general but with

EVPF program in general but with continuous profiling I think yes you can

continuous profiling I think yes you can do something like that if you want uh

do something like that if you want uh just keep it running there uh and if not

just keep it running there uh and if not just turn it off uh not collecting any

just turn it off uh not collecting any data and not adding any overhead for you

data and not adding any overhead for you >> I mean I'm going to make that more

>> I mean I'm going to make that more specific because I get the general idea

specific because I get the general idea but maybe if I couch In Javish terms,

but maybe if I couch In Javish terms, I've got a machine that I set up earlier

I've got a machine that I set up earlier and I want to either switch profiling on

and I want to either switch profiling on or off. Is it as simple as I'm going to

or off. Is it as simple as I'm going to call an MBAM N bean on that running JVM

call an MBAM N bean on that running JVM or am I redeploying a whole Kubernetes

or am I redeploying a whole Kubernetes pod with the new settings?

pod with the new settings? >> No, you can do it in in runtime

>> No, you can do it in in runtime >> dynamically at runtime. Yeah, you can do

>> dynamically at runtime. Yeah, you can do dynamic and that's that's what I

dynamic and that's that's what I mentioned the configur also depends on

mentioned the configur also depends on the configuration that you have in your

the configuration that you have in your u in your program basically because it's

u in your program basically because it's a program you can do anything with it.

a program you can do anything with it. >> Yeah. As long as you planned ahead.

>> Yeah. As long as you planned ahead. >> Exactly.

>> Exactly. >> Okay.

>> Okay. >> But if it's not as you mentioned you

>> But if it's not as you mentioned you would need basically to redeploy it.

would need basically to redeploy it. >> But then you don't need to recompile the

>> But then you don't need to recompile the kernel. It just like as any program you

kernel. It just like as any program you just need to deploy it and then it would

just need to deploy it and then it would take care of itself.

take care of itself. >> Okay. So I can set myself up with this

>> Okay. So I can set myself up with this kind of profiling. Yeah.

kind of profiling. Yeah. >> Okay. So, I've got a way of doing it

>> Okay. So, I've got a way of doing it that seems practical, has a low enough

that seems practical, has a low enough overhead that I might set up my whole

overhead that I might set up my whole cluster to run this.

cluster to run this. >> Yep.

>> Yep. >> I can see the data management problem,

>> I can see the data management problem, but you've given me some ideas about

but you've given me some ideas about mitigating that reporting. Getting back

mitigating that reporting. Getting back to the developer, I see the picture. I

to the developer, I see the picture. I think you need to start giving me some

think you need to start giving me some specific recommendations.

specific recommendations. Which tools would you pick for this

Which tools would you pick for this strategy?

strategy? I I like Periscope uh because it offers

I I like Periscope uh because it offers um best of both worlds. It has support

um best of both worlds. It has support for EBPF. Uh but if your orc for a way

for EBPF. Uh but if your orc for a way of another doesn't want to embold in an

of another doesn't want to embold in an ebpf journey yet um it has an agent um

ebpf journey yet um it has an agent um an agent alternative that you can use

an agent alternative that you can use for specific languages and runtimes. Oh,

for specific languages and runtimes. Oh, because the presumably the EBPF needs

because the presumably the EBPF needs root access, but the agent I can just

root access, but the agent I can just run in user space. So if I've got a

run in user space. So if I've got a security team, I don't need to have that

security team, I don't need to have that argument. Yes.

argument. Yes. >> Yeah. Exactly. Uh so it's it has a bit

>> Yeah. Exactly. Uh so it's it has a bit as you mentioned bit of both worlds. It

as you mentioned bit of both worlds. It has this enterprise uh flavor into it

has this enterprise uh flavor into it that make it more enterprisey and it's

that make it more enterprisey and it's it's pleasing to that enterprise word.

it's pleasing to that enterprise word. uh it tries to find shortcuts and fi and

uh it tries to find shortcuts and fi and meet you where you are you are rather

meet you where you are you are rather than trying to

than trying to move you to another uh runtime. Uh so

move you to another uh runtime. Uh so that's bit I like about it and then um

that's bit I like about it and then um periscope is from graphana and then a

periscope is from graphana and then a lot of companies has graphana as well.

lot of companies has graphana as well. So that the tooling would be based on

So that the tooling would be based on graphana which is um many are familiar

graphana which is um many are familiar with uh already um from the cloud side

with uh already um from the cloud side uh data dog um have their own and then

uh data dog um have their own and then we um when we were at Miami uh one of

we um when we were at Miami uh one of this one of the speakers basically were

this one of the speakers basically were using uh data dog and they were very

using uh data dog and they were very happy with it. So uh I think from if you

happy with it. So uh I think from if you want something that comes with ease of

want something that comes with ease of use uh you don't you don't want to

use uh you don't you don't want to bother yourself about deploying it and

bother yourself about deploying it and um managing it yourself uh data take

um managing it yourself uh data take care of that. Um, another two other open

care of that. Um, another two other open source tools. Uh, Parka is one. Uh, it's

source tools. Uh, Parka is one. Uh, it's really good. And then my favorite, uh,

really good. And then my favorite, uh, though it's still, uh, not early days,

though it's still, uh, not early days, but getting there. It's Pixie. Uh, it's

but getting there. It's Pixie. Uh, it's a CNCF project, uh, by, um, I forgot the

a CNCF project, uh, by, um, I forgot the name again.

name again. >> Uh, Newf

>> Uh, Newf by Newick. Um, so it's it's more than a

by Newick. Um, so it's it's more than a continuous profiling tool. It tries to

continuous profiling tool. It tries to be a monitoring tool. Uh it it it try to

be a monitoring tool. Uh it it it try to combines profiling together with the

combines profiling together with the metrics part. Um

metrics part. Um uh but it's it's it's cool. Uh it's open

uh but it's it's it's cool. Uh it's open source. Uh but it also offers a way to

source. Uh but it also offers a way to inboard it as part of new. I it has this

inboard it as part of new. I it has this new flavor as well. Um so it's it's my

new flavor as well. Um so it's it's my favorite tool. So, so far but again it's

favorite tool. So, so far but again it's still

still a little bit early days for it. Not that

a little bit early days for it. Not that widely adoption uh yet but it's it's

widely adoption uh yet but it's it's getting there since it's part of CNCF

getting there since it's part of CNCF projects. It has a larger community so

projects. It has a larger community so it's improving bit by bit.

it's improving bit by bit. >> Okay. I kind of want to ask you which

>> Okay. I kind of want to ask you which one you use at work but I suspect then

one you use at work but I suspect then the Spotify legal team will dive on this

the Spotify legal team will dive on this podcast. So I'll leave that question

podcast. So I'll leave that question entirely.

Uh, I think that gives me a complete picture. Maybe I need to go and play

picture. Maybe I need to go and play with one of these.

with one of these. Where would you start if it's just out

Where would you start if it's just out of pure curiosity? Pixie.

of pure curiosity? Pixie. >> I would start with Pixie. Yes. Um, yeah,

>> I would start with Pixie. Yes. Um, yeah, they have a a cloud version to So, you

they have a a cloud version to So, you would need to install the Pixie agent on

would need to install the Pixie agent on your cluster

your cluster >> and then they have a cloud last time I

>> and then they have a cloud last time I played with it and they have a cloud

played with it and they have a cloud version that basically get the data and

version that basically get the data and send it uh to it and you can visualize

send it uh to it and you can visualize it. Um so yeah, Pixie would be uh

it. Um so yeah, Pixie would be uh probably the first and then Periscope

probably the first and then Periscope maybe the second um because it's yeah uh

maybe the second um because it's yeah uh it's it's well integrated, easy to use

it's it's well integrated, easy to use uh and it's part of the Graphan

uh and it's part of the Graphan ecosystem so it's yeah nicer.

ecosystem so it's yeah nicer. >> Cool. I'm going to go and check those

>> Cool. I'm going to go and check those out. Muhammad, thank you very much for

out. Muhammad, thank you very much for joining me and I hope when you get to

joining me and I hope when you get to the end of this recording you don't

the end of this recording you don't think the um elapse time was too long or

think the um elapse time was too long or too short.

too short. It was very enjoyable.

It was very enjoyable. >> Great.

>> Great. >> Thank you for having me.

>> Thank you for having me. >> Thank you.

>> Thank you. >> Thank you, Muhammad. As always, the show

>> Thank you, Muhammad. As always, the show notes the place to head if you want

notes the place to head if you want links to anything we discussed. And

links to anything we discussed. And before you head there, please do take a

before you head there, please do take a moment to like and rate this episode and

moment to like and rate this episode and maybe share it around because it all

maybe share it around because it all helps other people find us. The

helps other people find us. The algorithm decides that if you liked it,

algorithm decides that if you liked it, other people will like it and off we go.

other people will like it and off we go. And that helps share the knowledge,

And that helps share the knowledge, which is the whole point of this

which is the whole point of this podcast. Please do make sure you're

podcast. Please do make sure you're subscribed so that you can find us in

subscribed so that you can find us in time for the next episode. And until

time for the next episode. And until then, I've been your host, Chris

then, I've been your host, Chris Jenkins. This has been Developer Voices

Jenkins. This has been Developer Voices with Muhammad Aboule. Thanks for

with Muhammad Aboule. Thanks for listening.

listening. [Music]

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube TranscriptPreparing your results…

YouTube Transcript:Building Observable Systems with eBPF and Linux (with Mohammed Aboullaite)

AutoDub

Video Transcript

Summary

Core Theme

Paste YouTube URL

Transcript Extraction Form

Get Our Chrome Extension

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube Transcript:
Building Observable Systems with eBPF and Linux (with Mohammed Aboullaite)