YouTube Transcript:
Proxmox VE Server Health Check: Hardware & Software Monitoring

Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.

Video Transcript

View:

In this video, I'm going to do what I like to call my

Proxmox Checkup, a manual check

in the Proxmox web interface and a little

bit of terminal use to see what's going on

with the system and see if there's any possible hardware or

software issues to take a note

of.

While automated alerts are great for getting alerted

immediately when something starts

having an issue, I think it's still a good

idea to manually look over a system like this

every once in a while to make sure your alerting system

didn't miss anything and that there's

no other problems or issues that it's going on with the

system that you should take action

on before they get worse.

I'm going to be doing most things in this video in the

Proxmox web interface, but you

can do almost all of this as well in the

terminal if you want to script it as well.

So let's dive into the interface now.

This is one of my personal Proxmox

servers, so it's kind of running okay, but there's

some oddities on the system as well to hopefully

demonstrate what a real life system would

look like.

The first thing I'm going to start with is this bottom

section here with all the different

Proxmox tasks from the last day or two.

The first thing I'm going to notice is anything that's red

or yellow that has a warning or

error, and I'm going to notice I have

some backup jobs that had some problems.

This Proxmox VE system is supposed to back up to a PBS

server, but my PBS server spends

some time shut down, so all of these are explained by me

and I know what's going on here.

But it should have been running for this one, so I'm going

to take a look at what's going

on here.

Looking at the error, the next thing I take

a notice is I noticed that my mail to root

has errors here.

So in addition to my backups not working,

I also have some mailing issues and likely

want to make sure that my

notifications email is also set up correctly.

So I'm going to take a note of both of those things here,

and other than that, it looks

like some of these backup jobs ran correctly.

There's some background updates of the package database

that ran fine and some shells.

I'm also looking for if there's anything abnormal that I

don't remember doing just to make sure

nothing weird is going on with this box.

The next place I'm going to take a look at

on my server is the left hand column here,

which shows all my different servers

and the VMs and the storage on them.

I'm going to mostly be dealing with a single cluster in

this video, but if you have multiple

nodes in a cluster, I'd immediately check to see if

everything's online for little check

marks.

Typically, a Proxmox cluster wants to have

all of its nodes running all of the time,

so if something's off, I'd try to

get it online if it wasn't intentional.

The next thing I'm going to just take a look at is all the

VMs to see which ones running

and which ones aren't.

Proxmox can inadvertently kill VMs or sometimes you can

accidentally shut them down and realize

they're not running when they should have been running.

I'd immediately take a note and make sure

that nothing's running that shouldn't be

running and nothing's not running that should be running.

Sometimes it might be nice to use the little note tags for

always running VMs so you can

know which VMs that should be running aren't actually

running because sometimes Proxmox

kills a VM and just doesn't tell or notify you very well.

I'm now going to move down and look at storage.

The first thing I'm going to look for is

anything with this little question mark which

means it can't access it.

It looks like my two ways of mounting PBS

storage as part of a migration are having

some issues and it's loading.

I think this is something to do with my

PBS server being extremely slow, but it helps

me feel the idea that there's backup issues here.

There's definitely a problem on this system

and I need to take a note of why these two

chunks of storage aren't being able to be accessed here.

For my three working storages on this system, I'm going to

take a look at the utilization

on them to make sure they don't fill

all the way up and cause any issues.

Taking a look at the little bar graph they

have right here, I can see that HDD file and

local are nearly empty but AVMME SSD is almost full.

So clicking on it for more detail lets me know that I have,

it looks like about 50 gigabytes

free out of 250, which is about

80% full, which for me is nearing.

I should really start to think

about adding storage on this system.

The other thing I like to take a look at is the graph.

Typically I look at either a month or a year to get a

better long term view of what's going

on and I can see it's kind of been creeping up a little bit

and then I deleted some files

which freed up some space.

So it looks like since my usage is slowly going up, I'm

going to want to add some more

storage here, but it might not be super critical.

Probably within the next week I should really add more

storage in this pool because otherwise

it's getting too close for comfort and SSDs don't like

being completely full either.

Moving on to the rest of the system, I can look at data

center, but data center is almost

all configuration and in this video I'm going to be mostly

checking on the system and not

reviewing the existing configuration of the system.

The one nice thing in data center is the summary tab which

can say if there's any node issues

or VMs that are running or aren't running, but that's

pretty much the same info we can

see on the left.

In order to get a better feel for resource utilization on

each node, I'm going to click

on the node itself and go on to summary and this is going

to get me a bit of information

immediately right now of what's going on in the system as

well as information over time.

For the immediate information, I

can see CPU usage is pretty low.

Load averages are fairly low.

A quick bit of additional information on load averages in

Linux is that roughly one in load

average is equivalent to fully

utilizing one core thread on a CPU.

In this case, I have 12 CPUs, so 12 load

average would be effectively fully used.

Load average also adds in things like disk

IO and waiting for that sort of thing, so

if this goes up abnormally, your system's waiting on

something and is under heavy load.

On a system like this, anything nearing double digits or

over double digits would make me

think the system's under pretty heavy load

and I'd either want to reduce the load or

get a bigger system.

IO delay can also be a great way to see how much your

system is waiting on disk IO.

Right now, it's less than 1%, which means

not very much, but if this shoots up, this

can be a good indication that you need a faster storage

solution for your VMs to work better.

I can also see RAM usage on this system.

RAM usage in Proxmox is unfortunately not

very well shown if you're using ZFS because

ZFS likes to use a lot of storage

and it just shows up as used here.

So if this seems abnormally high,

check what ZFS is doing in the terminal.

But other than that, 80 some percent seems

perfectly fine and I also noticed I have a

bit of swap on this system, which means I'm not going to be

in the risk of having an out-of-memory

error killing any of my VMs.

But I want to take a look at the graphs to get a feel for

what's going on in this system,

not just right now, but also over time.

And I'm actually going to set it to year month maximum.

And maximum is going to show me the highest usage during

these periods of time instead

of the average, so I can kind of get a better feel for what

the peaks are doing on the system

instead of my average load because I want to make sure my

system still has some resources

left during the peaks instead of just the average load.

So taking a look at my system, I can

see I did a RAM upgrade about here.

And during that RAM upgrade, I can see my

IODelay shot down because I was using a lot

less swap and my CPU peaks also went down.

So I can see that RAM upgrade helped quite a bit.

And now that I've done the upgrade, CPU

usage is peaking in the 20 to 30 percent range,

so that's perfectly fine with fairly low IODelay.

Same with server load.

After the RAM upgrade, it looks good.

But before the RAM upgrade, it's getting

over a 10 and even much higher at some periods

of time.

And since I have about 12 threads on the CPU, 10 is kind of

my guide for it's a little bit

higher than it should be on this system.

Looking at network traffic, this server is also being used

as a NAS and I have a gigabit

network it's using and it's maxing out that network.

This is telling me that upgrading to faster than gigabit

network would likely speed some

things up.

But other than that, everything seems to be fine as I

expect the NAS to fully utilize

my network connection.

Speaking about network, let's actually take

a look at the network on this system next.

On a basic setup like mine, I can just see

it's working as I configured on this system.

But the one thing you might want to take a

note of is if you're using bonded network

connections where a pair of connections are working

together effectively as one, make

sure that one of them hasn't gone offline

because that could be a potential network

issue and means if the other one goes

offline, the system is completely gone now.

Moving on beyond that, let's go to the system log now.

This is the internal system log

that shows everything that's going on.

And reading logs can really be an art

and there's a lot of data to try to get to.

And there's a lot of things going on here.

So what I'm going to notice right now

is it's seeing this old PBS and PBS.

So this means that my storages are offline and I know I

need to work on my backups here

and it's just telling me in a different location.

Status update, pvestatd is getting some interesting data.

And I think the pvestatd is the daemon that is reading the

amount of storage space here and

that's also telling it.

So if I want to calm this logs down, I'm going to have to

get that PBS storage back online

or remove it from my

configuration so it can be no longer used.

Speaking about backups, I will actually go back to data

center for backups for one thing

right now.

And I would double check your backup configuration when

you're doing a check of your system just

to make sure you didn't add any VMs that you want to backup

that you forgot to add to your

backup configuration or there isn't any other problems.

So I take a look at job detail to make

sure nothing seems odd, that you're forgetting

about any disks that you potentially

wanted to backup on this storage here.

And I think it's also a good idea to take

a look at showing guests without a backup

job and make sure that all these systems without a backup

job, you're okay with not having

any proxmox backups of.

So in case the node fails, you

know those VMs are going down with it.

Let's jump back to the node itself and

take a look at the updates and repositories.

I also like to do a check for it to make

sure that I have a proxmox repository here,

either the paid subscription if you don't have one or a

subscription just to make sure

proxmox is being updated in

addition to the Debian OS under the hood.

I'm going to refresh the updates, which means effectively

downloading what updates are available.

This isn't going to touch your

system under the hood at all.

And then I'm going to see what all is available.

Looking at this list right here seems to be quite a bit,

which is telling me this system

probably needs to be doing an update relatively soon.

And I should probably upgrade

this system clicking this button.

I think give it a reboot in case there's any kernel updates

because it looks like proxmox

kernel has a new version.

I'm not using the firewall on this system, but if you are

checking out the firewall log

is probably a good place to see setting up

the firewall log can definitely be a complex

task to make sure you have the right amount of logging

going on on your system, but making

sure that nothing's accessing data or nothing's trying to

access data that it shouldn't be

might be a good idea to glance at right now.

The next thing I'm going to take

a look at is disks on this system.

So clicking on the disks tab, I can see all the physical

disks with all of their partitions.

The first thing I'm going to notice is there's quite a few

ZFS disks, which tells me I'm

going to have to take a look at more closely at ZFS data.

And then I'm going to glance over to the right under smart

data and make sure everything

says past here.

Because if a drive says it's having issues

internally, he probably should replace it

because normally it's telling the truth that it doesn't

want to be there too much longer.

The other thing to take a note

of is wear out for your SSDs.

So I can take a look that I have a few

SSDs here and all in the low single digits.

So I have nothing to worry about.

But if this number is creeping up, it might mean you want

to get another SSD to have on

the player on hand to replace one of these

drives when they do end up hitting no maximum

rated wear out.

While SSDs can typically go way beyond the

wear out figure, some will go read only when

hitting the maximum.

And it's typically a good idea to have your drives running

within no wear out specifications.

Now that I've noticed I have ZFS, I'm going

to go over ZFS on the system and see all of

my interestingly named ZFS pools.

I'm going to immediately notice that everything's online.

One thing you might want to notice and

deal with is fragmentation on a ZFS pool.

This is free space fragmentation, which is a little bit

different than what you'd think

of on NTFS fragmentation.

And also ZFS doesn't have a built in D fragment.

So unfortunately you can't really do anything about it.

But as long as pool performance is fine, I wouldn't worry

about the fragmentation number.

Online means everything's working correctly.

But if there is a problem, you can take a

look at detail and see which one of these

drives has an issue or it'll tell you if there's data

errors or what's going on with the pool

that might need fixing or replacing.

If you're doing any actual disk replacements or anything

like that, that's likely going

to have to be done in the terminal on this system.

I'm going to skip over Ceph for the system

because I just have a single node right now

that's not using Ceph on here.

But I'm going to skip over to task history on here.

And this is one nice way in the web UI to see all the

history here and not just the

short term history that's shown in the bottom task.

But I like to look at the bottom.

It's convenient to just see what's

happened recently on a node like this.

Diving into the terminal on this system, I

can see a lot of the same information if I

want using standard Linux utilities

like htop and iostat right here.

But I think the Proxmox interface does

quite a good job of showing me usage over time.

The one thing I can't easily see in the

Proxmox web interface is the access log.

If your Proxmox system is available to many different users

or especially the public internet,

keep an eye on this log right here and it's

going to tell you who's accessing what and

what they're looking at.

I'm just scrolling through this quickly

and less right now and I can see everything's

coming from this 192.168.1.205 which is one

of my desktops but if I go all the way down

here I can look at it at 209 and that's this computer I'm

using right now for the recording.

And since my system's not available to the outside

internet, nothing raises a bell but

if your system's available to the outside internet, you

probably want to put in something

like fail2ban to make sure people aren't trying to

access data and just take a close

look if anyone's accessing parts of data that they

shouldn't be on this system just to make

sure you're aware of that.

Hopefully this little video is a useful guide of things to

check up on your Proxmox system

every once in a while to make

sure everything's working correctly.

Let me know if you think I skipped anything in the comments

below and thanks for watching

this video.

Click on any text or timestamp to jump to that moment in the video

Most transcripts ready in under 5 seconds

One-Click Copy125+ LanguagesSearch ContentJump to Timestamps

Paste YouTube URL

Enter any YouTube video link to get the full transcript

Most transcripts ready in under 5 seconds

Get Our Chrome Extension

Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.

Add to Chrome — Free

Works with YouTube, Coursera, Udemy and more educational platforms

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube

←

→

↻

https://www.youtube.com/watch?v=UF8uR6Z6KLc

YoutubeToText

←

→

↻

https://youtubetotext.net/watch?v=UF8uR6Z6KLc

YouTube Transcript:Proxmox VE Server Health Check: Hardware & Software Monitoring

Video Transcript

Paste YouTube URL

Transcript Extraction Form

Get Our Chrome Extension

Get Instant Transcripts: Just Edit the Domain in Your Address Bar!

YouTube Transcript:
Proxmox VE Server Health Check: Hardware & Software Monitoring