YouTube Transcript:
Proxmox VE Server Health Check: Hardware & Software Monitoring
Skip watching entire videos - get the full transcript, search for keywords, and copy with one click.
Share:
Video Transcript
View:
In this video, I'm going to do what I like to call my
Proxmox Checkup, a manual check
in the Proxmox web interface and a little
bit of terminal use to see what's going on
with the system and see if there's any possible hardware or
software issues to take a note
of.
While automated alerts are great for getting alerted
immediately when something starts
having an issue, I think it's still a good
idea to manually look over a system like this
every once in a while to make sure your alerting system
didn't miss anything and that there's
no other problems or issues that it's going on with the
system that you should take action
on before they get worse.
I'm going to be doing most things in this video in the
Proxmox web interface, but you
can do almost all of this as well in the
terminal if you want to script it as well.
So let's dive into the interface now.
This is one of my personal Proxmox
servers, so it's kind of running okay, but there's
some oddities on the system as well to hopefully
demonstrate what a real life system would
look like.
The first thing I'm going to start with is this bottom
section here with all the different
Proxmox tasks from the last day or two.
The first thing I'm going to notice is anything that's red
or yellow that has a warning or
error, and I'm going to notice I have
some backup jobs that had some problems.
This Proxmox VE system is supposed to back up to a PBS
server, but my PBS server spends
some time shut down, so all of these are explained by me
and I know what's going on here.
But it should have been running for this one, so I'm going
to take a look at what's going
on here.
Looking at the error, the next thing I take
a notice is I noticed that my mail to root
has errors here.
So in addition to my backups not working,
I also have some mailing issues and likely
want to make sure that my
notifications email is also set up correctly.
So I'm going to take a note of both of those things here,
and other than that, it looks
like some of these backup jobs ran correctly.
There's some background updates of the package database
that ran fine and some shells.
I'm also looking for if there's anything abnormal that I
don't remember doing just to make sure
nothing weird is going on with this box.
The next place I'm going to take a look at
on my server is the left hand column here,
which shows all my different servers
and the VMs and the storage on them.
I'm going to mostly be dealing with a single cluster in
this video, but if you have multiple
nodes in a cluster, I'd immediately check to see if
everything's online for little check
marks.
Typically, a Proxmox cluster wants to have
all of its nodes running all of the time,
so if something's off, I'd try to
get it online if it wasn't intentional.
The next thing I'm going to just take a look at is all the
VMs to see which ones running
and which ones aren't.
Proxmox can inadvertently kill VMs or sometimes you can
accidentally shut them down and realize
they're not running when they should have been running.
I'd immediately take a note and make sure
that nothing's running that shouldn't be
running and nothing's not running that should be running.
Sometimes it might be nice to use the little note tags for
always running VMs so you can
know which VMs that should be running aren't actually
running because sometimes Proxmox
kills a VM and just doesn't tell or notify you very well.
I'm now going to move down and look at storage.
The first thing I'm going to look for is
anything with this little question mark which
means it can't access it.
It looks like my two ways of mounting PBS
storage as part of a migration are having
some issues and it's loading.
I think this is something to do with my
PBS server being extremely slow, but it helps
me feel the idea that there's backup issues here.
There's definitely a problem on this system
and I need to take a note of why these two
chunks of storage aren't being able to be accessed here.
For my three working storages on this system, I'm going to
take a look at the utilization
on them to make sure they don't fill
all the way up and cause any issues.
Taking a look at the little bar graph they
have right here, I can see that HDD file and
local are nearly empty but AVMME SSD is almost full.
So clicking on it for more detail lets me know that I have,
it looks like about 50 gigabytes
free out of 250, which is about
80% full, which for me is nearing.
I should really start to think
about adding storage on this system.
The other thing I like to take a look at is the graph.
Typically I look at either a month or a year to get a
better long term view of what's going
on and I can see it's kind of been creeping up a little bit
and then I deleted some files
which freed up some space.
So it looks like since my usage is slowly going up, I'm
going to want to add some more
storage here, but it might not be super critical.
Probably within the next week I should really add more
storage in this pool because otherwise
it's getting too close for comfort and SSDs don't like
being completely full either.
Moving on to the rest of the system, I can look at data
center, but data center is almost
all configuration and in this video I'm going to be mostly
checking on the system and not
reviewing the existing configuration of the system.
The one nice thing in data center is the summary tab which
can say if there's any node issues
or VMs that are running or aren't running, but that's
pretty much the same info we can
see on the left.
In order to get a better feel for resource utilization on
each node, I'm going to click
on the node itself and go on to summary and this is going
to get me a bit of information
immediately right now of what's going on in the system as
well as information over time.
For the immediate information, I
can see CPU usage is pretty low.
Load averages are fairly low.
A quick bit of additional information on load averages in
Linux is that roughly one in load
average is equivalent to fully
utilizing one core thread on a CPU.
In this case, I have 12 CPUs, so 12 load
average would be effectively fully used.
Load average also adds in things like disk
IO and waiting for that sort of thing, so
if this goes up abnormally, your system's waiting on
something and is under heavy load.
On a system like this, anything nearing double digits or
over double digits would make me
think the system's under pretty heavy load
and I'd either want to reduce the load or
get a bigger system.
IO delay can also be a great way to see how much your
system is waiting on disk IO.
Right now, it's less than 1%, which means
not very much, but if this shoots up, this
can be a good indication that you need a faster storage
solution for your VMs to work better.
I can also see RAM usage on this system.
RAM usage in Proxmox is unfortunately not
very well shown if you're using ZFS because
ZFS likes to use a lot of storage
and it just shows up as used here.
So if this seems abnormally high,
check what ZFS is doing in the terminal.
But other than that, 80 some percent seems
perfectly fine and I also noticed I have a
bit of swap on this system, which means I'm not going to be
in the risk of having an out-of-memory
error killing any of my VMs.
But I want to take a look at the graphs to get a feel for
what's going on in this system,
not just right now, but also over time.
And I'm actually going to set it to year month maximum.
And maximum is going to show me the highest usage during
these periods of time instead
of the average, so I can kind of get a better feel for what
the peaks are doing on the system
instead of my average load because I want to make sure my
system still has some resources
left during the peaks instead of just the average load.
So taking a look at my system, I can
see I did a RAM upgrade about here.
And during that RAM upgrade, I can see my
IODelay shot down because I was using a lot
less swap and my CPU peaks also went down.
So I can see that RAM upgrade helped quite a bit.
And now that I've done the upgrade, CPU
usage is peaking in the 20 to 30 percent range,
so that's perfectly fine with fairly low IODelay.
Same with server load.
After the RAM upgrade, it looks good.
But before the RAM upgrade, it's getting
over a 10 and even much higher at some periods
of time.
And since I have about 12 threads on the CPU, 10 is kind of
my guide for it's a little bit
higher than it should be on this system.
Looking at network traffic, this server is also being used
as a NAS and I have a gigabit
network it's using and it's maxing out that network.
This is telling me that upgrading to faster than gigabit
network would likely speed some
things up.
But other than that, everything seems to be fine as I
expect the NAS to fully utilize
my network connection.
Speaking about network, let's actually take
a look at the network on this system next.
On a basic setup like mine, I can just see
it's working as I configured on this system.
But the one thing you might want to take a
note of is if you're using bonded network
connections where a pair of connections are working
together effectively as one, make
sure that one of them hasn't gone offline
because that could be a potential network
issue and means if the other one goes
offline, the system is completely gone now.
Moving on beyond that, let's go to the system log now.
This is the internal system log
that shows everything that's going on.
And reading logs can really be an art
and there's a lot of data to try to get to.
And there's a lot of things going on here.
So what I'm going to notice right now
is it's seeing this old PBS and PBS.
So this means that my storages are offline and I know I
need to work on my backups here
and it's just telling me in a different location.
Status update, pvestatd is getting some interesting data.
And I think the pvestatd is the daemon that is reading the
amount of storage space here and
that's also telling it.
So if I want to calm this logs down, I'm going to have to
get that PBS storage back online
or remove it from my
configuration so it can be no longer used.
Speaking about backups, I will actually go back to data
center for backups for one thing
right now.
And I would double check your backup configuration when
you're doing a check of your system just
to make sure you didn't add any VMs that you want to backup
that you forgot to add to your
backup configuration or there isn't any other problems.
So I take a look at job detail to make
sure nothing seems odd, that you're forgetting
about any disks that you potentially
wanted to backup on this storage here.
And I think it's also a good idea to take
a look at showing guests without a backup
job and make sure that all these systems without a backup
job, you're okay with not having
any proxmox backups of.
So in case the node fails, you
know those VMs are going down with it.
Let's jump back to the node itself and
take a look at the updates and repositories.
I also like to do a check for it to make
sure that I have a proxmox repository here,
either the paid subscription if you don't have one or a
subscription just to make sure
proxmox is being updated in
addition to the Debian OS under the hood.
I'm going to refresh the updates, which means effectively
downloading what updates are available.
This isn't going to touch your
system under the hood at all.
And then I'm going to see what all is available.
Looking at this list right here seems to be quite a bit,
which is telling me this system
probably needs to be doing an update relatively soon.
And I should probably upgrade
this system clicking this button.
I think give it a reboot in case there's any kernel updates
because it looks like proxmox
kernel has a new version.
I'm not using the firewall on this system, but if you are
checking out the firewall log
is probably a good place to see setting up
the firewall log can definitely be a complex
task to make sure you have the right amount of logging
going on on your system, but making
sure that nothing's accessing data or nothing's trying to
access data that it shouldn't be
might be a good idea to glance at right now.
The next thing I'm going to take
a look at is disks on this system.
So clicking on the disks tab, I can see all the physical
disks with all of their partitions.
The first thing I'm going to notice is there's quite a few
ZFS disks, which tells me I'm
going to have to take a look at more closely at ZFS data.
And then I'm going to glance over to the right under smart
data and make sure everything
says past here.
Because if a drive says it's having issues
internally, he probably should replace it
because normally it's telling the truth that it doesn't
want to be there too much longer.
The other thing to take a note
of is wear out for your SSDs.
So I can take a look that I have a few
SSDs here and all in the low single digits.
So I have nothing to worry about.
But if this number is creeping up, it might mean you want
to get another SSD to have on
the player on hand to replace one of these
drives when they do end up hitting no maximum
rated wear out.
While SSDs can typically go way beyond the
wear out figure, some will go read only when
hitting the maximum.
And it's typically a good idea to have your drives running
within no wear out specifications.
Now that I've noticed I have ZFS, I'm going
to go over ZFS on the system and see all of
my interestingly named ZFS pools.
I'm going to immediately notice that everything's online.
One thing you might want to notice and
deal with is fragmentation on a ZFS pool.
This is free space fragmentation, which is a little bit
different than what you'd think
of on NTFS fragmentation.
And also ZFS doesn't have a built in D fragment.
So unfortunately you can't really do anything about it.
But as long as pool performance is fine, I wouldn't worry
about the fragmentation number.
Online means everything's working correctly.
But if there is a problem, you can take a
look at detail and see which one of these
drives has an issue or it'll tell you if there's data
errors or what's going on with the pool
that might need fixing or replacing.
If you're doing any actual disk replacements or anything
like that, that's likely going
to have to be done in the terminal on this system.
I'm going to skip over Ceph for the system
because I just have a single node right now
that's not using Ceph on here.
But I'm going to skip over to task history on here.
And this is one nice way in the web UI to see all the
history here and not just the
short term history that's shown in the bottom task.
But I like to look at the bottom.
It's convenient to just see what's
happened recently on a node like this.
Diving into the terminal on this system, I
can see a lot of the same information if I
want using standard Linux utilities
like htop and iostat right here.
But I think the Proxmox interface does
quite a good job of showing me usage over time.
The one thing I can't easily see in the
Proxmox web interface is the access log.
If your Proxmox system is available to many different users
or especially the public internet,
keep an eye on this log right here and it's
going to tell you who's accessing what and
what they're looking at.
I'm just scrolling through this quickly
and less right now and I can see everything's
coming from this 192.168.1.205 which is one
of my desktops but if I go all the way down
here I can look at it at 209 and that's this computer I'm
using right now for the recording.
And since my system's not available to the outside
internet, nothing raises a bell but
if your system's available to the outside internet, you
probably want to put in something
like fail2ban to make sure people aren't trying to
access data and just take a close
look if anyone's accessing parts of data that they
shouldn't be on this system just to make
sure you're aware of that.
Hopefully this little video is a useful guide of things to
check up on your Proxmox system
every once in a while to make
sure everything's working correctly.
Let me know if you think I skipped anything in the comments
below and thanks for watching
this video.
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.
Works with YouTube, Coursera, Udemy and more educational platforms
Get Instant Transcripts: Just Edit the Domain in Your Address Bar!
YouTube
←
→
↻
https://www.youtube.com/watch?v=UF8uR6Z6KLc
YoutubeToText
←
→
↻
https://youtubetotext.net/watch?v=UF8uR6Z6KLc