this post was submitted on 29 Aug 2023

64 points (94.4% liked)

Selfhosted

42627 readers

616 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago

MODERATORS

CPU load over 70 means I can't even ssh into my server (self.selfhosted)

submitted 2 years ago* (last edited 1 year ago) by PlutoniumAcid to c/selfhosted

20 comments fedilink hide all child comments

edit: you are right, it's the I/O WAIT that it destroying my performance:
%Cpu(s): 0,3 us, 0,5 sy, 0,0 ni, 50,1 id, 49,0 wa, 0,0 hi, 0,1 si, 0,0 st
I could clearly see it using nmon > d > l > - such as was suggested by @SayCyberOnceMore. Not quite sure what to do about it, as it's simply my sdb1 drive which is a Samsung 1TB 2.5" HDD. I have now ordered a 2TB SSD and maybe I am going to reinstall from scratch on that new drive as sda1. I realize that's just treating the symptom and not the root cause, so I should probably also look for that root cause. But that's for another Lemmy thread!

I really don't understand what is causing this. I run a few very small containers, and everything is fine - but when I start something bigger like Photoprism, Immich, or even MariaDB or PostgreSQL, then something causes the CPU load to rise indefinitely.

Notably, the top command doesn't show anything special, nothing eats RAM, nothing uses 100% CPU. And yet, the load is rising fast. If I leave it be, my ssh session loses connection. Hopping onto the host itself shows a load of over 50,or even over 70. I don't grok how a system can even get that high at all.

My server is an older Intel i7 with 16GB RAM running Ubuntu22. 04 LTS.

How can I troubleshoot this, when 'top' doesn't show any culprit and it does not seem to be caused by any one specific container?

(this makes me wonder how people can run anything at all off of a Raspberry Pi. My machine isn't "beefy" but a Pi would be so much less.)

all 21 comments

sorted by: hot top controversial new old

[–] [email protected] 49 points 2 years ago* (last edited 2 years ago) (3 children)

It's sounds like it could be an IO wait issue, system load will climb a ton without showing much CPU usage.

Make sure you're not running out of RAM and going into swap space, it doesn't sound like it though.

iotop might show something useful. And in htop you can add the 'PERCENT_IO_DELAY" column which can be useful.

[–] PriorProject 19 points 2 years ago* (last edited 2 years ago)

My money is also on IO. Outside of CPU and RAM, it's the most likely resource to get saturated (especially if using rotational magnetic disks rather than an SSD, magnetic disks are going to be the performance limiter by a lot for many workloads), and also the one that OP said nothing about, suggesting it's a blind spot for them.

In addition to the excellent command-line approaches suggested above, I recommend installing netdata on the box as it will show you a very comprehensive set of performance metrics without having to learn to collect each one on the CLI. A downside is that it will use RAM proportional to the data retention period, which if you're swapping hard will be an issue. But even a few hours of data can be very useful and with 16gb of ram I feel like any swapping is likely to be a gross misconfiguration rather than true memory demand... and once that's sorted dedicating a gig or two to observability will be a good investment.

[–] [email protected] 8 points 2 years ago (2 children)

And I know OP mentioned not using much ram, but almost everytime I see a server load that high, it's usually because the server is swapping heavily causing the iowait.

[–] [email protected] 4 points 2 years ago* (last edited 2 years ago)

Yeah I figured I would mention it since OP does describe symptoms like that.

[–] [email protected] 2 points 2 years ago

Does top show unpaged memory too? I've had an application with a memory leak before that would fill up unpaged memory and it would look like nothing was using ram when I looked in the task manager, even though usage was 99%.

[–] [email protected] 3 points 2 years ago (1 children)

Yep. IO.

OP, this might be overkill for you but it might be worth standing up a grafana/prometheus stack.. You'd be able to see this stuff a lot faster and potentially narrow in on a root cause.

[–] PlutoniumAcid 1 points 1 year ago

That is definitely an interesting idea! Much, much better than the stupid dashdot container I am running now :-D

[–] BlackXanthus 12 points 2 years ago* (last edited 2 years ago) (2 children)

The last time I saw this was on a slow-failing HDD.

Check a quick fsck might get you a few answers. You can find more info in the Linux manual. It could just be one or two bad blocks that you can recover and fix the problem (though, ofc, it's time to backup your data).

The other, slightly unusual time I've seen it is with mixed RAM. 16gb made of 2x6g and then 2x4gb did some real odd things to the system. If it's not the disk, and your box will boot with one stick of ram, try it to see if it fixes the issue. It could be that your RAM speeds are off (or your like me and just put two sticks you had lying around, and it basically worked until it didn't).

An outlier, that I've not seen on modern machines is io/wait for a CD-ROM to spin up, even if your not accessing the CD-ROM. Normally caused by bad cabling. Based on the age of your machine, this is unlikely, but it might be worth unplugging devices to see if one is bad and not reporting properly.

This is, if course, assuming dmsg is empty

Final thought: see if your running SELinux. If you are, turn it off and try again. Those policies are complex, and something installed in a non-standard place could be causing SELinux to slow IO as it fills your logs with warnings.

Hope that helps,

[–] [email protected] 2 points 2 years ago

To add on to this, if you're using some random RAM stick picked out of the gutter, then it might be worth it to run memtest86+. Bad RAM sectors can give some weird unpredictable issues.

[–] PlutoniumAcid 1 points 1 year ago

Do not run fsck on a mounted device

So how do I run this on /dev/sda? I can't very well unmount the OS drive...

[–] [email protected] 11 points 2 years ago (1 children)

"load" is not "CPU usage." It's "system usage" and includes disk and network activity. Including swapping if you're low on memory.

vmstat can tell you what your disk io looks like. Iotop can help with narrowing it down to a process.

[–] [email protected] 6 points 2 years ago (1 children)

It's a bit more complicated than that. System load is a count of how many processes are in an R state (either "R"unning or "R"eady). If a process does disk I/O or accesses the network, that is not counted towards load, because as soon as it makes a system call, it's now in an S (or D) state instead of an R state.

But disk I/O does affect it, which makes it a bit tricky. You mentioned swapping. Swapping's partner in crime, memory-mapped files, also contribute. In both of those cases, a process tries to access memory (without making a system call) that the kernel needs to do work to resolve, so the process stays in an R state.

I can't think of a common situation where network activity could contribute to load, though. If your swap device is mounted over NFS maybe?

Anyway, generally load is measuring CPU usage, but if you have high disk usage elsewhere (which is not counted directly) and are under high memory pressure, that can contribute to load. If you're seeing a high load with low CPU utilization, that's almost always due to high memory pressure, which can cause both swapping and filesystem cache drops.

[–] [email protected] 3 points 2 years ago (1 children)

Using network storage to store your swapfile is one of the… um, more interesting ideas I’ve heard today

[–] [email protected] 1 points 1 year ago

Adding to list of things to try when one's network only knows 100Gb

[–] [email protected] 7 points 1 year ago

many people aren't running containers on RBpi ... while feasible, it was notoriously poor until the 8GB pi4, and still is easily bounded by SD card I/o. are there docker stats so you can see the disk + net I/o of each container?

[–] [email protected] 6 points 2 years ago

Install nmon - it's a CLI tool to show system load Run it and press 'd' to show disk usage, then 'l' to show a longterm graph, then '-' to speed it up. If your storage is the issue you'll see it here - and potentially which drive(s) are affected.

[–] [email protected] 5 points 2 years ago (1 children)

Immich and photoprism do AI detection and sorting right? Until they have scanned through all current photos you are going to have a lot of system load. And my 4 gig pi usually jusy runs at 1 gig of memory and low load...no Immich, but running Openmediavault for DAAP and SAMBA, and DejaDup and Syncthing, homeassistant, CUPS, etc.

[–] [email protected] 1 points 1 year ago

I ran immich on a server that's substantially faster than a raspberry pi, and after about a day and a half it keeps me from getting in with ssh. Even locally, I have to wait for a long time to get a login prompt.

[–] [email protected] 3 points 2 years ago

I'd try each application one by one. Maybe write a script to monitor load and stop the program if it goes past your desired threshold and notify you.

It could also be a setting in some app like photoprism or immich ... I think one of them uses tensorflow to classify images. That would increase the load if thats running in the background.

Maybe try them with an empty directory so there is no data to process and see if you encounter the error. Then add some data and see how the load is.

[–] [email protected] 3 points 2 years ago

Check the wa level in top, if it is high the system is waiting for hardware to process stuff. If it is high, check with atop of disks are red.

In such cases I almost always see some hardware but failing, networkcard or switch falling, harddisk/NFS stuff falling, memory falling. Hope this helps