Turns out... its ceph storage.
Despite having 7x OSDs on bare metal NVMe... despite having DEDICATED 10G network connectivity.... Its having significant performance issues.
Any spikes in IO (Large file transfers, backups. Even copying files to a different server) would cause huge IO delays, causing things to break or drop offline.
There are no errors shown. The configuration is pretty standard. I have no idea why it is having so many issues.
I have cleared off a new NVMe, and will move this server to it tomorrow, and hopefully end all of the issues from this week... Assuming I have any users left here. (I wouldn't blame you for leaving, it has been a really bad week for LemmyOnline)
IF, my assumptions are incorrect, then f-it, I will just run lemmy on a bare metal server I have on standby.
Update
Server migrated to local storage. Was, nearly unnoticeable, unless you did something in the 3 minute window it took to clone/restore/etc.