Lemmy Online

2 readers

1 users here now

News and posts related to LemmyOnline.com

founded 1 year ago

MODERATORS

[email protected]

Downtime Followup (lemmyonline.com)

submitted 1 year ago by [email protected] to c/[email protected]

1 comments fedilink hide all child comments

As a continuation from the FIRST POST

As you have likely noticed, there are still issues.

To summarize the first post.... catastrophic software/hardware failure, which meant needing to restore from backups.

I decided to take the opportunity to rebuild newer, and better. As such, I decided to give proxmox a try, with a ceph storage backend.

After, getting a simple k8s environment back up and running on the cluster, and restoring the backups- lemmy online, was mostly back in business using the existing manifests.

Well, the problem is.... when heavy backend IO occurs (during backups, big operations, installing large software....), the longhorn.io storage used in the k8s environment, kind of... "dies".

And- as I have seen today, this is not an infrequent issue. I have had to bounce the VM multiple times today to restore operations.

I am currently working on building out a new VM specifically for LemmyOnline, to seperate it from the temporary k8s environment. Once, this is up and running, things should return to stable, and normal.

top 1 comments

sorted by: hot top controversial new old

[–] [email protected] 1 points 1 year ago

Understandably, using longhorn on top of ceph, is not the best decision.

However, given the backups were all performed in longhorn, the backups need to be restored to longhorn. As this is a temporary solution for now, this is taped together, by using replicas=1, which tells it to only keep a single copy of each piece of data. Now, ideally, this should mean the longhorn functions... as a glorified method of handling local storage, but, there are still other issues.

Another issue, recall, I had a 4-node kubernetes cluster, before the failure. Everything is currently condensed into a single VM.... and it appears... it might just be too much stuff for a single server. Only.... ~200 pods running, but, I am still seeing lots of errors for resource contention, despite having enough ram/cpu.

So... still working on this, to get things back to stable and normal.