Lemmy Online

2 readers

1 users here now

News and posts related to LemmyOnline.com

founded 1 year ago

MODERATORS

[email protected]

Downtime Explanation - Updated 7/24 9pm CST (lemmyonline.com)

submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/[email protected]

0 comments fedilink hide all child comments

My apologies for the past day or so of downtime.

I had a work conference all of last week. On the last morning around 4am, before I headed back to my timezone, "something" inside of my kubernetes cluster took a dump.

While- I can remotely reboot nodes, and even access them... the scope of what went wrong was far above what I can accomplish remotely via my phone.

After returning home yesterday evening, I started plugging away a bit, and quickly realized.... something was seriously wrong with the cluster. As such, from previous experience, I found it was quicker to just tear it down, rebuild it, and restore from backups. So- I started that process.

However, since, I had not seen my wife in a week, I felt spending some time with her was slightly more important at the time. But- I was able to finish getting everything restored today.

Due, to the issues before, I will be rebuilding some areas of my infrastructure to be slightly more redundant.

Whereas before- I had bare-metal machines running ubuntu, going forward, I will be leveraging proxmox for compute clustering and HA, along with ceph for storage HA.

That being said, sometime soon, I will have ansible playbooks setup to get everything pushed out and running.

Again- My apologies for the downtime. It was completely unexpected, and came out of the blue. I honestly still have no idea what happened.

The best suspicion I have, is disk failure.... and after rebooting the machine, it came back to life?

Regardless, Will work to improve this moving forward. Also- I don't plan on being out of town soon... so, that will help too.

There may be some slight downtime later on as I am working on and moving things around. If- that is the case, it will be short. But- for now- the goal is just restoring my other services and getting back up and running.

Update 2023-07-23 CST

There are still a few kinks being worked out. I have noticed occasionally things are disconnecting still.

Working on ironing out the issues still. Please bear with me.

(This issue appears to be due to a single realtek nic in the cluster... realtek = bad)

Update 9:30pm CST

Well, it has been a "fun" evening. I have been finding issues left and right.

A piece of bad fiber cable.
The aforementioned server with a realtek NIC which was bringing down the entire cluster.
STP/RSTP issues, likely caused by the above two issues.

Still, working and improving...

Update 2023-07-24

Update 9am CST

Working out a few minor kinks still. Finish line is in sight.

Update 5pm CST

Happened to find a SFP+ module which was in the process of dying. Swapped it out with a new one, and... magically, many of the spotty network issues went away.

Have new fiber ordered, will install later this week.

Update 9pm CST

Broken/Intermittent SFP+ Module replaced.
Server with crappy realtek nic removed. Re-added server with 10G SFP+ connectivity.
Clustered servers moved to dedicated switch.
New fiber stuff ordered to replace longer-distance (50ft) 10G copper runs.

I am aware of current performance issues. These will start going away as I expand out the cluster. Still focusing on rebuilding everything to a working state.

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here