Was doing two deployments at the same time. On the first one, I got to the point where I had to clear the cache. I was typing out the command to remove the temp folder, and looked down at the other deployment instructions I had in front of me, and typed the folder for the prod deployments and hit enter, deleting all of the currently installed code. It was a clustered machine, and the other machine removed it's files within milliseconds. When I realized what I had done, I just jumped up from my desk and said out loud "I'm fired!!" over and over. Once I calmed down, I had to get back on the call and ask everyone to check their apps. Sure enough they were all failing. I told them what I had done, and we immediately went to the clustered machine and files were gone there too. It took about 8 hours for the backup team to restore everything. They kept having to go find tapes to put in the machine, and it took way longer than anyone expected. Once we got the files restored, well we determined that we were all back to the previous day, and everyone's work from that night was all gone, so we had to start the nights deployments over. I got grilled about it, and had to write a script to clear the cache from that point on. No more manually removing files. The other thing that came out of this for the good was no more doing two deployments at the same time. I told them exactly what happened and that when you push people like this, mistakes get made.
Ask Lemmy
A Fediverse community for open-ended, thought provoking questions
Rules: (interactive)
1) Be nice and; have fun
Doxxing, trolling, sealioning, racism, and toxicity are not welcomed in AskLemmy. Remember what your mother said: if you can't say something nice, don't say anything at all. In addition, the site-wide Lemmy.world terms of service also apply here. Please familiarize yourself with them
2) All posts must end with a '?'
This is sort of like Jeopardy. Please phrase all post titles in the form of a proper question ending with ?
3) No spam
Please do not flood the community with nonsense. Actual suspected spammers will be banned on site. No astroturfing.
4) NSFW is okay, within reason
Just remember to tag posts with either a content warning or a [NSFW] tag. Overtly sexual posts are not allowed, please direct them to either [email protected] or [email protected].
NSFW comments should be restricted to posts tagged [NSFW].
5) This is not a support community.
It is not a place for 'how do I?', type questions.
If you have any questions regarding the site itself or would like to report a community, please direct them to Lemmy.world Support or email [email protected]. For other questions check our partnered communities list, or use the search function.
6) No US Politics.
Please don't post about current US Politics. If you need to do this, try [email protected] or [email protected]
Reminder: The terms of service apply here too.
Partnered Communities:
Logo design credit goes to: tubbadu
Well first of, in a properly managed environment/team there's never a single point of failure... *ahem*... that being said..
The worst I ever did was lose a whole bunch of irreplaceable data because of... things. I can't go into detail on that one. I did have a back plan for this kind of thing, but it was never implemented because my teammates thought it was a waste of time to cover for such a minuscule chance of a screw-up. I guess they didn't know me too well back then :)
"properly managed" is carrying a whole lotta weight in that first sentence.
I was purely talking in hypotheticals, I've never seen such a thing with my own eyes :)
Then colleague upgraded glibc by copying it in via scp. Then we couldn't ssh in anymore. :) Not sure how important that server was. I think it was reinstalled soon-ish.
A little different:
I was a live FOH sound tech during a concert and hit the wrong button on a playback device (it was a tracking song). Thought I was queuing up the next track for further in the concert but I was on the live side. The director did a great on of pivoting but boy was I red faced.
Plugged a server in after it had been repaired but the person whose responsibility it was insisted it would be fine - they didn't release the FSMO roles from it, the time was an hour out, it changed the time EVERYWHERE and broke ALL THE THINGS. Not technically my fault, but i should have pushed harder for them to have demoted it before I turned it back on.
Flushed the entire AD not realizing I somehow got back into prod
Forgot to turn the commercial power back on after testing the battery backups... oopsie.
Two exhibitors, both alike in ~~dignity~~ naming. One needed a critical sw update on their Doremi to fix an issue. The other was running The Force Awakens to a packed auditorium.
Found out the hard way to triple check your work when adding a new line to the proxy policy. Or, more accurately 2 lines when you only planned one, and that second one defaulted to a 'deny all' and resulted in dropping all web traffic out for the company...
That made for a REAL tense meeting the next day after it got deployed and people started asking WTF happened...
Was troubleshooting a failed drive in a raid array on a small business DC/File Serv/Print/Everything else box. Replaced drive still showed failed. Moved to another bay thinking it was the slot not the drive. Accidentally hit yes when asked to initialize the array. Blew the whole thing away. It was an OLD server the customer was working on replacing, so I told them it finally gave up the ghost and I was taking it back to the office to keep working on it. I had been on the job for about 4 months and thought for SURE I was fired. Turns out we were already working on moving them to the cloud, so it ended up not being a big deal.
Not software but I once powered off an entire network node by accident, the power distribution was 48v dc and the breaker panel in the rectifier had a retainer bar to hold in the breakers that was abîme the toggles. The toggles did not resist being turned off particularly well and after unscrewing one side of the bar, the whole thing pivoted down, cleanly shutting off every single breaker in the row.
Accidentally announced a /12 of IPv6 on a bad copy-paste of a /127.
Started appending a verification line after interface configs to make sure I never missed a trailing character again.
Took 3 months for anyone to notice (circa 2015).
Installed a flatpak app (can't remember which one but it wasn't obscure or shady) and smh it broke the file system on one of my main machines :) (at least I think that's what happened because the machine started lagging, any app refused to launch and after a reboot I got an fsck error or something like that)
Skipping test to patch ERP Prod because, you know, what could go wrong?
The vendor was......unsympathetic.
Set off cascading event bus loops that ran out of control. Friends don’t let friends allow events to spawn more events.
I seriously never had a major gaffe.
My buddy Donny, however, repartitioned and overwrote the wrong hard drive... Destroying video that took in the neighborhood of about 9,000 hours to render.
This was in ~~1996~~ 1997 so you can only imagine how devastating that was when our rendering farm was 10 machines with Pentium III's.
Seems trivial now when we have so much computing power at our fingertips, but 10 computers as a dedicated rendering farm was considered insane at that time.
@[email protected] In 1995 I worked at a company with several active web sites. Early days of the web, very important to the company. I was hired to take care of the hardware and software running the existing web sites and help in developing new ones.
One day I walked into my office, which had the production web server in it, carrying a Diet Coke (I was young and inexperienced). I opened the Diet Coke and it spewed an epic fountain right onto the production server. It was as if that server had a gravitational pull that drew all liquid towards it. I panicked and started unplugging every cable in sight, thinking this was better than risking a hardware-destroying short.
Needless to say the web sites were down for awhile. I believe I managed to save the hardware from myself though.
Forgive me, but that's a figure of speech I've never heard before. What does it mean?
By breaking production, I'm referring to a situation where someone, most likely in a technical job, broke a system that was intended to be responsible for the operation for some kind of service. Most of the responses here, which have been great to read, are about messing up things like software, databases, servers and other hardware.
Stuff happens and we all make mistakes. It's what you take away from the experience that matters.