this post was submitted on 21 Oct 2024
30 points (91.7% liked)

Technology

58988 readers
6968 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
 

Online service reliability is crucial in the digital age. Even robust systems can face unexpected outages, affecting various platforms. Let's explore the insights!

https://robotalp.com/blog/historys-major-downtimes-lessons-from-the-biggest-outages/

you are viewing a single comment's thread
view the rest of the comments
[–] somebodysomewhere 5 points 1 week ago

reason for that is isolation and reduncancy though. Most incidents/outages are the result of a change and in the cases you mentioned they are mitigated by the fact that not all instances receive updates at the same time. Presumably, the error is noticed in one place and traffic is then served by healthy instances.

By all accounts these are practices that significant service providers follow. In fact AWS typically rolls out updates to us-east-1 before updating other regions to use it as a canary to warn against issues.

With federated services, this is less of a conscious decision and tends to happen only because instance maintainers update on different schedules.

Blue-green deployments and failover are common mitigation strategies and mature organizations actively employ these. Conversely, these patterns are integral to the decentralized nature of the fediverse and other distributed solutions such as cdn.