Sysadmins for sysadmins

226 readers
7 users here now

Kažkas turi tai padaryti / Somebody has to do it

Related communities:

Fotkė / Photo camilo jimenez Unsplash

founded 2 years ago
MODERATORS
1
4
Incantations (josvisser.substack.com)
submitted 4 days ago by [email protected] to c/[email protected]
 
 

Incantations

Metadata

Highlights

The problem with incantations is that you don’t understand in what exact circumstances they work. Change the circumstances, and your incantations might work, might not work anymore, might do something else, or maybe worse, might do lots of damage. It is not safe to rely on incantations, you need to move to understanding.

2
 
 

How much are your 9's worth?

Metadata

Highlights

All nines are not created equal. Most of the time I hear an extraordinarily high availability claim (anything above 99.9%) I immediately start thinking about how that number is calculated and wondering how realistic it is.

Human beings are funny, though. It turns out we respond pretty well to simplicity and order.

Having a single number to measure service health is a great way for humans to look at a table of historical availability and understand if service availability is getting better or worse. It’s also the best way to create accountability and measure behavior over time…

… as long as your measurement is reasonably accurate and not a vanity metric.

Cheat #1 - Measure the narrowest path possible.

This is the easiest way to cheat a 9’s metric. Many nines numbers I have seen are various version of this cheat code. How can we create a narrow measurement path?

Cheat #2 - Lump everything into a single bucket.

Not all requests are created equal.

Cheat #3 - Don’t measure latency.

This is an availability metric we’re talking about here, why would we care about how long things take, as long as they are successful?!

Cheat #4 - Measure total volume, not minutes.

Let’s get a little controversial.

In order to cheat the metric we want to choose the calculation that looks the best, since even though we might have been having a bad time for 3 hours (1 out of every 10 requests was failing), not every customer was impacted so it wouldn’t be “fair” to count that time against us.

Building more specific models of customer paths is manual. It requires more manual effort and customization to build a model of customer behavior (read: engineering time). Sometimes we just don’t have people with the time or specialization to do this, or it will cost to much to maintain it in the future.

We don’t have data on all of the customer scenarios. In this case we just can’t measure enough to be sure what our availability is.

Sometimes we really don’t care (and neither do our customers). Some of the pages we build for our websites are… not very useful. Sometimes spending the time to measure (or fix) these scenarios just isn’t worth the effort. It’s important to focus on important scenarios for your customers and not waste engineering effort on things that aren’t very important (this is a very good way to create an ineffective availability effort at a company).

Mental shortcuts matter. No matter how much education we try, it’s hard to change perceptions of executives, engineers, etc. Sometimes it is better to pick the abstraction that helps people understand than pick the most accurate one.

Data volume and data quality are important to measurement. If we don’t have a good idea of which errors are “okay” and which are not, or we just don’t have that much traffic, some of these measurements become almost useless (what is the SLO of a website with 3 requests? does it matter?).

What is your way of cheating nines? ;)

3
6
Composite SLO (blog.alexewerlof.com)
submitted 1 month ago by [email protected] to c/[email protected]
 
 

How to calculate SLO

4
 
 

cross-posted from: https://feddit.it/post/7752642

A week of downtime and all the servers were recovered only because the customer had a proper disaster recovery protocol and held backups somewhere else, otherwise Google deleted the backups too

Google cloud ceo says "it won't happen anymore", it's insane that there's the possibility of "instant delete everything"

5
 
 

(again)

6
 
 
7
5
Do you run tableau (lemmy.horwood.cloud)
submitted 2 months ago by [email protected] to c/[email protected]
 
 

We run a bit of software called tabelau, I have had to restart it over night and the server hit 113 on the load average. on a 16 core box.

please tell me thats mad for any software

8
 
 

Good overview on how it works and why being compliant does not mean being secure.

9
 
 

Great article

10
 
 

What the title says - pros/cons

11
 
 

Interesting take - RIP Redis: How Garantia Data pulled off the biggest heist in open source history https://lnkd.in/ezme7dbw #redis #opensource

12
13
 
 

Seni krienai tauzyja apie IT

14
 
 

What about yours? What do you predict?

15
16
17
 
 

Distributed rate limiting

18
19
 
 

Everybody upgraded? Any horror stories?

20
5
SSH over HTTPS (trofi.github.io)
submitted 5 months ago by [email protected] to c/[email protected]
 
 

sometimes i see this in "very secured" servers as well. so check web configs, especially if you takeover the server management from somebody else ;)

21
 
 

Could have saved me tons of time (if I knew about it earlier)

22
 
 

Network CI/CD and automation is still quite rare, so it is nice to get any interesting article in that area.

23
24
 
 

A story

25
 
 

Nice resource

view more: next ›