(A few) Ops Lessons We All Learn The Hard Way -- a Twitter 🧵:
1. Email is the worst monitoring and alerting mechanism except for all the others.

2. Absence of a signal is itself a signal.

3. The severity of an incident is measured by the number of rules broken in resolving it.
4. The mobile hotspot you're paying for so you can leave your house while you're oncall only works at home and in the office.

5. The only other person who knows how this works is also on vacation.
6. If a post-mortem follow-up task is not picked up within a week, it's unlikely to be completed at all.

7. That janky script you put together during the outage -- the one that uses expect(1) and 'ssh -t -t' -- now is the foundation of the entire team's toolchest.
8. NTP being off may not be a root cause, but it sure didn't help.

9. UTC or GTFO.
10. Your infrastructure uses a lot more self-signed certificates than you think. A lot more. In places that make you weep.
11. Self-signed certificates beget long lived certs, which beget lack of certificate validity monitoring, which begets curl -k, which begets a lack of certificate deployment automation, which begets self-signed certificates.
12. For any N applications, at most N/2+1 use the same certificate bundle.

13. The system you're troubleshooting doesn't use the one the tool you're troubleshooting it with does.
14. An API without a reference implementation and command-line client is called a gray box.

15. Restricted shells are not as restricted as you think.

16. Very few operations are truly idempotent.
17. "Asserting state" beats "monitoring for compliance" any day.

18. One in a Million is next Tuesday.
docs.microsoft.com/en-us/archive/…
19. People give talks at conferences not to convince others that their work is awesome and totally worth the time and effort they put in, but themselves.
20. It's ok to use shell for complex stuff; it often times is easier, faster, and still less of a mess than juggling libraries and dependencies.
21. There's nothing wrong with Perl.

22. Ok, we all at times keep adding $, {, }, and @ in random places trying to make things work, but still.
23. Serverless isn't.

24. Y38K is already here, it's just not evenly distributed.

25. If you determine "human error" as the root cause, then you're doing it wrong.
26. Your network team has a way into the network that your security team doesn't know about.

27. And don't even as much as mention the serial console and IPMI networks, but boy are you glad you have 'em.
28. Blocking TCP port 53 traffic leads to very strange failures. Don't.

29. Somewhere in your infrastructure a service you didn't know uses DNS for endpoint discovery in a very surprising way.

30. Do 👏 Not👏 Monkey👏 Around👏 With👏 /etc/hosts.
31. If you break it, you own it - for now; if you fix it, you own it - forever.

32. Turning it off and on again is actually quite a reasonable way to fix many things.

33. A README.md in git is no substitute for a manual page that's shipped with your tool.
34. A search for a document you know exists will only turn up links to documents referencing but not actually linking to the one you're looking for.
35. The document you're looking for was marked as obsolete and not migrated to the new content management solution.

36. Sure, your current content management system sucks, but it's still better than the one you're moving to.
37. Nobody knows how git works; everybody simply rm -fr && git checkout's periodically.
38. There are very few network restrictions creative and determined use of ssh(1) port forwarding can't overcome.

39. This is both incredibly useful and concerning.
40. It is tempting to jump right into implementing a solution when the right thing may well be to not do the thing that requires the solution in the first place.

41. Turning things off permanently is surprisingly difficult.
42. "Ancient" is a very relative term when it comes to software and protocols.

43. "Obsolete" doesn't mean it's not in use and relied on.
44. The sets of systems online before and after a data center power outage only intersect. Some of the old systems coming online will immediately cause a different outage.
45. Some of your most critical services are kept alive by a handful of people whose job description does not mention those services at all.
46. After the initial "down for everybody or just me ermahgehrd Slack is down" drop, productivity increases linearly throughout the the duration of the outage.
47. You're bound by the CAP theorem much more often than you may think. Halting Problem's a bitch, too.

48. Eventual consistency doesn't help when the system you're debugging hasn't converged yet.
49. The source you're looking at is not the code running in production.

50. strace(1)/ktrace(1) doesn't lie.

51. Unless somebody's been playing LD_PRELOAD games.
52. Schrödinger's Backup -- "The condition of any backup is unknown until a restore is attempted." -- is overly optimistic.

53. There's an xkcd for the precise situation you find yourself in. (There's also one for at least half of these.)
54. At some point in your career you will implement half of kerberos. Poorly.
55. Any sufficiently successful product launch is indistinguishable from a DDoS; any sufficiently advanced user indistinguishable from an attacker.

56. Debugging any sufficiently complex open source product is indistinguishable from reverse engineering a black box.
57. "We've always done it this way." is not a good reason by itself, but there's bound to be one for why.

58. That reason may or may not be valid any longer, however.
59. A junior engineer asking "why" and pointing out the docs don't reflect reality is at least as valuable as the senior engineer working blindly off tribal knowledge.
60. Your herculean efforts to upgrade the OS across your entire fleet completed just in time for the EOL announcement of the version you upgraded to.

61. This phenomenon was first described in Dante's Inferno as the Ninth Circle of Hell, Ring Four, aka RedHat Canto XXXIV.
62. Containers create at least as many problems as they solve.

63. The most ninja move the expert you hired for that third party black box product you rely on is to say "Let me ping the support team".
64. Somewhere, somebody ran into this exact problem, but they never bothered to post a solution.

65. That completely automated solution you set up requires at least three manual steps you didn't document.
66. CAPEX budget always increases, OPEX budget always decreases.

67. CAPEX costs can be reasonably estimated, OPEX costs can only be ballparked.
68. Doubling your time estimate in the hopes of beating expectations won't work because your manager takes your estimate, has a hardy laugh, and then resets it back to what they already promised upchain.
69. Your quarterly planning means bubkes when the next re-org rolls around.

70. Most of your actual work is not covered by your OKRs.
71. Recursively applying the Pareto Principle is a surprisingly accurate way to gauge your low hanging fruit, determine your high impact objectives, and ballpark your required effort.

72. Although, to be honest, it only works in about 80% of cases.
73. Management will always happily spend $$$ on outside consultants to tell them what you've been saying for years.

74. Management will much rather invest in inventing a new, square wheel than fixing an old round one.
75. In any organization practicing continuous integration, half of all commits are to fake out CI tests.
76. Good software development practices do not always translate well to ops and friends.

77. Mandatory code reviews do not automatically improve code quality nor reduce the frequency of incidents.
78. Every new paradigm tends to mostly add layers of abstractions; cutting through them and identifying what basic principles continue to apply is half the battle.
79. Real change can only be implemented above layer 7.
80. "Prod" is just another name for "staging".

81. Your source of truth lies.

82. Also: it's incomplete.
83. pcap or it didn't happen.
84. grep(1) > Splunk

(there, I said it)
85. Multithreading is rarely worth the added complexity.

86. Parallelism is not Concurrency.

87. Simplicity is King.
...and finally...

88. Nobody knows what exactly it is you do.
This thread as a single HTML page:
netmeister.org/blog/ops-lesso…

Peace out, nerds!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jan Schaumann (@jschauma@mstdn.social)

Jan Schaumann (@jschauma@mstdn.social) Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @jschauma

May 14, 2023
Remember the X.509 PKI? You know, the one that gave us

- "Oh wait, certificate revocation is basically all broken"
- The One Where That Dutch CA Issued A Fraudulent *.google.com Cert

and my all-time favorite:

- Honest Ahmed's Used Cars & Certificates
bugzilla.mozilla.org/show_bug.cgi?i…
It's great, because it secures virtually all web traffic, and all you have to do is get a certificate from a certificate authority -- any one at all!

Don't be picky: there are literally hundreds in your trust bundle: ImageImage
But you probably would want to allow only a very small number of CAs to do that.

What can you do?

For a while, we tried dynamic HTTP Public Key Pinning (HPKP, via the Public-Key-Pins header).

But, TOFU issues aside, that was a big footgun, so we deprecated that swiftly.
Read 33 tweets
Mar 10, 2023
Who reads your email? Ok, ok, nobody does. Even you don't want to, I know. But... who _could_?

A 🧵 about centralization of MX records across gTLDs:
SMTP relies on MX records in the DNS to identify which server(s) it should hand the mail off to, and over 40 years after RFC722 was published, email is still cleartext.

Together, this means that any receiving mail server can trivially read any message passing through.
It used to be common for domain operators to run their own mail servers, but doing that is actually hard. And what do we do when things are hard? We pay somebody else to do it for us. To the cloud!

So I was wondering: how much is SMTP centralized in 2023?
Read 25 tweets
Nov 16, 2022
Who controls the internet?

A Twitter 🧵 (if those still work) about diversity of authoritative NS records in gTLDs:
Why yes, the internet is resting on a foundation of duct tape and WD40, aka the DNS.

(Yes, yes, obligatory XKCD.)
Let's start by gathering zone files.

First, there's the root zone itself: iana.org/domains/root/f…

Next, you can request all gTLD zone files here: czds.icann.org

.gov is here: github.com/cisagov/dotgov…

And .arpa can be AXFR'd from most of the root servers.
Read 29 tweets
Oct 25, 2022
Time is an illusion, Unix time doubly so.

A Twitter 🧵 coming live at you at a palindromic 1666666661...
As you well know, on Unix systems we measure time as the number of seconds since "the epoch": 00:00:00 UTC on January 1st, 1970.

This has made a lot of people very angry and been widely regarded as a bad move.
How did we get here?

It all began back in 1971, when the First Edition Unix Programmer's Manual defined Unix time as "the time since 00:00:00, 1 January 1971, measured in sixtieths of a second":

bell-labs.com/usr/dmr/www/pd…
Read 32 tweets
Aug 31, 2022
Hey, so y'all know SPF, the Sender Policy Framework, right?

It's straight forward, isn't it? I mean, client connects, you check envelope-from, client IP, and (what else) a DNS record, and then make your call.

Well. Turns out there's (a bit) more to it. Let's take a look...
Simple example:

We try to send mail pretending to be from Microsoft through Yahoo's mail server.

Yahoo looks up microsoft.com's TXT records, finds our sending IP is not authorized, and rejects our mail.
Neat, right? RFC7208 defines a number of qualifiers and mechanisms. You'll mostly encounter 'a', 'mx', 'include', 'ip4', and 'ip6', ending with 'all', all optionally prefixed with one of [+?~-].

But there's also 'ptr', 'exists', and the "modifiers" 'redirect' and 'exp'.
Read 39 tweets
Jul 18, 2022
Pop quiz: what is the maximum size of a DNS response?
Everybody Knows(tm) that your DNS response MUST fit into 512 bytes, because that's the size of a UDP packet. Right?

Let's pretend that's true. How many A records can you put into a round-robin?

Here's a name that will return a bunch of A records and still fits into 512 bytes:
This returned 28 A records. So far, so good. But 28 IPv4 addresses is only 28 * 4 bytes = 112 bytes. Shouldn't we have been able to add a whole bunch of IPv4 addresses more?

Let's take a look at what the packets actually look like, using tcpdump(1):
Read 36 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(