Discover and read the best of Twitter Threads about #SREcon

Most recents (24)

Closing off #SREcon EMEA 2019 is @HeidyKhlaaf on formal verification!
@HeidyKhlaaf What is formal verification? establishing whether a system satisfied requirements/properties using maths.

We create a formal mathematical model, write the formal specification, then check the model against the spec. We get prioritized findings out of it. #SREcon
Proofs can be really expensive to write, and they're not very applicable to our practical domains.

What can we do instead? Use the "smart device" analysis framework from nuclear safety.

Safety-critical systems have the potential to cause serious injury/harm. #SREcon
Read 22 tweets
Andrey Falko of Lyft begins the closing keynotes at #SREcon Europe by talking about applying the theory of fault tree analysis to Kafka.
This is a story of applying theory to practice, and getting real actionable insights that advanced his career and improved his service. #SREcon
Why do people use Kafka? They use it as a data bus to move data without losing it. It avoids the problem of making and shoring up point to point connections between producers and consumers. #SREcon
Read 23 tweets
.@ahidalgosre and @ahl91 are talking about how to SRE when everything's on fire...

ahidalgo was oncall and didn't get to finish watching his game, because there was a recurring, long-running production outage :( :( #SREcon
@ahidalgosre @ahl91 This is a case study on how they moved from 85% reliability to a trusted and documented 99.9%, as an o11y-focused SRE team. #SREcon
Nothing here is particularly new. Alarm fatigue is a well-studied issue. They needed to reduce alerting noise, do SLOs and SLIs, improve o11y, improve automation, and conduct retrospectives.

They used to use ELK to understand their service. #SREcon
Read 28 tweets
Naoman Abbas of Pinterest on the requirements for good o11y tooling: reliability, ease of use, and automation. #SREcon
1B metrics per minute processed by Pinterest's homegrown tooling [ed: is that timeseries, or datapoints?]

Went from graphite to opentsdb to sharded opentsdb to Goku/Gorrilla for metrics storage. tried storm/spark streaming, but now on job stream (homegrown). #SREcon
To reduce data storage costs, they reduce the cardinality of their data, smashing together data from hosts into taskless data... [ed: insert my marble smashing/grinding graphic here...] #SREcon
Read 16 tweets
Why automating everything adds to your toil, Colin Thorne/Cam McAllister from IBM

#SREcon
Things we've found over the years, when stuff breaks: "just automate it".

We're both software engineers by trade. There's an assumption that automating things means that you don't need to touch things again.

Some years into the project, we found there was more toil.

#SREcon
Definitions: toil. Shout out to @googlesre book.

Gets in the way of making progress.

Repetitive manual tasks, incidents, tickets.

Reduce toil. Project improvement work should add features or reduce future toil.

#SREcon
Read 30 tweets
Segueing nicely from my talk is @TheEvDev on tracing real-time distributed systems! #SREcon
@TheEvDev Bloomberg is fundamentally a data platform that gets realtime information into the hands of finance professionals. 5k+ engineers, datacenters across the world.

100B market transactions processed daily. #SREcon
So tracing can give you the waterfall view, and a correlation of all the items of work done across services to perform an end-user request. #SREcon
Read 17 tweets
Finally in this block, we have @isitvegan on being a solo SRE! #SREcon
@isitvegan 2015 was the worst year of @isitvegan's professional career. Why? Earlier, he'd joined a fast-growing company that might be called... "LoudClock".

And the application kept getting more and more complex. They rearchitected the system in 2014, and tried to go DevOps. #SREcon
His team would come to him and complain. "I don't know if my feature is working." "I can't tell if it's released." A huge lack of visibility into production.

Outages were even worse. They were being misrouted, noisy, chaotic...

So they read the Google SRE book... #SREcon
Read 13 tweets
Next is @matthewhuxtable on starting an SRE journey in a smaller company, and growing a team from 0 to 1 to more than 1. #SREcon
@matthewhuxtable There's so much work that you have to do with very constrained resources -- and juggling oncall isn't easy when your team is less than 8 people.

We're influenced by so many different other fields in SRE -- e.g. safety engineering.

Here are some of the common challenges: #SREcon
Software engineering has huge leverage. Complexity catches us by surprise. We add new problems as we solve older ones.

Risk management vs velocity is a difficult messaging problem. We need to build adaptive capacity to cope with risk, rather than eliminating it. #SREcon
Read 11 tweets
In track 2 of #SREcon this afternoon, we have @gheenghis of Unity talking about how they built SRE the right way, for the right reasons, rather than having unrealistic expectations.
@Gheenghis Why does Unity3D need an SRE team? Don't they sell packaged software libraries?

Well, they offer ad monetization as a service, for which they need an SRE team to improve the reliability thereof. #SREcon
Big companies like to rename and shame Ops to SRE. Why? checkbox/buzzword driven adoption.

How difficult can it be if you have @srebook already written as a blueprint/manual?

But, the company already _has_ people doing SRE even if it's not called that officially. #SREcon
Read 8 tweets
Next is @molly_struve on building scalable monitoring systems! #SREcon
@molly_struve And remember that we need to not just think about ourselves as SREs, but everyone else on our teams.

So, how did they get into this mess? #SREcon
Well, they bought or built... all of the services. New Relic, Honeybadger, Pagerduty, background cron jobs, homegrown admin dashboards, Elastic, Slack, text messages, email, phone call...

And the alerts were inconsistent and noisy, including no-op alerts, false positive. #SREcon
Read 9 tweets
Now @heinrichhartman is telling us about doing latency SLOs right, and drilling down on the difficulty of getting granular data about request performance over longer time periods. #SREcon
@heinrichhartman and trying to drill down into _how_ you actually measure SLOs as per the practices from the SLO workshop from last year's #SREcon Europe

but you can't average percentiles in order to know whether your SLO was met.
Not every time period is equal in terms of query volume; you can't just operate on percentage of time windows in which the p99 was sufficiently good.

The count matters. If you had a brief huge latency spike _and_ a spike of queries... then you probably blew your SLA. #SREcon
Read 11 tweets
The second opening keynote of #SREcon is Prof. Nancy Leveson of MIT, on taking systems approaches to safety and security.

She got her start in safety as a freshly minted PhD by being asked to work on systems analysis of software-controlled torpedoes.
It turns out that a system that always disables itself is "safe", but doesn't actually accomplish the goal!

Is the right long-term solution turning off enough safety devices until it "works as intended"? Hmmmm. #SREcon
An accident results in a loss; which could be hurting people, property, environmental harm, negative business impact, or loss of the mission.

Regardless of how it happens.

Hazard = set of conditions that, in a worst-case environment, would result in a loss. #SREcon
Read 31 tweets
.@aknin is exhorting us to be precise when we talk about reliability, rather than just saying "three nines". "Three nines of what?" #SREcon
@aknin We need to approach reliability with an engineering and optimization perspective - revenue, cost, or risk are tradeoffs we can make and measure. #SREcon
Where's the k8s and the code? Well, here are some graphs...

Be wary of false tradeoffs. People assume that in order to get more reliability, we have to work harder and be oncall more.

But we can change the shape of the curve, not just move along it. #SREcon
Read 21 tweets
The dreaded "exactly as many days as one is staying" business entry stamp from Irish authorities, despite still having most of the automatic 90 days in the fellow CTA country, the UK, which I just embarked from. Also, hi Ireland, hi #SREcon! --lizf Sleepyish lizzes on a plane with messed up lipstickBorder stamp good for 8 days with the (B) label for business
There's this weird racism where white folks I travel with get a full 90 day stamp, but I only get 24 hours, N days, or (much more rarely 30 days), handwritten into my US passport.

It happens every time I enter Ireland and it's kind of silly.
But I count my blessings that I have a US passport, unlike other brown folks who don't and have to queue for visas at embassies, etc.
Read 3 tweets
First up today is @adrianco on innovation culture! #DevOpsDays
@adrianco .@adrianco talks to many companies about innovation. "So far, nobody has said they want to slow down their innovation rate..." #DevOpsDays
In the old world of IT, there was nothing that directly touched customers; your only clients were your employees, factories, marketing department, etc.

But in the new world, we interact *directly* with customers as well as our employees. Everything is just in time. #DevOpsDays
Read 34 tweets
I'm excited to announce several talks:
(1) I'll be talking about how to cut your observability bills with a bit of statistical magic at #SREcon EMEA in October! usenix.org/conference/sre…
(2) and about production excellence at @blamelesshq's August summit! eventbrite.com/e/blameless-su…
@blamelesshq (3) also, on Thursday this week, I'll be speaking on @honeycombio's journey with infrastructure-as-code and how @HashiCorp Terraform Enterprise has helped us safely refactor our AWS environment for cost efficiency & maintainability. meetup.com/South-Bay-Area…
(4) I'm also delighted to announce that in October I'll be speaking at Velocity Berlin, bringing the Production Excellence talk to European attendees, alongside a new @opentelemetry workshop I'm collaborating across company boundaries to create. conferences.oreilly.com/velocity/vl-eu…
Read 6 tweets
Final speaker of #SREcon: @deniseyu21 on why distributed systems are so hard.
@deniseyu21 She's an engineer on PCF, and also an avid artist and live-doodler of talks.

four things for today: a brief history lesson on distributed systems, CAP theorems, why networks partitions are hard/omnipresent, and how we can mitigate these risks. #SREcon
Once upon a time, everyone used a singular database that IT maintained.

But eventually IT became a business enabler we had to invest more in. Business analysts wanted to ask more complex questions, ML/NLP came along, and our requirements increased complexity... #SREcon
Read 18 tweets
Next up: Pragmatic Automation, by @mluebbe #SREcon
"Automate yourself out of a job"

What if you're asked to automate something you have no idea how to do?
#SREcon
Backstory: Google has cloud regions. The original four:

- us-central1
- us-east1
- europe-west1
- asia-east1

Then we decided to build more. How?

$./build_new_region.sh

#SREcon
Read 29 tweets
#srecon @randyshoup: Learning from Learnings: Anatomy of Three Incidents
#srecon @randyshoup: Outage 1: Google App Engine Outage. App Engine was down globally for 8 hours. The playbook failed and triggered a cascading failure.
#srecon @randyshoup: Resolutions: increased traffic routing capacity, but more importantly, created a program to reduce probability of the same problem happening again.
Read 28 tweets
Next up: Learning from Learnings: Anatomy of Three Incidents by @randyshoup #SREcon
Review three incidents from different companies with common themes, and then discuss what we can change to improve post incident response.
#SREcon
Read 33 tweets
Last day of #SRECon! Track One is kicking off with "Optimizing for Learning" by Logan McDonald (@_loganmcdonald) 🎉
Fun fact: all the art for @_loganmcdonald’s #SRECon talk was done by @emilywithcurls
#SRECon @_loganmcdonald
"expert intuition in achievable"
this helped logan onboard to new systems
Read 16 tweets
Running excellent retrospectives: talking for humans
#SREcon
Goal of this tutorial:
- Learn how to run a retrospective.
- Create a safe space

Job running a retro:
- Facilitation
- Having a productive conversation
- Don't make bad jokes.

#SREcon
Facilitation: a.k.a. creating psychological safety, servant leadership.

Let's talk about language. English is blame-y. "you". Starting with "you" creates a line between participants.
#SREcon
Read 42 tweets
Next up is “An Introduction to GraphQL” with @icco from @Google #SRECon
What is GraphQL?
Now let’s make it more real
Read 11 tweets
Track 1 today @ #SREcon :

SRE Classroom - How to Design a Distributed System in 3 Hours

Ryan Thomas, JC van Winkel, Phillip Tischler, and Jennifer Mace, Google
Requirements:

Identify SLIs and SLOs
▪️Data freshness
▪️Availability
▪️Latency

Sample SLO: 99%ile of queries returns valid result within 100ms

#SREcon #SREclassroom
One way to scale is via microservices.

#SREcon #SREclassroom
Read 24 tweets

Related hashtags

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!