- Culture: mutual trust, willingness to learn and continuous improvement, constant flow of information, open-mindedness to changes and experimentation
- Automation: deployment pipelines (CI/CD), comprehensive test automation
- Lean: minimize WIP state, shorten and amplify feedback loops, look for opportunities to remove waste, fix errors as they are discovered
- Measurement: monitoring, system metrics, KPIs
- Sharing: sharing knowledge & practices, including successes & failures, learn from each other's experiences, proactively communicate, shadowing & pairing on tasks
The Utilization Saturation and Errors (USE) Method is a methodology for analyzing the performance of any system. It directs the construction of a checklist, which for server analysis can be used for quickly identifying resource bottlenecks or errors. It begins by posing questions, and then seeks answers, instead of beginning with given metrics (partial answers) and trying to work backwards.
The USE Method can be summarized as:
For every resource, check utilization, saturation, and errors.
On the same website:
tl;dr :
- Start using [your software] in production in a non-critical capacity (by sending a small percentage of traffic to it, on a less critical service, etc)
- try to have each incident only once
- Understand what is ok to break and isn’t
For example, with Kubernetes:
- ok to break:
- any stateless control plane component can crash or be cycled out or go down for 5 minutes at any time
- kubernetes networking can break as much as it wants because we decided not to use it to start
- not ok to break
- for us, if etcd goes down for 10 minutes, that’s ok
- containers not starting or crashing on startup
- containers not having access to the resources they need
- pods being terminated unexpectedly by Kubernetes
With Envoy, the breakdown is pretty different:
Not very insightful, but I'm retaining some quotes:
But IT operations includes much more than the limited “ops” functions we typically fold into a DevOps team. I’m talking about things like ticket management, incident handling, user management and authorization, backups and recovery, network management, security operations, infrastructure procurement and cost optimization, compliance reporting, and much more. In today’s IT organization, where do these responsibilities fall?
You want DevOps teams to have a streamlined, low lead-time, lean pipeline to production. Devoting team capacity to this broader set of operational functions may slow down this pipeline. There are also efficiencies to be gained by sharing these practices across the work of all the DevOps teams.
All of this is to say that a portion of IT operations still exists independently of the DevOps teams, performing those “ops” functions that are not in “DevOps” while the DevOps teams focus on that subset of ops functions specifically related to deploying code and responding to code-related incidents
-
"Ansible is simple, which is a major strength", it "works by connecting to a server using SSH, copies the Python code over, executes it and then removes itself".
"Ansible Tower is the Enterprise version, it turns the command line Ansible into a service, with a web interface, scheduler and notification system."
"you can’t have long-running tasks." -
"StackStorm is designed as a highly-configurable if-this-then-that service. it can react to events and then run a simple command or a complex workflow."
"MongoDB can be scaled using well-documented patterns." "StackStorm extensibility system is a key strength." "If StackStorm were a programming language, it would be strongly typed." -
"Salt was born as a distributed remote execution system used to execute commands and query data on remote nodes."
"Ultra high-performance for large deployments." (LinkedIn use it)
We take a look at Etsy's blameless postmortems, both in terms of philosophy, process and practical measures/guidance to avoid blame and better prepare for the next outage. Because failures are inevitable in complex socio-technical systems, it’s the failure handling and resolution that can be improved by learning from postmortems.
I love reading postmortems. They’re educational, but unlike most educational docs, they tell an entertaining story. I’ve spent a decent chunk of time …
In 1943 the psychologist Abraham Maslow proposed the concept of a 'hierarchy of needs' to describe human motivation. Most often portrayed as a pyramid, with the more fundamental needs occupying the largest space in the bottom layers, his theory states that only in the fulfilment of the lower-level needs can one hope to progress to…
Lately the job boards have been filled with ads that look something like this: Seeking Senior DevOps Engineer Must be able to debug all databases created since 1980 Be a core contributor to at least 10 open source projects Have experience with Go, Java, Python, Ruby, and C# Understand the kernel and be able to debug panics at 3AM
Configuration management is an essential ingredient in creating high performance IT. But how you implement it matters. In this talk Jez will present the princi…
Form a better understanding of DevOps and your delivery ecosystem through following the DevOps Checklist by Steve Pereira - design courtesy of Aaron Legaspi and Amit Jakhu of Myplanet.
If an IT management idea could ever be said to be “exciting”, then perhaps DevOps is a strong candidate. Its basic idea is to replace the traditional separation of “Development” and “IT” or “Operations” with a single function. This restructure has laudable goals; no more developers having to commission and wait for test and production…
500 IT Professionals were surveyed, now see for yourself how on-call is getting better and what pains still need solving...
CodeCraft is an online publication about Technology and Software Craftmanship by @VaamoTech.
How to be Good at Operations