4371 shaares
tl;dr :
- Start using [your software] in production in a non-critical capacity (by sending a small percentage of traffic to it, on a less critical service, etc)
- try to have each incident only once
- Understand what is ok to break and isn’t
For example, with Kubernetes:
- ok to break:
- any stateless control plane component can crash or be cycled out or go down for 5 minutes at any time
- kubernetes networking can break as much as it wants because we decided not to use it to start
- not ok to break
- for us, if etcd goes down for 10 minutes, that’s ok
- containers not starting or crashing on startup
- containers not having access to the resources they need
- pods being terminated unexpectedly by Kubernetes
With Envoy, the breakdown is pretty different: