Chaos Engineering is the idea that something *will* go wrong, so let’s practice what happens when something goes wrong and make sure it all recovers.
If every component in your stack has a 99.99% uptime SLA, then you should expect almost an hour of downtime for each component each year. But as you add more components, you get into scenarios where that downtime is more and more likely to coincide with another components downtime. By their nature, cloud-native apps and k8s apps will have more components than monoliths, which is why you see Chaos Engineering happen there.
The way the math works, if all your components have a 1 hour window where they’ll break each year, you only need a little over 100 components for there to be a 50% chance of an overlap any given year, so don’t wait for it to happen, test various combinations of failures.
It is not necessarily always done against k8s clusters or cloud native apps. You can do it against your own personal cloud in a VM cluster just fine.
Latest Answers