First an example:
Classical testing for the Titanic is doing calculations on how to react to icebergs, then running an iceberg simulated test on the dry dock, training all the staff, etc.
Chaos engineering says “Lets have a big tug boat intentionally tow icebergs into the path of the titanic on every voyage it takes.”
“What if they fail to handle the iceberg and sink and kill everyone on board?”
“Well then you’ve have learned some valuable lessons to build your next boat and train your next crew.”
Testing software and hardware often looks like this:
– Test Everything. Test all failures, test all successes, test all weirdness before pushing it live.
– Trust no one. Don’t trust developers (QA is done by a separate department). QA shouldn’t even trust QA.
– You can only trust devs to work on things that they know will happen.
– No matter how well staging matches production, staging will not match production.
– QA are only guaranteed to find additional problems they can think of.
– Devops are the last and hardest things for the above. QA don’t have the expertise to predict failures, developers don’t expect the failures, and Developers/Devops can’t be trusted to build the redundancy features properly due to the “trust no one” mantra.
So what if Devops uses their expertise and thinks of all the redundancy they want, and then programs malicious pieces of code that intentionally break production. Not just in staging or production-like, but the program runs all the time, trying to take down production, causing REAL outages.
Now those failures fall under the “You can only trust devs to work on things that they know will happen.”. Nowhere to hide. They also fall under the “Devops better have prepared for this and know how to fix things quickly”.
That is chaos engineering, it is similar to a suite of regression testing, but actually done against production at the risk of real outages so that you can’t hide from it.
Kubenetes and other devops tools are ways of deploying applications that can be ‘self healing’ and automatically enforce redundancy and abstract away the solutions to the problems the tamper monkey code is creating. So if a server goes down, it notices and spins up one to replace it, and also handles things like shared resources, IP addresses, load balancing, etc.
This makes them great tools for companies engaged in chaos engineering (or not, tbh), but you can run a chaos engineering tool on any stack. Most companies run them on staging, but really doing chaos engineering is doing it on production. Chaos engineering is like penetration testing or DDOS tests or load tests. Early on this can make your service terrible (as it breaks all the things that aren’t redundant) but in the long run should make your company better (as you become more resilient to these problems).
Latest Answers