what is chaos engineering? Is it always done on a Kubernetes cluster or cloud native app?

494 viewsOtherTechnology

what is chaos engineering? Is it always done on a Kubernetes cluster or cloud native app?

In: Technology

3 Answers

Anonymous 0 Comments

It’s the idea that your design should handle any failure, so by introducing constant chaos during the engineering phase and during production you find issues you might not have thought of. It’s not exclusive to k8s or cloud but both make it easier. Think of it like randomly removing a node or introducing intentional or random errors into the system to see how it behaves.

Anonymous 0 Comments

Chaos Engineering is the idea that something *will* go wrong, so let’s practice what happens when something goes wrong and make sure it all recovers.

If every component in your stack has a 99.99% uptime SLA, then you should expect almost an hour of downtime for each component each year. But as you add more components, you get into scenarios where that downtime is more and more likely to coincide with another components downtime. By their nature, cloud-native apps and k8s apps will have more components than monoliths, which is why you see Chaos Engineering happen there.

The way the math works, if all your components have a 1 hour window where they’ll break each year, you only need a little over 100 components for there to be a 50% chance of an overlap any given year, so don’t wait for it to happen, test various combinations of failures.

It is not necessarily always done against k8s clusters or cloud native apps. You can do it against your own personal cloud in a VM cluster just fine.

Anonymous 0 Comments

First an example:

Classical testing for the Titanic is doing calculations on how to react to icebergs, then running an iceberg simulated test on the dry dock, training all the staff, etc.

Chaos engineering says “Lets have a big tug boat intentionally tow icebergs into the path of the titanic on every voyage it takes.”

“What if they fail to handle the iceberg and sink and kill everyone on board?”

“Well then you’ve have learned some valuable lessons to build your next boat and train your next crew.”

Testing software and hardware often looks like this:

– Test Everything. Test all failures, test all successes, test all weirdness before pushing it live.

– Trust no one. Don’t trust developers (QA is done by a separate department). QA shouldn’t even trust QA.

– You can only trust devs to work on things that they know will happen.

– No matter how well staging matches production, staging will not match production.

– QA are only guaranteed to find additional problems they can think of.

– Devops are the last and hardest things for the above. QA don’t have the expertise to predict failures, developers don’t expect the failures, and Developers/Devops can’t be trusted to build the redundancy features properly due to the “trust no one” mantra.

So what if Devops uses their expertise and thinks of all the redundancy they want, and then programs malicious pieces of code that intentionally break production. Not just in staging or production-like, but the program runs all the time, trying to take down production, causing REAL outages.

Now those failures fall under the “You can only trust devs to work on things that they know will happen.”. Nowhere to hide. They also fall under the “Devops better have prepared for this and know how to fix things quickly”.

That is chaos engineering, it is similar to a suite of regression testing, but actually done against production at the risk of real outages so that you can’t hide from it.

Kubenetes and other devops tools are ways of deploying applications that can be ‘self healing’ and automatically enforce redundancy and abstract away the solutions to the problems the tamper monkey code is creating. So if a server goes down, it notices and spins up one to replace it, and also handles things like shared resources, IP addresses, load balancing, etc.

This makes them great tools for companies engaged in chaos engineering (or not, tbh), but you can run a chaos engineering tool on any stack. Most companies run them on staging, but really doing chaos engineering is doing it on production. Chaos engineering is like penetration testing or DDOS tests or load tests. Early on this can make your service terrible (as it breaks all the things that aren’t redundant) but in the long run should make your company better (as you become more resilient to these problems).