active/active distributed systems

263 viewsOtherTechnology

How does an active/active cluster work? That is the magic key element allowing two systems to coordinate transaction processing, let’s make it a chat application for the sake of an argument? Can we truly have no single point or failure? Or is it that a group chat system is just fundamentally not suited for active/active?

All my searches end up with vendor material selling their messaging or else system but I am looking to understand the core principles. How does one quarantine exactly-once processing if underlying databases at two sites could lose connectivity? How does active/active handle recovery when communication between sites recovers?

​

In: Technology

Anonymous 0 Comments

When you really break it down, the bottom line is that no amount of redundancy can completely mitigate all failures. It can help with a lot of failure modes, but in the end there will be situations that will degrade or take down the service, and it is up to the service designer to decide what the best way to fail is.

[CAP theorem](https://en.wikipedia.org/wiki/CAP_theorem) breaks this down into a “pick two out of three” situation:

* Consistency, where every read gets the most updated data, rather than retrieving out of date data.
* Availability, where every request gets properly answered.
* Partition tolerance, where the system can keep working even in the case of a network outage between nodes.

Vendors will always, *always* build a quiet assumption into their systems around which one they compromise on. For example, a lot of vendors will sell you a pair of appliances that they expect to be kept right next two each other, with a pair of heartbeat cables between them. They require this because their system is not partition tolerant, so taking out that heartbeat cable will cause all hell to break loose.

No matter how shiny the sales pitch, you can’t get all three.