If a social media platform is running smoothly, but the engineers leave, why can’t a platform continue to run on autopilot?

743 views

I guess this is applicable to any social media platform or other similar systems. Is it because there are always bugs to address, so it’s never really running smoothly, or other reasons?

In: 153

38 Answers

Anonymous 0 Comments

Trying to explain it in simpler words.

Big websites like Twitter or Google are a bit like big cities – very complex, constantly changing systems, consisting of many simpler systems. Think about all the roads, water and electricity facilities, but also museums, police, schools, trash collection and so on. In order for the city to “work” – being a place people want and can easily live – all of those systems need to be working at all times to some level. Streets need to allow for deliveries, and for people to move around. Trash cans need to be collected, electricity needs to work, etc.

Any of those systems can fail. It could be for any, sometimes unexpected reason. Eg: a lightning strikes a local powerplant, a change in policy causes all garbage men to go on strike. Now, if trash is not collected for some time, eventually city starts to stink, and be unpleasant. If it get worse and trash piles on the streets potentially some roads get blocked. If roads get blocked, people and deliveries can not get around. The longer it takes to resolve the worse. So one failure can pull another.

Some systems failing will have bigger and some smaller impact on the overall city working. All museums closed for a week will be mostly an inconvenience. But if there was no electricity it would be probably chaos, armagedon, possibly many people dying. And again: you can imagine one system failing pulling others down. Plus the longer they are down the worse.

So you want to be able to fix things quickly. In city it would be responsibility of management of specific city companies, probably together with city government, with likely a set of people who only work on managing unexpected problems like that.

Now coming back to computer systems. Each of the city systems is something called in computers a microservice – the same program running on one or more servers. Microservices are also interrelated, and one failing often pulls another down. They also need some common infrastructure to work. In city it’s roads, canalisation etc, in computers it would be network and power. Each microservice is usually owned by a team, who takes care of it, the same way that city companies have managment. Each team usually will own more than one microservice. Which means that even in small companies you will have 10s of them, probably going into hundreds, and thousands and beyond depending on company size. Twitter likely have somewhere in high hundreds of them.

Now, what has happened in Twitter in last week, is basically 90% of city companies management quitting all at once. There is almost no one there to know that a pipe under main square is about to burst, and that unless checked weekly, the electricity will start failing in parts of the city. And there is also not many people left to be able to coordinate fixes is something breaks. And even if they are around, the chances are that they have no knowledge about a specific thing which broke, and without that a fix will take days or weeks. By which point the city may be in flames with people escaping in drows.

You are viewing 1 out of 38 answers, click here to view all answers.