Why do social media apps like Instagram have outages? What causes them?


Outtages as in you can’t send messages or images sometimes, or your feed won’t reload, etc.

In: 1

Too many users at once. High activity hours.

Think of it like a concert that’s super packed and loud. With so many people it might get hard to see the performers on stage, and it is probably also going to be hard to even have a conversation with the person right next to you.

Social media apps are coded by software engineers, who are human. Humans make errors – an overlooked bug in the code, or maybe a bad configuration. This is probably the most common cause of outages, especially with Instagram which runs on Facebook’s infrastructure. But for other sites, things like too much traffic (think Reddit’s hug of death) or downstream dependencies being down can also be the cause.

Lots of things can affect uptime in any system. No system is perfect, and the more complex a system, the less perfect it’s going to be. A system is only as strong as its weakest component.

Instagram relies on connectivity with servers at the center of the system. Those servers can go down for many reasons like power outages, excessive load, hardware defects, software bugs. The servers are connected to the rest of the world and to each other by very fast, very complex networks which use hardware/software systems that can also fail.

People try to minimize outages by relying on redundancy. You don’t have only one server, you have several working together, and if one or a few of them goes down the system continues to operate. Of course this introduces more complexity because now you need software and hardware to ensure that these systems will coordinate correctly.

As we keep on adding redundant systems, at some point you reach diminishing returns. As the saying goes, the first 99% reliability cost 10% of the price, and the extra 1% will be 90% of the cost.

Since it’s impossible to have a 100% perfect system, any company must decide what is an acceptable outage and how much should be paid to avoid the unacceptable ones. So you may spend millions to have 99% reliability, tens of millions to have 99.9% and it only gets steeper as you move into 99.99, 99.999, 99.9999% and so on. So now you are paying hundreds of millions for redundancy, and it is still not a matter of *if* an outage will occur, but *when*.