A site like Twitter is not fully self contained. It uses many (probably thousands) of third party libraries. These libraries are constantly being updated for new features, security risks, stability etc.
That means you need to frequently update your app to at the very least use the new libraries. Not doing so won’t break it right away, but sooner or later (hint: usually sooner) there will be a breaking change such as an older version being deprecated, or a field name being changed, that requires you to not only update the library you tell your program to use, but to make some changes internally as well.
Plus anything running at the scale of Twitter has a whole lot of infrastructure supporting it, usually in the cloud, that requires specific types of engineers (DevOps, DevSecOps, etc).
Who keeps it up to date with new hardware and software? The whole rest of the internet will continue to move forward. How long until their app no longer works on phones, or their website displays disjointedly on modern browsers?
What happens when some little thing goes wrong, as is often the case with computers, and nobody’s there to fix it?
In my experience with IT, it’s rare to have a completely uneventful day.
– Hardware goes down
– Networks stop responding
– Software becomes obsolete
– Operating Systems need to be patched
There are certain things that you’ll be able to keep working for a while. Then it gets to a point where other employees can find a workaround without having to get to the guts of the server room….
But at some point the work around a create a drain on productivity, then just stop working altogether.
Sometimes things can be fixed just by doing a reboot, but that’s not always easy.
I work for a small company with less than 100 office workers, and doing a complete reboot can easily take 30 minutes.
Some things will automatically start working again, others you’ll have to manually log into a part of the system and force things to start back up.
Plus, a system is only as reliable as its least experience user…. people open e-mails with viruses, leave passwords unsecured, forget passwords…. With an average user running things on autopilot, things break very easy.
The site is running smoothly *because* all the staff are constantly doing things. And it’s not just the engineers. Moderators are removing bad content, lawyers are responding to requests from governments, project managers are making sure projects run on time, and accounting staff are paying all the bills.
It’s like saying “this hotel is running very smoothly. Why would it matter if 80% of the staff left?” It’s the constant, almost invisible effort of the humans that keeps it going. Sure, the building isn’t going to fall down. But there’s not going to be enough staff left to wash and change the sheets, make guest keys, change the air filters, start the giant coffee pots in the morning, receive deliveries of soap, or pay the electric bill.
There’s a whole class of people called Site Reliability Engineers (SREs) whose whole job is to keep large websites working. Here’s a very fascinating thread from an experienced SRE just listing all the ways a large tech company can collapse:
I've seen a lot of people asking "why does everyone think Twitter is doomed?"
As an SRE and sysadmin with 10+ years of industry experience, I wanted to write up a few scenarios that are real threats to the integrity of the bird site over the coming weeks.
— Mosquito Capital (@MosquitoCapital) November 18, 2022
Every system, whether digital or physical, requires routine maintenance to ensure all its features are functional. That’s where engineers and technicians come in, they’re the ones who check and maintain respective components in the system.
In addition to maintenance, the system also needs to be updated regularly to maintain cross compatibility with other systems.
So in the context of social media platforms, routine maintenance may be for stuff like the hardware that holds account information, media files, etc. or for UI interactions on different platforms.
And updates could be stuff like OS compatibility, especially for mobile apps that require optimisation for multiple OS, addition of new features or fixing of bugs.
These things are not something that can be fully automated, if at all.
(I do engineering work in a different field so I’m not sure how accurate this info is with regards to digital infrastructure and systems but it should be similar enough)
A hard drive fills up. That can crash a server. And take down any services that rely on that server.
That’s just one example of a small failure that if left unchecked degrades the system. Enough small failures and you start to have reliability issues across the system. It starts as a few things slowing down or not functioning until cascading failures bring the whole thing down.
So when the platform was first being created, the developers had to make a bunch of tradeoffs in order to meet deadlines and solve immediate issues. The price they paid was code that would create problems down the road and require additional workarounds. A lot of the code that is still in the codebase is this legacy code. The engineers know about these problems and can anticipate when they are going to become a real issue. Without the engineers, the platform can run okay for a little while, but the built-in problems will eventually compound and it will crash.
Lots of reasons but I’ll give you 3.
1. Day to day fires. Projects at Twitter scale stress limits on systems in different way based on lots of factors and you need people around to adjust for those changes.
2. Security and privacy. Twitter is now a massive hacking target for bad actors around the world. If no engineers are around, they become a bigger target.
3. Tribal knowledge. Knowing how a system behaves and all of its idiosyncrasies, how systems work together, why decisions were made in the past, what lessons were learned on the way, all of these things, are more important to running a system than the bits.
Latest Answers