How are status pages for major CDNs and major backbone providers designed to be up even though the provider is down?

562 views

How are status pages for major CDNs and major backbone providers designed to be up even though the provider is down?

In: Technology

9 Answers

Anonymous 0 Comments

The most obvious way to implement this is to host the status page with a competitor using completely independent infrastructure. Another trick which can be implemented is that if any part of the status page fails in some way it will just fail by showing the service as failed. So for example with todays fastly issue when the front end servers worked but the backend network had issues the frontends can be configured to show a status page with everything failing if they can not find the back end servers. This obviously does not work if there is any errors very early in the pipeline but at that point it is likely a problem with the clients network provieder anyway and the status page would be of no help.

Anonymous 0 Comments

A park you visit could be closed, but the signs at the perimeter and leading to it will still be there. It’s much like this. The status pages are kept specifically in heavily redundant places, sometimes reside in multiple places, so that they can always be reached, and not at the same physical location as the site itself.

Anonymous 0 Comments

The computational requirements for a status page are extremely low compared to a full website such as e.g. reddit or facebook. It’s extremely cacheable, lightweight, and read-only. There’s no user authentication or policy enforcement logic. It’s also, for obvious reasons, hosted on separate infrastructure from the main site.

Furthermore, even during an outage event, a status site is going to receive less traffic than the actual site, since only a more savvy subset of the users will bother checking the status page.

Anonymous 0 Comments

In Fastly’s case, they host their status page on AWS, whose infrastructure is not dependent on them.

Anonymous 0 Comments

I can answer this as it’s my job. I manage a large CIRT (Critical Incident Response Team) who provides critical incident services to over 45 companies, I believe its the largest CIRT in the western world at least.

I see from a lot of your comments that you are looking at internal hosting solutions, **dont**, look at the already established industry tools in place, consider something like [statuspage](https://www.atlassian.com/software/statuspage) (and also take a look at ops genie, its magnificent).

Anonymous 0 Comments

Ahn noticed the fastly outage didnt ya.
But to the point they host their main infrastructure separately from the status page.
For example you host the status page in a Azure (Microsoft) datacenter in the region eu-west1, while the main infrastructure is hosted at AWS in us-east1…

Simplified example but you get the gist.
It could still be with the same provider but different regions too.

Anonymous 0 Comments

The easiest solution is to just use something like https://statuspage.com, which also has a statuspage (https://metastatuspage.com/). Unfortunately there is no metametastatuspage.

Anonymous 0 Comments

Popular status pages providers usually have fallbacks in case something fails. They use different DNS providers, CDNs, and hosting. Pages are often static and light, which helps a lot when dealing with a lot of traffic.

Let’s say you want to create a service on top of Amazon Web Services. You can use Route 53 for DNS and Cloudflare/Akamai/Google/self-host as fallback. You can use CloudFront as the main CDN and then Fastly/Akamai/Cloudflare/etc as fallback (or bypass the CDN). For hosting, maybe Amazon EC2 as main and Google Cloud/Azure/Linode/Digital Ocean/OVH/etc with the option to quickly scale if needed. If there’s a need to store heavy files, you can use Amazon S3 but also Akamai/Backblaze B2/Digital Ocean Spaces/etc.

To make things even more robust, you shouldn’t rely on only one location either. If you’re hosting something in US-west, have fallbacks in US-east. If possible, use different continents.

Services sometimes stop working even with all this redundancy because there’s always something that can go wrong: a bottleneck somewhere, something that fails to redirect traffic, a problem that no one thought about, one of the services use the same upstream provider, etc.

If you are a website/service operator, want to control your own status page and really want for it to be online when you’re having issues, you should at least use different services for everything (maybe a different domain and different DNS servers, CDN and hosting provider).

Anonymous 0 Comments

They aren’t always. There was a major S3 outage in us-East-1 a few years ago, and AWS’ status page couldn’t update the status icon because all the icons were themselves stored in S3.