How do websites like YouTube store their data when thousands of TBs are uploaded every hour and daily?

154 views
0

Don’t know about some other technology behind it besides from my computer at home. Basically, I just imagine it at as thousands of hard drives in a room connected to a server and always expanded almost everytime.

EDIT: Apparently, I already answered my own question. I honestly imagined it as something else like some super tech not available to the public. No wonder no one was asking as it is very simple. Been bothered with this thought for a couple of days now.

In: 114

Well, that’s basically it – a large 100 TB HDD doesn’t actually take that much space so it’s easy to stack hundreds or thousands of them in a single server room. YouTube for example currently has something around 10 exabytes of data – 1 exabyte is 10^18 bytes – and they have much more space available in their servers still, to the point where they share out parts of them for use to external companies. This data is stored on connected super drive servers mostly that are made out of many hard drives connected to one another, and a MySQL database is set up that contains the info where what is stored to make it even somewhat possible to operate that large amount of files.

That’s pretty much it, for the most part.

While I don’t know exactly how YouTube does it – as every company will have variance in terms of their implementation strategy, loads of rooms (all over the world) with loads of disk storage is what’s happening behind the scenes, and storage capacity is perpetually being increased.

These days you can buy 18TB hard disks on Amazon for less than £300, so with modern technology, the concept of these massive companies being able to process and store thousands of TB in a given moment isn’t necessarily as incomprehensible as it might seem.

For Instance, suppose YouTube did store 1000TB of new content every hour, and that was even after compression, and assuming we ignore factors like bulk-purchase discounts they will have when obtaining new drives and other expenses such as servers and other datacentre costs:

1000TB * 24 * 365 = 8,760,000TB (8.76 Exabytes) a year.

8,760,000TB / 18TB = ~486,667 drives.

486,667 * £300 = £146m a year.

YouTube generated ~£13.9 Billion in 2020, so the purchase of these disks would represent just over 1% of that as an expense.

The answer is lots of hard drives and lots of servers in large server farms. you can see a list of the ones operated by google that owns Youtube at https://en.wikipedia.org/wiki/Google_data_centers The number of servers in the millions.

There are specific https://en.wikipedia.org/wiki/Storage_area_network protocols and standards to provide access to drives over networks.

The disks are connected to computers that are not that different from a home computer, they tend to use the server version of CPUs, many memory slots that use memory with error correction, more expansion slots. Y

The cases are designed so you can fit lots of dives in them. The drive is hot-swappable so you can easily remove them when the server is running. The number of disks you can have is also quite high and there are controller cards for the, You have redundant power supplies

You can look at https://www.supermicro.com/en/products/general-purpose-storage for storage servers.

To the best of my knowledge google do not use complex servers with lots of storage but simple server with a limited amount of storage. That way you get more CPU capacity compared to the storage so you can compress videos as required and do other tasks like they do when you search.

I attended a presentation from Google at my college, this now ~15 years ago, so their model may have changed a bit.

Essentially, the [Google File System](https://en.wikipedia.org/wiki/Google_File_System) is a massive distributed file system built on economical 1 rack-unit (about 1.75 inches) servers with a few drives each (think like 4x multi-TB SSD’s). They fill racks upon racks, data center upon data center full of these little 1U servers, and they have their own custom software on them that joins them to the massive storage pool. Files that are stored there (your Gmail, YouTube videos, Google Docs, etc) are all indexed, auto-duplicated, etc so that if any 1 server went down there’s at least 2 other copies saved somewhere else in that data center or perhaps another, so the system heals itself and puts another copy somewhere else.

What’s also cool about it – if a video goes viral, and there’s millions of people watching it, the system obviously can’t serve that many people from just a few copies of a video, so the system automatically duplicates the file and stores it in data centers that make sense (for example, if a video goes viral in the US, it might be duplicated onto a bunch of US data centers).

When you go to watch that video on YouTube, your request goes to one of the indexing servers, which figures out where the closest copy exists and/or does load balancing so no server is overloaded, and then sends that copy to your browser.

It was really impressive to me then, can’t imagine how it’s improved over the years, and now Amazon, Facebook, Apple, Microsoft, Netflix, Hulu, etc. are all doing similar sorts of things…

Oh top of what others have said. There is also a storage technology called “de-duplication” that breaks the files up into “blocks” it will look at these blocks and only store a single copy of each duplicate block per datacenter (Not including backups). Even though the content of the video is all unique, there is a lot of duplicate information that makes up the base video file. So that mythical 1000TB of data might be less than 250TB after all the duplicate data is only stored once.