How exactly do Amazon S3/Hadoop/etc work?

833 views

How do they work and what makes them special?

In: Technology

Anonymous 0 Comments

Note that I don’t work with these technologies often, so this is my understanding of them as a summarization of the information available. (I’ve used Hadoop in the past, but a few years back)

Amazon S3 is considered “object” storage. An object is a file and information about the file (metadata) that is stored in a bucket (container) along with a unique ID. The ID is used to access the information from anywhere that can access the Internet. This data can also be replicated across multiple geographic regions to make the retrieval faster, no matter where in the world it is accessed. It also scales (grows and shrinks) the amount of data that you are allocating (paying for) based upon what you put in it. In contrast, if you buy a hard drive, you can’t just pay for 10% of it, even if you only use 10%. With S3 (and similar technologies), you just pay for what you use. In addition, if you suddenly need more space, you’d normally have to run out and buy more hard drives. With S3 it automatically expands to conform to your needs (within any preset limits you impose)

Hadoop is a distributed processing architecture (really a group of different programs that work together) that basically splits operations on large amounts of data across different computers for performance purposes. One of the main tasks the Hadoop architecture can perform is called MapReduce, which takes the data and first performs operations on the larger sets of data, like filtering or sorting, and then reduces the data by operations like averaging or counting before sending it on to the next task. It is mainly used for offline (not real time) processing on what is currently considered very large datasets which would be exceptionally long-running tasks if executed on a single computer.