eli5: what does (de-)fragmentation even mean?

20 views

After reading another post about why defragmentation isn’t as necessary with modern devices, i started wondering what exactly fragmentation even is. How and why does it happen and doesn’t it screw up your data?

In: 0

My dad always told me it was like tidying your room, and organising everything alphabetically, but in your disc c 😂

Old mechanical hard drives with platters and stylus reader would take longer to travel over the physical magnetic disk in multiple places to pull all the bits of data associated with a program. Defrag consolidates all the data together into one physical chunk on the disk, so the stylus and disk doesn’t have to move around as much or travel as far to access all the data. Modern solid state drives have no moving parts but contain a controller with address bits to each piece of data, so there is no speed decrease by accessing data from multiple places on the drive. The controller directs the drive to each bit of data, and because electrons move at the speed of light, more or less, it doesn’t matter if the data is contained within address bit #1 or #736347, etc… the speed is the same. Defrag is not necessary for SSD.

Let’s say you have a hard disk with 10 files on it, and file #4 is 8 blocks long. If you delete file #4, there is an 8 block “hole”. If you add a new 12 block file, there might be space at the end of the 10 files. At some point, however, there won’t be enough unused space at the end to fit the next file. Eventually, 8 blocks of some file will pe put where file #4 used to be and the rest of that file will be someplace else. That file will be in two fragments, one where file #4 used to be and another one someplace else. If you do this enough and the disk as little free space, eventually you’ll only have a bunch of 1 block holes. That means a new 15 block file will be split into 15 fragments, and that will take 15 times as long to read as if the file was all together. SSDs have no access time, eliminating this “longer time to read” effect.

Defragmentation is the process of copying all the files into a pattern where every file only needs one fragment. This involves a lot of copying and work, particularly if your disk has little free space.

Today, hard drives are big and running them almost completely full is uncommon, so the problem doesn’t occur much. The newest operating systems work to minimize fragmentation, and SSD drives are worn down by frequent defragmentation, so it’s not a common thing to do manually anymore.

Imagine you have a book (Your hard drive) it has a story but the data is spread out. Some words on page 1, a few more on page 2, a bunch on page 3 and so on. This is done by the computer to fill up any available space. Defrag essentially takes all the words and moves them to page 1…this makes it faster to read and therefore more efficient.

Let’s say you have a library. And you are in the mood to read a book. So you just grab the book and read it, front to back. EZ.

But, let’s say, when you’re trying to put the book back, for whatever reason, there isn’t enough room for it to fit on the shelf. So, instead, you rip the book in half, put half of it on the shelf where it used to be, and the other half in the next available free spot. You also include a note with the first half explaining where to find the second.

The next time you’re in the mood to read that book, you’ll grab the first half, read it, but then you’ll have to stop and find the second half to continue reading.

This is fragmentation. And, if done a lot, can really slow down the loading and reading of files.

*De*fragmentation is the process of sitting down, taking all the books off the shelves, putting them back together, and putting them back on the shelves, whole.

It’s not really a thing anymore because the ways in which we store and organize files are better, the ways in which we load and read files is better, and the configuration and speed of newer hardware (e.g. SSDs) makes fragmentation largely irrelevant.

When we remove files from the disk, we leave holes where those files were. When the next file comes in, and it’s larger than that hole, we fill in as much of it as we can into that hole, then leave note at the end saying “rest of it is over in X”, where we put rest of the file. For larger files (or smaller holes), we can end up splitting the file into several different pieces (fragments – fragmentation). Since going between disk locations is costly (less so now than it was before), every time we read one of those split files, it takes us a longer time than if they were all neatly put together.

As a side-note, those little notes we leave for “rest of it is over in X” still take space, and given enough splitting, might actually have a noticeable effect on how much of your disk you can use.

De-fragmantation is re-organizing the disk so files are made continuous again.

Data is arranged on a hard disk in concentric circles. To access another one of these circles, the disk drive has to move the magnetic head. This takes some length of time, which adds up if many such jumps are required.

Initially all files are written to a disk in a sequential fashion. One after another. Some files later get deleted and the space they took up is marked as free. When a new file is written, it may need to occupy several of these smaller gaps and now consists of disjoint segments. They are being kept track of in a table and can be reassembled, but this takes time.

So imagine a traditional hard drive – you have a spinning platter and a head that moves around on it.

Here’s a simplified example.

When you write a file called “myfile” to disk, you create an entry in a directory table saying “myfile starts on sector 17.” Now maybe your file is 6 sectors long, but sector 21 is used by something else. So you have sectors 17,18,19, and in sector 20 you say “the rest of this file starts at sector 44.”

So your drive head moves over to sector 44, and reads sector 44, 45, and 46 to get the complete file.

Your file is in two fragments.

Now if you have a lot of big files that are constantly being written and re-written to disk, and especially if the disk is mostly full, the odds are very good that your file will get split up into a LOT of fragments, and every time you have to jump to a new fragment, you are probably going to have to move the head somewhere and then wait for the right portion of the disk to spin under it. This can definitely increase the time to read a file.

Defragmenting does its best to undo this process – it moves data blocks around on the disk to consolidate some free space, and then moves a fragmented file into that free space. Then repeats this many many times.

So, **why isn’t it as important any more?**

Firstly because we have solid-state drives (SSD), which don’t have physical moving parts to shuffle around. Reading data blocks is roughly a constant rate, no matter where they reside on the “disk,” and how fragmented they are. They’re also ridiculously fast compared to spinning drives.

Secondly, because modern filesystems are much better at addressing the problem of fragmentation. For example, they don’t start writing a file at the first spot available on a disk, or even at the first spot big enough for the entire file. Instead, they look for the smallest space that will hold the entire file; and in some cases, will deliberately fragment a file to avoid leaving tiny gaps of free space scattered across the disk. They also may do things like calculating the shortest ‘travel time’ from the end of one fragment to the start of the next.

Finally, spinning disks now have a big chunk of cache on them which will pre-load more data; and operating systems cache read data in memory as well. Less and less of the time we spend waiting for our computer is tied to the hard drive.

So, back in the days of spinning hard drives, data was written in a big spiral on the surface of platters. There was an arm that had to physically move across the surface of the drive, to wherever the data you wanted was stored. It could keep track of where everything was written, and what parts of the drive were free to write on. But, waiting on that arm to move was the slowest part of reading data.

As you added data to the drive, it would more or less just add it to the end of that spiral of data. But along the way, you probably would have deleted some files, and eventually, its going to run out of fresh disk to write on, so it starts writing into the gaps formed by deleted files. But, those gaps might be smaller than the file being written, so it would break up new files into pieces that would fit in the open gaps. It would get the file stored, but that means that now, partway through reading the file, the arm has to move to a different spot on the drive, often many times during the file read, which slows read time considerably. When files are split up like that, they’re said to be “fragmented,” since the file has been broken up into multiple fragments across the drive.

Defragmentation is the process of re-organizing the empty space and files on a drive, so that each file can occupy one continuous block of the drive.

The reason its not necessary anymore in the days of soldi state storage are twofold: one, because hard drives had more or less infinite re-writes (they fail because of the moving parts inside. Some of those parts move over 100MPH, so eventually something is going to crash, get stuck, eventually, and it’s just toast, but the magnets don’t really lose their energy. However, solid state drives can only re-write any given bit a finite number of times (it can be tens of thousands to millions of times, but there are a lot of temporary files that come and go and generate a lot of write cycles). And you don’t want to burn up those write cycles just re-arranging which bits hold which data. Especially when you consider the second reason: there are no moving parts, no swing arm that has to come to the data. Every bit of a solid state drive can be read at more or less the same speed, whether its located right next to the previous bit or not. So, it can just find whatever space is free (and it keeps track of those write cycles, so it’ll give preference to using a spot on the drive that’s been used less, to try to wear it more evenly)

Imagine you have tools and a wall of tool chests. By default, your automated tool sorting system is going to place a tool in the first drawer that has enough empty space for it. However, this often means that tools that go together, like drills and drill bits, are stored in two separate, far-apart places because a drawer will have space for the tiny bits but needs to go further to find space for the big drill. So when you need to use the drill, you waste time going to two separate toolchests for the drill and the bits when it would be more efficient to store them in the same place since they’re always going to be used together.

Defragmentation goes through and sees which tools go together and manually rearranges the tool boxes to put them in the same place so that they’re faster to fetch.

Computer scientists have to get very clever sometimes. I bet there have been hundreds of thousands of instances of fragmentation dealing a killing blow to many machines before there was a good solution to it. First, what it is and how it happens. Let’s say you’ve got a woodworking shop. Your working on a piece of furniture. Every night, you go downstairs and take the piece out of its drawer, work on it for a few hours, then put it back. Maybe you’ve got a couple side projects you work on to take a break every once in a while. They’ve all got their spots in the project cabinet. One day, you assemble two parts of the chair or whatever, but the problem is, it no longer fits in the spot you had set aside for it. With woodworking, your only recourse is to find a completely new location for the project that will accommodate it’s size. Maybe a smaller project will come along and fit the empty space, otherwise, you’re just making inefficient use of your storage.

Computers have a fancy trick up their sleeve. They keep a lookup table that is always a constant size and always sits in the same place. The lookup table is a list of the location and sizes of each of the pieces corresponding to your project. Moreover, they can be assembled on the fly when you need to work on it, then disassembled and placed in their own drawers when you’re done. If a piece gets too big for the location allocated to it and would spillover into other files, the program splits it apart and finds a new location for the new piece, and adds a new entry in the lookup table so it can find it again later.

I’m going to switch metaphors for defragmentation. It also allows me to explain why fragmentation is not an issue on solid state devices. The kind that are in most modern computers now. Let’s say your favorite movie theater is Marcus, your favorite local sports team plays at Allianz field, and there’s a nice little dive bar 8 miles away that you like to go to and listen to the live musicians they host. Meanwhile, Bob Floozy (some guy in town) prefers spending most of his free time either at AMC theater , Target Field, and a small concert hall.

You both spend a fair bit of time on the road driving to your favorite places of leisure. And this isn’t a perfect analogy because many people go to these places. For the sake of this example, pretend you 2 are the only people that make use of the establishments listed. It would be so much more efficient if we moved your favorite establishments into your neighborhood and all of Bob’s favorite places to his neighborhood. You can get to what you love in less time and spend less time waiting. This is defragmenting. Pieces of files are moved around so they’re all in one place. This is important because on a spinning disk hard drive, data’s location is relevant. If one piece of a file is stored on the left side of the disk, and another piece is stored on the right side, it takes time to spin the disk and move the read/write head to the correct location. Moving files together means less physical movement, less time wasted in travel, and quicker computing.

Solid state devices are not immune to fragmentation. In fact, there is absolutely no difference in how files need to be stored when you’re finished working on them. The difference is in retrieving files. You ever wonder how a USB stores billions of bytes of information accessible on only 4 data pins (2 of which are for power) and no moving parts? There is a unique path to every single bit. All that needs to be done is switch on and off the appropriate transistors to connect that bit to the output at the right time (done with a small processor onboard). This is very fast, but tends to take up more space and generates more heat. SSDs are smaller than hard drives, but not the data portion, just the moving mechanism and casing. This is where my analogy comes back into play. Instead of driving to the theater, stadium, or bar, you can turn on the TV and flip to a movie, sports, or music channel at will without moving. It doesn’t matter that the venues are scattered all around town, at the click of a button, a camera in a stadium sends its data through some antennae and eventually to your screen. No need to defragment because the distance between you and the source of entertainment doesn’t matter.