How does a filesystem on a FLASH storage device work?

196 views

I was thinking about how to design a filesystem from scratch.

Let’s say every cell in our FLASH chip has 100’000 write cycles before it dies and let’s imagine a simple flat filesystem.

I imagine the filesystem splits the data of the file across multiple chunks and creates a linked list of these chunks where every chunk holds a pointer to the next chunk. It would be a copy-on-write filesystem. This would theoretically slow the wear down on the cells? Now I’m not really sure how to achieve efficient wear.

But I really don’t know how the filesystem table should work because changing, creating or deleting files would be very easy to do and wouldn’t do too much damage to the FLASH cells. But I imagine you’d need to update the filesystem table with each of this operation.

My question is how do you get around this? Do you just rely on the fact that when the cells in the filesystem table fail the whole filesystem fails? I guess you could store somehow multiple copies of the filesystem table and compare them against each other at every write operation? And if one of the filesystem table copies failed you just create a new one?

And so you would also need to store the addresses of bad sectors somewhere to not attempt to write to them. But you would also need multiple copies of that?

In: 4

4 Answers

Anonymous 0 Comments

Why not google the existing solutions? There’s half a dozen different file systems designed to minimize wear on eeprom or flash storage.

Anonymous 0 Comments

Modern operating systems are pretty smart, you don’t need to necessarily get rid of the whole drive once sectors start to go bad because your OS will isolate them from read/write usage. But it is prudent because they are also good at spreading reads and writes pretty evenly, so it’s a good indication your drive overall is failing for whatever reason.

The last thing you mentioned is actually a thing, it’s called [RAID storage (redundant array of inexpensive disks)](https://en.m.wikipedia.org/wiki/RAID), specifically raid 1 where you are storing multiple copies of files, but that’s for the contents not the iNodes (the records that point to the actual contents and have meta data)

You are correct though about the linked list like implementation. A disk partition has a segment size and when a file exceeds that, it replaces the tail end of its data with a pointer to the next chunk. The iNode points to the first one, then each chunk points to the next one, or null.

A corrupt filesystem table is a big problem though, that’s usually the first few sectors of your disk allocation. There are ways to recover, but usually that involves parsing the entire disk and piecing it back together based on the other information that remains.

As for implementing it, you should look at how Linux does it or just Google the standards like NTFS or ExFAT or ext4, they are open standard and Linux has the implementation fully open source as well.

ELI5 isn’t a great place for asking how to implement it in specifics cause well… you can’t really do that in a short, concise manner. Filesystems are stupidly complex as you seem to know from the premise lol

Anonymous 0 Comments

Modern filesystems (not just Flash-specific, but also HDD-oriented: NTFS, ext4, etc) do not use a linked list to connect chunks of a file. They instead use a tree structure: there is a “superchunk”, that contains a list of all chunks. If a file is very big, it can have multiple superchunks, with a “super-super-chunk” storing a list of those.

The files are recorded in a file table, which stores an information about a file, as well as its topmost super…chunks. The file table is itself organized as a file: it has its own chunks and superchunks. The superchunks of a file table are recorded in the “root chunk”, which records the data about the entire filesystem.

On Flash FS, the chunks do not get overwritten when modified. Instead a new chunk is created further down, then a new superchunk, then a new file table chunk, and finally a new root chunk. Only then it reaches the end of the memory, it goes back and starts overwriting the blocks in the beginning.

That means that the position of the root block is not fixed. While the computer is running, it can remember the root position in RAM, but on startup it has to search for the root. That might sound bad, but with binary search a 1 terabyte disk can be searched with just 40 or so reads. Of course, the blocks have to be marked with some version counter, and some “milestone” blocks have to be made along the way.

Many types of Flash also allow “partial writing” – you can write to the same place for “free”, as long as you only change bits in a specific direction (usually – from 1 to a 0). This allows to create “log chunks”, that record small changes without committing to overwrite chunks and superchunks. When the “log block” is full – that’s when the changes are actually performed.

*Edit:* as I was corrected, modern consumer SSDs have a wear leveling feature. It allows to use a normal HDD filesystem, without caring about special Flash needs. It works like this: internally, the SSD has a different filesystem. That filesystem contains just a single file. A mini-computer inside SSD manages that FS, presenting that single file as a content of the SSD.

Anonymous 0 Comments

File systems don’t need to do this, the SSD handles what you are asking transparently.

There is another layer of abstraction handled at SSD device level, although it can be designed to work with the host computer, it doesn’t need to in order to become functional.

It’s called the “flash translation layer”.

There is a mapping table maintained by the SSD which maps each logical address to a physical block, and other tables record the use/free status and the age of each block.

Whenever a logical block is modified, the SSD looks for a free physical block, write stuff to it, then change the mapping table so the logical address points to this physical block with newer copy of the data, then the original block it points to is marked invalid, write life incremented and ready to be used for something else.

In this way any new data or modification always go to blocks with lower use life, overwriting that amplify life on the same logical block NEVER happens on the same physical block, it’s always the blocks with lower write counts get used.

This transparent behavior is good enough, but it can be made better if the host preemptively tells the SSD that certain logical blocks are already free on a file system level as soon as a file system level deletion happens, instead of only being notified when overwriting happens.

This is done via “TRIM” commands. Whenever a file system level deletion happens, the host also notifies the SSD of all the logical addresses containing the bulk of the deleted data are free to be deleted. The SSD can preemptively free and erase these blocks, so next time when a write happens, it can immediately write to these prepared blocks. This improves SSD write performance and longevity by giving the SSD more free space to shuffle cold data that never gets modified, freeing those stiff blocks for better wear leveling.