Storage on disk is a complex business: most files are broken up into smaller chunks (each piece having a link to the next one) and deleting files, which happens often enough, leaves gaps in the overall physical disposition of these chunks.
When you copy a file to that disk the operating system has to decide where to put the new chunks, and tries to be somewhat smart about that (e. g. finding an available gap to fill to not waste space).
So when you have a lot of small files it has to make that decision each time anew and there’s a noticeable overhead compared to a single big file processed in one go.
For similar reasons hard disks have to be defragmented (moving the chunks around on the disk platter) to get rid of the gaps, which is generally done continuously in the background while the machine runs. But not with SSDs where the data is accessed differently: they don’t need defragmentation. Which doesn’t remove the necessity to find free space for writing.
Storage on disk is a complex business: most files are broken up into smaller chunks (each piece having a link to the next one) and deleting files, which happens often enough, leaves gaps in the overall physical disposition of these chunks.
When you copy a file to that disk the operating system has to decide where to put the new chunks, and tries to be somewhat smart about that (e. g. finding an available gap to fill to not waste space).
So when you have a lot of small files it has to make that decision each time anew and there’s a noticeable overhead compared to a single big file processed in one go.
For similar reasons hard disks have to be defragmented (moving the chunks around on the disk platter) to get rid of the gaps, which is generally done continuously in the background while the machine runs. But not with SSDs where the data is accessed differently: they don’t need defragmentation. Which doesn’t remove the necessity to find free space for writing.
Imagine a delivery driver who is delivering items to a house, but they can only deliver one item at a time
It’s much faster to deliver a single 10kg item than It is to deliver 10 1kg items individually.
The driver has to go back to the truck, collect the next item, deliver to the door etc .
Your computer has to do finish the transfer, verify it copied and then find the next file and do the same.
If over the network, there’s also a small loss of efficiency with lots of smaller files. A standard packet is 64kb in size, so you can send 1 64kb file in a single packet or if you want to send 8 8kb files, each needs its own packet which adds time and overhead (correct me if am wrong)
It might be easier to compress a bunch of smaller files into a zip file, copy it and then unzip
Also notice when you check a files properties it’ll say size and then size on disk. There’s a minimum file size the drive is programmed to use, which is the cluster size. Imagine serving dinner and a plate is the minimum/cluster size. If they want 50GB/full plates of rice then it’s going to fill each plate and be faster to move around.
But if 50 people wanted 1 grain of rice, you still have to use a whole plate for 1 rice. So you lose 50 plates for 50 rice. Also that’s 50 different people so it’ll take some time to find them, hand it out, etc.
Very looose analogy. But I have seen a folder with a 100 tiny 1kb files be larger than a 200kb files because of cluster size.
Also notice when you check a files properties it’ll say size and then size on disk. There’s a minimum file size the drive is programmed to use, which is the cluster size. Imagine serving dinner and a plate is the minimum/cluster size. If they want 50GB/full plates of rice then it’s going to fill each plate and be faster to move around.
But if 50 people wanted 1 grain of rice, you still have to use a whole plate for 1 rice. So you lose 50 plates for 50 rice. Also that’s 50 different people so it’ll take some time to find them, hand it out, etc.
Very looose analogy. But I have seen a folder with a 100 tiny 1kb files be larger than a 200kb files because of cluster size.
Imagine a delivery driver who is delivering items to a house, but they can only deliver one item at a time
It’s much faster to deliver a single 10kg item than It is to deliver 10 1kg items individually.
The driver has to go back to the truck, collect the next item, deliver to the door etc .
Your computer has to do finish the transfer, verify it copied and then find the next file and do the same.
If over the network, there’s also a small loss of efficiency with lots of smaller files. A standard packet is 64kb in size, so you can send 1 64kb file in a single packet or if you want to send 8 8kb files, each needs its own packet which adds time and overhead (correct me if am wrong)
It might be easier to compress a bunch of smaller files into a zip file, copy it and then unzip
Also notice when you check a files properties it’ll say size and then size on disk. There’s a minimum file size the drive is programmed to use, which is the cluster size. Imagine serving dinner and a plate is the minimum/cluster size. If they want 50GB/full plates of rice then it’s going to fill each plate and be faster to move around.
But if 50 people wanted 1 grain of rice, you still have to use a whole plate for 1 rice. So you lose 50 plates for 50 rice. Also that’s 50 different people so it’ll take some time to find them, hand it out, etc.
Very looose analogy. But I have seen a folder with a 100 tiny 1kb files be larger than a 200kb files because of cluster size.
Imagine a delivery driver who is delivering items to a house, but they can only deliver one item at a time
It’s much faster to deliver a single 10kg item than It is to deliver 10 1kg items individually.
The driver has to go back to the truck, collect the next item, deliver to the door etc .
Your computer has to do finish the transfer, verify it copied and then find the next file and do the same.
If over the network, there’s also a small loss of efficiency with lots of smaller files. A standard packet is 64kb in size, so you can send 1 64kb file in a single packet or if you want to send 8 8kb files, each needs its own packet which adds time and overhead (correct me if am wrong)
It might be easier to compress a bunch of smaller files into a zip file, copy it and then unzip
Latest Answers