Files have an extension and a header, the extension is a dot followed by usually three letters after the filename that determines what program will open the file and some general information on the type of data, this is sometimes hidden by the operating system
The header is a section at the start of a file that gives information on how to read the file contents, like resolution of an image, framerate of a video, sampling rate of an audio file and so on, combining the two you get all the metadata you need on how to read the binary information and interpret it in the correct way
Every file has a format. That format is what tells the program that opens it what it is meant to be.
But you are absolutely correct, at the end of the day it’s just binary. A format is a suggestion. You can absolutely open a video file in, say, a hex editor.
Some of those internet “hacker puzzles” rely on opening, say, an audio file as an image, or a video as text. But that usually means the audio or video is jumbled or broken if play in their “natural” players, because the contents of those files do not follow the intended structure that format actually uses in legit files.
Some files have a header that “explains” what the format is. There is also a convention – the file extension is a strong hint about the file type – your computer maintains a set of “file associations”, or correspondence between the extension and the file type, to help with this. But mostly…they don’t. If there is no header, and there is either no extension or no mapping for the extension, or if somebody has renamed the file with a different extension that no longer matches its true format….you have to tell it. That’s what happens when you get that “what program would you like to use to open this” message.
The simplest way is to just look at the file name extension, the part of the file name after the period. It is often common for files to come with what is known as a MIME file type string, HTTP, email, etc. does this. This will again tell the computer what type of file it is. But these are just hints and might be wrong at the end of the day.
There are tools which are able to identify a lot of different file types by looking at its content. It is very common for file types to have certain “magic numbers” in them which are there to help identify the file type. If you open a file in a text editor it often say the file type right in the beginning. So just a simple database of these magic numbers and their locations within the file goes a long way to identify the file type. And if a file type does not have an intentional magic number identifying it there are often very common numbers in the file which can help identify the file type. At this point we have covered almost all the file types out there but it is still possible to do even more advanced pattern matching and trying to parse the content.
Basically, the files are constructed following certain standards that include some sort of “header” that tells the computer what they are.
Broken down to a basic level, this is like when we two agree on the following for exchanging files:
– Whenever a file starts with “00”, it’s a text file
– when it starts with “01”, it’s a video
– when it starts with “10”, it’s an image
– when it starts with “11”, something went wrong and that’s not a valid file
In practice, these headers are much longer or course, and pass much more info such as the version of some format used, the encoding for text files (=which bit string corresponds to which symbol), stuff like the resolution and framerate for videos, the filesize and so on.
A common approach to passing the file type specifically is for example the extension: we agree on some filename passed in the header, and whatever is the part of the name after the last period specifies the file type.
The most basic is going by the file name, as picture.jpg lets the computer know the binary data in the file is in jpeg format, so it should process that data as a jpeg and show you the results. But if you rename the jpeg to .txt and open it, you’ll just get a bunch of junk in your text viewing program as the binary image data is translated into text instead of a picture.
Apple computers also have forks, basically a tiny second file connected to the data file that has a bunch of information about the data.
After that, a computer can look into a file to see the pattern of the binary data in it and make a good guess as to what format it’s in. For example, a png file has a specific binary format that will have binary for the letters “PNG” near the beginning.
Yes, the computer only understands machine code. But the difference between types of files aren’t in the machine level, they are at the OS level, that is much higher.
When you open a file, you generally open it with another program (The OS generally has an application associated to the extension of the file, but you can try opening any file, with any program). The program that will try to make sense of that given data.
Let me show you an example:
Imagine you have a .jpg file (an image). When you open it through an application that can display images, the application will see that the data is on a .jpg format, and will decode the data using the rules to decode the 0s and 1s to a .jpg image, and then display the image on the screen.
But you can try opening it on a text editor. If you do that, you’ll probably see a bunch of random weird characters. What happenned is that the text editor will see only the 1s and 0s as well, but now the application is trying to decode that to a text.
The simplest way to tell is to give the file a name that tells you what it is. This is a file extension, like the `.jpg` in `somecoolphoto.jpg` or `.docx` in `FinalThesis.docx`. These extensions aren’t any kind of magic, they’re literally just names. You can still rename `FinalThesis.docx` to `FinalThesis.jpg` and open it up just fine (probably, if the program isn’t actively checking the name). You can even simply not have one at all. But some systems, particularly Windows, uses the file extension almost exclusively to determine what type a file is. Windows, in particular, by default will even outright hide some of the more common filename extensions when you look at them in the file explorer unless you explicitly tell it to show them to you.
When that fails (or isn’t used), many (but not all) file types contain headers. These are chunks of data at the very beginning of the file that tell the program reading the file what’s inside, like a tiny little included user manual. These headers usually begin with a short sequence of bytes that are unique to the kind of file it is. These are called “file signatures”, [and you can see a list of more of the well-known ones here.](https://en.wikipedia.org/wiki/List_of_file_signatures) For example, all PDF files start with the bytes `25 50 44 46 2D`. So if your program starts to read the file and the first five bytes are this sequence, it can be pretty confident that what it is reading is *probably* a PDF file. Unix systems like Linux often use this kind of type identification. It’s not uncommon to see files that have no extension in their name at all and yet the computer can still tell what kind of file it is using the header information.
If you’re specifically downloading files off of the Internet through a browser or sending attachments over email, there’s also MIME types. These are essentially the server outright telling the downloader what kind of file it is, e.g. “I’m going to send you a PNG image file, please treat it like one.”
If all of these fail, the computer basically gives up and says “lmao I don’t know, it’s a binary file ¯_ (ツ)_/¯”.
Latest Answers