Every so often, if I open a non-text based document in either Microsoft Word or Notepad, it will open a massive file with an endless wall of completely garbled, gibberish text, most of the characters being either rectangle boxes or characters that can’t normally be typed. What does each of these characters represent? What happens if I insert or delete these characters?
Usually files would refuse to open with an incompatible format. How do these text-processing softwares somehow manage to run virtually any file?
In: 15
Nobody understands how to do ELI5 anymore.
# ELI5
So we all accept that everything in a computer is just 1s and 0s, right?
Long ago, people got together and decided, “Hey everyone, let’s all treat 1011 as the letter A.”
And so everyone agreed. For text, treat 1011 as A, and 1100 as B, and so on.
But then someone came along and wanted to use 1s and 0s to convey sound. “A” and “B” don’t mean anything for sound, instead they want 1011 to mean “a sound at 440hz at volume 6” (for example).
And then someone else came along and wanted to use 1s and 0s to convey images. They want 1101 to mean “Set this pixel to navy blue intensity 8”.
So you’ve got a file. It’s a bunch of 1s and 0s.
Should we interpret those 1s and 0s as text (1101 means A)?
Should we interpret those 1s and 0s as sound?
Should we interpret those 1s and 0s as images?
The extension (ex: txt, doc, xls) gives you a *hint* about what’s inside, but it’s no guarantee.
So you open Notepad, and it will only interpret things as text. That’s its job, that’s what it knows how to do. And the extension (ex: jpg) is kind of meaningless, Notepad is happy to try to interpret the file as text.
And it finds “010000010110”, and it says “Hmm.[ It says here](https://unicode-table.com/en/#0416) that, if treated as text, that string of 1s and 0s should be: Ж.
That’s where you get those weird images. It’s trying to interpret those 1s and 0s as text. There are some strings of 1s and 0s that don’t have ANY corresponding letter, so it does its best, and sometimes those show up as a square or rectangle.
What happens if you insert/delete/change those characters?
Well, say you type a “K” where the “Ж” was.
You’ve just changed the 1s and 0s.
And if the file was meant to be interpreted as images, now you’ve told the thing that interprets those 1s and 0s to put a yellow pixel instead of a blue one.
It could be as harmless as that. Or it could tell your IV pump to send a lethal dose of medicine. It all depends on what’s interpreting those 1s and 0s, and how it’s doing the interpretation.
> Usually files would refuse to open with an incompatible format
Sure. You open the file as a jpg and the jpg interpreter says, “Whoa! I just found 001001 and I have no idea what that means! I better stop and alert the user that this file is incomprehensible!”
But Notepad doesn’t do that because just about every combination of 1s and 0s can be some kind of character.
If the file isn’t text, it’s not going to look like text.
“Text” in computers is simply a list of numbers to look up letters in a table so that they can be printed on the screen. The number “65” represents the uppercase letter “A”, for example. Each letter, digit, punctuation mark, space, they all have numbers assigned to them and the application simply looks up in a table what to print when it sees the number.
Of course, there are more numbers than there are letters. When you open something that isn’t a “text file”, that is, a file where all the numbers match letters in that table, then the program has to improvise. If the numbers aren’t meant to be text, though, you just have a seemingly random bunch of numbers that the program tries to treat as text (because it’s designed to treat files as text). You’ll see the numbers matching letters here and there, but also numbers that are assigned to weird symbols that are not often used. Some of the numbers don’t even have a symbol assigned to them – in which case the program just shoves some placeholder in there (usually a square or something).
Files are in a certain format and the application opening them checks if the format is valid (as expected).
Random files will not be valid and the application refuses to open them.
Ex: Most image file are expected to have a special area defining the with/heigth, number of colors, etc..
A text editor, such as notepad, is not expecting a certain file format and can open any file.(some constraints about size, etc… but that is irrelevant.
The garbled text you see is because of the way Notepad tries to interpret the file as text.
All characters are encoded as a string of bits.
Memory was precious and a long time ago, characters where encoded using 7 bits (Not even a full byte to save a single bit per character.) (Standard ASCII character set)
Later 8 bits per character were used. (Extended ASCII character set)
Today multiples of 8-bits (even up ti 4 bytes) can be used per character to display all the special ones, such as Chinese, Japanese characters.
For files that are not text, other encodings can be used.
The same string of bits might have a completely different meaning. (For a compiled application it might be an instruction for the CPU.)
As notepad is a very simple editor and doesn’t know about all this, it just displays each byte of a file as an individual character from the ASCII character set, resulting in what you see.
You can see the (extended) ASCII character set here: https://www.asciitable.com/
Latest Answers