Every so often, if I open a non-text based document in either Microsoft Word or Notepad, it will open a massive file with an endless wall of completely garbled, gibberish text, most of the characters being either rectangle boxes or characters that can’t normally be typed. What does each of these characters represent? What happens if I insert or delete these characters?
Usually files would refuse to open with an incompatible format. How do these text-processing softwares somehow manage to run virtually any file?
In: 15
>What does each of these characters represent?
Each garbled character represents Notepad’s attempt to interpret the file’s contents as regular ASCII — regular characters, in other words. A file is nothing more than a stream of binary numbers, so Notepad does its best to divide the stream into chunks and translate those chunks into characters.
>Usually files would refuse to open with an incompatible format.
That’s only if programs are written specifically to do so. Notepad is very simple, and so it assumes that what you’ve given it is something it can work with.
Word isn’t as simple as Notepad, but it’s not programmed to fail, so it’ll do its best too.
Notepad is not a particularly sophisticated program. If you tell it to open a file it will treat it as if it were text regardless of whatever is actually in there.
Text files are laid out sequentially with letters and numbers listed in the order they appear with special characters for spaces, returns, etc
So Notepad interprets the binary info in the file into text characters based on the ASCII table
Since there’s no text in most of a file it comes out as a bunch of gobbledygook.
But sometimes you’ll see text appear because there’s text in the file, like for a prompt, a filename, or whatever.
All digital computer data these days is made up of ones and zeros. How your computer reads that data depends on what format it is expecting.
For instance, The way computers typically store the character ‘a’ is with the binary 01100001. When reading a .txt file, the computer reads 8 bits at a time, then consults the [ascii table](http://www.gcsecs.com/uploads/2/6/5/0/26505918/ascii-table_orig.png) to translate what that group of ones and zeros mean. Every time it sees 01100001, it replaces that information with ‘a’ by the time it reaches your screen.
There are other ways to read 01100001 though. If the computer is expecting an 8 bit number to be there instead of a character, it will decode 01100001 to mean the number 97, because that’s what 01100001 is when converted from base 2 to base 10.
When you open a non text file in note pad, you’re feeding that non-text information into a binary to text decoder. It’s reading information that was never meant to represent text and telling you what those ones and zeros would be if converted to text.
It may come across a 16 bit integer 1000000011111111 meant to express the number 33023, but it’s expecting characters represented by groups of 8 bits, so it sections off 10000000 an translates that to ‘@’ and then the next 8 bits of 11111111 and translates that to ‘?’.
the text processing software isn’t necessarily running the file, its just showing you what’s inside.
what’s inside depends on what kind of file it is. for example a .exe file contains machine code instructions, a .zip file contains a bunch of separated chunks of data, and so on. every file on a computer can be whittled down to binary, so a text processor is able to open them all.
notepad/word sees the binary the files are made of & tries to output it the only way it knows how, which is to translate the sequences to text. but since these sequences aren’t human readable data in the first place, you get the gibberish text & characters. if you insert or delete any of these characters & save the file, that will most likely corrupt it & cause it to not be runnable by whatever software its meant for.
>What does each of these characters represent?
Not necessarily anything specific, the meaningful chunks in the underlying data may not align with single characters.
Notepad just assumes whatever the file is is going to make sense if decoded with an ASCII table.
Lets say the byte has a value of 42(in hexadecimal this would be represented as 2A). Notepad is going to interpret that as the character *. If what you’re looking at was a computer program, that might actually be the opcode for subtract.
>What happens if I insert or delete these characters?
You’ll likely break whatever it was you changed. It depends on what it actually did, and how your change will be interpreted. If you want to see what the data of a computer program actually means, open it with a disassembler.
Notepad is basically a universal file viewer. It is programmed in such a way that you can literally drag any file onto it and it will display it… Or try to display it. If you try to open an image, for example, it will basically convert the data that makes up the image into a bunch of characters that are text-based. Technically the file is still valid and correct, it’s just being read in a way that we can’t decipher. (Like how we can see things that are on the visible spectrum, but not things that are infrared. Those things are still there, but our eyes don’t work that way).
Every file is just made up of bytes. Each byte is a number between 0 and 255.
Those numbers can be taken to mean *anything*. Different applications just have different conventions for how to read them. Some applications expect the file to be text, so they treat these numbers as letters. Notepad expects each number to represent one letter, with the number 97 representing an `a`, for example.
But other applications might read the same contents differently. If you open a save file for a game, the same 97 might mean something completely different. It might be how much life you have, or which level you have reached. Or it might mean nothing on its own, but combined with a dozen *other* bytes it tells us where in the world the assault rifle dropped.
Notepad is just really stupid, and insists that whatever file you open, it’s gotta be text! We’re going to show every byte as a letter or graphical symbol beceause *that’s what we do*.
Let’s pretend you’re a student who has some exams coming up, but you didn’t have time to study for them and will surely fail. So, you decide to… get some help from your fellow classmates.
Your History class collectively comes up with a system to communicate with each other during the multiple-choice exam: one cough is A, two coughs is B, three coughs is C, and four coughs is D.
Everyone takes the exam, and with everyone’s cooperation, the class collectively shares their answers during the exam and everyone passes with flying colors.
Next week, your meet up with your Biology class to discuss how you’re going to pass the Biology exam. “Last week, we came up with a system of coughs to communicate answers with each other”, they say. That’s great, you already know a coughing system, so you’re set!
So you enter the Biology exam and listen to everyone’s coughs. But sometimes someone coughs five or six times, sometimes someone clear their throat, and it doesn’t really make sense to you. You do your best with the coughing system you learned with your History class. Some of the answers don’t end up making sense, but you submit the exam anyway.
You get the exam back with a big fat F-.
What happened? Well, the Biology class’ system of coughs was clearly different than the History class’ system of coughs. Since you tried to interpret the Biology class’ coughs using the system of coughs that your History class decided upon, all your answers were gibberish and completely wrong.
Computers work the same way: all files are just a bunch of zeroes and ones. The different file types, like .txt, .doc, .xls, .jpg, .mp3, are all different systems to interpret those zeroes and ones as some other kind of data, like text, spreadsheets, images, or sound.
Notepad only knows how to interpret the system of zeroes and ones for .txt files and turn that into text. If you give it a .mp3 file, it will try to interpret the .mp3 file’s zeroes and ones as a .txt file, which is sort of like trying to listen to the Biology class talk about exam answers with their own cough system, but you only know the History class’ cough system. It’s going to be complete gibberish.
Computers have many ways of encoding data. That just means data can be written many different ways.
When you open a non-.txt format file in notepad, it’s not guaranteed that the data in the file is written in letters and numbers as we use them. Instead, the data could be raw binary.
Those square symbols and gibberish are the word processor trying to decode the data into text using one of many text encoding schemes.
The word processor is trying to translate a big, random number into text, but since the data wasn’t meant to be text, it’s like trying to translate Gibberish into English. Occasionally there’ll be something recognizable, but it’s still largely gibberish.
Most Programms that refuse to open incompatible formats have sanity checks and similiar build in that expect at a certain point of the file or its metadata(the data about the data(aka when it was made, the name of it, the file extension) to get a specific set of data(1s and 0s) if it dosnt find those the programm assumes the file is A) incompatible, or B)potentialy damaged or C) it trys to read it anyway, but what the programms does with it causes the programm itself to fail and crash.
Notepad has far less checks buildin, and it does one thing that(unless you specificaly try to exploit bugs etc) is almost impossible to “fail”, it takes whatever data is in the file, and displays it as text, using usualy UTF-8(or ANSI/ASCII) to decode it, each set of 1s and 0s represent one character in UTF-8 and ASCII
NOTEPAD gets something “readible” out of it even if its garbled and not in the right format becuase at its core every file is the same, its a collection of 0s and 1s. notepad just takes this data and assumes its all Text and parses it acordingly.
You can see it happen pretty easily if you for example open almost any(if not even all) modern executables(.exe file) in notepad, the first 80 or so symbols are pure gibberish, they represent instructions your PC executes HOWEVER right after you have a line “This programm can not be run in dos mode” its clear text because it was meant to be displayed as clear text if you try to execute this file under DOS
why dont you see more “clear text”? because most (or really any)of the actual text of the programm wont be located in the Executable but in other files, on the example of Chrome you can find a good chunk of clear text between instructions(gibberish) in the locales files for your language
Latest Answers