ASCII and Unicode are so-called text encodings. They are basically just a massive list of characters ordered in a particular way that digital systems can agree upon to store and communicate text.
Digital systems, like computers, are pretty good at storing and transmitting bits – it’s their native language. Going from bits to bytes, and bytes to numbers is pretty straightforward, but letters aren’t numbers – so why not just assign a number to each letter instead? As long as you remember that particular data was originally text, you can always recover what you put in there. This is called a representation.
What decides which letter gets what number is the text encoding. ASCII and Unicode assign so called codepoints (indices) to letters, and so programs that can interpret these encodings will know what letters to draw when they need to put text on the screen.
UTF-8 is kind of a layer on top of Unicode that exists because Unicode is a very long list; UTF-8 represents Unicode codepoints that then represent letters. The way UTF-8 does this is backwards compatible with ASCII and is generally storage size efficient (for Latin letter using languages).
Computers don’t know letters, they just know numbers. Original computers didn’t have display screens, all input/output was via punchcards or electronic typewriter/printer. Even in modern computer displays, they don’t ‘know’ letters, they just now that number 65 uses a font that displays a glyph for ‘A’.
The typewriter needed a way to know which letter to print. So sending a number ’10’ would (for example) print an A.
But what if another computer used 16,17,18, etc for A,B,C, etc. ASCII was designed a standard for determining which number would map to which character. There were special control characters defined for typewriter carriage return, backspace, tab, etc.
There was a competing standard from IBM, EBCDIC that defined a different mapping for numbers to characters.
https://en.wikipedia.org/wiki/EBCDIC#/media/File:Blue-punch-card-front-horiz_top-char-contrast-stretched.png
Bytes are not binary voltage states. They are sets of 8 bits that can be interpreted in any way. Bytes can be stored and transmitted bit by bit. One of the first text encodings was [Baudot code](https://en.wikipedia.org/wiki/Baudot_code) invented in 1870s. It was based on sets of 5 bits most likely because the Latin alphabet requires 5 bits to encode. Murray code added Carriage Return and Line Feed codes demonstrating that you can assign arbitrary meaning to codes.
You can transmit and store Baudot and Murray encoded text as a sequence of sets of 5 bits. Similarly you can transmit and store ASCII encoded text as sequence of sets of 7 bits but that would be inconvenient from software point of view. If CPUs had instructions like “retrieve 7 bits from RAM” and “retrieve 7 bits from RAM at position N” you could see ASCII stored as a sequence of sets of 7 bits.
The reason CPUs don’t have “retrieve 7 bits from RAM at position N” instruction is because multiplying N by 7 requires a full blown multiplier that would take many transistors to implement and 1-2 cycles to perform whereas multiplying N by 8 requires only shifting bits and inserting three zeros: binary 1001 times 8 is 1001000. It requires virtually no transistors and no cycles to perform.
ASCII is far older. It was actually an extension of five-bit teletype code (as in telegrams, ever hear of those? Literally hear; a lot of movies and TV shows use the hammering sound of teletypes to convey the sound of a newsroom – and it’s literal hammering; teletype machines had a cylindrical column with the letters on it; the cylinder would rotate, elevate, and then an electromagnetic hammer would hammer the drum into the paper.)
Anyway, 5-bit “Baudot” code had only 32 possible characters; as time went on, people wanted luxuries like upper AND lower case. This was the cause for ASCII, an eight bit code. Well, that’s actually a lie; ASCII is seven bits plus one bit of parity, but fortunately almost nobody paid any attention to that eighth bit, and that was very lucky.
ASCII was an invention of ANSI (American National Standards Institute), and it worked very well… as long as you used the American character set (127 characters covering all 26 letters upper case, 26 letters lower case, and “$” as the only currency mark). No oomlauts, accent marks, or rarer characters (like “cents” mark) were supported.
So now things get weird…. people in other countries wanted computers to properly reflect their own local languages, so Microsoft came up with “wchars” – a 16-bit wide character that could encode a much larger character set (64K characters)… but not one that covered every significant human language (hence the idea of “code pages”).
Thence came Unicode – a way to encode essentially _all_ human character sets – the only problem being that it would require four bytes per character.
UTF-8 solves this problem handily. Remember up a couple of paragraphs up where we said that ASCII simply ignored the top level bit making ASCII essentially a seven-bit character code? Well, that eighth bit could be stolen, and to indicate “overflow into next byte”.
UTF-8 works like this:
* If the character is one of the 127 ASCII 7-bit encodings, use it unchanged (eighth bit is a zero)
* Else, the Unicode length of the character is checked:
1. if the character needs 11 or fewer bits, the first three bits of the first byte are 110 and the first two bits of the second byte are 10; the rest of the bits are copied from the Unicode.
2. If the character needs 12 to 16 bits, the first four bits of the first byte are 1110 and the first two bits of the second and third byte are 10; the rest of the bits are copied from the Unicode.
3. If the character needs 17 to 21 bits, the first five bits of the first byte are 11110 and the first two bits of the second, third, and fourth byte are 10; the rest of the bits are copied from the Unicode.
That’s basically it. It’s a hack that preserves compatibility withbyte == letter software, autodetects invalid character codes, and (very importantly) preserves valid and stable sort ordering.
Note that it’s possible to create overlong codings and other such glitches; that’s a whole other topic and the Wikipedia article explains it well.
Latest Answers