what the context of creation of the ASCII and UTF-8 ?

605 viewsOtherTechnology

Each time I dive into the subject I get lost , like WTF !?

were they used to convert binary voltage states (bytes) into display signal ?

were they invented as a guide for a software to translate 1 and O (that are NOT bytes) ?

is it both because of convenience ?

what is it ???

In: Technology

5 Answers

Anonymous 0 Comments

ASCII is far older. It was actually an extension of five-bit teletype code (as in telegrams, ever hear of those? Literally hear; a lot of movies and TV shows use the hammering sound of teletypes to convey the sound of a newsroom – and it’s literal hammering; teletype machines had a cylindrical column with the letters on it; the cylinder would rotate, elevate, and then an electromagnetic hammer would hammer the drum into the paper.)

Anyway, 5-bit “Baudot” code had only 32 possible characters; as time went on, people wanted luxuries like upper AND lower case. This was the cause for ASCII, an eight bit code. Well, that’s actually a lie; ASCII is seven bits plus one bit of parity, but fortunately almost nobody paid any attention to that eighth bit, and that was very lucky.

ASCII was an invention of ANSI (American National Standards Institute), and it worked very well… as long as you used the American character set (127 characters covering all 26 letters upper case, 26 letters lower case, and “$” as the only currency mark). No oomlauts, accent marks, or rarer characters (like “cents” mark) were supported.

So now things get weird…. people in other countries wanted computers to properly reflect their own local languages, so Microsoft came up with “wchars” – a 16-bit wide character that could encode a much larger character set (64K characters)… but not one that covered every significant human language (hence the idea of “code pages”).

Thence came Unicode – a way to encode essentially _all_ human character sets – the only problem being that it would require four bytes per character.

UTF-8 solves this problem handily. Remember up a couple of paragraphs up where we said that ASCII simply ignored the top level bit making ASCII essentially a seven-bit character code? Well, that eighth bit could be stolen, and to indicate “overflow into next byte”.

UTF-8 works like this:

* If the character is one of the 127 ASCII 7-bit encodings, use it unchanged (eighth bit is a zero)
* Else, the Unicode length of the character is checked:

1. if the character needs 11 or fewer bits, the first three bits of the first byte are 110 and the first two bits of the second byte are 10; the rest of the bits are copied from the Unicode.
2. If the character needs 12 to 16 bits, the first four bits of the first byte are 1110 and the first two bits of the second and third byte are 10; the rest of the bits are copied from the Unicode.
3. If the character needs 17 to 21 bits, the first five bits of the first byte are 11110 and the first two bits of the second, third, and fourth byte are 10; the rest of the bits are copied from the Unicode.

That’s basically it. It’s a hack that preserves compatibility withbyte == letter software, autodetects invalid character codes, and (very importantly) preserves valid and stable sort ordering.

Note that it’s possible to create overlong codings and other such glitches; that’s a whole other topic and the Wikipedia article explains it well.

You are viewing 1 out of 5 answers, click here to view all answers.