– ASCII vs Unicode – can you help me understand the real-world applications?

352 views

I’m taking the ITF+ certification course from CompTIA, and one lesson mentioned ASCII and Unicode. I get that they are different methods for encoding text, but can someone explain to me what is the “real world” application for knowing this? I have a hard time just learning definitions in a vacuum, and it helps me to know how I would use this knowledge.

For example, if I am looking at a webpage or a document, is it beneficial for me to know whether it uses ASCII or Unicode? And how would I know which one it is?

I’m sorry if this isn’t the right place to post this. I tried posting to r/computerscience but they seem to be closed to new members. I tried r/askcomputerscience but it was removed by their automod for being about “tech support.”

In: 1

Anonymous 0 Comments

Here are some facts about bytes: Most computers work with 8-bit bytes. You can use 8 bits to represent 256 different values. These 256 different values often represent numbers 0-255 using the binary number system.

What if you want to represent letters or punctuation? You have to decide what number means what letter. In the early days of computers, each computer maker decided for themselves, [possibly based on punch cards](https://en.wikipedia.org/wiki/BCD_%28character_encoding%29).

As time went on, more companies started making computers (and computer-adjacent devices like keyboards and printers). More people started buying them. And some people wanted to connect different computers to each other. So in the 1960’s, ASCII was created: ASCII is a standard for what number means what letter. It defines meaning for the numbers 0-127, the numbers 32-126 stand for the 96 “printable characters” that appear on modern US keyboards:

!”#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~

(0-31 and 127 are “control codes,” a couple of them are still used (tab, newline) but most are obsolete.)

What if you want your computer to work with non-American languages? Other countries like to use letters with funny squiggles (ñ), dots (ö), or slashes (é); letters that don’t exist in English (like Germany’s ß); completely different alphabets (like Greek or Cyrillic); or even writing systems with hundreds or thousands of symbols (Japanese, Korean, Chinese).

ASCII is an *American* code, so that’s not ASCII’s concern. Up to the early 1990’s, it was still up to the individual computer makers to decide how to deal with non-US characters. So non-US computer systems used an alphabet soup of incompatible solutions, [frequently causing issues with unreadable text](https://en.wikipedia.org/wiki/Mojibake) especially as the Internet started to get popular.

US users (even technology professionals) could be blissfully unaware of all this, because most systems spoke ASCII just fine [1].

Unicode was a new standard for representing text *in all human languages* on computers (beginning in the late 1980’s, but only really picking up steam in the mid-to-late 1990’s). Unicode’s idea was to assign numbers to the letters of all the world’s languages. (Later, they added many non-letter symbols: Currency symbols, safety symbols, emoji, music notation, ancient hieroglyphics, alchemy / astrology…plenty of strange characters have been added to Unicode.)

Unicode would assign 0-127 to the same letters as ASCII, so it would hopefully at least be compatible with all the ASCII-based software. Since there were so many new languages — some with hundreds or even thousands of characters — it was clear that lots of the new characters would need more than 1 byte. Originally they wanted to use 2 bytes per character, giving 65,536 possible numbers (256×256 or 2^16), but over time this was not enough; [according to Wikipedia](https://en.wikipedia.org/wiki/Unicode) Unicode now defines numbers for 149,813.

So they separated the *assignment of numbers to characters problem* (solved by Unicode) from the *assignment of numbers to byte sequences problem* (solved by UTF-8 or UTF-16). Basically UTF-16 is “most characters are 16 bits (except some of the new characters after our 65536 characters started to run out are bigger)” while UTF-8 is “ASCII is 8 bits, same as it always was, and a non-ASCII character is some probably-multi-byte code sequence starting with a 128-255 byte.” Originally Unicode envisioned UTF-16 as the standard (strings and characters in the Java programming language assume UTF-16).

But today, the vast majority of modern websites, software, and programming languages assumes — or at least tolerates — Unicode encoded as UTF-8.

Now we can get to your question:

> how I would use this knowledge

– If software is showing [garbled text](https://en.wikipedia.org/wiki/Mojibake), especially for a non-English language, try to look through the menus or other configuration for a way to change the encoding. Settings that might work: UTF-8, UTF-16, ISO 8859-1, Code Page 437, Windows 1252.
– If you see a weird character at the beginning of some text, it might be a [BOM](https://en.wikipedia.org/wiki/Byte_order_mark), which is a way for the document to tell software what encoding it uses — but this particular software doesn’t understand it.
– If you’re looking at a text file in a hex editor, you can carefully go through the Unicode and UTF-8 specs to figure out why the bytes that represent a non-ASCII character or emoji are what they are.
– Unicode contains a bunch of weird stuff: characters modifying other characters (accents, emoji skin color) and changing text direction (for e.g. Middle Eastern languages that write right to left). This can cause issues (e.g. putting every possible accent on the same character to make a corrupted-looking cursed character) or even security problems.
– If you write a program involving strings, 1 character = 1 byte is true for ASCII, but not true in general. It is very easy to get confused and write a buggy program, especially if you only tested your program with ASCII inputs and outputs.
– Older programming languages might have a built-in assumption that 1 character = 1 byte. Code that handles text in these languages is very tricky to get correct for potentially non-ASCII inputs.
– Newer programming languages might force you to say whether your data is bytes or characters, and tell the programming language when you want to convert between them. This makes it harder to write incorrect programs, but is inconvenient and annoying if you just want to assume ASCII for whatever reason. Telling the programming language to convert using UTF-8 will usually work, but “the right thing to do” is to give the user some way to input or configure what encoding to use.
– Websites and browsers send out information about the encoding as part of HTTP headers. This might occasionally be important, especially if parts of the website you’re dealing with have an unusual encoding.

[1] Except the mainframe world and EBCDIC, but that’s another story. Even today, there are some EBCDIC holdouts operating here and there — [at the Computer History Museum](https://www.youtube.com/watch?v=uFQ3sajIdaM), or even [in the wild](https://shkspr.mobi/blog/2021/10/ebcdic-is-incompatible-with-gdpr/) — but you’re quite unlikely to encounter them.