What’s the difference between UTF 8, 16 and 32

834 views

I’m learning the basics about computers and got to see about Unicode. Apparently it can be divided in 3 with UTF (Unicode Tranformation Format) which would be UTF 8, UTF 16 and UTF 32. I understand that each one has different value UTF 8 – 1B; UTF 16 – 2B; UTF 32 – 4B. But I don’t understand beyond how much space each one of them takes what’s the difference between one and the other?

Also, apologies if I got any concept wrong :$ Feel free to correct me if I did

In: Technology

4 Answers

Anonymous 0 Comments

Unicode is a system that is designed to encode every character in any language. Every character is represented as a code point, which is just a number from 0 to 1,114,111, or U+0 to U+10FFFF in hexadecimal. (Not every code point is a character- some correspond to accents that combine with other characters, etc.).

UTF-32 is the simplest encoding, which just stores the code point as a single 32-bit number. So the code point U+1F600 (a grinning face) is just stored as the hexadecimal 0x0001F600 (which is actually stored as 00 F6 01 00, on a typical “little endian” machine).

The advantage of UTF-32 is its simplicity, as every code point is the same size. But it wastes a lot of space (many of the bits are always zero, and many of the rest are almost always zero).

UTF-16 represents every code point as either 16 or 32 bits. All the code points from U+0000 to U+D7FF and U+E000 to U+FFFF are encoded as 16 bits in the obvious way (i.e. U+E000 is just 0xE000). The higher code points (U+010000 to U+10FFFF) are split into two surrogate pairs. First the code point has 0x10000 subtracted from it, and then the resulting 20 bit number is split in two. The highest 10 bits are attached to a number beginning with the bits 110110 and the lowest 10 bits are attached to a number beginning with the bits 110111. Then these two numbers together make the whole encoding of the code point.

So, instance, the code point U+1F600 would be encoded like:

– Subtract 0x10000 to get 0xF600
– Express 0xF600 as a 20-bit binary number (00001111011000000000)
– The first surrogate pair is 110110 then the first ten bits (1101100000111101=0xD83D)
– The second surrogate pair is 110111 then the next ten bits (1101111000000000=0xDE00)

So then the grinning face would be encoded the two 16-bit numbers 0xD83D and 0xDE00. (On a little endian machine, this would be 3D D8 00 DE in memory).

UTF-16 lacks the advantages of *both* UTF-32 and UTF-8. The only advantage it arguably has over UTF-8 is that for certain languages it can result in smaller encodings. But it is complicated, and a major source of bugs in supposedly “Unicode compliant” code. Unlike UTF-32, it is not a fixed-width encoding, but many bugged code treats it as fixed-width and breaks when higher code points are used.

In UTF-8, each code point is stored in 1 to 4 bytes. Every byte in UTF-8 corresponds to one of the following patterns

– 0xxxxxxx One byte code point
– 110xxxxx Initial byte of two-byte code point
– 1110xxxx Initial byte of three-byte code point
– 11110xxx Initial byte of four-byte code point
– 10xxxxxx Non-initial byte of any code point

Looking at the above, one-byte encodings can hold 7 bits of information, so they correspond to the first 2^7 code points (U+00 to U+FF, ascii characters). 2 byte encodings can hold 11 bits (5 in the initial byte and 6 in the following byte). 3 byte encodings can hold 16, and 4 byte encodings can hold 21.

In theory, there is room for more, but with just these options UTF-8 already holds more room than UTF-16 does.

The same grinning face emoji (U+1F600) would be encoded as:

– In binary, the code point is (11111011000000000). This is 17 bits, and so only fits into a 4 byte encoding
– Pad the code point to 21 bits (000011111011000000000)
– It’s a 4 byte encoding, so the initial byte is 11110xxx
– Take the first 3 bits and put them in the initial byte, to get 11110000=0xF0
– Then the next three bytes are 10xxxxxx. Put six bits in each one: (10011111=0x9F,10011000=0x98,10000000=0x80)
– Then the grinning face is encoded as 0xF0 0x9F 0x98 0x80. (Regardless of endianness this is F0 9F 98 80 in memory).

UTF-8 has many advantages. It is backwards compatible with ascii, and a lot of code and libraries written to work with ascii text will just work with UTF-8. It is always smaller than UTF-32, and in most realistic situations it’s also smaller than UTF-16. It’s simpler and less prone to bugs than UTF-16. The encoding does not depend on whether the machine is big endian or little endian. It is self-synchronizing. This means that, since the initial byte of a code point can always been told apart from a following byte, if some bytes are lost somehow it doesn’t corrupt the whole thing, since a program can just skip ahead to the next initial byte and start reading again.

If you are a developer, what you need to know is mostly:

– Use UTF-8 if you have any choice in the matter

You are viewing 1 out of 4 answers, click here to view all answers.