ASCII and UTF-8, -16, -32

ASCII and UTF-8, -16, -32

May 04, 2020
  • ASCII: 7 bits (1 byte)

  • Extended ASCII: 8 bits (1 byte)

  • UTF-8: min 8, max 32 bits (min 1, max 4 bytes)

  • UTF-16: min 16, max 32 bits (min 2, max 4 bytes)

  • UTF-32: 32 bits (4 bytes)

Unicode is a superset of ASCII, and the numbers 0–127 have the same meaning in ASCII as they have in Unicode. For example, the number 65 means "Latin capital 'A'". Because Unicode characters don't generally fit into one 8-bit byte, there are numerous ways of storing Unicode characters in byte sequences, such as UTF-8, UTF-16 and UTF-32.

  • UTF-8: "size optimized": best suited for Latin character based data (or ASCII), it takes only 1 byte per character but the size grows accordingly symbol variety

  • UTF-16: "balance": it takes minimum 2 bytes per character which is enough for existing set of the mainstream languages, but size is still variable and up to 4 bytes

  • UTF-32: "performance": allows using of simple algorithms as result of fixed size characters (4 bytes per character) but with memory disadvantage

Code length comparison between UTF-8 and UTF-16:

  • U+0000 - U+007F: UTF-8 is shorter (1 < 2)

  • U+0080 - U+07FF: Both are the same size (2 = 2)

  • U+0800 - U+FFFF: UTF-8 is longer (3 > 2)

  • U+10000 - U+10FFFF: Both are the same size (4 = 4)

An issue with the UTF-16 and UTF-32 encodings is that the order of the bytes will depend on the endianness of the machine that created the text stream. So add to the mix UTF-16BE, UTF-16LE, UTF-32BE and UTF-32LE.

UTF-16 and UTF-32 use units bigger than 8 bits, and so are sensitive to endianness. A single unit can be stored as big endian (BE: most significant bits first) or little endian (LE: less significant bits first). BOM (Byte Order Marks) is a short byte sequence to indicate the encoding and the endian.

For more info, please reference here: https://kunststube.net/encoding/

Enjoy this post?

Buy Adam Adler, Ph.D. a coffee

More from Adam Adler, Ph.D.