This is an archived copy of a previous semester's site.
Please see the current semester's site.
UTF-8 is a character encoding that utilized several core bit- and byte-level design decisions. It can be thought of in several ways; one way is
65 is a capital letter Aand
128026 is a spiral shell.
Unicode is an international standard that is best known for its mapping between characters and non-negative interers called code points.
Code points are most often written with U+
followed by an upper-case hexidecmal representation of the integer zero-padded to at least 4 bytes; thus capital A is U+0041 (code point 0x41) and spiral shell is U+1F41A (code point 0x1F41A). Unicode supports code points between U+0000 and U+10FFFF, though most of those are currently unassigned1 As of Janaury 2024, fewer than 4% of code points had assigned meanings..
The Unicode standard is more complicated than just a mapping between code points and glyphs. Each character is categorized in multiple ways, such as U+0041 being an upper-case letter where the lower-case pair is U+0061. Some code points encode metadirectives like advance to the next tab stop
or switch writing direction
. Some code points adjust other characters, adding accent marks or changing the coloration of an emoji. Some glyphs can be created in multiple ways, leading to various normalization algorithms being part of Unicode. When talking about encoding languages one also runs into other translation and localization concepts like what thousands separator to use when displaying numbers and how to format dates. The Unicode standard covers all of this and more.
That said, the vast majority of times a programmer refers to Unicode, they mean its defined set of code points and sometimes the defined meaning of each character, as summarized in this file: https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt.
UTF-8 splits a code point into 1–4 bit groups depending on the number of bits needed to represent the code point as a binary integer.
Bits needed | Groups used |
---|---|
0–7 | 1 |
8–11 | 2 |
12–16 | 3 |
17–21 | 4 |
All but the highest-order group gets 6 bits; the highestr group gets the rest.
Bit groups are arranged in big-endian order.
U+1F41A uses 17 bits, so it needs 4 bit groups. We can find them by writing the code point in binary (1 1111 0100 0001 1010) and grouping 6 bits from the low-order bit (11111 010000 011010) then ordering high-to-low to get [0, 0b11111, 0b010000, 0b011010]
Each bit group is padded in its highest-order bits to make it a full byte.
Each group except the first uses the pad 0b10. The first uses a pad of 0b0 if it’s the only group, or 0b1…10 with n 1 bits if there are n > 1 groups.
This means that every byte will have at least one 0 bit, and that the meaning of the byte can be found by splitting the byte into two parts: n 1s before the highest-order 0 and the binary number after the highest-order 0. If n is 0, this is the only byte of a 1-byte character. If n is 1, this is not the first byte of a character. If n is 2, 3, or 4 this is the first byte of a character with n bytes. If n is 5 or more, this is part of a valid UTF-8 encoding.2 It’s also not part of a valid UTF-8 encoding if there are too many contiguous n = 1 bytes for the preceding n ≠ 1 byte or if the code point is larger than 0x10FFFF or is one of the few code points that Unicode has identified as never being used.
Byte | Meaning |
---|---|
0xxxxxxx | only byte of character |
10xxxxxx | second, third, or fourth byte of a character |
110xxxxx | first byte of a two-byte character |
1110xxxx | first byte of a three-byte character |
11110xxx | first byte of a four-byte character |
11111xxx | invalid |
Encode U+0041 (capital A) in UTF-8
Encode U+06CD (Arabic yeh with tail, ۍ) in UTF-8
Encode U+2331 (dimensional origin symbol, ⌱) in UTF-8
Encode U+12500 (cuniform sign LAK-608, 𒔀) in UTF-8
Decode the first UTF-8 character in the byte sequence F8 93 EA 80 B2 5C 00