4.8. Character Encoding

4.8.1. Introduction to Character Encoding

For many years Americans have exchanged text using the ASCII character set; since essentially all U.S. systems support ASCII, this permits easy exchange of English text. Unfortunately, ASCII is completely inadequate in handling the characters of nearly all other languages. For many years different countries have adopted different techniques for exchanging text in different languages, making it difficult to exchange data in an increasingly interconnected world.

More recently, ISO has developed ISO 10646, the ``Universal Mulitple-Octet Coded Character Set (UCS). UCS is a coded character set which defines a single 31-bit value for each of all of the world's characters. The first 65536 characters of the UCS (which thus fit into 16 bits) are termed the ``Basic Multilingual Plane'' (BMP), and the BMP is intended to cover nearly all of today's spoken languages. The Unicode forum develops the Unicode standard, which concentrates on the UCS and adds some additional conventions to aid interoperability. Historically, Unicode and ISO 10646 were developed by competing groups, but thankfully they realized that they needed to work together and they now coordinate with each other.

If you're writing new software that handles internationalized characters, you should be using ISO 10646/Unicode as your basis for handling international characters. However, you may need to process older documents in various older (language-specific) character sets, in which case, you need to ensure that an untrusted user cannot control the setting of another document's character set (since this would significantly affect the document's interpretation).

4.8.2. Introduction to UTF-8

Most software is not designed to handle 16 bit or 32 bit characters, yet to create a universal character set more than 8 bits was required. Therefore, a special format called ``UTF-8'' was developed to encode these potentially international characters in a format more easily handled by existing programs and libraries. UTF-8 is defined, among other places, in IETF RFC 2279, so it's a well-defined standard that can be freely read and used. UTF-8 is a variable-width encoding; characters numbered 0 to 0x7f (127) encode to themselves as a single byte, while characters with larger values are encoded into 2 to 6 bytes of information (depending on their value). The encoding has been specially designed to have the following nice properties (this information is from the RFC and Linux utf-8 man page):

The classical US ASCII characters (0 to 0x7f) encode as themselves, so files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. This is fabulous for backwards compatibility with the many existing U.S. programs and data files.
All UCS characters beyond 0x7f are encoded as a multibyte sequence consisting only of bytes in the range 0x80 to 0xfd. This means that no ASCII byte can appear as part of another character. Many other encodings permit characters such as an embedded NIL, causing programs to fail.
It's easy to convert between UTF-8 and a 2-byte or 4-byte fixed-width representations of characters (these are called UCS-2 and UCS-4 respectively).
The lexicographic sorting order of UCS-4 strings is preserved, and the Boyer-Moore fast search algorithm can be used directly with UTF-8 data.
All possible 2^31 UCS codes can be encoded using UTF-8.
The first byte of a multibyte sequence which represents a single non-ASCII UCS character is always in the range 0xc0 to 0xfd and indicates how long this multibyte sequence is. All further bytes in a multibyte sequence are in the range 0x80 to 0xbf. This allows easy resynchronization; if a byte is missing, it's easy to skip forward to the ``next'' character, and it's always easy to skip forward and back to the ``next'' or ``preceding'' character.

In short, the UTF-8 transformation format is becoming a dominant method for exchanging international text information because it can support all of the world's languages, yet it is backward compatible with U.S. ASCII files as well as having other nice properties. For many purposes I recommend its use, particularly when storing data in a ``text'' file.

4.8.3. UTF-8 Security Issues

The reason to mention UTF-8 is that some byte sequences are not legal UTF-8, and this might be an exploitable security hole. UTF-8 encoders are supposed to use the ``shortest possible'' encoding, but naive decoders may accept encodings that are longer than necessary. If an attacker intentionally creates an unusually long format, input checkers might not notice the problem. The RFC describes the problem this way:

Implementors of UTF-8 need to consider the security aspects of how they handle illegal UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.
A particularly subtle form of this attack could be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters. For example, a parser might prohibit the NUL character when encoded as the single-octet sequence 00, but allow the illegal two-octet sequence C0 80 (illegal because it's longer than necessary) and interpret it as a NUL character (00). Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F C0 AE 2E 2F.

A longer discussion about this is available at Markus Kuhn's UTF-8 and Unicode FAQ for Unix/Linux at http://www.cl.cam.ac.uk/~mgk25/unicode.html.

4.8.4. UTF-8 Legal Values

Thus, when accepting UTF-8 input, you need to check if it's valid UTF-8. Here is a list of all legal UTF-8 sequences; any character sequence not matching this table is not a legal UTF-8 sequence. In the following table, the first three rows list the legal UTF-8 sequences in binary, hexadecimal, and octal. The last row lists the UCS code region that the sequence encodes to (in hexadecimal). In the binary column, the ``x'' indicates that either a 0 or 1 is legal in the sequence. In the other columns, a ``-'' indicates a range of legal values (inclusive). Of course, just because a sequence is a legal UTF-8 sequence doesn't mean that you should accept it (see the other issues discussed in this book).

Table 4-1. Legal UTF-8 Sequences

Binary	Hexadecimal	Octal	Decimal	UCS Code (Hex)
0xxxxxxx	00-7F	0-177	0-127	0000000-0000007F
110xxxxx 10xxxxxx	C0-DF 80-BF	300-337 200-277	192-223 128-191	00000080-000007FF
1110xxxx 10xxxxxx 10xxxxxx	E0-EF 80-BF 80-BF	340-357 200-277 200-277	224-239 128-191 128-191	00000800-0000FFFF
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	F0-F7 80-BF 80-BF 80-BF	360-367 200-277 200-277 200-277	240-247 128-191 128-191 128-191	00010000-001FFFFF
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx	F8-FB 80-BF 80-BF 80-BF 80-BF	370-373 200-277 200-277 200-277 200-277	248-251 128-191 128-191 128-191 128-191	00200000-03FFFFFF
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx	FC-FD 80-BF 80-BF 80-BF 80-BF 80-BF	374-375 200-277 200-277 200-277 200-277 200-277	252-253 128-191 128-191 128-191 128-191 128-191	04000000-7FFFFFFF

I should note that in some cases, you might want to cut slack (or use internally) the hexadecimal sequence C0 80. This is an overlong sequence that, if permitted, can represent ASCII NUL (NIL). Since C and C++ have trouble including a NIL character in an ordinary string, some people have taken to using this sequence when they want to represent NIL as part of the data stream; Java even enshrines the practice. Feel free to use C0 80 internally while processing data, but technically you really should translate this back to 00 before saving the data. Depending on your needs, you might decide to be ``sloppy'' and accept C0 80 as input in a UTF-8 data stream. If it doesn't harm security, it's probably a good practice to accept this sequence since accepting it aids interoperability.

4.8.5. UTF-8 Illegal Values

The UTF-8 character set is one case where it's possible to enumerate all illegal values (and prove that you've enumerated them all). You detect an illegal sequence by checking for two things: (1) is the initial sequence legal, and (2) if it is, is the first byte followed by the required number of valid continuation characters? Performing the first check is easy; the following is provably the complete list of all illegal UTF-8 initial sequences:

Table 4-2. Illegal UTF-8 initial sequences

UTF-8 Sequence	Reason for Illegality
10xxxxxx	illegal as initial byte of character (80..BF)
1100000x	illegal, overlong (C0 80..BF)
11100000 100xxxxx	illegal, overlong (E0 80..9F)
11110000 1000xxxx	illegal, overlong (F0 80..8F)
11111000 10000xxx	illegal, overlong (F8 80..87)
11111100 100000xx	illegal, overlong (FC 80..83)
1111111x	illegal; prohibited by spec

The second step is to check if the correct number of continuation characters are included in the string. If the first byte has the top 2 bits set, you count the number of ``one'' bits set after the top one, and then check that there are that many continuation bytes which begin with the bits ``10''. So, binary 11100001 requires two more continuation bytes.

Again, although C0 80 is technically an illegal UTF-8 sequence, for interoperability purposes you might want to accept it as a synonym for character 0 (NIL).

4.8.6. UTF-8 Related Issues

This section has discussed UTF-8, because it's the most popular multibyte encoding of UCS, simplifying a lot of international text handling issues. However, it's certainly not the only encoding; there are other encodings, such as UTF-16 and UTF-7, which have the same kinds of issues and must be validated for the same reasons.

Another issue is that some phrases can be expressed in more than one way in ISO 10646/Unicode. For example, some accented characters can be represented as a single character (with the accent) and also as a set of characters (e.g., the base character plus a separate composing accent). These two forms may appear identical. There's also a zero-width space that could be inserted, with the result that apparently-similar items are considered different. Beware of situations where such hidden text could interfere with the program. This is an issue that in general is hard to solve; most programs don't have such tight control over the clients that they know completely how a particular sequence will be displayed (since this depends on the client's font, display characteristics, locale, and so on).