In the last post I’ve talked about the importance of selecting an appropriate character set early in the project. Today, we continue with a related and equally important decision: the encoding method.
Agree on the Encoding Method
The encoding method defines how character codes are represented in a computer. If the reader and the writer of a character code assume different encoding methods (e.g. writing UTF-8 and reading UTF-16) the resulting character is most likely wrong. Transformations between the different encodings are often possible (e.g. with libraries like ICU) but they cost performance.
I have seen systems where character codes were transformed back and forth several times just because a single component in the processing chain required a particular encoding method. This should be avoided. Therefore, projects should agree on the encoding methods they intend to support. Ideally, there is just one and thus no need for any transformations.
Modal vs. nonmodal encoding methods
Some encoding methods (e.g. ISO-2022) are modal, meaning that the interpretation of a given byte sequence depends on the current mode. Special control characters are used for switching between modes. Modes can represent different character sets and they can even use different character lengths (e.g. one-byte and two-byte).
Nonmodal encoding methods (e.g. UTF-8) don’t switch between modes, the byte sequence of each encoded character can be decoded individually without looking at previous control characters. The processing of text encoded by a nonmodal method is a bit simpler. All standard encoding methods for Unicode are nonmodal.
Fixed-length vs. variable-length encoding methods
Encoding methods are fixed-length or variable-length. Fixed-length encoding methods (e.g. ASCII, UTF-32) use the same number of bits for each character. This is convenient for text processing because the size of a string does not depend on its content and all characters can be addressed directly. It also helps when dynamic memory allocation is not allowed (as is often the case in safety-critical embedded systems) because we can easily define the buffer size for a given number of characters without knowing their (dynamic) content.
For variable-length encoding methods (e.g. UTF-8) we’d have to assume the worst case and expect every character to have the maximum length. The drawback of most fixed-length encoding methods is the waste of storage space because some characters (e.g. Latin) are required much more often than others (e.g. characters only used in Chinese family names) and assigning them shorter codes would save a lot of memory. For example, UTF-8 requires only one byte for the (very frequent) ASCII characters, up to three bytes for characters within the Basic Multilingual Plane (BMP) of Unicode and four bytes only for the (rarely used) characters outside the BMP.
Character set affects the choice
There are many encoding methods around but most of them are designed for a particular character set and region, so if you have selected your character set the choice is limited.
For example, when using Unicode, you can basically choose between UTF-8, UTF-16, and UTF-32. Nowadays, Unicode is becoming the dominant character set also for embedded systems, so let’s have a closer look at each of these three encoding methods. They are all capable of encoding the full range of 1,112,064 characters defined by Unicode and are well supported by many libraries.
Encoding methods for Unicode
First, a word about endianess. When using encoding methods with multi-byte words (e.g. UTF-16, UTF-32) endianess must be considered, especially when transferring text between devices. Therefore, you should either explicitly define which endianess you are using: UTF-16LE or UTF-32LE for little endian and UTF-16BE or UTF-32BE for big endian. Or you can place a special Unicode character at the beginning of the text, the so-called Byte-Order-Marker (BOM, code point U+FEFF).
For example, if the UTF-16 text is big endian, the BOM is stored as <0xFE 0xFF>, if it is little endian, it is stored as <0xFF 0xFE>. This allows the reader of the text to determine its endianess.
UTF-8 is a nonmodal, variable-length encoding method with each character being encoded in one to four 8-bit words. It has some interesting and useful properties: First, the byte 0x00 is not used in any character. This is very useful when using string APIs for C-style strings as they expect 0x00 to terminate a string. Second, it is ASCII-compatible in the sense that the codes 0-127 are identical to ASCII. That means, a UTF-8 system can correctly read ASCII text. Third, because it uses only 8-bit words, there are no issues with endianess and no BOM is required. However, in some cases a BOM (<0xEF 0xBB 0xBF> in UTF-8) is used for making clear that the text is UTF-8 and not ASCII. UTF-8 is the main encoding method used for the Web.
UTF-16 is, like UTF-8, nonmodal and variable-length. Each character is encoded in either one or two 16-bit words. All characters of the Basic Multilingual Plane (BMP) are encoded in a single 16-bit word, so if the BMP contains all your required characters, you can treat UTF-16 like a fixed-length encoding method. However, when using UTF-16 I’d recommend to implement full support as the need for supporting non-BMP characters may suddenly arise late in the project. Characters outside the BMP are encoded in two 16-bit words, each taken from a special range (high and low surrogate area). That means, by looking at a single 16-bit word you can tell whether it’s a single code, or the first or second word of a surrogate pair.
UTF-32 is a nonmodal, fixed-length encoding with each character being encoded in a 32-bit word. Its words directly contain the Unicode code point extended to 32-bit. For example, U+FEFF (the BOM) is represented as 0x0000FEFF (big endian) or 0xFFFE0000 (little endian).
Use UTF-8 or UTF-16 for Embedded Systems
There are many encoding methods available but if you have defined your character set the choice is limited. When using Unicode, I’d recommend either UTF-8 or UTF-16 as the waste of storage space in UTF-32 is often not suitable for embedded systems. Also make sure to check the encoding methods supported by your font rendering engine. It may further restrict your options.
Before trying to implement text processing functions yourself (e.g. for counting characters in a text, transforming between encodings) consider using existing libraries such as ICU.
That’s it for part two, in the next part I will discuss fonts and font rendering techniques.
Now I would like to learn from your experience with encoding methods for embedded systems. Do you share my arguments?