Displaying text in the user’s language is taken for granted in the PC world. However, it is also becoming more common for embedded systems. Even small devices, which may have just had a segment display a couple of years ago, are now often capable of displaying high-quality characters on pixel displays.
The development of multi-language support for an embedded system can be surprisingly tricky, especially if non-Latin scripts such as Chinese characters are involved. Depending on the size of the system we will not have all the language infrastructure of PC or mobile phone operating systems available to us. Limited hardware resources may not allow for storing several big fonts, rendering outline (e.g. TrueType) fonts, or smoothing (anti-aliasing) characters.
Projects often face problems like:
- Hardware resources of the system (especially memory) are not sufficient for languages with a high number of characters (e.g. Chinese).
- Screen resolution is not sufficient for complex characters (e.g. Chinese).
- The UI design does not allocate enough screen space for texts with larger characters or longer words.
- The font (already purchased) does not contain all the required characters.
- The selected encoding is not capable of representing all the required characters.
- Software makes wrong assumptions about the encoding (e.g. the assumption of “one byte per character”).
- Ugly display of characters because the performance is not sufficient for doing anti-aliasing.
- Software design and UI design make it nearly impossible to add right-to-left rendering (e.g. for Arabic or Hebrew).
- If the device provides text input, the chosen input method is not capable of entering text of writing systems such as Chinese or Japanese.
A common reason for these problems is lack of language-specific knowledge in the development team. The translation to other languages is done by another team or even another organization and the language knowledge of the translators is not available to the developers. Adhering to principles of iterative, incremental development, the team starts with a single language (mostly English) and plans to do the other languages later. However, some design decisions (e.g. the character set) are very hard to change later. It is important to understand the specific requirements of the target languages and to consider them in early iterations.
This post starts a mini-series about internationalization (i18n) for embedded systems with each post focusing on one design aspect. Today, we start with the selection of a character set.
Selecting a character set
As the name suggests, a character set defines a collection of characters. In a coded character set a unique code is assigned to each character. There are also non-coded character sets but they are not relevant for our purpose and will not be considered further. Without knowing the underlying character set a given character code is meaningless. Therefore, you should explicitly define which character set to use in your project in order to avoid confusion and misunderstandings.
Usually, a character set is defined by a national or international standard. Widely known examples are ASCII (ANSI X3.4) and Unicode (ISO 10646). While Unicode is slowly replacing region-specific character sets, there are still many other relevant standards around for which support might be required. For example, the Chinese standard GB 2312 (containing 7445 characters) is still in use although it has officially been replaced by the Unicode-compatible GB 18030.
By choosing a big international character set such as Unicode you guarantee that you can represent basically every character which you might ever need. However, that does not mean that you can also display them. It is very unlikely that you’ll have a font for your embedded system containing all Unicode characters (more on fonts in a later post). Therefore, you have to agree with your client on a manageable subset. This can be hard as clients may also lack detailed knowledge about the target languages. Region-specific character sets (such as the Chinese GB 2312) can be very helpful here as you can get away from requests like “just make sure that the most common characters are supported”. They are also helpful for talking to a font vendor about the characters you need in the font (for Chinese, you don’t want to discuss individual characters). Automated acceptance tests should be set up together with the client to check for compliance with the agreed subset of characters.
If you already have a GUI framework, the included font engine might restrict the character sets that you can use. Make sure to check the capabilities of the framework before deciding on a character set.
That’ll be it for the first part, in the next part I am discussing character encodings. Have a look at it! You might want to click through my slides first.