Unicode Transformation Format is a way of describing a large set of characters (including all Western, Arabic, Chinese etc.) in a
uniform way. The characters can be encoded using multi-byte (UTF-8), double byte (UTF-16) or four byte encodings (UTF-32). Officially
the encoding is spelled as case insensitive UTF dash 8, 16 or 32 and optionally BE or LE for big endian or little endian
(e.g. UTF-16LE or utf-8).
There are 3 types of encoding which must match for an XML file.
- Byte Order Mark
- XML encoding
- Character Encoding
Byte Order Mark
The Byte Order Mark (BOM) specifies the type of encoding used in a file. The BOM is set in the first bytes of a file. The first 2
bytes for UTF-16, the first 4 bytes for UTF-32 and the first 3 bytes for UTF-8.
The character used for the BOM is a Zero-Width No-Break Space character. So the character is not displayed in editors. Normally a BOM would not show up in the middle of a file, but if this happens for some reason the character will still not be displayed.
|00 00 FE FF||UTF-32, big-endian|
|FF FE 00 00||UTF-32, little-endian|
|FE FF||UTF-16, big-endian|
|FF FE||UTF-16, little-endian|
|EF BB BF||UTF-8|
In some situations the BOM is not required. The BOM for UTF-8 is optional and if the XML encoding is set to UTF-16LE or UTF-16BE the BOM is also not set.
In the XML declaration tag the encoding must be specified.
<?xml version="1.0" encoding="UTF-8"?>
The encoding must match the BOM. So if the BOM specifies a UTF-16 file, the encoding must also be set to UTF-16.
Browsers are known to give an error message when the BOM and the encoding do not match.
When the encoding is set to UTF-16LE or UTF-16BE the BOM is not set.
Unicode uses 3 different character sets and 2 different byte orders. The 3 character sets are UTF-8, UTF-16 and UTF-32.
Since UTF-16 and UTF-32 characters are made up by 2 or 4 bytes, the order must also be specified. The encoding can start with
the smallest byte or the greatest byte, these are called little endian and big endian byte orders.
The table below shows the code of characters in the different encodings.
|z||7A||7A 00||00 7A||7A 00 00 00||00 00 00 7A|
|é||C3 A9||E9 00||00 E9||E9 00 00 00||00 00 00 E9|
|水 (chinese water)||E6 B0 B4||34 6C||6C 34||34 6C 00 00||00 00 6C 34|
Be aware that most Windows editors (e.g. Notepad, Ultraedit) will translate any file to UTF-16 or ASCI when opening the file. If you want to see the exact encoding, use a HEX editor.
Characters in UTF-8 are represented by 1, 2 or 3 bytes (actually we can go up to 4 bytes but this is not supported by all applications).
The first 128 characters are the same as the first 128 characters of the US-ASCII character set, so the letter z is represented as 7A.
But the characters above 128 use 2, 3 or 4 bytes. E.g. the é character is the US-ASCII character 233, so this is bigger than 128 and will be
represented by the 2 byte code C3 A9.
The exact layout of UTF-8 is displayed in the table below:
|0 - 31||Used by the first byte, ASCII Controll characters|
|32 - 127||Used by the first byte, ASCII Printable characters|
|128 - 191||Used by the second byte|
|192 - 233||Used by the third byte|
The bytes in UTF-8 do not overlap, this will limit the number of unique characters but is still sufficient to place all Unicode
characters within the UTF-8 character-set.
The characters 0xFF and 0xFE are not used in UTF-8 to prevent interference with UTF-16.
Actually the characters 0xC0, 0xC1 and the range 0xF5 till 0xFF are not valid in UTF-8.
Characters in UTF-16 are always represented by 2 bytes (16 bits). The first 256 characters are the same as the ISO 8859-1 characters set, so a letter z is represented as 00 7A in UTF-16BE. The first 127 characters of the ISO 8859-1 (and UTF-16) match the US-ASCII character set as well.
ASCII (American Standard Code for Information Interchange) is the character-set used for all modern Enlish and other Western European languages.
The preferred encoding name is US-ASCII.
ASCII is basically a 7 bit character-set made up of 128 characters. The first 32 characters (0-31) are control characters. These characters were used to operate early printers and include the carriage return (13) and the line feed (10).
Characters 32 till 127 are printable characters (actually 127 is the backspace and not a really printable character). The uppercase characters can be calculated by adding 32 to the lowercase character value, or setting the sixth bit in the byte to 1.
Extended ASCII is an extension on US-ASCII. The Extended ASCII uses the 8th bit and describes the characters 128-256. These characters include the language specific characters. Since the 128 extra characters is not enough to specify all language specific characters a lot of Extended ASCII variant are available. These variants are defined by the code-page.
The code-page defines the upper part of the Extended ASCII character-set (characters 128-256). The code-page for the American language is code-page 437 which includes
special characters for the American market. The Greek use code-page 737 to be able to display the Greek characters.
The lower part of the Extended ASCII character-set (characters 0-127) is the same in all code-pages.