Encoding

 

Encoding

Table of contents
1 UTF Encoding
1.1 Byte Order Mark
1.2XML Encoding
1.3Character encoding
1.4UTF-8
1.5UTF-16
2ASCII
2.1Extended ASCII
2.2Codepage

1 - UTF Encoding

Unicode Transformation Format is a way of describing a large set of characters (including all western, arabic, chinese etc.) in a uniform way. The characters can be encoded using multibyte (UTF-8), double byte (UTF-16) or four byte encodings (UTF-32). Officially the encoding is spelled as case insensitive UTF dash 8, 16 or 32 and optionally BE or LE for big endian or little endian (e.g. UTF-16LE or utf-8).
There are 3 types of encoding which must match for an XML file.
  • Byte Order Mark
  • XML encoding
  • Character Encoding
The file encoding tells what kind of encoding is used in the file. This is done by the BOM (Byte Order Mark). The XML encoding tells what kind of encoding is used in the XML document, this must match the BOM. for some encodings the BOM is not required. The Characterset must math the encoding set in the file or in the XML encoding.

1.1 - Byte Order Mark

The Byte Order Mark (BOM) specifies the type of encoding used in a file. The BOM is set in the first bytes of a file. The first 2 bytes for UTF-16, the first 4 bytes for UTF-32 and the first 3 bytes for UTF-8.
The character used for the BOM is a Zero-Width No-Break Space character. So the character is not displayed in editors. Normally a BOM would not show up in the middle of a file, but if this happens for some reason the character will still not be displayed.
BytesEncoding form
00 00 FE FFUTF-32, big-endian
FF FE 00 00UTF-32, little-endian
FE FFUTF-16, big-endian
FF FEUTF-16, little-endian
EF BB BFUTF-8
In some situations the BOM is not required. The BOM for UTF-8 is optional and if the XML encoding is set to UTF-16LE or UTF-16BE the BOM is also not set.

1.2 - XML Encoding

In the XML declaration tag the encoding must be specified.
<?xml version="1.0" encoding="UTF-8"?>
The encoding must match the BOM. So if the BOM specifies a UTF-16 file, the encoding must also be set to UTF-16. Browsers are known to give an error message when the BOM and the encoding do not match.

When the encoding is set to UTF-16LE or UTF-16BE the BOM is not set.

Sample fileDescription
UTF-8 with BOM.xmlUTF-8 without BOM
UTF-8 without BOM.xmlUTF-8 with BOM
UTF-16 as LE encoded.xmlUTF-16 with LE BOM
UTF-16 as BE encoded.xmlUTF-16 with BE BOM
UTF-16LE.xmlUTF-16LE
UTF-16BE.xmlUTF-16BE

1.3 - Character encoding

Unicode uses 3 different character sets and 2 different byte orders. The 3 character sets are UTF-8, UTF-16 and UTF-32. Since UTF-16 and UTF-32 characters are made up by 2 or 4 bytes, the order must also be specified. The encoding can start with the smallest byte or the greatest byte, these are called little endian and big endian byte orders.
The table below shows the code of characters in the different encodings.
CharacterUTF-8UTF-16LEUTF-16BEUTF-32LEUTF-32BE
z7A7A 0000 7A7A 00 00 0000 00 00 7A
éC3 A9E9 0000 E9E9 00 00 0000 00 00 E9
水 (chinese water)E6 B0 B434 6C6C 3434 6C 00 0000 00 6C 34

Be aware that most Windows editors (e.g. Notepad, Ultraedit) will translate any file to UTF-16 or ASCI when opening the file. If you want to see the exact encoding, use a HEX editor.

1.4 - UTF-8

Characters in UTF-8 are represented by 1, 2 or 3 bytes (actually we can go up to 4 bytes but this is not supported by all applications). The first 128 characters are the same as the first 128 characters of the US-ASCII character set, so the letter z is represented as 7A. But the characters above 128 use 2, 3 or 4 bytes. E.g. the é character is the US-ASCII character 233, so this is bigger than 128 and will be represented by the 2 byte code C3 A9.
The exact layout of UTF-8 is displayed in the table below:
Byte rangeDescription
0 - 31Used by the first byte, ASCII Controll characters
32 - 127Used by the first byte, ASCII Printable characters
128 - 191Used by the second byte
192 - 233Used by the third byte
The bytes in UTF-8 do not overlap, this will limit the number of unique characters but is still sufficient to place all unicode characters within the UTF-8 characterset.
The characters 0xFF and 0xFE are not used in UTF-8 to prevent interferrence with UTF-16.
Actually the characters 0xC0, 0xC1 and the range 0xF5 till 0xFF are not valid in UTF-8.

1.5 - UTF-16

Characters in UTF-16 are always represented by 2 bytes (16 bits). The first 256 characters are the same as the ISO 8859-1 characters set, so a letter z is represented as 00 7A in UTF-16BE. The first 127 characters of the ISO 8859-1 (and UTF-16) match the US-ASCII character set as well.

2 - ASCII

ASCII (American Standard Code for Information Interchange) is the characterset used for all modern Enlish and other Western European languages. The preferred encoding name is US-ASCII.
ASCII is basicly a 7 bit characterset made up of 128 characters. The first 32 characters (0-31) are control characters. These characters were used to operate early printers and include the carriage return (13) and the line feed (10).
Characters 32 till 127 are printable characters (actually 127 is the backspace and not a really printable character). The uppercase characters can be calculated by adding 32 to the lowercase character value, or setting the sixth bit in the byte to 1.

2.1 - Extended ASCII

Extended ASCII is an extention on US-ASCII. The Extended ASCII uses the 8th bit and describes the characters 128-256. These characters include the language specific characters. Since the 128 extra characters is not enough to specify all language specific characters a lot of Extended ASCII variant are available. These variants are defined by the codepage.

2.2 - Codepage

The codepage defines the upper part of the Extended ASCII characterset (characters 128-256). The codepage for the American language is codepage 437 which includes special characters for the American market. The Greek use codepage 737 to be able to display the Greek characters.
The lower part of the Extended ASCII characterset (characters 0-127) is the same in all codepages.