|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Home Genealogie Hemochromatose Certificaat Test Manuals Korn shell SQL commands Robots.txt Encoding SSL setup WSDL definition XSLT scrips |
EncodingTable of contents
1 - UTF EncodingUnicode Transformation Format is a way of describing a large set of characters (including all western, arabic, chinese etc.) in a uniform way. The characters can be encoded using multibyte (UTF-8), double byte (UTF-16) or four byte encodings (UTF-32). Officially the encoding is spelled as case insensitive UTF dash 8, 16 or 32 and optionally BE or LE for big endian or little endian (e.g. UTF-16LE or utf-8).There are 3 types of encoding which must match for an XML file.
1.1 - Byte Order MarkThe Byte Order Mark (BOM) specifies the type of encoding used in a file. The BOM is set in the first bytes of a file. The first 2 bytes for UTF-16, the first 4 bytes for UTF-32 and the first 3 bytes for UTF-8.The character used for the BOM is a Zero-Width No-Break Space character. So the character is not displayed in editors. Normally a BOM would not show up in the middle of a file, but if this happens for some reason the character will still not be displayed.
1.2 - XML EncodingIn the XML declaration tag the encoding must be specified.<?xml version="1.0" encoding="UTF-8"?>The encoding must match the BOM. So if the BOM specifies a UTF-16 file, the encoding must also be set to UTF-16. Browsers are known to give an error message when the BOM and the encoding do not match. When the encoding is set to UTF-16LE or UTF-16BE the BOM is not set.
1.3 - Character encodingUnicode uses 3 different character sets and 2 different byte orders. The 3 character sets are UTF-8, UTF-16 and UTF-32. Since UTF-16 and UTF-32 characters are made up by 2 or 4 bytes, the order must also be specified. The encoding can start with the smallest byte or the greatest byte, these are called little endian and big endian byte orders.The table below shows the code of characters in the different encodings.
Be aware that most Windows editors (e.g. Notepad, Ultraedit) will translate any file to UTF-16 or ASCI when opening the file. If you want to see the exact encoding, use a HEX editor. 1.4 - UTF-8Characters in UTF-8 are represented by 1, 2 or 3 bytes (actually we can go up to 4 bytes but this is not supported by all applications). The first 128 characters are the same as the first 128 characters of the US-ASCII character set, so the letter z is represented as 7A. But the characters above 128 use 2, 3 or 4 bytes. E.g. the é character is the US-ASCII character 233, so this is bigger than 128 and will be represented by the 2 byte code C3 A9.The exact layout of UTF-8 is displayed in the table below:
The characters 0xFF and 0xFE are not used in UTF-8 to prevent interferrence with UTF-16. Actually the characters 0xC0, 0xC1 and the range 0xF5 till 0xFF are not valid in UTF-8. 1.5 - UTF-16Characters in UTF-16 are always represented by 2 bytes (16 bits). The first 256 characters are the same as the ISO 8859-1 characters set, so a letter z is represented as 00 7A in UTF-16BE. The first 127 characters of the ISO 8859-1 (and UTF-16) match the US-ASCII character set as well.2 - ASCIIASCII (American Standard Code for Information Interchange) is the characterset used for all modern Enlish and other Western European languages. The preferred encoding name is US-ASCII.ASCII is basicly a 7 bit characterset made up of 128 characters. The first 32 characters (0-31) are control characters. These characters were used to operate early printers and include the carriage return (13) and the line feed (10). Characters 32 till 127 are printable characters (actually 127 is the backspace and not a really printable character). The uppercase characters can be calculated by adding 32 to the lowercase character value, or setting the sixth bit in the byte to 1. 2.1 - Extended ASCIIExtended ASCII is an extention on US-ASCII. The Extended ASCII uses the 8th bit and describes the characters 128-256. These characters include the language specific characters. Since the 128 extra characters is not enough to specify all language specific characters a lot of Extended ASCII variant are available. These variants are defined by the codepage.2.2 - CodepageThe codepage defines the upper part of the Extended ASCII characterset (characters 128-256). The codepage for the American language is codepage 437 which includes special characters for the American market. The Greek use codepage 737 to be able to display the Greek characters.The lower part of the Extended ASCII characterset (characters 0-127) is the same in all codepages. |
Amé Schaake Senna Schaake
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| © Christiaan Schaake | Laatste update January 16 2011 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||