HTML Character Set

HTML Character Set

To display an HTML page correctly, the browser must know what character set to use.

World Wide Web using a character set early ASCII. ASCII supports the numbers 0-9, the uppercase and lowercase English alphabet, and some special characters.

Complete ASCII Reference Manual

Since many countries use characters are not part of ASCII, modern browser default character set is ISO-8859-1.

Complete ISO-8859-1 reference manual

If a page using a different ISO-8859-1 character set, it should be in the <meta> tag specified.

ISO character set

ISO character set is ISO (ISO) defined for different alphabets / languages ​​standard character set.

The following lists the different character sets used throughout the world:

character set description Scope
ISO-8859-1 Latin alphabet part 1 North America, Western Europe, Latin America, the Caribbean, Canada, Africa
ISO-8859-2 Latin alphabet part 2 Eastern Europe
ISO-8859-3 Latin alphabet part 3 SE Europe, Esperanto, miscellaneous
ISO-8859-4 Latin alphabet part 4 Scandinavia / Baltic Sea (and the other part is not included in the ISO-8859-1)
ISO-8859-5 Latin / Cyrillic part 5 Using the Cyrillic alphabet languages ​​such as Bulgarian, Belarusian, Russian, Macedonian
ISO-8859-6 Latin / Arabic part 6 Using the Arabic alphabet languages
ISO-8859-7 Latin / Greek part 7 Modern Greek, as well as mathematical symbols derived from Greek
ISO-8859-8 Latin / Hebrew part 8 Hebrew language
ISO-8859-9 Latin 5 part 9 Turkish
ISO-8859-10 Latin 6 Lapland language, Germanic, Scandinavian Eskimo
ISO-8859-15 Latin 9 (aka Latin 0) Similarly with ISO 8859-1, the euro symbol and several other characters replace some of the less frequently used symbols
ISO-2022-JP Latin / Japanese part 1 Japanese
ISO-2022-JP-2 Latin / Japanese part 2 Japanese
ISO-2022-KR Latin / Korean part 1 Korean

Unicode standard

Because character sets listed above have limited capacity and are not compatible in multilingual environments, the Unicode standard Unicode alliance developed.

The Unicode Standard covers all the characters, punctuation, and symbols in the world.

Whatever the platform, program or language, Unicode can be processing, storage and interchange of text data.

The Unicode Consortium

The Unicode Consortium developed the Unicode standard. Their goal is to use the standard Unicode Transformation Format (UTF) to replace the existing character sets.

Unicode standard has been a success, in XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML in, Unicode has been achieved. In many operating systems and all modern browsers, Unicode also supported.

The Unicode Consortium standard of organization and leadership development cooperation, such as ISO, W3C, and ECMA.

Unicode can be compatible with different character sets. The most common way of encoding is UTF-8 and UTF-16:

The UTF8 characters can be from 1 to 4 bytes long. UTF-8 can represent any character in the Unicode standard. UTF-8 is backwards compatible ASCII. UTF-8 is the preferred encoding for Web pages and e-mail.
16-bit Unicode Transformation Format is a variable-Unicode character encoding for all Unicode repertoire can be encoded. UTF-16 is mainly used for operating systems and environments, such as Microsoft Windows 2000 / XP / 2003 / Vista / CE and Java and .NET bytecode environments.

Tip: front 256 Unicode character set corresponding to character 256 in the ISO-8859-1 character.

Tip: All HTML 4 processors have support for UTF-8, and all XHTML and XML processors support UTF-8 and UTF-16.