Understanding Unicode - HTML

Unicode is a standard developed by The Unicode Consortium for processing the world’s alphabets in a consistent way. The Unicode Consortium is made up of universities, research institutes, companies, and other interested parties dedicated to creating an international standard for the way languages are represented to computers. Unicode consists of a vast number of tables each containing numerical references to an alphanumeric character. For example, the English letter A is represented by the hexadecimal number 0041 and the decimal number 65. Every character in nearly every written language in the world is represented in Unicode, which shows the starting and ending hexadecimal-based numerical references for each encoding.

Alphabets Represented in Unicode

Alphabets Represented in UnicodeAlphabets Represented in UnicodeAlphabets Represented in Unicode

Unicode tables are called code pages, and each one serves a specific set of languages. Each code page consists of a table of numerical references to each letter. Each row represents a code page. Each code page, in turn, consists of several rows of numerical reference values mapping each character of the alphabet defined by the code page. You could write the following in your HTML using one of the Unicode encodings, such as UTF-8, and a modern browser would render it as “Hello World”:

Hello World

To get a handle on how this works, let’s examine a code page most of us are familiar with, Basic Latin.

Basic Latin (U+0000 - U+007F)
All nations in America, most European nations, most African nations, as well as Australia and New Zealand all use the Latin encoding. In Unicode, the Latin encoding is broken down into different parts. The most basic is called Basic Latin. Only a few languages can be written entirely with a Basic Latin encoding. You generally need to incorporate additional Latin encodings because Basic Latin consists of only characters between 0 and 7F (hexadecimal). When you are using UTF-8 as your encoding, all of these Latin encodings are automatically included as part of the UTF-8 encoding. In fact, Unicode-based UTF-8 includes most of the world’s written languages.

ISO-8859-1
If you are working on Web sites for Western audiences, you will most likely use ISO-8859-1, which, although not officially a subset of UTF-8, does map out to the Latin Basic and Latin Extended A Unicode sets.

This character encoding contains many of the numerical references included in the US-ASCII encoding (ISO/IEC 646), which is not actually a part of the Unicode standard and predates it. Certain numeric references in ASCII are not defined by the Unicode Standard. In addition, certain numeric references are different in Macintosh-based code pages and Windows-based code pages, which is why you’ll sometimes encounter problems when transferring files between the two platforms. Unicode and ISO-defined encodings are defined by different international bodies.

The Unicode standard, as previously mentioned, is defined by The Unicode Consortium, whereas ISO encodings are defined by International Organization for Standardization (ISO). Inaddition, the code pages used by Macintosh and Windows in earlier days, although based on ASCII, actually had a few small variations that were incompatible with each other. For example, the decimal-based code point (another word for numerical reference) for the “registered” mark (®) are different for Macintosh and Windows encodings (168 and 174, respectively), and is 174 in Unicode.

Luckily, the most familiar encoding to Western HTML developers, ISO-8859-1, is a subset of Unicode and can be used safely because most modern browsers now support Unicode. Although ISO-8859-1 is not part of the Unicode standard, the two bodies governing both standards have worked together to standardize the models to avoid driving everyone crazy.

Table shows the entities you are likely to encounter as an HTML developer. If your encoding is UTF-8, you can use the decimal references, but for compatibility with older browsers you should use HTML entities, because many older browsers don’t support Unicode.

ISO-8859-1 HTML Entities

ISO-8859-1 HTML EntitiesISO-8859-1 HTML EntitiesISO-8859-1 HTML EntitiesISO-8859-1 HTML Entities

Latin-1 Supplement (U+00C0 - U+00FF)
The Latin-1 Supplement also contains values from ISO-8859-1. The characters in this Unicode block are used for the following languages:

  • Danish
  • Dutch
  • Faroese
  • Finnish
  • Flemish
  • German
  • Icelandic
  • Irish
  • Italian
  • Norwegian
  • Portuguese
  • Spanish
  • Swedish

It extends the Basic Latin encoding with a miscellaneous set of punctuation and mathematical signs.

Latin Extended-A (U+0100 - U+017F)
Once you roam past Latin-1 Supplement in Unicode, you begin to veer away from ISO-8859-1, as well. There are specific ISO encodings for different Latin languages. You can find the names of these encodings here:

  • Afrikaans
  • Basque
  • Breton
  • Catalan
  • Croatian
  • Czech
  • Esperanto
  • Estonian
  • French
  • Frisian
  • Greenlandic
  • Hungarian
  • Latin
  • Latvian
  • Lithuanian
  • Maltese
  • Polish
  • Provencal
  • Rhaeto-Romanic
  • Romanian
  • Romany
  • Sami
  • Slovak
  • Slovenian
  • Sorbian
  • Turkish
  • Welsh

Latin Extended-B and Latin Extended Additional
The characters in this block are used to write additional languages and to extend Latin encodings. These characters include seldom-used characters such as the bilabial click, which looks like this: _ . By the time you march into this territory, you should definitely be using UTF-8.

Encoding your pages in UTF-8 may at first seem like a good idea if you are doing a lot of localization, but some problems exist with taking that course. Mixing encodings can cause problems, so if your site is only going to be viewed by users reading with Western languages, it’s fine to use ISO-8859-1, because it’s essentially a subset of UTF-8. The other problem is that many sites in countries such as China don’t use Unicode at all. For example, many Chinese sites use Big 5, which is a Chinese encoding not incorporated into Unicode.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

HTML Topics