How XML Works - HTML

XML doesn’t actually do anything on its own. It’s just a way to mark up text-based data. More specifically, it’s a methodology for describing how a structured document should handle sequences of characters.

Getting started with XML parsers
Before you begin creating XML documents, it’s a good idea to find a parser, which is software that can read an XML document. There are two kinds of parsers: validating and nonvalidating. A validating parser reads an XML document and determines if it is following the rules of a DTD. A nonvalidating parser doesn’t care about validation, and only checks an XML document to be sure that the syntax is correct. A document that follows these rules is called a well-formed document. The obvious examples of widely distributed nonvalidating parsers are Internet Explorer and Netscape 7.0, or any of the new Mozilla-based browsers.

To open an XML document in Internet Explorer 5 or later, or in Netscape/Mozilla, you simply open it using the File menu in those programs, and choose Open. . . Both browsers will display the XML in a tree-based format.

Begin with a prolog
There are no pre-existing elements in XML. Most basic XML documents start with a prolog, which includes a declaration that states a document as being an XML document:

<?xml version=“1.0” encoding=“ISO-8859-1”?>

The declaration must come first, before anything else, and its characters must be the first the parser encounters (no white space before that question mark). A prolog can also include a Processing instruction. A processing instruction (PI) tells the parser to pass the data it contains to another application. For example, if a prolog has a processing instruction containing a style sheet, the following PI would tell the processor to pass the named file to software that can handle the style sheet processing:

<?xml-stylesheet type=“text/xsl” href=“note.xsl”?>

You’ll learn more about style sheets later in the chapter, in the section named Style Sheets for XML: XSL, but PIs are not limited to style sheet processing. They can pass all kinds of information to processors. The trick is whether the XML parser is actually capable of doing so. No rule exists to say that it must. Generally, when there is a lot of action with PIs, vendors create extensions to parsers or bundle them into larger XML processing components so that the processing is hidden. Microsoft’s XML parser, MSXML, for example, contains a processing component for style sheets.

Understanding encoding
Did you notice the bolded encoding attribute in the prolog (encoding=“ISO- 8859-1”) in the prolog example? That actually isn’t an attribute; it just looks like one, but it’s an important part of the XML prolog. Encoding, in fact, is a vital part to truly understanding XML. XML requires all XML parsers to handle an encoding named UTF-8. An encoding is sort of like a mapping between alphanumeric characters and the numbering system your computer understands.

UTF-8 is a fairly new and comprehensive encoding that covers most languages of the world. It is based on Unicode, which is an amalgamation of various encodings such as UTF-8 and UTF-16, which is also supported by XML and is different than UTF-8 in the number of byte sequences used to store characters.

In Unicode-based encodings, for example, the capital letter A is represented by the hexadecimal number U+0041. The small letter a is represented by the hexadecimal number U+0061. Every letter and numeric character in every alphabet in the world (almost) has such a number assigned to it.

Note: The U+ in the preceding examples are not part of the hexadecimal number, but characters added to show they are part of Unicode. In a Web page, you would use &0041 and &0061; instead of U+0041 and U+0061.

Each human language is a subset of the vast UTF-8 encoding attached to it. Western European languages, for example, use the ISO-8859-1 encoding, which is simply a table of mappings within UTF-8 dealing specifically with Western languages. This is important to XML development because XML is concerned with how sequences of these mapped numerical references are structured within an XML document.

Your encodings need to be consistent to successfully parse XML. If you use Windows-specific encodings, for example, you’ll need to be absolutely sure that everything that interacts with Windows encodings is also a Windows encoding. This is because Windows uses a different set of tables for mapping characters to numbers than UTF-8. The Windows encoding for Latin-based languages, for example, is called Windows Code Page 1252 (sets of encodings are also called code pages).

This code page, also referred to as ANSI (from the American National Standards Institute), isn’t a subset of UTF-8 the way ISO-8859-1 is. Luckily, most characters happen to map out to the same numerical references in both encoding sets, but not all do. For example, the™character used for trademarks does not map out to the same hexadecimal number in ANSI as it does in ISO-8859-1.

Things get even more difficult when you’re dealing with Chinese alphabets, because two well-established encoding mechanisms are in use for Chinese languages. For example, in Taiwan, an encoding named Big 5 is used. Its mappings are quite different than UTF-8. Even though an XML-compliant parser must be able to parse UTF-8 documents, there’s nothing forcing developers to use UTF-8, and mostChinese-based Web sites don’t use it. This is a critical distinction to be aware of when working internationally. A sale element in a Big 5 document (assuming a Chinese translation, of course) is not a sale element in UTF-8, because element names are dependent on their encoding.

This may seem like an awfully long explanation about something so arcane, but it is a virtual guarantee that at some point in your XML work you’ll encounter a square character or question mark in output generated from XML. It usually takes people hours or days to figure out the source of these character “anomalies.” You have the advantage of knowing they occur because of encoding problems. When a system doesn’t recognize a character, it generally emits a square character (a border with empty space), a solid black square, or a question mark.

This is invariably related to an encoding issue. Make sure encodings between output and input within your XML environment are consistent, and you should avoid these kinds of problems. You can’t force a billion people to change their encodings to UTF-8, but you can develop a system in your own environment to handle Big 5 encodings, which is a lot easier to do anyway.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

HTML Topics