Well-structured XML - HTML

XML does not consist of many syntax rules, but the ones it has are very strict, and parsers that don’t follow them to the letter are not considered genuine XML parsers.

Including a root element
The most basic rule is that in order to be a conforming or well-formed XML document (in other words, a legitimate one), it must consist of one root element. Therefore, the following is a conforming XML document:


You don’t even have to include the XML declaration, but it’s good form to do so, and is a much better way to maintain the integrity of your documents if versions should change. If you do include a declaration and you declare your XML document as an XML 1.0 document, your syntax must adhere to that version. Declarations are not required because the creators of the XML specification knew that some existing SGML and HTML document qualified as XML documents or could be easily made into XML, and didn’t want those documents to fail when there was no XML declaration.

Properly nesting XML documents
In HTML, you can get away with some improper nesting, such as that shown here:

<b><i>Most browsers will render this</b></i>

In XML, all elements must be properly nested within each other, like this:

<b><i>All XML parsers will parse this</i></b>

In addition, the root element must contain the group of all the other elements. In other words, there must be one “master” element, within which all the other properly nested elements are contained. An XML parser would not correctly parse the following:

<html><b><i>All XML parsers will parse this</i></b>

This is because there is no closing tag for the html element. You can fix this by simply adding one, as in the following example:

<html><b><i>All XML parsers will parse this</i></b> </html>

If an element has no content, you must use a closing tag. There are two ways to do this. You can use the kind of closing tag you’re used to seeing in HTML, as in the following:

<img src=“my.gif”></img>

Or, you can simply include a closing slash within the element tag, as follows:

<img src=“my.gif”/>

Note the extra space between the end of the XML name and the closing forward slash. Although this isn’t necessary in XML, if you’re creating HTML documents with XML syntax (known as XHTML), you’ll need them or browsers won’t render things such as line breaks correctly.

Maintaining case sensitivity in XML
HTML elements and attributes are case sensitive. Therefore, <data type=“bad”
/> is different than <data TYPE =“bad”/> and <DATA type=“bad”/>.

Using quotes in attribute values
In HTML, you can also get away with not including quotes around attributes. For example, you can write the following and a browser will render the element correctly:

<td colspan=2>some data</td>

In XML, all attribute values must be quoted:

<td colspan=“2”>some data</td>

Handling line breaks and white space in XML documents
Windows applications store line breaks as pairs of carriage return, line feed (CR LF) characters, which map out to 000D; and 000A; in XML UTF-8 using hexadecimal format. In UNIX applications, a line break is usually stored as an LF character. Macintosh applications use a single CR character to store a line break. This is an important distinction when working on large Web sites that may have source control software, which often has a translation option for handling cross-platform line-break differences when merging files between development and production environments.

Using predefined entities and entity or character references
Several entity references must be used to “escape” XML markup characters to prevent an XML parser from interpreting markup characters as XML when that is not your intent. These are called predefined entities.

Using predefined entities and entity or character references

By referring to Table, instead of writing markup like this,

<body>This is a left angle bracket: < </body>

you need to escape the < character you see in the preceding code in bold, like this (change is also in bold):

<body>This is a left angle bracket: &lt; </body>

The same steps are necessary for the other predefined entities listed. Instead of using the markup shown under the heading “Markup Equivalent,” you can also use the Unicode values shown in the next column.

Every character in any language you use (with a few rare exceptions involving comparatively obscure languages) can be represented by character references, which are Unicode values mapped to characters (a process described earlier). For example, if you wanted to write out the word “Foo,” you could write it like this:


Managing white space in XML
The default behavior of XML is to preserve white space. In HTML, the default behavior is to collapse white space. This means that within a p element in HTML, the following,

Hello there

would look like this in a browser:

Hello there

However, in XML, the original line break is preserved. XML includes a special attribute that can be used within any element called xml:space. You can use this to override XML’s default line breaking behavior with the following:

xml:space = “preserve”

The options available for this attribute are default and preserve. Because the default corresponds to the default mechanism that allows for line breaks, you’ll rarely specifically call for that value.

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

HTML Topics