Document Type Definitions - HTML

As previously mentioned, an XML document that follows the syntax rules of XML is called a well-formed document. You can also have, or not have, a valid document. A document is valid if it validates against a Document Type Definition (DTD). A DTD is a document containing a list of rules about how the structure of an XML document should appear. For example, should all contact elements contain a phone element, like this?

The preceding code fragment is a well-formed document as it stands. However, you may wish to define rules that more clearly delineate the purpose of each element and the position of each element within the framework, or structure, of the document as a whole.

A DTD can exist either outside the XML document that validates against it or within that same document. If the DTD exists outside of the document, you must declare it within the XML document so that the XML parser knows you’re referring to an external DTD, like this:

<!DOCTYPE root SYSTEM “filename”>

In the case of the preceding contact XML, the DOCTYPE declaration would look like this (the DOCTYPE declaration is in bold):

You need to create a separate DTD file named contact.dtd when you declare such an external DTD, and that DTD must be adhered to. You can

also declare the DOCTYPE and define its rules within the actual XML document validating against it, as in the following example:

The bolded markup contains the DTD. All you need to do to create an external DTD is take the following steps:

  1. Create an inline DTD first, as done in the preceding code.
  2. Cut the bolded part of the code out of the XML document and paste it into a new text file.
  3. Name it contact.dtd (or whatever your DTD’s name really is).

Of course, creating the DTD within the validating XML document is not a necessary first step. You can create the file separately from the beginning. But doing it within the validating XML makes it easy to test if you’re using an XML-enabled browser such as IE5 and later or Mozilla (and Netscape 7.xxx). When you can load the file into a browser without any errors, you can then split the DTD markup (in bold in the preceding code) into a separate file and call it contact.dtd (or some other name), then refer to it in the XML document as previously shown:

<!DOCTYPE root SYSTEM “contact.dtd”>

DTD and XML structure is defined using the following core components of XML:

  • Elements
  • Attributes
  • Entities
  • PCDATA
  • CDATA

Each of these is described in the sections that follow.

Using elements in DTDs
Elements are the main data-containing components of XML. They are used to structure a document. You’ve seen them in HTML, and the core principles are the same in HTML. An element can contain data, or it can be empty. If it is empty, it normally consists of an attribute, but that isn’t a requirement. The HTML br and img elements are good examples of empty elements.

XML elements are declared with an element declaration using the following syntax:

<!ELEMENT name datatype>

The first part of the declaration (!ELEMENT) says that you are defining an element. The next part (name) is where you declare the name of your element. The next part (datatype) declares the type of data that an element can contain. An element can contain the following types of data when defined by DTDs:

  • EMPTY data, which means there is no data within the element.
  • PCDATA, or parsed character data.
  • One or more child elements: There is always a root element and, if the XML document defined by the DTD is to contain additional elements, the DTD must define what those elements are in the root element’s declaration.

Using element declaration syntax for empty elements
Empty elements are declared by using the keyword EMPTY:

<!ELEMENT name EMPTY>

For example, to declare an empty br element, you would write the following:

<!ELEMENT br EMPTY>

This element would appear as follows in an XML document:

Using element declaration syntax for elements with PCDATA
Elements that don’t contain any other elements and only contain character data are declared with the keyword #PCDATA inside parentheses, like this:

<!ELEMENT name (#PCDATA)>

A typical example of such an element follows:

<!ELEMENT note (#PCDATA)>

An XML parser might then encounter an actual note element that looks like this:

<note>This note is to warn you that not all DTDs are good DTDs. There are bad DTDs. DTD design is more an art than a science.</note>

You can see there are no elements within the note element, just text (character data).

Using element declaration syntax for elements with child elements
Elements can contain sequences of one or more children, and are defined with the name of the children elements inside parentheses:

<!ELEMENT name (child_name)>

If there is more than one element, you separate each element with a comma:

<!ELEMENT name (child_name, child_name2)>

An example, using the code you saw earlier for the contact document, might look like this:

<!ELEMENT contact (name, address, phone)>

Declaring the number of occurrences for elements
You can also declare how often an element can appear within another element by using an occurrence operator in your element declaration. The plus sign (+) indicates that an element must occur at least one or more times within an element. Therefore, if you create the following declaration, the phone element must appear at least once within the contact element:

<!ELEMENT contact (phone+)>

You can declare that a group of elements must appear at least one or more times:

<!ELEMENT contact (name, address, phone)+>

To declare that an element can appear zero or more times (in other words, it’s an optional element), use an asterisk instead of a plus sign, as in the following:

<!ELEMENT contact (phone*)>

If you want to limit an element to zero or one occurrence, use a question mark (?) operator instead:

<!ELEMENT contact (phone?)>

The following XML would not be valid when the declaration uses a ? operator for the phone element:

You can also use a pipe operator (|) to indicate that one element or another element can be contained within an element:

<!ELEMENT contact (name,address,phone,(email | fax))>

In the preceding declaration, the sequence of name, address, and phone elements must all appear in the order shown, followed by either the email or fax elements. This means the following XML is valid:

owever, the following XML would not be valid if validating against the same DTD:

As a test of what you’ve seen so far, look at Listing and see if you can determine why it won’t validate. What could you do to make it work?

A Nonvalidating XML Document

If you try to parse Listing using a validating parser, you’ll get an error. The reason is because there is a fax and an email element, but the DTD in bold calls for an email or a fax element. To fix the document, you need to remove either the fax or the email element.

Using attributes in DTDs
Attributes define the properties of an element. For example, in HTML, the img element has an src property, or attribute, that describes where an image can be found. When deciding whether something should be an element or attribute, ask yourself if the potential attribute is a property that helps describe the element in some way. Attributes shouldn’t contain data with line breaks unless you’re okay with those breaks being replaced with one nonbreaking space, because attributes don’t render line breaks in XML.

Using entities in DTDs
Entities are used to store frequently used or referenced character data. You’ve already seen some of XML’s predefined entities. You can also create your own. When you do that, you must declare them in your DTD. You can’t, for example, simply use &nbsp; in an XML document. You must first declare it by defining what it means and letting the XML parser know about it. When an XML parser encounters an entity it expands that entity. For example, this means that the parser recognizes &nbsp; as a nonbreaking space if you have defined it as such.

Using PCDATA and CDATA in DTDs
PCDATA is parsed character data, which means that all character data is parsed as XML; any starting or closing tags are recognized, and entities are expanded. Elements contain PCDATA. CDATA is data that is not parsed by the processor. This means that tags are not recognized, and entities are not expanded. Attributes do not contain PCDATA; they contain CDATA.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

HTML Topics