25 August 2009

Validation as a Lifestyle Choice

Quick Overview of XML Validation


A well-formed XML document follows the syntax rules for an XML document. (document strictly in the XML sense of that term!)

An XML document is valid "if it has an associated document type declaration and if the document complies with the constraints expressed in it."

A non-validating XML parser can tell if an XML document (still that strict, restricted XML sense of document) is well formed, but cannot validate. A validating XML parser can validate an XML document against a specific Document Type Definition (DTD).

In practise, validation is somewhat more complicated, due to the various kinds of schemas. Schemas are XML documents (DTDs are not XML, and have their own syntax rules) that provide a description of the constraints for a particular XML vocabulary. DITA has an associated set of schemas, equivalent[1] to the DTDs, but the DTDs continue to be considered canonical for DITA.

Validation is an effective way of catching structural mistakes in DITA content. Valid content might still not be correct in terms of a writing objective, but invalid content certainly isn't correct.

Do You Really Need to Bother?


While it is certainly possible to set up a content management system for DITA authoring to accept invalid XML content, or even XML content that isn't well-formed, this is not a good idea.

There are three main reasons for doing validation with your CMS whenever content is submitted to it. ("checked in", "released", etc. Terminology will depend on the specific CMS you're using; I mean the case where you tell the CMS to "store this as the next version number of this object".)

Firstly, at some point you must process your XML content into a delivery format. No matter what format that is, at that point it will become important that your XML content is valid for relatively inescapable technical reasons.[2] Invalid content means either no output at all or output that you can't safely ship.

Secondly, people are neither good XML validators nor good managers of large groups of references to unique identifiers. Computers, on the other hand, are excellent XML validators and don't forget things. Manual maintenance of a valid body of XML content by human beings is a hard, horrible, and unending job, which is precisely the kind of thing that makes DITA authoring projects fail. Without automated validation support for your XML content, you don't get improved productivity and quality, you get decreased productivity and a writing team looking for other work.

Thirdly, validation in the XML editor, a potential substitute for validation in the CMS, isn't enough to support production use. The editor doesn't know about what things are stored in the CMS, and can therefore only check for validity, not whether or not referenced content actually exists.

Validation Concerns in Practice


Division of labour and staged production means, in part, that there are a bunch of levels and kinds of validation you'll need to be concerned with for your DITA CMS.

The XML Editor


If you are authoring your content in DITA, you should be using an XML editor.[3] That editor should not only validate your content, it should provide visual and textual indication of any problems.

It's important that whatever editor you chose can be pointed at specific DTDs or schemas; built-in DITA DTDs are no good to you if you've specialized something and validation based on the built-in DITA stops being applicable to your content.

At the level of the topic, the most basic reason for validation is that it catches a lot of stupid mistakes—tags with no matching close tag, forgetting to specify the number of columns in a table via the cols attribute of <tgroup/>, use of a <tm/> element in a <navtitle/> where it is not permitted, mis-typing an entity name so that you've introduced an undefined entity, etc. At a less basic level, validation reinforces the use of standard patterns in the organization of the content. This is where the good error messages from the XML editor's validation function are so important; making XML authoring a frustrating exercise in getting your content past an inscrutable and apparently arbitrary guardian function will not lead to productivity increases.

Managing References


Any DITA authoring environment involves a lot of references:
  • maps assemble topics into deliverables through references
  • topics reference other topics, images, and external resources such as web sites
  • maps may contain secondary content ordering structures like <reltable/> being used to provide alternative information traversals than the map's main table-of-contents ordering provides
  • topics may reference part of their content through the DITA <conref/> mechanism, XML entities, or from the map via a per-map variable mechanism.
One consequence of managing references is that you need a specific tool for managing maps, preferably one which does not compel direct entry of unique identifiers. (The more reliably unique an identifier is, the less reliably a human being is going to type it.)

The primary consequence of all these references is that simple XML validation is necessary, but not sufficient.

Consider a topic referencing another topic with <xref/>, for example. The xref element will be valid with no attributes or contents, an empty href attribute, or an href attribute which references XML content which does not exist. It's most unlikely that you want anything other than the case where the referenced content actually exists.

Similarly, consider the case where you have a properly defined XML parsed general entity being used to substitute XML content for a trademarked name. (This is one of the ways to get a single, central list of trademarked names where changes to the central list are automatically reflected in the content.)

Normalization turns the entity &OurProduct; into <tm trademark="OurProduct" tmtype="reg">Our Splendid Product</tm>. However, XML validation in the editor is checking solely for whether or not the referenced entity is properly defined; it's not doing any normalization so it can't be checking that the normalized form is still valid. (Since the <tm/> can not be put just anywhere in a DITA document and the entity can, this is a real issue.)

Similar issues arise with the DITA conref facility, where the conref attribute is valid so long as it is syntactically correct in XML terms; there's no way for XML validation to test that the reference conforms to DITA syntax or that the object referenced by the conref attribute is available.

You want to make sure that your CMS has some way to detect and flag content that references other content that doesn't exist, rather than accepting what is otherwise a completely valid XML object.

Validation in the Content Management System


When a content object is submitted to the CMS, a sequence of events should take place:
  1. XML normalization (expands or resolves any entities)
  2. check the reference values of
    1. conrefs
    2. topics
    3. images
    for existence in the CMS.
  3. perform XML validation of the normalized version
If any of the existence checks for references fail, or if the normalized version of the XML is no longer valid, the CMS should not accept the object and should give an error message instead. This is an area in which you want[4] informative error messages.

Business Rules with Schematron


Schematron is an unusual schema language, in that it does not require you to specify all the rules under which the document is constructed. Instead, it provides a way to make and test assertions about the content of an XML document.

So if you want to make business rules for your DITA content, such as ordered lists must have at least 2 list items or every topic must have a <navtitle/> with text content, secondary validation with an appropriate Schematron schema is the way to do it.

This approach is enormously simpler, less expensive, easier to change, at less risk due to changes in the DITA specification, and faster to implement than undertaking DITA specialization to reflect your business rules. It is also faster and more reliable than having editors check tagging style visually. Because of the way Schematron is structured, it's also easy to add additional rules in an incremental way; you can start with a very short list of rules (no empty topic title elements) and add those you discover you need later.


[1] Equivalent, but not equal; DTDs and schemas do not specify everything in exactly the same way. So, for instance, a DTD can specify 0 or 1, 1 or more, or 0 to many of something; a schema can specify specific numeric ranges using minOccurs and maxOccurs attributes. This is one of the reasons that there has to be a defined canonical specification when a DTD and schema implementation both exist.

[2] The intermediate XML stages of generating output certainly don't need to be valid, but the first stage, the one that resolves all the entities and supplies the class attributes, must. Otherwise the validating parser, conformant to the XML specification, will hit the first error and stop.

[3] XML documents are plain text documents, and you can edit them with any text editor, even Notepad. But it's very unlikely your sins are so great that you deserve to be editing XML in a text editor, and it's even more unlikely that your writing team will achieve the full expected productivity gains with a text editor. I highly recommend taking a look at oXygen for XML editing purposes.

[4] for values of want that closely correspond to intense need; you don't want a writing team collectively trying to figure out why the editor says it's valid but the the CMS won't take it without at least a strong hint.

No comments: