23 August 2009

Using Semantic Tagging

Semantic Tagging Isn't Formatting Instructions


Years of having to worry about formatting can make it particularly difficult to view semantic tagging as a label for the kind of meaning the content of the element should have, rather than "it comes out bold".

Let's consider a (relatively simple) example of the DITA <fig/> (figure)element.

<fig>
<title>Example Placement Diagram</title>
<desc>The Example Placement Diagram shows a common arrangement of component parts.</desc>
<image href="diagram.png">
</fig>

This XML content can be rendered in a bunch of different ways.

In HTML output, we might have:

<div class="figure">
<img href="diagram.png" alt="The Example Placement Diagram shows a
common arrangement of component parts." />
<p class="caption">Example Placement Diagram</p>
</div>

In PDF output, we might have parts of the figure element appearing in different places, as part of the List of Figures at the front of the document:


and as part of the main flow of content within the body of the PDF document:


Note that while the text child of the <desc/> element winds up variously as the contents of an alt attribute, in front of the title of the figure in the main text, and after the title of the figure in the List of Figures, its semantic purpose—to be a more informative description than the title—is being respected in all cases.

It's important to remember that what you're doing is arranging the words to match the semantic purpose in both directions; out towards the expected audience, and in towards the definitions for the XML elements that provide you with the tag set.

This is especially important in the cases where the output processing doesn't use the entire XML tree of the topic, or where the output processing renders the delivered document in some order other than the XML document order of the DITA source.

Examples of not using the whole XML tree include assembling a quick "cheat sheet" style of instruction using only the <cmd/> elements from the <step/> elements of a task topic, or building an overview page by hierarchically assembling the topic titles and short descriptions from all the topics in a map.

Examples of rendering the delivered document in some other order than the XML document order of the DITA source include rendering the topics of a map into multiple sequences of topics to statisfy a list of scenarios the map is expected to meet in scenario-based authoring ("the content referenced by this map allows these personae to perform this list of scenarios", and one output per persona/scenario pair), or, more simply, having a house style that prefers rendering task topics so that the contents of the <context/> element renders before the contents of the <prereq/> element.

Output Types


DITA supports multi-channel publishing; a single source XML content representation can be processed into multiple types of output. The most common types of output are HTML and PDF, but other types are possible, and even within the broad categories of HTML and PDF, it's quite likely that you will have multiple different output types for each. So a "F1 Help" HTML output type may co-exist with a "User Guide" HTML output type, or a "white paper" PDF output type may co-exist with a "technical documentation" PDF output type.

This is a significant system advantage for a DITA content management system, but the advantage comes at the cost of a significant writing challenge. It is necessary to think of the job as getting the semantic representation correct, because you cannot be certain what type of output processing will be used on the topics or maps you produce as an individual. So while you know what output type you will be used—the HTML user guide, or the white paper PDF, etc.—to produce the shipped document from your current work assignment, you don't know what else will happen to that content, either in parallel or in the future. Perhaps the HTML user guides shall be combined and processed into a PDF format to be presented to a potential customer for your software by someone in your company's customer relations department; perhaps some of the concept topics from the white paper will wind up in the engineering documentation, introducing the subject before the task topics and reference topics with quantified values specific to a particular shipping product.

This mix of different and unknown uses for the content make it imperative to not allow the semantic tagging to collapse into a sort of awkward attempt at a formatter. If that collapse happens, you'll wind up with some unexpected suitably generic output processing that won't work on your content, which has been customized to a particular output type's processing at a particular point in time.

Getting the delivered document is inherently a two-stage process with DITA, and keeping the formatting step separate from the semantic tagging step is vital for maintaining the writing advantages—arbitrary re-arrangement, no content edits required to change the type of delivered documents—of topic-based authoring in DITA.

Agreeing on Meaning


DITA a general XML vocabulary for technical authoring, with little inherent structural constraint. This is in part because DITA also supports specialization, but specialization does not solve the problems of semantic tagging. Specialization is a way to reflect consistent and frequent recurring patterns in your content, so that you might for instance wish to specialize the <prereq/> element of a task into <tools/> and <precautions/>. This is very helpful if you want every task (or every task for a particular audience) to include content about the required tools and the necessary precautions, but it does not allow you to agree on the local value of the semantic meaning of an element. It might, at best, reduce the scope of the argument about what that local semantic meaning should be.

Consider the <info/> element, to which the specification assigns the semantics: "The information element (<info>) occurs inside a <step> element to provide additional information about the step."

You can put a lot of block level elements in an information element; paragraphs, lists, simple tables, full tables, figures, and objects are all permissible. There are also a large number of legal inline markup elements, such as filepath and menucascade. This means that you can sensibly use <info/> as a text container, like a specialized paragraph, or you can use it as a small section, which contains a sequence of block level elements such a paragraphs and lists. You can't sensibly use a single info block as both of those things at the same time, though; aside from output processing issues, you're putting the same kind of content at different levels in the XML tree, which is bad semantic tagging. Since you can put as many information elements as you like in a single task step, you also need to consider if you want to use one or many <info/> elements.

There is no single right answer for this; it depends on the kind of information you want a task to impart, your overall style decisions about how information is to be presented, and, probably, on how your output processing works. (Information elements that contain note elements with marginal icons of danger symbols, for example, may require you to go one-info-element-per-note to keep the icons from landing on top of each other.)

Someone on the writing team has to own the process of agreement on semantic meaning, and be able to make decisions, break ties, and otherwise ensure that there is a single definitive style guide for semantic tagging of content.

No comments: