Investigating the Infoset
by Leigh Dodds
August 02, 2000
What is the XML Infoset specification? What purpose does it serve? These are some
of the questions that have been discussed on XML-DEV this week. The XML-Deviant
was there to record the answers.
The Infoset
The latest draft of the XML Information Set ("Infoset") specification was published this week, providing
an update to the previous December 1999 draft. The Infoset is one of those specifications
frequently mentioned, but rarely discussed in detail. Paul Abrahams, no
doubt voicing the thoughts of many other developers, wondered
what the purpose of the Infoset was:
What is the purpose of the XML Infoset? Is it mainly
intended to enlighten implementors about what the abstract
structure of an XML document is, or does it have some other
less obvious uses?
The resulting discussion provided a useful primer for developers
interested in learning more about the Infoset specification.
Jonathan Borden described the Infoset as an abstract model of the data in an XML document:
XML is a serialization of a logical document structure defined by the XML
Infoset.
Martin Gudgin echoed this view, saying that
the abstract model separates
applications from the syntax:
To me the Infoset defines what XML is in the abstract. XML 1.0 + namespaces
is just one possible serialization syntax. I expect there will be others in
time. Likewise SAX and DOM are two possible reflections of the Infoset.
Maybe other APIs will be developed over time. The Infoset, being abstract,
shields me from the details of the serialization syntax which to me is a big
win. If I find ( or write ) a parser that supports a binary form of XML but
still conforms to the Infoset I don't need to change any of my application
code but I can get all the benefits ( probably size and speed ) of the new
serialization syntax.
The Infoset then is a data model that describes the important properties
of a well-formed XML document. The model describes the results of
parsing an XML document, and it is this model that is manipulated by
XML APIs. This view puts the XML data model first, and the syntax
second.
Summarizing responses from
several contributors, Paul Abrahams asked a further series of questions:
... doesn't the XML spec itself define well-formedness satisfactorily?...
... Viewed as an elegant description of the information contained in an XML
document, the Infoset make sense. But unlike the other XML specs, its
normative effect is unclear. If I'm implementing an XML-related processor of
any variety, what does the Infoset require me to do that I would not have to do
if the Infoset never existed?
Michael Champion offered an explanation of how the Infoset refines
the definition of well-formedness given in the XML specification:
[The Infoset] answers questions that are irrelevant when XML is viewed as a syntax, but
quite important to users of the DOM, XPath, XSL, etc. that operate on some
representation of a more abstract parsed XML document. For example, the XML
spec says that "<empty></empty>" and "<empty/>" are both well formed XML
elements, but nothing about whether they are equivalent. Infoset says ...
that they are.
Champion also provided an example of the type of question that the Infoset answers for
application designers:
So, one fairly practical normative question it *does* answer would be: 'My
application would like to treat "<empty></empty>" as signifying "data will
the value NULL" and "<empty/>" as signifying "no data". Can I do this in a
environment where the XML will be processed by various tools that implement
the XML specs but that I do not control?' The answer, for better or worse,
is NO - an XML processor is under no obligation to preserve this
distinction. That answer comes from the Infoset ... not the XML spec, the
DOM, XSLT, etc.
The Infoset is therefore a normalized data model that irons out variations in syntax, to
provide a foundation upon which XML applications and processors can be built.
Syntax versus Model
In some ways, the Infoset poses a chicken-and-egg problem. If the data model is more important
than the syntax then why (and how) was XML specified before its data model was defined?
Michael Champion admitted that a lack of
a data model made the DOM Level One specification harder to produce:
The lack of an Infoset certainly made it much harder to invent the Level 1
DOM; it simply was not clear (and was highly contentious) whether expanded
entity references remained in the XML document tree or not... and how mixed
content would be represented in the tree.
Jonathan Borden believed that specifying
the DOM was only possible because of prior work on SGML:
True, the DOM spec was written prior to the
Infoset spec, but I think that the only reason this was possible is because
of all the work on groves and property sets that had already been done for
SGML, so the people who devised the DOM already had a pretty good idea of
what the Infoset would look like.
Tim Bray disputed the relative
importance of the XML model over its syntax, claiming that standardized syntax is how interoperability
is really achieved:
XML took a lot of static in its early days because it
was "just syntax" - there are certainly a lot of people who want to think
only in terms of object models (groves, DOMs, whatever) and see the syntax
as disposable fluff. Me, I think syntax is crucial. Because describing
data structures in a straightforward, interoperable way is really hard to
get right and very often fails. At the end of the day, if you really want
to interoperate, you have to describe the bits on the wire. That's what
XML does.
Think of it another way... a promise like "my implementation of SQL
(or posix, or DOM, or XLib) will interoperate with yours" is really
hard to keep. A promise like "I'll ship you well-formed XML docs
containing only the following tags and attributes" is remarkably,
dramatically, repeatably more plausible in the real world.
This is a debate that recurs often when attempting to define markup languages: do
you begin with a model and then define a syntax, or build a model that describes
the syntax? It's not a debate that is likely to be resolved anytime soon, if at all. The
important point is that you cannot focus on one aspect--model or syntax--to the exclusion
of the other. The Infoset is therefore an important step in the further development of
the XML family of specifications.
The 80/20 Split
Another of those recurring debates concerning the details of a
specification is the "80/20 Split." It's impossible for a single
specification to address all possible requirements, and so compromises
have to be made. Disputes arise from opinions on where that split needs
to be made, and which compromises are tolerable. Inevitably, a similar
debate has revolved around the details of the Infoset data model.
Whilst praising the intent of the Infoset, Michael Kay asserted that
the specification makes too many
compromises:
Personally, I don't have any problems identifying the need for the Infoset:
I've seen so many people try to attach meaning to lexical distinctions that
should not carry meaning that I yearn for an authority I can point to when
telling them they're wrong.
But the problem with the Infoset as currently defined is that it has had to
make too many compromises. Creating a common abstraction with the constraint
that XML, XML Namespaces, the DOM, and XPath should all conform with it is,
I think, a requirement that has proved impossible to satisfy.
Some developers expressed concerns over the information that the Infoset
does not model -- i.e., the information on the wrong side of the 80/20 split.
Indeed, Simon St. Laurent advocated extending the model to cover all available information,
with the option of defining subsets as a later effort:
I'd suggest that the Infoset's designers build for a wider XML-using
audience than the particular one they have envisioned, and then describe a
subset and perhaps the processing that takes information from XML syntax to
parser output.
While support for this suggestion was forthcoming from several contributors, many
were happy with how the model had been defined. Joe English observed
that a subset is still useful for
many applications:
Having a canonical "subsetted" model like the Infoset is very
important to tool-builders, spec writers, and schema designers
though. Without it, it's all to easy to design an application
that relies on properties of the input document that most tools
consider accidental syntactic properties; then documents built
in conformance with that application can't be processed with
those tools. This has happened to me a couple of times when
dealing with SGML.
Sean McGrath saw this as unacceptable--syntactic
differences may be important for some applications:
But distinctions that are irrelevant for some applications are not
irrelevant for others. This is the nub of the problem. The Infoset
throws certain things away. In so doing, it creates problems
for certain types of XML processing applications.
Eric Bohlman highlighted one class of
applications that the Infoset doesn't support:
Of course, there are always going to be certain applications that really
have to work with the lexical details of the syntactic instance rather
than its Infoset; these are editor-type applications that need to preserve
aspects of the lexical (physical) structure of the original document.
Trying to defuse the arguments, Rick Jelliffe attempted to further clarify
the purpose of the Infoset specification. Jelliffe described the Infoset as defining
a policy that other W3C specifications
will follow:
The Infoset is aimed at XML specifications and software in general. It
is not its intent to state all the information that anyone could encode
in their document. I would say that in particular it is setting a policy
that W3C XML specs should not operate as if the formatting of the XML
markup was significant.
This is not a new issue: I remember it being discussed 3 years ago or
so. It is good for XML editors to regenerate edited documents with the
original formatting of the markup. That is why it is useful if SAX
reports rather than collapses whitespace, and why a DOM implementation
for an interactive editor should subclass the W3C DOM to provide this
information. That is their Infoset, but it is not the one that W3C
Working Groups should start from.
These latter points from Bohlman and Jelliffe are important because they highlight
that the Infoset only fails to support a small subset of XML applications: a hit rate of
much
higher than eighty per cent.
For the majority of XML developers, the Infoset will serve as a useful adjunct to the XML
specification: complementing the syntax to build an interoperable data model upon which XML
processors can be layered.