
The Long, Long Arm of SGML
by Kendall Grant Clark
November 05, 2003
SGML's Influence, XML's Anxiety
Writing about XML invites a certain humanistic pretension, for several
reasons. First, all this talk of universality and frictionless information
exchange bears uncanny allusions, for those of us who inhabit certain parts
of the ideological landscape, to what Christians used to call (before they
learned about language and gender bias) the "brotherhood of man". Sure, in
the end, it's all just bits and bytes, ones and zeroes; but in the XML
world all that Matrixesque machine austerity comes packaged in a
particular set of high tones and lofty ideals.
Second, since XML gets used in the production and dissemination of
documents, artifacts which are still and will long be the primary bearers
-- second only to natural persons, one suspects -- of cultural
transmission, it tends to attract (or reward?) technical people with
humanities backgrounds and trainings. Sure, in the end, the data wonks are
likely to take over, and then we humanities-derived folks will be out on
our collective asses; but for now, at least, we tend to fight them to a
healthy draw.
Thus it is that, when reviewing the recent debate about Tim Bray's
UTF-8+names proposal, I couldn't help but think of it in terms of an
important theory of poetry, the one promulgated by Yale literary
critic and professional academic contrarian Harold Bloom, in his
book The Anxiety of Influence. Since this is an XML
column, I won't go into much detail about Bloom's theory other than
to give a synopsis: Bloom argues that poetry (and, by extension, all
fiction and, perhaps, all of the creative arts) is formed in the
struggle between generations of poets; that is, younger poets
struggle to overcome the sense of anxiety brought about by the
unbearable influence of the older, stronger poet. The primary means
of enacting this struggle is for the younger poet to creatively
misread, misconstrue, and mistake the work of the stronger, older
poet. In these acts of creative erring, the younger poet forms a
space within which some act of poetic originality, that is, of
poetic independence, may be achieved.
What's the connection, you may be asking, between a theory of poetry
and Tim Bray's proposal? The connection is the long, long arm of
SGML.
Some significant percentage of the
pain suffered by the XML development community over the past 5 years
is directly attributable to dealing with the legacy of SGML. It has,
in other words, turned out to be much harder, much more complex to
do "SGML on the Web" than many people thought it would be. A
considerable amount of the early traction seized by XML was due to
the confluence of two forces: first, the technical maturity of SGML;
second, the early to middle years of exuberance about the Web
itself.
In various ways then, XML has really been about trying to overcome
the legacy of SGML. Perhaps "overcome" isn't quite right; perhaps
"modify and contemporize" is better? At any rate, XML has been
driven in part by a sense that SGML had things right, but not
just right, and that work remains to be done to overcome
SGML's failings.
What About All These Funny Characters?
Tim Bray's recent proposal --
presented in IETF RFC form, no less -- for fixing the "funny
character" issue in XML is a case in point. In XML you have two
choices for creating memorable shortcuts for entering various
Unicode characters. First, you can use a numeric character reference
(NCR), which is an entity (an &...; construct) formed from
Unicode code points. The problem with NCRs is that they aren't very
memorable; they're rather anti-mnemonic, in fact. Second, you can
use an internal parsed entity, which is basically a binding,
declared in a DTD, between a pair of arbitrary strings. So, for
example, one can declare a set of bindings between human-friendly
strings and NCRs; thus the producer of an XML document can use the
friendly form which gets turned into the NCR form. Anyone who's done
any real work with SGML knows about and has used such sets of
entities.
As Bray puts it, "...these techniques in XML were inherited directly
from SGML." But part of the struggle to overcome the legacy of SGML
has been to find ways to do without DTDs. "For a variety of
reasons," Bray says, "authors increasingly wish to avoid the use of
DTDs, but still want to retain the convenience and readability of
internal parsed entities." There is, in truth, a world of struggle
to overcome the legacy of SGML packed into this disarmingly simple
little sentence. Obviously the big move on this front was XML's
introduction of the idea of well-formedness as against validity. As
we all know by now, well-formed XML instances don't require
DTDs. Other XML technologies replace aspects of DTD functionality,
including W3C XML Schema and RELAX NG, XInclude, and the ongoing
work on xml:id.
Bray's proposal is simple enough, really. He suggests adding another
character set, one which is a very strict superset of UTF-8, which
he calls UTF-8+names. Basically, the UTF-8+names character set is
the UTF-8 character set plus a set of replacements, which are
sequences that begin with "&", have some other character string,
and end with ";". The character string enclosed by "&" and ";"
is something Bray calls the "replacement name", and it is a
representation of a Unicode character sequence which he calls the
"replacement value". Thus, when using the UTF-8+names character set
in an XML instance, one can use character sequences which look for
all the world just like ol' SGML entities -- ü -- but which
are, in fact, simply containers of replacement names representing
replacement values.
Bray's proposal met with fairly vigorous reaction. Seairth Jacobs'
reaction (seconded by Elliotte Rusty Harold), that Bray might want
to consider a different format other than one which stuck so
carefully to SGML's legacy format, is an interesting one, and it
highlights the sense in which SGML is still the thing that XML
people are often reacting to and against. Why not, as Jacobs said, a
"@name;" or "#name;" form?
What does Bray's proposal show? That, as he puts it, "it is
technically feasible to provide named characters without touching
XML by using an alternative encoding of Unicode." That's a useful
showing, but it's not clear that it's a viable way to move past SGML
for this particular issue. Even more to the point, Bray adds that
"there is no realistic prospect of adding entity declaration to any
of the modern schema facilities or of somehow shoehorning it into
XML itself in a DTD-less way."
A Competing Proposal
There's another proposal floating around XML-DEV lately, but it's
not really a competitor, inasmuch as Bray's proposal was really just
a thought experiment. Richard Tobin's proposal, which I think to be
relatively sane and even a bit clever, uses XML namespaces to
declare entities in XML attributes, thus:
xmlent:eacute="é"
Also in XML-Deviant
The More Things Change
Agile XML
Composition
Apple Watch
Life After Ajax?
"é" is thus replaced by "é", the relevant NCR,
within the element on which this attribute exists. There's also an
XML entity file version, using the attribute
xmlentfile, the value of which is one or more URIs.
In addition to ridding ourselves of one of the last remaining needs
for DTDs, Tobin's proposal, owing to its element scoping, also means
that arbitrary XML fragments which include entities and, in Tobin's
proposal, the declaration of those entities, become easier to
include, arbitrarily, in other XML fragments or instances. That's
very handy.
No matter which proposal, whether Bray's or Tobin's or someone
else's, there seems to be a renewed energy among the members of
XML-DEV, and perhaps among the XML development community at large,
to renew the struggle to overcome the remaining vestiges of SGML's
legacy, including the DTD. I'm not sure that a world without DTDs
will be a better world, but it will be a new one. And it will have
been achieved by means of a struggle with our predecessors and
precursors.