Escaped Markup Considered Harmful
by Norman Walsh
August 20, 2003
XML is pretty simple. There's plenty of complexity to be found if you
go looking for it: if you want, for example, to validate or transform or
query it. But elements and attributes in well formed combinations
have become the basis for an absolutely astonishing array of projects.
Recently I've encountered a design pattern (or antipattern, in my
opinion) that threatens the very foundation of our enterprise. It's
harmful and it has to stop.
A Little Historical Background
It was not always obvious that the idea of well formed markup, and the
draconian approach to error detection that the XML 1.0 Recommendation
requires, was going to catch on. Well, it wasn't obvious to me, anyway; I
won't speak for anyone else.
The technical merits of well formedness are easily understood. Well
formedness allows a parser to discover, easily and unambiguously, the
logical structure of an XML file. Before XML, we had SGML, which has all
sorts of rules for markup minimization. They made writing a parser
hard. Very hard. And SGML had its cousin HTML. Most HTML documents
weren't really valid SGML documents; beyond all of parsing issues that
full SGML support would have brought, the applications that consumed HTML
had completely ad hoc rules for recovering from markup errors.
All of this complexity for parser writers and ambiguity in
interpretation of broken markup had a benefit. It made hand authoring of
markup a lot easier. In SGML you could leave off the quotes around some
attribute values, omit start and end tags in some places, and rely on the
devious little SHORTREF mini-language, if you were so inclined. And in
HTML, you could throw just about any tag soup at the browser and it'd do
something. With a little random fiddling, you could probably get it to do
something that looked right, at least in some browsers.
XML came along and said, "Nope. Too hard, too costly, too
difficult. You're going to do your markup just like this, with almost no
minimization, and if you don't get it exactly right, applications aren't
allowed to recover from your errors. If it isn't well formed, it isn't
XML."
And for a moment, we held our collective breath.
The moments ticked by. The right vendors agreed to support XML, the
necessary folks in the user community looked at the possibility of a
future where powerful applications were easy to write and agreed that the
trade-off in markup ease was well made. XML passed the first, perhaps most
important hurdle. It was off and running.
A few years later there are growing pains. We can all point to this
specification or that one and claim that it would have been better if it'd
been done some other way. But I think few would argue that it hasn't been
a success story on the whole. As I said, now we've got an absolutely
astonishing array of powerful, open, flexible, adaptable tools at our
disposal.
And we have them because XML must be well formed.
Thus it came as a surprise to me when I discovered that the RSS folks
were supporting a form of escaped markup. Webloggers often publish a list
of their recent entries in RSS and online news sites often publish
headlines with it. Like most XML technologies, there's enough flexibility
in it to suit a much wider variety of purposes than I could conveniently
summarize here.
Surprise became astonishment when I discovered that the folks working
on the successor to RSS weren't going to explicitly outlaw this ugly
hack. When I discovered that this hack was leaking into another XML
vocabulary, FOAF, I became outright concerned.
What is Escaped Markup?
Escaped markup is just what it sounds like: markup that has been
escaped so that it isn't markup anymore. If you write XML documents that
have less-than signs or ampersands in content, you're already familiar
with escaped markup.
In RSS, it often looks like this:
<description><![CDATA[
Some description of an article about
<a href="http://www.w3.org/TR/REC-xml">XML</a> that
contains a link and a <br> element.]]>
</description>
It is important to realize that this is precisely the same as:
<description>
Some description of an article about
<a href="http://www.w3.org/TR/REC-xml">XML</a> that
contains a link and a <br> element.
</description>
The notion that CDATA elements convey some special, literalist
semantics on the escaped markup is incorrect. While it is technically
possible for an application to distinguish which form of escaping was
used, it would be wrong to establish meaning based on the form. CDATA
escaping is generally indistinguishable from other forms of escaping.
Now there's nothing wrong with escaped markup, as long as it means what
it says. Namely:
Some description of an article about
<a href="http://www.w3.org/TR/REC-xml">XML</a> that
contains a link and a <br /> element.
But, perversely, most RSS applications render that markup like this:
Some description of an article about
XML that
contains a link and a
element.
A convention has developed that says the contents of at least some and
perhaps all elements in RSS are "unescaped" and then rendered. This opens
a horrible back door in the whole XML markup picture.
Escaped Markup Doesn't Work
There appear to be two arguments in favor of escaped markup:
Aggregators are using XML, in the form of RSS, to
combine data sources together. Aggregators are tools or companies that
build RSS feeds for a wide variety of sources. You might, for example,
subscribe to a feed that shows the top ten news stories from selected
major news outlets.
The aggregators might argue that they're just using XML as a transport
protocol and have no control over the actual content. The content that
they're aggregating may or may not be well formed so they have to do
something with the markup. There's a further argument that they don't
have any interest in the actual content, that it's just shuffled off to
some other application for rendering, and that it's better and more
efficient to store the content as opaque text nodes.
I don't think these arguments come close to justifying the solution
that's been adopted:
Escaping markup, particularly with CDATA sections,
just doesn't work. There are other things that might be wrong that would
make the documents not well formed. There are Unicode characters that are
forbidden, there are encoding issues for the characters that are allowed,
and there are sequences of characters that must be avoided. (e.g.,
"]]>"). Not to mention the fact that CDATA sections don't
nest.
There are better ways of escaping content. First of all, if the
content you encounter is well formed XML, no escaping is necessary. If it
isn't well formed XML, then it must be HTML. No application is allowed to
accept a document that purports to be XML but is not well formed. There
are well understood ways to turn HTML into XHTML (or well formed XML). I'd
even prefer stripping all the markup entirely to this escaped markup
"solution".
The argument about opacity doesn't fly either. Just because some
applications don't care about the content of the aggregated feed is a poor
excuse for putting it inside a black box that can't be opened by any
rational XML application.
If it's really important to escape the markup, if it's impractical
to convert it to well formed XML, or the penalty of parsing the nested
markup is too expensive, use base64 encoding.
That would have two distinct advantages: first, it would actually work,
which is always a nice feature, since it would handle arbitrary
characters; second, it would very clearly not be a format designed for
human authoring.
I think the most dangerous part of this whole escaped markup kludge is
that it encourages naive authors and programmers to adopt this style in
other applications.
Escaped markup allows authors to put HTML and other content into
elements where the schema or DTD says that only text is allowed.
I'm sorry: an obvious, compelling, and irrefutable argument
against allowing escaped markup is that it allows authors to put
HTML and other content into elements where the schema or DTD says that
only text is allowed.
Escaped Markup Is Harmful
The idea of escaping markup goes against the fundamental grain of
XML. If this hack spreads to other vocabularies, we'll very quickly find
ourselves mired in the same bugward-compatible tag soup from which we have
struggled so hard to escape.
And evidence suggests that it's already spreading. Not long ago, the
question of escaped markup turned up in the context of FOAF. The FOAF
specification condones no such nonsense, but one of the blogging tools
that produces FOAF reacted to a users insertion of HTML markup into the
"bio" element by escaping it. The tool vendor in question was quickly
persuaded to fix this bug.
Escaped Markup Must Stop
There is clear evidence that the escaped markup design will spread if
it isn't checked. If it spreads far enough before it's caught, it will
become legacy. Some vendors will be forced to continue to support this
abomination by simple economics. And it won't be their fault, it'll be
ours for not killing the virus before it could spread.