Intuition and Binary XML
by Leigh Dodds
April 18, 2001
XML-DEV has been revisiting a well-loved debate this week, namely,
binary encoded alternatives to XML, last
encountered in January.
The Need For Evidence
Lurk long enough on any mailing list, and you'll always find a few
ideas that refuse to go away. XML-DEV has more than a few of its own;
the concept of a binary XML is one which ranks up there with
"Namespaces: Good or Bad?" and "Why the W3C/ISO/OASIS/IETF (delete
as appropriate) Process is Just Plain Wrong." All of them are
good subjects if you're feeling lonely and fancy reading some
email.
It's easy to be dismissive of these debates, but they're often a
sign that there's some fundamental problem or common misunderstanding
that needs to be addressed. Alaric Snell
characterized the reaction of many developers when faced with XML
for the first time.
To many programmers, XML *looks* inefficient and
awkward. That was my first thought when presented with the idea of
using it for data interchange; luckily I was enamored enough of the
good work being done on various interesting schemas that suggested
this data format (although technically lacking in many respects) may
actually achieve "ubiquitous" status.
This reaction is likely to be a kind of gut feeling. After all, XML
is a plain text format containing lots of whitespace, so it
must be inefficient, right? Unfortunately gut reactions rarely
lead to good results when it comes to optimization. As Tim Bray noted,
empirical evidence and real-world test profiles are what's
needed.
...an argument that unpacking a binary format (particularly
on a machine whose binaries are different and you have to
bit-swizzle) is significantly faster than XML parsing a la expat or
MSXML, needs to supported by actual empirical data rather than by
assertion. And suppose, as a thought experiment, that this were
true; if you were to speed up the XML parsing/generating part of an
XML-using application, how much would that speed up the whole
application? You'd need to know what proportion of its time it
spends parsing/generating XML. In some apps, this proportion is
going to be very small.
Bray
recounted painful attempts to optimize without accurate profiling
information, while in the full flush of the enthusiasm one encounters
when presented with a optimization problem (something that Sean
McGrath later termed the "
rush of code to the hand").
In my experience, assertions about what will make software run
faster, when not backed up by empirical profiling data, are not worth
wasting time on. I have seen untold amounts of time wasted by overeager
junior programmers who just knew, "without needing empirical evidence", that
putting a hash-table in, or some such, would make their app go faster, when
some profiling work would have shown that their performance was dominated by
I/O buffer management.
Several members of XML-DEV were forthcoming with anecdotal evidence
and experience with different XML encodings. Oleg Paraschenko reported
that his Pyx parser project (Pyx is a line-oriented subset of XML) was
actually
slower than a full parser. Henry Thompson has more recently
learned the hard way that binary is not necessarily faster.
I just wasted a weekend getting my schema validator to dump
the internal form of the 'compiled' schema-for-schemas, on the
_assumption_ that reloading that would be faster than
parsing/compiling the schema-document-for-schemas every time I
needed it. Wrong. Takes more than twice as long to reload the
binary image than to parse/compile the XML.
There are _lots_ of people out there working hard to make
parsing/writing XML blindingly fast. With respect, you're unlikely
to beat them.
Yet because there are few empirical results, the debate
cannot be put to rest and the hand-waving continues. Even those big
projects that have adopted a binary encoded XML format have not
produced a convincing case. One commonly cited example is wbXML, used in WAP devices that
are deemed to have little processing power and limited bandwidth, yet
even this case is arguable as Sean McGrath has pointed out.
I do a lot of work with WAP and experience with it has
turned me off binary XML encodings fairly comprehensively. I don't
think WAP demonstrates the advantage of a binary encoding. I think
it demonstrates quite the opposite.
My tests repeatedly show that the difference between
response times of the *same* system serving compact HTML (iMode) to
an iMode client browser versus WML to a WML browser is
negligible.
For my money, iMode got it right. A stripped down HTML with
plain text -- pure as the driven snow -- flowing from client to
server.
This most recent discussion also highlighted another example which
has the potential to become extremely widely used, MPEG-7. The
MPEG-7 effort is " daring
to describe" multimedia data using XML and provides a binary
alternative for encoding this XML data. But, as Claude Seyrat notes,
even here a degree of choice is being allowed.
When designing MPEG-7, the following policies have been
adopted:
- to stay as close as possible to the XML spirit by the adoption of
a textual version designed with XML Schema,
- to define a binary format that uses XML Schema definition to
generate an efficient encoding scheme,
- to allow one to decide whether he wants to use binary or textual
format.
Since the beginning, MPEG-7 has been XML driven. The
MPEG-7 community is very reluctant to follow another development
path. However in MPEG-7 everybody recognizes the need for a binary
format.
Binary encodings may be suitable for applications where the format
and data are known in advance and suitable optimizations can be made.
However,
deriving a generally useful binary encoding is much harder as
Ramin Firoozye pointed out.
Binarizing of the form in WML does actually make the
content smaller -- but that's because they've already pre-defined
the element tokens, well-known attributes, and common
substrings. Binarizing streaming XML of an unknown variety actually
slows down the application because of the overhead for building an
on-the-fly dictionary (and in worst-case scenarios -- requiring
multiple passes over the source). Binarizing through
object-streaming actually makes the file size larger due to overhead
for storing internal tree information.
Len Bullard succinctly
summed up the challenge that proponents of alternative binary XML
formats should meet (with hard evidence) for the debate to move
forward.
The question is not is a binary useful for any given XML
application language, but is a standard XML binary useful for all of
them. WML has one because it needs one and it is good for WML.
Generalizing that leads to false conclusions because the form and
fit is not the same for the function.
Why Binary Isn't Enough
Other members of XML-DEV sidestepped the binary versus text
processing speed issue entirely, honing in on other aspects of XML
that are significant advantages in their own right and would be lost
with a binary format.
David Brownell
highlighted XML's openness.
Binary formats are bad because they tend towards being
proprietary, and that's the last thing that should happen to the
world's next "intellectual commons".
Auditability was a significant advantage in Clark Evan's book.
XML is going to succeed where other file formats have
failed because it is auditable -- I, a mere human, can pull up the
code and read it with my own eyes and without an intermediate reader
which could be at fault.
...Binary XML is dead on arrival. Getting away from binary
formats is the _entire_ reason for XML. Being able to audit your
inputs and outputs.
The issue of XML as an easily readable format may be too easily
dismissed; after all, who wants to sift through tangles of
angle-brackets? Yet the point is not that XML should be readable to
the everyday user, but it is readable to a developer and
therefore can be deciphered, reverse-engineered, tested, and audited
much more easily than a binary alternative.
Eric Bohlman believed that the discussion was pitched at too low a
level; saving CPU cycles is not the issue. In environments where data
is being exchanged between multiple organizations, other factors
become important. Not least among them are maintenance and
documentation, as well as the
social implications of agreeing on a format in the first
place.
And let's not forget the *social* aspects (the ultimate
non-geeky stuff) of data interchange. When several unrelated
organizations, or even departments within an organization, need to
exchange data, there's an enormous advantage to using a data format
that was created by a third party rather than by one of the players,
namely that there's no rivalry over *which* player gets to create
the format. Again, if one party could simply impose a format by
fiat, everything would be cool, but in real life, if you don't get
full "buy in" from all the players, you're going to see a lot of
friction (usually in the form of "creative incompetence" where
everybody's implementations differ in slight but important details)
that will dissipate a lot of energy as heat. Yes, this falls into
the realm of what hardcore geeks would call "touchy-feely" stuff,
but the fact is that psychological/verbal/non-
quantitative/stereotypically-female/"touchy-feely" considerations
play important roles in any real-life human endeavor involving more
than one person, and the fact that one might be more confortable
with bits and chips than with human interactions doesn't change that
reality.
Characteristically philosophical, Walter Perry cut to the heart of
the issue: once on the Internet you no longer know how, or by whom (or
even what), your data will be processed, so you
cannot make any assumptions about how it will be used. Perry has
long argued that facilitating this kind of usage is the key advantage
of XML.
The savings to be realized through the use of a binary
format are premised upon parsing the XML text only once and
thereafter passing around or storing the binary encoded output. Such
a mechanism demands that every user of that data expect, or accept,
the identical output of that parse -- effectively, a canonical
rendering. It is only such unanimity which would permit every user
to accept the product of a parse performed by any of them. In the
rapidly growing internetworked universe, it is precisely that
unanimity which we cannot reasonably expect...I argue that the
reasonable understanding of XML acknowledges that every use of an
XML document begins with a fresh parse of that document in the
context of that use. That parse is not required to instantiate XML
as XML -- the document itself is already that instance -- but to
instantiate the particular objects which that specific use of the
XML document expects and requires...You may choose to drive that
instantiation off of something other than XML syntax, but it is not
then XML processing, and what you lose, most significantly, in doing
that is the ability for the same text to be understood and usefully
processed at the same time as something very different, but
simultaneously the valid basis for a transaction between, utterly
dissimilar users.
This is an important point, as it grounds much of the effort behind
XML. XML is about freeing data so that it can reach its full
potential by packaging it up in an appropriate way; it's fundamentally
not about standardizing complicated software architectures. This is
not to discount any benefits that may come from looking at innovative
ways of processing XML data. As Rick Jelliffe observed, innovation
can be applied without recourse to a binary format.
It is completely possible to make inefficient binary
formats...or ones with performance penalties. It is completely
possible to provide indexes in XML documents...It is possible to
provide multipart documents with an XML document and a binary index
for searching. It is possible to provide non-XML text formats that
have nice performance characteristics...my STAX short-tagging
compression which can give well over 50% reduction in file size (in
suitable cases) for just a paragraph of extra lines of
non-processor-taxing code inside an XML parser. And there are more
efficient parsers possible (especially for trusted data) if they
assume WF documents...
...And there is also the other cat in the bag: sparse, lazy DOMs
(i.e. DOMs constructed lazily as required from a fragment server) may
require far less processing than retrieving full documents whether those
documents are sent as XML or non-XML.
...the use-case is not merely readability, however excellent that
constantly shows itself to be. A lot of the supposed benefits of a binary
format may be nothing to do with the binary-nature itself, and just as
doable in vanilla XML or in a text format.
In short, the consensus is that a binary XML will at best equal the
advantages of XML as it is today. Greater rewards will be found from
pursuing the application, and not the re-engineering, of XML.