Table of Contents
Compression Techniques
XML's Future Could Depend on Efficiency
Wrapping Up - What's Inside?
Last week in XML-Deviant we explored the design of SVG and discovered that concern over
file size was behind several contentious design decisions.
This week we focus on a discussion that sought a more generic solution
to XML verbosity.
Compression Techniques
Like any textual markup language, XML is verbose. There is a lot of "redundant" data in
an XML document, including white space, and element and attribute names.
XML documents are therefore a prime candidate
for compression. Simon St. Laurent asked if
anyone was working on a standard compression format:
I'm starting to get concerned about the volume of complaints I'm getting
from readers and folks in Web development forums who are starting to argue
that XML's verbosity is a problem, especially for things like transmitting
vector graphics information. There are a lot of wasted bits in XML
documents -- and of course in HTML and other text documents as well.
Judging from the response, others have had similar thoughts and
encountered similar complaints. Two potentially useful products
were
mentioned, XMill and
XMLZip. Both of these are general
purpose XML compression tools. Mark Baker suggested that, because HTTP supports
compression through its Accept-Encoding and Content-Encoding headers,
there's no need to
wait for a standard:
I could see a generic XML-specific compression mechanism being
developed; one that understands what "<" and ">" mean. But
you don't have to wait for that to compress your XML today.
A binary encoding for XML is another means by which file sizes, and
ultimately bandwidth, could be be reduced. Ingo Macherius
pointed out
the work of the WAP Forum:
The WAP community has developed an architecture for binary XML encoding,
which includes efficient compression.
Obviously, an efficient format is essential within the limited bandwidth available to mobile devices
(although this restriction will no doubt be alleviated at some point in
the future).
The topic of a binary encoding for XML has cropped up before on XML-DEV.
Discussion last year explored the requirements and issues behind the idea.
The interested reader may wish to look at two threads in particular:
"Is there
anyone working on a binary version of XML?", and
"Binary-encoding of XML for communication."
XML's Future Could Depend on Efficiency
In response to these suggestions, St. Laurent clarified his aim as being
the
integration of compression seamlessly with current
transmission
and processing mechanisms,
rather than any specific technology:
While these various tools for compressing XML are interesting, and use a
wide variety of promising strategies, none of them are currently set up to
be built into a compress-before-transmission/decompress-on-receipt
framework that's invisible to the user.
The WAP approach is probably the closest to what I'm thinking about, but
the WAP forum has control over the entire transmission cycle. Building
support for this binary encoding into WAP devices is easy.
Making compression/decompression work across existing Internet frameworks
is a lot harder...
In a thought-provoking response, John C. Schneider outlined a body of work carried out by MITRE, a
not-for-profit US government organization. The work, based around a format called Message Text Format (MTF), parallels many of the W3C efforts to date.
Projects included the development of validating and non-validating parsers, schemas,
validation tools, and an object model. Schneider indicated that one product
was a compression mechanism that
could be tweaked for XML.
One of the concepts we devised fits the description you give below and,
with sufficient tweaking, could form the basis of an efficient XML encoding
scheme. The algorithm does not rely on character redundancy and, as such,
works equally well for small information objects that tend to get larger
using algorithms like zip. In addition, its design permits it to be
read/written directly from an appropriately modified DOM implementation
instead of incurring the cost of a separate compression/decompression step.
Schneider saw a more efficient binary encoding of XML as being "inevitable," and
hoped that it would become ubiquitous. Ideally parsers would be capable of reading both
text and binary encodings. The exact encoding would be transparent to the user.
Citing previous experience, Schneider stressed
the importance of an efficient encoding:
For XML's long term viability, I believe it is strategically important to
design a more efficient encoding. I'd hate to see XML unseated by a more
efficient format a few years down the road, reducing the importance of the
great XML work that's been done and introducing new interoperability
barriers. While this scenario might seem far fetched today, it occurred
within my customer's community several years ago (even though their original
format was about 10 times smaller than XML).
Commenting on the WAP initiative, Schneider believed that the activity may not
result in a generally applicable standard:
...their current path appears less likely to result in a general
purpose XML encoding for all XML users than if the work was done in an
environment like the W3C or IETF... If my projections about the eventual development of a general
purpose, efficient XML encoding are true, this change in focus may be strategically
important to the long term viability of WAP.
Wrapping Up - What's Inside?
The incorporation of compression into a general "XML infrastructure" is
related to a much wider problem: packaging of XML
documents. For a given document, there is a range of information
useful to an XML processor that is not directly related to the
data it contains. Identifying a compression mechanism is one
example; style sheets and schemas are others. Ideally this
additional information would be available from a separate
packaging mechanism.
Don Park believed that packaging should be the primary focus:
I doubt we will be able to agree on a standard compression
format. Rather, I would like to work on [making the] XML packaging
standard proceed faster to encompass arbitrary encoding
of XML documents and fragments. XML's relationship with
MIME should also be strengthened.
With a generic packaging framework, it should be possible to support multiple compression standards.
In this sort of environment, alternate standards could compete and flourish. Thus we avoid the need to dictate a
specific solution at this early stage. Simon St. Laurent observed that packaging is an area neglected by the W3C:
I definitely agree on the need for XML packaging. I've been disappointed
with the slow progress (is there any?) on packaging at the W3C, and look
forward to seeing more activity.
Looking back over the XML-DEV archives shows that again packaging is a recurring topic. Last year saw several threads
relating to the issue: "Packaging and Hub Documents,"
and "Packaging and Related-Resource Discovery."
It would appear that no real movement has been made on this front, although Simon St. Laurent's XML Processing Description Language (see also "Profiling and Packaging XML")
has been a step in the right direction.
Currently involved with activities to improve MIME support for XML content, St. Laurent commented that he believed
more fundamental changes may be required:
I don't think anyone expected that XML might require a rethinking of the
infrastructures we use to carry it, but I'm headed more and more that
direction. It may still be too early in the game, though -- after all, XML
is still a tiny portion of the overall traffic on the Internet.
However tiny XML-based traffic is today, if the current rate of
adoption continues, XML transmission will be ubiquitous before long.
Now may be the best time to consider some wider architectural problems: perhaps it's
time to take a break from producing the unceasing flow of new standards. Considering how
these standards fit together will reinforce our efforts toward the holy grail
of Interoperability. Experiences from organizations like MITRE, as well
as feedback from developers "on the factory floor," will be vital.