
Transporting Binary Data in SOAP
by Rich Salz
August 28, 2002
You know the old saying, that a picture is worth a thousand words?
There's an awful lot of binary data out there, and XML is not going to
replace it all or even a significant percentage. After all, what's the
benefit to xmlifying things like MPEG's or program executables?
Binary Data in XML
XML doesn't handle embedded binary data very well. Naive developers
first try to embed the data directly into their document, reasoning that
since Unicode uses all possible byte values, they'll be able to do this.
They realize their mistake as soon as their embedded content has a byte
with a special value like 0x3C (less than) or perhaps 0x26 (ampersand).
The clever naïf might try to fix this by wrapping their content in a
CDATA construct, but that only makes the problem less likely, rather than
removing it. Suppose the content is a SAX library -- it's quite possible
that the CDATA terminator string, "]]>", will show up.
Having lost their innocence to the cruel master of experience, the
developer bites the bullet and encodes their data as Base64 and lets XML
treat it as a string. The problem with this is two-fold. First, it's not
really a string, it's something else. Second, Base64 is one-third
larger.
Actually, the combination of those two factors will probably make the
overhead penalty worse. If the developer is using a third-party XML or
SOAP toolkit, it's most likely that the toolkit will return the embedded
data as a string, which means the developer will then have to decode it
themselves. This would result in a (temporary) overhead of 1 1/3 -- more
than 100%. Unfortunately, while it can be prohibitively expensive (in
terms of message size and memory use), Base64 strings have been the only
approach that works and is portable.
XML in XML
XML is also not good at embedding XML documents inside each other.
There are a number of reasons for this. First, there can be only one
prolog, so you have to force everything to be in the same encoding. While
almost everything is in UTF-8 right now, it's probably only a matter of
time before applications start using encodings optimized for their locale.
Second, the embedded document can't have a DTD, requiring the developer to
do entity expansion, fill in defaults, and so on -- onerous, but not
impossible.
What is impossible is knowing which attributes are XML ID's.
This means that the the developer's outer document can't point into parts
of the embedded document. More importantly, it's impossible to enforce
the ID uniqueness constraint on the resulting compound document. For
example, a medical document might define an "id" attribute to be a patient
identifier, and not an XML ID at all. A SOAP-based web service that sent
three medical documents about the same person would probably fool many
generic parsers into marking the compound document as invalid.
In current practice developers usually cross their fingers and make the
following simplifying assumptions:
- Everything is in UTF-8, and nothing else from the prolog matters
- There are no entities or defaults
- Every XML ID attribute is named "id" (or perhaps "ID"), and
they're not that common, so we'll ignore the potential for conflicts
While this method is an ugly hack, it usually works in the real world.
And when it doesn't, our poor developer encodes their embedded XML as a
Base64 string and pays the price described above. It's worse, of course,
because now you have to rescan to create the embedded document and figure
out how to properly associate the two documents to each other. Have you
ever used a DOM implementation to "merge" two XML documents? It's neither
pretty nor easy.
By this point, it should be clear that it's not good to try to embed
arbitrary binary or XML content into another XML document. This is
particularly bad news for SOAP and web services, since SOAP messages are
XML documents with a thin layer -- a SOAP bubble, perhaps? -- around
them.
SwA, DIME, BEEP
The right approach is to pull the embedded content out of the XML
container, and replace it with a link. Fortunately, SOAP defines the
href attribute that makes such linking fairly easy. For example,
a stock service could easily refer to the latest SEC filing and set of
indictments:
<SOAP-ENV:Envelope>
<SOAP-ENV:Body>
<tns:Ticker>WCOM</tns:Action>
<tns:Price>0.32</tns:Amount>
<tns:Filing href="http://edgar.sec.us.gov/10k.cgi?s=wcom"/>
<tns:Indictments href="http://alcatraz.doj.us.gov/search/wcom"/>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
(Don't waste your time trying the href values; I just made them
up.)
Usually it's necessary to bundle the data with the message. When this
is done, we typically call the SOAP message the payload and the
data that used to be embedded as attachments. There are three
common formats for doing this. In no particular order, they are
- SOAP Messages with Attachments (SwA), which uses multi-part MIME
- DIME, a binary packaging format created by Microsoft
- BEEP, a very powerful facility by protocol expert Marshall Rose
We'll look at each of these in turn, starting with SwA for the rest of
this column, and DIME and BEEP in subsequent months. While "direct
handling of binary data" was explicitly declared to be
out of
scope for the W3C SOAP working group, this should change once SOAP 1.2
enters the standardization track. Using one of the existing mechanisms
seems the most reasonable way to move forward.
SOAP Messages with Attachments
SOAP Messages with
Attachments is a W3C Note, just like SOAP 1.1. It was published in
December of 2000, seven months after the SOAP Note. The name turns out to
have been unfortunate, having usurped the obvious generic term.
SwA is very simple: the first part of the multipart MIME message is
the XML SOAP document; the subsequent parts contain the attached data.
The bulk of the document addresses URI resolution, particularly relative
URI's. If we ignore them and always use absolute URI's (the current
recommendation), the specification becomes even simpler. In the example
below, we'll use email-like Message-ID's as our identifiers, as they have
the convenient properties of being globally unique and absolute. We'll
just attach a prefix to a single Message-ID to distinguish the parts.
The first bit is to properly declare the MIME content type; as is
common with MIME multipart, the hardest part will probably be determining
the message boundary:
Content-Type: multipart/related; type=text/xml;
boundary="xXxXxXx";
start="<start-AA11234455.22@www.datapower.com>"
Here is the movie you requested.
Thank you for patronizing the MPAA on-line store.
--xXxXxXx
Content-Type: text/xml; charset="UTF-8"
Content-ID: <start-AA11234455.22@www.datapower.com>
<SOAP-ENV:Envelope>
<SOAP-ENV:Body>
<tns:RunningTime>120</tns:Action>
<tns:Rating>PG</tns:Amount>
<tns:Movie href="cid:part1-AA11234455.22@www.datapower.com"/>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
--xXxXxXx
Content-Type: application/mpeg
Content-Transfer-Encoding: 8bit
Content-ID: <part1-AA11234455.22@www.datapower.com>
.....
--xXxXxXx--
There are a couple of things to notice. First, if you follow the
techniques I used here with stylized use of Message-IDs and
Content-ID headers, it should be drop-dead easy to generate and
parse SwA messages with the help of a MIME toolkit. Second, note that
HTTP forms can be sent using MIME multipart, and if a "file upload" is
involved, then they have to be. This means that all web servers
probably already have the necessary MIME machinery built in. Any client
with a modern mailreader (one capable of sending attachments, if not doing
the whole GUI thing), should be in the same situation.
So this is what's good about SwA: it's simple, and if the code isn't
already on the platform, it's not onerous to get it. In spite of this, it
doesn't seem to have taken off. There are a couple of technical reasons
for this. The first is a minor one: MIME can be heavyweight, and might
not be appropriate for small or embedded devices. While this is true for
full-fledged MIME toolkits, a custom library for SwA-style MIME use need
not be big. (For a sense of historical perspective, the same complaints
used to be raised about ASN.1/DER libraries -- a binary format used by PKI
-- and there seem to be no problems getting the necessary bits of those
onto devices like smartcards.)
More from Rich Salz
|
|
SOA Made Real
SOA Made Simple
The xml:id Conundrum
Freeze the Core
WSDL 2: Just Say No
|
| |
The second drawback to SwA is that it can't handle data streaming.
While the ability to send the data in chunks wasn't part of our original
problem statement, once you start using it in the real world for things
like multi-media data, it's clear you don't want to require the sender or
receiver to have to buffer the entire attachment before processing it.
In order to address the streaming and the implementation footprint
issues, Microsoft developed the DIME protocol, which is progressing
through the IETF. MS clearly sees DIME as more useful than SwA; although
MS was one of the original SwA authors, it's only supported in one MS
toolkit, while DIME is part of MS's global XML Architecture.
In next month's column, we'll examine DIME.