Distributed XML
by Edd Dumbill
September 06, 2000
The role played by XML in the next-generation Web
Introduction
It's not cool to be different, at least not where Internet
computing is concerned. Despite widespread agreement about certain
subsections of Internet technology -- SMTP for email, for instance --
many services and sources of data remain desperately unconnected. The
same is equally true of desktop computing. Although office suites
provide some degree of integration, exchanging data between
applications from different vendors is very frustrating. Add the Web
and email into the mix and the problem gets worse.
The more I use and rely on computers, the more I realize I ought to
stand up for my rights. I'm abused and tormented by a patchwork of
programs that hardly work together, that trap my data in places I
don't want, and make me adopt unnatural working styles. The busier I
get, the more I get buried in information overload, the more I realize
this is happening, and the more I want it fixed.
XML offers hope for escape from the current situation of
fragmentation and disarray. In this talk I will focus on two
technologies that look as though they'll have a big impact in this
area: SOAP and RDF. I'll also talk about the shift in architecture,
from centralized to decentralized, that we'll need to embrace as the
world of Internet computing continues to grow.
The Dream
The dream that drives the integrated vision of the future is of a
universal homogeneous view of information. No special cases or
peculiar formats but a universally accessible "data bus"
over the realm of the Internet, and by extension, all your private
data sources. Let's look at the components of such a system.
Fundamentally, they are a universal addressing scheme and a universal
data format. In many ways these represent the essential components of
a universal computer.
The addressing scheme, universal resource identifiers (URIs), has
been in operation over the web for a long time now. The data format,
XML, has been around for nearly three years, and it's clearly
providing many benefits in reducing the translation overheads of
communication within and between organizations.
We're now close to the conditions in which computing can be
performed over the whole span of the Internet. However, every computer
requires instructions and a language in which to program. These are
the problems we need to work on now in order to realize the greater
promise.
One of the luxuries of being able to address a conference full of
XML developers is the chance to urge you to check out some new ideas.
I hope that some of this dream will catch in your minds.
The Universal Computer
I've already mentioned that URIs are pretty much set in place as an
addressing scheme, and it looks increasingly like HTTP is becoming the
transport for information and instructions in our "universal
computer". Yet this still leaves a couple of problems left
unsolved:
How do we encode the data? Plain old XML won't do by
itself -- how do we say something is an integer, something
represents a "Person", etc.?
How do we encode instructions? If we want to cause another
computer somewhere else on the web to perform a function, how do we
express that?
These questions lead me to talk about two technologies that go some
way to answering these questions: RDF and SOAP.
Characterizing RDF
RDF, the Resource Description Framework, is a technology invented
at the W3C. It was one of the earliest XML applications, and
definitely the first to use XML namespaces in earnest. For various
reasons, its rise has not been the up-and-up that XML itself, and more
recently XSLT, has achieved, but, rather, a slow but steady
expansion. Its immediate user community has not been e-commerce, which
has also been an important factor.
What is RDF for, and what are its qualities? Well, it does what it
says on the package: RDF is a language for describing things. Well,
one might object, you can describe things in XML anyway, so why do you
need more? The answer is that XML is too flexible, and you need
conventions. As an example, there are many ways to indicate the color
of something in XML. Try to describe a "red car":
<car color="red" />
<car><color>red</color></car>
<car color="#cc" /><color id="cc" shade="red" />
I just came up with these three on the spur of the moment, there
are lots of other ways you could write that fact down. What RDF does
is to invent a standard way of interpreting XML-encoded descriptions
of things, or "resources", which turns out to be very
useful.
Further, RDF employs URIs as a naming scheme. This means that
there's one naming convention, which has the property of being able to
generate globally unique names for your resources. This is another
important feature RDF needs to be able to model the real world.
(Incidentally, it also shines the spotlight on the fact that naming
anything is very hard indeed!)
One consequence of using URIs is that it enables RDF to be used in a
decentralized fashion: unique names need no context to qualify them,
so anyone anywhere can write descriptions that involve anything
anywhere.
Another key feature of RDF is that it's
openly extensible. Unlike plain old XML, there's no sense of
constraining what the document can describe by a DTD or schema. This
means that if you get a description of something from someone, and
want to add your own observations to it, you can do so without having
to agree to a change in the schema. For certain classes of
application& -- particularly annotation and metadata applications --
this is a great advantage. It also means that systems using RDF run
less risk of getting stuck in a legacy file format situation: the
ability to cope in a forward-compatible manner usually comes
hand-in-hand with using RDF.
Essential RDF
Let's take a look at some of the essentials of RDF. At its most
basic, RDF is a way of modeling things. A lot XML technologies tend to
start at the syntax and go from there (following the lead of XML
itself). To get a good understanding of RDF, it's better to start with
the model.
Everything in RDF can be represented by a graph with nodes and
arcs. Each node is a resource, and each arc represents a property.
Both properties and resources are named with URIs. What does this
mean? It means that the whole Web and beyond (in short, anything which
you can name) is within the scope of an RDF description. In effect,
RDF graphs boil down into a "soup" of logical assertions.
To make this more concrete, let's have a look at an RDF description
of part of an email inbox.
Graph of Email Inbox

The email resource itself is named by the mid:339.C2@foo.com URI (derived
from the Message-Id header field), and it has various properties that
are either literal values or resources themselves. In particular,
we've decided to name the author of the email by their email address
and give them a "real name" property.
RDF, as written in XML, is a syntactical formulation of these
graphs.
So far, this is fairly straightforward. If it's so neat, what can
it do?
Of course, RDF is just a way of representing data, and you need
query engines to give you answers to these questions. This is an area
that's growing at the moment, and there are several great open source
projects available. Some of them are based on Prolog, which is great
if you're inclined that way, but others are C and Java-based, more
oriented toward popular programming languages.
The real power is being able to join together graphs from multiple
data sources. The use of URIs for names enables them to act as
connectors over a potentially infinite data source. An RDF processor
could then chase down bits of information by following these
links. Imagine layering an RDF graph over Amazon.com for instance. You
could construct the "is-similar-author-to" property over all
book authors by using the detail from the "customers who bought
this book also bought..." information they present.
If we take a step back from this connected chain of information for
a second, it might sound familiar: a standard way of writing
information, connected together via universally unique names. In many
ways, RDF does the same thing for computers that HTML does for
humans. The HTML-web has enabled humans to chase down various bits of
information through links and query engines: widespread RDF on the web
will enable computers to do that. What matters most of all is the
linking through universal names.
RDF Processing Models
What kinds of computation can be done with RDF? How will this web
of information actually work? This is where we definitely walk into
the world of prototypes and experimentation. The basic method of
processing generally involves aggregation as a first step. Here, RDF sources are
mined for their descriptions, and these bundled into a local store of
some sort. From there, queries can be performed on the
data.
A slightly more sophisticated
architecture may involve some kind of dynamic description generation
or querying. Let's use Amazon.com as our example again. If I wanted to
run a query on Harry Potter books, in order to see which
books are in a similar genre, I do not want to import Amazon.com's
entire catalog. Furthermore, Amazon.com doesn't want me to import
their entire catalog. Instead, they may just give me access to a
virtual graph, which I can query without having to
construct.
In general when processing RDF the logic
tends to be performed in one place, while the source data can be
widely distributed.
RDF Infrastructure Requirements
As with most technologies, RDF requires other things to be in
place to support its widespread use. These include:
Vocabularies: there
is little use being able to talk to each other unless we understand
what we mean when we use a particular phrase. If I can retrieve the
fact that the "car is red", I don't really have any useful
information unless I know precisely what "car",
"red", and "is" mean.
Query languages: once
we have constructed our database of information, we need query
languages and standard APIs to use the data. This is an area in which
active development is being pursued in RDF, but we are still some way
from having something as mature as SQL.
Data stores:
developers constructing systems using RDF shouldn't have to worry
about how they will store their data, storage needs to be a
"drop-in" component. Like querying and APIs, this is an
area under active development, and projects like R.V. Guha's
rdfdb and Dave Beckett's
Redland show
promise.
Characterization: this is a very interesting issue
concerning query and inference-based systems that use RDF. Where are
the bounds of what I know about? How do I find out what other people
know about, and how can I express those bounds? Issues like this
become important when attempting to link up multiple sources of
information.
[1] [2] Next