Building the Semantic Web
by Edd Dumbill
March 07, 2001
This article is adapted from the closing keynote I delivered at
Knowledge Technologies 2001.
Introduction
The range of people working under the broad umbrella of the
Semantic Web come from many diverse communities, from the Web-focused
to experienced researchers in the fields of artificial intelligence
and knowledge representation. Ultimately the skills of all those
involved will be required, and it's definitely beyond the scope of any
one group to provide the expertise necessary to build the ultimate
Semantic Web.
For me, the key thing about the Semantic Web is the word
"Web". It's our essential starting point, and the Web at large is the
ecology in which the primordial Semantic Web must grow. I spend most
of my time working with the Web, as a developer and a writer, and also
in involvement with the community of developers and publishers that
use the Web.
So, as I approach the Semantic Web (or "SW" from here on), I'm
always asking the question "how do we get this started?" There are
many interesting and exciting possibilities in the realms of logic and
proofs, but getting them running on the Web must be preceded by
getting more basic machine processible content out there. The evolving
form of the SW has to crawl before it can run.
In this article I introduce the SW vision and explore the practical
steps that we need to be taking to build it.
What is the Semantic Web?
The essential aim of the SW vision is to make Web information
practically processible by a computer. Underlying this is the goal of
making the Web more effective for its users. This increase in
effectiveness is constituted by the automation or enabling
of things that are currently difficult to do: locating content,
collating and cross-relating content, drawing conclusions from
information found in two or more separate sources.
In the software world we can often get so enthusiastic about the
systems that we're creating that we stray from a focus on the user's
requirements. One of the great things about the Web is that it's
unforgiving when we ignore the user. Create a site that's hard to use
and nobody will come. Create a technology for page markup that's
difficult to grasp and nobody will use it. In fact, you might see the
creation and implementation of the SW as a near impossible task: it's
still difficult to get people to use as little metadata as the
<title> tag in their web pages.
Clearly, to get off the starting blocks, the SW has to offer enough
in reward to make it worth people's time to learn new skills and to
more carefully deploy their content on the Web.
So, that's the vision. A Web that machines can understand to make
our lives easier. If you accept that the end purpose of the SW is to
make your life easier, then the use cases spring from your
frustrations. Some of the common problems we want to solve on the Web
revolve around interoperability of data. Synchronize your Palm
Pilot's schedule with a web page, have some kind of universal view
over your email, documents, and web browsing history. These problems
are currently unsolved because of the fragmentation of our data due to
custom and proprietary data formats. Providing an integration of these
is an obvious use case.
As well as meeting some obvious use cases, there's a degree of
serendipity in the SW work. There's a feeling that says, "if only we
got all these sources of information tied together, than exciting
things would happen!" Building the SW is a research and development
project, not a manufacturing process. There'll be some dead ends, and
there'll be some discoveries of exciting and unforeseen
proportions.
Speaking personally, I have a fundamental excitement at being able
to recover and integrate my data from disparate sources and
proprietary formats. This springs from constraints on my time, the
difficulty of finding information, and the redundancy of having my
data scattered across multiple devices. In what follows I give an
explanation of each layer in Tim Berners-Lee's vision of the SW: each
layer gives progressively more value; each is exciting in its own
right. My current aims for the SW result purely from the
implementation of some of the lower layers.
Overview of the Semantic Web
The World Wide Web Consortium has recently started a specific
Activity to address SW development. Under the leadership of Eric
Miller, its remit is twofold: to develop and address issues with RDF
and RDF Schema; to coordinate with other W3C groups using RDF; and to
undertake and encourage "advanced development" of SW software.
This latter aim is the thing I find most exciting. "Advanced
development" entails the W3C working with developers in an open
fashion to encourage SW-related projects and to give them a
focus. Early projects that might cluster around mandate include some
work inside the W3, such as RDF wrappers for CVS repositories, and
potentially some existing community-based projects could have a home
there. Essentially, "advanced development" is a recognition of what
has happened to the RDF world in the last year. While it essentially
languished for a while at the W3C in terms of formal activity, a
community has grown up, with some very encouraging results.
The W3C has put forward a very clear architecture for the SW, described
by Berners-Lee at XML 2000 in Washington last year. This
architecture is cleanly layered, starting with the foundation of
URIs and Unicode. On top of that sits syntactic interoperability in
the form of XML, which in turn underlies what I like to think of as
the data interoperability layer, RDF and RDF schemas. Those layers sum
up most of the SW that's presently available in implementation
form. And without looking further up the SW stack, an extraordinary
amount of utility can and has been obtained from just those
layers.
You'll notice that digital signatures run right up the side of the
stack, emphasizing their widespread utility. At each stage they allow
content from a layer to be labeled with an assured provenance. Digital
signatures are critical to both the SW and the growing use of XML in
other message exchanges. From the basic act of signing some RDF
assertion ("I said this!") to signing proofs, they add a level of
assurance to the Web that hasn't existed thus far.
On top of RDF lie ontologies, which allow the further description
of objects and their interrelations, past the basic class-property
descriptions enabled by RDF Schema. The W3C in conjunction with DARPA
and the European Union is pursuing the development of languages in
this area right now. Ontologies provide the ability to say "my world
is like this" and are the foundation that will enable programs to
reason about different worlds and environments and make connections
between them.
The logic layer will provide an interoperable language for
describing the sets of deductions one can make from a collection of
data -- how, given the world we've now neatly described, we can make
connections and derive new facts about it. The proof language will
provide a way of describing the steps taken to reach a conclusion from
the facts. These proofs can then be passed around and verified,
providing short cuts to new facts in the system without having each
node conduct the deductions themselves.
The SW vision is that once all these layers are in place, we will
have a system in which we can place trust that the data we are seeing,
the deductions we are making, and the claims we are receiving have
some value. That's the the goal: to make a user's life easier by the
aggregation and creation of new, trusted information over the Web.
Goals for Building the SW
Now that we've seen the plan, let's look at how it's going to be
built. Obviously, the technology needs to be invented. But technology
without adoption is dead. What SW advocates need to do to reach the
critical points along the road to adoption?
Eric Miller, SW Activity Lead, certainly has his job cut out. While
there are encouraging signs of a groundswell in support for RDF, it
mostly has a bad name and reputation at the moment. Take this along
with the confusion that XML namespaces, an underlying layer, generates
(and never mind that many US programs can't even work with European
Latin character sets, much less Unicode) and there are some steep
slopes to climb.
So one of the first aims of SW advocates must be to promote
understanding of what they're doing, at both low and high levels. RDF
is more than an obscure or verbose way to write what you could do
easily in XML. There are reasons for using it. Naming everything with
URIs is in fact very powerful, but the confusion about the use of the
http: prefix for unretrievable resources needs to be
cleared up.
But it would be a mistake to focus on getting all developers (much
less users) to understand fundamentally every layer of this stack. The
fact is that most developers use prepared modules to do their
construction work; only a few are extreme enough to bake their own
bricks. An aid and impetus to getting understanding is to get
implementation. It's very reasonable for people to ask, "what does
this do for me?" about a new technology. Implementations can speak
louder than a thousand specifications.
Implementations fall into two categories: (1) deployment of SW
technologies in a vocabulary or framework and (2) software tools. The
growth in basic RDF tools over the last year has been very
pleasing. These tools are starting to reach the level of maturity at
which I would consider basing an application on one or two of
them. Likewise the deployment of RDF in vocabularies like PRISM and
RSS is encouraging and has reaped particular benefits that straight
XML serializations often miss.
We should be
careful not to restrict SW technologies to just those explicit
layers in Berners-Lee's idealized
diagram. There's obviously a difference between what is on the Web,
and what is in the diagram (HTML is not mentioned, for instance).
The beauty of XML is that it's in the
perfect place to act as a bridge. HTML (or more properly XHTML) can be
semantically decorated by means of things like the class
attribute, and XSLT can be used to extract RDF. Likewise, there are
other semantic applications, such as Topic Maps, that are pure XML
applications. Are these to be excluded from the SW? No, XML provides
a bridge.
Picture RDF as providing an interoperable data bus for the SW.
Some data sources may need a converter to connect, but it doesn't stop
them connecting. And once they're patched in, there's a lot of
potential in the resulting integration.
So the W3C has to promote understanding and implementation among
the community. What about money? Surely you can't reach critical mass
without there being money in it?
Yes, there has to be commercial value somewhere down the line;
business is after all about providing services to users. But we ought
to be wary of the effect of premature
intense commercial interest. On the one
hand, look at the W3C's greatest successes: the Web itself was built
while nobody in particular was looking.
XML 1.0 developed similarly: "fast, low and
under the radar," as Tim Bray likes to say. On the other hand, the effect of large-scale
corporate interest on XML Schema has been significant, causing the end
result to be late and an obviously overcomplicated result of
design-by-committee.
Recipes for Success
What does getting the SW right entail? There's a lot we can learn
from the existing Web itself, which has been outrageously
successful. As the SW is to be built on top of the Web, many of its
characteristics are there as a base and should be continued. The Web
provides the ecology in which the SW must thrive, not destroy. So what
are these characteristics?
- Simple protocols, concepts and syntax: the easier the
component parts of the SW are to learn, the quicker they will spread
in adoption. Of course there is a tension here, but on the Web
widespread adoption is something that can be set against
complexity. There is ultimately more
power in a simple technology universally adopted than a more
powerful one with patchy or little adoption.
- Low barrier to access: the SW should be something which
normal users have easy access to, in the same way that it's very
easy to read the Web, and relatively easy to set up and publish a
web page. We run into tool-dependencies here, but that's not a
blocker, as most non-HTML savvy folk use an authoring tool to
publish. The point is that SW technology must become
commoditized.
- Tangible utility: this may seem obvious, but the Web
actually does something people want. There's a danger with the SW,
as with any technology, that its developers get carried away with
ideas that end up being clever but hardly useful. The use cases for
the SW must begin at home and describe pratical problems.
So is this a private W3C party? Judging by the way the new SW
Activity is set up, the W3C has recognized it's not and wishes open
community involvement in the effort. The importance of this community
should not be underestimated. Over the last year there have been at
least two community-driven efforts already building the SW that've
caught my attention. Their use cases in each instance described
practical problems that the developers had to solve to help in their
work.
RDDL,
covered recently in XML.com, allows developers to place a
machine-readable description document at the end of a namespace URI to
allow processors to discover resources related to a namespace. RSS 1.0
is a web content metadata distribution format. Its extensibility
allows it to be used in many situations far beyond original use
cases. Both these projects fill in a little bit of the picture for the
SW and represent chunks of what is to come. In the context of success
for the SW, they're notable because they solved direct needs and
extensibly allow reuse and expansion into areas that the designers
didn't foresee -- a direct reflection of the development of the Web
itself.
Hitting That Fabled 80/20 Point
To conclude, it's important that the builders of the SW keep their
feet on the ground. The next generation of the Web will be built
cooperatively and in a distributed manner. Rather than pondering grand
unification theories, we should concentrate on doing small things well
and solving achievable and well-defined problems. Good and open
implementation in addition to good design is key. Furthermore, the
longer development can stay "fast, low and under the radar," the
better.
The SW represents an enormous opportunity not just to solve our
problems with information management, but also to solve them in an
interoperable environment, so we can all share solutions and enjoy the
network effect. But always the goal should be to make the Web more
effective for the user, and it is by such that it will be judged.