Table of Contents
The Technologies
XHTML and XSLT
RDF, RDF Schemas
and Ontologies
XML Protocol
Technology
Getting Started
with Development
RDF Application
Frameworks
Applications
The Future
Introduction
The Semantic Web lies at the heart of Tim Berners-Lee's vision for
the future of the World Wide Web. Along with others at the W3C
Berners-Lee is working on the infrastructure for this next stage of
the Web's life. But the question "What is the Semantic Web?" is being
asked with increasing frequency. While mainstream media is content
with a high level view, XML developers want to know more, and they
want to discover the substance behind the vision.
Accusations of fuzziness about the Semantic Web (SW) have been
levelled at Berners-Lee, and it is certainly true that he has yet to
deliver the long-awaited "Semantic Web Whitepaper." However, there are
interesting suggestions in his Semantic Web
Roadmap text which give details about the direction he wants to
go. Furthermore, SW activity
at the W3C and MIT/LCS has been increasing in intensity, and
community involvement with RDF, a key SW technology, has increased
markedly over recent months.
In his Roadmap document, Berners-Lee contrasts the Semantic Web
with the existing, merely human-readable Web: "the Semantic Web
approach instead develops languages for expressing information in a
machine processable form." This is perhaps the best way of summing up
the Semantic Web -- technologies for enabling machines to make more
sense of the Web, with the result of making the Web more useful for
humans.
Given that goal, it's unsurprising that the scope of the Semantic
Web vision is somewhat broad and ill-defined. There are many ways to
solve the problem and many technologies that can be employed. Some XML
developers have a "well-formed" prejudice against, as they cheerily
call it, the "Pedantic Web" because of the strong links with RDF (not
everyone's favorite technology) and the definite view taken on
URIs. But to perceive the SW only in this light would be a
mistake. Technical peeves aside, the value of the Semantic Web is to
solve real problems in communication. First and foremost this means
radically improving our ability to find, sort, and classify
information: an activity that takes up a large part of our time.
The development of the Semantic Web is well underway. This development is
occurring in at least two areas: from the infrastructural, all-embracing,
position as espoused by the W3C/MIT and other academically-focused
organizations, and also in a more directed application-specific fashion by
those using web technologies for electronic business.
One of the fundamental contributions towards the Semantic Web to
date has been the development of XML itself. Liberating data from
opaque, inextensible formats as it does, XML provides an interoperable
syntactical foundation upon which solutions to the larger issues of
representing relationships and meaning can be built. It's an important
center of agreement among individual developers and corporations. The
face of the Web is changing, offering once again new possibilities for
communication and interaction -- not because all of the underlying
concepts are new per se, but because they can be combined on
the Web and exposed to the opportunity and unpredictability of
large-scale decentralization.
For the developer, however, the grand vision is irrelevant unless
it can be put to work. The point of this article is to draw together
the technological threads of the Semantic Web and introduce some
tools available now that can be used as a basis for experimentation
and development.
The Technologies
This section addresses some of the most important technologies for
constructing the Semantic Web. By no means is this list exhaustive because, as
I observe in the section addressing RDF, as long as there is some translation
to a common data model, many syntaxes can be a source of structured
information for a machine. However, I have included those technologies that
are key in this stage of Semantic Web development.
XHTML and XSLT
Perhaps surprisingly, a powerful tool for the construction of the
Semantic Web is HTML itself or, more properly, XHTML. Most people are
acquainted with the "meta" tags which can be used to embed metadata
about the document as a whole (for more on metadata see An
Introduction to Dublin Core.) Yet there are more powerful,
granular techniques available too. Although largely unused by web
authors, XHTML offers several facilities for introducing semantic
hints into markup to allow machines to infer more about the web page
content than just the text. These tools include the "class" attribute,
used most often with CSS stylesheets. A strict application of these
can allow data to be extracted by a machine from a document intended
for human consumption. For instance, consider the example:
<p>
For more information, contact:
<span class="contact" id="edumbill">
<span class="name">Edd Dumbill</span>,
<span class="role">Managing Editor</span>,
<span class="organization">XML.com</span>
</span>
</p>
A program could easily construct from such a XHTML snippet a
"Contact" object identified by the ID "edumbill" with properties
"name", "role" and "organization."
Techniques similar to this, known colloquially as "screen
scraping," have been used for some time on the Web. Common
applications include the extraction of data from search engines for
use in Perl scripts or the extraction of headline information from
news sources. For these
applications the problem has been the shifting nature of the design of
HTML pages and, thus, the need to readjust the scrapers whenever the
design changes. A page marked up using the technique showed above
would enable reliable scripts to interface with the HTML.
As web application providers consider adding SOAP and similar
interfaces to their systems to allow remote-application access, they
could actually be saved the effort of maintaining twin APIs (browser
and SOAP) by embedding machine-readable information in the HTML
itself. There is still a lot of value and utility in simpler web
technologies.
Once the richer information has been embedded in a page, a program
still needs to transform it into the format it requires. At this point
another W3C technology, XSLT, has a lot to offer. Given an XHTML page
as input, it is useful for selecting and transforming the contents of
that page. It provides an excellent bridge from older HTML technology
to the nascent XML-based Semantic Web applications. A tool of singular
utility when used in conjunction with an XSLT processor is Dave
Raggett's "Tidy,"
which can take HTML and turn it into XHTML. As most web authoring
tools still don't have XHTML support, HTML will be created by web
authors for some time to come. Tidy facilitates the processing of
normal HTML with XSLT, enabling authors of such documents to
participate in the Semantic Web.
Although there have been several proposals for embedding RDF inside
HTML pages, the technique of using XSLT transformations has a much
broader appeal. Few people want to learn RDF, and so it presents a
barrier to the creation of semantically rich web pages. Using XSLT
provides a way for web developers to add semantic information with
minimal extra effort. Dan Connolly of the W3C has conducted quite a
number of experiments in this area, including HyperRDF, which extracts
RDF statements from suitably marked-up XHTML pages.
[1] [2] Next