Instant RDF?
by Leigh Dodds
August 30, 2000
This week the XML Deviant returns from holiday to find that,
although the W3C's Resource Description Framework (RDF) technology appears to be gaining
supporters, developers still have many concerns about the complexity of
its syntax.
Mining the Web
Complexity has been one criticism which RDF has had difficulty in shaking
off. Both the RDF model, and its serialization syntax,
have fallen foul of this issue at various points in its development. Efforts to
produce a simpler serialization syntax have lead to several alternate
proposals, including one from Tim Berners-Lee ("The Strawman Proposal"), and
one from Sergey Melnick ("Simplified Syntax for RDF"). For non-RDF-afficionados,
the serialization syntax is the representation of the RDF data model
as XML. (Although XML is only one possible means of representing this information).
While technical concerns have been raised about specific details of the
RDF syntax, the main aim of simplification is to make it easier to generate
RDF from existing (and future) XML documents--documents which were not produced with
RDF applications in mind. Given the slow adoption of RDF, this seems a useful approach.
While discussion of the finer points of the RDF syntax are no doubt beneficial,
for developers seeking to gain some benefit from using RDF this transitional step from XML to RDF
is important. An increasingly large amount of XML data coupled with a vast amount of HTML (suitably
tidied for well-formedness) provides
a rich data source for bootstrapping RDF applications
A recent discussion on the W3C RDF Interest mailing list has highlighted
some different viewpoints on how this might be achieved.
Simplifying the Syntax
Broadly speaking, there are two viewpoints in this debate. They differ
in terms of how much the structure
of an XML document should be affected by RDF. Or, in other words, how much effort needs to
be invested up front to allow a document type to be processed as RDF.
The conventional approach assumes that the
XML document should contain RDF markup. A parser can then directly process the document
extracting the "triples", which are the core data items in RDF. (See "Abbreviated Syntax"
in the RDF Specification)
The other viewpoint suggests that RDF should not be allowed to impact the structure
of XML documents at all. Instead tools should be provided to generate RDF from these documents
prior to their processing by an RDF application. This has been termed "Screen-scraping" in RDF circles. Aaron Swartz offered a
proposal discussing this approach:
What is needed is a way to allow RDF parsers to extract RDF triples from
regular XML. This would be an amazing boost for RDF, allowing any existing
XML format to be easily used as RDF information.
Swartz suggests that XSLT transformations could be one mechanism suitable for
generating the desired RDF. Ora Lassila, co-editor of the RDF specification, welcomed a simpler
syntax but urged caution when transforming XML into RDF:
I feel that although a simpler, more intuitive syntax would
be a good idea, transforming *any* XML to "something like RDF" is somewhat
dangerous unless you make sure that any intended semantics is preserved.
Syntactic transformation is only half of the battle...
Swartz's suggestion has parallels with recent work, using similar "screen scraping"
techniques to generate RDF. This includes work
by Dan Connolly to generate RDF databases from email archives,
and extract Dublin Core Metadata from XHTML files.
Dan Brickley has also posted research notes describing the use of XSLT in RDF screen scraping.
Of course, the two approaches are not mutually exclusive. RDF can be scraped
from existing documents, while new formats can include RDF markup directly. Existing
formats could also be revised to this end. It's likely that a combination of these
techniques will yield the best results.
The recent proposal for a
revised RSS format takes this approach--the format includes RDF markup, while associated
tools allow the generation of RSS from XHTML documents.
This gives authors an easy route to producing RSS documents without encumbering them with
a new syntax.
Ease of Use
One area of clear agreement is that RDF needs to be user-friendly, as Jonathan
Borden noted:
It should be easy for people to add RDF statements into otherwise
mundane XML documents in ways that minimally interfere with the chosen
document structure.
Charles McCathieNevile agreed, but saw better ways of tackling the problem than
the syntax:
It should be a trivial matter of making the statements in
their favourite authoring tool, or of using a simple point click drag
interface to specify arcs and nodes of meaning. Sitting around writing pointy
brackets is like telling the poor country astrophysicist to use only a slide
rule because it's better - sure, it works, but there are better ways.
Ora Lassila also saw syntax difficulties as a minor issue:
Personally I am not opposed to a new RDF syntax (the current looks a bit
like it was "designed by committee" :-). But ultimately the syntax shouldn't
matter all that much since I am sure everyone is hoping that most of RDF
will be both read and *written* by machines (not humans).
There are echoes here of other RDF debates in which the fabled "killer app" is seen as the
most important goal, rather than a quest for simplicity. However, given the slow acceptance of RDF, several developers disagreed with Lassila's viewpoint;
believing syntax to be very important in this stage of RDF's development. Greg FitzPatrick
observed that simple syntax contributed to the success of HTML:
HTML is also read by machines. But if HTML
had been difficult to comprehend and not mnemonic it would not have started
a landslide.
Bill de Hora highlighted the use of RDF in the RSS 1.0 proposal, and commented that this
is a good opportunity for
RDF to increase its profile:
It would be shame to miss out on the opportunity to piggyback RDF on the
popularity of RSS feeds, that is, to miss piggybacking on a network
amplification effect, on the assumption that there is no pressing need to
adapt the syntax because tools will appear to automate serialization of RDF
anyway. That's not the case for legions of people using RSS now. So at this
point in time, the syntax is perhaps very important: it is after all the
concrete expression of the model and is what people will have to manipulate.
I'm less concerned about the precise syntax (once the model is invariant),
than about missing a golden opportunity to seed RDF.
While it's too early to say whether RSS will its proving ground, RDF's
supporters are keen to see more adoption. Dan Brickley has suggested to developers
that effort should be spent
on producing interoperability tests for the increasing
range of available RDF parsers:
The XSLT / Semantic Web Screenscraping threads on this
list have shown how we can extract RDF models from all manner of well
managed XML data. There are a fair number of RDF 1.0 parsers now, and
significant effort has gone into creating these. I would rather see our
time go on developing interoperability tests for these to get them up to
production grade, learning through doing so about any grey areas in the
syntax spec.