Make Your XML RDF-Friendly
6. Be careful about the use of container elements.
The good news is that a given resource can be both the object of
one or more RDF statements and the subject of others. For example, the
following shows that Bridget Fonda's father is Peter Fonda and that
Peter Fonda's father is Henry Fonda. Peter is the object of the
statement made by the outer triple and the subject of the inner
one.
<Entertainer rdf:about="http://us.imdb.com/Name?Fonda,%20Bridget">
<gc:father>
<Entertainer rdf:about="http://us.imdb.com/Name?Fonda,%20Peter">
<gc:father>
<Entertainer rdf:about="http://us.imdb.com/Name?Fonda,%20Henry"/>
</gc:father>
</Entertainer>
</gc:father>
</Entertainer>
There's no limit to the level of nesting, as long as even-numbered
elements in the line of descendants are resources and odd-numbered
resources are predicates. This alternating relationship is known in
RDF circles as striping.
The bad news is that many common uses of container elements throw
this striping pattern off. The following example, which omits the
document element and namespace declarations, is otherwise perfectly
good RDF until the attachments element.
<email rdf:about="msg001">
<from>bram@snee.com</from>
<to>bela@snee.com</to>
<date>20021024T081423</date>
<msgSubject>Dinner tonight</msgSubject>
<attachments>
<attachment>data\sample1.txt</attachment><!-- RDF parser chokes here -->
<attachment>data\sample2.txt</attachment>
</attachments>
<cc>frank@snee.com</cc>
</email>
Up to that point, an RDF parser knows that the resource with the
ID "msg001" has a from value of "bram@snee.com", a
to value of "bela@snee.com", and so on, but what is the
attachments value? If its contents were an XML element, it
would have to be just one element, with an identifier that named it as
a specific resource. Having more than one element -- which is the
whole point of the wrapper, because a given e-mail message may have
more than one attachment -- is something that RDF can't handle when
represented this way. It thinks that the attachments property
of the email resource has two properties of its own (the two
attachment elements). Properties can't have properties, but
resources can.
There are two obvious options for giving this email
element the resource-predicate-resource-predicate descendant structure
that RDF expects: either remove a layer of containment or add
one. Removing the attachments container would make each
attachment element a sibling of from, to,
and the email element's other children, and email
wouldn't have any grandchildren:
<email rdf:about="msg002">
<from>bram@snee.com</from>
<to>bela@snee.com</to>
<date>20021024T081423</date>
<msgSubject>Dinner tonight</msgSubject>
<attachment>data\sample1.txt</attachment>
<attachment>data\sample2.txt</attachment>
<cc>frank@snee.com</cc>
</email>
The problem with this is that you may have a good reason to use
that container. For example, when processing your XML e-mail messages
using an event-based model such as the SAX API, maybe there's
something specific you want to do when you reach the end of the
attachment list. How do you know you've reached the end of that list
when processing this version of the email element? When you
reach the cc element? What if cc is optional?
Nothing says "end of attachment list" like an
</attachments>.
If you must have a container around your attachment
elements, and want to make it proper RDF, one solution is to use one
of RDF's specialized container elements. In this case, you can wrap an
rdf:Bag element around the attachment elements in
the original e-mail example, inside of the attachments
element. (In keeping with guideline 2, the attachments
element has been given an rdf:ID attribute to make it easier
for a parser to refer to it.) The rdf:Bag element describes a
container whose contents aren't ordered in any meaningful way. The
example's rdf:Bag element has an rdf:ID value of
"i2", telling an RDF parser that in addition to having a from
property with a value of "bram@snee.com", as well as the other
properties we saw, the resource with the ID "msg003" also has an
attachments property with resource #i2 has its value. This i2
resource has a type of rdf:Bag, which RDF parsers understand
to be a container of unordered content. The i2 resource has one
attachment with a value of "data\sample1.txt" and another
with a value of "data\sample1.txt". And, unlike the first e-mail
example above, this one causes no error message in the RDF parser.
<email rdf:about="msg003">
<from>bram@snee.com</from>
<to>bela@snee.com</to>
<date>20021024T081423</date>
<msgSubject>Dinner tonight</msgSubject>
<attachments rdf:ID="i1">
<rdf:Bag rdf:ID="i2">
<attachment>data\sample1.txt</attachment>
<attachment>data\sample2.txt</attachment>
</rdf:Bag>
</attachments>
<cc>frank@snee.com</cc>
</email>
In addition to the rdf:Bag container for unordered
content, RDF also offers the rdf:Seq element for ordered (or
"sequenced") content and the less popular rdf:Alt container
to show available alternatives to a specified value.
There is actually a third, even simpler option for converting this
email element's structure into something that won't confuse
the RDF parser: we can explicitly tell this parser that the
attachments property of the email element is itself
a resource with the rdf:ParseType attribute:
<email rdf:about="msg004">
<from>bram@snee.com</from>
<to>bela@snee.com</to>
<date>20021024T081423</date>
<msgSubject>Dinner tonight</msgSubject>
<attachments rdf:parseType="Resource">
<attachment>data\sample1.txt</attachment>
<attachment>data\sample2.txt</attachment>
</attachments>
<cc>frank@snee.com</cc>
</email>
Think about the original problem: the attachments property
of the email element couldn't have its own properties, which
is why the RDF parser choked at the first attachment element
-- it thought that the document was trying to name a property of a
property, which is illegal. Now that the attachments element
is explicitly named as a resource, it can have properties, so the RDF
parser will have no problem with the two attachment children
of this element.
7. Eschew mixed content.
Mixed content presents a more advanced version of the problem
caused by containers that throw off the striping pattern. Once you see
that the resources described in RDF statements must either be siblings
of each other or skip an odd number of generations when descendants of
each other, and that predicates must be descendants found at the
levels between those, it's clear how the typically irregular patterns
of mixed content can throw off RDF striping. Mixed content can also
put strings of PCDATA in odd places -- or at least in places that seem
odd if you're looking for regular recurring patterns.
This doesn't mean that you can't have RDF in a document with mixed
content. The "Moby Dick" example at the beginning of this article has
mixed content, and the rdf:RDF element showing publishing
metadata such as the work's creator and availability date is kept
separately in an RDF header section.
RDF statements in a mixed content document can even use elements
within the mixed content as resources. The following example has an
rdf:RDF header element that contains a made-up
imgLink element linking the character in-line
element to an image on a remote server.
<article xmlns="http://www.snee.com/ns/dummy#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:RDF>
<imgLink rdf:about="#c1">
<image rdf:resource=
"http://www.keele.ac.uk/depts/as/Literature/Moby-Dick/images/Moby.gif"/>
</imgLink>
</rdf:RDF>
<body>
<title>Moby Dick</title>
<para>Call me <character rdf:ID="c1">Ishmael</character>.</para>
<para>Just don't call me late for supper.</para>
</body>
</article>
An RDF parser will find the statement linking the
character element to the Moby.gif picture and will have no
problem with the mixed content along the way.
8. Find an RDF parser to check that your RDF statements are
okay.
When learning any new language, you want to be sure that what you
think you're saying is really what you're saying. Most RDF parsers
make this easy by outputting a subject-predicate-object triple for
each RDF statement they find. For example, the W3C's RDF Validation Service
turns this document
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:imdb="http://us.imdb.com/Name?"
xmlns="http://www.cyc.com/2002/04/08/cyc.daml#"
xmlns:gc="http://www.daml.org/2001/01/gedcom/gedcom#">
<Entertainer rdf:about="http://us.imdb.com/Name?Fonda,%20Bridget">
<gc:father>
<Entertainer rdf:about="http://us.imdb.com/Name?Fonda,%20Peter"/>
</gc:father>
</Entertainer>
</rdf:RDF>
into this (carriage returns added):
<http://us.imdb.com/Name?Fonda,%20Peter>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.cyc.com/2002/04/08/cyc.daml#Entertainer> .
<http://us.imdb.com/Name?Fonda,%20Bridget>
<http://www.daml.org/2001/01/gedcom/gedcom#father>
<http://us.imdb.com/Name?Fonda,%20Peter> .
<http://us.imdb.com/Name?Fonda,%20Bridget>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.cyc.com/2002/04/08/cyc.daml#Entertainer> .
Or, in English, using only the URI fragment identifiers:
Peter Fonda has a type value of Entertainer.
Bridget Fonda has a father value of Peter Fonda.
Bridget Fonda has a type value of Entertainer.
In general, using a utility to convert RDF to triples helps you to
understand exactly what is being said if you read the
subject-predicate-object triple "X, Y, Z" as "X has a Y value of Z."
All the natural language descriptions of RDF statements in this
article were checked this way.
As RDF
tools become more widely available and easy to use, you'll have
more resources available to do improved metadata management for your
own data. Even if you're not ready to build serious RDF applications
just yet, making more of your own data RDF-friendly will do more than
widen the number of applications that can use it. For many people,
the kinds of things that RDF is good at become clearer to them when
used with data that is important to their business or important to
them personally, such as an address or appointment file. Using RDF
tools to play with your own data will help you understand the strong
points of RDF and, perhaps, even the strong points of your own data
better.
Prev [1] [2]