RDF and Metadata
by Tim Bray
June 09, 1998
This article has now been updated to incorporate changes in the RDF spec and the growth of the RDF community. You can find a newer
version here: What is RDF?.
The Right Way to Find Things
RDF stands for Resource Description Framework. RDF is built for the Web, but let's leave Web-land behind for a few minutes and think about how we find things in the real world.
Scenario 1: The Library
You're in a library to find books on raising donkeys as pets.
In most libraries these days you'd use the computer lookup system, basically an electronic version of the
old card file.
This system allows you to list books by author, title, call-number, and
subject.
The list includes the date, author, title, and lots of other
useful information, including (most important of all) where each book is.
Scenario 2: The Video Store
You're in a video store and you want a movie by John Huston.
A large modern video store offers a lookup facility that's similar
to the library's.
Of course, the things you can search on are different (director,
actors, and so on) but the results are more or less the same.
Scenario 3: The Phone Book
You're working late at a customer's office in South Denver, and it seems that a pizza is essential if work is to continue.
Fortunately, every office comes equipped with a set of Yellow Pages
that, properly used, can lead to quick pizza delivery.
The Common Thread
What do all these situations have in common, and
what differences lie behind the scenes?
First of all, each of these systems is based on metadata, or
information about information.
In each case, you need a piece of information (the book's location, the
video's name, the pizza joint's phone number).
In each case, you use metadata (information about information) to get it.
We're all used to this stuff; the usual setup is that metadata comes in
named chunks (subject, director, business category) that associate
lookup information ("donkeys", "John Huston", "Pizza, South Side") with
the real info that you're after.
Here's a subtle but important point: in theory, metadata is not really
necessary. In principle, you could go through the library one book at a time
looking for donkey books; or through the video store shelves until you found
your movie; or call all the numbers in your area code until you find pizza
delivery.
But that would be very wasteful -- in fact, downright stupid.
Metadata is the way to go.
It's All Different Behind the Scenes
In each of our scenarios, we used metadata, and used it in a remarkably
similar way.
Does this mean that the library, the video store, and the phone company all
use the same metadata setup?
Of course not -- to start with, every library has a choice among
at least two systems for
organizing their books, and among many
vendors who will sell them software to do the looking-up.
The same is true, obviously, for video stores and phone companies.
In fact, most such products define their own system of metadata and
their own facilities for storing and managing it; they typically do
not offer any facilities for sharing or interchanging it.
This doesn't cause too much of a problem, assuming they do a decent job
with the user interface.
We are comfortable enough with the general process we call "looking things up"
(really, searching via metadata) that we are able to adapt and use all these
different systems.
Not Just For Searching
The most common day-to-day use of metadata is to help us find things.
But there are lots of other uses going on behind the scenes: the library and
video store are both keeping other metadata that you don't see, concerning how
often the books and videos are being used, how much it cost to buy them, where
to go for a replacement; running a library or a video store would be
unthinkable without metadata.
Similarly, the phone company, of course, uses its metadata, most obviously to
print the Yellow Pages, but for many other internal management
and administration tasks.
What About the Web?
The Web is a lot like a really REALLY big library, in that
there are millions of things out there, and if you know the URL (in effect an
electronic "call number") you can get them.
Since the Web has books, movies, and pizza joints,
the number of things that you might need to look things up by includes all
the things a library uses, plus all the things the video store uses, plus
all the things the Yellow Pages use, and lots more.
The problem at the moment is that there is hardly any metadata on the Web.
So how do we find things? Mostly, using dumb brute-force techniques.
The dumb brute force is supplied by the Web robots of search engine sites like
Altavista, Infoseek, and Excite. These sites do the equivalent of going through the
library, reading every book, and allowing us to look things up based on the
words in the text.
It's not surprising that people complain about search results, or that the
robots are always way behind the growth and change of the Web.
In fact, there is one metadata-based general purpose lookup facility: Yahoo!, which is the
most visited Web site of all.
Yahoo doesn't use a robot. When you search through Yahoo, you're searching
through human-generated subject categories and site labels.
Compared to the amount of metadata that a library maintains for its books,
Yahoo! is pitiful; but its popularity is clear evidence of the power of
(even limited) metadata.
Divine Metadata for the Web
People who have thought about these problems, and including many of the world's librarians
and webmasters, generally agree that the Web urgently needs
metadata.
What would it look like?
If the Web had an all-powerful Grand Organizing
Directorate (at www.GOD.org), they would think up a set of lookup
fields such as Author, Title, Date, Subject, and so on.
The Directorate, being, after all, GOD, would simply decree that all Web pages
start using this divine Metadata, and that would be that.
Of course there would be some details such as how the Web sites ought to
package up and interchange the metadata, and we all know that the Devil is in
the details, but GOD can lick the Devil any day.
In fact, there is no www.GOD.org.
For this reason, there is no chance that everyone will agree to start using
the same metadata facilities.
If libraries, which have been existence for thousands of years, can't agree on
a single standard, there's not much chance that the Web will.
Does this mean that there is no chance for metadata? That everyone is going to
have to build their own lookup keys and values and software, and that we're
going to be stuck using dumb brute-force robots forever?
No -- because as we observed with our three search scenarios,
metadata operations have an awful lot in common, even when the metadata is different.
RDF is an effort to identify these common threads and provide a way for Web architects to use them to provide useful Web metadata without divine
intervention.
Introducing RDF
Resource Description Framework, as its name implies, is a framework for
describing and interchanging metadata.
It is built on the following rules:
- A Resource is anything that can have a URI; this
includes all the world's
Web pages, as well as individual elements of an XML document.
An example of a resource is a draft of the document you are now reading
and its URL is
http://www.textuality.com/RDF/Why.html
- A PropertyType is a Resource that has a name and can be used as a
property, for example
Author or Title.
In many cases, all we really care about is the name; but a PropertyType needs
to be a resource so that it can have its own properties.
- A Property is the combination of a Resource, a PropertyType, and a
value.
An example would be: "The Author of
http://www.textuality.com/RDF/Why.html
is Tim Bray."
The Value can just be a string, for example "Tim Bray" in the previous
example, or it can be another resource, for example
"The Home-Page of
http://www.textuality.com/RDF/Why.html
is http://www.textuality.com."
- There is a straightforward method for expressing these abstract Properties
in XML, for example:
<RDF:Description href='http://www.textuality.com/RDF/Why-RDF.html'>
<Author>Tim Bray</Author>
<Home-Page RDF:href='http://www.textuality.com' />
</RDF:Description>
RDF is carefully designed to have the following
characteristics:
- Independence
- Since a PropertyType is a resource, any independent organization
(or even person) can invent them.
I can invent one called Author, and you can invent one called
Director (which would only apply to resources that are associated with
movies), and someone else can invent one called Restaurant-Category.
This is necessary since we don't have www.GOD.org to take care of it for
us.
- Interchange
- Since RDF Properties can be converted into XML, they are easy for us to
interchange. This would probably be necessary even if we did have www.GOD.org.
- Scalability
- RDF properties are simple three-part records (Resource,
PropertyType, Value), so they are easy to handle and look things up by, even in
large numbers.
The Web
is already big and getting bigger, and we are probably going to have
(literally) billions of these floating around (millions even for a big
Intranet), so this is important.
- PropertyTypes are Resources
- This means that they can have their own properties and
can be found and manipulated like any other Resource.
This is important because there are going to be lots of them; too many to look
at one by one.
For example, I might want to know if anyone
out there has defined a PropertyType that describes the
genre of a movie, with values like Comedy, Horror, Romance, and Thriller.
I'll need metadata to help with that.
- Values Can Be Resources
- For example, most Web pages will have a property named Home-Page which
points
at the home page of their site.
So the values of properties, which obviously have to include things like
title and author's name, also have to include Resources.
- Properties Can Be Resources
- So they can have properties too.
Since there's no www.GOD.org to provide useful assertions for all the
resources, and since the Web is way too big for us to provide our own, we're
going to need to
do lookups based on other people's metadata (as we do today with Yahoo!).
This means that we'll want, given any Property such as "The Subject of this
Page is Donkeys", to be able to ask "Who said so? And When?"
One useful way to do this would be with metadata; so Properties will need to
have Properties.
Why Not Just Use XML?
XML allows you to invent tags, and for the tags to contain both text data
and other tags.
Also, XML has a built-in distinction between element types, for example
the IMG element type in HTML, and elements, for example an
individual <IMG SRC='Madonna.jpg'>; this corresponds naturally
to the distinction between PropertyTypes and Properties.
So it seems as though XML documents should be a natural vehicle for exchanging
general purpose metadata.
XML, however, falls apart on the Scalability design goal.
There are two problems:
- The order in which elements appear in an XML document is significant and
often very meaningful.
This seems highly unnatural in the metadata world. Who cares whether a movie's
Director or Title is listed first, as long as both are available for
lookups?
Furthermore, maintaining the correct order of millions of data items is
expensive and difficult, in practice.
- XML allows constructions like this:
<Description>The value of this property contains some
text, mixed up with child properties such as its temperature
(<Temp>48</Temp>) and longitude
(<Longt>101</Longt>). [&Disclaimer;]</Description>
When you represent general XML documents in computer memory, you get weird data
structures that mix trees, graphs, and character strings.
In general, these are hard to handle in even moderate amounts, let alone
by the billion.
On the other hand, something like XML is an absolutely necessary part of
the solution to RDF's Interchange design goal.
XML is unequalled as an exchange format on the Web; but by itself, it doesn't
provide what you need in a metadata framework.
The Devil is in the Details
The four general rules given above define the central ideas of RDF.
It turns out that it takes quite a lot of abstract terminology and XML syntax
to define them precisely enough that people can write computer programs to
process them.
In particular, turning Properties into Resources is quite tricky.
Also, it turns out that in a (very) few cases, you do need to order your
properties, and this requires quite a bit of syntax.
This article is not going to try to explain all these details; there are a
variety of
excellent resources to be found at
http://www.w3.org/RDF
that are designed to do just that.
Vocabularies
RDF, as we've seen, provides a model for metadata, and a syntax so that
independent parties can exchange it and use it.
What it doesn't provide though, is any PropertyTypes of its own.
That is to say, RDF doesn't define Author or Title or Director or
Business-Category.
That would be a job for www.GOD.org, if there were one.
Since there isn't, it's a job for everyone.
It seems unlikely that one PropertyType standing by itself is apt to be very
useful.
It is expected that these will come in packages; for example, a set of basic
bibliographic PropertyTypes like Author, Title, Date, and so on.
Then a more elaborate set from OCLC, and a competing one from the
Library of Congress.
These packages are called Vocabularies; it's easy to imagine
PropertyType vocabularies describing books, videos, pizza joints, fine wines,
mutual funds, and many other species of Web wildlife.
What RDF Might Mean
The Web is too big for anyone person to stay on top of.
In fact, it contains information about a huge number of subjects, and for most
of those subjects (such as fine wines, home improvement, and cancer
therapy), the Web has too much information for any one person to stay on top
of and also have a real job.
This means that opinions, pointers, indexes, and anything that helps people
"look things up" are going to be commodities of very high value.
That is to say, vocabularies.
Nobody thinks that everyone will use the same vocabulary (nor should they),
but with RDF we can have a marketplace in vocabularies.
Anyone can invent
them, advertise them, and sell them.
The good (or best-marketed) ones will survive and prosper.
Probably, most
niches of information will come to be dominated by a small number of
vocabularies, the way that library catalogues are today.
And even among people who are sharing the use of metadata vocabularies,
there's no need to share the same software.
RDF makes it possible to use multiple different pieces of software to process
the same metadata, and to use a single piece of software to process (at least
in part) many different metadata vocabularies.
With any luck, this should make the Web more like a library,
or a video store, or a phone book, than it is today.
W3C+Standards template=.stdlist.def::>