Creative Comments: On the Uses and Abuses of Markup
by Kendall Grant Clark
January 15, 2003
Whether you think of the Semantic Web as a new and exciting
promise or as a fantastic and impractical threat, it will not be a
separate web but, rather, overlay the existing one. The
Semantic Web isn't a replacement, it's a supplement. Both the
existing web (the "Human Web") and the Semantic Web (for my
purposes here, the "Machine Web") will inhabit the same conceptual
space and share considerable infrastructure, including Unicode,
URI, HTTP, XML.
How, then, do we distinguish the Human from the Machine Web? The
easiest way is by distinguishing the identity and nature of each
web's dominant agent. The dominant agent of the Human Web is the
natural person. The Human Web is made for humans; the information
and knowledge it contains is intended for human consumption. The
Human Web's primary language, HTML, is best suited to presenting
information to human agents.
Of course there are some machine agents at play in the fields of
the Human Web, but they are massively outnumbered, don't know the
game well, and their play is generally hampered by the strange
environs. As every programmer who's tried to screen scrape someone
else's web site knows, HTML isn't a very good way to express
information to a machine. When it works, it does so only because of
considerable, daily care and feeding.
The Machine Web's dominant agent is a computer process, a
machine. The information contained in the Machine Web is intended
for machine consumption. RDF, the Machine Web's primary language, at
least for now, is best suited to describing information for
machine agents.
Thus far, I've made no new claims, having merely laid out the
conventional picture. In the remainder of this article I want to
draw your attention to the transitional period -- the period during
which the Machine and Human Webs will begin to inhabit the same
conceptual space and technical infrastructure. We are now living in
the early days of this transitional period and there are some
issues specific to it which may be worth considering.
Machine Content and Human Comments
The issue I want to raise here is the increasingly widespread
practice of embedding information -- mainly using, but not limited
to, RDF -- intended for machine consumption in a format,
HTML comments, which is intended for human
consumption.
When I realized people were embedding RDF in HTML comments,
claiming that the resulting document is part of the Semantic Web, I
was confused. Surely, I wondered, they know that putting RDF into
HTML comments is an inelegant way of relating human and
machine-consumable resources?
Creative Commons,
which has taken on the laudable task of creating RDF descriptions
of common licensing terms for intellectual property, suggests its
users associate machine-consumable licensing terms such as
this:
<rdf:RDF xmlns="http://web.resource.org/cc/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<License rdf:about="http://creativecommons.org/licenses/by-nc-sa/1.0">
<requires rdf:resource="http://web.resource.org/cc/Attribution" />
<permits rdf:resource="http://web.resource.org/cc/Reproduction" />
<permits rdf:resource="http://web.resource.org/cc/Distribution" />
<permits rdf:resource="http://web.resource.org/cc/DerivativeWorks" />
<requires rdf:resource="http://web.resource.org/cc/ShareAlike" />
<prohibits rdf:resource="http://web.resource.org/cc/CommercialUse" />
<requires rdf:resource="http://web.resource.org/cc/Notice" />
</License>
</rdf:RDF>
with the web resources to which they apply by embedding RDF
directly in HTML comments, like this:
<!-- <rdf:RDF xmlns="http://web.resource.org/cc/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<License rdf:about="http://creativecommons.org/licenses/by-nc-sa/1.0">
<requires rdf:resource="http://web.resource.org/cc/Attribution" />
<permits rdf:resource="http://web.resource.org/cc/Reproduction" />
<permits rdf:resource="http://web.resource.org/cc/Distribution" />
<permits rdf:resource="http://web.resource.org/cc/DerivativeWorks" />
<requires rdf:resource="http://web.resource.org/cc/ShareAlike" />
<prohibits rdf:resource="http://web.resource.org/cc/CommercialUse" />
<requires rdf:resource="http://web.resource.org/cc/Notice" />
</License>
</rdf:RDF> -->
(For what it's worth, Movable
Type's TrackBack system also works by embedding RDF
descriptions of web resources into (X)HTML comments; most of what I
say about the Creative Commons case applies to TrackBack, too.)
From a conceptual point of view, setting aside the exigencies of
the actual world for the moment, this is not a sound or even
coherent strategy. The point of describing the licensing terms of a
web resource in RDF is to enable a machine to consume those
licensing terms and, based on choices a programmer has already
made, take appropriate action with regard to that web resource. In
other words, licensing terms constitute a constraint on what a
machine may legally do with the resource they address; for example,
to distribute copies of the resource or to refrain from
distributing copies. The point of HTML comments is to allow humans
to include information which is solely intended for
human-consumption in a resource. In short, markup language comments
are for communicating with humans, not with machines. The problem
with incoherent strategies is that it's not always possible to
predict all the ways in which they will fail or go bad.
From a practical standpoint, embedding RDF in XML or (X)HTML
comments works, but only under a limited range of contexts and
conditions. Consider what you have to do, in the general case, to
consume Creative Commons RDF licensing terms in an XHTML comment of
a web resource. You have to decide whether to consume the XHTML web
resource as XHTML -- in other words, to pass it to an XML
parser and then to interact with it by means of some API -- or as
an opaque string of characters. If you've decided to treat XHTML as
XHTML, your XML processing framework has to preserve XML comments
and then make them available programmatically. (And you still have
to sort through all the comments contained in the parsed
representation of an XML resource, trying to figure out if any of
the comments contain something that looks like or is RDF, which you
can do either by using a regular expression on the contents of each
comment or by trying to parse the contents of each comment as
XML...) Otherwise, you don't have a choice: you must treat the
resource as a string of characters. Some XML parsing frameworks do
not preserve comments, and it's hard to see how they can be said to
be doing the wrong thing by not preserving them.
If, on the other hand, if you've decided to treat an XHTML
resource as an opaque string of characters -- refusing to take
advantage of all the value offered by XHTML resources in the first
place -- you're stuck with the task of using, say, regular
expressions to comb through the string, looking for bits of text
which look like a particular kind of RDF -- a brittle operation at
best. Once you've identified some bits of text which may be RDF,
and which may be RDF descriptions of licensing terms, you still
have to consume them, either by writing an ad hoc
parser or by parsing them as RDF.
The sole advantage of embedding RDF into markup language
comments is that it's simple. It doesn't require the person doing
the embedding to understand much about the web beyond
cut-and-paste. That is a real advantage, but it's not clear how
much it's worth, especially when there are alternatives. The main
alternatives to embedding RDF in (X)HTML comments -- and I don't
see any good reasons to think these alternatives cannot coexist --
is to turn the machine-consumable licensing terms into a first
class web resource or to put them into (or associate them with) an
RSS file. I am agnostic as to which of these solutions is best, in
large part because "best" is context-dependent and
interest-relative. What is best in this case depends almost
entirely on what you need to do and where you need to do it.
RSS and Linking
The RSS 1.0 and 2.0 communities have both managed, in their own
distinct ways, to accommodate the Creative Commons project. There's
an RSS 1.0 Creative Commons module,
mod_cc, and there's also an RSS 2.0 creativeCommons
RSS Module. Despite the various divergences and differences of
opinion between these two communities, their Creative Commons
solutions are similar. In each case, the approach is to associate
the license terms of a web resource with an RSS file, which is
itself a machine-consumable, alternate version of a web resource or
a collection of resources. What's shaking out in the transitional
period, during which the Human and Machine Webs are learning to
cohabitate, is that RSS, of whatever variety, is becoming, as a
matter of convention and social agreement, the place to put
machine-consumable metadata about a resource or a collection of
resources (i.e., a "site").
The key aspect of RSS's success is convention and social
agreement. The part of this story which is yet to be told
is whether there will be any widespread convention and social
agreement about the third way of dealing with RDF and (X)HTML. In
this way, the various permutations of RDF licensing terms become
first-class web resources of their own, which means giving them a
URI (as, for example, Creative Commons has done). Once the
licensing terms you prefer -- which may be a mixture of Creative
Commons RDF vocabularies and other RDF predicates and terms; there
is no reason you cannot include, say, Dublin Core predicates and
terms in your licensing resource -- are web resources, you
associate them with the web resources you wish to license by
linking to them. One way to do that, and the way which RSS
communities have used to foster automatic discovery of RSS
resources, is by placing a link element inside the
head of the resource in question. For example,
<link rel="license-terms" href="/License.rdf" type="application/rdf+xml" />
The content of the rel attribute is, in my view, the
conventional, yet crucial bit. If widespread convention and social
agreement evolve about the content of the rel attribute,
then people will be able to program machines to look for
link children in head elements that have the
conventional rel-attribute value, with some justified
confidence that the resource which the link points to is
one which contains machine-consumable description of licensing
terms for the resource in question. That's not ideal in every
circumstance, but it's sane, elegant, and clean enough to become a
viable alternative. The problem, so far, is that, unlike RSS, the
link solution has no natural constituency or community to
push it, which means that the requisite convention, based on social
agreement, has been slow to coalesce.
Transitional periods are exciting, interesting times. But they
are also dangerous because it's never quite clear which temporary,
transitional solutions -- ones which everyone agrees are ugly,
inelegant hacks -- are going to outlast the transition. The history
of technology is full of examples of transitional strategies, which
wouldn't die and couldn't be or simply weren't killed, turning into
the very problems which the next big solution is designed to solve.
Among the dangers of the present transitional period, as we move
from Human Web to a Human Web with a Machine supplement, embedding
machine-consumable information within human-consumable comments may
well turn out to be one we end up living with for far longer than
anyone intended or imagined.