What's in a Name?
by Leigh Dodds
November 29, 2000
This week, XML-Deviant looks at a XML-DEV discussion on the
best practices for identifying XML resources; then wonders why more
developers aren't taking advantage of entity management systems.
Entity Management
Correctly naming resources and objects is widely regarded as one of
the most difficult problems in computing (another being caching). As
the saying goes, any problem in computing can be solved by adding
another level of indirection. One step toward solving naming problems
is to add indirection by separating the name of the resource from its
address. This is a common pattern, which we see in a number of areas
from pointers in C to Persistent URLs (PURLs)
on the web.
XML 1.0 offers a separation between the naming and addressing of
resources or entities referred to in XML documents. Broadly speaking
SYSTEM identifiers define an actual resource that is retrieved, or
dereferenced to retrieve, the entity in question. A PUBLIC identifier
simply gives a name for the required resource. It says nothing about
where that resource may be dereferenced.
Of course life isn't really that simple, and its likely that some
readers are already objecting. The short but heated XML-URI debate
earlier this year testifies to the disagreement on this issue. A
SYSTEM identifier is specified as a URI, which can be easily be a
Uniform Resource Name (URN) as well, instead of being the more
commonly found URL. A URN is more like a PUBLIC identifier, as it
simply names the resource in question. Yet there is still no widely
deployed means of using URNs.
This glosses over disagreements about whether a URI is actually a
name or address, a completely different debate. For most purposes,
this distinction is probably the most useful: A SYSTEM identifier is an
address, a PUBLIC identifier is a name.
We've covered some of these issues previously in the
XML-Deviant, (see "Filling
In The Gaps"), when a discussion about identifiers took place on
XML-DEV back in April. The advice given then was to provide PUBLIC
identifiers with your documents and maintain a local catalog of
identifiers (i.e. store the addresses associated with those PUBLIC
names). These cataloging facilities are often referred to as "entity
management systems," as they can do more more than just providing a
look-up table of names and addresses.
Fragile Resources
Always keeping a keen eye on interoperability issues, Simon
St. Laurent observed that many
parsers fail if they cannot properly deference a SYSTEM
identifier.
SYSTEM identifiers, or more properly, the SystemLiteral which
contains the content of the SYSTEM identifier, are defined as URIs,
conforming to RFC 2396. These URIs are "meant to be dereferenced to obtain
input for the XML processor to construct the entity's replacement text."
In common practice, that's meant using URLs, typically HTTP-based
URLs. Validating (and some non-validating) XML parsers tend to report errors
when they can't retrieve the content referenced by a SystemLiteral, since
effectively it means that they can't validate the document.
SYSTEM identifiers are, therefore, a possible failure point in your
XML application. Norm Walsh recommended
using an entity management system, and PUBLIC identifiers to
improve robustness.
At the very least, you should use a PUBLIC identifier as well since
that allows an entity manager to do the right thing even in the presence of
varying system identifiers.
Michael Mealing, author of a recent RFC
describing an IANA managed XML registry, also said
entity management was the right solution.
My hope is that XML parsers will make sure that they have entity
resolvers that allow the local parser to match URIs used in the parsing
process, thus ensuring that parsers don't need access to a network in order
to be able to work. It seems kind of problematic to me to require that your
parser is part of a network in order to use the DTD that you have locally...
Potential failures therefore not only encompass incorrect
identifiers, but also the possibility that a resource is
unavailable. The Internet operates on a best effort basis. Is this
really acceptable for a mission critical XML application?
Freely available tools to perform entity management have been
available for some time, as we have previously reported. Yet few
developers seem to use them.
What approach should you take to achieve the greatest degree of
interoperability: SYSTEM or PUBLIC Identifiers? Simon St. Laurent advocated
using both as best practice.
This is the approach the W3C takes with XHTML. I'd suggest this
makes more sense than the alternatives, since the PUBLIC identifier allows
processors which support entity resolution to use it (as Mozilla does with
XHTML) but provides a canonical URL which developers who've never heard of
entity resolvers (lots of them) can still use.
...For DTDs and schemas, resolvability really matters. I'd stick
to the combination of a public identifier and a 'guaranteed' URI..., but
make clear that the public identifier is the critical piece and the
SystemLiteral is only provided for backup.
Among other things, that would let people without entity resolvers
point to local URLs while still identifying the document with the right
PUBLIC identifier.
One might wonder why these kind of problems aren't surfacing
daily. It may be that the lack of runtime validation in many
applications means that remote resources are not being retrieved. Tim
Bray noted that retrieving
DTDs and schemas is an infrequent operation.
...across the universe of XML processing, the proportion of times
that the DTD or schema actually gets fetched is pretty small; for example,
your average XHTML agent is not going to go chasing after DTDs in the course
of displaying web pages, and your average b2b code probably doesn't do a lot
of DTD munging.
In Bray's opinion SYSTEM identifiers are the better option, although this
doesn't preclude the usage of entity management systems: they just naturally
grow to encompass caching of retrieved resources, etc. For example, W3C XML
Schemas includes a schemaLocation attribute which
...provides hints from the author to a processor regarding the
location of schema documents.
...Note that the schemaLocation is only a hint and some processors
and applications will have reasons to not use it; For example, an HTML
editor may have a built-in HTML schema.
If and when generic XML browsers start appearing, we may see these
issues occurring more frequently: such applications will have to
flexibly handle new document types, namespaces, and schemas as they
are delivered, most of which aren't likely to be built-in. A
properly-layered XML application will allow an entity management
system to be plugged in to support retrieval of any required
resource.