Entities and XSLT
by Bob DuCharme
March 14, 2001
In XML, entities are named units of storage. Their names are
assigned and associated with storage units in a DTD's entity
declarations. These units may be internal entities, whose contents are
specified as a string in the entity declaration itself, or they may be
external entities, whose contents are outside of the entity
declaration. Typically, this means that the external
entity is a file outside of the DTD file which contains the entity
declaration, but we don't say "file" in the general case because XML
and XSLT work on operating systems that don't use the concept of
files.
A DTD might declare an internal entity to act like a constant in a
programming language. For example, if a document has many copyright
notices that refer to the current year, declaring an entity
cpdate to store the string "2001" and then putting the entity
reference "&cpdate;" throughout the document means that updating
the year value to "2002" for the whole document will only mean
changing the declaration.
Internal entities are especially popular to represent characters
not available on computer keyboards. For example, while you could
insert the "ñ" character in your document using the numeric
character reference "ñ" (or the hexadecimal version
"ñ"), storing this character reference in an entity called
ntilde lets you put "España" in an XML document as
"Espanña", which is much easier to read than
"España" or "Espa༚". (It has the added bonus of
being familiar to those who used the same entity reference in HTML --
perhaps without even knowing that it was an entity reference.)
An external entity can be a file that stores part of a DTD, which
makes it an external parameter entity, or it can store part of a
document, which makes it an external general entity. For example, the
following XML document declares and references the external general
entity ext1. (Comments in sample documents refer to filenames
in this zip file.)
<!-- xq226.xml -->
<!DOCTYPE poem [
<!ENTITY ext1 SYSTEM "lines938-939.xml">
]>
<poem>
<verse>I therefore, I alone first undertook</verse>
<verse>To wing the desolate Abyss, and spy</verse>
&ext1;
<verse>Better abode, and my afflicted Powers</verse>
<verse>To settle here on Earth or in mid-air</verse>
</poem>
An XML parser reading this document will look for an external
entity named lines938-939.xml and report an error if it doesn't find
it. If it does find a file named lines938-939.xml that looks like
this,
<!-- xq227.xml (lines938-939.xml) -->
<verse>This new created World, whereof in Hell</verse>
<verse>Fame is not silent, here in hope to find</verse>
it will pass something like the following to the application using
that XML parser (for example, an XSLT processor):
<poem>
<verse>I therefore, I alone first undertook</verse>
<verse>To wing the desolate Abyss, and spy</verse>
<verse>This new created World, whereof in Hell</verse>
<verse>Fame is not silent, here in hope to find</verse>
<verse>Better abode, and my afflicted Powers</verse>
<verse>To settle here on Earth or in mid-air</verse>
</poem>
Because an XSLT stylesheet is an XML document, you can store and
reference pieces of it using the same technique, but you'll find that
the xsl:include and xsl:import instructions give
you more control over how your pieces fit together. See my November
column Combining
Stylesheets with Include and Import for more detail.
All these categories of entities are known as parsed entities
because an XML parser reads them in, replaces each entity reference
with the entity's contents, and parses them as part of the
document. XML documents use unparsed entities, which aren't used with
entity references but as the value of specially declared attributes,
to incorporate non-XML entities.
When you apply an XSLT stylesheet to a document, if entities are
declared and referenced in that document, your XSLT processor won't
even know about them. An XSLT processor leaves the job of parsing the
input document (reading it and figuring out what's what) to an XML
parser; that's why the installation of some XSLT processors requires
you to identify the XML parser you want them to use. (Others include
an XML parser as part of their installation.) An important part of an
XML parser's job is to resolve all entity references, so that if the
input document's DTD declares a cpdate entity as having the
value "2001" and the document has the line "copyright &cpdate; all
rights reserved", the XML parser will pass along the text node
"copyright 2001 all rights reserved" to put on the XSLT source
tree. Newcomers to XSLT often ask how they can check for entity
references such as " " or "<" in the source tree, and
the answer is: you can't. By the time the document's content reaches
the source tree, it's too late.
How about entities in your result tree? You can't add entity
declarations there, because although XSLT can add a document type
declaration to a result tree, it can't add one with an internal DTD
subset, which is the only way to add DTD declarations to a document
entity.
There are, however, ways to add entity references. If you create an
XML document in your result tree, and you add references to any
entities other than the five that all XML processors are required to
handle, whether they're declared or not (lt, gt,
apos, quot, and amp), then your document
must have a document type declaration that points to a DTD with
declarations for your entities. If you're creating an HTML document,
entity declarations aren't required, and most web browsers understand
a wide variety of entity references for special characters such as
"é" for the "é" character and "ñ" for the
"ñ" character.
Let's look at various approaches to creating an entity reference in
a result tree. We'll use the following one-line document as a source
document and try to add a text node to the result that includes the
entity reference "ñ" for the "ñ" character.
<test>Dagon his Name, Sea Monster</test>
If the stylesheet document has the appropriate entity declaration,
the XML parser that feeds the stylesheet and source document to the
XSLT processor will replace this entity reference in the stylesheet
with the replacement text declared for it. For this stylesheet, it
will replace "ñ" with the Unicode value for the "ñ"
character:
<!-- xq230.xsl: converts xq229.xml into xq231.xml -->
<!DOCTYPE stylesheet [
<!ENTITY ntilde "ñ" ><!-- small n, tilde -->
]>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:template match="test">
<testOut>
The Spanish word for "Spain" is "España".
<xsl:apply-templates/>
</testOut>
</xsl:template>
</xsl:stylesheet>
The actual "ñ" character and not an entity
reference to it shows up in the result:
<?xml version="1.0" encoding="utf-8"?><testOut>
The Spanish word for "Spain" is "España".
Dagon his Name, Sea Monster</testOut>
Normally, your stylesheet doesn't need a DOCTYPE declaration, but
if the stylesheet has references to any entities besides the five
predeclared ones listed above, you must declare them inside a DOCTYPE
declaration. The XML parser that reads in the stylesheet for your XSLT
processor will replace any entity references with their entity values
before giving the stylesheet to the XSLT processor.
This is handy, but not what we're looking for. We want to see an
entity reference, not the entity it refers to, in the result
document. XSLT offers no way to tell the XML processor not to make
entity replacements. (Certain XSLT processors such as Xalan offer this
option as a non-standard feature). However, XSLT does offer a way to
turn off its automatic "escaping" of certain characters -- that is, an
XSLT processor's substitution of the entity reference "&" for
ampersands and "<" for less-than characters in result tree text
nodes. You can turn it off for your entire result tree with an
xsl:output instruction that has a method attribute
value of "text", and you can turn it off for a single
xsl:text element by setting its
disable-output-escaping attribute to equal "yes".
The disabling of output escaping is used too often in situations
where it shouldn't be -- in particular, to create a less-than
character that starts a tag or declaration that could be added to a
result tree with a more appropriate XSLT instruction. Because it's
essentially turning off something that the XSLT processor is supposed
to do, it should be used sparingly.
The following version of the stylesheet resembles the previous one
except for the replacement text specified in the ntilde
declaration. It's an xsl:text instruction with
"&ntilde;" as its contents.
<!-- xq232.xsl: converts xq229.xml into xq233.xml -->
<!DOCTYPE stylesheet [
<!ENTITY ntilde
"<xsl:text disable-output-escaping='yes'>&ntilde;</xsl:text>">
]>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output doctype-system="testOut.dtd"/>
<xsl:template match="test">
<testOut>
The Spanish word for "Spain" is "España".
<xsl:apply-templates/>
</testOut>
</xsl:template>
</xsl:stylesheet>
The XML parser that reads the stylesheet and hands it off to the
XSLT processor will replace that "&" with a "&", but
because the xsl:text element has its
disable-output-escaping attribute set to "yes", the XSLT
processor will pass along the "ñ" string to the result tree
without trying to resolve it. (If it did try to resolve it, it would
cause an error, because having "ñ" as the replacement text
for the ntilde entity would be an illegal recursive entity
declaration.) With the same test document, the new stylesheet
creates this output:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE testOut SYSTEM "testOut.dtd">
<testOut>
The Spanish word for "Spain" is "España".
Dagon his Name, Sea Monster</testOut>
The new stylesheet has one more difference from the earlier one: it
includes an xsl:output element. This element doesn't need a
method attribute, because the default value of "xml" is fine,
but the doctype-system attribute is important. If the result
document has an "ñ" entity reference, that entity must be
declared somewhere. XSLT doesn't offer a way to include such
declarations in an internal DTD subset of the document's DOCTYPE
declaration, although some stylesheet developers have assembled hacks
to add these declarations using disable-output-escaping
kludges. The best way to ensure that these declarations are properly
declared is to give the result tree a DOCTYPE declaration with a
SYSTEM identifier that points to a DTD with that declaration. The
example above adds a SYSTEM declaration that points to a
testOut.dtd file that should include a declaration for the
ntilde entity.
This trick works for any general entity reference you want in your
result tree, whether it references an internal entity whose contents
are included in the declaration (like the ntilde entities in
the examples above) or an external entity whose contents are stored in
an external file like the ext1 one that references the
lines938-939.xml file at the beginning of this column.
To review, you can add any kind of entity reference you want to
your result tree with the following two steps:
Add an entity reference to your result tree.
Declare the entity's contents in the stylesheet's DOCTYPE
declaration to be an ampersand, the entity name, and a semicolon all
inside of an xsl:text element with its
disable-output-escaping attribute set to "yes".