Do you have an answer for these XML questions? Share your experience in our forum.
Post your comments
There is a strange kind of correspondence between the desired and
actual results. What the actual result tree is saying might be
translated as "The angle brackets in the following lines are not to be
treated as markup delimiters, but as literal characters." And guess
what? That's exactly how the CDATA section in this (or any other)
source document suggests markup-significant characters should be
treated. Whoever created that document evidently imagined him or
herself to be doing the downstream application a favor -- as though by
shrouding the embedded HTML markup in a CDATA section it was protected
from tampering by alien forces (like one of those blasted XSLT
processors). In fact, what wrapping in CDATA did was to announce to
any markup-aware application, "This looks like markup but
really isn't -- it's not even HTML." Under the circumstances, the
assumptions made by the XSLT processor are quite reasonable.
All that said, here's something for you to try. (It's worked for me
with both the MSXML and Saxon XSLT processors.) In your XSLT
stylesheet, include this top-level element:
<xsl:output method="text"/>
This approach may seem counterintuitive, even weird. After all, if
the problem resides in the input side of the transformation, what good
would specifying the output's characteristics do?
But in the absence of any xsl:output element at all, the
XSLT processor attempts to figure out the stylesheet's intentions by
examining the result tree from the transformation. This figuring-out
uses a series of tests whose purpose is to determine whether the
result tree is HTML (and by default, the version is HTML 4.0,
not XHTML); if not, the result tree is assumed to be a
well-formed XML general parsed entity. (Such an entity may or may not
be a well-formed document. For instance, the root node may
contain two child elements.) The four tests of an HTML result tree
(and all must be true) are
- the result tree's root node has an element child (that is, it
has a root element);
- the local name of the root element (discounting any namespace
prefix) is "html";
- the root
html element has no namespace URI
associated with it; and
- the only text nodes preceding the result tree's root element
are whitespace-only text nodes.
In the case of a document like the one you describe, these tests
are almost immaterial: no matter how much it looks like it contains
markup, a CDATA section by definition contains only literal text. So
by default, there is no "root element" in the above result tree, an
html or anything else. There's just a string of literal
characters which happens to start with a literal <
character. Since the result tree fails the HTML test, the processor
guesses the result tree is simply a well-formed general parsed entity
-- consisting, in this case, of a single text node.
But by specifying method="text", you short-circuit the
processor's default behaviors, instructing it not to make any
assumptions at all about the nature of the result.
(There are two dangers in using this little trick, by the way.
First, it's global: you can't apply it selectively to some
sections of the source/result trees but not to others. Second, and
more importantly, if the "markup" within the CDATA section isn't
well-formed, it will simply be passed without complaint to the result
tree. If the downstream application meant to consume this result tree
is XML- or HTML-aware, you may be faced with disastrous downstream
complications.)
Q: I keep losing a trailing space inside my empty-element
tags.
To keep my XHTML compatible with older browsers (like Netscape
4.77), my XSLT transformation includes a space before the trailing
slash on empty XHTML elements, like this:
<xsl:template match="model/name">
<em>Model Name: </em>
<xsl:apply-templates/><br />
<!-- Note space ^ -->
</xsl:template>
However, the transformation ends up looking something like
<em>Model Name: </em> Nimbus
2000<br/>
<!-- No space ^ -->
Also in XML Q&A
From English to Dutch?
Trickledown Namespaces?
From XML to SMIL
From One String to Many
Getting in Touch with XML Contacts
That's fine for newer browsers, but older browsers don't recognize
<br/> as a <br> tag, and hence
ignore it, which is just no good. I've looked at a number of
techniques for controlling whitespace in XML (Bob DuCharme's series,
for instance), but all of these techniques focus on the content of
elements, not the element tags themselves. I recognize that XML has
its reasons for handling whitespace the way it does, and that from an
XML perspective trying to control whitespace within a tag is a
little batty. But does anyone know of a workaround, short of fixing it
with, say, a Perl script after the transformation?
A: A Perl script? After the transformation?
<shudder/> I mean, I love Perl, but still....
There are a couple of approaches to resolve this issue.
First, remember that an empty element can be represented by a
contiguous start tag/end tag pair, like:
<br></br>
So you may be able to put this into the result tree instead of the
empty-tag form, <br/> (with or without the space
before the slash). One problem with this solution is that some
versions of older browsers may interpret this as two
br elements in sequence.
A better solution is a variation of the answer to the first
question in this month's column. As I described above, the XSLT
processor makes an educated guess about the result tree. I don't know
why this educated guess is failing to recognize your result tree as
HTML 4.0 (which is readable by both older and newer browsers). But you
can force the interpretation with this top-level element:
<xsl:output method="html"/>
In this case, for instance, when your stylesheet includes an
XML-compliant <br/> tag (again, with or without the
space), a compliant processor will output it in the HTML-compliant
<br> form.
I realize this may introduce an unwanted wrinkle to your problem;
it forces the result tree to be not XHTML, just plain old dumb
HTML 4.0. Unfortunately we're at a transitional stage in both browser
and XHTML development. If I were you, I'd leverage the still-forgiving
nature of the newer browsers rather than coding to XHTML strict
standards and hoping that older browsers will somehow function as
expected. (They often didn't comply with standards in place at the
time the browsers were built; it's no wonder they adhere to
newer standards even less rigorously.)