
New and Improved String Handling
by Bob DuCharme
August 06, 2003
In my June column last
year, I discussed XSLT 1.0 techniques for comparing two strings for
equality and doing the equivalent of a "search and replace" on your source
document. XSLT 2.0 makes both of these so much easier that describing the
new techniques won't quite fill up a column, so I'll also describe some
1.0 and 2.0 functions for concatenating strings. Notice that I say "1.0"
and "2.0" without saying "XSLT"; that's because these are actually XPath
functions available to XQuery users as well as XSLT 2.0 users. The
examples we'll look at demonstrate what they bring to XSLT
development.
String Comparison
The string comparison techniques described before were
really boolean tests that told you whether two strings were equal or
not. The new compare() function does more than that: it tells
whether the first string is less than, equal to, or greater than the
second according to the rules of collation used. "Rules of collation"
refers to the sorting rules, which can apparently be tweaked to account
for the spoken language of the content. (The XQuery 1.0 and XPath 2.0
Functions and Operators document tells us that "Some collations,
especially those based on the Unicode
Collation Algorithm can be 'tailored' for various purposes. This
document does not discuss such tailoring.")
The following stylesheet, which can be run with any document
as a source document, has six calls to the compare()
function. (All XSLT 2.0 examples were tested with version 7.6.5 of Michael Kay's
Saxon XSLT processor.)
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:variable name="color">red</xsl:variable>
<xsl:template match="/">
1. 'qed' and 'red': <xsl:value-of select="compare('qed','red')"/>
2. 'red' and $color: <xsl:value-of select="compare('red',$color)"/>
3. 'red' and ' red ': <xsl:value-of select="compare('red',' red ')"/>
4. 'red' and normalize-space(' red '): <xsl:value-of
select="compare('red',normalize-space(' red '))"/>
5. 'RED' and $color: <xsl:value-of select="compare('RED',$color)"/>
6. upper-case('RED') and upper-case($color): <xsl:value-of
select="compare(upper-case('RED'),upper-case($color))"/>
</xsl:template>
</xsl:stylesheet>
Before discussing the individual calls, let's look at the
result of running the stylesheet:
1. 'qed' and 'red': -1
2. 'red' and $color: 0
3. 'red' and ' red ': 1
4. 'red' and normalize-space(' red '): 0
5. 'RED' and $color: -1
6. upper-case('RED') and upper-case($color): 0
The compare() function returns a -1 if a sort would
put the string in its first argument before the one in its second
argument, 1 if it would come after, and 0 if the two arguments are
equal. (The function only works with strings. Use the <, =, and >
operators to compare other data types such as numbers, dates, and
booleans.) Line 1 of the stylesheet result shows that "qed" is
alphabetically less than "red" because "q" comes before "r" in the
alphabet. Line 2 shows the use of a variable as an argument to
compare(); a variable storing the string "red" is equal to the
literal string "red".
Also in Transforming XML
Automating Stylesheet Creation
Appreciating Libxslt
Push, Pull, Next!
Seeking Equality
The Path of Control
Lines 3 and 4 demonstrate an issue from my earlier column on
comparing strings: dealing with extra spaces. A space character gets
sorted after any letters of the alphabet, so the call to
compare() in line 3 returns a 1. Line 4 shows that enclosing the
string " red " in a call to the normalize-space()
function trims the leading and following spaces, thereby passing the
string "red" to the compare() function. This is particularly
handy when comparing the contents of an element to another string because
the use of spaces in XML documents is often inconsistent.
The last two lines demonstrate the effect of case on string
comparison. Line 5 shows that a sort would put the upper-case string
"RED" before the lower-case string "red". While the compare()
function offers no option for a case-insensitive string comparison, it's
easy enough to do: use the new upper-case() function to convert
both arguments to upper-case and compare those. This way, whether your two
arguments are "red" and "RED" or "rEd" and "ReD", the string comparison
won't care about the case of the letters.
Search and Replace
The XPath 1.0 translate()
function lets you map individual characters to other characters, but if
your search target or replacement string are more than one character long,
it isn't much help. A recursive named template can do the job, but it's a
lot of trouble for programmers used to text manipulation languages such as
awk, Perl, and Python, where arbitrary string replacement can be done with
much less code. The XPath 2.0 replace()
function makes this much easier.
The function takes three required parameters: the string to
act on, the target string to search for in the first argument's string,
and the string to replace any occurrences of the second argument's
string. An optional fourth parameter lets you specify two flags: an "m" to
operate in multiline mode and an "i" to ignore case.
The function returns a copy of the first argument after
making any replacements. This is so much simpler than the XSLT 1.0 hack
for doing the same thing (which certainly didn't bother with multiline
mode or case sensitivity options) that the 43-line stylesheet from my
earlier column on comparing and replacing strings can be rewritten in 15
lines using 2.0:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:output method="xml" omit-xml-declaration="yes"/>
<xsl:template match="text()">
<xsl:value-of select="replace(.,'finish','FINISH')"/>
</xsl:template>
<xsl:template match="@*|*">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
The input and output are the identical to the 1.0 version
shown in the earlier column.
Performing Multiple Search and Replace Operations
What if, in addition to replacing "finish" with "FINISH" as
shown above, I also want to replace the string "flavors" with "tastes" and
"11" with "22"? In a procedural programming language, you might do this to
a string called "testString" with code following this model:
; pseudo-code. NOT XSLT!
testString = replace(testString,'finish','FINISH');
testString = replace(testString,'flavors','tastes');
testString = replace(testString,'11','22');
XSLT, however, is not a procedural language. Like its
ancestors Lisp and Scheme, it's a functional language. We
don't write a series of instructions to be executed one after the other;
we combine functions into expressions that return values. In the following
revision of the match="text" template rule from above, the string returned
by each call to replace() is passed as the first argument to
another call:
<xsl:template match="text()">
<xsl:value-of select="replace(
replace(
replace(.,'11','22'),
'flavors',
'tastes'
),
'finish',
'FINISH'
)"/>
</xsl:template>
I tried to use some Lisp/Scheme whitespace conventions to
make it more readable, but as you can see, it wasn't entirely
successful.
Concatenating Strings
The XPath 1.0 concat() function returns the two or
more strings passed to it as one string. We saw its use in the column on
the XSLT 1.0 version of search and replace, as well as in the column on Setting and Using
Variables and Parameters. Of course, adding two text nodes to the
source tree one right after the other essentially concatenates them, and
this is used even more often than the concat() function.
One classic XML element manipulation problem is the output
of a collection of nodes as a delimited list. For example, to output the
values of the color elements in the following source document as
a comma-delimited list, we can't just output each one with a comma after
it, because we don't want to put a comma after the last one.
<colors>
<color>red</color>
<color>blue</color>
<color>yellow</color>
<color>green</color>
</colors>
A typical XSLT 1.0 approach is to use an
xsl:for-each element to output them and an xsl:if to
output a comma after each if it's not the last child of its parent.
<xsl:template match="colors">
<xsl:for-each select="color">
<xsl:value-of select="."/>
<xsl:if test="position() != last()">
<xsl:text>, </xsl:text>
</xsl:if>
</xsl:for-each>
</xsl:template>
XPath 2.0's string-join()
function lets you do this much more concisely. It takes two arguments: a
sequence (an "ordered collection of zero or more items" according to the
XQuery 1.0 and
XPath 2.0 Data Model document) and a delimiter to use when returning
the list. Look how much less code is necessary to achieve the same result
in XSLT 2.0:
<xsl:template match="colors">
<xsl:value-of select="string-join(color,', ')"/>
</xsl:template>
This is really doing the opposite of the
tokenize() function that we learned about in the May column.
New features such as data typing and a new data model may make XSLT and
XPath 2.0 look radically different from their 1.0 counterparts, but many
of these new features are straightforward functions that are familiar from
other popular programming languages. The compare(),
replace(), and string-join() functions, which will make
common coding tasks go more quickly with less room for error, are great
examples of this.