
Splitting and Manipulating Strings
by Bob DuCharme
May 01, 2002
XSLT is a language for manipulating XML documents, and XML
documents are text. When you're manipulating text, functions for
searching strings and pulling out substrings are indispensable for
rearranging documents to create new documents. The XPath string
functions incorporated by XSLT give you a lot of power when you're
manipulating element character data, attribute values, and any other
strings of text that your stylesheet can access. We'll start by
looking at ways to use these functions to split up strings and how a
PCDATA element might be split into subelements.
To demonstrate the first few functions, we'll use the following
simple document:
<poem>
<verse>SeestthouyondrearyPlain,forlornandwild,</verse>
<verse>
Theseatofdesolation,voidoflight,
</verse>
</poem>
(Note how the second verse element begins and ends with
some extra spaces and carriage returns -- we'll learn about a function
that tells the XSLT processor to ignore them.) The following template
adds the complete contents of each verse element in the
sample document above to the result tree at line 1 and then
demonstrates various ways to pull substrings out of them. Curly braces
in the result make it easier to see exactly which substrings are
getting pulled out of the verse elements. (Complete
stylesheets with these sample templates, along with the input and
output used to demonstrate them, are available in this zip file.)
<!--xq319.xsl:convertsxq318.xmlintoxq320.txt-->
<xsl:templatematch="verse">
1.Byitself:{<xsl:value-ofselect="."/>}
2.{<xsl:value-ofselect="substring(.,7,6)"/>}
3.{<xsl:value-ofselect="substring(.,12)"/>}
4.{<xsl:value-ofselect="substring-before(.,'dreary')"/>}
5.{<xsl:value-ofselect="substring-after(.,'desolation')"/>}
</xsl:template>
Before talking about the individual functions, let's look at what
this stylesheet does to the sample document:
1.Byitself:{SeestthouyondrearyPlain,forlornandwild,}
2.{thouy}
3.{yondrearyPlain,forlornandwild,}
4.{Seestthouyon}
5.{}
1.Byitself:{
Theseatofdesolation,voidoflight,
}
2.{The}
3.{eseatofdesolation,voidoflight,
}
4.{}
5.{,voidoflight,
}
The source document has two verse elements, so the "verse"
template rule adds two sets of lines 1 through 5 to the result. Each
line 1 in the result shows the complete contents of the verse
element. For the second verse element, line 1 includes the
extra whitespace around the source document's text.
Lines 2 and 3 of the stylesheet demonstrate the
substring() function. In line 2, the function call
substring(.,7,6) takes the verse element's contents
(because "." abbreviates self::node()) and, starting
at its seventh character, gets six characters. For the first
verse element, it skips the first six characters ("Seest")
to start at the seventh and get the six-character string "thou y". For
the second verse element, the six characters to skip on the
way to that seventh character are two carriage returns and four
spaces, so that the six-character string starting at the seventh
character is "The" (three spaces followed by the three letters you
see). Line 3 of the stylesheet has no third parameter to specify the
length of the substring to extract, so the substring(.,12)
function call starts at the twelfth character and gets everything to
the end of the string. For the second verse element, this
includes the two carriage returns that end it.

Do you find XSLT's string handling facilities useful? Share your experience in our forum.
Post your comments
The function call substring-before(.,'dreary') in line 4
of the stylesheet looks for the string passed as the second argument
in the string passed as the first argument (., or
self::node()). If it finds it, it returns everything in the
first parameter's string before that occurrence of the second
string. When looking for "dreary" in the first verse element,
the function finds it and returns the string "Seest thou yon "; in the
second verse element, it doesn't find it, and nothing appears
between the curly braces of the fourth line for that element.
The function call substring-after(.,'desolation')
resembles substring-before except that if it finds the second
argument in the first argument's text, it returns the string
after that text. The first verse element doesn't have
the string "desolation", so nothing appears between the curly braces
of the first line 5. The second verse element does have this
string, and the XSLT processor puts the characters after it (the
string ", void of light," followed by two carriage returns) between
the curly braces of the result document's second line 5.
The next stylesheet demonstrates a more diverse group of XPath
string functions.
<!--xq321.xsl:convertsxq318.xmlintoxq322.txt-->
<xsl:templatematch="verse">
1.{<xsl:value-ofselect="concat('length:',string-length(.))"/>}
2.<xsl:iftest="contains(.,'light')">
<xsl:text>light:yes!</xsl:text>
</xsl:if>
3.<xsl:iftest="starts-with(.,'Seest')">
<xsl:text>Yes,startswith"Seest"</xsl:text>
</xsl:if>
4.{<xsl:value-ofselect="normalize-space(.)"/>}
5.{<xsl:value-ofselect="translate(.,'abcde','ABCD')"/>}
</xsl:template>
With the same source document as the previous example, this new
stylesheet creates this result:
1.{length:46}
2.
3.Yes,startswith"Seest"
4.{SeestthouyondrearyPlain,forlornandwild,}
5.{SstthouyonDrAryPlAin,forlornAnDwilD,}
1.{length:49}
2.light:yes!
3.
4.{Theseatofdesolation,voidoflight,}
5.{
ThsAtofDsolAtion,voiDoflight,
}
Line 1 of this stylesheet demonstrates two functions:
string-length(), which returns the number of characters in
the string passed as an argument, and concat(), which
concatenates its argument strings into one string. The function call
concat('length: ',string-length(.)) shows that its arguments
don't have to be literal strings; you can use functions that return
strings (or can easily be converted into strings, like the integer
returned by the string-length() function) as arguments as
well. This, along with its ability to accept any number of arguments
greater than one, make concat() a very flexible
function.
Lines 2 and 3 of the stylesheet (which each take up more than one
line of the stylesheet) each have an xsl:if instruction that
uses a boolean string function -- functions that evaluate a certain
condition about a string or strings and return a boolean true if the
condition is true. The first function call,
contains(.,'light'), checks whether its first argument
contains the string passed as the second argument and returns a
boolean true if it does. For the source document's first
verse element it doesn't, so nothing appears after the first
"2" in the result. The second verse element does, so the
message "light: yes!" appears in the result.
Line 3's xsl:if instruction has a similar function call in
its test attribute: starts-with(.,'Seest'), which
only returns true if the string in its first argument starts with the
string in its second. This is true for the first verse
element, so the message 'Yes, starts with "Seest"' appears on the
result tree, but the second verse element doesn't, so there
is nothing after its "3".
Also in Transforming XML
Automating Stylesheet Creation
Appreciating Libxslt
Push, Pull, Next!
Seeking Equality
The Path of Control
Line 4's normalize-space(.) function call accepts one
argument, strips whitespace at its beginning and end, replaces any
sequence of whitespace in the string with a single space character,
and returns the resulting string. In English, the targeted whitespace
characters are the spacebar space, the tab character, and the carriage
return. The first verse element's text looks the same when
processed by this function, but the second verse element's
text is definitely different: all the leading and trailing space
characters have been removed. An XML processor does this to the spaces
in most kinds of attributes, and it's handy to be able to do it to
element character data as well, especially when you want to compare
two strings of element character data whose only difference may be the
spacing around them in their source document, as we'll see in next
month's column.
Line 5's translate() function gives you a way to map one
set of characters to another. It goes through the string in the first
argument and replaces any characters that are also in the second
argument with the corresponding character in the third argument. If
the third argument has no corresponding character, then the XSLT
processor deletes the one found in the first string. In the example,
the function call translate(.,'abcde','ABCD') maps the
letters "a", "b", "c", and "d" to their upper-case
equivalents. Because the letter "e" is in the second argument but not
the third, it's mapped to nothing; any occurrences of it are removed
from the copy of the first argument's string that the function
returns.
Let's look at a more realistic example of some of these string
manipulation functions. In the following, the binCode element
represents a wine brand's location on the wine store shelf. The first
two characters are its row, the third character its shelf, and the
text after the hyphen is its product number.
<winelist>
<wine>
<winery>Lindeman's</winery>
<product>Bin65</product>
<year>1998</year>
<price>6.99</price>
<binCode>15A-7</binCode>
</wine>
<wine>
<winery>Benziger</winery>
<product>Carneros</product>
<year>1997</year>
<price>7.55</price>
<binCode>15C-5</binCode>
</wine>
<wine>
<winery>Duckpond</winery>
<product>MeritSelection</product>
<year>1996</year>
<price>14.99</price>
<binCode>12D-1</binCode>
</wine>
</winelist>
The following template rule separates the three components of the
binCode element type into separate elements: row,
shelf, and prodNum, all inside of a
productLocation container element.
<!--xq324.xsl:convertsxq323.xmltoxq325.xml-->
<xsl:templatematch="binCode">
<productLocation>
<row><xsl:value-ofselect="substring(text(),1,2)"/>
</row>
<shelf><xsl:value-ofselect="substring(.,3,1)"/>
</shelf>
<prodNum><xsl:value-ofselect="substring-after(text(),'-')"/>
</prodNum>
</productLocation>
</xsl:template>
The call to substring() that creates the row
element has text() as its first argument. For the purposes of
this stylesheet, this means the same thing as
".". (Technically, text() refers to the text node
child of the context node and "." refers to a string
representation of the node's contents when used as the first parameter
to the substring() function.) The result XML looks like the
input except that the XSLT processor has replaced each
binCode element with the productLocation element and
its three child elements:
<?xmlversion="1.0"encoding="UTF-8"?>
<winelist>
<wine>
<winery>Lindeman's</winery>
<product>Bin65</product>
<year>1998</year>
<price>6.99</price>
<productLocation><row>15</row><shelf>A</shelf>
<prodNum>7</prodNum></productLocation>
</wine>
<wine>
<winery>Benziger</winery>
<product>Carneros</product>
<year>1997</year>
<price>7.55</price>
<productLocation><row>15</row><shelf>C</shelf>
<prodNum>5</prodNum></productLocation>
</wine>
<wine>
<winery>Duckpond</winery>
<product>MeritSelection</product>
<year>1996</year>
<price>14.99</price>
<productLocation><row>12</row><shelf>D</shelf>
<prodNum>1</prodNum></productLocation>
</wine>
</winelist>
Next month, we'll look at how to compare two elements to see if
they're the same. We'll also look at a way to implement a global
string replace with an XSLT stylesheet. (If you can't wait until then,
see my book, XSLT
Quickly, from which these columns are excerpted.)