An Introduction to Streaming Transformations for XML
by Oliver Becker, Paul Brown, Petr Cimprich
February 26, 2003
Overview
This article introduces Streaming Transformations for XML (STX), a template-based XML
transformation language that operates on streams of SAX events. STX resembles XSLT 1.0, the tree-driven
transformation language for XML, but STX offers unique features and
advantages for some applications.
XSLT's popularity has grown over the past three years, both aiding and
riding on the adoption of XML. In
comparison to API-level programming with Document Object Model (DOM), XSLT provides a loosely-typed,
declarative environment tailored for tree-oriented transformation of XML
documents, which has achieved wide adoption as a general-purpose XML
manipulation tool despite the proscription:
...XSLT is not intended as a completely general-purpose XML
transformation language. Rather it is designed primarily for the kinds of
transformations that are needed when XSLT is used as part of XSL. —
XSLT 1.0
DOM versus SAX
"Which is better, DOM or SAX?" is a common question for
newcomers on XML-related discussion lists. (And, one could legimately also
ask "DOM or JDOM or DOM4J or XOM?" and
"SAX or XNI"?)
DOM provides an overall view of an XML document through tree traversal
and manipulation. DOM is heavyweight, however, in that it typically
imposes a memory footprint of around five times the size of the underlying
XML text for simple documents. DOM also imposes a significant time
overhead for creating the necessary objects.
SAX provides a sequential view
of an XML document through a stream of events. SAX-based programs
typically maintain some amount of state information that encapsulates
already-received events, but SAX processing requires a negligible amount
of memory (typically only the representation of the current event and the
buffer for the parser).
SAX is the event-oriented sibling of the DOM API. (See the sidebar for a short discussion of DOM versus
SAX.) STX provides a streaming analog for XSLT by adopting some of the now
familiar concepts from XSLT (e.g., matching based on templates and an XPath 1.0-like expression
language) but using SAX as the underlying interface to the XML
document. In line with the proscription about
XSLT, STX is neither a general purpose XML transformation language,
nor is it an attempt to improve, extend, or replace XSLT.
Like SAX, STX is a completely free, grassroots effort by the XML
community, initiated by Petr Cimprich; the specification and a mailing
list are hosted on SourceForge. The current version of the
STX specification contains a list of other contributors. There are
currently two STX processor implementations:
-
Joost, a Java-based
processor by Oliver Becker
-
STX::XML,
a Perl-based processor by Petr Cimprich.
An Illustrative Example
This section illustrates some of the relative strengths of STX through an example. The
scenario is one of a nursery that stocks a selection of trees and flowers
and receives orders in a simple XML format, illustrated below. (The full
text of the example order document is available in
XML.) Every order consists of the name, an optional
quantity which defaults to 1, and a price per
unit quantity.
<?xml version="1.0"?>
<order-list>
<order>
<name>Achillea Millefolium</name>
<quantity>50</quantity>
<price>2.93</price>
</order>
...
<order>
<name>Camelia Japonica</name>
<quantity>20</quantity>
<price>39.90</price>
</order>
<order>
<name>Ginkgo Biloba</name>
<price>19.90</price>
</order>
...
</order-list>
Example 1. An order
The initial goal is to create a stylesheet that computes the total
price (the sum over all items of the the quantity multiplied by the unit
price). This goal could be accomplished with XSLT, using recursive
invocation of named templates with parameter passing.
Here is XML source for the stylesheet; and
here is a pretty-printed HTML
rendition.
The output from the transformation is, as expected, an XHTML document containing the order formatted into
a table and the total cost computed.
Running the Example with Joost
To execute the transformation with Joost under JDK 1.4 (the SAX parser, a
version of Crimson,
bundled with JDK 1.4 is less than perfect; Ælfred2
or Xerces-J2 both
make better choices for general use XML parsers), download binary
distributions of Joost and log4j and
then run:
java -cp joost.jar;log4j-1.2.7.jar net.sf.joost.Main plants.xml order-sum.stx
Example 3. Running the stylesheet with
Joost under
JDK 1.4
The process should complete almost instanteously. On a reasonable PC,
Joost will process around 1Mb
per second through the example stylesheet. This
was estimated by timing the processing of a 250,000 line-item order.
Running under JDK 1.3
requires adding a JAXP-compliant SAX parser to the classpath.
Running the stylesheet with
STX::XML
To execute the transformation with STX::XML under
Perl, install the STX::XML module
and the prerequisite Perl modules
XML::SAX and XML::NamespaceSupport from CPAN. (The modules
XML::SAX::Expat and XML::SAX::Writer are also
recommended but not required.) Then execute the stxcmd.pl script
(located in the STX::XML
installation directory) as follows:
perl stxcmd.pl order-sum.stx plants.xml
Example 4. Running the stylesheet with
STX::XML
under Perl
Similarities with XSLT
A quick glance over the example stylesheet
should remind the reader of an XSLT stylesheet:

STX stylesheets are well-formed XML
and can thus be edited with any text or XML editor.

STX stylesheets consist of a mixture
of instructions and declarations in the STX namespace, literal
result elements, and content.
Furthering the resemblance, the STX namespace
(http://stx.sourceforge.net/2002/ns) shares many elements (by
local name) with the XSLT 1.0 namespace
(http://www.w3.org/1999/XSL/Transform), and these elements
provide similar functions to their XSLT 1.0 counterparts:

The
transform element
encloses the stylesheet. (STX does not support the "literal result
element as stylesheet" syntax described in
section
2.3 of the XSLT 1.0 specification.)

STX provides variables, both global
and local. (STX and XSLT 1.0 treat variables
differently,
however.)

STX uses declarative, rule-based
matching to drive the flow of stylesheet execution. (STX matching
is different from
XSLT 1.0 in some
important ways; read
on.)

STX has a
value-of element
to place the value of an expression on the output.

STX uses a loosely-typed expression
syntax derived from [XPath2.0], so the basic syntax is similar to
the version of XPath ([XPath1.0]) used in
XSLT 1.0.
STX also has choose, if, param, and
with-param that behave similarly to their XSLT 1.0
counterparts.
Differences Between STX and
XSLT 1.0
Nodes and Events
The most important difference between XSLT and STX is the difference
between XPath 1.0
nodes and SAX
events. The fundamental, atomic unit of operation in
XPath 1.0 is a node that exists in the context of its containing
document and inter-node relationships (parents, siblings, children, and so
on). In contrast, the fundamental unit of operation in SAX is an event
that exists without any additional context.
Programming at the level of SAX events is non-trivial and typically
involves tracking the context of an event, directing the event to an
appropriate handler based on its context, and performing an operation. To
this end, an STX processor adds a thin layer of context to a stream of SAX
events to create a notion of node robust enough to support a useful subset
of XPath expressions -- called "STXPath" -- without requiring the creation
of a DOM or similar tree-like data structure. The available context for a
node in STXPath includes:
-
Current node. The current event along with all of its
intrinsic information is accessible. The STX processor takes care
of combining sequential characters() events into a single
node.
-
Ancestor stack. The ancestor stack consists of the
ancestors of the current node (i.e., the current node plus any
startElement() events that haven't yet had the matching
endElement() occur) together with their contexts.
-
Position counter. The position counter tracks the index
of the current node among its siblings with respect to node type
and node name particles (local name, prefix, and qname). The
position index is exposed via the position() function
and.
-
Lookahead. The STX processor always reads one node
ahead so that the value of an immediately subsequent
characters() event (if any) is included in the context for
an element and made available as the value of the
node.
stx:process-* versus
xsl:apply-templates
STX does not include an analog of the apply-templates
instruction from XSLT, since an STX processor receives SAX events
in document order as opposed to being able to look for nodes in an object
model. STX provides several different process-* instructions to
control the handling of subsequent SAX events (with an optional
group="..." specification; see below for
more about groups).
Figure 1. A schematic for
process-*
One way to think about the process-* instructions is as valves
that let a certain set of events pass through part of the stylesheet,
returning control to the current template once those events have been
consumed.
-
The process-children instruction causes the STX
processor to receive and handle the child nodes, if any, of the
current node, and then return.
-
The process-attributes instruction causes the STX
processor to handle the attributes on the current node, if
any.
-
The process-siblings instruction causes the STX processor to
handle the siblings of the current node with optional while="..."
and until="..." expressions that allow it to return before all of
the siblings have been consumed. (Note that the children of the original
node and any consumed siblings will not be back, since STX processors can
only process any given node once.) Combined with the ability to create
separate startElement() and endElement() events (below), this gives STX powerful capabilities to
restructure XML content.
-
The process-self instruction causes the STX processor
to re-handle the current node, possibly using a different template
group.
Variables in STX
In contrast to side-effect-free XSLT, variables in STX are mutable and
can be freely reassigned at any point just like variables in programming
languages like Java, C, Python, and Perl. The example
stylesheet uses this to maintain the sum during the execution
of the stylesheet and to re-populate the name, quantity,
and price variables for each successive order.
Unique and Distinguishing Features
of STX
The example STX stylesheet illustrates some of
the similarities and differences between STX and XSLT 1.0, but STX
has a host of additional features.
-
Discrete
startElement() and endElement()generation. Because
an STX processor operates synchronously, it can generate literal events to
the output, and the start-element and end-element will
create the corresponding events on the output. This feature is useful when
adding structure to a flat document; for example, here is a stylesheet that groups the line items in the plants example XML document by the first letter
of the plant name, and here is the HTML
output.
-
Template groups. STX
allows templates to be formally grouped together under group
elements. Groups can expose or conceal contained templates from matching,
and shadowing of group-level variables is controlled at the template level
with the new-scope="..." attribute..
-
Event buffers. STX provides FIFO event buffers that can
be declared in stylesheets or within groups and follow the same
hierarchical rules as variables. Buffers are populated by wrapping a
portion of a template in a result-buffer element. Once the buffer
is filled, a call to process-buffer routes the events in the
buffer back through the input, where they are processed as though they had
occurred in the source document.
Buffers are useful for shuffling document fragments. (For example,
Oliver Becker implemented a
bubble sort algorithm in STX.)
Conclusion
We have demonstrated that STX is an approachable means for transforming
streaming XML with two usable implementations and an active,
community-driven development effort.