Table of Contents
XML performance and XSLT
Scoring System
Result Verification
The Tests
Results
Acknowledgments
XSLTMark is a benchmark for the comprehensive measurement of XSLT
processor performance. It consists of forty test cases designed to
assess important functional areas of an XSLT processor. The latest
release, version 2.0, has been used to assess ten different
processors. This article describes the benchmark methodology and
provides a brief overview of the results.
Important Update (2001/04/04). Since the first publication
of this article, errors in the methodology has been discovered. Although
this doesn't affect the "headline news," it is significant. For full
details please be sure to read the complete
explanation below in the "Results" section. The performance
chart has been updated to reflect the corrections.
XML performance and XSLT
The performance of XML processing in general is of considerable
concern to both customers and engineers alike. With more and more
XML-encoded data being transmitted and processed, the ability to both
predict and improve XML performance is critical to delivering scalable
and reliable solutions. While XSLT is a big part of delivering on the
overall value proposition of XML (by allowing XML-XML data interchange
and XML-HTML content presentation), it also presents the greatest
performance challenge. Early anecdotal evidence showed wide
disparities in real-life results, and no comprehensive benchmark tools
were available to obtain more systematic assessments and
comparisons. We first created and started using XSLTMark internally at
DataPower in mid-2000 and released it publicly in November. By using a
well-balanced test base it is possible to make predictions about
likely application behavior and to select the best processor for a
given application. Considerable improvements have been made by some
XSLT engines over the past six months, but it is also clear that
further performance improvements will be required to support the
growth of XML.
Scoring System
We considered many possible scoring systems for measuring XSLT
performance. A survey of existing non-XML benchmark platforms revealed
that many benchmarks use abstract unit-less scores to rate
performance. These scores are often composed of weighted averages of
separate benchmark components that would not be otherwise
aggregated. While the abstract scoring method is excellent for
relative studies of performance, it lacks the value of a "real-world"
number in a standard unit.
Most of the other efforts to assess XSLT performance have centered
around execution time measurements for a small number of test cases
(where lower scores are better). These numbers are again great for
relative comparison, but they are hard to assess in an absolute sense
(i.e., in relation to other types of computer processing).
For these reasons, and because XML is increasingly becoming part of
the network, XSLTMark uses kilobytes-per-second as its overall score,
where kilobytes are the average of input and output document
size. This provides a score that is tied to both the document size and
the time expended for processing. The variations between scores for
different test cases are then attributable to the complexity and
specifics of processing performed by the stylesheet and the structure
of the input document. By examining the detailed data for the
individual cases a great deal of additional knowledge can be
gleaned. We conducted some preliminary tests to obtain
nodes-per-second measurements, but in the end we settled on
kilobytes-per-second as the best way to characterize real-world
performance.
The first two releases of XSLTMark computed as a total score an
aggregate KB/s measurement, computed according to the total execution
time and total kilobytes processed. We like this calculation because
it strictly measures the overall performance of the processor on a
very broad range of tasks. It is important to understand that this
aggregate score gives more weight to computationally-intensive test
cases -- since the score is based on the total execution time, the
"slower" test cases will have a greater effect on the score. This
contrasts with an arithmetic mean of individual test case scores,
which is weighted in favor of "faster" test cases.
XSLTMark 2.0 introduces a geometric mean score in addition to the
aggregate score. We include this measurement because it provides an
average of test case scores in a manner that is not weighted by the
qualities of individual test cases. Specifically, scaling the
throughput of a single test case results in a scaling of the geometric
mean by a factor that does not depend on which test case is
scaled.
In order to support both C/C++ and Java processors, XSLTMark uses
wall clock time (elapsed real world time, rather than CPU seconds) as
obtained by gettimeofday() or Java's System.currentTimeMillis(). This
means that benchmarking must occur on an unloaded system, and tests
should execute a sufficient number of iterations to avoid real time
clock granularity and interrupt effects. Considerable time was
invested in ensuring that this approach produced precise and accurate
measurements.
Result Verification
The spotty compliance of many XSLT processors meant that we had to
spend considerable time manually verifying the output of various
tests. DataPower's internal projects also required that results be
verified, so basic compliance checking was built into XSLTMark early
on. The intent is not to provide a compliance test suite; although
XSLTMark is comprehensive in its functional area coverage and presents
a balanced performance assessment, it is not comprehensive enough for
a full compliance suite. We look forward to the compliance efforts of
OASIS and W3C. XSLTMark's compliance testing exists to ensure that
largely incomplete processors do not receive unfairly high
benchmarks. This is especially important because implementing many
parts of the XSLT specification correctly means a certain performance
penalty. Often a processor that does well on a subset of cases but
fails many others will be considerably slower by the time it achieves
full compliance. (This was the case for Transformiix, Mozilla's XSLT
processor, which has made great progress in compliance but at a cost
to performance).
Result verification is achieved by normalizing the output using
DataPower's "dgnorm" (for "normalizer") tool. This simple C program is
a SAX processor that removes insignificant whitespace, handles HTML
peculiarities, alphabetically sorts attributes and does some other
processing to make the output of XSLTMark stylesheets directly
accessible to "diff" and byte-wise compares. After normalization, a
simple comparison of a reference result and the output is
performed. (Purists correctly protest that dgnorm is not a general XML
normalizer; it is only suitable for normalizing the results of
XSLTMark testcases).
It should be noted that sometimes there's more than one correct
result, in which case it's still necessary to verify all "CHK OUTPUT"
lines to make sure that they reflect a real compliance problem. This
is why some benchmark results have a few manually corrected
scores. Previous XSLTMark releases triggered comments from a number of
prominent XSLT implementers, and some of the thorny compliance
ambiguities have been resolved in the current version. The number test
case was difficult to assess due to ambiguity in the XSLT
specification and widespread disagreement among processors, so we
omitted the associated reference file; hence the "NO REFERENCE" found
in the detailed results.
[1] [2] Next