Mastering DocBook Indexes
by Jirka Kosek
|
Generating an Index
The DocBook XSL stylesheets generate indexes automatically.
The only thing we have to do is place an empty
index element into a
location where a real index should appear. This is usually
somewhere near the end of a document.
The stylesheets adapt the index appearance to the output
format. An index on an HTML page does not contain page
numbers, but instead uses section or chapter titles that link back to the index term occurrence in the document flow. If the output
format is an HTML Help, then the HTML Help index is built
instead of a simple HTML page with links.
However, print output is not without obstacles. Generating an
index in XSL is a two-phase process. The first phase is a
XSLT transformation that converts a source DocBook document
into a set of abstract formatting objects. Page numbers for
the index entries are not known at this moment. The actual
rendering and page-number evaluation takes part during the
second formatting phase, which is performed by a FO processor
like FOP, XEP, or XSL Formatter.
Problems arise when one
index term occurs twice within a page. In this case, the
index contains duplicate page numbers for this entry. We will
see how to deal with it in the following parts of this
article.
Indexes for non-English languages represent another issue.
Generating the index consists of grouping the index terms
with the same initial letters and then alphabetical sorting the entries within each letter group. The stylesheets
exactly implement this algorithm that is unfortunately
insufficient for many languages.
For example, some languages
treat "ch" as a single
letter that should sort between "c" and "d" in traditional Spanish or between
"h" and "i" in Czech.
Diacritics can be the cause of another complexity. Some
languages completely ignore them, some use complex rules. In
Czech, for example, the words starting with letters
"u" and "ú" belong to
the same index group, but words starting with
"c" and "č" belongs to
two different groups. And we don't even want to start thinking
about the CJKV languages.
XSLT offers very poor support for
grouping, which is why the index generation is very
difficult. If you want to implement locale-aware indexing in
XSLT you will reach the limits of the language. Fortunately, many
XSLT processors offer extensions to the XSLT core -- so we will
next see how internationalized indexing is supported in
the DocBook XSL stylesheets.
Removing Duplicate Page Numbers
from a Printed Index
As I mentioned earlier, the current combination of the
XSLT and XSL-FO standards does not provide a mechanism for
removing duplicate page numbers from a printed index. This
serious drawback can be overcome in two ways. The first
solution utilizes the FO processor, which implements a vendor
extension for the index generation. The other possibility is
to use multiple passes over a document to detect and remove
the duplicities.
The vendor extensions are supported in the two best-known
commercial FO processors -- XEP and
XSL
Formatter. The DocBook XSL stylesheets contain support
for these FO implementations; we just tell
stylesheets to use these extensions by turning on an
appropriate parameter. For instance, XEP should be invoked
by the following command line:
xep -xml document.xml -xsl .../fo/docbook.xsl -param xep.extensions=1
In the real world we usually change behavior of the
stylesheets by customizing more than one parameter. The best
practice is then to create a customization layer, which
imports stock stylesheets and sets all necessary parameters.
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl=
"http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:import href="http://docbook.sourceforge.net/release/xsl
/current/fo/docbook.xsl"/>
<xsl:param name="paper.type" select="'A4'"/>
<xsl:param name="xep.extensions" select="1"/>
</xsl:stylesheet>
If you prefer XSL Formatter over XEP, you can use a similar parameter
axf.extensions to turn on the XSL Formatter support.
Using both parameters results in removing duplicate page numbers and in creating
a page range for continuous sequences of page numbers. For example, if a single
index entry occurs on the following pages:
5, 5, 8, 9, 10, 37
The output will be more reasonable and aesthetic in the
following way:
5, 8–10, 37
In the future no need of using such vendor extensions will be
necessary because the upcoming version 1.1 of XSL-FO
has the direct index support.
When using another FO processor, we must employ a more
difficult procedure. This is also the case of the open-source
FOP
processor. We must process the document twice. The first
pass is done with the make.index.markup
parameter set.
The resulting PDF will contain an XML markup
for index entries and page numbers. This PDF can be converted
to plain text from which the XML markup is extracted. The
duplicates are then removed and the modified XML fragment of
the index is now used to get the proper PDF. This process is
a real hackery, and it does not work very well for languages
that use characters outside the ISO Latin 1 -- the FOP does
not insert the proper Unicode mapping vector for embedded
fonts. This technique was invented by G. Ken Holman.
Internationalized Indexes
The DocBook XSL stylesheets adapt its output to a document
language. The document language can be specified by using the
lang attribute.
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE book PUBLIC '-//OASIS//DTD DocBook XML V4.3//EN'
'http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd'>
<book lang="de">
... German book ...
</book>
Due to the previously mentioned limitations of XSLT the
stylesheets cannot use different grouping criteria for each
language. Fortunately, several XSLT processors offer
extensions to XSLT that can be used to overcome this
limitation. As these extensions are not backward compatible
with pure XSLT, they cannot be included in the default
stylesheet. If we want to generate an internationalized
index, we must use the EXSLT-aware XSLT processor, which
supports user-defined functions. Then these functions can be
used in the definition of a lookup key (xsl:key). These criteria are met
by Saxon; xsltproc is still having some unresolved
issues at the time of this writing.
If we want to use internationalized indexing features of the
stylesheets we must create a customization layer that will
override default index-generating templates by including a
small autoidx-ng.xsl stylesheet.
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl=
"http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:import href="http://docbook.sourceforge.net/release
/xsl/current/fo/docbook.xsl"/>
<xsl:include href="http://docbook.sourceforge.net/release
/xsl/current/fo/autoidx-ng.xsl"/>
<!--
Parameter settings and other modifications of stylesheet
-->
</xsl:stylesheet>
The internationalized indexing is implemented for both HTML
and print (FO) output. Each output format has its own
autoidx-ng.xsl file in the
corresponding directory. The stylesheets currently support
the internationalized indexing for the following languages:
Czech, Danish, German, English, Spanish, French, and Turkish.
The described method of internationalization places each
index term into the correct letter group and the groups are
sorted in proper collating order. Sorting of entries within
one group is left to the XSLT processor, which may be a
problem because many XSLT processors support only the English
sort order out-of-the-box. Saxon 6.5.3 (the recommended
version for use with the DocBook stylesheets) can be easily
extended to support user-defined collation.
We first create
simple implementation of a TextComparer, which must be named after the
language code. For example, for German we must create a class
named Compare_de.
package com.icl.saxon.sort;
import java.text.Collator;
import java.util.Locale;
public class Compare_de extends TextComparer
{
int caseOrder = UPPERCASE_FIRST;
public int compare(Object a, Object b)
{
Collator deCollator =
Collator.getInstance(new Locale("de", "de"));
return deCollator.compare(a, b);
}
public Comparer setCaseOrder(int caseOrder)
{
this.caseOrder = caseOrder;
return this;
}
}
Then we must compile this class into the Java bytecode:
javac -classpath /path/to/saxon.jar Compare_de.java
The resulting file Compare_de.class
must be on the CLASSPATH when Saxon is invoked in order
to get the proper German sorting. The same procedure applies
to another languages.
Prev [1] [2] [3] Next