Online Magazines with Apache Cocoon
by Steve Punte
April 16, 2003
In order to demonstrate what I call XML-directed solutions using Apache Cocoon, in this article I
will discuss how to use Cocoon to create an online magazine.
XML-directed solutions are those where XML, rather than a programming
language is used to control the application. If you are considering
entering into the world of online publications or are thinking about
upgrading your existing technology, consider how elegantly Apache
Cocoon provides a publishing framework.
Overview
You are reading this article via the web magazine XML.com. Ever
wonder what goes on behind the scenes? It can be really quite simple.
Web publications are content presentation services, different than an
interactive application like a stock trading service. Publication
services typically have very little state information, while a stock
trading service contains significant mutable information: customer
data, equity trades, transactions, etc.

In this article I examine a very simple and elegant two-layer
solution for online publishing; it presents articles stored in a local
repository or directly utilizes feeds from other online magazines and
new services. It turns out that the management of articles and of
news stories is very similar, and much of this type content is
converging on the use of RSS. Thus, an appropriate architectural
tactic is to divide the problem into two parts: the article repository
layer and the presentation layer. Figure 1 represents a top-level
perspective.
Figure 1: Top Level Design
A key architectural feature of this solution is that no
application-specific Java or other procedural software is required.
All necessary functionality and operation is achieved using existing
off-the-shelf Apache Cocoon components, supplying them with
appropriate XML configuration information. Such solutions are "XML
directed architectures" and are expected to play an increasingly
dominant role thanks to the software component interoperability that
XML provides.
Design and Implementation
RSS
In a nutshell, RSS
is an XML vocabulary for describing content such as the headlines of a
news site or the latest articles of an online magazine.
RSS is one of those standards that fights like hell not be
standard. To begin with, there is no agreement on the acronym RSS.
To make matters worse the dominant versions of RSS are incompatible.
Mark Pilgrim's article, "What Is
RSS?", is a good place to get caught up with RSS.
The presentation layer assumes the news feed is in one of the
dominant RSS formats, converting it into the RSS 1.0 format for
uniformity. RSS only specifies the delivery of the content headlines,
not the body of the story or article.
Apache Cocoon Architecture
Solutions realized by the Apache Cocoon framework are constructed
by way of "pipelines" (see Getting
Started with Cocoon for introductory tutorial). In a nutshell,
each pipeline is a sequence of XML processing beginning with a
"generator" (representing in Figure 2 below as a pentagon shaped
block), followed by any number of "transformers" (triangle shaped
block), and finally terminated by a "serializer" (hexagon shaped
block).
Two standard Cocoon components comprise nearly all of this
particular solution. They are, first, the URI Generator which simply
retrieves XML content given any URI; and, second, the XSL Transformer,
which can be configured to utilize any number of XSLT engines (by
default it uses Apache Xalan).
Apache Cocoon offers a wide variety of standard components which can
be further examined in the Apache Cocoon User
Docs.
Architecture and Design
The entire architecture consists of four Cocoon pipelines as shown
in Figure 2. Only two pipelines (i.e. the "/home" and the "/article"
pipeline) are intended for the end-user.
Figure 2: Internal Pipeline Design
The "/home" pipeline and associated URL portion exist for the
purpose of displaying summaries of the top available articles. The Apache
Cocoon Sitemap Pipeline construct is show below. The first step
of the pipeline is to retrieve the appropriate RSS document. This
could be from the local RSS repository or could be a well known remote
source depending upon which magazine is selected (i.e. variable {1}).
Notice that this solution uses the Apache Cocoon sitemap "One of N"
switch functionality (<map:select>). This
construct provides a simple mechanism to uniquely post-process a
particular feed source. In the case of NewsForge, we convert its
RSS-0.91 format into RSS-1.0 using a standard XSLT component
configured with stylesheet document rss-91.xsl. Finally,
the feed is converted to HTML and the appropriate styling and graphics
are added: see figure 3 for results.
<!-- HOME PAGE APACHE COCOON PIPELINE FRAGMENT -->
<!-- Use local or remote RSS feed to populate home page. -->
<map:match pattern="home/*.html">
<!-- Use second field on URI to determine RSS Source. -->
<!-- These values are hardcoded here and in common.xsl -->
<map:select type="parameter">
<map:parameter name="parameter-selector-test" value="{1}"/>
<!-- Obtain on-line from O'Reilly Net. -->
<map:when test="oreillynet">
<map:generate src="http://www.oreillynet.com/meerkat/?_fl=rss10&t=ALL&c=47"/>
</map:when>
<!-- Obtain headlines from this local file inside application. -->
<map:when test="local">
<map:generate src="http://localhost:8080/cocoon-mag/rss-feed.rss"/>
</map:when>
<!-- Obtain on-line from News 4 Sites. -->
<!-- Note: Format is in RSS-0.91 -->
<map:when test="newsforge">
<map:generate src="http://www.newsforge.com/newsforge.rss"/>
<map:transform type="xslt" src="rss-91.xsl"/>
</map:when>
</map:select>
<!-- Presentation Layer: Convert RSS-1.00 to our HTML -->
<map:transform type="xslt2" src="home.xsl">
<map:parameter name="global-source" value="{1}"/>
</map:transform>
<!-- Send off as HTML character stream -->
<map:serialize type="html"/>
</map:match>
Figure 3: Top Level Magazine Home Page
The second user pipeline is the "article pipeline" shown below.
The URL intercepted and processed by this pipeline is rather lengthy
and has embedded in it the actual source location (local or remote) of
the article (i.e. **
construct). The article is retrieved as HTML, then optional custom
filtering (i.e. see source code article.xsl file) may be applied to
remove undesired portions; finally, the presentation is applied. The
results of publishing an article from NewsForge in our exemplar
magazine is show in figure 4 (note URL in address bar has embedded the
NewsForge location).
<!-- ARTICLE PAGE APACHE COCOON PIPELINE FRAGMENT -->
<!-- Retrieve an article, even from a remote feed, and wrap it
with our magazine. -->
<map:match pattern="article/*/**">
<!-- Retrieve article from (possibly remote) source -->
<map:generate type="html" src="http://{2}?">
<map:parameter name="copy-parameters" value="true"/>
</map:generate>
<!-- Format into HTML -->
<map:transform type="xslt" src="article.xsl">
<map:parameter name="global-source" value="{1}"/>
<map:parameter name="global-path" value="{2}"/>
</map:transform>
<map:serialize type="html"/>
</map:match>
Figure 4: Article imported from NewsForge
The Local Sources
To achieve uniformity and simplicity, the local magazine content is
made available as two web services: a local RSS feed at URL location
"/rss-feed.rss" and the article feed at "/article-feed/*/body.html".
Both services are trivial two-component Cocoon pipelines. See the
demonstration software for additional details.
Distinguishing Characteristics of this Solution
Component Reuse
A repeated theme in this and previous articles is the use of the
XML directed architecture philosophy. The entire solution is achieved
by way of reusable components directed by XML documents: in this case
three XSL stylesheets and the sitemap file. No Java or any other type
of custom procedural software was written. Granted this is a very
simple design, and a more feature-rich magazine would possibly require
such procedural business. Nonetheless, the trend seems to be that
more and more solutions are taking on this reuse paradigm, achieving
more functionality with less effort.
Simplicity
Again, the architectural goal is simplicity. Following this
philosophy, a decision was made early on to not use a relational
database. Instead all content is stored in the file system. The file
system is probably the most under-appreciated subsystem of the modern
OS. It is capable of nearly unlimited storage, fast retrieval, and
efficient and automatic caching. The key concept is that no
relational queries are needed in this application. Thus the use of a
relational database or even an XML database adds no value.
Performance
While I have yet to measure performance, I am confident that this
solution should hold it own against any other system. First, the file
system is used as the primary means of persistence. File systems are
typically very efficient and finely optimized over many years of
evolution. Second, all key components in Apache Cocoon utilize the Jakarta Avalon
framework and model for component pooling and reuse. Like file
systems, this approach is highly efficient and optimized. Apache
Cocoon allows and supports pooling configuration for every component
in the pipeline. Third, Apache Cocoon also provides content caching.
Each component in the chain can ask the quick question of the previous
component: "do you have anything different than last time?"
If not, a final component like a serializer can make the decision to
simply reuse the last generated content and forgo nearly all pipeline
processing. Fourth, a performance improvement can be achieved by
embedding the local RSS and article feed into the two user pipelines.
This would eliminate an unnecessary conversion of document between
text and SAX events. Ideally the framework would be smart enough to
do this automatically. Last, the XSLT transformers are the only
possible building block that could be troublesome. XSLT technology is
still fairly new and has shown sluggishness in the past. However,
tremendous efforts are underway to improve performance. (See the
article "Fast
XSLT" for a detailed consideration of XSLT performance
issues.)
In summary, the Apache Cocoon framework has provision for all
major optimization tactics and allows them to be engaged and activated
with simple configuration adjustments.
Trying it Out
Installation
A J2EE war file (cocoon-mag.war)
solution utilizing Apache Cocoon and implementing the "Generic Online
Magazine" can be downloaded here.
This software has been tested against Tomcat 4.1.12 and requires no
other packages. Simply place the downloaded war file into the
~tomcat/webapps directory and direct a browser to the
application's URL; http://localhost:8080/cocoon-mag,
typically.
Adding New Articles
You can add a new article by simply installing it in the directory
space ~tomcat/webapps/cocoon-mag/articles/<id>/. By
convention, the article ID is a unique numeric value. The second step
is to add an RSS reference to the article in the file
~tomcat/webapps/cocoon-mag/local.xml. This will cause it to
appear on the top page headlines.
Conclusion
While still in its infancy, component solutions directed by XML
configurations are becoming viable and production-worthy ways of
building web applications. Apache Cocoon excels in the territory of
content presentation solutions and is making progress at addressing
more interactive behavior situations with Apache
Struts-like additions. The entire application presented in this
article is contained in one Cocoon sitemap file and a handful of XSLT
templates. Both these files define behavior and can be seen as an
application layer on top of a generic, technology-agnostic XML
framework. In my next article for XML.com, I will present a
generalization of such a framework, which I call X2EE.
|