XML in News Syndication
by Edd Dumbill
July 17, 2000
Introduction
Table of Contents
The Big Picture
NITF
XMLNews
NewsML
PRISM
ICE
Putting It All Together
The news industry has spent the last few years trying to figure out how to
deal with the new challenges the Web presents. It was difficult enough to
handle the initial challenge of adopting business models that focused on
giving content away for free, but the technical issues have also proven
enormous.
The requirements of simultaneously creating content for paper, web,
and archive destinations are more demanding on news production
systems than paper-only delivery,
and place particular demands on the
transmission and storage of news content--specifically, the
granularity, structure, and precision of information.
Typically, human intervention is employed at many stages in the
production process of print publications. This has led to a situation
where many of the data formats and conventions within news
organizations could not be reliably parsed by computer. As a
trivial example, consider the "byline" of a story. In an ASCII
news story (as might be sent down a newswire), this might be
represented thus:
IMPORTANT HAPPENING IN ONLINE NEWS
by Edd Dumbill
Although this example doesn't seem too hard to parse with a
simple Perl script, consider the variations "by Edd Dumbill, News
Reporter" and "by Edd Dumbill, Fred Smith, and Sally Jones". Is
"News Reporter" a person, or "Fred Smith" a job title?
If were were encoding the news today in XML we might expect something a
little more like this:
<headline>IMPORTANT HAPPENING IN ONLINE NEWS</headline>
<by><person>Edd Dumbill</person></by>
As news used to be handled by journalists who understood the
context of the content, this didn't matter too much. The
web, however, has boosted the requirement for automatic handling of news
content, which needs to support more granularity than straight
text. It goes almost without saying that this is one of the areas
in which XML excels, and it should come as no surprise to find a
great deal of XML activity in the news distribution world.
The XML applications outlined in this article have all arisen in response to the problems posed by multi-destination delivery,
and to the new business models in
content that have emerged.
The Big Picture
If we look at the conventional way news has been transmitted
between organizations--the newswire--we can identify various
components:
- Protocol: the conventions used to carry information between
parties over the transmission medium
- Envelope: the conventions used to identify a segment of
information
- Header: the conventions for identifying metadata about a news
item
- Content: the convention used for the actual content of the
item
I've invented some of this separation for the convenience of
illustrating the roles played by the XML-based applications
discussed in this article. In the old wire specifications, some of
these components were merged together. On the Internet, transmission is no longer as simple as
broadcast or point-to-point, and pulling out the components
separately provides a cleaner, more reusable, approach.
Here's an outline of where today's news syndication
technologies sit:
| Protocols | Envelope | Metadata |
Content |
ICE
HTTP
FTP
NewsML
XMLNews-Meta
PRISM
NITF
XMLNews-Story
Existing multimedia formats
Note that there are some points of overlap between these
classifications, but this table serves as a useful indicator of
the purposes of the separate news initiatives. The names picked
out in italics are those formats and protocols that
together are likely to become the most ubiquitous platform over the
next few years.
The rest of this article discusses each of the XML-based technologies
featured in the table.
NITF
The News Industry Text Format is a well-established XML
application for marking up news items. In fact, it was originally an SGML
application before the advent of XML.
NITF was developed jointly between the
International Press Telecommunications Council
(IPTC) and the Newspaper Association of America, the
two major standards organizations for the news industry. The
intention was to supersede the ANPA1312/IPTC7901 binary wire
formats for the delivery of news, which, as alluded to in the
introduction, were geared exclusively towards print
applications.
To give you a taste for what NITF does, here's a sample story
marked up in NITF:
<nitf version="-//IPTC-NAA//DTD NITF-XML 1.3c//EN" change.date="31
October 1999" change.time="1900">
<head>
<title>XML News Formats</title>
</head>
<body>
<body.head>
<hedline>
<hl1>XML-based Formats in the News Industry</hl1>
</hedline>
<byline>
<person>Edd Dumbill</person>
<virtloc>edd@xml.com</virtloc>
</byline>
<dateline><story.date>Friday July 14
2000</story.date></dateline>
</body.head>
<body.content>
<p>
The advent of the web was a large problem for the
news industry, and in more ways than one.
Economically speaking, print publications faced
the challenge of giving away their content for free
on the web, and adapting to new business
models. Technically, too, the web raised many issues
for news providers.
</p>
</body.content>
</body>
</nitf>
You can see that NITF is inspired in part by HTML. NITF has been
cleverly designed to be a flexible DTD, in that users can put as
little or as much embedded markup as they wish into the
story. This may seem to markup fans a little
counterproductive, but it is an essential feature as it lowers the costs of
moving to NITF by reducing the amount of re-engineering required within
production systems. So, both of the following fragments are valid
NITF, but allow varying degrees of cost in generating the
content.
The riot took place in north London last Monday.
The <event>riot</event> took place in
<location>north London</location>
<chron norm="20000717">last Monday</chron>.
As the NITF
Implementors' Guide puts it,
[P]artial implementation can be introduced into most editorial computer systems without
large-scale modifications. NITF can be tested on a specific project, such as sports agate, without involving
other departments. This gives publishers a chance to see how NITF works without making a large
investment.
NITF has evolved through several versions, including making the
transition from SGML to XML. The most recent version is v1.3,
released on October 31, 1999. The DTD is available from the NITF web site.
Further reading: Robin Cover
on NITF
XMLNews
XMLNews is probably the most deployed news syndication format on
the web at the moment. It was designed by David Megginson to be a subset of the NITF September 21, 1998 release.
That part of the specification is known as
"XMLNews-Story". Additionally, XMLNews contained "XMLNews-Meta," an RDF application for
describing news content.
Because XMLNews-Story is similar to NITF (as described above), I won't
explain it in detail here. It is worth noting, however, that with
the most recent revision of NITF (which includes simplifications
and improves ease-of-use), XMLNews-Story is no longer a
compliant subset. Megginson's work with XMLNews
made NITF radically more accessible and understandable for many, and
has fulfilled a valuable function
in enabling software support from the likes of Wavo and iSyndicate. However, it looks as though future development will happen exclusively in NITF.
XMLNews-Meta is an extensible vocabulary for describing news
resources. In contrast to NITF, which is used for the content
itself, XMLNews-Meta describes the content. Its main features are
the ability to describe the following:
- Identification (assigning a unique ID to the resource being
described)
- Header Information (such as language, title, description)
- Milestones (publication, release, receipt and expiry
times)
- Provenance (the route through news providers taken by a
story)
- Rights (copyright and distribution rights)
- Subject Matter (machine readable classification
information)
- Linking (describing inter-story relationships, e.g., previous
versions)
As XMLNews-Meta is expressed in RDF, it is inherently extensible,
and organizations can extend for their own purposes by using a
vocabulary in a custom namespace. XMLNews-Meta also has the distinction
of being one of the few RDF applications currently in everyday use.
Further reading: Announcement and background on XMLNews, Robin
Cover on XMLNews
NewsML
A younger specification than NITF, NewsML is also being developed
under the auspices of the IPTC. NewsML is an envelope format for
news content, designed to help solve the problem of transporting
news items irrespective of their encoding.
In the same way that NITF supersedes IPTC7901, NewsML is an
XML-based successor to the IPTC's "Information Interchange
Model." NewsML is still very much in development, but its core features
are support for the following:
- All formats and media-types: "News ML makes no assumption about the media type, format, or encoding of news. NewsML provides a structure within which news
objects, of whatever type, relate to each other. NewsML can equally
represent text, video, audio, graphics, and photos."
- Collections of news items, either as journalistic
packages or results of automatic collation
- Named relationships between news items: much like
the linking part of XMLNews-Meta
- Multi-part structure with internal relationships:
e.g., text with supporting images or video
- Tracking revision of news items over time
- Alternative representations of item parts, for
instance HTML, RTF, and PDF encoding of text
- Inclusion and exclusion of news item parts
- Attached metadata
Although NewsML will allow the implementation of an envelope,
incorporating existing content formats like NITF, and the inclusion
of external metadata descriptions like PRISM (see below), it is also
designed to be
self-sufficient. That is, it will be possible to use NewsML alone
for the envelope, metadata, and text parts of a story, albeit with
less flexibility. Further detail on the relationship between NITF
and NewsML can be found in this thread from the NewsML mailing list.
A beta DTD for NewsML can be found at http://www.iptc.org/NewsML,
and the final specification is due for release in early October.
Further reading:
IPTC's NewsML web site
PRISM
The Publishing Requirements for Industry Standard Metadata
initiative takes a wider view than the IPTC-sponsored NITF and
NewsML. Operated under the aegis of the IDEAlliance, PRISM seeks to "develop an XML metadata vocabulary for the magazine, catalogue, mainstream journal, news, and book industries."
Before the Web, news syndication was largely the domain of large
organizations, such as news wires, who could afford the staffing
and the infrastructure to make the business profitable. As it has
many other things, the web overturns that model. Standards like PRISM and ICE
address themselves to a larger audience than the traditional large
news organizations.
The PRISM authoring group expects to release their metadata
vocabulary this fall. Until that point, there is not much information
publicly available about the vocabulary. The PRISM home page does, however, outline some encouraging goals, particularly the re-use of existing metadata standards such as RDF and the Dublin Core.
PRISM, when it is released, will fulfil a similar purpose as
XMLNews-Meta does today, but with a broader focus. The
dual-document approach of XMLNews provides the pattern that
PRISM itself will follow.
Further reading: PRISM web site.
ICE
The Information and Content Exchange specification is one of the
most established applications of XML in this area. It defines both a vocabulary
and a protocol for the transport and business rules aspects of
content syndication.
ICE provides the protocol by which content syndicators can offer
content to potential subscribers, and subscribers can receive it.
It does this by defining XML-over-HTTP exchanges.
Most of the members of the ICE Authoring Group don't
really come from the traditional news world. This is reflected in the fact
that the ICE syndication model is a lot more sophisticated than
the models previously used by the big players. Traditional
newswires adopted a broadcast or point-to-point philosophy, where
all the business rules governing subscription happened out-of-band
between humans. Other syndication occurred by drop-off using ISDN,
bulletin boards, or even by fax.
Perhaps ICE's most notable feature is its ability to codify some
of the business aspects of syndication. These include delivery
schedules, activating subscriptions, and even "surprise" content
requests, to handle one-off transactions. This ability means it is
practical for syndicators to maintain large client bases. Also, for
content aggregators, ICE makes it easier to conduct transactions of
known reliability with multiple suppliers. ICE servers and clients
provide a shrink-wrap replacement for a host of Perl and shell
scripting that previously conducted these operations.
ICE now has multiple implementations in the field, and is being applied wherever a reliable publish/subscribe model is required for electronic asset exchange. Examples include parts catalogs as well as more traditional media areas.
Although the ICE specification is publicly available, the ICE Authoring Group and Network are fee-based organizations, which means they pay for the work to develop the standard. This does, however, have a knock-on effect on the speed with which information about ICE is disseminated--meaning the ICE AG hasn't been able to benefit from the same groundswell of open source implementation that W3C XML specs often receive.
According to a message earlier this year from Sami Khoury, one of ICE's authors, an open source reference implementation of ICE will soon be available from the authoring group themselves.
Further reading: ICE web
site, Robin Cover on ICE
Putting It All Together
The technologies mentioned in this article are at varying levels
of completeness and implementation. Also, they each have slight
overlaps as they have been developed by differing groups and with
different goals. What is very encouraging, though, is that each of
the initiatives are being pursued with both extensibility and
compatibility in mind, and are looking as though they will play
nicely with each other.
For pursuing content syndication right now, the two most stable
activities are NITF and ICE, both of which are good places to
start looking. For an accessible start into syndication, XMLNews
is still a good choice, and has commercial software support--this
means that as the newer specifications such as PRISM and NewsML
come on-stream, an upgrade path is likely to be provided.
What does the future hold? Hopefully before too long, the
various technologies will be joined together, so you may find
news marked up in NITF, described by
PRISM, packaged in NewsML, and delivered via ICE.
Acknowledgements
I'd like to thank Deren Hansen of Wavo Corporation for his
assistance in compiling this article.