Television Listings and XMLTV
by Kyle Downey
February 18, 2004
Introduction
For several years I've wanted to assemble my own PC. Every
time I decided to replace my computer, I would say that
maybe this time I will get around to building my own. This
resolution lasted about as long as it took for me to go to
Dell and
price its latest, comparing that to the amount of free
time I had, which is very little.
The emergence of Linux-based packages for building
personal video recorders (PVR) like TiVO -- something I
would probably never be able to justify buying just for
its own sake -- offered me the chance I waiting for. A
mini PC with a TV capture card, a WiFi card, a monster
hard drive (you can get up to a quarter terabyte
nowadays), and a Linux package like MythTV can not only do
almost everything a TiVO can do, but can also serve up MP3
files, act as a Windows file server with
Samba, run a web server, and more.
One critical element of a DIY TiVO is TV listings. Without
these all the fancy hardware in the world won't do much
good. But there's an open source, Perl XML-based solution
by Edward Avis called
XMLTV that many of the TV-on-your-PC
packages like
Freevo and
MythTV
support. With support for screen-scraping data for many
country's cable systems, XMLTV can take various sources
and create a consistent stream of XML.
Here's a snippet to give you an idea of the kind of
information you can get:
<tv>
<programme channel="C54amc.zap2it.com"
start="20031230002000 -0500" stop="20031230022000 -0500">
<title>Mystic Pizza</title>
<desc>Three teenage girls come of age one summer working in a
pizza parlor in Mystic, Conn.</desc>
<date>1988</date>
<category>Comedy</category>
<rating system="VCHIP">
<value>14</value>
</rating>
<rating system="MPAA">
<value>R</value>
</rating>
<star-rating>
<value>2.5/4</value>
</star-rating>
</programme>
</tv>
If you have an iCal-compliant viewer (like
Mozilla)
you can even convert this to a calendar using Irving
Probst's
XSLT stylesheet (screenshot).
Getting started
As a first step I grabbed the latest Windows version of
XMLTV from the
SourceForge project. (For OS X, RPM-based
Linux systems, and Debian package-based systems you also
get packages; see the home page for details.) This gives
you a binary "xmltv.exe" at the top level of the directory
where you unpack the ZIP file. Like any good tool with a
UNIX heritage, XMLTV is meant to act as a filter chained
together with other programs. Once you set it up (in my
case to point to the North American listings), you can run
the program and get a stream of XML suitable for your
homegrown electronic program guide:
C:\writing\xmltv-0.5.24-win32>xmltv tv_grab_na --configure
Timezone is -0500
Welcome to XMLTV 0.5.24 (tv_grab_na V3.20031101) for Canada and US tv listings
Please report any problems, bugs or suggestions to:
xmltv-users@lists.sourceforge.net
For more information consult http://sourceforge.net/projects/xmltv
checking XMLTV release information..
Warning: failed to get current release information from:
http://sourceforge.net/projects/xmltv
If this problem persists, look for a new XMLTV release.
starting manual configuration process..
how many times do you want to retry on www site failures ? (default=2)
how many seconds do you want to between retries ? (default=30)
what is your postal/zip code ? 11375
getting list of providers for postal/zip code 11375, be patient..
Choose a service provider:
0: DIRECTV New York - New York (128766)
1: DISH New York - New York (128719)
2: RCN Cable (Microwave) - New York - Digital Rebuild (70946)
3: RCN Cable (Microwave) - New York - Rebuild (70945)
4: RCN Cable (Microwave) - New York (70944)
5: Time Warner Cable - Brooklyn - Cable Ready (71328)
6: Time Warner Cable - Brooklyn - Digital (71329)
7: Time Warner Cable - Brooklyn (71327)
8: Time Warner Forest Hills - Forest Hills - Cable Ready (71440)
9: Time Warner Forest Hills - Forest Hills (71439)
10: C-Band - USA (87341)
11: DIRECTV - USA (62044)
12: DISH Network - USA (62046)
13: VOOM - USA (179304)
14: Local Broadcast Listings (137303)
Select one: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14 (default=0)] 6
you chose 71329 # Time Warner Cable - Brooklyn - Digital
getting channel list, be patient..
After a few moments you get:
got channel list
add channel 1 NY1 ? [yes,no,all,none (default=yes)] A
.
.
.
add channel 1000 MOONDEM ? yes
add channel 1020 ADONDEM ? yes
add channel 1031 HBODEM ? yes
add channel 1032 CINDEM ? yes
add channel 1033 SHOWDEM ? yes
add channel 1034 TMCDEM ? yes
updating C:\/.xmltv/tv_grab_na.conf..
configuration step complete, let the games begin !
My first impression? digital cable gives you way too many
choices for good health.
Looking at the Format
Preamble
The top-level element,
<tv>, contains no big surprises:
<tv date="20031230133339 -0500" generator-info-name="tv_grab_na V3.20031101"
generator-info-url="http://sourceforge.net/projects/xmltv"
source-info-name="Zap2It" source-info-url="http://www.zap2it.com">
...
</tv>
In date, the timestamp given (including the
GMT offset for timezone) lets you know when the original
source generated the listing data. The attributes
source-info-url
and
source-info-name provide a glimpse into how
xmltv the program works: for the U.S. it screen-scrapes
HTML from a website providing channel listings by ZIP
code. We'll be reading right past this information for our
example program below.
This brings up an important question: what's the legal
status of XMLTV? The Zap2IT license seems to be broad
enough to allow for it.
While you may interact with or download a single copy of any portion
of the Content for your own personal, non-commercial entertainment,
information or use, you may not and may not authorize others to
reproduce, sell, publish, distribute, modify, display, repost or
otherwise use any portion of the Content in any other way or for any
other purpose without the prior written consent of TMS. Requests
regarding use of the Content for any purpose other than personal,
non-commercial use should be directed to Feedback at Zap2it.com.
Other services in other countries have shut out XMLTV. And
it's possible that they'd make more a bigger issue of it
if there were more Linux PVRs out there pulling down their
data. Even if there were no legal concerns about XMLTV
sourcing, there is also the technical risk: every time the
HTML layout on Zap2IT changes, XMLTV will break. There
seems to be a small market for people who might pay an
annual fee for reliable XML-formatted EPG (electronic
programming guide), but one debate in tne XMLTV forum on
the
DigiGuide pay service pointed out that
North American TV listings are a duopoly, and Bill Gates
paid $6 million for his listings for WebTV. It would be
hard to make a profit off homegrown DIY users wanting
commercial-grade TV listings, especially given the risk
that providing the data in a format which is so easy to
redistribute. The whole issue brings to mind the MP3
debate: do people use software like XMLTV because there's
no good pay alternative, or because they wouldn't use it
unless it was free?
No matter what happens with the listing sources, XMLTV
itself is still useful to understand and handle, and it's
a good example of XML's strengths in syndication and
bridging diverse applications.
Channel information
Next up in the format we have multiple
<channel> tags describing all the
available channels in your area. XMLTV maps this
information to the program listings by an ID which we'll
see again later; the ID should follow
RFC 2838: Uniform Resource Identifiers
for Television Broadcasts but the DTD obviously
can't enforce this. Channels can include an optional
icon and an optional URL.
<channel id="C2wcbs.zap2it.com">
<display-name>2 WCBS</display-name>
<display-name>2</display-name>
<icon src="http://tvlistings2.zap2it.com/tms_network_logos/cbs_30.jpg"/>
</channel>
XMLTV supports basic localization by a "lang" attribute,
e.g. fr_FR. (In a perfect world the DTD would have used
xml:lang instead.) It thus allows for multiple display
names. Thankfully one variant offered for at least my feed
is the channel number itself, which will be needed for PVR
software.
Program information
The mother lode of information in XMLTV is in the program
listings: what programs play on what channel ID, starting
and stopping at what times. Here's an example:
<programme channel="C2wcbs.zap2it.com" start="20031230043000 -0500"
stop="20031230050000 -0500">
<title>CBS Morning News</title>
<desc>News reports on current events.</desc>
<category>News</category>
<audio>
<stereo>stereo</stereo>
</audio>
<subtitles type="teletext"/>
</programme>
The DTD allows for a lot of optional information,
including icon, URL, language, year, country, credits
(director, actor, writer, etc.), star ratings, audio
metadata, video aspect ratio, whether it has subtitles,
etc.. We're going to stick with title for the example; for
a serious application you might need a commercial feed
(should one ever become available) with more reliable and
detailed information.
Episodes
Episodic programs get special treatment in the XMLTV
format. Here's an example from the feed I pulled:
<programme channel="C2wcbs.zap2it.com" start="20031230030700 -0500"
stop="20031230033700 -0500">
<title>Becker</title>
<sub-title>Small Wonder</sub-title>
<desc>Reggie and the gang dispute Becker's crazy theory
that little people are bad luck.</desc>
<episode-num system="xmltv_ns"> . . 0/3</episode-num>
<audio>
<stereo>stereo</stereo>
</audio>
<subtitles type="teletext"/>
<rating system="VCHIP">
<value>PG</value>
</rating>
</programme>
The "system" attribute in <episode-num> has two
allowed values: "xmltv_ns", which is used here, and
"onscreen". The latter provides the human displayable
version; the former has more structured data. It's
supposed to be three numbers (with "." as a separator):
the season number, the episode number within the entire
series, and finally the part number. Slashes indicate out
of how many, and numbers begin at zero; so "0/3" means the
first of three. The DTD provides a good set of examples:
The first episode of the second series is '1.0.0/1'. If it were a two-part
episode, then the first half would be '1.0.0/2' and the second half '1.0.1/2'.
If you know that an episode is from the first season, but you don't know
which episode it is or whether it is part of a multiparter, you could
give the episode-num as '0..'. Here the second and third numbers have
been omitted. If you know that this is the first part of a three-part
episode, which is the last episode of the first series of thirteen,
its number would be '0 . 12/13 . 0/3'. The series number is just '0'
because you don't know how many series there are in total - perhaps
the show is still being made!
Easy, right? But look at the actual data. As you can
probably guess, this "Becker" episode is not a
three-parter, and the first two fields are missing
entirely. We're looking at dirty data: no season number,
no episode number, and an unreliable last segment. You
couldn't run a real electronic program guide off of XMLTV,
which is probably good for the developer's legal exposure.
Playing Around
grep is a good way to scan through XML for
fragments of interest, but if you want to process XMLTV
programatically you'll want heftier tools. One of my
favorite tools for processing XML with minimal programming
effort is XPath. The
Jaxen
project provides a good implementation in Java, my
language of choice, but the open source community has
provided a wealth of options in your pick of languages. If
your only goal is to produce HTML, you could also consider
using XSLT.
XPath packs a lot of information into a very small space,
so mixing it with your procedural and OO code can make for
compact, expressive code. It's also very easy to store
XPath fragments in XML, databases, and property files, so
you can make your program more configurable. Here's the
path to find all programs:
//programme
and then all programs on CBS, using the channel ID for our area:
//programme[@channel='C2wcbs.zap2it.com']
and all programs with a rating of PG or G:
//rating[value='PG' or value='G']
Let's say you want to develop a "coming up" program
schedule for a fan homepage for Becker. You might even be
thinking of turning the fragment into a portlet to collect
all those Becker fan pages . (I promise the code will be
more realistic than the premise.) We can find all the
Becker episode titles with a single line of XPath code:
//programme[@channel='C2wcbs.zap2it.com' and title='Becker']/sub-title/text()
Next we need source data. You can get the next 14 days
worth of data in a nightly cron job. After configuring
your feed source, you can run the following to get a full
two weeks of source data.
xmltv tv_grab_na --days 14 > feed.xml
Next we need to process it. The following sample Java code
loads the file into a DOM Document and uses Jaxen to
select and print the episode titles under the nodes. (Note
this example excludes all error handling, reasonable
argument processing ,and modular design you'd expect from
production code.)
import java.io.File;
import java.util.List;
import java.util.HashMap;
import java.util.Map;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.jaxen.XPath;
import org.jaxen.dom.DocumentNavigator;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
public class XMLTV {
public static final void main(String[] args) throws Exception {
// set up Java XML processing
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = dbf.newDocumentBuilder();
// parse the feed
File srcFile = new File(args[0]);
Document doc = docBuilder.parse(srcFile);
// get an instance of Jaxen's DOM handler
DocumentNavigator navigator = DocumentNavigator.getInstance();
// pre-compile the XPath expressions
XPath channelXpath = navigator.parseXPath("/tv/channel");
XPath beckerXpath = navigator.parseXPath("//programme[title='Becker']");
// create a mapping from ID to display name
Map channelMap = new HashMap();
List channelNodes = channelXpath.selectNodes(doc);
for (int ii = 0; ii < channelNodes.size(); ii++) {
Element channelElem = (Element)channelNodes.get(ii);
Element displayNameElem = (Element)channelElem.
getElementsByTagName("display-name").item(0);
channelMap.put(channelElem.getAttribute("id"),
displayNameElem.getFirstChild());
}
// find the episode nodes!
List nodeList = beckerXpath.selectNodes(doc);
System.out.println(nodeList.size() + " matches found");
for (int ii = 0; ii < nodeList.size(); ii++) {
Element programElem = (Element)nodeList.get(ii);
Element subTitleElem = (Element)programElem.
getElementsByTagName("sub-title").item(0);
System.out.print("Episode title = " +
subTitleElem.getFirstChild());
System.out.println("'; channel = " + channelMap.get
(programElem.getAttribute("channel")));
}
}
}
The example does a little more than get the episode
title. It first maps channel ID to channel name, then
finds all the elements. This is something that you can do
very quickly in Perl or Java but that might take a little
more work in XSLT. Of course, emitting HTML based on
output would be much easier in XSLT, arguing for a
combination of the two -- creating a pipeline with an
XMLTV producer, a Java processor, and then a stylesheet
using Cocoon might be one way to do it.
For a real tool you might consider SAX2 despite the
greater complexity, and implement page caching using a
package like
OSCache or produce the HTML in a nightly
batch as well. XMLTV creates a lot of data and a web app
that transforms from even a large static file has the
potential to be very slow.
A Wish List
XMLTV is an evolving format; the version covered in this
article is 0.5. A revised but convertible 0.6 format is on
the way. For the future, I have a short wish list, all XML
technical issues. (The content aspect already seems quite
complete.)
- It would be nice to have a standard namespace so
one could consider weaving XMLTV content together
with other XML vocabularies.
- An XML schema would be useful here to allow
stricter validation; DTD can't cover the typed
data XMLTV carries around. It would also provide a
structured way to make visible the great
documentation hidden away in comments in the DTD
now.
- The application itself emits a DOCTYPE with a
relative location for the DTD; an HTML URL might
be more appropriate, especially since the
application already requires access to the Web.
Wrapping Up
Good software can be used as a building block to make
other software, and by this measure XMLTV -- both the de
facto standard and the software -- is very
useful. Although people's dreams of combining computers
with televisions have yet to pan out, now there are solid
mechanisms that let you combine Internet data with live
video, and insert your own software in between. People
have already done work combining closed captioning with
full-text indexing to find video clips of interest;
obviously a lot of work has been done to enable PVR
functionality. But the really exciting element is not what
has been done, but the convergence of interesting
information, ease of access and processing with XML-based
formats like XMLTV, with freely-available, powerful
software. With those building blocks I am sure we will see
more and more innovative combinations of television and
computing in the future.