|
Using PHP 5's SimpleXML
by Adam Trachtenberg, coauthor of
PHP Cookbook
01/15/2004
|
XML is great, but I've constantly wondered why it's so difficult to parse.
Most languages provide you with three options: SAX, DOM, and XSLT. Each has its
own problems:
- SAX's event-based design forces you to track elements manually, by pushing
and popping them on and off of a stack.
- DOM is bulky and cumbersome. While comprehensive, it takes seven lines just
to read
<hello>.
- XSLT? If I wanted to program in a functional language, I'd use Lisp instead
of PHP.
SimpleXML is a new and unique feature of PHP 5 that solves these problems
by turning an XML document into a data structure you can iterate through like a
collection of arrays and objects. It excels when you're only interested in an
element's attributes and text and you know the document's layout ahead of time.
SimpleXML is easy to use because it handles only the most common XML tasks,
leaving the rest for other extensions.
This article shows how to use SimpleXML to read an XML file, parse the
results into a useful form, and query the document with XPath. I use RSS for
the examples, since some versions of RSS are nice and easy. Then there's RSS
1.0. It uses RDF, multiple namespaces, and defines a default namespace for its
elements. (Not so nice and easy.)
Along the way, there's a brief discussion on XML namespaces and XPath, since
they're necessary to process XML documents that expand beyond the basics. In
particular, to handle RSS 1.0, you need to work with these XML
specifications.
To try SimpleXML, you need a copy of PHP 5 Beta 3, as not everything
described here works in earlier versions. SimpleXML also requires
libxml2, an open source XML parsing library that all of PHP 5's
XML extensions now use. SimpleXML support is enabled by default, so it's
automatically installed when you build PHP 5.
Like PHP 5, SimpleXML is beta quality. There are still a few bugs, memory
leaks, and unimplemented features, but overall it's coming together nicely.
Reading XML
The first set of examples use the following chunk of RSS, which is stored in
rss-0.91.xml:
<?xml version="1.0" encoding="utf-8" ?>
<rss version="0.91">
<channel>
<title>PHP: Hypertext Preprocessor</title>
<link>http://www.php.net/</link>
<description>The PHP scripting language web site</description>
</channel>
<item>
<title>PHP 5.0.0 Beta 3 Released</title>
<link>http://www.php.net/downloads.php</link>
<description>PHP 5.0 Beta 3 has been released. The third beta
of PHP is also scheduled to be the last one (barring unexpected
surprises).</description>
</item>
<item>
<title>PHP Community Site Project Announced</title>
<link>http://shiflett.org/archive/19</link>
<description>
Members of the PHP community are seeking volunteers to help
develop the first web site that is created both by the community and for
the community.</description>
</item>
</rss>
To begin, create a new SimpleXML object. For XML on disk, use
simplexml_load_file('/path/to/file.xml'). If it's stored in a PHP
variable, use simplexml_load_string($xml). So, to load the RSS,
do:
$s = simplexml_load_file('rss-0.91.xml');
Element text is accessed like object properties:
print $s->channel->title . "\n";
PHP: Hypertext Preprocessor
If there's more than one element in the same level in document, they're
placed inside an array. In this example, there's only one
<channel>, but two <items>s. To access
an <item>, use its location in the array:
print $s->item[0]->title . "\n";
PHP 5.0.0 Beta 3 Released
To print all titles, use a foreach loop:
foreach ($s->item as $item) {
print $item->title . "\n";
}
PHP 5.0.0 Beta 3 Released
PHP Community Site Project Announced
Use array notation to read element attributes:
print $s['version'] . "\n";
0.91
Other XML features, like comments and processing instructions, are
unsupported. You can't (yet) access these entities. However, since most XML
documents don't place vital information in comments or use processing
instructions, this isn't a big drawback.
Querying with XPath
SimpleXML uses XPath to allow you to gather information from a document.
Find and print all the text inside title elements with:
foreach ($s->xsearch('//title') as $title) {
print "$title\n";
}
PHP: Hypertext Preprocessor
PHP 5.0.0 Beta 3 Released
PHP Community Site Project Announced
The xsearch() method searches a SimpleXML object and returns
an array of matching nodes. Pass your XPath query as the argument. In this
case, //title finds all title elements regardless of location in
the tree. Or, restrict the search to only <title>s inside
of <item>s with //item/title.
If you've used XSLT, you're familiar with XPath. XSLT templates use XPath
expressions to determine when to process a node. For more on XPath, read John
E. Simpson's XPath and XPointer (O'Reilly) or John's XML.com article, Top Ten Tips to
Using XPath and XPointer. Additionally, Chapter 9 of XML
in a Nutshell, by Elliotte Rusty Harold and W. Scott Means (O'Reilly), covers XPath and is available free online.
While these examples are somewhat trivial, XPath is quite useful with
complex documents, as you can create sophisticated queries to return finely
tuned results.