Googling for XML
by Bob DuCharme
February 11, 2004
The introduction of the O'Reilly book Google Hacks
tells us that the filetype: query qualifier restricts
your Google search to files whose names end with a particular
extension. The book's first example of this is
homeschooling filetype:pdf, a query that searches for the word
"homeschooling" in Adobe Acrobat files. The second example,
"leading economic indicators" filetype:ppt, looks for the
phrase "leading economic indicators" in Microsoft PowerPoint
presentations. (Of course, Google checks the file extension and
not the actual format; if an Excel spreadsheet with a "ppt" file
extension is in Google's index, the second search will look for
the target phrase there, and if a PowerPoint presentation with an
extension of "pres" is in the index, the same search will ignore
it.)
Being an XML geek, I had to run immediately to Google's
homepage to try searching XML files with this trick. Simply
entering filetype:xml
as a Google query returns nothing, so I entered filetype:xml
test to search for XML files with the word "test" in them, and
Google reported 329,000 hits. (All "hits" figures listed here will
evolve by the time you read this.) The query filetype:xml
-test, which searches for files with an extension of "xml"
that don't have the word "test" in their contents, gave me
1,080,000 hits. So my rough guess puts 1.4 million files with an
extension of "xml" in Google's index.
Of course, it's a very rough guess. As you read about my
further experiments in searching only XML files of particular
document types, such as DocBook files and TEI files, as well as my
Google searches through RSS, FOAF and other RDF files, remember
that I based it all on hunches and guesswork. The
technical-sounding term for this exploration into Google
capabilities is "reverse engineering," but the most appropriate
term is the one that gave the name to the popular O'Reilly series:
hacks.
Googling for Specific Document Types
Running my test/-test pair of searches for files with an
extension of "xhtml" showed about half a million in Google's
index. This is useful to the many XML developers who know that
these HTML files are much more likely to be properly well-formed,
and maybe even valid against a DTD or schema, than files with
extensions of "html" or "htm".
Many XHTML files have an extension of "xml" as well, and
these present a problem when searching for XML documents of other
document types besides XHTML. For example, a search (filetype:xml
docbook) for files with an extension of "xml" that mention DocBook, a DTD popular for
technical writing and computer books since SGML days, will find
XHTML files that discuss DocBook as well as actual DocBook
files.
Let's look at some strategies for locating DocBook files
and then return to this issue of XHTML files that discuss
DocBook. Technically, DocBook has no namespace URI associated with
it, but when mixing DocBook elements with elements from other
namespaces, many people want to assign a namespace URI to those
elements, and "http://www.oasis-open.org/docbook/" seems to be
popular. As the ancestor directory for many DocBook DTD files,
this URL shows up in the SYSTEM parameter of a lot of DOCTYPE
declarations. A
search for files with an extension of "xml" that contain this
string turns up about 1,170 hits, many of which are and many
of which aren't DocBook files.
The context phrases that Google search results show around these
hits often show tags from the DocBook DTD, making it easier to see
which ones are really DocBook documents. A search of XML files for
the quoted phrase
"oasis dtd docbook xml" gets about 1,560 hits because Google,
which ignores punctuation, often finds that phrase in a public
identifier string like "-//OASIS//DTD DocBook XML V4.2//EN". Some
of these files are actually HTML representations of complete
DocBook files, perhaps with numbers to show them as the "source
code" for some project.
I tried adding the quoted string "doctype article" to
that last search and found some
surprising results. While Google supposedly doesn't index tag
names or the contents of DOCTYPE declarations, it apparently does
in certain circumstances. (Again: guesswork! Reverse engineering!
Hacks!) Several results for this query show a document "title"
(for HTML files, the part in the head
element's title element) that begin like this:
"<html> <head> </head><body><pre>".
Following
one of these links shows no such HTML tags. Following the
corresponding
link to the Google cache shows that the document was
"converted" to HTML for Google's cache by mapping all less-than
signs to < entity references and then wrapping the whole
document in the appropriate HTML tags to make it one big
HTML pre element.
This is good news for two reasons: first, while a Google
search for XML files of a particular document type may show you
plenty of XHTML documents that discuss that document type, as
opposed to actually being documents of that type, don't let a
string of HTML tags in Google's result listing discourage you --
the file might be a document of the type you're interested in
after all. Second, when Google does this, it apparently indexes
the entire DocBook document as the contents of an
HTML pre element, putting tag names and attribute values
in the index as well, because it just considers them to be
more pre content. When element and attribute names,
attribute values, and other markup metadata are in the index, you
can use them as search terms, which is why I got DocBook hits from
a search for "doctype article".
Another DTD that's been popular since SGML days is the
one developed by the Text Encoding
Initiative, a non-profit group that has worked to make it
easier to encode literary and linguistic texts since 1987. I had
disappointing results with a search of filetype:xml
"TEI DTD" ("TEI DTD" being a phrase in its public identifier),
but eventually figured out that "tei" is a more popular extension
for these files than "xml". For example, a search for
filetype:tei tei gave me 2,630 hits.
XHTML and TEI files aren't the only XML documents that
often don't have extensions of "xml". Running my test/-test pair
of searches for files with an extension of "rss" showed about
116,000 files in Google's index. Of course, they're not
necessarily all well-formed XML; specialized RSS search engines do
exist, but the ability to search them with Google means that you
can use all the other search techniques described here and in the
"Google Hacks" book to search RSS files. For example, a search
of
filetype:rss http://purl.org/rss/1.0 looks for files with an
"rss" extension that include the namespace URL for RSS 1.0 in
their content, resulting in 10,800 hits. Searching for the same
URL in files with an "rdf" extension (
filetype:rdf http://purl.org/rss/1.0) gave me 34,500 hits.
To search in both filetypes at once, use Google's OR
operator. (Remember to enter it in upper-case.) The search
filetype:rdf OR filetype:rss http://purl.org/rss/1.0 gave me
47,200 hits, and a more specific search for the term "XForms" in
RSS 1.0 files with an extension of "rdf" or "rss" (
filetype:rdf OR filetype:rss http://purl.org/rss/1.0 xforms)
found 21 files. Remember that all the found documents aren't
necessarily RSS 1.0, but odds are that most files with an
extension of "rss" or "rdf" that have the string
"http://purl.org/rss/1.0" in them are RSS 1.0 files.
Googling for RDF
RDF is used for more than RSS. FOAF, or Friend Of A
Friend, files are an experiment in the RDF community to store
personal metadata -- where people live and work, what their
interests are, and who their friends are. A typical FOAF file (mine, for example)
doesn't list all of a person's friends, but only those who
have FOAF files themselves; the growing collection of FOAF-to-FOAF
links provide sample data for various RDF experiments.
There are conventions for FOAF filenames, but no set
rules, so to search for FOAF files in Google, instead
of filetype: I used the inurl: qualifier. This
searches for URLs that have the specified string in them. Just
entering inurl:foaf
as a search term gave me 37,200 hits, but that included the FOAF
specs, articles about it, and associated software. Adding the FOAF
namespace URL to create a search query of
inurl:foaf http://xmlns.com/foaf/0.1/ gave me 1,090 hits, with
a much higher percentage of hits on the first Google result page
being actual FOAF RDF files. You can add search terms to this to
search within those files for a specific term -- for example, to
see how many of those FOAF files specify a value for the
FOAF workplacehomepage property, enter
inurl:foaf http://xmlns.com/foaf/0.1/ workplacehomepage.
FOAF files and RSS 1.0 are the two most popular uses of
RDF that I know of. The OWL Web Ontology
Language provides infrastructure for the ontology part of the
Semantic Web. How popular is this set of RDF properties? A check
for filetype:rdf
owl showed 456 hits; repeated checks over time will give clues
about the progress of its popularity. Once the number of hits gets
into four figures, Semantic Web experiments are going to get
easier and easier.
What kind of experiments can we do with the RDF out
there? I've started playing a bit to answer this related
question: what else do people use RDF for besides FOAF and RSS
1.0? By searching for files that have an extension of "rdf" but
don't mention FOAF or RSS, I hope to find out. A Google query
of
filetype:rdf -rss -foaf ("show me files with a filetype of
'rdf' that don't have the strings 'rss' or 'foaf' in them") gave
me 150,000 hits. Of course, many turn out to be RSS or FOAF files
anyway, but this particular query reduces their percentage. Using
the Google API and a
simple Perl script described in "Google Hacks," I can pull down
URLs for some of these RDF files, and then a batch file that uses
the wget
utility can pull down the files themselves.
Loading the RDF triples from these randomly collected
files into a single RDF triple store will create an interesting
collection of RDF to play with. I certainly can't assume that all
the files will contain good RDF, but using rdflib
and Python's exception handling ability, the following short
script rejects any RDF files that it can't parse, reads the rest
into a single triple store, and at the end saves it all as a
single XML/RDF file:
#! /usr/bin/python
from rdflib.TripleStore import TripleStore
store = TripleStore()
# Try to read files data/1.rdf, data/2.rdf ... data/34.rdf into a
# TripleStore directory, then save that as test.rdf.
for i in range(1,35):
filename = "data/" + str(i) + ".rdf"
try:
store.load(filename)
except:
print "bad XML: " + filename
store.save("test.rdf")
It's only an experiment, and it's just a start, but I'm
confident that as I scale it up, analysis of the results will
reveal valuable information about how people use RDF. Repeating
the same experiment every six or eight months is bound to be
interesting as well, showing increases and decreases in the
popularity of various aspects of RDF.
Googling for...
Whether you're interested in RDF or any other kinds of
XML, the presence of this freely accessible, constantly updated,
massive index of XML files known as Google is quite a
resource. Combining the techniques shown here with others in the
book "Google Hacks" gives you a lot to play with. You've seen one
of my ideas for future research to take advantage of this
resource. I look forward to seeing yours.