Building an XML-based Metasearch Engine on the Server
by Ralf Westphal
July 08, 1999
Applied XML Tutorial
In my last article
I showed you how XML can make life for metasearch engines so much easier.
I set up a scenario where two database driven address directory sites (called
"All Addresses" and
"Best of Addresses")
allowed access to their data through a simple query interface. But in addition to a
regular search engine user interface they also returned results as an XML
format.
These XML formats then were the foundation on which I built a client side metasearch
engine. After the user entered some search criteria it queried the
address directory sites and consolidated the returned XML data into one
homogenous result list (see Figure 1 for a layout of this scenario).
Figure 1: Information flow of the client side metasearch engine
The metasearch engine worked just fine using XML, XSL and XQL (XSL pattern
matching)it only had one drawback: it was very dependent on Internet Explorer
5.0 and its MSXML XML-engine. Only clients running IE5 were able to use it.
Today I'd like to show you, how we can move the metasearch process to the
server and deliver browser independent HTML to any client (I hope you don't mind that
this solution will also rely on the MSXML component; but this time it's only
needed in one place: on the server.) [Download the code samples]
1. Moving the Metasearch to the Server
First let's have a look at the server side metasearch engine. It's implemented as an ASP-page. When loading
serverside.asp it displays the same user interface as the client side metasearch
engine did. Let's try it:
You'll notice that some of the addresses are displayed with a yellow
background. This is to distinguish the data coming from the
different address directory databases. Addresses retrieved from site "Best
of Addresses" are highlighted; the unmarked ones are from "All
Addresses".
Name
Street
ZIP/City
Tel
Fax
Firma
Karl-Heinz Rosowski
Maikstrae 14
22041
Hamburg
721 99 64
21110111
Fa. Kehlenbeck & Marquardt
GmbH
Kanalstr. 47 a
22041
Hamburg
280 68 17
354827
Firma
Dieter Schreyack
Zum Meeschensee 65
22041
Hamburg
04193/783 90
2514250
Firma
Willi H. Matschuck & Sohn
Poppenbtteler Weg 90
22041
Hamburg
538 20 24/25
6429234
Firma
Hans-Jrgen Knaak
Am Schiffbeker Berg 10
22041
Hamburg
732 77 44
6547105
Table 1: Sample result of the server side metasearch engine
From a user's point of view the difference between the client side and the
server side metasearch engine is small. The user interface looks the same. So
where's the difference? For one, the server side solution can be used with any
browserif it produces browser independent HTML, e.g. HTML 3.2. But let's have a closer look at the
information flow of the server side solution (see Figure 2).
Figure 2: Information flow of the server side metasearch engine
As you can see, there's a bit more traffic between the client and the server
as is normal for server side database applications. The server sends a form to
fill out to the client, the client sends back some information to the server,
the server then does its database work and returns a result page to the client.
Nothing unusual here.
Unusual in this scenario is, that in order to produce a result to send back
to the client, the server contacts other servers on the internet! The metasearch
engine server thus temporarily becomes an internet client itself. Where in
Figure 1 most of the traffic went on between the client and the database
servers, now the traffic is between the metasearch engine server and the
database servers.
Retrieving XML Data from other Servers
The similarity in functionality and information flow between our former
client side metasearch engine and the server side solution suggests that there
should not be too much a difference in how the server side metasearch engine is
working. Let's take a look at the code:
...
<%
if request.form("Searchword") <> "" then
Dim xml1, xml2
Get the XML results from database site
"All Addresses" into xml1
Get the XML results from database site "Best of Addresses" into xml2
Dim xml, adr, adrList
Set xml = CreateObject("Microsoft.XMLDOM")
xml.appendChild xml.createElement("SearchResult")
Set adrList = xml1.selectNodes("Addresses/Address")
For Each adr In adrList
xml.documentElement.appendChild adr.cloneNode(True)
Next
...
This should look very familiar to you. We are reading XML results from the
database sites into XML DOMs (xml1 and xml2) and then consolidating
them in the XML DOM xml.
But as you can see, I've left out a very important point: how do we read in
the XML results? On the client side we used IE5 XML islands. But there is no
(D)HTML page on the server. It's only in the process to be generated.
Instead we can use the MSXML component directly:
set xml1 =
CreateObject("Microsoft.XMLDOM")
xml1.async = false
xml1.Load
"http://www.ralfw.de/xml-com/metasearch/alladdr/searchxml.asp?whereField="
& _
request.form("Field") &
"&pattern=" & request.form("Searchword") &
_
"&orderbyField=" &
request.form("SortField")
set xml2 =
CreateObject("Microsoft.XMLDOM")
xml2.async = false
xml2.Load
"http://www.ralfw.de/xml-com/metasearch/bestOfAddr/findxml.asp?whereField="
& _
request.form("Field") &
"&pattern=" & request.form("Searchword")
&
"&orderbyField=" &
request.form("SortField")
This technique works as long as the servers we want to query provide a HTTP
GET request "interface". That means as long as we can pass any query
parameters as URL parameters, we can ask the MSXML parser to retrieve the XML
data from a URL (instead of a local file on the server).
Things become more complicated when the servers to be queried have only a
HTTP POST request "interface". I'll tackle a solution for that in a
future article when I want to talk about more bidirectional XML-communication,
e.g. in B2B-scenarios.
Transforming and Formatting
As you can imagine, the lack of XML islands also makes changes to the use of
the XSL stylesheets necessary. We need a stylesheet for transforming the XML
data from "Best of Addresses" to our "canonical" XML address
format (which "All Addresses" already provides). And we need another
stylesheet for sorting the consolidated data in xml as well as
transforming it into plain HTML.
dim xml2Transformed,
ssBestOfAddr
set xml2Transformed = CreateObject("Microsoft.XMLDOM")
set ssBestOfAddr = CreateObject("Microsoft.XMLDOM")
ssBestOfAddr.Load
server.MapPath("/") & "/.../bestofaddr.xsl"
xml2.transformNodeToObject
ssBestOfAddr.documentElement, xml2Transformed
Set adrList =
xml2Transformed.selectNodes("Addresses/Address")
For Each adr In adrList
xml.documentElement.appendChild adr.cloneNode(True)
Next
As in the above example, we compensate the lack of XML islands with the explicit use of a XML
DOM object: ssBestOfAddr. The metasearch engine loads the stylesheet and xml2
applies it to itself thereby producing a XML DOM (xml2Transformed)
containing the transformed XML element tree.
The last task remaining is transforming the consolidated address list in xml
to HTML. Like on the client side we do this by applying another stylesheet and
sending back to the client the resulting HTML <table>:
if
xml.documentElement.hasChildNodes then
dim ss, searchResults
set ss = CreateObject("Microsoft.XMLDOM")
ss.Load server.MapPath("/") & "/.../serverSideAddresses.xsl"
ss.selectSingleNode("//@order-by").nodeValue = "+"
& request.form("sortfield")
set searchResults =
CreateObject("Microsoft.XMLDOM")
xml.transformNodeToObject ss.documentElement, searchResults
response.write
searchresults.xml
Of course we only need to do the transformation if there were any addresses
returned from the database sites we queried. The stylesheet serverSideAddresses.xsl
looks just like the stylesheet we used on the client. And as before we need to
tweak it a little bit by inserting the requested sort order.
But there's a small thing I added. As you noticed in Table 1 above the
addresses are color coded according to their origin. This is accomplished in two
steps:
1. After receiving the XML address data each address "record" is
tagged. The consolidation process simply adds an attribute (source) to
each <Address>-element while copying it to xml.
Dim xml, adr, adrList,
sourceAttr, clone
...
Set sourceAttr = xml.createAttribute("source")
sourceAttr.nodeValue = "AllAddresses"
Set adrList =
xml1.selectNodes("Addresses/Address")
For Each adr In adrList
set clone = adr.cloneNode(True)
clone.Attributes.setNamedItem sourceAttr.cloneNode(True)
xml.documentElement.appendChild clone
Next
2. Within the stylesheet for transforming xml to a HTML table the source-attribute
is checked, and if it designates an address from site "Best of
Addresses" the name column is highlighted by adding a background color to
its <td>-element:
<xsl:for-each
select="Address" order-by="+ZIP">
<tr>
<td width="20%" valign="top">
<xsl:choose>
<xsl:when
match="*[@source = 'BestOfAddresses']">
<xsl:attribute
name="bgcolor">yellow</xsl:attribute>
</xsl:when>
</xsl:choose>
<small><font
face="Arial"><xsl:value-of select="Name"
/></font></small>
</td>
With <xsl:attribute>
the attribute is added to the XML output node it is located in, which is the <td>
node.
That's it
We've finished moving the metasearch engine to the server. It wasn't
necessary to change any of it's workingsexcept for replacing the XML islands
with explicit XML DOM objects. This demonstrates nicely how easy it is to set up
a client-server communication using XML, as well as a server-to-server
communication. Given a well defined interface (how to pass parameters to the
server plus a XML data format for the resulting data) an XML DOM component like
Microsoft's MSXML COM-component is sufficient for the job.
2. Paged Display of XML Result Sets
Since it was so easy to put the metasearch engine on the server, I'd like to
add a little twist to it before I leave you alone with it. The question I'd
like to raise is, how can we limit the addresses displayed to a certain number
per page? It's a must-have feature for all search enginesnot to throw
thousands of result items at the user, but to show just a subset of them at a
time.
So far we are transforming all the address data we retrieved from several
database sites into an HTML table and sending it to the client. But how can we limit
the addresses to display to a certain number of addresses at a time without
sacrificing our XSL solution?
Using XSL to Display Subsets
In a "traditional" ASP-solution we'd have a recordset and a loop
to iterate over it, for example:
Set rs =
CreateObject("ADODB.Recordset")
rs.Open "select...", ...
rs.PageSize = 20
rs.Page = 3
for i = 1 to 10
response.write rs.fields("name").value &
"<br>"
rs.MoveNext
next
XSL however essentially is descriptive, not algorithmic. Still though, it
provides looping constructsand we are already using them:
<xsl:for-each
select="Address" order-by="+ZIP">
<tr>
...
<xsl:for-each>
implicitly iterates through the list of <Address>-elements below
the document element. What we now have to do is finding a way to output the
content of the <xsl:for-each>-element only for a specific number of
records, e.g. addresses 10 to 20.
First I thought the solution would come easily by adding a twist of XQL to
the select-attribute. XSL patterns provide a function (index())
to get at the index of a node in its parent nodelist. I added a filter to the
query and felt very confident:
<xsl:for-each
select="Address[index() $ge$ 0 and index() $le$ 9]"
order-by="+ZIP">
The select-attribute now limits the <Address>-elements to
the ones with indexes from 0 to 9. So far no problem. But when I looked at the
first page, although it was limited to just a couple of addresses, it contained
the wrong ones. The XSL engine had worked properlybut I had misjudged the
order of processing of the select- and order-by-attributesstupid me.
Instead of first applying the sort-clause and then selecting the first couple of
records, of course it worked the other way around. I was presented with a
selection of addresses in unsorted order which contained only entries from the
first database site. So I had to go back to the drawing board an see how I could
1. sort all addresses, and 2. select just the ones I wanted.
Rescue came by means of the <xsl:if>
element.
<xsl:for-each select="Address" order-by="+ZIP">
<xsl:if test="context()[index() $ge$ 0 and index() $le$ 9]">
<tr>
...
<xsl:for-each> selects and sorts <Address>-elements
as before and itereates over all (!) of them. But now we decide which ones
we'll actually output by checking their index within the loop. However,
since <xsl:if> and its test-attribute are independent of the
elements selected by <xsl:for-each> we have to explicitly grab that
listthe context of the current XSL-elementwith the context()-function.
What was left was setting the range of indexes dynamically according to the
page requested.
set xslNode =
ss.selectSingleNode("//xsl:for-each")
xslNode.attributes.getNamedItem("order-by").nodeValue =
"+" & session("sortfield")
xslNode.selectSingleNode("xsl:if").attributes.getNamedItem("test").nodeValue
= _
"context()[index() $ge$ " &
((page-1)*PAGESIZE) & _
" and index() $lt$ " &
((page-1)*PAGESIZE + PAGESIZE) & "]"
It works like setting the sort order before. Find the <xsl:if>-element
in the XSL stylesheet and set its test-attribute to a XSL pattern
containing a range of indexes depending on the current page and the page size.
Now let's have a look at how this is working out.
MSXML and the Standards
The code presented here heavily relies on the Microsoft XML COM-component
MSXML which is included with Internet Explorer 5.0. Unfortunately, the
component does not implement the current XSLT draft. For example the above used order-by-attribute
is not XSL draft compatible. Instead the <xsl:sort>-element should
be usedwhich is not yet implemented in MSXML. Also Microsoft has added some
proprietary methods to the XML DOM, e.g. selectSingleNode.
But still I'm using the component in my examples. Why is that, you might
ask? Because it worksfor the things I want to demonstrate here. This column
is concerned with "How and where to use XML, XQL etc.?", not with
"Which tool is the best?" or "Which tool most closely adheres to
the standards?" (have a look at http://www.webstandards.org
if you are concerned with this question). Please take the sample code I'm
showing you as a bag of ideas. For example: Using the proprietary method selecSingleNode
instead of reaching the same effect with standard XML DOM methods simply helps
to get the point across: "A XSL stylesheet is XML data and you can
manipulate it using the XML DOM." This is what you should take home from
it.
Please get me right, I'm all in favour of standards. But waiting for
standards is no excuse for not learning of the benefits of the concepts and
technologies to be standardizedbefore they get standardized. Plus, if there
are working, pragmatic solutions out there, why not use them?
Keeping the Result Set across Page Calls
Only one problem remains to be solved: How do we keep the search result
across page calls? Sure we don't want to requery all the database sites
whenever the user just wants to flip to another page in the same result set. One
way would be to store the consolidated XML DOM as raw XML in a session variable.
That would be a trivial solutioncosting quite a bit of performance. We'd have
to serialize the XML DOM and deserialize it for each page change. On the other
hand there would be no thread problems, since only plain text would get stored
in an ASP session variable.
Fortunately there's a much better solution. MSXML provides a free threaded
version of its XML-parser component. So we can actually keep the whole XML DOM
alive across page calls by storing an object reference in an ASP session
variable.
if
request.form("Searchword") <> "" or _
(request.querystring("page")<>""
and isObject(session("xml"))) then
if request.querystring("page") = "" then
Dim xml, xml1, xml2, ss
...
set
xml = CreateObject("Microsoft.FreeThreadedXMLDOM")
...
set session("xml") = xml
session("sortfield") =
request.form("sortfield")
set ss =
CreateObject("Microsoft.FreeThreadedXMLDOM")
ss.Load server.MapPath("/") & "/.../serverSideAddressesPaged.xsl"
set session("ss") = ss
else
set xml = session("xml")
set ss = session("ss")
end if
...
Whenever the user issues a new query the metasearch engine retrieves data
from other sites, consolidates it in xml and stores xml as well as
the stylesheet (ss) to transform xml into a HTML table in session
variables. Both XML DOM objects are created using the progID Microsoft.FreeThreadedXMLDOM.
By creating them as free threaded objects they don't glue the ASP page to a
certain execution thread within the Internet Information Server. This is very
important for high performance web sites.
But be careful: If you use the free threaded version of the XML DOM you
can't exchange nodes with instances of the not free threaded version. So be
sure to use only one threading model for the XML DOM within an ASP page.
There isn't more to server side paged display of XML data: Keep the data
around across page calls and know thy stylesheetor maybe you want to use the
XML DOM directly to generate HTML data, which can be faster, since you don't
have to iterate over all the elements to display.
Oh, and there is one last thing you could improve: You could sort the
consolidated XML data in xml only once before saving a reference to the
object in the session variable. Then in the stylesheet you could do without the order-by-attribute
and gain performance. But I'll leave that to you as an excercise ;-)
If you like, let me know if this
article was of any value to you.