Building A Web Spider
Building A Web Spider
by Chris Payne
Introduction
Web spiders are probably one of the most useful tools ever developed for the
internet. After all, with millions of separate and different sites out there
today, how else can you gather all that information?
A spider does one thing - it goes out on the web and collects information.
The way a typical spider (like Yahoo) works is by looking at one page and
finding the relevant information. It then follows all the links in that
page, collecting relevant information in each following page, and so on.
Pretty soon, you'll end up with thousands of pages and bits of information
in your database. This web of paths is where the term 'spider' comes from.
So how do you create a web spider? We'll explain that below, but first we'll
need to outline some concepts.
Fundamentals
Web spiders can be built to search many things. In fact, there are several
specific commercial spiders out there, and these applications draw big bucks
($300k to license Altavista's technology, for example). Here are the
fundamentals of a web spider:
-
Collects information from a variety of sources
Technically, this should be from any sources, and not limiting. The
more sources, the better.
-
Accurate
We all know the complaints about search engines returning 1 million
plus results when only the last two are what you are looking for (or
even worse, the middle two). A spider should be accurate in the items
it returns, and in many cases, specific (i.e., a spider that only
returns a certain type of information, such as the gaming spiders
on www.enfused.com).
-
Relatively up-to-date
This depends on the technique you use to implement the spider (see
section below), but a spider should return up-to-date information,
or least reasonably so. There's no point (in most cases) of having a
spider if it only returns items that are 5 years old.
-
Relatively quick
The point of a spider is to make information gathering faster. It
doesn't matter how accurate your spider is if takes forever to return
results.
Techniques
There are a few ways to spider. The first, which I'll call general
spidering, simply grabs a page, and searches it for whatever you're looking
for - for instance, a search phrase. The second, specific spidering, grabs
only a certain portion of a page. This scenario is useful in cases where
you might want to grab news headlines from another site.
General spidering is the easier of the two. First of all, you don't need
to have any knowledge of the page beforehand. Simply look within that page
for your search term, and links to other pages. If you want to get fancy,
you can build in functionality to ignore links that are within the same
site.
A specific spider usually requires you to have some knowledge of the page
beforehand, such as table layout. For instance, if you're looking for news
headlines on a page, then you should know what HTML tags delimit the
headlines, so you only search the right portion of the page. In this case,
it is usually not important to spider each link on the page, especially
since your spider might not work on different pages.
There are also different times you can perform a spider: beforehand, and
real time. Doing it beforehand means that any information you collect while
your spider is running is stored in a database, for access later. You
obviously won't have the most recent data, but if you run the spider often
enough, it won't matter.
Doing it in real time means that you don't store any information - you run
the spider every time you need it. For instance, if you had a search
function on your web site, spidering in real time would mean that whenever
a user enters a search term and presses submit, you would run the spider,
versus simply querying a database of items created beforehand. While this
will ensure that you always have the latest data, this option is usually
not preferred because of the time required to spider and return anything
of value. Use this option only when the material you are spidering is very
time sensitive.
From an ASP?
So how can you implement a spider from an Active Server Page? With the
magic of the internet transfer control (ITC). This control, provided by
Microsoft, allows you to access internet resources from an ASP (check
here
for a good reference). You can create this object in an ASP, and use it to
grab web pages, access ftp servers, and even submit POST headers. (Note:
for this article, we will only be focusing on the first capability listed
here.)
There are a few drawbacks, however. For one thing, Active Server Pages are
not allowed to access the Windows registry, which means that certain
constants and values that the ITC normally stores there will not be
available. Normally, you can get around this issue by not allowing the
ITC to use default values - specify the values every time.
Another, more serious, problem involves licensing issues. ASPs do not have
the ability to invoke the license manager (a feature of Windows that makes
sure components and controls are being used legally). The license manager
checks the key in the actual component, and compares it to the one in the
Windows registry. If they're not the same, the component won't work.
Therefore, if you decide to deploy your ITC to another computer that
doesn't have the necessary key, it breaks. A way around this is to bundle
up the ITC in another VB component that basically duplicates the ITC's
methods and properties, and then deploy that. It's a horrible pain, but
unfortunately must be done. Read this
MSDN article
for more info.
Show me some examples!
You can create and set up the ITC with the following code:
set Inet1 = CreateObject("InetCtls.Inet")
Inet1.protocol = 4 'HTTP
Inet1.accesstype = 1 'Direct connection to internet
Inet1.requesttimeout = 60 'in seconds
Inet1.URL = strURL
strHTML = Inet1.OpenURL 'grab HTML page
|
strHTML now holds the entire HTML content of the page specified by strURL.
To create a general spider, you can now do a simple call to an instr()
function to determine if the string you're looking for is there. You can
also look for href tags, parse out the actual URL and set it to the URL
property of the internet control, and open up another page. The best way
to look through all the links this way would be to use recursion (see this
article
for a lesson on recursion).
Note, however, that while this method is pretty easy to implement, it is
not very accurate or robust. Many search engines out there today perform
additional logic checks, such as the number of times a phrase appears in
a page, the proximity of related words, and some even claim to judge the
context of the search phrase. I'll leave these to you as you explore
spiders. For more info on detailed searches, here's a
good article
on creating a spider program to rival "Ask Jeeves."
A specific spider is a bit more complicated. As we mentioned earlier, a
specific spider will grab a certain portion of a page, and that requires
knowing ahead of time which portion. For instance, let's look at the
following HTML page:
<HTML>
<HEAD>
<TITLE>My News Page</TITLE>
<META Name="keywords" Content="News, headlines">
<META Name="description" Content="The current news headlines.">
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#FF3300"
VLINK="#CC0000" ALINK="#0000FF">
<p><h3>Headlines</h3></p>
<!--put headlines here-->
<a href="/news/8094.asp">Stocks prices fall</a>
<a href="/news/8095.asp">New movies today</a>
<a href="/news/8096.asp">Bush and Gore to debate tonight</a>
<a href="/news/8097.asp">Fall TV lineup</a>
<!--end headlines-->
</BODY>
</HTML>
|
In this page, you really only care about the stuff between the "put
headlines here" and "end headlines" comment tags. You could
build a function that would return only this section:
Function GetText(strText, strStartTag, strEndTag)
dim intStart
intStart = instr(1, strText, strStartTag, vbtextcompare)
if intStart then
intStart = intStart + len(strStartTag)
intEnd = InStr(intStart + 1, strText, strEndTag, vbtextcompare)
GetText = Mid(strText, intStart + 1, intEnd - intStart - 1)
else
GetText = " "
end if
End Function
|
Using the example of creating the ITC control above, you would simply pass
in strHTML, "<!--put headlines here-->", and
"<!--end headlines-->" as parameters to the GetText
function.
Note that the start and end tags do not have to be actual HTML tags - they
can be anything text delimiter you wish. Often times, you won't find nice
HTML tags to delimit the sections you're looking for. You'll have to use
whatever is available - for instance, your start and end tags could
look like:
strStartTag = "/td><td><font face="arial" size="2"><p><b><u>"
strEndTag = "<p></td></tr><tr><td><o:ums>"
Make sure to find something unique in the HTML page so that you extract
exactly what you need. You can also follow the links in the portion of
text you return, but beware that if you don't know the format of those
pages, your spider could return nothing.
Storing the info
In most cases, you're going to want to store the information that you
collect in a database for easy access later. Your needs here may very
widely, but here are a few things to keep in mind:
-
Check for the latest information in your database
If you run this spider often to check a site for new headlines,
make sure that you take note of the newest headline that is already
in your database. Then compare that to what the spider returns, and
only add the new ones. That way, you won't end up having a lot of
duplicate data in your database.
-
Update information
You may not want to add new information to your database at all.
For instance, if you are maintaining an online index of US state
populations, then you'll only want to update the information in
your database – there will never be a need to insert new
information in the table (until we get a new state, that is).
-
Store everything you need, and build what you don't have
For instance, if you spider headlines, make sure you also grab
the links that the headlines point to, and store that in your
database. If there are no links supplied, you may need to build
one. For example, I'm spidering headlines from www.yoursite.com,
to display on www.mysite.com. If the headline has a story linked
to it that resides on your web site, I will also have to store
http://www.yoursite.com in front of whatever link in on your
server in my database so that the links work correctly.
| A link on www.yoursite.com... |
On www.mysite.com Turns into... |
| /stories/news/980345.html |
http://www.yoursite.com/stories/news/980345.html |
Conclusion
This article should give you a very good idea about how to build a more
complete spider. All of the basic functionality is laid out here, all you
have to do is add the bells and whistles.
This type of application begs to be placed in a COM object or in a separate
application by itself. Placing this functionality in an ASP would be very
convenient, but you would gain speed and security benefits by moving your
code elsewhere. (not to mention the fact that it would be easier to
package and sell).
Happy scripting!