|
A
Review of the IRIS DMC Web Search Engine
In
the past few years, the IRIS DMC World Wide Web presence has been
growing appreciably. Data access services, online manuals, and earthquake
references abound, making it increasingly difficult to guide visitors
to
the information they are looking for. The sensible solution has
been to present a search feature to the web page to help users quickly
and easily find web pages of interest.
About
a week of effort back in 1997 produced a web crawler engine that
scans web pages and keeps a "dictionary" of those pages
for quick reference. A web crawler is an automated program that
follows links to Web pages, in the same way that a user would click
on links in a Web browser to view other pages. The name given to
the IRIS DMC web crawler is "WARP." In its four years
of operation, the service has been functioning reliably and is virtually
unchanged since its inception.
On
a regular basis, the WARP web crawler starts its scan with the IRIS
Home Page, notes down all of the words in the document, and then
collects all of the links on that page. Each of those links are
then examined one by one, and all of their contents and links are
recorded as well. This search pattern repeats in an ever-growing
collection of links and web pages, much like a squirrel running
up each branch of a tree, and each page that is accessed is added
to the dictionary. It is this dictionary that you reference when
you use the search engine on the IRIS Home Page.
So,
what stops WARP from growing and growing until it searches the whole
Internet? The answer to that comes through a special reference that
each web server must include in order to be searched. Therefore,
when we provide a link to a university, news center, or commercial
site on one of our web pages, they don't have to worry about having
their entire site searched. Special arrangements have been made
with institutions such as the Federation
of Digital Seismic Networks and the
Albuquerque Seismic Laboratory in New Mexico, and regular indexing
is performed on those remote sites, searchable from the IRIS Home
Page. It's what you could call a 'neighborhood' web crawler.
In
order to help visitors make effective use of the IRIS search feature,
two concepts will be illustrated here. First is how to enter an
effective search query. The second is how to interpret the results.
Unlike other search pages that you might be familiar with, the IRIS
search page does not make use of logical glue words like AND and
OR and doesn't use special characters or quotes to include meaning
in the query. Things are kept simple in that you enter a list of
words relevant to your topic of interest, separating them with spaces.
It's case-insensitive and it ignores punctuation.
To
make the best query, try listing a few unique words related to the
subject of interest. Using common words will give you matches to
a lot of pages you wouldn't be interested in. Many times, one or
two keywords are sufficient to give a good match, but if the results
you get are not what you wanted, try entering different keywords,
more keywords specific to your search, or enter them in a different
order. Another tip is to enter a short phrase such as 'Loma Prieta
earthquake', since the search engine will match strongly to such
ordered word patterns. When the results come back, the 50 most relevant
matches are shown on the screen. This degree of relevance is determined
by a scoring system that assigns 'points' to how well each web page
matches the search query. This score value can be seen as a number
in parentheses at the end of each result listing, and can be used
as a rule of thumb to determine just how well a given page matched
to your query.
The
score value is created through a set of test conditions that each
web page is subjected to and the results are added together. Some
types of test conditions can result in a high score being added
to the total, while other tests result in just small gains. The
overall score should be a good reflection of how appropriate the
web page is to your search parameters. To illustrate this further,
the scoring conditions are as follows:
| SEARCH
QUERY |
ADDED
SCORE |
| fits
to text in the URL address |
50
per keyword |
| fits
to phrase in page title bar |
20
times number of words in phrase |
| fits
to phrase in first few sentences of web page text |
10
times number of words in phrase |
| matches
word found in web page |
equal
to number of occurrences of word in page up to a max of 10 |
| partially
matches word in web page |
1 |
| matches
more than one word |
double
the added word match score |
By
looking at these scoring values, you can see that you find fairly
good matches at a score of 20 or greater and get even better candidates
at 50 or greater. The best results come from matches to the title
of a page, or the first few sentences, which generally best reflect
the topic of a web page. Having the score values as a guide will
allow you to navigate the search results intelligently and choose
the best route to find what you are looking for.
The
WARP search engine at the IRIS DMC is intended as a service to
our users to make their visit at our Web site and our affiliated
Web sites as productive and convenient as possible. It is our
hope that you find the search feature useful, yet if you find
problems you wish to discuss with us, please drop us a line at
.
Submitted
by Rob Casey
For more information or comments contact
|