Evaluation
of Web-Based Search Engines Using User-Effort Measures
Muh-Chyun Tang and Ying Sun
4 Huntington St.
School of Information, Communication and Library Studies
Rutgers University,
New Brunswick, NJ 08901,
U.S.A.
muhchyun@scils.rutgers.edu
Ysun@scils.rutgers.edu
Abstract
This paper presents
a study of the applicability of three user-effort-sensitive evaluation
measures—“first 20 full precision,” “search
length,” and “rank correlation”—on four Web-based
search engines (Google, AltaVista, Excite and Metacrawler). The authors
argue that these measures are better alternatives than precision and
recall in Web search situations because of their emphasis on the quality
of ranking. Eight sets of search topics were collected from four Ph.D.
students in four different disciplines (biochemistry, industrial engineering,
economics, and urban planning). Each participant was asked to provide
two topics along with the corresponding query terms. Their relevance
and credibility judgment of the Web pages were then used to compare
the performance of the search engines using these three measures. The
results show consistency among these three ranking evaluation measures,
more so between “first 20 full precision” and search length
than between rank correlation and the other two measures. Possible reasons
for rank correlation’s disagreement with the other two measures
are discussed. Possible future research to improve these measures is
also addressed.
Introduction
The explosive growth of information on the World Wide Web poses a challenge
to traditional information retrieval (IR) research. Other than the sheer
amount of information, some structural factors make searching for relevant
and quality information on the Web a formidable task. The freewheeling
nature of publishing on the Web is a blessing for the flow of ideas,
but it has also complicated the process of retrieving relevant information.
In contrast to traditional IR, there are no consistent indexing and
classification principles for organizing materials on the Web. Nor are
there any filtering practices at hand to ensure the quality and credibility
of the documents. Furthermore, certain features of Web search situations
also distinguish the Web from the traditional IR setting. It has been
shown that ordinary Web searchers tend to give little input (Jasen et
al. 1998) and are very sensitive to the time and effort put into the
search (Silverstein et al. 1998). The issues of credibility and user
efforts peculiar to the Web search environment are not addressed properly
by traditional precision and recall measures. Several measures that
focus on user efforts have been proposed, yet there has been little
investigation of their validity.
Literature Review
Several studies have explored the applicability of traditional IR evaluation
criteria, i.e., precision and recall, on search engine performance (Chu
and Rosenthal 1996, Leighton and Srivastava 1997, Clarke and Willett
1997, Wishard 1998). Chu and Rosenthal studied the precision of ten
queries on three search engines (AltaVista, Excite, and Lycos). Instead
of a binary measure of relevance (relevant/non-relevant), they adopted
a three-point scale to distinguish among relevant, partially relevant,
and non-relevant documents. Clarke and Willett also used a three-point
scale in assigning relevance scores, with a slight modification: pages
that were considered irrelevant in themselves but that led to relevant
pages were judged partially relevant. Clarke and Willett provided by
far the most feasible method for measuring recall on the Web. Previously,
in the absence of a predefined set of relevant documents, it had been
very difficult to assess recall on the Web. Clarke and Willett constructed
a relative recall measure by using the merged outputs of all three search
engines tested as the pool of relevant documents.
Some characteristics
of Web searching, however, require performance criteria other than the
precision and recall measures developed in traditional IR. The enormous
amount of information and the wide variety of sources on the Web seem
to make quality of ranking a much more important dimension in assessing
search engine performance since users in general spend less time and
effort to sort through the retrieved pages. This is supported by studies
of users' searching behaviors on the Web. Silverstein et al. found that
about 85 percent of users look only at the first screen with results
(Silverstein et al. 1998). Su, Chen, and Dong called for a more user-centered
evaluation framework in Web searching environments (Su et al. 1998).
They applied five criteria—relevance, efficiency, utility, user
satisfaction, and connectivity—to evaluate the performance of
four search engines (AltaVista, Infoseek, Lycos, and Open Text). Furthermore,
instead of submitting simple text queries as in most search engine evaluations,
they used real user search strategies and judgment in the searching
and evaluating process.
In contrast to traditional
IR searchers, the majority of Web users are laypersons who are more
sensitive to time and effort spent on finding information. The ability
to optimize search order thus becomes an even more salient dimension
of search engine performance. The notion of Expected Search Length (ESL),
first proposed by Cooper (1968) some 30 years ago, seems to be an ideal
notion to test how well a search engine is able to deliver the most
relevant documents at the top of retrieved sets (Agata et al. 1997;
Su, Chen and Dong 1998; Oppenheim et al. 2000, Chignell et al. 1999).
According to Cooper, the primary function of a retrieval system is to
save users as much labor as possible in the search for relevant documents
by perusing and discarding irrelevant ones. We tested several measures
that emphasize user efforts: first 20 “full” precision,
search length, and rank correlation with the view of investigating their
applicability and validity. We were interested in seeing how these three
measures correlated with one another. The consistency or the lack of
it among these measures would be an indicator of each measure’s
validity in reflecting a user’s effort and, therefore, search
engines’ performance.
Research
Procedures
Eight sets of topics were collected from four Ph.D. students in four
different disciplines (biochemistry, industrial engineering, economics
and urban planning). Each participant provided two topics along with
the query terms s/he considered suitable for submission to a search
engine.
Table 1 shows one
of the query topics used in the study.
Subject Domain:
Economics
Statement 1: Find information about non-parametric estimation of
factor analysis in term structure models and a test for arbitrage.
Query: functional data analysis |
Table 1. Topic (See
Appendix 1 for entire collection of statements and queries)
The search engines
selected for comparison were Google, AltaVista, Excite, and Metacrawler.
The four search engines were selected mainly due to their popularity.
We were also interested in seeing whether certain ranking techniques
would lead to better results using our measures. Google was selected
because of its growing popularity and its incorporation of citing behaviors
on the Web. The only meta-search engine, Metacrawler, was selected for
its claim to provide a ranking algorithm of the composite results. Its
power-search feature allows the user to select up to 11 major search
engines and has the option to sort results by relevance, by source search
engine, or by originating site. During our run on Metacrawler, the other
three search engines tested were selected as the input sources for the
power-search feature, and the results were sorted according to relevance.
It should be noted that the main purpose of this study is to test the
applicability of user-effort measure rather than to compare search engine
performance. An experiment was conducted to compare the performance
of four search engines (AltaVista, Excite, Google, and Metacrawler)
in academic contexts. Since different search engines have different
capabilities, to ensure comparability, we decided to adopt a minimalist
approach, using only simple, unstructured query terms in lower-case
characters. No phrases or Boolean symbols were included. We believe
that this also reflect the majority of Web search engine user’s
search behaviors.
The authors collected
queries from the real users and submitted them on their behalf to the
search engines within a two-hour time frame on December 5, 2000. All
searches were conducted on computers with the same properties and located
in the same LAN in order to avoid the effects of differences in computer
performance and network speed. We decided to collect only the top 20
links among the thousands retrieved in light of previous studies showing
that 80 percent of users view only the first two pages of results (Jansen
et al. 1998). Several representation issues also need to be addressed
before returned links can be present to the subjects. It was recognized
that varying representation used by the search engines might factor
in the subjects' judgment. Furthermore, the order in which returned
links were presented might also influence subjects' judgment since they
might develop different relevance criteria during the course of examining
the Web pages. To avoid sequence and user preference bias, the returned
hits from all four search engines for each query were mixed together
and stripped of all graphic cues. We then presented the URLs in Microsoft
Word files that allowed the subject to examine the real page by clicking
on its URL.
Participants were then asked to judge each Web page according to its
relevance and credibility on a five-point scale using "0"
to indicate non-relevance or a lack of credibility and "4"
to indicate high relevance or high credibility. Relevance was defined
as a result that provided information that is considered useful by the
participant for his or her question. The subjects were also told to
judge a source's credibility by its authorship, source of its content,
disclosure, and currency.
Participants were
asked to mark but not to judge duplicate links (those with the same
URLs). We could in this way avoid assigning different scores to the
same page. Duplicate links were examined later by the authors. When
duplicate links were retrieved by the same search engine, the second
document was treated as non-relevant; those from different systems'
result sets were assigned the same values for relevance and credibility.
Broken link ratio
was an indication of how frequently and thoroughly the engine checked
the links in its database for currency. In the analysis of relevance
and credibility, broken links were treated as non-relevant documents
with zero relevance and credibility.
Evaluation Measures
First 20 Full Precision
Precision measures the ratio of relevant documents within the total
set of returned documents. The binary relevance judgment widely adopted
in traditional IR evaluation, however, does not take into account the
different amounts of relevant information contained in each document.
In this regard, Chignell et al. proposed a "full" precision
measure that sums up the total amount of relevant information contained
in the first 20 documents, which seems to reflect better than binary
relevance each search engine's ranking capacity (Chignell, Gwizdka,
and Bodner 1999).
According to Chignell et al., the first 20 "full" precision
is calculated by the following equation:

Where:
• scorei—score assigned to the i-th hit by the judges;
• 20—number of measured hits;
• 4—maximum score that can be assigned to one hit.
User Effort Measure—Search Length i
Cooper's concept of expected search length measures user effort in terms
of the number of non-relevant documents that a user must examine before
finding i relevant documents. Cooper illustrated several scenarios in
which different i (that is, numbers of relevant documents) may be desired
based on the user's need for thoroughness. In our study we decided to
set the desired number at two. The most relevant web page is defined
as documents with a relevance score of three or four. Thus the search
length is operationalized as
"The number of Web pages one has to examine (including relevant
and non-relevant documents) before two documents with relevance score
of 3 or above are found."
Rank Correlation
Su, Chen, and Dong proposed comparing the user's relevance and the system's
relevance ranking in order to measure a search engine's ranking performance.
The measure they proposed involved correlating the rank order assigned
by the search engine and the user's preference. Our rank correlation
was designed to reflect the same notion of evaluating how closely a
system's ranking reflected user preference, but with a slight modification
in procedure. We decided to include all 20 pages that appeared first
instead of only the top five pages as was used in Su et. al. Since we
do not have access to the actual ranking scores, we used a document’s
position within the top 20 returned hits as the approximate ranking
score. The higher a document appeared in the list, the higher the ranking
score presumably assigned by the system. Thus, the top five hits were
given a score of four; the next five hits were given a score of three,
and so forth. The higher the correlation between the ranking scores
and the user judgment scores, the more efficiently the system is able
to relieve user efforts.
Results and Discussion
Overlap of Results
There was low overlap of returned hits among the three non meta-search
engines. Since we customized Metacrawler on the basis of the other three
systems, it is not surprising that most duplicate links occurred in
Metacrawler output. For the topic Urban Planning 1, the Metacrawler
returned hit sets in which 19 of 20 documents had appeared in the other
three search engines' sets. For the topic of Economics 1, the comparable
figure was 15 out of 20. Overall, the average overlap was 11. Notably,
nearly half of the hits returned by Metacrawler were not included among
the top 20 of the other three search engines, which demonstrates the
re-ranking function of Metacrawler. The effectiveness of re-ranking
will be discussed below in the Ranking Results section.
Currency Results
Broken-link ratio can be an indication of how frequently and thoroughly
an engine checks the links in its database for currency. The numbers
and percentage of broken links are shown in Table 2.
| |
AltaVista |
Google |
Excite |
Metacrawler |
| Broken link |
7 (4.38%) |
5 (3.13%) |
10 (6.25%) |
9 (5.63%) |
Table
2. Broken Links
Credibility Results
Across all four disciplines used for the test searches, the mean and
standard deviation of credibility score of each search engine are listed
in Table 3.
| |
AltaVista
M(SD) |
Google
M(SD) |
Excite
M(SD) |
Metacrawler
M(SD) |
| Credibiliy |
2.22(1.38) |
2.68(1.28) |
2.32(1.20) |
2.54(1.28) |
Table
3. Mean and Standard Deviation of Credibility Score
Among all pair relationships, Google was significantly better than AltaVista
and Excite (Google-AltaVista: t (159) = 3.31, p < .01; Google-Excite:
t (159) = 2.79, p < .01). Metacrawler provided more credible hits
than AltaVista but not more than Excite (Metacrawler-AltaVista: t (159)
= 2.31, p < .5; Metacrawler-Excite: t (159) = 1.89, ns). There was
no significant difference between Google and Metacrawler (t (159) =
1.15, ns) nor between Excite and AltaVista (t (159) = .74, ns).
User's Effort Results: Search length 2
As discussed before, user effort was measured by a modified search length
measure, that is, the number of links the user has to go through to
find two relevant documents. The mean and standard deviation for this
performance measure are displayed in Table 4.
| |
AltaVista
M(SD) |
Google
M(SD) |
Excite
M(SD) |
Metacrawler
M(SD) |
| Search length
2 |
10.5(7.71) |
4.25(1.83) |
11.0(7.01) |
9.00(4.78) |
Table 4. Mean and
Standard Deviation of Search Length
In general, at the time this study was carried out, Google significantly
outperformed the other three systems on this measure as one had to go
through fewer pages to find the first two satisfying and relevant documents.
The performances of AltaVista, Excite, and Metacrawler were nearly equal.
Rank Correlation
| |
AltaVista |
Google |
Excite |
Metacrawler |
| Correlations |
.092 |
.131 |
.013 |
-.012 |
Table 5. Overall
System Ranking vs. User Relevance Judgment
Overall, none of the correlations between system-assigned order and
user's judgment score was significant although Google and AltaVista
showed a slightly higher than the others. A further analysis of system
performance of each individual query shows the same pattern. The only
significant correlations occurred in queries submitted to AltaVista
and Google. Google also had only two negative scores while the correlations
of Excite and Metacrawler tended to be near zero or negative.
| |
AltaVista |
Google |
Excite |
Metacrawler |
| Bio1 |
-.25 |
-.20 |
-.22 |
-.07 |
| Bio2 |
-.11 |
.31 |
.15 |
.09 |
| IE1 |
-.19 |
.33 |
-.24 |
.22 |
| IE2 |
.46* |
.06 |
-.20 |
.00 |
| Econ1 |
.55* |
.55* |
-.14 |
-.09 |
| Econ2 |
.19 |
-.07 |
-.28 |
-.09 |
| Urban1 |
.37 |
.00 |
.30 |
-.14 |
| Urban2 |
-.24 |
.11 |
.19 |
-.08 |
| Across all
queries |
.09 |
.13 |
.01 |
-.01 |
Table 6. System
Rank Order vs. User Relevance Judgment (* p<.05)
First 20 “full” Precision Results
| |
AltaVista |
Google |
Excite |
Metacrawler |
| Biology |
.31 |
.34 |
.21 |
.32 |
| Industrial
Engineering |
.23 |
.39 |
.14 |
.21 |
| Economics |
.53 |
.74 |
.55 |
.60 |
| Urban Planning |
.35 |
.66 |
.46 |
.58 |
| Average |
.35 |
.53 |
.33 |
.43 |
Table 7. First
20 “full” Precision
Table 7 shows the
full precision scores for the four search engines in the four respective
domains. Figure 1 shows the profile plots of full precision across systems
and domains. From the figure, one can see that Google performed best
in all four domains although the difference is not necessarily significant.
The varying results among the disciplines suggest that the domain will
affect search engine performance. As Figure 1 makes clear, all search
engines performed much better on the subset including Urban Planning
and Economics than on the other.

One should, however, be cautious about making assertions regarding any
difference in search engine performance based on discipline, considering
that there was only one participant in each discipline. Differing performance
might be attributable to idiosyncrasies of the individual users.
Consistency
| Rank |
First 20 "Full"
Precision |
Search Length
2 |
Rank Correlation |
Credibility |
| 1st |
Google |
Google |
Google |
Google |
| 2nd |
Metacrawler |
Metacrawler |
Alta Vista |
Metacrawler |
| 3rd |
Alta Vista |
Alta Vista |
Excite |
Excite |
| 4th |
Excite |
Excite |
Metacrawler |
Alta Vista |
Table 9. Consistency
— Ranking of search engine performance according to four measures.
Discussion
and Conclusions
In this experiment, Google was found to outperform the other search
engines on the basis of credibility, first 20 full precision, and user
effort. This may be due to Google's incorporation of hyperlink information
in its ranking algorithm. Metacrawler, our sample meta-search engine,
failed to achieve the expected high level of performance.
The results obtained in the experiment showed little overlap in the
documents returned by the different search engines except in the case
of Metacrawler, confirming previous observations by Ding and Marchionini
(1996). Search engines' result sets tend to have relatively low overlap
because they employ different ranking techniques for indexing coverage.
There was no significant effect of interaction between search engines
and subject domains on the first 20 full precision of returned hits
in the experiment. Beyond its overall superior performance, however,
Google seems to be particularly adept at handling natural science and
engineering topics. Further analysis of the returned hits showed that
Google returned more academic articles given high scores for relevance
in biochemistry and industrial engineering while AltaVista and Excite
returned more pages with faculty or departmental information.
The concern of the study, however, was not so much to determine the
best search engine as to test various performance measures and investigate
their theoretical implications. As pointed out earlier, we felt that
the most widely used measures in traditional IR evaluation, namely,
precision and recall, do not seem to be well suited to the Web search
environment. During our literature review, we found that Cooper's notion
of search length provided an appropriate evaluative framework for the
Web environment, in which users' search behaviors suggest that ranking
capacity is crucial. Our three ranking measures can all be properly
explained by Cooper's principle of optimizing search order, that is,
an ideal search engine should be able to deliver most relevant documents
higher in the rank list. We were able to demonstrate in our findings
a certain degree of consistency in these ranking measures and in credibility
and currency measures (see Table 9). More important, the consistency
also presents among three ranking evaluation measures, more so between
first 20 full precision and search length 2 and to a less degree between
rank correlation and the other two measures. We suspect the lower level
of consistency between rank correlation and the other two measures is
the result of the imprecise representation of system-assigned scores.
Our transformation of system assigned scores (1-20) into the same scale
as user judgment score (0-4) might not correctly capture the real ranking
scores assigned by the systems. An accurate assessment of the applicability
of the rank correlation measure is more likely when using actual ranking
scores assigned by the search engines.
Whereas Cooper discussed different scenarios of search thoroughness,
in our study the measure of search length 2 only took into account one
specific situation in which the thoroughness of the search was limited.
We feel this is proper considering that the evidence shows that average
Web users are sensitive to the effort they have to expend when using
search engines. This may not be the case when thoroughness of search
is desired. Further studies are needed to explore the tradeoff between
user effort and information acquired when different degrees of thoroughness
of search are desired.
References
Agata, T., Nozue, M., Hattori, N., & Ueda, S. (1997). A measure
for evaluating search engines on the World Wide Web: Retrieval test
with ESL. Library and Information, 37, 1-11.
Chignell, M. H., Gwizdka, J., & Bodner, R. C. (1999). Discriminating
meta-search: A framework for evaluation. Information processing
and management, 35(3), 337-362.
Chu, H., & Rosenthal, A.U. (1996). Search engines for the world
wide web: A comparative study and evaluation methodology. Proceedings
of the 59th Annual Meeting of the American Society for Information Science,
Baltimore, M.D., 127-135.
Clark, S. J., & Willett, P. (1997). Estimating the recall performance
of Web search engines. Aslib Proceedings, 49 (7). 184-189.
Cooper, W. S. (1968).
Expected search length: A single measure of retrieval effectiveness
based on the weak ordering action of retrieval systems. Journal
of American Society of Information Science, 19(1), 30-41.
Ding, W., & Marchionini, G. (1996). A comparative study of web search
service performance. Proceedings of the 59th Annual Meeting of the
American Society for Information Science, Baltimore, M.D., 136-142.
Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998). Real
life information retrieval: A study of user queries on the Web. SIGIR
Forum, 32(1), 5-17.
Leighton, H. V., & Srivastava, J. (1997). Precision among World
Wide Web search services (search engines): Alta Vista, Excite, Hotbot,
Infoseek, Lycos. [online ] available at www.winona.edu/library/webind2/webind2.html
Oppenheim, C., Morris, A., & McKnight, C. (2000). The evaluation
of WWW search engines. Journal of Documentation, 56 (1), 71-90.
Silverstein, C., Henzinger, M., Marais, J. & Moricz, M. (1998).
Analysis of a very large Alta Vista query log. Technical Report
1998-014, COMPAQ Systems Research Center, Palo Alto, Ca, USA, 1998.
Su. L. T., Chen, H. L., & Dong, X. Y. (1998). Evaluation of Web-based
search engines from an end-user's perspective: A pilot study. Proceedings
of the 61st Annual Meeting of the American Society for Information Science,
Pittsburgh, PA., 348-361.
Wishard, L. (1998). Precision among Internet search engines: An earth
sciences case study. Issues in Science and Technology Librarianship
1998.
[online] available at http://www.library.ucsb.edu/istl/98-spring
/articles5.html