|
| | | |
Score Aggregation Techniques in Retrieval Experimentation
Ravana, S.D. and Moffat, A.
Comparative evaluations of information retrieval systems
are based on a number of key premises, including that
representative topic sets can be created, that suitable relevance
judgements can be generated, and that systems
can be sensibly compared based on their aggregate performance
over the selected topic set. This paper considers
the role of the third of these assumptions - that the performance
of a system on a set of topics can be represented by
a single overall performance score such as the average, or
some other central statistic. In particular, we experiment
with score aggregation techniques including the arithmetic
mean, the geometric mean, the harmonic mean, and the
median. Using past TREC runs we show that an adjusted
geometricmean providesmore consistent system rankings
than the arithmetic mean when a significant fraction of the
individual topic scores are close to zero, and that score
standardization (Webber et al., SIGIR 2008) achieves the
same outcome in a more consistent manner. |
Cite as: Ravana, S.D. and Moffat, A. (2009). Score Aggregation Techniques in Retrieval Experimentation. In Proc. Twentieth Australasian Database Conference (ADC 2009), Wellington, New Zealand. CRPIT, 92. Bouguettaya, A. and Lin, X., Eds. ACS. 59-67. |
(from crpit.com)
(local if available)
|
|