Conferences in Research and Practice in Information Technology
  

Online Version - Last Updated - 20 Jan 2012

 

 
Home
 

 
Procedures and Resources for Authors

 
Information and Resources for Volume Editors
 

 
Orders and Subscriptions
 

 
Published Articles

 
Upcoming Volumes
 

 
Contact Us
 

 
Useful External Links
 

 
CRPIT Site Search
 
    

Estimating Set Intersection using Small Samples

Koehler, H.

    The similarity of two sets A, B can be measured by the size of their intersection A ∩ B, relative to the size of A and B. The classic measure here is resemblance or Jaccard similarity, but other useful measures (e.g. subset containment) can be derived from intersection size as well. For large and/or many sets, exact com- putation of intersection size can be expensive though, and requires transmitting entire sets if they are distributed. For this reason a number of different sampling techniques have been developed, which allow us to estimate intersection size (and derived measures) efficiently from the intersection size of smaller sample sets. However, while existing estimation formulas are intuitive and unbiased, they can be quite inaccurate when samples are small. We show that by using more advanced estimation techniques, we can significantly reduce sample sizes without compromising accuracy, or conversely, obtain more accurate results from the same samples.
Cite as: Koehler, H. (2010). Estimating Set Intersection using Small Samples. In Proc. 33rd Australasian Computer Science Conference (ACSC 2010) Brisbane, Australia. CRPIT, 102. Mans, B. and Reynolds, M. Eds., ACS. 71-78
pdf (from crpit.com) pdf (local if available) BibTeX EndNote GS