Comparing Offline and Online Statistics Estimation for Text Retrieval from Overlapped Collections
Student Defense
Date: September 11, 2007
Time: 10:30 AM - 12:00 PM
Contact Person: Bhaumik Chokshi
Contact Email: bhaumik.chokshi@asu.edu
Location: BYENG 455
Defense Type: Master's Thesis Defense
Committee Members
Dr. Subbarao Kambhampati
Dr. Yi Chen
Dr. Hasan Davulcu
In an environment of distributed text collections, the first step in the information retrieval process is to identify which of all available collections are more relevant to a given query and should thus be accessed to answer the query. Collection selection is difficult due to the varying relevance of sources as well as the overlap between these sources. Some of the previous collection selection methods have considered relevance of the collections but have ignored overlap among collections. They thus make the unrealistic assumption that the collections are all effectively disjoint. Overlap estimation can be done in two ways - offline or online. In this thesis, the main objective is to compare these two approaches for estimating statistics. One of the existing approaches(e.g., COSCO) uses offline approach to store the statistics for frequent item sets. It uses these statistics to estimate statistics for the user query. In this thesis, ROSCO is presented, which uses sample based online approach to estimate the overlap among collections for a given query. In addition to that, COSCO and ROSCO are compared with ReDDE(which does not consider overlap) under a variety of scenarios. The experiments show that ROSCO is able to outperform existing methods by 8-10% in presence of overlap among collections.