Improving Scalability and Performance of Random Forest Based Learning-to-Rank Algorithms by Aggressive Subsampling

Ibrahim, M. and Carman, M.

Random forest based Learning-to-rank (LtR) algorithms exhibit competitive performance to other state-of-the-art algorithms. Traditionally, each tree of the forest is learnt from a bootstrapped copy (sampled with replacement) of the training set, where approximately 63% examples are unique, although some studies show that sampling without replacement also works well. The goal of using a bootstrapped copy instead of the original training set is to reduce correlation among individual trees, thereby making the prediction of the ensemble more accurate. In this study, we investigate whether we can decrease the correlation of the trees even more without compromising accuracy. Among several potential options, we work with the sub-sample used for learning individual trees. We investigate the performance of a random forest based LtR algorithm as we reduce the size of the sub-samples used for learning individual trees. Experiments on Letor data sets reveal that substantial reduction of training time can be achieved using only small amount of data training data. Not only that, the accuracy is likely to increase while maintaining the same level of performance stability as the baseline. Thus in addition to the existing benefit of being completely parallelizable, this study empirically discovers yet another ingredient of random forest based LtR algorithms for making them one of the top contenders for large scale LtR.

Cite as: Ibrahim, M. and Carman, M. (2014). Improving Scalability and Performance of Random Forest Based Learning-to-Rank Algorithms by Aggressive Subsampling. In Proc. Twelfth Australasian Data Mining Conference (AusDM14) Brisbane, Australia. CRPIT, 158. Li, X., Liu, L., Ong, K.L. and Zhao, Y. Eds., ACS. 91-99

(from crpit.com) (local if available)