Detecting Topic Labels for Tweets by Matching Features from Pseudo-Relevance Feedback

Zhang, J., Liu, D., Ong, K.L., Li, Z. and Li, M.

Detecting a suitable topic label for short texts, e.g. tweets from Twitter, is an important component in many applications including diversity ranking, clustering, information retrieval, and information filtering. To automatically detect topic labels however is a major challenge. The character limit of a short text means the lack of a significant feature space to adequately describe its content in relation to other short texts in a given collection. Therefore, methods like LDA, TF-IDF or similarity measures all fail due to their sensitivity to a small feature space. And when a collection of related short texts are considered, e.g., from a Twitter search, the result set collectively exhibits sparsity and high dimensionality { a nightmare for information processing. A solution to this problem is to expand the feature space through a process known as pseudo-relevance feedback. Unfortunately, they disappoint when subjected to real-world conditions. The fundamental problem lie in the level of noise present in both the short texts and the feedback source, which is often the World Wide Web. We propose a novel pseudo-relevance feedback algorithm to accurately identify topic labels for short texts. Our algorithm robustly handles noise in both the short texts and the feedback source through a method called `feature matching'. Empirical results confirm the efficacy of our algorithm.

Cite as: Zhang, J., Liu, D., Ong, K.L., Li, Z. and Li, M. (2012). Detecting Topic Labels for Tweets by Matching Features from Pseudo-Relevance Feedback. In Proc. Data Mining and Analytics 2012 (AusDM 2012) Sydney, Australia. CRPIT, 134. Zhao, Y., Li, J. , Kennedy, P.J. and Christen, P. Eds., ACS. 9 - 20

(from crpit.com) (local if available)