Unsupervised Text Segmentation using LDA and MCMC

Yu, K., Li, Z., Guan, G., Wang, Z. and Feng, D.

In this paper, we propose a data driven approach to text segmentation, while most of the existing unsupervised methods determine segmentation boundaries by empirically exploring similarity measurement between adjacent units (e.g. sentences). Firstly, we train a latent Dirichlet allocation (LDA) model with the large scale Wikipedia Corpus to avoid the problem of vocabulary mismatch, which makes our approach domain-independent. Secondly, each segment unit is represented with a distribution of the topics, instead of a set of word tokens. Finally, a text input is modeled as a sequence of segment units and Markov Chain Monte Carlo technique is employed to decide the appropriate boundaries. The major advantage of using MCMC is its ability to detect both strong and weak boundaries. Experimental results demonstrate that our proposed approach achieve promising results on a widely used benchmark dataset when compared with the state-of-the-art methods.

Cite as: Yu, K., Li, Z., Guan, G., Wang, Z. and Feng, D. (2012). Unsupervised Text Segmentation using LDA and MCMC. In Proc. Data Mining and Analytics 2012 (AusDM 2012) Sydney, Australia. CRPIT, 134. Zhao, Y., Li, J. , Kennedy, P.J. and Christen, P. Eds., ACS. 21 - 26

(from crpit.com) (local if available)