Probabilistic suffix tree (PST) is a stochastic model that uses a suffix tree as an index structure to store conditional probabilities associated with subsequences. PST has been successfully used to model and predict protein families following global approach. Their approach takes into account the entire sequence, and thus is not suitable for partially conserved families. We develop two variants of PST for local prediction: multiple-domain prediction and best-domain prediction. The multiple-domain method predicts the probability that a protein belongs to a family based on one or more significant conserved regions, while the best-domain method does it based on the most conserved region in the query sequence. The time complexity of both of our approaches is the same as that of the global prediction, that is, O(Lm) where L is the depth bound of the tree and m is the size of the query sequence. We tested our algorithms on the Pfam database of protein families and compared the results with the global prediction method . The experimental results show that our approaches have higher accuracy of prediction than that of global approach. We also show that, our local prediction approach is an effective way to extract motifs/domains. Our approaches employ a linear time method for building PST by adapting the linear time construction of Probabilistic Automata reported by A.Apostolico et al.
|Cite as: Sun, Z. and Deogun, J.S. (2004). Local Prediction Approach of Protein Classification using Probabilistic Suffix Trees. In Proc. Second Asia-Pacific Bioinformatics Conference (APBC2004), Dunedin, New Zealand. CRPIT, 29. Chen, Y.-P. P., Ed. ACS. 357-362. |
(local if available)