|
| | | |
Locality-Sensitive Hashing for Protein Classification
Buckingham, L., Hogan, J.M., Geva, S. and Kelly, W.
Determination of sequence similarity is a central issue in
computational biology, a problem addressed primarily
through BLAST, an alignment based heuristic which has
underpinned much of the analysis and annotation of the
genomic era. Despite their success, alignment-based
approaches scale poorly with increasing data set size, and
are not robust under structural sequence rearrangements.
Successive waves of innovation in sequencing
technologies – so-called Next Generation Sequencing
(NGS) approaches – have led to an explosion in data
availability, challenging existing methods and motivating
novel approaches to sequence representation and
similarity scoring, including adaptation of existing
methods from other domains such as information
retrieval.
In this work, we investigate locality-sensitive hashing of
sequences through binary document signatures, applying
the method to a bacterial protein classification task. Here,
the goal is to predict the gene family to which a given
query protein belongs. Experiments carried out on a pair
of small but biologically realistic datasets (the full protein
repertoires of families of Chlamydia and Staphylococcus
aureus genomes respectively) show that a measure of
similarity obtained by locality sensitive hashing gives
highly accurate results while offering a number of
avenues which will lead to substantial performance
improvements over BLAST.. |
Cite as: Buckingham, L., Hogan, J.M., Geva, S. and Kelly, W. (2014). Locality-Sensitive Hashing for Protein Classification. In Proc. Twelfth Australasian Data Mining Conference (AusDM14) Brisbane, Australia. CRPIT, 158. Li, X., Liu, L., Ong, K.L. and Zhao, Y. Eds., ACS. 141-147 |
(from crpit.com)
(local if available)
|
|