Conferences in Research and Practice in Information Technology
  

Online Version - Last Updated - 20 Jan 2012

 

 
Home
 

 
Procedures and Resources for Authors

 
Information and Resources for Volume Editors
 

 
Orders and Subscriptions
 

 
Published Articles

 
Upcoming Volumes
 

 
Contact Us
 

 
Useful External Links
 

 
CRPIT Site Search
 
    

Multi-stage Redundancy Reduction: Effective Utilisation of Small Protein Datasets

Hawkins, J. and Boden, M.

    In many important bioinformatics problems the data sets contain considerable redundancy due to the evolutionary processes which generate the data and biases in the data collection procedures. The standard practice in bioinformatics involves removing the redundancy such that there is no more than at most forty percent similarity between sequences in a data set. For small data sets this can dilute the already impoverished data beyond the boundary of practicality. One can choose to include all available data in the process by just ensuring that only the training and test samples have the required redundancy gap. However, this encourages overfitting of the model by exposure to a highly redundant training sets. We out- line a process of multi-stage redundancy reduction, whereby the paucity of data can be effectively utilised without compromising the integrity of the model or the testing procedure.
Cite as: Hawkins, J. and Boden, M. (2006). Multi-stage Redundancy Reduction: Effective Utilisation of Small Protein Datasets. In Proc. 2006 Workshop on Intelligent Systems for Bioinformatics (WISB 2006), Hobart, Australia. CRPIT, 73. Boden, M. and Bailey, T. L., Eds. ACS. 55-59.
pdf (from crpit.com) pdf (local if available) BibTeX EndNote GS
 

 

ACS Logo© Copyright Australian Computer Society Inc. 2001-2014.
Comments should be sent to the webmaster at crpit@scem.uws.edu.au.
This page last updated 16 Nov 2007