Modern biology increasingly relies on high-throughput techniques. This trend challenges computational biologists to quickly extract as much useful information from the data as possible. In the genomic sense, this primarily implies correlating phenotypic differences with observed nucleotide sequence variations. On the protein side the challenge generally is to annotate protein function at reasonable accuracy levels. We believe that nucleic and amino acid sequences contain a large portion of the information necessary to address both of these directions.
Our main goal is to develop fast, accurate, and meaningful ways of analyzing this growing deluge of biological data and to bring these developments bench- (or patient-) side. To make our predictions we rely on a number of sequence-based features (including evolutionary information and other predictor results) and utilize a variety of methodologies (including Neural Nets, SVMs and random forests).
The active projects in the lab include:
- Development of an in silico mutagenesis methodology which will define functionally important residues in protein sequences. This direction addresses questions in nsSNP analysis, mutation combinatorics (possibly applicable to phylogenetics), and function prediction.
- Analyzing the effects of genomic SNPs (non-coding or synonymous) on the overall organism fitness. Initial steps in this direction focus on data collection and on outlining SNP characteristics that can be used to differentiate between functionally non-/important SNPs.
- Computational literature analysis (Natural Language Processing) to extract from free text (scientific publications, lab records, etc.) information relevant to the above two goals.