|
A major challenge posed by the rapid accumulation of whole-genome
sequence data has been finding ways to interpret such data
in a biologically meaningful way (i.e. finding an effective
data mining technique that produces new biological insight).
One of the first steps in DNA data mining is genome annotation:
the process by which putative gene sequences are identified
by establishing homology of open reading frames to existing
genes. Current methods of homology searching can only partially
provide annotation for most newly-sequenced genomes. Algorithms
such as BLAST have inherent shortcomings which prevent them
from detecting weak homology, especially when dealing with
genomes of unusual composition. For my research project we
are developing algorithms to improve such methods of homology
searching.
We believe that Mutation biases (when the four
types of DNA nucleotide do not mutate to one another with
equal frequency) interact with the biochemical properties of amino acids and the codon assignments
of the genetic code to produce complex variation in the patterns by which amino
acids substitute for one another in different genes and different
genomes.
We plan to use this knowledge to create biochemical
profiles that would allow us to predict the way a gene would “look” in
a different genome.
Next, we plan to extend the sort of search
carried out by PHI-BLAST (a modified version of BLAST)
where instead of a query sequence, we provide BLAST with
a biochemical
search image.
We expect that this approach will allow us
to reduce the number of unannotated putative genes, while
at the same time revealing insights into the mechanisms
involved in protein evolution.
|