Predictive Oncology & Intervention Strategies
Molecular Basis of Oncogenesis & Cancer Control
February 7 - 10, 2004Hotel WestminsterNice, France

Non-synonymous cSNP and InDel discovery by bioinformatic data mining

CH Lai MS, LY Hu MS, JY Chiu MS, W-C Lin PhD

Institute of Biomedical Sciences, Academia Sinica, Taiwan

AIM: Single nucleotide polymorphism (SNP) represents single base pair positions in genomic DNA at which different sequence alternatives exist in normal individuals in population, wherein the least frequent allele has an abundance of 1% or greater. It is estimated that one SNP could be discovered per 1,000 base pairs and more than 3 million SNPs existed in the human genome. Non-synonymous cSNPs that occur in the protein-coding region could potentially alter the amino acid composition and thus the functionality of gene product. They are potentially responsible for the assured culprits of all human diseases associated with inheritance. Although it is a less frequent event for single amino acid residue insertion or deletion (InDel) in the coding region, such alterations might also have significant impacts on the gene function. This study is developing bioinformatics tools for data-mining cSNP and InDel functional variants from the human EST databases. METHODS: In addition to the direct genome sequencing method, dbEST has been used for SNP discovery due to its rich information of millions ESTs. Our laboratory has devoting much effort to develop bioinformatic tools for experimental research purposes and we have used the human dbEST database for the discovery of human genes. Novel human genes could be identified by searching EST databases with query sequences of xenolog origins, which we use the term of comparative gene identification (CGI) for this kind of approach (Genome Research 2000, 10: 703). CGI approach was demonstrated to be beneficial by providing a better alignment scaffold with protein sequence queries. In this study, we have modified our CGI bioinformatic tools for data-mining such low frequency functional variants from Human-ESTs. RESULTS: We have modified the CGI computer program and use alternative selection filters to extract these low frequency functional variants from human ESTs and human Reference protein. With 17,234 human reference proteins as starting alignment scaffolds and 4.8 million human EST entries, we have identified more than 217,473 potential non-synonymous cSNPs and 1,477 InDel variants in the human reference proteins. In addition, our approach can be used to validate and correct amino acid sequences of human reference proteins. Additional data integration with the OMIM database provides us new potential diseases related cSNPs. CONCLUSIONS: By using the bioinformatic computation approach as well as database integration, we could identify new potential disease targets and their variants useful for clinical diagnosis and prevention.

Paper presented at the International Symposium on Predictive Oncology and Intervention Strategies; Nice, France; February 7 - 10, 2004; in poster session 893 (Molecular pathology).