Abstract

An alternative method for analyzing proteins is proposed. Currently, protein search engines available on the internet utilize domains (predefined sequences of amino acids) to align proteins. The method presented converts a protein sequence with the use of 1200 numeric codes that represent a unique three—amino-acid protein sequence. Each numeric code starts with one of three specific amino acids, followed by any two additional amino acids. With the use of the FPC (FingerPrinted Contig) program, the total protein database (including “redundant” records) from the National Center for Biotechnology Information (NCBI) has been processed and placed into “bins/contigs” based on associations of these triplet codes. When analyzed with FPC, proteins are “contigged” together based on the number of shared fragments, regardless of order. These associations were supported by additional analysis with the standard BLASTP utility from NCBI. Within the created contig sets, there are numerous examples of proteins (allotypes and orthotypes) that have evolved into different, seemingly unrelated proteins. The power of this domain-free technique has yet to be explored; however, the ability to bin proteins together with no a priori knowledge of domains may prove a powerful tool in the characterization of the hundreds of thousands of available, yet undescribed expressed protein and open reading frame sequences.

Share

COinS
 

Link to publisher version

http://dx.doi.org/10.1166/gl.2004.041