Gilbert White William
Seffens* *Corresponding author Keywords:
Amino acids, , Backtranslation,
, Genetic code, Neural network, Nucleic acids
A neural network (NN) was trained on amino and nucleic acid sequences to test the NN’s ability to predict a nucleic acid sequence given only an amino acid sequence. A multi-layer backpropagation network of one hidden layer with 5 to 9 neurons was used. Different network configurations were used with varying numbers of input neurons to represent amino acids, while a constant representation was used for the output layer representing nucleic acids. In the best-trained network, 93% of the overall bases, 85% of the degenerate bases, and 100% of the fixed bases were correctly predicted from randomly selected test sequences. The training set was composed of 60 human sequences in a window of 10 to 25 codons at the coding sequence start site. Different NN configurations involving the encoding of amino acids under increasing window sizes were evaluated to predict the behavior of the NN with a significantly larger training set. This genetic data analysis effort will assist in understanding human gene structure. Benefits include computational tools that could predict more reliably the backtranslation of amino acid sequences useful for Degenerate PCR cloning, and may assist the identification of human gene coding sequences (CDS) from open reading frames in DNA databases.
Degenerate primers or probes, usually designed from partially sequenced peptides or conserved regions on the basis of comparison of several proteins, have been widely used in the polymerase chain reaction (PCR), DNA library screening, or Southern blot analysis. The degenerate nature of the genetic code prevents backtranslation of amino acids into codons with certainty. Numerous statistical studies have established that codon frequencies are not random (Karlin and Brendel, 1993). Many cDNA sequences have been mapped onto a "DNA-walk" and long-range power law correlations were found (Peng et.al., 1992). In consideration of the long-range correlations in DNA, a neural network approach may identify sequence patterns in coding regions that could be used to improve the accuracy of backtranslation. Neural networks are able to form generalizations and can identify patterns with noisy data sets. To list just a few biological applications, neural networks have been used successfully to identify coding regions in genomic DNA (Snyder and Stormo, 1993), to detect mRNA splice sites (Ogura et. al., 1997), and to predict the secondary structure of proteins (Holley and Karplus, 1989, Chandonia and Karplus, 1996). Neural networks have also been used to study the structure of the genetic code. One such network was trained to classify the 61 nucleotide triplets of the genetic code into 20 amino acid categories (Tolstrup et.al., 1994). This network was able to correlate the structure of the genetic code to measures of amino acid hydrophobicity. Most neural network methods for identifying patterns in sequences can be classified as a search by signal or a search by content (Granjeon and Tarroux, 1995). Search by signal consists in identifying specific sites, such as splice sites. This method suffers from a lack of reliability when variable signals delimit the regions of interest. Search-by-content algorithms use local constraints, such as compositional bias, to characterize regions of DNA. The goal of the research reported here is to utilize the successful NN techniques to analyze and generalize codon usage in mRNA sequences beginning at the CDS start site. Local and global patterns of codon usage in genes may be identifiable by neural networks of suitable architecture. This paper reports on some initial trials of altering the encoding of amino acids for the input neural layer. Future studies will address the architecture of the hidden layer to optimize for the NN ability to detect codon usage patterns in genes.
Training set. Human mRNA sequences were obtained from GenBank on the basis of several criteria. The coding sequences were relatively short in order to avoid splicing and other variants of the mRNA. The sequences were identified by keywords that would indicate a complete mRNA could be reconstructed. Such words would be complete coding sequence (CDS), 5’ and 3’ untranslated regions (UTR), and poly(A) site. Multiple members from gene families were excluded to prevent overtraining of those sequences. The sequences were downloaded from Entrez at the NIH web site (http://www.ncbi.nlm.nih.gov/Entrez/) and the coding sequence was saved from each into a file. Up to the first 75 nucleotides of the CDS were selected for this study in a window starting at the methionine ATG start site. Binary representations. In order to train the neural network (NN) it is necessary to formulate a decoding scheme because the architecture of the NN is binary and does not allow a direct representation of nucleic or amino acid sequences. Therefore, a binary numeric representation was used to encode the amino acid data. Several Microsoft Word 97 macros were recorded to convert amino acids and nucleic acids into numerical values. The macros used the find and replace commands in Microsoft Word 97 for each of the twenty amino acids and for the four nucleotides. The individual numeric-encoded sequence files were then joined together into groups. For this study a total of sixty mRNAs were examined with different window sequence lengths which changed the total size of the training set (White, 1998). The nomenclature for each group identifies the number of sequences used and the number of codons taken from each sequence. For example, in Training Set 60S-10C there are sixty sequences with a window of ten codons taken from each sequence. Since ten codons were taken from each sequence, there are 600 codons in this set. A related study of predicting bases in tRNA sequences used a window size of 15 bases (Sun et.al., 1995), while this study used a window of 10 codons or 30 bases. Neural network. All work with the NN was performed on a Sun SPARCstationÔ 20 computer. The NN used was a utility of Partek 2.0b4, called a multi-layer perceptron (MLP). A MLP is a NN, which has at least three layers (the input, output and the hidden layer(s)). Each layer is attached to the next layer by connection weights that are changed during the training process to reduce the overall error. This allows the network to "learn" patterns in the mRNA sequences. Training was stopped when the change in the total output error became less than 0.1% from the previous iteration. This usually occurred after 500 - 1200 iterations using the backpropagation learning method. Test sets were assembled to assess the predictive accuracy of the trained NN. The test sets consisted of 3 randomly selected human gene sequences from the same group of sequences from which the training set was selected. The predicted output was measured in 3 categories: the overall percent correct, percent correct for degenerate bases, and percent correct for fixed bases. These measures allow the assessment of the various schemes used to encode the amino acids.
Encoding
the amino acids
Adding
degeneracy information
Binary
encoding Comparing
the schemes
One of the possible uses of this research
is to improve the design of oligonucleotide probes (Eberhardt,
1992). One primer-design study found an overall homology greater
than 82% between the predicted probe and the target sequence when codon
utilization and dinucleotide frequencies were taken into account (Lathe,
1985). When sequence stretches lacking Serine, Arginine, and Leucine
are selected the overall homology became 85.7% in Lathe's study. Our
best network predicted 85% of the degenerate bases, and 93% of the overall
bases. The data set used Lathe's study contained 13,000 nucleotides
and our largest training set had 4500 nucleotides. Therefore, an increase
in our network or training set size could lead to even greater accuracy
by detecting patterns of codon choice within the mRNA sequences. The
architecture of the amino acid encoding method apparently does not have
a large impact on predictive accuracy as found in this study. Therefore
other factors, such as computational time or memory size may be a criteria
used to select an encoding scheme for a larger training set. It is also
interesting to note that the network that predicted the highest percentage
of correct overall bases did so on a test set that had eight Leucines,
one Arginine, and two Serines. These amino acids present difficulties
for algorithms based on codon lookup tables, such as Lathe's work or
common primer selection programs (such as Nash, 1993).
The work reported here demonstrates that a NN approach may yield improvements
in predictive accuracy for PCR primer selection.
Chandonia, J. and Karplus, M. (1996) The importance of larger data sets for protein secondary structure prediction with neural networks. Protein Science 5:768-774. Demeler, B. and Zhou, G. (1991). Neural Network Optimization for E. Coli Promoter Prediction. Nucleic Acid Research 19: 1593-1599. Eberhardt, N. (1992) A shell program for the design of PCR primers using Genetics Computer Group software. BioTechniques 13:914-916. Granjeon, E. and Tarroux, P. (1995) detection of compositional constraints in nucleic acids using neural networks. CABIOS 11:29-37. Holley, L. and Karplus, M. (1989). Protein Secondary Structure Prediction with a Neural Network. Proceedings of the National Academy of Sciences USA 86: 152-156. Karlin, S. and Brendel, V. (1993) Patchiness and correlations in DNA sequences. Science 259:677-680. Lapedes, A., Barnes, C., Burks, C., Farber, R., and Sirotkin, K. (1990) Application of neural networks and other machine learning algorithms to DNA sequence analysis. In G. Bells and T. Marr (eds.), Computers and DNA: SFI Studies in the Sciences of Complexity. Addison-Wesley, Reading, MA. Vol 7, pp157-182. Lathe, R. (1985) Synthetic Oligonucleotide Probes Deduced from Amino Acid Sequence Data. Journal of Molecular Biology 183: 1-12. Nash, J. (1993) A computer program to calculate and design oligonucleotide primers from amino acid sequences. CABIOS 9:469-471. Ogura, H. Agata, H., Xie, M., Odaka, T., and Furutani, H. (1997). A Study of Learning Splice Sites of DNA Sequence by Neural Networks. Comput. Biol. Med. 27: 67-75. O’Neill, M. (1991). Training Back-Propagation Neural Networks to Define and Detect DNA binding Sites. Nucleic Acid Research 19: 313-318. Peng, C., Buldyrev, S., Goldberger, A., Havlin, S., Sciortino, F., Simons, M., and Stanley, H. (1992) Long-range correlations in nucleotide sequences. Nature 356:168-170. Snyder, E. and Stormo, G. (1993). Identification of coding Regions in Genomic DNA Sequences: an Application of Dynamic Programming and Neural Networks. Nucleic Acid Research 21: 607-613. Sun, J., Song, W.-Y., Zhu, L.-H., and Chen, R.-S. (1995) Analysis of tRNA gene sequences by neural network . J. Comp. Biol. 2:409-416. Tolstrup, N., Toftgard, J., Englebrecht, J., and Brunak, S. (1994) Neural Network Model of the Genetic Code is Strongly Correlated to the GES Scale of Amino Acid Transfer Free Energies. Journal of Molecular Biology 243: 816-820. Uberbacher, E. and Mural, R. (1991) Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proceedings of the National Academy of Sciences USA 88:11261-11265. White, G. (1998) Detection of Codon Usage Patterns for Backtranslation Using a Neural Network, Masters Thesis, Biology Department, Clark Atlanta University, Atlanta, GA.
Shown are the percent of correctly predicted degenerate bases in a test set composed of three sequences selected randomly from the same group of sequences from which the training set was assembled. |
Note: Electronic Journal of Biotechnology is not responsible if on-line references cited on manuscripts are not available any more after the date of publication. |