Prediction of xylanase optimal temperature by support vector regression Guangya Zhang*1 · Huihua Ge1 1Huaqiao University, College of Chemical Engineering, Xiamen, Fujian, PR China *Corresponding author: zhgyghh@hqu.edu.cn Financial support: This work was supported by the National Natural Science Foundation of China (No. 20806031), the Cultivation Project of Huaqiao University for the China National Funds for Distinguished Young Scientists (No.JB-GJ1006) and the Program for New Century Excellent Talents in Universities of Fujian Province (No. 07176C02). Keywords: amino acid composition, optimum temperature, support vector machine, uniform design, xylanase.
Background: Support vector machine (SVM), a novel powerful machine learning technology, was used to develop the non-linear quantitative structure-property relationship (QSPR) model of the G/11 xylanase based on the amino acid composition. The uniform design (UD) method was applied to optimize the running parameters of SVM for the first time. Results: Results showed that the predicted optimum temperature of leave-one-out (LOO) cross-validation fitted the experimental optimum temperature very well, when the running parameter C, Ɛ, and γ was 50, 0.001 and 1.5, respectively. The average root-mean-square errors (RMSE) of the LOO cross-validation were 9.53ºC, while the RMSE of the back propagation neural network (BPNN), was 11.55ºC. The predictive ability of SVM is a minor improvement over BPNN, but it is superior to the reported method based on stepwise regression. Two experimental examples proved the validation of the model for predicting the optimal temperature of xylanase. Conclusion: The results indicated that UD might be an effective method to optimize the parameters of SVM, which could be used as an alternative powerful modeling tool for QSPR studies of xylanase. Xylanase has a wide range of potential biotechnological applications. Recently the interest in xylanase has markedly increased due to the potential industrial uses, particularly in pulping and bleaching processes (Beg et al. 2001; Diaz et al. 2004; Oliveira et al. 2006). The thermo-alkalophilic conditions of xylanase-aided bleaching (60-80ºC, pH 8-10) combined with a high level of activity, demand a set of characteristics of xylanases, not usually found in native enzymes. An alternative for obtaining new thermostable enzymes is the modification of presently used xylanases to be more stable in extreme conditions. During the last twenty years, rational site-directed mutation (Moreau et al. 1994) and irrational directed evolution (Fenel et al. 2006) have become a routine approach for engineering xylanases to achieve this goal. Although, the so-called ‘semi-rational’ approach, which used computational techniques to perform in silico screening of protein sequences or to enhance the efficiency of directed evolution, has become an emerging area in protein engineering. However, it has not been employed in xylanase engineering. This approach has been applied with some success (Hayes et al. 2002; Mildvan, 2004), and researchers think that it may pave the way to exciting areas of enzyme research including efficient engineering of existing biocatalysts (Chica et al. 2005). Protein design algorithms (mathematical models) that provide quantitative structure-property relationship (QSPR) of proteins are the core part of the ‘semi-rational’ approach. The support vector machine (SVM) is a new and very promising classification and regression method developed by Vapnik (1998). It has been shown that SVM has two distinct features. Firstly, it has high generalization ability. Secondly, it requires only small size of training samples. According to some literatures, SVM has shown promising results on several biological problems and is becoming established as a standard tool in bioinformatics (Ward et al. 2003; Cai et al. 2004; Chen et al. 2006). In the present investigation, SVM, as a novel machine learning technique, was used to establish a model for predicting the optimum temperature of xylanase in G/11 family. During the process, the uniform-design method was applied to optimize the running parameters of SVM. The aim was to establish a new QSPR model and to confirm the possibility of predicting the optimum temperature of xylanases. The performances of SVM were better than that of back propagation neural network (BPNN) and the reported models, and may be useful for computer virtual screening in engineering for more thermostable new xylanases. Dataset construction To reduce the redundancy, we downloaded the sequences of xylanases from UniProt, for it contains records with full manual annotation or computer-assisted, manually-verified annotation performed by biologists and based on published literature and sequence analysis (Bairoch et al. 2005). The optimum temperatures of xylanases obtained from Liu’s work (Liu et al. 2006) have been shown in Table 1. Altogether, 25 xylanase sequences and their corresponding optimum temperatures were obtained. Support vector regression (Cortes and Vapnik, 1995; Vapnik, 1998) SVM can be applied in both classification and regression; here we used support vector regression (SVR). In SVR, the basic idea is to map the data X into a higher-dimensional feature space F via a nonlinear mapping Ф and then to do linear regression in this space. Therefore, regression approximation addresses the problem of estimating a function based on a given data set G = {(xi, di)}in (xi is the input vector, di is the desired value, and n is the total number of data patterns), and SVM approximate the function using Equation 1:
Where Ф(x) is the high-dimensional feature space which is nonlinearly mapped from the input space x. The coefficients w and b are estimated by minimizing
The first term in Equation 2 is the empirical error (risk). They are measured by the Ɛ-insensitive loss function given by Equation 3. On the other hand, the second term in Equation 2 is the regularization term. C is the regularized constant and it determines the trade off between the empirical risk and the regularization term. Ɛ is called the tube size and it is equivalent to the approximation accuracy placed on the training data points. Both C and Ɛ are user-prescribed parameters. To obtain the estimations of w and b, Equation 2 is transformed to the primal function given by Equation 4 by introducing the positive slack variables and as following: Finally, the regression function given by Equation 1 has the following explicit form Where nSV is the number of Support Vectors (SVs), ai, ai* are the introduced Lagrange multipliers and they satisfy the equality ai · ai* = 0,ai ≥ 0,ai* ≥ 0, and the kernel function K corresponding to
Linear and radial basis function (RBF, Gaussian function) kernels are two commonly used kernels in SVR (Smola and Schölkopf, 1998) and are given by Linear kernel
RBF kernel [Equation 8] Where γ is a constant, the parameter of the kernel, it controls the width of the Guassian kernel (although itself is not the width) and therefore, controls the generalization ability of SVM. The generalization performance of SVR depends on a good setting of parameters: C, Ɛ and the kernel type and corresponding kernel parameters. Here, uniform design is employed in optimizing the running parameters. Uniform design Uniform design (UD) was first proposed by Fang (1980), based on theoretic accomplishments in number-theoretic method. Generally speaking, UD is a form of ‘space filling’ design. Suppose that the experimental domain consists of s factors and h(x) is a response of the experiment. In many cases, we can assume this domain to be the unit cube Cs. The expectation value of h(x) over the experimental domain Eh(x) can be estimated by the sample mean, where ρ is a set of n experimental points over the domain. The famous Koksma–Hlawka inequality gives the upper error bound of the estimate of Eh(x), Where V(h) is a measure of the variation of h and D(ρ) the discrepancy of set ρ. The inequality indicates that the more uniform set in ρ is over the experimental region, the more accurate the estimate Eh(x) provides to ĥ (Zhang et al. 1998). Thus, obtaining experimental points that are most uniformly scattered in the domain is the key step in uniform design. Uniform design has its own features, such as its functional agility of arranging experiment runs and its robustness against model uncertainty. For more detail information, reference is Liang and Fang’s works (Fang and Yang, 2000; Liang et al. 2001). The overall performances of SVM and BPNN were evaluated in terms of the root-mean-square error (RMSE) and mean absolute error (MAE), which was defined as below. Where yi and ŷi stand for the actual value and training value (or predicted value), respectively. Cross-validation The performance and robustness of the models was evaluated by cross-validation. The jackknife test (leave-one-out, LOO) was used; it was deemed the most rigorous and objective with the least arbitrariness, as demonstrated by an incisive analysis in a recent review (Chou and Shen, 2007). We used 24 data points to train the models and tested it with the left one. This was repeated 25 times, leaving in turn a different data point out of the training set and using it to validate the resulting models. Software and computation environment To analyze the 20 amino acid compositions of xylanases, Bioedit software was used (version 5.0.9), and then each xylanase in the data set was characterized by a vector xi(i = 1,…, N). The input vector xi has 20 coordinates for the amino acid composition (in the percentage). The SVR and BPNN were performed by the software of WEKA, which is a java package providing an environment for implementation of a large number of machine learning and statistical algorithms (Frank et al. 2004). All the computations were carried out on a Pentium IV computer with a 2.7 GHz processor and 512M RAM. Optimizing the parameters of linear kernel SVM based on UD Similar to other multivariate statistical models, the performances of SVM for regression depend on the combination of several parameters. They are penalty value C, Ɛ of Ɛ-insensitive loss function, the kernel type K, and its corresponding parameters. To get the best generalization ability, some strategies are needed for optimizing these factors. There are four possible choices of kernel functions, i.e. linear, polynomial, radial basis function (RBF), and sigmoid function. For regression tasks, we selected linear and RBF kernel. For linear kernel, there are only two parameters, C and Ɛ. Here the UD method was employed in optimizing the combination of the parameters based both on LOO cross-validation and training. The UD table for 2 factors with 16 levels was used and the results were shown in Table 2. For linear kernel, we found that when the regularization parameter (C), and Ɛ-insensitive loss function (Ɛ) was 1 and 0.005 respectively, the MAEs of LOO cross-validation and training were 7.56ºC and 1.35ºC, respectively, with an average MAE of 4.46ºC. The LOO cross-validation results of the 9th run (R9) were slightly better, but the training results of it were much worse. On the other hand, the training results of the 6th run (R6) were slightly better than our chosen run (R12), but the LOO cross-validation results of R6 were much worse (about 5.64ºC). From the results of R7, one can see that parameters should not be chosen only based on training error, this could easily lead to over-fitting. So the optimal C and Ɛ for linear SVM were finally chosen as 1 and 0.005. The training and LOO cross-validation results of linear SVM were shown in Table 1 and Figure 1. Optimizing the parameters of RBF kernel SVM based on UD For the RBF kernel, there are three parameters C, Ɛ and γ. Here the UD method was also employed in optimizing the combination of the parameters based both on LOO cross-validation and training. The UD table for 3 factors with 16 levels was used and the results were shown in Table 3. From the results of the Table 3, we can see that different combination of the three parameters might result in different MAE and RMSE values. When C, Ɛ and γ, were 50, 0.001 and 1.5 respectively, the MAEs of LOO cross-validation and training were 6.88ºC and 0.04ºC, respectively, with an average MAE of 3.46ºC. Although the LOO cross-validation results of the 5th run (R5) were slightly better (about 0.28ºC), the training results of it were much worse (about 5.5ºC). From the results of R5, one can see that parameters should not be chosen only based on LOO cross-validation error. So the optimal C, Ɛ and γ for RBF SVM were finally chosen as 50, 0.001 and 1.5. The training and LOO cross-validation results of RBF SVM were also shown in Table 1 and Figure 1. According to Table 2 and Table 3, one can observe that many different combinations of parameters resulted in the same LOO cross-validation and training errors, which means that SVMs are not so sensitive to parameters. Meanwhile, the RBF kernel is superior to linear kernel, which was in accordance with some former researches for support vector regression tasks (Xue et al. 2004; Liu et al. 2005). Compared with back propagation neural network (BPNN) Recently, a few studies have shown that SVM yielded better results than alternative machine learning techniques such as BPNN. In this study, we have compared the performance of SVM and BPNN with the same datasets. The architecture of BPNN was also optimized by UD and the results were shown in Table 4. During the process, the maximum iterations were appointed as 1000. According to Table 4, the learning rate (η), momentum parameter and the neuron number of the hidden layer was chosen as 0.04, 0.6 and 11, respectively. The MAEs of LOO cross-validation and training were 7.73ºC and 0.97ºC, respectively, with an average MAE of 4.35ºC. The training and LOO cross-validation results of BPNN were shown in Table 1 and Figure 1. According to Table 4, one can observe that different combinations of parameters resulted in different LOO cross-validation and training MAEs and RMSEs. This means that BPNN may be more sensitive to its running parameters when compared with SVMs, especially RBF SVM. And also, the LOO cross-validation results of BPNN were widely different when different sets of training and LOO cross-validation were employed. The maximum and minimum LOO cross-validation MAEs were 38.89ºC and 0.2ºC, respectively, while the corresponding MAEs of RBF SVM were 27.44ºC and 0.03ºC. The predicted errors of all the 25 runs of BPNN and SVM are shown in Figure 2. For linear SVM, 13 samples had small differences to their experimental optimal temperatures (│ERROR│<5ºC), and RBF SVM also had 13 samples, while 11 samples had small differences to their experimental optimal temperatures in BPNN. The predicted RMSEs of linear SVM, RBF SVM and BPNN were 9.92ºC, 9.55ºC and 11.52ºC, respectively. As analyzed above, it can be seen that the SVM based models showed minor robustness than BPNN. It was consistent with the inherent advantages over BPNN, which did not show robustness especially in the condition of only small amount of training samples were available. To validate the prediction models, we showed two examples. Firstly, we cloned the xylanase gene of Bacillus pumilus, sequenced it and expressed it in Escherichia coli. The accession number of the gene in NCBI is EF090270 and the protein ID is ABM54186.1 (http://www.ncbi.nlm.nih.gov/nuccore/EF090270). The optimal temperature of the purified xylanase was 50ºC, which was shown in Figure 3. We calculated the amino acid composition of the xylanase and used the model to predict the optimal temperature. For linear kernel SVM, the predicted optimal temperature was 49.89ºC, for RBF kernel SVM was 50.02ºC, and for BPNN was 49.94ºC, the MAEs were 0.11ºC, 0.02ºC and 0.06ºC, respectively. Secondly, we designed a new thermophilic xylanase, synthesized its coding gene de nove and expressed it in Escherichia coli. The optimal temperature of it was 60ºC and it can keep over 50% activity at 70ºC for one hour (Fu et al. 2012). The predicted results of linear kernel SVM, RBF kernel SVM and BPNN were 54.77ºC, 55.25ºC, and 56.05ºC, and the MAEs were 5.23ºC, 4.75ºC, and 3.95ºC, respectively. From the two experimental examples, we could conclude that the model we proposed might work as a useful tool for QSPR studies of xylanase and facilitate the engineering for new one.
Some important parameters (C, Ɛ, and γ) had to be optimized during SVM training and testing in order to gain a good predictive performance of SVR model. There was some studies deal with the optimization of running parameters (Xue et al. 2004; Yao et al. 2004; Liu et al. 2005), but all of them fixed two of the parameters and check the curve of RMSE versus the left one to find its optimal value. Often the fixed values of the parameters were selected based on human expertise or even experience. For example, researchers know that too small value of C will lead to insufficient stress placed on fitting the training data and too big value of C will lead to overfit the training data. But how big is not too small or too big? Different researchers may chose different values of C that they think it is not too small or too big. So using uniform design to optimize the parameters would have at least two advantages compared with their methods. Firstly, it allowed much larger searching space of the combination of the parameters and thus the chances of finding the optimal combination of the parameters would increase. Secondly, uniform design was much easier than their methods; it needed only 16 runs for 3 factors and 16 levels, while their methods need 48 runs. This was because of the inherent advantages of uniform design. Recently two linear models for both single residue and dipeptides and optimum temperature of xylanase in the G/11 family were established based on stepwise regression (Liu et al. 2006).The training RMSEs of their models were 5.03ºC and 1.91ºC, respectively, and they calculated the maximal and minimal optimum temperature of xylanase as 120.84ºC and 10.83ºC. From these results we can conclude that the model we established here was much more accurate. This indicated that the relationship between amino acid composition and the xylanase optimum temperature was very complicated and one might not gain the satisfactory results based on the simple linear models, while SVM is a more powerful tool for prediction of nonlinearities. Using the crystal information of xylanase, one can pinpoint the residues that may suitable for mutations. Consequently, saturation mutagenesis (where all 20 native amino acids are tested at each pinpointed position) can be applied to generate large, virtual libraries of mutants. Then, our model, for predicting the xylanase optimal temperatures, can be used for pre-screening the virtual libraries. The optimal sequences were chosen based on their predicted optimal temperatures; the mutants were then generated experimentally by mutagenesis and recombination. Therefore, the model can decrease the sequence space, while maintaining broad diversity, to a number easily amenable to experimental screening. As analyzed above, SVM only showed a minor improvement over BPNN in our study, the large variation (from 0.03ºC to 27.44ºC) in prediction indicated that it should be used with some cautions. At the same time, the MAE of LOO cross-validation was 6.88ºC, and the mean absolute percent error was 12.8%, one can see that is not good enough for directing xylanase engineering. Perhaps further improvement may be achieved by collecting more data sets of higher quality. It should be possible to increase the number of data entries and eliminate the noisy data entries from the updated databases. We think when the MAE of LOO cross-validation was within 5ºC, it may be good enough for directing xylanase engineering and our results were close to this object. BAIROCH, A.; APWEILER, R.; WU, C.H.; BARKER, W.C.; BOECKMANN, B.; FERRO, S.; GASTEIGER, E.; HUANG, H.; LOPEZ, R.; MAGRANE, M.; MARTIN, M.J.; NATALE, D.A.; O'DONOVAN, C.; REDASCHI, N. and YEH, L.S. (2005). The universal protein resource (UniProt). Nucleic Acids Research, vol. 33, no. 1, p. D154-D159. [CrossRef] BEG, Q.A.; KAPOOR, M.; MAHAJAN, G. and HOONDAL, S. (2001). Microbial xylanases and their industrial applications: A review. Applied Microbiology and Biotechnology, vol. 56, no. 3-4, p. 326-338. [CrossRef] CAI, C.Z.; HAN, L.Y.; JI, Z.L. and CHEN, Y.Z. (2004). Enzyme family classification by support vector machines. Proteins: Structure, Function, and Bioinformatics, vol. 55, no. 1, p. 66-76. [CrossRef] CHEN, C.; TIAN, Y.X.; ZOU, X.Y.; CAI, P.X. and MO, J.Y. (2006). Using pseudo-amino acid composition and support vector machine to predict protein structural class. Journal of Theoretical Biology, vol. 243, no. 3, p. 444-448. [CrossRef] CHICA, R.A.; DOUCET, N. and PELLETIER, J.N. (2005). Semi-rational approaches to engineering enzyme activity: Combining the benefits of directed evolution and rational design. Current Opinion in Biotechnology, vol. 16, no. 4, p. 378-384. [CrossRef] CHOU, K.C. and SHEN, H.B. (2007). Recent progresses in protein subcellular location prediction. Analytical Biochemistry, vol. 370, no. 1, p. 1-16. [CrossRef] CORTES, C. and VAPNIK, V. (1995). Support-vector networks. Machine Learning, vol. 20, no. 3, p. 273-297. [CrossRef] DIAZ, M.; RODRIGUEZ, S.; FERNÁNDEZ-ABALOS, J.M.; RIVAS, J.D.L.; RUIZ-ARRIBAS, A.; SHNYROV, V.L. and SANTAMARÍA, R.I. (2004). Single mutations of residues outside the active center of the xylanase Xys1Δ from Streptomyces halstedii JM8 affect its activity. FEMS Microbiology Letters, vol. 240, no. 2, p. 237-243. [CrossRef] FANG, K.T. (1980). The uniform design: Application of number-theoretic methods in experimental design. Acta Mathematicae Applicatae Sinica, vol. 3, p. 363-372. FANG, K.T. and YANG, Z.H. (2000). On uniform design of experiments with restricted mixtures and generation of uniform distribution on some domains. Statistics and Probability Letters, vol. 46, no. 2, p. 113-120. [CrossRef] FENEL, F.; ZITTING, A.J. and KANTELINEN, A. (2006). Increased alkali stability in Trichoderma reesei endo-1, 4-β-xylanase II by site directed mutagenesis. Journal of Biotechnology, vol. 121, no. 1, p. 102-107. [CrossRef] FRANK, E.; HALL, M.; TRIGG, L.; HOLMES, G. and WITTEN, I.H. (2004). Data mining in bioinformatics using Weka. Bioinformatics, vol. 20, no. 15, p. 2479-2481. [CrossRef] FU, X.P.; WANG, W.Y. and ZHANG, G.Y. (2012). Construction of an expression vector with elastin-like polypeptide tag to purify the xylanase with non-chromatographic method. Acta Microbiologica Sinica. In press. HAYES, R.J.; BENTZIEN, J.; ARY, M.L.; HWANG, M.Y.; JACINTO, J.M.; VIELMETTER, J.; KUNDU, A. and DAHIYAT, B.I. (2002). Combining computational and experimental screening for rapid optimization of protein properties. Proceedings of the National Academy of Sciences of the United States of America, vol. 99, no. 25, p. 15926-15931. [CrossRef] LIANG, Y.Z.; FANG, K.T. and XU, Q.S. (2001). Uniform design and its applications in chemistry and chemical engineering. Chemometrics and Intelligent Laboratory Systems, vol. 58, no. 1, p. 43-57. [CrossRef] LIU, H.X.; YAO, X.J.; XUE, C.X.; ZHANG, R.S.; LIU, M.C.; HU, Z.D. and FAN, B.T. (2005). Study of quantitative structure-mobility relationship of the peptides based on the structural descriptors and support vector machines. Analytica Chimica Acta, vol. 542, no. 2, p. 249-259. [CrossRef] LIU, L.; DONG, H.; WANG, S.; CHEN, H. and SHAO, W. (2006). Computational analysis of di-peptides correlated with the optimal temperature in G/11 xylanase. Process Biochemistry, vol. 41, no. 2, p. 305-311. [CrossRef] MILDVAN, A.S. (2004). Inverse thinking about double mutants of enzymes. Biochemistry, vol. 43, no. 46, p. 14517-14520. [CrossRef] MOREAU, A.; SHARECK, F.; KLUEPFEL, D. and MOROSOLI, R. (1994). Increase in catalytic activity and thermostability of the xylanase A of Streptomyces lividans 1326 by site-specific mutagenesis. Enzyme and Microbial Technology, vol. 16, no. 5, p. 420-424. [CrossRef] OLIVEIRA, L.A.; PORTO, A.L.F. and TAMBOURGI, E.B. (2006). Production of xylanase and protease by Penicillium janthinellum CRC 87m-115 from different agricultural wastes. Bioresource Technology, vol. 97, no. 6, p. 862-867. [CrossRef] SMOLA, A.J. and SCHÖLKOPF, B. (1998). A Tutorial on Support Vector Regression. NeuroCOLT Technical Report NC-TR-98-030. Royal Holloway College, University of London, UK. VAPNIK, V. (1998). Statistical learning theory. John Wiley and Sons, New York. 740 p. ISBN 0-471-03003-1. WARD, J.J.; MCGUFFIN, L.J.; BUXTON, B.F. and JONES, D.T. (2003). Secondary structure prediction with support vector machines. Bioinformatics, vol. 19, no. 13, p. 1650-1655. [CrossRef] XUE, C.X.; ZHANG, R.S.; LIU, H.X.; LIU, M.C.; HU, Z.D. and FAN, B.T. (2004). Support vector machines-based quantitative structure-property relationship for the prediction of heat capacity. Journal of Chemical Information and Computer Sciences, vol. 44, no. 4, p. 1267-1274. [CrossRef] YAO, X.J.; PANAYE, A.; DOUCET, J.P.; ZHANG, R.S.; CHEN, H.F.; LIU, M.C.; HU, Z.D. and FAN, B.T. (2004). Comparative study of QSAR/QSPR correlations using support vector machines, radial basis function neural networks, and multiple linear regressions. Journal of Chemical Information and Modeling, vol. 44, no. 4, p. 1257-1266. [CrossRef] ZHANG, L.; LIANG, Y.Z.; JIANG, J.H.; YU, R.Q. and FANG, K.T. (1998). Uniform design applied to nonlinear multivariate calibration by ANN. Analytica Chimica Acta, vol. 370, no. 1, p. 65-77. [CrossRef] Note: Electronic Journal of Biotechnology is not responsible if on-line references cited on manuscripts are not available any more after the date of publication. |