

论文作者:赵伟 许尤厚 郑甲 王玉光 周洪波

文章页码:2543 - 2550


Key words:amino acid composition; n-peptide composition; di-residue coupling; protein stability; support vector machines

摘    要:从GenBank数据库中获取了微生物来源的嗜热脂肪酶序列77条,耐热脂肪酶序列65条,分别统计分析序列中20种氨基酸出现的频次,二肽片段、三肽片段出现的差异以及非相邻二元组合的偏爱性。在此基础上,利用支持向量机(SVM)进行序列分类研究。研究结果表明:在统计学意义上,20种天然氨基酸残基中,亮氨酸、脯氨酸、蛋氨酸、苯丙氨酸、色氨酸和酪氨酸在嗜热蛋白序列中出现的频率高于其在耐热蛋白中出现的频率;二肽片段KC,EE,KE,RE, VE, YI, EK, VK, EV, YV, EY, KY, VY 和 YY的出现频率在嗜热蛋白中显著高于其在耐热蛋白中出现的频率。三肽片段的出现频率和非相邻二元组合的序列偏爱性也显示与蛋白耐热性显著相关。训练集的分类准确率达99.65%,真实数据集的分类准确率达到98.41%。


The amino acid compositions, the distributions of N(N=2, 3) neighboring amino acids and the non-adjacent di-residue coupling patterns in the sequences of 65 thermostable and 77 thermophilic lipases getting from GenBank were systematically analyzed. Based on the information, a statistical method based on support vector machines (SVMs) for discriminating thermophilic and thermostable lipases was developed. The results show that hydrophobic residues Leu, Pro, Met, Phe, Trp, as well as the polar residue Tyr have higher occurrences in thermophilic lipases than thermostable ones. The occurrences of KC, EE, KE, RE, VE, YI, EK, VK, EV, YV, EY, KY, VY and YY in thermophilic proteins are significantly more frequent. The composition of dipeptide, tripeptide and non-adjacent di-residue patterns contain more information than amino acid composition, and this information indicates the possible thermostable mechanism of microbial lipases. The accuracy of this method for the training dataset is 99.65%, and its accuracy for testing datasets is 98.41%.


