Redirigiendo al acceso original de articulo en 15 segundos...
Inicio  /  Applied Sciences  /  Vol: 9 Par: 20 (2019)  /  Artículo
ARTÍCULO
TITULO

Sample Reduction Strategies for Protein Secondary Structure Prediction

Sema Atasever    
Zafer Aydin    
Hasan Erbay and Mostafa Sabzekar    

Resumen

Predicting the secondary structure from protein sequence plays a crucial role in estimating the 3D structure, which has applications in drug design and in understanding the function of proteins. As new genes and proteins are discovered, the large size of the protein databases and datasets that can be used for training prediction models grows considerably. A two-stage hybrid classifier, which employs dynamic Bayesian networks and a support vector machine (SVM) has been shown to provide state-of-the-art prediction accuracy for protein secondary structure prediction. However, SVM is not efficient for large datasets due to the quadratic optimization involved in model training. In this paper, two techniques are implemented on CB513 benchmark for reducing the number of samples in the train set of the SVM. The first method randomly selects a fraction of data samples from the train set using a stratified selection strategy. This approach can remove approximately 50% of the data samples from the train set and reduce the model training time by 73.38% on average without decreasing the prediction accuracy significantly. The second method clusters the data samples by a hierarchical clustering algorithm and replaces the train set samples with nearest neighbors of the cluster centers in order to improve the training time. To cluster the feature vectors, the hierarchical clustering method is implemented, for which the number of clusters and the number of nearest neighbors are optimized as hyper-parameters by computing the prediction accuracy on validation sets. It is found that clustering can reduce the size of the train set by 26% without reducing the prediction accuracy. Among the clustering techniques Ward?s method provided the best accuracy on test data.

 Artículos similares

       
 
Fupeng Liu, Jiandong Ma, Zhongzhi Ye, Lijia Wang, Yu Sun, Jianxing Yu, Yuliang Qin, Dongliang Zhang, Wengang Cai and Hao Li    
The reliability of liquefied natural gas (LNG) storage tanks is an important factor that must be considered in their structural design. Concrete is a core component of LNG storage tanks, and the geometric uncertainty of concrete aggregate material has a ... ver más

 
Wenbiao Wang, Qianqian Zhang and Kai Zheng    
Industrial system operations usually have dynamic characteristics. If these characteristics are ignored, the performance of fault detection degrades. Herein, the fault-detection algorithm of dynamic global?local preserving projection (DGLPP) is employed ... ver más
Revista: Applied Sciences

 
Yi Zhang, Jie Ma, Xiaolin Qin, Yongming Li and Zuwei Zhang    
Chronic diseases are severe and life-threatening, and their accurate early diagnosis is difficult. Machine-learning-based processes of data collected from the human body using wearable sensors are a valid method currently usable for diagnosis. However, i... ver más
Revista: Applied Sciences

 
Fathyah Whba, Faizal Mohamed, Mohd Idzat Idris and Mohd Syukri Yahya    
This study focused on surface modification of cellulose nanocrystals (CNCs) to create a biocompatible, stable, and hydrophilic substrate suitable for use as a coating agent to develop a dual-contrast composite material. The CNCs were prepared using acid ... ver más
Revista: Applied Sciences

 
Alessandra Valletta, Kioumars Tavakoli Tafti, Kimia Baghaei, Amirhossein Moaddabi, Parisa Soltani, Gianrico Spagnuolo and Akhilanand Chaurasia    
(1) Background: Fractal analysis has been used as a mathematical method for studying the complexity of fractal structures such as trabecular bone that look similar at different scales. Bruxism is a disorder involving nonfunctional grinding and clenching ... ver más
Revista: Applied Sciences