Inicio  /  Algorithms  /  Vol: 14 Par: 10 (2021)  /  Artículo
ARTÍCULO
TITULO

XGB4mcPred: Identification of DNA N4-Methylcytosine Sites in Multiple Species Based on an eXtreme Gradient Boosting Algorithm and DNA Sequence Information

Xiao Wang    
Xi Lin    
Rong Wang    
Kai-Qi Fan    
Li-Jun Han and Zhao-Yuan Ding    

Resumen

DNA N4-methylcytosine(4mC) plays an important role in numerous biological functions and is a mechanism of particular epigenetic importance. Therefore, accurate identification of the 4mC sites in DNA sequences is necessary to understand the functional mechanism. Although some effective calculation tools have been proposed to identifying DNA 4mC sites, it is still challenging to improve identification accuracy and generalization ability. Therefore, there is a great need to build a computational tool to accurately identify the position of DNA 4mC sites. Hence, this study proposed a novel predictor XGB4mcPred, a predictor for the identification of 4mC sites trained using an extreme gradient boosting algorithm (XGBoost) and DNA sequence information. Firstly, we used the One-Hot encoding on adjacent and spaced nucleotides, dinucleotides, and trinucleotides of the original 4mC site sequences as feature vectors. Then, the importance values of the feature vectors pre-trained by the XGBoost algorithm were used as a threshold to filter redundant features, resulting in a significant improvement in the identification accuracy of the constructed XGB4mcPred predictor to identify 4mC sites. The analysis shows that there is a clear preference for nucleotide sequences between 4mC sites and non-4mC site sequences in six datasets from multiple species, and the optimized features can better distinguish 4mC sites from non-4mC sites. The experimental results of cross-validation and independent tests from six different species show that our proposed predictor XGB4mcPred significantly outperformed other state-of-the-art predictors and was improved to varying degrees compared with other state-of-the-art predictors. Additionally, the user-friendly webserver we used to developed the XGB4mcPred predictor was made freely accessible.

 Artículos similares

       
 
Shawn Hinz, Jennifer Coston-Guarini, Michael Marnane and Jean-Marc Guarini    
In this review, the use of environmental DNA (eDNA) within Environmental Impact Assessment (EIA) is evaluated. EIA documents provide information required by regulators to evaluate the potential impact of a development project. Currently eDNA is being inc... ver más

 
Rafael Bañón, Alejandro de Carlos, Víctor Acosta-Morillas and Francisco Baldó    
One specimen of the shortfin neoscopelid Neoscopelus microchir Matsubara, 1943, has been recorded for the first time on the Porcupine Bank, southwestern Ireland, providing a new northern limit of distribution for the eastern Atlantic. Morphometric and me... ver más

 
Elad Nehoray Rachmilovitz, Omri Shabbat, Maayan Yerushalmy and Baruch Rinkevich    
Accurate identification of scleractinian coral species is fundamental for proper biodiversity estimates, for aiding in efforts of reef monitoring, conservation, restoration, and for the management of coral reefs. Here, we provide the first DNA barcoding ... ver más

 
Rafael Bañón and Alejandro de Carlos    
A review of the non-native Kyphosus species historically recorded in Galician waters (north-western Spain) based on morphological and molecular characteristics is carried out. The list is composed of 15 specimens recorded from 2002 to 2022, showing a cle... ver más

 
Qi Xie, Yongyu Zhao, Yumei Liu, Fengqing Han, Wei Liu and Zhansheng Li    
To identify cultivars quickly and accurately, DNA fingerprinting of 10 broccoli varieties was performed by using simple sequence repeat (SSR) marker technology. Highly informative and polymorphic SSR markers were screened using broccoli and rapeseed. Out... ver más
Revista: Applied Sciences