Inicio  /  Algorithms  /  Vol: 15 Par: 4 (2022)  /  Artículo
ARTÍCULO
TITULO

KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis

Deyou Tang    
Daqiang Tan    
Weihao Xiao    
Jiabin Lin and Juan Fu    

Resumen

Background: K-mer frequency counting is an upstream process of many bioinformatics data analysis workflows. KMC3 and CHTKC are the representative partition-based k-mer counting and non-partition-based k-mer counting algorithms, respectively. This paper evaluates the two algorithms and presents their best applicable scenarios and potential improvements using multiple hardware contexts and datasets. Results: KMC3 uses less memory and runs faster than CHTKC on a regular configuration server. CHTKC is efficient on high-performance computing platforms with high available memory, multi-thread, and low IO bandwidth. When tested with various datasets, KMC3 is less sensitive to the number of distinct k-mers and is more efficient for tasks with relatively low sequencing quality and long k-mer. CHTKC performs better than KMC3 in counting assignments with large-scale datasets, high sequencing quality, and short k-mer. Both algorithms are affected by IO bandwidth, and decreasing the influence of the IO bottleneck is critical as our tests show improvement by filtering and compressing consecutive first-occurring k-mers in KMC3. Conclusions: KMC3 is more competitive for running counter on ordinary hardware resources, and CHTKC is more competitive for counting k-mers in super-scale datasets on higher-performance computing platforms. Reducing the influence of the IO bottleneck is essential for optimizing the k-mer counting algorithm, and filtering and compressing low-frequency k-mers is critical in relieving IO impact.

 Artículos similares

       
 
Mohamed Shenify, Fokrul Alom Mazarbhuiya and A. S. Wungreiphi    
There are many applications of anomaly detection in the Internet of Things domain. IoT technology consists of a large number of interconnecting digital devices not only generating huge data continuously but also making real-time computations. Since IoT d... ver más
Revista: Applied Sciences

 
Ana Corceiro, Nuno Pereira, Khadijeh Alibabaei and Pedro D. Gaspar    
The global population?s rapid growth necessitates a 70% increase in agricultural production, posing challenges exacerbated by weed infestation and herbicide drawbacks. To address this, machine learning (ML) models, particularly convolutional neural netwo... ver más
Revista: Algorithms

 
Sebastian Avram and Radu Vasiu    
NB-PLC (narrowband power line communication) is a method of data communication that involves superimposing a relatively high-frequency signal (9 kHz to 500 kHz), which contains data, onto the power grid?s low frequency (50 to 60 Hz) signal. While using t... ver más
Revista: Applied Sciences

 
Mohamed Sameer Hoosain, Babu Sena Paul, Wesley Doorsamy and Seeram Ramakrishna    
The United Nations Member States created a common roadmap for sustainability and development in 2015. The UN-SDGs are included in the 2030 Plan as an immediate call to action from all nations in the form of global partnerships. To date, a handful of coun... ver más
Revista: Water

 
Mo Wang, Zhiyu Jiang, Rana Muhammad Adnan Ikram, Chuanhao Sun, Menghan Zhang and Jianjun Li    
Amidst the growing urgency to mitigate the impacts of anthropogenic climate change, urban flooding stands out as a critical concern, necessitating effective stormwater management strategies. This research presents a bibliometric analysis of the literatur... ver más
Revista: Water