ARTÍCULO
TITULO

Neo-hetergeneous Programming and Parallelized Optimization of a Human Genome Re-sequencing Analysis Software Pipeline on TH-2 Supercomputer

Xiangke Liao    
Shaoliang Peng    
Yutong Lu    
Yingbo Cui    
Chengkun Wu    
Heng Wang    
Jiajun Wen    

Resumen

The growing velocity of biological big data is way beyond Moore's Law of compute power growth. The amount of genomic data has been explosively accumulating, which calls for an enormous amount of computing power, while current computation methods cannot scale out with the data explosion. In this paper, we try to utilize huge computing resources to solve thebig dataproblems of genome processing on TH-2 supercomputer. TH-2supercomputer adopts neo-heterogeneous architecture and owns 16,000 compute nodes: 32000 Intel Xeon CPUs + 48000 Xeon Phi MICs. The heterogeneity, scalability, and parallel efficiency pose great challenges forthe deployment of the genomeanalysis software pipeline on TH-2. Runtime profiling shows that SOAP3-dp and SOAPsnp are the most time-consuming parts (up to 70% of total runtime) in the whole pipeline, which need parallelized optimization deeply and large-scale deployment. To address this issue, we first designa series of new parallel algorithms for SOAP3-dp and SOAPsnp, respectively, to eliminatethe spatial-temporal redundancy. Then we propose a CPU/MIC collaboratedparallel computing method in one node to fully fill the CPU/MIC time slots. We also propose a series ofscalable parallel algorithms and large scaleprogramming methods to reduce the amount of communications between different nodes. Moreover, we deploy and evaluate our works on the TH-2 supercomputer in different scales. At the most large scale, the whole process takes 8.37 hours using 8192 nodes to finish the analysis of a 300TB dataset of whole genome sequences from 2,000 human beings, which can take as long as 8 months on a commodity server. The speedup is about 700x.

 Artículos similares

       
 
Lakkana Suwannachai, Krit Sriworamas, Ounla Sivanpheng and Anongrit Kangrang    
In addition to changes in the amount of rain, changes in land use upstream are considered a factor that directly affects the maximum runoff flow in a basin, especially in areas that have experienced floods and flash floods. This research article presents... ver más
Revista: Water

 
Rui Wang and Guoliang Yu    
In this study, the bedform dimensions of an alluvial bed in a unidirectional flow were experimentally investigated. A series of flume experiments was conducted; 700 sets of flume and field data were used in developing formulae for predicting the bedform ... ver más
Revista: Water

 
Lijun Jin, Changsheng Yan, Baojun Yuan, Jing Liu and Jifeng Liu    
The source area of the Yellow River (SAYR) in China is an important water yield and water-conservation area in the Yellow River. Understanding the variability in rainfall and flood over the SAYR region and the related mechanism of flood-causing rainfall ... ver más
Revista: Water

 
Futo Ueda, Hiroto Tanouchi, Nobuyuki Egusa and Takuya Yoshihiro    
River water-level prediction is crucial for mitigating flood damage caused by torrential rainfall. In this paper, we attempt to predict river water levels using a deep learning model based on radar rainfall data instead of data from upstream hydrological... ver más
Revista: Water

 
Chen Li, Yinxu Lu, Yong Bian, Jie Tian and Mu Yuan    
The quality and safety of agricultural products involve a variety of risk factors, a large amount of risk information data, and multiple circulation and disposal processes, making it difficult to accurately trace the source of risks. To achieve precise t... ver más
Revista: Applied Sciences