Inicio  /  Algorithms  /  Vol: 14 Par: 12 (2021)  /  Artículo
ARTÍCULO
TITULO

Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 Variants

Zahra Tayebi    
Sarwan Ali and Murray Patterson    

Resumen

The widespread availability of large amounts of genomic data on the SARS-CoV-2 virus, as a result of the COVID-19 pandemic, has created an opportunity for researchers to analyze the disease at a level of detail, unlike any virus before it. On the one hand, this will help biologists, policymakers, and other authorities to make timely and appropriate decisions to control the spread of the coronavirus. On the other hand, such studies will help to more effectively deal with any possible future pandemic. Since the SARS-CoV-2 virus contains different variants, each of them having different mutations, performing any analysis on such data becomes a difficult task, given the size of the data. It is well known that much of the variation in the SARS-CoV-2 genome happens disproportionately in the spike region of the genome sequence?the relatively short region which codes for the spike protein(s). In this paper, we propose a robust feature-vector representation of biological sequences that, when combined with the appropriate feature selection method, allows different downstream clustering approaches to perform well on a variety of different measures. We use such proposed approach with an array of clustering techniques to cluster spike protein sequences in order to study the behavior of different known variants that are increasing at a very high rate throughout the world. We use a k-mers based approach first to generate a fixed-length feature vector representation of the spike sequences. We then show that we can efficiently and effectively cluster the spike sequences based on the different variants with the appropriate feature selection. Using a publicly available set of SARS-CoV-2 spike sequences, we perform clustering of these sequences using both hard and soft clustering methods and show that, with our feature selection methods, we can achieve higher F1" role="presentation">??1F1 F 1 scores for the clusters and also better clustering quality metrics compared to baselines.

 Artículos similares

       
 
Adil Redaoui, Amina Belalia and Kamel Belloulata    
Deep network-based hashing has gained significant popularity in recent years, particularly in the field of image retrieval. However, most existing methods only focus on extracting semantic information from the final layer, disregarding valuable structura... ver más
Revista: Information

 
Wenting Li, Xiuhui Zhang, Yunfeng Dong, Yan Lin and Hongjue Li    
Multi-stage launch vehicles are currently the primary tool for humans to reach extraterrestrial space. The technology of recovering and reusing rockets can effectively shorten rocket launch cycles and reduce space launch costs. With the development of de... ver más
Revista: Aerospace

 
Yuji Takubo and Masahiro Kanazaki    
Landing of supersonic transport (SST) suffers from a large uncertainty due to its highly sensitive aerodynamic properties in the subsonic domain, as well as the wind gusts around runways. At the vehicle design stage, a landing trajectory optimization und... ver más
Revista: Aerospace

 
Yang Shi, Zhenbo Wang, Tim J. LaClair, Chieh (Ross) Wang, Yunli Shao and Jinghui Yuan    
The advent of connected vehicle (CV) technology offers new possibilities for a revolution in future transportation systems. With the availability of real-time traffic data from CVs, it is possible to more effectively optimize traffic signals to reduce co... ver más
Revista: Applied Sciences

 
Shumin Lai, Longjun Huang, Ping Li, Zhenzhen Luo, Jianzhong Wang and Yugen Yi    
In this paper, we present a novel unsupervised feature selection method termed robust matrix factorization with robust adaptive structure learning (RMFRASL), which can select discriminative features from a large amount of multimedia data to improve the p... ver más
Revista: Algorithms