Inicio  /  Algorithms  /  Vol: 13 Par: 3 (2020)  /  Artículo
ARTÍCULO
TITULO

Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark

Athanasios Alexopoulos    
Georgios Drakopoulos    
Andreas Kanavos    
Phivos Mylonas and Gerasimos Vonitsanos    

Resumen

At the dawn of the 10V or big data data era, there are a considerable number of sources such as smart phones, IoT devices, social media, smart city sensors, as well as the health care system, all of which constitute but a small portion of the data lakes feeding the entire big data ecosystem. This 10V data growth poses two primary challenges, namely storing and processing. Concerning the latter, new frameworks have been developed including distributed platforms such as the Hadoop ecosystem. Classification is a major machine learning task typically executed on distributed platforms and as a consequence many algorithmic techniques have been developed tailored for these platforms. This article extensively relies in two ways on classifiers implemented in MLlib, the main machine learning library for the Hadoop ecosystem. First, a vast number of classifiers is applied to two datasets, namely Higgs and PAMAP. Second, a two-step classification is ab ovo performed to the same datasets. Specifically, the singular value decomposition of the data matrix determines first a set of transformed attributes which in turn drive the classifiers of MLlib. The twofold purpose of the proposed architecture is to reduce complexity while maintaining a similar if not better level of the metrics of accuracy, recall, and F1" role="presentation">??1F1 F 1 . The intuition behind this approach stems from the engineering principle of breaking down complex problems to simpler and more manageable tasks. The experiments based on the same Spark cluster indicate that the proposed architecture outperforms the individual classifiers with respect to both complexity and the abovementioned metrics.

 Artículos similares

       
 
Emily E. Waddell, Jeppe H. Rasmussen and Ana ?irovic    
Passive acoustic monitoring is a method that is commonly used to collect long-term data on soniferous animal presence and abundance. However, these large datasets require substantial effort for manual analysis; therefore, automatic methods are a more eff... ver más

 
Jaeun Choi and Yongsung Kim    
With the widespread use of over-the-top (OTT) media, such as YouTube and Netflix, network markets are changing and innovating rapidly, making it essential for network providers to quickly and efficiently analyze OTT traffic with respect to pricing plans ... ver más
Revista: Applied Sciences

 
Emilie Tew-Kai, Victor Quilfen, Marie Cachera and Martial Boutet    
In the context of maritime spatial planning and the implementation of spatialized Good Environmental Status indicators in the Marine Strategy Framework Directive (MSFD), the definition of a mosaic composed of coherent and standardised spatial units is ne... ver más

 
Xinghua Lin, Jianguo Wu and Qing Qin    
Fish can sense their surrounding environment by their lateral line system (LLS). In order to understand the extent to which information can be derived via LLS and to improve the adaptive ability of autonomous underwater vehicles (AUVs), a novel strategy ... ver más

 
L. Pastor-Jabaloyes, F. J. Arregui and R. Cobacho    
Disaggregating residential water end use events through the available commercial tools needs a great investment in time to manually process smart metering data. Therefore, it is extremely difficult to achieve a homogenous and sufficiently large corpus of... ver más
Revista: Water