REVISTA
Algorithms

TODAS

Inicio / Algorithms / Vol: 13 Par: 3 (2020) / Art�culo

ART�CULO

TITULO

Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark

Athanasios Alexopoulos

Georgios Drakopoulos

Andreas Kanavos

Phivos Mylonas and Gerasimos Vonitsanos

Resumen

At the dawn of the 10V or big data data era, there are a considerable number of sources such as smart phones, IoT devices, social media, smart city sensors, as well as the health care system, all of which constitute but a small portion of the data lakes feeding the entire big data ecosystem. This 10V data growth poses two primary challenges, namely storing and processing. Concerning the latter, new frameworks have been developed including distributed platforms such as the Hadoop ecosystem. Classification is a major machine learning task typically executed on distributed platforms and as a consequence many algorithmic techniques have been developed tailored for these platforms. This article extensively relies in two ways on classifiers implemented in MLlib, the main machine learning library for the Hadoop ecosystem. First, a vast number of classifiers is applied to two datasets, namely Higgs and PAMAP. Second, a two-step classification is ab ovo performed to the same datasets. Specifically, the singular value decomposition of the data matrix determines first a set of transformed attributes which in turn drive the classifiers of MLlib. The twofold purpose of the proposed architecture is to reduce complexity while maintaining a similar if not better level of the metrics of accuracy, recall, and F1" role="presentation">??1F1 F 1 . The intuition behind this approach stems from the engineering principle of breaking down complex problems to simpler and more manageable tasks. The experiments based on the same Spark cluster indicate that the proposed architecture outperforms the individual classifiers with respect to both complexity and the abovementioned metrics.

Palabras claves

Apache Spark - Apache MLlib - PySpark - big data - machine learning - 10V data - two-step classification - ensemble classification - SVD - SparkQL - computing performance - F1 Metric - dataframe

Acceso

P�GINAS

pp. 0 - 0

N�MERO

Volumen: 13 Parte: 3 (2020)

MATERIAS

INGENIER�A Y CONSTRUCCI�N CIVIL
TECNOLOG�A

REVISTAS SIMILARES

Applied Sciences
Information
Computers

DOI

https://doi.org/10.3390/a13030071

Art�culos similares

Applying Artificial Intelligence Methods to Detect and Classify Fish Calls from the Northern Gulf of Mexico

Acceso

Emily E. Waddell, Jeppe H. Rasmussen and Ana ?irovic

Passive acoustic monitoring is a method that is commonly used to collect long-term data on soniferous animal presence and abundance. However, these large datasets require substantial effort for manual analysis; therefore, automatic methods are a more eff... ver m�s

Revista: Journal of Marine Science and Engineering

Time-Aware Learning Framework for Over-The-Top Consumer Classification Based on Machine- and Deep-Learning Capabilities

Acceso

Jaeun Choi and Yongsung Kim

With the widespread use of over-the-top (OTT) media, such as YouTube and Netflix, network markets are changing and innovating rapidly, making it essential for network providers to quickly and efficiently analyze OTT traffic with respect to pricing plans ... ver m�s

Revista: Applied Sciences

Dynamic Coastal-Shelf Seascapes to Support Marine Policies Using Operational Coastal Oceanography: The French Example

Acceso

Emilie Tew-Kai, Victor Quilfen, Marie Cachera and Martial Boutet

In the context of maritime spatial planning and the implementation of spatialized Good Environmental Status indicators in the Marine Strategy Framework Directive (MSFD), the definition of a mosaic composed of coherent and standardised spatial units is ne... ver m�s

Revista: Journal of Marine Science and Engineering

Robust Classification Method for Underwater Targets Using the Chaotic Features of the Flow Field

Acceso

Xinghua Lin, Jianguo Wu and Qing Qin

Fish can sense their surrounding environment by their lateral line system (LLS). In order to understand the extent to which information can be derived via LLS and to improve the adaptive ability of autonomous underwater vehicles (AUVs), a novel strategy ... ver m�s

Revista: Journal of Marine Science and Engineering

Water End Use Disaggregation Based on Soft Computing Techniques

Acceso

L. Pastor-Jabaloyes, F. J. Arregui and R. Cobacho

Disaggregating residential water end use events through the available commercial tools needs a great investment in time to manually process smart metering data. Therefore, it is extremely difficult to achieve a homogenous and sufficiently large corpus of... ver m�s

Revista: Water

Revistas destacadas

Acceso directo a los n�meros publicados en la revista Infrastructures

Infrastructures

Acceso directo a los n�meros publicados en la revista Informed Infraestructure

Informed Infraestructure

Acceso directo a los n�meros publicados en la revista BiT

Acceso directo a los n�meros publicados en la revista Revista de la Construcci�n

Revista de la Construcci�n

Ver todas las revistas