Inicio  /  Applied System Innovation  /  Vol: 4 Par: 1 (2021)  /  Artículo
ARTÍCULO
TITULO

SMOTE-ENC: A Novel SMOTE-Based Method to Generate Synthetic Data for Nominal and Continuous Features

Mimi Mukherjee and Matloob Khushi    

Resumen

Real-world datasets are heavily skewed where some classes are significantly outnumbered by the other classes. In these situations, machine learning algorithms fail to achieve substantial efficacy while predicting these underrepresented instances. To solve this problem, many variations of synthetic minority oversampling methods (SMOTE) have been proposed to balance datasets which deal with continuous features. However, for datasets with both nominal and continuous features, SMOTE-NC is the only SMOTE-based oversampling technique to balance the data. In this paper, we present a novel minority oversampling method, SMOTE-ENC (SMOTE?Encoded Nominal and Continuous), in which nominal features are encoded as numeric values and the difference between two such numeric values reflects the amount of change of association with the minority class. Our experiments show that classification models using the SMOTE-ENC method offer better prediction than models using SMOTE-NC when the dataset has a substantial number of nominal features and also when there is some association between the categorical features and the target class. Additionally, our proposed method addressed one of the major limitations of the SMOTE-NC algorithm. SMOTE-NC can be applied only on mixed datasets that have features consisting of both continuous and nominal features and cannot function if all the features of the dataset are nominal. Our novel method has been generalized to be applied to both mixed datasets and nominal-only datasets.

 Artículos similares

       
 
Falah Amer Abdulazeez, Ismail Taha Ahmed and Baraa Tareq Hammad    
A significant quantity of malware is created on purpose every day. Users of smartphones and computer networks now mostly worry about malware. These days, malware detection is a major concern in the cybersecurity area. Several factors can impact malware d... ver más
Revista: Applied Sciences

 
Qiyan Li, Zhi Weng, Zhiqiang Zheng and Lixin Wang    
The decrease in lake area has garnered significant attention within the global ecological community, prompting extensive research in remote sensing and computer vision to accurately segment lake areas from satellite images. However, existing image segmen... ver más
Revista: Applied Sciences

 
Ji-Woon Lee and Hyun-Soo Kang    
The escalating use of security cameras has resulted in a surge in images requiring analysis, a task hindered by the inefficiency and error-prone nature of manual monitoring. In response, this study delves into the domain of anomaly detection in CCTV secu... ver más
Revista: Applied Sciences

 
Julia Mayer, Martin Memmel, Johannes Ruf, Dhruv Patel, Lena Hoff and Sascha Henninger    
Urban tree cadastres, crucial for climate adaptation and urban planning, face challenges in maintaining accuracy and completeness. A transdisciplinary approach in Kaiserslautern, Germany, complements existing incomplete tree data with additional precise ... ver más
Revista: Applied Sciences

 
Catarina Palma, Artur Ferreira and Mário Figueiredo    
The presence of malicious software (malware), for example, in Android applications (apps), has harmful or irreparable consequences to the user and/or the device. Despite the protections app stores provide to avoid malware, it keeps growing in sophisticat... ver más
Revista: Information