Inicio  /  Applied System Innovation  /  Vol: 4 Par: 1 (2021)  /  Artículo
ARTÍCULO
TITULO

SMOTE-ENC: A Novel SMOTE-Based Method to Generate Synthetic Data for Nominal and Continuous Features

Mimi Mukherjee and Matloob Khushi    

Resumen

Real-world datasets are heavily skewed where some classes are significantly outnumbered by the other classes. In these situations, machine learning algorithms fail to achieve substantial efficacy while predicting these underrepresented instances. To solve this problem, many variations of synthetic minority oversampling methods (SMOTE) have been proposed to balance datasets which deal with continuous features. However, for datasets with both nominal and continuous features, SMOTE-NC is the only SMOTE-based oversampling technique to balance the data. In this paper, we present a novel minority oversampling method, SMOTE-ENC (SMOTE?Encoded Nominal and Continuous), in which nominal features are encoded as numeric values and the difference between two such numeric values reflects the amount of change of association with the minority class. Our experiments show that classification models using the SMOTE-ENC method offer better prediction than models using SMOTE-NC when the dataset has a substantial number of nominal features and also when there is some association between the categorical features and the target class. Additionally, our proposed method addressed one of the major limitations of the SMOTE-NC algorithm. SMOTE-NC can be applied only on mixed datasets that have features consisting of both continuous and nominal features and cannot function if all the features of the dataset are nominal. Our novel method has been generalized to be applied to both mixed datasets and nominal-only datasets.

 Artículos similares

       
 
Xiaodong Cui, Zhuofan He, Yangtao Xue, Keke Tang, Peican Zhu and Jing Han    
Underwater Acoustic Target Recognition (UATR) plays a crucial role in underwater detection devices. However, due to the difficulty and high cost of collecting data in the underwater environment, UATR still faces the problem of small datasets. Few-shot le... ver más

 
Bahaa Yamany, Mahmoud Said Elsayed, Anca D. Jurcut, Nashwa Abdelbaki and Marianne A. Azer    
Ransomware is a type of malicious software that encrypts a victim?s files and demands payment in exchange for the decryption key. It is a rapidly growing and evolving threat that has caused significant damage and disruption to individuals and organizatio... ver más
Revista: Information

 
Catarina Palma, Artur Ferreira and Mário Figueiredo    
The presence of malicious software (malware), for example, in Android applications (apps), has harmful or irreparable consequences to the user and/or the device. Despite the protections app stores provide to avoid malware, it keeps growing in sophisticat... ver más
Revista: Information

 
Yugen Yi, Haoming Zhang, Ningyi Zhang, Wei Zhou, Xiaomei Huang, Gengsheng Xie and Caixia Zheng    
As the feature dimension of data continues to expand, the task of selecting an optimal subset of features from a pool of limited labeled data and extensive unlabeled data becomes more and more challenging. In recent years, some semi-supervised feature se... ver más
Revista: Information

 
Abdelghani Azri, Adil Haddi and Hakim Allali    
Collaborative filtering (CF), a fundamental technique in personalized Recommender Systems, operates by leveraging user?item preference interactions. Matrix factorization remains one of the most prevalent CF-based methods. However, recent advancements in ... ver más
Revista: Information