Redirigiendo al acceso original de articulo en 18 segundos...
Inicio  /  Applied Sciences  /  Vol: 10 Par: 23 (2020)  /  Artículo
ARTÍCULO
TITULO

Comparison of Deep Learning Models and Various Text Pre-Processing Techniques for the Toxic Comments Classification

Viera Maslej-Kre?náková    
Martin Sarnovský    
Peter Butka and Kristína Machová    

Resumen

The emergence of anti-social behaviour in online environments presents a serious issue in today?s society. Automatic detection and identification of such behaviour are becoming increasingly important. Modern machine learning and natural language processing methods can provide effective tools to detect different types of anti-social behaviour from the pieces of text. In this work, we present a comparison of various deep learning models used to identify the toxic comments in the Internet discussions. Our main goal was to explore the effect of the data preparation on the model performance. As we worked with the assumption that the use of traditional pre-processing methods may lead to the loss of characteristic traits, specific for toxic content, we compared several popular deep learning and transformer language models. We aimed to analyze the influence of different pre-processing techniques and text representations including standard TF-IDF, pre-trained word embeddings and also explored currently popular transformer models. Experiments were performed on the dataset from the Kaggle Toxic Comment Classification competition, and the best performing model was compared with the similar approaches using standard metrics used in data analysis.

 Artículos similares

       
 
Thomas Kopalidis, Vassilios Solachidis, Nicholas Vretos and Petros Daras    
Recent technological developments have enabled computers to identify and categorize facial expressions to determine a person?s emotional state in an image or a video. This process, called ?Facial Expression Recognition (FER)?, has become one of the most ... ver más
Revista: Information

 
Maryan Rizinski, Andrej Jankov, Vignesh Sankaradas, Eugene Pinsky, Igor Mishkovski and Dimitar Trajanov    
The task of company classification is traditionally performed using established standards, such as the Global Industry Classification Standard (GICS). However, these approaches heavily rely on laborious manual efforts by domain experts, resulting in slow... ver más
Revista: Information

 
Ligang Yuan, Jing Liu, Haiyan Chen, Daoming Fang and Wenlu Chen    
Scene taxiing time is an important indicator for assessing the operational efficiency of airports as well as green airports, and it is also a fundamental parameter in flight regularity statistics. The accurate prediction of taxiing time can help decision... ver más
Revista: Aerospace

 
Fenfang Li, Zhengzhang Zhao, Li Wang and Han Deng    
Sentence Boundary Disambiguation (SBD) is crucial for building datasets for tasks such as machine translation, syntactic analysis, and semantic analysis. Currently, most automatic sentence segmentation in Tibetan adopts the methods of rule-based and stat... ver más
Revista: Applied Sciences

 
Dimitris Papadopoulos and Vangelis D. Karalis    
Sample size is a key factor in bioequivalence and clinical trials. An appropriately large sample is necessary to gain valuable insights into a designated population. However, large sample sizes lead to increased human exposure, costs, and a longer time f... ver más
Revista: Applied Sciences