Arabic Toxic Tweet Classification: Leveraging the AraBERT Model

Amr Mohamed El Koshiry

Entesar Hamed I. Eliwa

Tarek Abd El-Hafeez and Ahmed Omar

Resumen

Social media platforms have become the primary means of communication and information sharing, facilitating interactive exchanges among users. Unfortunately, these platforms also witness the dissemination of inappropriate and toxic content, including hate speech and insults. While significant efforts have been made to classify toxic content in the English language, the same level of attention has not been given to Arabic texts. This study addresses this gap by constructing a standardized Arabic dataset specifically designed for toxic tweet classification. The dataset is annotated automatically using Google?s Perspective API and the expertise of three native Arabic speakers and linguists. To evaluate the performance of different models, we conduct a series of experiments using seven models: long short-term memory (LSTM), bidirectional LSTM, a convolutional neural network, a gated recurrent unit (GRU), bidirectional GRU, multilingual bidirectional encoder representations from transformers, and AraBERT. Additionally, we employ word embedding techniques. Our experimental findings demonstrate that the fine-tuned AraBERT model surpasses the performance of other models, achieving an impressive accuracy of 0.9960. Notably, this accuracy value outperforms similar approaches reported in recent literature. This study represents a significant advancement in Arabic toxic tweet classification, shedding light on the importance of addressing toxicity in social media platforms while considering diverse languages and cultures.

Palabras claves

Arabic toxic - toxic classification - Arabic NLP - BERT

Acceso

P�GINAS

pp. 0 - 0

N�MERO

Volumen: 7 Parte: 4 (2023)

MATERIAS

INFRAESTRUCTURA

DOI

https://doi.org/10.3390/bdcc7040170

Arabic Toxic Tweet Classification: Leveraging the AraBERT Model

Revistas destacadas