Inicio  /  Future Internet  /  Vol: 13 Par: 5 (2021)  /  Artículo
ARTÍCULO
TITULO

Collecting a Large Scale Dataset for Classifying Fake News Tweets Using Weak Supervision

Stefan Helmstetter and Heiko Paulheim    

Resumen

The problem of automatic detection of fake news in social media, e.g., on Twitter, has recently drawn some attention. Although, from a technical perspective, it can be regarded as a straight-forward, binary classification problem, the major challenge is the collection of large enough training corpora, since manual annotation of tweets as fake or non-fake news is an expensive and tedious endeavor, and recent approaches utilizing distributional semantics require large training corpora. In this paper, we introduce an alternative approach for creating a large-scale dataset for tweet classification with minimal user intervention. The approach relies on weak supervision and automatically collects a large-scale, but very noisy, training dataset comprising hundreds of thousands of tweets. As a weak supervision signal, we label tweets by their source, i.e., trustworthy or untrustworthy source, and train a classifier on this dataset. We then use that classifier for a different classification target, i.e., the classification of fake and non-fake tweets. Although the labels are not accurate according to the new classification target (not all tweets by an untrustworthy source need to be fake news, and vice versa), we show that despite this unclean, inaccurate dataset, the results are comparable to those achieved using a manually labeled set of tweets. Moreover, we show that the combination of the large-scale noisy dataset with a human labeled one yields more advantageous results than either of the two alone.

 Artículos similares

       
 
Ren Nishimura, Norman L. Jones, Gustavious P. Williams, Daniel P. Ames, Bako Mamane and Jamila Begou    
Accurate characterization of groundwater resources is required for sustainable management. Due to the cost of installing monitoring wells and challenges in collecting and managing in situ data, groundwater data are sparse?especially in developing countri... ver más
Revista: Hydrology

 
Amir Karami, Rachana Redd Kadari, Lekha Panati, Siva Prasad Nooli, Harshini Bheemreddy and Parisa Bozorgi    
Twitter?s APIs are now the main data source for social media researchers. A large number of studies have utilized Twitter data for diverse research interests. Twitter users can share their precise real-time location, and Twitter APIs can provide this inf... ver más

 
Miguel R. Luaces, Jesús A. Fisteus, Luis Sánchez-Fernández, Mario Munoz-Organero, Jesús Balado, Lucía Díaz-Vilariño and Henrique Lorenzo    
Providing citizens with the ability to move around in an accessible way is a requirement for all cities today. However, modeling city infrastructures so that accessible routes can be computed is a challenge because it involves collecting information from... ver más

 
Pengyuan Wang, Xiao Huang, Joseph Mango, Di Zhang, Dong Xu and Xiang Li    
Studying population prediction under micro-spatiotemporal granularity is of great significance for modern and refined urban traffic management and emergency response to disasters. Existing population studies are mostly based on census and statistical yea... ver más

 
Tsutomu Tsuboi    
The paper takes an analysis of traffic conditions in a developing country, namely, India. India is a country with a rapidly growing economy and a large market, and it has the second largest population in the world, which was 1.3 billion in 2018. India al... ver más
Revista: Infrastructures