Inicio  /  Information  /  Vol: 14 Par: 10 (2023)  /  Artículo
ARTÍCULO
TITULO

A Benchmark Dataset to Distinguish Human-Written and Machine-Generated Scientific Papers

Mohamed Hesham Ibrahim Abdalla    
Simon Malberg    
Daryna Dementieva    
Edoardo Mosca and Georg Groh    

Resumen

As generative NLP can now produce content nearly indistinguishable from human writing, it is becoming difficult to identify genuine research contributions in academic writing and scientific publications. Moreover, information in machine-generated text can be factually wrong or even entirely fabricated. In this work, we introduce a novel benchmark dataset containing human-written and machine-generated scientific papers from SCIgen, GPT-2, GPT-3, ChatGPT, and Galactica, as well as papers co-created by humans and ChatGPT. We also experiment with several types of classifiers?linguistic-based and transformer-based?for detecting the authorship of scientific text. A strong focus is put on generalization capabilities and explainability to highlight the strengths and weaknesses of these detectors. Our work makes an important step towards creating more robust methods for distinguishing between human-written and machine-generated scientific papers, ultimately ensuring the integrity of scientific literature.

 Artículos similares

       
 
Sufyan Danish, Asfandyar Khan, L. Minh Dang, Mohammed Alonazi, Sultan Alanazi, Hyoung-Kyu Song and Hyeonjoon Moon    
Bioinformatics and genomics are driving a healthcare revolution, particularly in the domain of drug discovery for anticancer peptides (ACPs). The integration of artificial intelligence (AI) has transformed healthcare, enabling personalized and immersive ... ver más
Revista: Information

 
Chuanbo Wang, Amirreza Mahbod, Isabella Ellinger, Adrian Galdran, Sandeep Gopalakrishnan, Jeffrey Niezgoda and Zeyun Yu    
Wound care professionals provide proper diagnosis and treatment with heavy reliance on images and image documentation. Segmentation of wound boundaries in images is a key component of the care and diagnosis protocol since it is important to estimate the ... ver más
Revista: Information

 
Linhua Zhang, Ning Xiong, Wuyang Gao and Peng Wu    
With the exponential growth of remote sensing images in recent years, there has been a significant increase in demand for micro-target detection. Recently, effective detection methods for small targets have emerged; however, for micro-targets (even fewer... ver más
Revista: Information

 
Samuel de Oliveira, Oguzhan Topsakal and Onur Toker    
Automated Machine Learning (AutoML) is a subdomain of machine learning that seeks to expand the usability of traditional machine learning methods to non-expert users by automating various tasks which normally require manual configuration. Prior benchmark... ver más
Revista: Information

 
Shubhendu Kshitij Fuladi and Chang-Soo Kim    
In the real world of manufacturing systems, production planning is crucial for organizing and optimizing various manufacturing process components. The objective of this paper is to present a methodology for both static scheduling and dynamic scheduling. ... ver más
Revista: Algorithms