Inicio  /  Future Internet  /  Vol: 15 Par: 11 (2023)  /  Artículo
ARTÍCULO
TITULO

Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification

Panagiotis Skondras    
Panagiotis Zervas and Giannis Tzimas    

Resumen

In this article, we investigate the potential of synthetic resumes as a means for the rapid generation of training data and their effectiveness in data augmentation, especially in categories marked by sparse samples. The widespread implementation of machine learning algorithms in natural language processing (NLP) has notably streamlined the resume classification process, delivering time and cost efficiencies for hiring organizations. However, the performance of these algorithms depends on the abundance of training data. While selecting the right model architecture is essential, it is also crucial to ensure the availability of a robust, well-curated dataset. For many categories in the job market, data sparsity remains a challenge. To deal with this challenge, we employed the OpenAI API to generate both structured and unstructured resumes tailored to specific criteria. These synthetically generated resumes were cleaned, preprocessed and then utilized to train two distinct models: a transformer model (BERT) and a feedforward neural network (FFNN) that incorporated Universal Sentence Encoder 4 (USE4) embeddings. While both models were evaluated on the multiclass classification task of resumes, when trained on an augmented dataset containing 60 percent real data (from Indeed website) and 40 percent synthetic data from ChatGPT, the transformer model presented exceptional accuracy. The FFNN, albeit predictably, achieved lower accuracy. These findings highlight the value of augmented real-world data with ChatGPT-generated synthetic resumes, especially in the context of limited training data. The suitability of the BERT model for such classification tasks further reinforces this narrative.

 Artículos similares

       
 
Xin Yao, Juan Yu, Jianmin Han, Jianfeng Lu, Hao Peng, Yijia Wu and Xiaoqian Cao    
Generating differentially private synthetic human mobility trajectories from real trajectories is a commonly used approach for privacy-preserving trajectory publishing. However, existing synthetic trajectory generation methods suffer from the drawbacks o... ver más

 
Claudia Alessandra Libbi, Jan Trienes, Dolf Trieschnigg and Christin Seifert    
A major hurdle in the development of natural language processing (NLP) methods for Electronic Health Records (EHRs) is the lack of large, annotated datasets. Privacy concerns prevent the distribution of EHRs, and the annotation of data is known to be cos... ver más
Revista: Future Internet

 
Chengbin Deng, Xiaoyu Dong, Huihai Wang, Weiying Lin, Hao Wen, John Frazier, Hung Chak Ho and Louisa Holmes    
Walking is the most common, environment-friendly, and inexpensive type of physical activity. To perform in-depth walkability analysis, one option is to objectively evaluate different aspects of built environment related to walkability. In this study, we ... ver más

 
Hossein Bagheri, Michael Schmitt and Xiaoxiang Zhu    
So-called prismatic 3D building models, following the level-of-detail (LOD) 1 of the OGC City Geography Markup Language (CityGML) standard, are usually generated automatically by combining building footprints with height values. Typically, high-resolutio... ver más

 
Alexandros Stergiou, Grigorios Kalliatakis and Christos Chrysoulas    
To deal with the richness in visual appearance variation found in real-world data, we propose to synthesise training data capturing these differences for traffic sign recognition. The use of synthetic training data, created from road traffic sign templat... ver más