REVISTA
Future Internet

TODAS

Inicio / Future Internet / Vol: 13 Par: 5 (2021) / Art�culo

ART�CULO

TITULO

Generating Synthetic Training Data for Supervised De-Identification of Electronic Health Records

Claudia Alessandra Libbi

Jan Trienes

Dolf Trieschnigg and Christin Seifert

Resumen

A major hurdle in the development of natural language processing (NLP) methods for Electronic Health Records (EHRs) is the lack of large, annotated datasets. Privacy concerns prevent the distribution of EHRs, and the annotation of data is known to be costly and cumbersome. Synthetic data presents a promising solution to the privacy concern, if synthetic data has comparable utility to real data and if it preserves the privacy of patients. However, the generation of synthetic text alone is not useful for NLP because of the lack of annotations. In this work, we propose the use of neural language models (LSTM and GPT-2) for generating artificial EHR text jointly with annotations for named-entity recognition. Our experiments show that artificial documents can be used to train a supervised named-entity recognition model for de-identification, which outperforms a state-of-the-art rule-based baseline. Moreover, we show that combining real data with synthetic data improves the recall of the method, without manual annotation effort. We conduct a user study to gain insights on the privacy of artificial text. We highlight privacy risks associated with language models to inform future research on privacy-preserving automated text generation and metrics for evaluating privacy-preservation during text generation.

Palabras claves

natural language processing - medical records - privacy protection - synthetic text - generative language models - named-entity recognition - natural language generation

Acceso

P�GINAS

pp. 0 - 0

N�MERO

Volumen: 13 Parte: 5 (2021)

MATERIAS

INFRAESTRUCTURA

REVISTAS SIMILARES

Big Data and Cognitive Computing
Future Internet
Infrastructures

DOI

https://doi.org/10.3390/fi13050136

Art�culos similares

Zero-Inflated Patent Data Analysis Using Generating Synthetic Samples

Acceso

Daiho Uhm and Sunghae Jun

Due to the expansion of the internet, we encounter various types of big data such as web documents or sensing data. Compared to traditional small data such as experimental samples, big data provide more chances to find hidden and novel patterns with big ... ver m�s

Revista: Future Internet

DP-CSM: Efficient Differentially Private Synthesis for Human Mobility Trajectory with Coresets and Staircase Mechanism

Acceso

Xin Yao, Juan Yu, Jianmin Han, Jianfeng Lu, Hao Peng, Yijia Wu and Xiaoqian Cao

Generating differentially private synthetic human mobility trajectories from real trajectories is a commonly used approach for privacy-preserving trajectory publishing. However, existing synthetic trajectory generation methods suffer from the drawbacks o... ver m�s

Revista: ISPRS International Journal of Geo-Information

A Data-Driven Framework for Walkability Measurement with Open Data: A Case Study of Triple Cities, New York

Acceso

Chengbin Deng, Xiaoyu Dong, Huihai Wang, Weiying Lin, Hao Wen, John Frazier, Hung Chak Ho and Louisa Holmes

Walking is the most common, environment-friendly, and inexpensive type of physical activity. To perform in-depth walkability analysis, one option is to objectively evaluate different aspects of built environment related to walkability. In this study, we ... ver m�s

Revista: ISPRS International Journal of Geo-Information

Fusion of Multi-Sensor-Derived Heights and OSM-Derived Building Footprints for Urban 3D Reconstruction

Acceso

Hossein Bagheri, Michael Schmitt and Xiaoxiang Zhu

So-called prismatic 3D building models, following the level-of-detail (LOD) 1 of the OGC City Geography Markup Language (CityGML) standard, are usually generated automatically by combining building footprints with height values. Typically, high-resolutio... ver m�s

Revista: ISPRS International Journal of Geo-Information

Traffic Sign Recognition based on Synthesised Training Data

Acceso

Alexandros Stergiou, Grigorios Kalliatakis and Christos Chrysoulas

To deal with the richness in visual appearance variation found in real-world data, we propose to synthesise training data capturing these differences for traffic sign recognition. The use of synthetic training data, created from road traffic sign templat... ver m�s

Revista: Big Data and Cognitive Computing

Revistas destacadas

Acceso directo a los n�meros publicados en la revista Infrastructures

Infrastructures

Acceso directo a los n�meros publicados en la revista Informed Infraestructure

Informed Infraestructure

Acceso directo a los n�meros publicados en la revista BiT

Acceso directo a los n�meros publicados en la revista Revista de la Construcci�n

Revista de la Construcci�n

Ver todas las revistas disponibles