Building a text corpus for automatic biographical facts extraction from Russian texts

A.V. Glazkova

Resumen

The tasks of computer linguistics and machine learning related to natural language processing (NLP) often require the use of text corpora. Text corpora are specially prepared collection of documents equipped with text markup containing morphological, syntactic, semantic or other information. The data received from the text corpora is used in supervised machine learning for building classifiers of texts written in natural language and in other tasks associated with natural language processing and computer linguistics. The specificity of the information presented in the corpus, as well as the type of texts, is determined by the aim and tasks of the particular study. This article presents a tool for building a corpus of biographical texts in Russian. The process of building a text corpus includes two stages: the collection of texts and their markup. At the first stage we collected texts suitable for markup. Thus, we included in the corpus biographical articles placed in Wikipedia in free access. For this purpose, we developed an automatic parser based on open Python libraries. The second stage is the semantic markup of the text sentences and the selection of biographical facts. This stage took place in a semi-automatic mode. The article describes the features of the process of building the corpus of biographical facts, taxonomy of biographical facts using in our work, software implementation for text collecting and markup, text representation in the corpus and the characteristics of the prepared corpus.

Acceso

P�GINAS

pp. 97 - 103

N�MERO

Volumen: 7 N�mero: 1 Parte: 0 (2019)

MATERIAS

INGENIER�A Y CONSTRUCCI�N CIVIL
TECNOLOG�A

REVISTAS SIMILARES

Applied Sciences
Informatics
Complex Systems Informatics and Modeling Quarterly

Art�culos similares

Approaches for Documentation in Continuous Software Development

Acceso

Theo Theunissen,Stijn Hoppenbrouwers,Sietse Overbeek P�g. 1 - 27

It is common practice for practitioners in industry as well as for ICT/CS students to keep writing ? and reading �? about software products to a bare minimum. However, refraining from documentation may result in severe issues concerning the vaporization ... ver m�s

Revista: Complex Systems Informatics and Modeling Quarterly

An Empirical Study on the Differences between Online Picture Reviews and Text Reviews

Acceso

Hanyang Luo, Wanhua Zhou, Wugang Song and Xiaofu He

In the context of e-commerce, online travel agencies often derive useful information from online reviews to improve transactions. Based on the dispute on the usefulness of different types of reviews and social exchange theory, this study investigates how... ver m�s

Revista: Information

CVE2ATT&CK: BERT-Based Mapping of CVEs to MITRE ATT&CK Techniques

Acceso

Octavian Grigorescu, Andreea Nica, Mihai Dascalu and Razvan Rughinis

Since cyber-attacks are ever-increasing in number, intensity, and variety, a strong need for a global, standardized cyber-security knowledge database has emerged as a means to prevent and fight cybercrime. Attempts already exist in this regard. The Commo... ver m�s

Revista: Algorithms

Finding Evidence of Fraudster Companies in the CEO?s Letter to Shareholders with Sentiment Analysis

Acceso

N�ria Bel, Gabriel Bracons and Sophia Anderberg

The goal of our research was to assess whether the observation about deceptive texts having a lower positive tone than truthful ones in terms of sentiment could become operative and be used for building a classifier in the particular case of fraudster?s ... ver m�s

Revista: Information

BIM for Existing Construction: A Different Logic Scheme and an Alternative Semantic to Enhance the Interoperabilty

Acceso

Franco Guzzetti, Karen Lara Ngozi Anyabolu, Francesca Biolo and Lara D?Ambrosio

In the construction field, the Building Information Modeling (BIM) methodology is becoming increasingly predominant and the standardization of its use is now an essential operation. This method has become widespread in recent years, thanks to the advanta... ver m�s

Revista: Applied Sciences

Revistas destacadas

Acceso directo a los n�meros publicados en la revista Infrastructures

Infrastructures

Acceso directo a los n�meros publicados en la revista Informed Infraestructure

Informed Infraestructure

Acceso directo a los n�meros publicados en la revista BiT

Acceso directo a los n�meros publicados en la revista Revista de la Construcci�n

Revista de la Construcci�n

Ver todas las revistas