ARTÍCULO
TITULO

Building a text corpus for automatic biographical facts extraction from Russian texts

A.V. Glazkova    

Resumen

The tasks of computer linguistics and machine learning related to natural language processing (NLP) often require the use of text corpora. Text corpora are specially prepared collection of documents equipped with text markup containing morphological, syntactic, semantic or other information. The data received from the text corpora is used in supervised machine learning for building classifiers of texts written in natural language and in other tasks associated with natural language processing and computer linguistics. The specificity of the information presented in the corpus, as well as the type of texts, is determined by the aim and tasks of the particular study. This article presents a tool for building a corpus of biographical texts in Russian. The process of building a text corpus includes two stages: the collection of texts and their markup. At the first stage we collected texts suitable for markup. Thus, we included in the corpus biographical articles placed in Wikipedia in free access. For this purpose, we developed an automatic parser based on open Python libraries. The second stage is the semantic markup of the text sentences and the selection of biographical facts. This stage took place in a semi-automatic mode. The article describes the features of the process of building the corpus of biographical facts, taxonomy of biographical facts using in our work, software implementation for text collecting and markup, text representation in the corpus and the characteristics of the prepared corpus.

 Artículos similares

       
 
Theo Theunissen,Stijn Hoppenbrouwers,Sietse Overbeek     Pág. 1 - 27
It is common practice for practitioners in industry as well as for ICT/CS students to keep writing ? and reading ­? about software products to a bare minimum. However, refraining from documentation may result in severe issues concerning the vaporization ... ver más

 
Hanyang Luo, Wanhua Zhou, Wugang Song and Xiaofu He    
In the context of e-commerce, online travel agencies often derive useful information from online reviews to improve transactions. Based on the dispute on the usefulness of different types of reviews and social exchange theory, this study investigates how... ver más
Revista: Information

 
Octavian Grigorescu, Andreea Nica, Mihai Dascalu and Razvan Rughinis    
Since cyber-attacks are ever-increasing in number, intensity, and variety, a strong need for a global, standardized cyber-security knowledge database has emerged as a means to prevent and fight cybercrime. Attempts already exist in this regard. The Commo... ver más
Revista: Algorithms

 
Núria Bel, Gabriel Bracons and Sophia Anderberg    
The goal of our research was to assess whether the observation about deceptive texts having a lower positive tone than truthful ones in terms of sentiment could become operative and be used for building a classifier in the particular case of fraudster?s ... ver más
Revista: Information

 
Franco Guzzetti, Karen Lara Ngozi Anyabolu, Francesca Biolo and Lara D?Ambrosio    
In the construction field, the Building Information Modeling (BIM) methodology is becoming increasingly predominant and the standardization of its use is now an essential operation. This method has become widespread in recent years, thanks to the advanta... ver más
Revista: Applied Sciences