Inicio  /  Information  /  Vol: 14 Par: 4 (2023)  /  Artículo
ARTÍCULO
TITULO

Four Million Segments and Counting: Building an English-Croatian Parallel Corpus through Crowdsourcing Using a Novel Gamification-Based Platform

Rafal Jaworski    
Sanja Seljan and Ivan Dunder    

Resumen

Parallel corpora have been widely used in the fields of natural language processing and translation as they provide crucial multilingual information. They are used to train machine translation systems, compile dictionaries, or generate inter-language word embeddings. There are many corpora available publicly; however, support for some languages is still limited. In this paper, the authors present a framework for collecting, organizing, and storing corpora. The solution was originally designed to obtain data for less-resourced languages, but it proved to work very well for the collection of high-value domain-specific corpora. The scenario is based on the collective work of a group of people who are motivated by the means of gamification. The rules of the game motivate the participants to submit large resources, and a peer-review process ensures quality. More than four million translated segments have been collected so far.

 Artículos similares

       
 
Paraskevas Koukaras, Dimitrios Rousidis and Christos Tjortjis    
The identification and analysis of sentiment polarity in microblog data has drawn increased attention. Researchers and practitioners attempt to extract knowledge by evaluating public sentiment in response to global events. This study aimed to evaluate pu... ver más
Revista: Informatics

 
Élisson da Silva Rocha and Patricia Takako Endo    
Introduction: Dental segmentation in panoramic radiograph has become very relevant in dentistry, since it allows health professionals to carry out their assessments more clearly and helps them to define the best possible treatment plan for their patients... ver más
Revista: Applied Sciences

 
Yi-Quan Li, Hao-Sen Chang and Daw-Tung Lin    
In the field of computer vision, large-scale image classification tasks are both important and highly challenging. With the ongoing advances in deep learning and optical character recognition (OCR) technologies, neural networks designed to perform large-... ver más
Revista: Applied Sciences

 
Awf A. Ramadhan and Muhammet Baykara    
The novel coronavirus (COVID-19) is a contagious viral disease that has rapidly spread worldwide since December 2019, causing the disruption of life and heavy economic losses. Since the beginning of the virus outbreak, a polymerase chain reaction has bee... ver más
Revista: Applied Sciences

 
Ajit Kumar Behera, Rudra Mohan Pradhan, Sudhir Kumar, Govind Joseph Chakrapani and Pankaj Kumar    
Despite being a biodiversity hotspot, the Mahanadi delta is facing groundwater salinization as one of the main environmental threats in the recent past. Hence, this study attempts to understand the dynamics of groundwater and its sustainable management o... ver más
Revista: Water