Redirigiendo al acceso original de articulo en 24 segundos...
ARTÍCULO
TITULO

Defining Semantically Close Words of Kazakh Language with Distributed System Apache Spark

Dauren Ayazbayev    
Andrey Bogdanchikov    
Kamila Orynbekova and Iraklis Varlamis    

Resumen

This work focuses on determining semantically close words and using semantic similarity in general in order to improve performance in information retrieval tasks. The semantic similarity of words is an important task with many applications from information retrieval to spell checking or even document clustering and classification. Although, in languages with rich linguistic resources, the methods and tools for this task are well established, some languages do not have such tools. The first step in our experiment is to represent the words in a collection in a vector form and then define the semantic similarity of the terms using a vector similarity method. In order to tame the complexity of the task, which relies on the number of word (and, consequently, of the vector) pairs that have to be combined in order to define the semantically closest word pairs, A distributed method that runs on Apache Spark is designed to reduce the calculation time by running comparison tasks in parallel. Three alternative implementations are proposed and tested using a list of target words and seeking the most semantically similar words from a lexicon for each one of them. In a second step, we employ pre-trained multilingual sentence transformers to capture the content semantics at a sentence level and a vector-based semantic index to accelerate the searches. The code is written in MapReduce, and the experiments and results show that the proposed methods can provide an interesting solution for finding similar words or texts in the Kazakh language.

 Artículos similares

       
 
Xiran Zhou, Xiao Xie, Yong Xue, Bing Xue, Kai Qin and Weijiang Dai    
High-resolution digital elevation models (DEMs) and its derivatives (e.g., curvature, slope, aspect) offer a great possibility of representing the details of Earth?s surface in three-dimensional space. Previous research investigations concerning geomorph... ver más

 
Korawit Orkphol and Wu Yang    
Words have different meanings (i.e., senses) depending on the context. Disambiguating the correct sense is important and a challenging task for natural language processing. An intuitive way is to select the highest similarity between the context and sens... ver más
Revista: Future Internet

 
Soumaya Trabelsi Ben Ameur, Dorra Sellami, Laurent Wendling and Florence Cloppet    
In this work, we build a computer aided diagnosis (CAD) system of breast cancer for high risk patients considering the breast imaging reporting and data system (BIRADS), mapping main expert concepts and rules. Therefore, a bag of words is built based on ... ver más

 
Massimo Stella    
Early language acquisition is a complex cognitive task. Recent data-informed approaches showed that children do not learn words uniformly at random but rather follow specific strategies based on the associative representation of words in the mental lexic... ver más

 
Xiangfeng Luo and Yawen Yi    
Nowadays, massive texts are generated on the web, which contain a variety of viewpoints, attitudes, and emotions for products and services. Subjective information mining of online comments is vital for enterprises to improve their products or services an... ver más
Revista: Future Internet