Inicio  /  Computation  /  Vol: 10 Par: 11 (2022)  /  Artículo
ARTÍCULO
TITULO

Greedy Texts Similarity Mapping

Aliya Jangabylova    
Alexander Krassovitskiy    
Rustam Mussabayev and Irina Ualiyeva    

Resumen

The documents similarity metric is a substantial tool applied in areas such as determining topic in relation to documents, plagiarism detection, or problems necessary to capture the semantic, syntactic, or structural similarity of texts. Evaluated results of the similarity measure depend on the types of word represented and the problem statement and can be time-consuming. In this paper, we present a problem-independent algorithm of the similarity metric greedy texts similarity mapping (GTSM), which is computationally efficient to be applied for large datasets with any preferred word vectorization models. GTSM maps words in two texts based on a decision rule that evaluates word similarity and their importance to the texts. We compare it with the well-known word mover?s distance (WMD) algorithm in the k-nearest neighbors text classification problem and find that it leads to similar or better results. In the correlation evaluation task of similarity measures with human-judged scores, we demonstrate its higher correlation scores in comparison with WMD and sentence mover?s similarity (SMS) and show that GTSM is a decent alternative for both word-level and sentence-level tasks.

 Artículos similares

       
 
Sardar Parhat, Mutallip Sattar, Askar Hamdulla and Abdurahman Kadir    
In this study, based on a morpheme segmentation framework, we researched a text keyword extraction method for Uyghur, Kazakh and Kirghiz languages, which have similar grammatical and lexical structures. In these languages, affixes and a stem are joined t... ver más
Revista: Information

 
Hao Wang, Miao Li, Jianyong Duan, Li He and Qing Zhang    
Previous work has demonstrated that end-to-end neural sequence models work well for document-level event role filler extraction. However, the end-to-end neural network model suffers from the problem of not being able to utilize global information, result... ver más
Revista: Applied Sciences

 
Tao Peng, Kun She, Yimin Shen, Xiangliang Xu and Yue Yu    
Requirement traceability links are an essential part of requirement management software and are a basic prerequisite for software artifact changes. The manual establishment of requirement traceability links is time-consuming. When faced with large projec... ver más
Revista: Information

 
Kirill Tyshchuk, Polina Karpikova, Andrew Spiridonov, Anastasiia Prutianova, Anton Razzhigaev and Alexander Panchenko    
Embeddings, i.e., vector representations of objects, such as texts, images, or graphs, play a key role in deep learning methodologies nowadays. Prior research has shown the importance of analyzing the isotropy of textual embeddings for transformer-based ... ver más
Revista: Information

 
Mikel Penagarikano, Amparo Varona, Germán Bordel and Luis Javier Rodriguez-Fuentes    
In this paper, a semisupervised speech data extraction method is presented and applied to create a new dataset designed for the development of fully bilingual Automatic Speech Recognition (ASR) systems for Basque and Spanish. The dataset is drawn from an... ver más
Revista: Applied Sciences