Inicio  /  Algorithms  /  Vol: 16 Par: 4 (2023)  /  Artículo
ARTÍCULO
TITULO

Model of Lexico-Semantic Bonds between Texts for Creating Their Similarity Metrics and Developing Statistical Clustering Algorithm

Liliya Demidova    
Dmitry Zhukov    
Elena Andrianova and Vladimir Kalinin    

Resumen

To solve the problem of text clustering according to semantic groups, we suggest using a model of a unified lexico-semantic bond between texts and a similarity matrix based on it. Using lexico-semantic analysis methods, we can create ?term?document? matrices based both on the occurrence frequencies of words and n-grams and the determination of the degrees of nodes in their semantic network, followed by calculating the cosine metrics of text similarity. In the process of the construction of the text similarity matrix using lexical or semantic analysis methods, the cosine of the angle for a vector pair describing such texts will determine the degree of similarity in the lexical or semantic presentation, respectively. Based on the averaging procedure described in this paper, we can obtain a matrix of cosine metric values that describes the lexico-semantic bonds between texts. We propose an algorithm for solving text clustering problems. This algorithm allows one to use the statistical characteristics of the distribution functions of element values in the rows of the cosine metric value matrix in the model of the lexico-semantic bond between documents. In addition, this algorithm allows one to separately describe the matrix of the cosine metric values obtained separately based on the lexical or semantic properties of texts. Our research has shown that the developed model for the lexico-semantic presentation of texts allows one to slightly increase the accuracy of their subsequent clustering. The statistical text clustering algorithm based on this model shows excellent results that are comparable to those of the widely used affinity propagation algorithm. Additionally, our algorithm does not require specification of the degree of similarity for combining vectors into a common cluster and other configuration parameters. The suggested model and algorithm significantly expand the list of known approaches for determining text similarity metrics and their clustering.

 Artículos similares

       
 
Saad Said Alqahtany, Ahmad B. Alkhodre, Abdulwahid Al Abdulwahid and Manar Alohaly    
Steganography is a widely used technique for concealing confidential data within images, videos, and audio. However, using text for steganography has not been sufficiently explored. Text-based steganography has the advantage of a low bandwidth overhead, ... ver más
Revista: Applied Sciences

 
Yiming Liu, Hongtao Shan, Feng Nie, Gaoyu Zhang and George Xianzhi Yuan    
The current popular approach to the extraction of document-level relations is mainly based on either a graph structure or serialization model method for the inference, but the graph structure method makes the model complicated, while the serialization mo... ver más
Revista: Information

 
Luis-Gil Moreno-Jiménez, Juan-Manuel Torres-Moreno and Roseli Suzi. Wedemann    
In this paper, we describe a model for the automatic generation of literary sentences in French. Although there has been much recent effort directed towards automatic text generation in general, the generation of creative, literary sentences that is not ... ver más
Revista: Algorithms

 
Sardar Parhat, Mutallip Sattar, Askar Hamdulla and Abdurahman Kadir    
In this study, based on a morpheme segmentation framework, we researched a text keyword extraction method for Uyghur, Kazakh and Kirghiz languages, which have similar grammatical and lexical structures. In these languages, affixes and a stem are joined t... ver más
Revista: Information

 
Loris Belcastro, Domenico Carbone, Cristian Cosentino, Fabrizio Marozzo and Paolo Trunfio    
Since the advent of Bitcoin, the cryptocurrency landscape has seen the emergence of several virtual currencies that have quickly established their presence in the global market. The dynamics of this market, influenced by a multitude of factors that are d... ver más
Revista: Algorithms