REVISTA
Algorithms

TODAS

Inicio / Algorithms / Vol: 16 Par: 4 (2023) / Art�culo

ART�CULO

TITULO

Model of Lexico-Semantic Bonds between Texts for Creating Their Similarity Metrics and Developing Statistical Clustering Algorithm

Liliya Demidova

Dmitry Zhukov

Elena Andrianova and Vladimir Kalinin

Resumen

To solve the problem of text clustering according to semantic groups, we suggest using a model of a unified lexico-semantic bond between texts and a similarity matrix based on it. Using lexico-semantic analysis methods, we can create ?term?document? matrices based both on the occurrence frequencies of words and n-grams and the determination of the degrees of nodes in their semantic network, followed by calculating the cosine metrics of text similarity. In the process of the construction of the text similarity matrix using lexical or semantic analysis methods, the cosine of the angle for a vector pair describing such texts will determine the degree of similarity in the lexical or semantic presentation, respectively. Based on the averaging procedure described in this paper, we can obtain a matrix of cosine metric values that describes the lexico-semantic bonds between texts. We propose an algorithm for solving text clustering problems. This algorithm allows one to use the statistical characteristics of the distribution functions of element values in the rows of the cosine metric value matrix in the model of the lexico-semantic bond between documents. In addition, this algorithm allows one to separately describe the matrix of the cosine metric values obtained separately based on the lexical or semantic properties of texts. Our research has shown that the developed model for the lexico-semantic presentation of texts allows one to slightly increase the accuracy of their subsequent clustering. The statistical text clustering algorithm based on this model shows excellent results that are comparable to those of the widely used affinity propagation algorithm. Additionally, our algorithm does not require specification of the degree of similarity for combining vectors into a common cluster and other configuration parameters. The suggested model and algorithm significantly expand the list of known approaches for determining text similarity metrics and their clustering.

Palabras claves

lexico-semantic model of text - matrix of lexico-semantic bond between texts - text vectorization - statistical text clustering algorithm

Acceso

P�GINAS

pp. 0 - 0

N�MERO

Volumen: 16 Parte: 4 (2023)

MATERIAS

INGENIER�A Y CONSTRUCCI�N CIVIL
TECNOLOG�A

REVISTAS SIMILARES

Algorithms
Applied Sciences
Informatics

DOI

https://doi.org/10.3390/a16040198

Art�culos similares

A Dynamic Multi-Layer Steganography Approach Based on Arabic Letters? Diacritics and Image Layers

Acceso

Saad Said Alqahtany, Ahmad B. Alkhodre, Abdulwahid Al Abdulwahid and Manar Alohaly

Steganography is a widely used technique for concealing confidential data within images, videos, and audio. However, using text for steganography has not been sufficiently explored. Text-based steganography has the advantage of a low bandwidth overhead, ... ver m�s

Revista: Applied Sciences

Document-Level Relation Extraction with Local Relation and Global Inference

Acceso

Yiming Liu, Hongtao Shan, Feng Nie, Gaoyu Zhang and George Xianzhi Yuan

The current popular approach to the extraction of document-level relations is mainly based on either a graph structure or serialization model method for the inference, but the graph structure method makes the model complicated, while the serialization mo... ver m�s

Revista: Information

Automatic Generation of Literary Sentences in French

Acceso

Luis-Gil Moreno-Jim�nez, Juan-Manuel Torres-Moreno and Roseli Suzi. Wedemann

In this paper, we describe a model for the automatic generation of literary sentences in French. Although there has been much recent effort directed towards automatic text generation in general, the generation of creative, literary sentences that is not ... ver m�s

Revista: Algorithms

Uyghur?Kazakh?Kirghiz Text Keyword Extraction Based on Morpheme Segmentation

Acceso

Sardar Parhat, Mutallip Sattar, Askar Hamdulla and Abdurahman Kadir

In this study, based on a morpheme segmentation framework, we researched a text keyword extraction method for Uyghur, Kazakh and Kirghiz languages, which have similar grammatical and lexical structures. In these languages, affixes and a stem are joined t... ver m�s

Revista: Information

Enhancing Cryptocurrency Price Forecasting by Integrating Machine Learning with Social Media and Market Data

Acceso

Loris Belcastro, Domenico Carbone, Cristian Cosentino, Fabrizio Marozzo and Paolo Trunfio

Since the advent of Bitcoin, the cryptocurrency landscape has seen the emergence of several virtual currencies that have quickly established their presence in the global market. The dynamics of this market, influenced by a multitude of factors that are d... ver m�s

Revista: Algorithms

Revistas destacadas

Acceso directo a los n�meros publicados en la revista Infrastructures

Infrastructures

Acceso directo a los n�meros publicados en la revista Informed Infraestructure

Informed Infraestructure

Acceso directo a los n�meros publicados en la revista BiT

Acceso directo a los n�meros publicados en la revista Revista de la Construcci�n

Revista de la Construcci�n

Ver todas las revistas disponibles