Redirigiendo al acceso original de articulo en 18 segundos...
Inicio  /  Information  /  Vol: 14 Par: 5 (2023)  /  Artículo
ARTÍCULO
TITULO

Quantifying the Dissimilarity of Texts

Benjamin Shade and Eduardo G. Altmann    

Resumen

Quantifying the dissimilarity of two texts is an important aspect of a number of natural language processing tasks, including semantic information retrieval, topic classification, and document clustering. In this paper, we compared the properties and performance of different dissimilarity measures D using three different representations of texts?vocabularies, word frequency distributions, and vector embeddings?and three simple tasks?clustering texts by author, subject, and time period. Using the Project Gutenberg database, we found that the generalised Jensen?Shannon divergence applied to word frequencies performed strongly across all tasks, that D?s based on vector embedding representations led to stronger performance for smaller texts, and that the optimal choice of approach was ultimately task-dependent. We also investigated, both analytically and numerically, the behaviour of the different D?s when the two texts varied in length by a factor h. We demonstrated that the (natural) estimator of the Jaccard distance between vocabularies was inconsistent and computed explicitly the h-dependency of the bias of the estimator of the generalised Jensen?Shannon divergence applied to word frequencies. We also found numerically that the Jensen?Shannon divergence and embedding-based approaches were robust to changes in h, while the Jaccard distance was not.

 Artículos similares

       
 
Shuang Wang, Amin Beheshti, Yufei Wang, Jianchao Lu, Quan Z. Sheng, Stephen Elbourn and Hamid Alinejad-Rokny    
Instructors face significant time and effort constraints when grading students? assessments on a large scale. Clustering similar assessments is a unique and effective technique that has the potential to significantly reduce the workload of instructors in... ver más
Revista: Algorithms

 
Philipp Gabler, Bernhard C. Geiger, Barbara Schuppler and Roman Kern    
Superficially, read and spontaneous speech?the two main kinds of training data for automatic speech recognition?appear as complementary, but are equal: pairs of texts and acoustic signals. Yet, spontaneous speech is typically harder for recognition. This... ver más
Revista: Information

 
Anastasia Fedotova, Aleksandr Romanov, Anna Kurtukova and Alexander Shelupanov    
This article is the third paper in a series aimed at the establishment of the authorship of Russian-language texts. This paper considers methods for determining the authorship of classical Russian literary texts, as well as fanfiction texts. The process ... ver más
Revista: Algorithms

 
Xuyang Wang, Yajun Du, Danroujing Chen, Xianyong Li, Xiaoliang Chen, Yongquan Fan, Chunzhi Xie, Yanli Li and Jia Liu    
Domain-generalized few-shot text classification (DG-FSTC) is a new setting for few-shot text classification (FSTC). In DG-FSTC, the model is meta-trained on a multi-domain dataset, and meta-tested on unseen datasets with different domains. However, previ... ver más
Revista: Applied Sciences

 
Shuang Lu, Jianyun Huang and Jing Wu    
In the contexts of global climate change and the urbanization process, urban flooding poses significant challenges worldwide, necessitating effective rapid assessments to understand its impacts on various aspects of urban systems. This can be achieved th... ver más
Revista: Water