REVISTA
Information

TODAS

Redirigiendo al acceso original de articulo en 18 segundos...

Inicio / Information / Vol: 14 Par: 5 (2023) / Art�culo

ART�CULO

TITULO

Quantifying the Dissimilarity of Texts

Benjamin Shade and Eduardo G. Altmann

Resumen

Quantifying the dissimilarity of two texts is an important aspect of a number of natural language processing tasks, including semantic information retrieval, topic classification, and document clustering. In this paper, we compared the properties and performance of different dissimilarity measures D using three different representations of texts?vocabularies, word frequency distributions, and vector embeddings?and three simple tasks?clustering texts by author, subject, and time period. Using the Project Gutenberg database, we found that the generalised Jensen?Shannon divergence applied to word frequencies performed strongly across all tasks, that D?s based on vector embedding representations led to stronger performance for smaller texts, and that the optimal choice of approach was ultimately task-dependent. We also investigated, both analytically and numerically, the behaviour of the different D?s when the two texts varied in length by a factor h. We demonstrated that the (natural) estimator of the Jaccard distance between vocabularies was inconsistent and computed explicitly the h-dependency of the bias of the estimator of the generalised Jensen?Shannon divergence applied to word frequencies. We also found numerically that the Jensen?Shannon divergence and embedding-based approaches were robust to changes in h, while the Jaccard distance was not.

Palabras claves

text distance - text representation - Jaccard distance - Jensen?Shannon divergence - entropy - word frequency distribution - document embedding - quantitative linguistics - authorship attribution - Project Gutenberg

Acceso

P�GINAS

pp. 0 - 0

N�MERO

Volumen: 14 Parte: 5 (2023)

MATERIAS

INGENIER�A Y CONSTRUCCI�N CIVIL
TECNOLOG�A

REVISTAS SIMILARES

Journal of Transport and Land Use
Applied Sciences
Information

DOI

https://doi.org/10.3390/info14050271

Art�culos similares

Learning Distributed Representations and Deep Embedded Clustering of Texts

Acceso

Shuang Wang, Amin Beheshti, Yufei Wang, Jianchao Lu, Quan Z. Sheng, Stephen Elbourn and Hamid Alinejad-Rokny

Instructors face significant time and effort constraints when grading students? assessments on a large scale. Clustering similar assessments is a unique and effective technique that has the potential to significantly reduce the workload of instructors in... ver m�s

Revista: Algorithms

Reconsidering Read and Spontaneous Speech: Causal Perspectives on the Generation of Training Data for Automatic Speech Recognition

Acceso

Philipp Gabler, Bernhard C. Geiger, Barbara Schuppler and Roman Kern

Superficially, read and spontaneous speech?the two main kinds of training data for automatic speech recognition?appear as complementary, but are equal: pairs of texts and acoustic signals. Yet, spontaneous speech is typically harder for recognition. This... ver m�s

Revista: Information

Digital Authorship Attribution in Russian-Language Fanfiction and Classical Literature

Acceso

Anastasia Fedotova, Aleksandr Romanov, Anna Kurtukova and Alexander Shelupanov

This article is the third paper in a series aimed at the establishment of the authorship of Russian-language texts. This paper considers methods for determining the authorship of classical Russian literary texts, as well as fanfiction texts. The process ... ver m�s

Revista: Algorithms

Improving Domain-Generalized Few-Shot Text Classification with Multi-Level Distributional Signatures

Acceso

Xuyang Wang, Yajun Du, Danroujing Chen, Xianyong Li, Xiaoliang Chen, Yongquan Fan, Chunzhi Xie, Yanli Li and Jia Liu

Domain-generalized few-shot text classification (DG-FSTC) is a new setting for few-shot text classification (FSTC). In DG-FSTC, the model is meta-trained on a multi-domain dataset, and meta-tested on unseen datasets with different domains. However, previ... ver m�s

Revista: Applied Sciences

Multi-Dimensional Urban Flooding Impact Assessment Leveraging Social Media Data: A Case Study of the 2020 Guangzhou Rainstorm

Acceso

Shuang Lu, Jianyun Huang and Jing Wu

In the contexts of global climate change and the urbanization process, urban flooding poses significant challenges worldwide, necessitating effective rapid assessments to understand its impacts on various aspects of urban systems. This can be achieved th... ver m�s

Revista: Water

Revistas destacadas

Acceso directo a los n�meros publicados en la revista Infrastructures

Infrastructures

Acceso directo a los n�meros publicados en la revista Informed Infraestructure

Informed Infraestructure

Acceso directo a los n�meros publicados en la revista BiT

Acceso directo a los n�meros publicados en la revista Revista de la Construcci�n

Revista de la Construcci�n

Ver todas las revistas disponibles