Inicio  /  Future Internet  /  Vol: 12 Par: 9 (2020)  /  Artículo
ARTÍCULO
TITULO

Topic Detection Based on Sentence Embeddings and Agglomerative Clustering with Markov Moment

Svetlana S. Bodrunova    
Andrey V. Orekhov    
Ivan S. Blekanov    
Nikolay S. Lyudkevich and Nikita A. Tarasov    

Resumen

The paper is dedicated to solving the problem of optimal text classification in the area of automated detection of typology of texts. In conventional approaches to topicality-based text classification (including topic modeling), the number of clusters is to be set up by the scholar, and the optimal number of clusters, as well as the quality of the model that designates proximity of texts to each other, remain unresolved questions. We propose a novel approach to the automated definition of the optimal number of clusters that also incorporates an assessment of word proximity of texts, combined with text encoding model that is based on the system of sentence embeddings. Our approach combines Universal Sentence Encoder (USE) data pre-processing, agglomerative hierarchical clustering by Ward?s method, and the Markov stopping moment for optimal clustering. The preferred number of clusters is determined based on the ?e-2? hypothesis. We set up an experiment on two datasets of real-world labeled data: News20 and BBC. The proposed model is tested against more traditional text representation methods, like bag-of-words and word2vec, to show that it provides a much better-resulting quality than the baseline DBSCAN and OPTICS models with different encoding methods. We use three quality metrics to demonstrate that clustering quality does not drop when the number of clusters grows. Thus, we get close to the convergence of text clustering and text classification.

 Artículos similares

       
 
Ting Zhang and Changxiu Cheng    
The public?s attitudes, emotions, and opinions reflect the state of society to a certain extent. Understanding the state and trends of public sentiment and effectively guiding the direction of sentiment are essential for maintaining social stability duri... ver más

 
Claudio Vanneschi, Giovanni Mastrorocco and Riccardo Salvini    
In this paper, various methods have been used to control and evaluate engineering difficulties in mining accurately. Different unstable scenarios occurring at the surfaces of underground mine walls, have been identified by comparing 3D terrestrial laser ... ver más

 
Simon Nam Thanh Vu, Mads Stege, Peter Issam El-Habr, Jesper Bang and Nicola Dragoni    
Botnets, groups of malware-infected hosts controlled by malicious actors, have gained prominence in an era of pervasive computing and the Internet of Things. Botnets have shown a capacity to perform substantial damage through distributed denial-of-servic... ver más
Revista: Future Internet

 
Martin Knura, Florian Kluger, Moris Zahtila, Jochen Schiewe, Bodo Rosenhahn and Dirk Burghardt    
With cities reinforcing greener ways of urban mobility, encouraging urban cycling helps to reduce the number of motorized vehicles on the streets. However, that also leads to a significant increase in the number of bicycles in urban areas, making the que... ver más

 
Flourensia Sapty Rahayu, Lukito Edi Nugroho, Ridi Ferdiana and Djoko Budiyanto Setyohadi    
Despite the negative role of IT in digital addiction development, IT may have a positive role in dealing with digital addiction. The present study undertakes a systematic literature review to explore the state of play and the trend regarding the use of I... ver más
Revista: Future Internet