ARTÍCULO
TITULO

Improving the Decision Value of Hierarchical Text Clustering Using Term Overlap Detection

Nilupulee Nathawitharana    
Damminda Alahakoon    
Sumith Matharage    

Resumen

Humans are used to expressing themselves with written language and language provides a medium with which we can describe our experiences in detail incorporating individuality. Even though documents provide a rich source of information, it becomes very difficult to identify, extract, summarize and search when vast amounts of documents are collected especially over time. Document clustering is a technique that has been widely used to group documents based on similarity of content represented by the words used. Once key groups are identified further drill down into sub-groupings is facilitated by the use of hierarchical clustering. Clustering and hierarchical clustering are very useful when applied to numerical and categorical data and cluster accuracy and purity measures exist to evaluate the outcomes of a clustering exercise. Although the same measures have been applied to text clustering, text clusters are based on words or terms which can be repeated across documents associated with different topics. Therefore text data cannot be considered as a direct ?coding? of a particular experience or situation in contrast to numerical and categorical data and term overlap is a very common characteristic in text clustering. In this paper we propose a new technique and methodology for term overlap capture from text documents, highlighting the different situations such overlap could signify and discuss why such understanding is important for obtaining value from text clustering. Experiments were conducted using a widely used text document collection where the proposed methodology allowed exploring the term diversity for a given document collection and obtain clusters with minimum term overlap.

 Artículos similares

       
 
Ive Botunac, Jurica Bosna and Maja Matetic    
Investment decision-makers increasingly rely on modern digital technologies to enhance their strategies in today?s rapidly changing and complex market environment. This paper examines the impact of incorporating Long Short-term Memory (LSTM) models into ... ver más
Revista: Information

 
Sai Wang, Guoping Fu, Yongduo Song, Jing Wen, Tuanqi Guo, Hongjin Zhang and Tuantuan Wang    
The development of intelligent oceans requires exploration and an understanding of the various characteristics of the oceans. The emerging Internet of Underwater Things (IoUT) is an extension of the Internet of Things (IoT) to underwater environments, an... ver más

 
Mfowabo Maphosa, Wesley Doorsamy and Babu Paul    
The role of academic advising has been conducted by faculty-student advisors, who often have many students to advise quickly, making the process ineffective. The selection of the incorrect qualification increases the risk of dropping out, changing qualif... ver más
Revista: Algorithms

 
Mohammad Shokouhifar, Mohamad Hasanvand, Elaheh Moharamkhani and Frank Werner    
Heart disease is a global health concern of paramount importance, causing a significant number of fatalities and disabilities. Precise and timely diagnosis of heart disease is pivotal in preventing adverse outcomes and improving patient well-being, there... ver más
Revista: Algorithms

 
James Oduor Oyoo, Jael Sanyanda Wekesa and Kennedy Odhiambo Ogada    
Road traffic collisions are among the world?s critical issues, causing many casualties, deaths, and economic losses, with a disproportionate burden falling on developing countries. Existing research has been conducted to analyze this situation using diff... ver más