ARTÍCULO
TITULO

Development of methods for pre-clustering and virtual merging of short documents for building domain dictionaries

Oleksii Kungurtsev    
Svitlana Zinovatna    
Iana Potochniak    
Nataliia Novikova    

Resumen

The aim of research is to improve the quality of domain dictionaries by expanding the corpus of the documents under study by using short documents. A document model is proposed that allows to define a short document and the need to combine it with other documents to highlight verbose terms. An algorithm for highlighting the substantive part of the document has been developed, since in a short document the heading and closing parts usually contain terms that are not related to the studied domain. A method for preliminary clustering of short documents to highlight verbose terms has been developed. The method is based on highlighting and counting occurrences of nouns (one-word terms) for all analyzed documents. The concept of document proximity is introduced, which is determined by the combination of two criteria: the relative number of matching terms and the relative frequency of occurrence of matching terms. The principle of grouping documents at the customer's site often does not correspond to the principles of grouping necessary for building a dictionary of the domain. In a short document, it is usually impossible to isolate a verbose term because the repetition of terms is very low. A method has been developed for virtual combining of short documents based on the principle of achieving the necessary repeatability of one-word terms. The merged document has the highest possible frequency of terms for the cluster it belongs to. At the same time, the original text of documents is preserved and the ability to associate the selected verbose term with those documents in which it is included. The experiment made it possible to find the best ratio for the elements of the document proximity coefficient and confirm the effectiveness of the proposed preliminary clustering method

 Artículos similares

       
 
David Mattie, Zihang Fang, Emi Takahashi, Lourdes Peña Castillo and Jacob Levman    
Diffusion magnetic resonance imaging (MRI) tractography is a powerful tool for non-invasively studying brain architecture and structural integrity by inferring fiber tracts based on water diffusion profiles. This study provided a thorough set of baseline... ver más
Revista: Information

 
James B. Rosenzweig, Gerard Andonian, Ronald Agustsson, Petr M. Anisimov, Aurora Araujo, Fabio Bosco, Martina Carillo, Enrica Chiadroni, Luca Giannessi, Zhirong Huang, Atsushi Fukasawa, Dongsung Kim, Sergey Kutsaev, Gerard Lawler, Zenghai Li, Nathan Majernik, Pratik Manwani, Jared Maxson, Janwei Miao, Mauro Migliorati, Andrea Mostacci, Pietro Musumeci, Alex Murokh, Emilio Nanni, Sean O?Tool, Luigi Palumbo, River Robles, Yusuke Sakai, Evgenya I. Simakov, Madison Singleton, Bruno Spataro, Jingyi Tang, Sami Tantawi, Oliver Williams, Haoran Xu and Monika YadavaddShow full author listremoveHide full author list    
Recently, considerable work has been directed at the development of an ultracompact X-ray free-electron laser (UCXFEL) based on emerging techniques in high-field cryogenic acceleration, with attendant dramatic improvements in electron beam brightness and... ver más
Revista: Instruments

 
Pietro Roncioni, Marco Marini, Oscar Gori, Roberta Fusaro and Nicole Viola    
The request for faster and greener civil aviation is urging the worldwide scientific community and aerospace industry to develop a new generation of supersonic aircraft, which are expected to be environmentally sustainable and to guarantee a high-level p... ver más
Revista: Aerospace

 
Qiankun Wang, Ke Zhu, Peiwen Guo, Jiaji Zhang and Zhihua Xiong    
Faced with the challenges of global climate change, zero-carbon buildings (ZCB) serve as a crucial means to achieve carbon peak and carbon neutrality goals, particularly in the development of tropical island regions. This study aims to establish a ZCB te... ver más
Revista: Applied Sciences

 
Dimitris C. Gkikas, Marios C. Gkikas and John A. Theodorou    
The specific application of this work involves the development of an intelligent system for diagnosing and treating fish diseases in Greek fish farming. The project aims to enhance the competitiveness of Greek fish farming by addressing the increasing mo... ver más
Revista: Applied Sciences