ARTÍCULO
TITULO

Comparative Analysis of the Accuracy of Methods for Visualizing the Structure of a Text Collection

Fedor Krasnov    

Resumen

Visualization of multidimensional data is the most important stage of data research. Often, decisions on the further stages of the study are made from the flat view of the data based on "rough proportions". High visibility and persuasiveness of representation on the plane of multidimensional vectors with the preservation of distances is used in models of distributive semantics (Word2Vec, GloVe, NaVec) successfully. On the other hand, the inaccuracy of the two-dimensional projection can lead to time being spent searching for non-existent multidimensional structures. The author set the task to evaluate the accuracy of dimensionality reduction methods with the following limitations: multi-dimensionality arises as a result of vector representation of text documents, dimensionality reduction is aimed at visualization on the plane. In numerous methods of dimension reduction, there is no separate class of approaches specifically for visualization. To measure the accuracy, an approach was chosen using marked-up data and quantifying the preservation of the markup while reducing the dimension. The author investigated 12 methods of reducing the dimension on two labeled data sets in Russian and English. Using the Silhouette Coefficient metric, the most accurate visualization method for text data was determined as UMAP with the Hellinger distance as the metric.

 Artículos similares

       
 
George Westergaard, Utku Erden, Omar Abdallah Mateo, Sullaiman Musah Lampo, Tahir Cetin Akinci and Oguzhan Topsakal    
Automated Machine Learning (AutoML) tools are revolutionizing the field of machine learning by significantly reducing the need for deep computer science expertise. Designed to make ML more accessible, they enable users to build high-performing models wit... ver más
Revista: Information

 
Hamed Taherdoost and Mitra Madanchian    
Blockchain technology has become a powerful disruptive force that upends established ideas in several industries. A fascinating point of convergence is that of blockchain technology and Business Process Management (BPM), where the distributed and immutab... ver más
Revista: Information

 
Marcin Klosok, Daria Gendosz de Carrillo, Piotr Laszczyca, Tomasz Plociniczak, Halina Jedrzejowska-Szypulka and Tomasz Sawczyn    
Revista: Applied Sciences

 
Siarhei Autsou, Karolina Kudelina, Toomas Vaimann, Anton Rassõlkin and Ants Kallaste    
Servomotors have found widespread application in many areas, such as manufacturing, robotics, automation, and others. Thus, the control of servomotors is divided into various principles and methods, leading to a high diversity of control systems. This ar... ver más
Revista: Applied Sciences

 
Carolina Bona-Sánchez, Heidi Salokangas and Kaisa Sorsa    
This study explores the complexities of cost behavior in the textile industry, conducting a comparative analysis between firms in the Nordic countries and Spain. Our main goal is to examine how distinct economic and corporate governance models impact the... ver más
Revista: Applied Sciences