Redirigiendo al acceso original de articulo en 20 segundos...
Inicio  /  Algorithms  /  Vol: 17 Par: 4 (2024)  /  Artículo
ARTÍCULO
TITULO

The Impact of Data Preparation and Model Complexity on the Natural Language Classification of Chinese News Headlines

Torrey Wagner    
Dennis Guhl and Brent Langhals    

Resumen

Given the emergence of China as a political and economic power in the 21st century, there is increased interest in analyzing Chinese news articles to better understand developing trends in China. Because of the volume of the material, automating the categorization of Chinese-language news articles by headline text or titles can be an effective way to sort the articles into categories for efficient review. A 383,000-headline dataset labeled with 15 categories from the Toutiao website was evaluated via natural language processing to predict topic categories. The influence of six data preparation variations on the predictive accuracy of four algorithms was studied. The simplest model (Naïve Bayes) achieved 85.1% accuracy on a holdout dataset, while the most complex model (Neural Network using BERT) demonstrated 89.3% accuracy. The most useful data preparation steps were identified, and another goal examined the underlying complexity and computational costs of automating the categorization process. It was discovered the BERT model required 170x more time to train, was slower to predict by a factor of 18,600, and required 27x more disk space to save, indicating it may be the best choice for low-volume applications when the highest accuracy is needed. However, for larger-scale operations where a slight performance degradation is tolerated, the Naïve Bayes algorithm could be the best choice. Nearly one in four records in the Toutiao dataset are duplicates, and this is the first published analysis with duplicates removed.

 Artículos similares

       
 
Youcun Liu, Yan Liu, Ming Chen, David Labat, Yongtao Li, Xiaohui Bian and Qianqian Ding    
This paper has adopted related meteorological data collected by 69 meteorological stations between 1951 and 2013 to analyze changes and drivers of reference evapotranspiration (ET0) in the hilly regions located in southern China. Results show that: (1) E... ver más
Revista: Water

 
Daniel Althoff, Lineu Neiva Rodrigues and Demetrius David da Silva    
Small reservoirs play a key role in the Brazilian savannah (Cerrado), making irrigation feasible and contributing to the economic development and social well-being of the population. A lack of information on factors, such as evaporative water loss, has a... ver más
Revista: Water

 
I. Oktaviani, M. Asril, Y. Aryanti, S. S. Leksikowati     Pág. 47 - 52
The conversion of agricultural land and plantation into an area with high human activity can affect the biodiversity contained in it. The biodiversity of a region can be surveyed and collect in a systematic database to know the wealth of flora and fauna ... ver más

 
Jin Pan, Yong Wang, Tao Wang and Mingcai Xu    
With the development of bridge crossings over rivers, the accident of the vessel?bridge collision is increasing as well. It is important to assess probability of bridges colliding with passing ships. Firstly, the AIS (Automatic identify system) data was ... ver más

 
Hoseon Kim, Jieun Ko, Aram Jung and Seoungbum Kim    
A connected vehicle (CV) enables vehicles to communicate not only with other vehicles but also the road infrastructure based on wireless communication technologies. A road system with CVs, which is often referred to as a cooperative intelligent transport... ver más
Revista: Applied Sciences