Portada: Infraestructura para la Logística Sustentable 2050
DESTACADO | CPI Propone - Resumen Ejecutivo

Infraestructura para el desarrollo que queremos 2026-2030

Elaborado por el Consejo de Políticas de Infraestructura (CPI), este documento constituye una hoja de ruta estratégica para orientar la inversión y la gestión de infraestructura en Chile. Presenta propuestas organizadas en siete ejes estratégicos, sin centrarse en proyectos específicos, sino en influir en las decisiones de política pública para promover una infraestructura que conecte territorios, genere oportunidades y eleve la calidad de vida de la población.
Redirigiendo al acceso original de articulo en 16 segundos...
ARTÍCULO
TITULO

The Impact of Data Preparation and Model Complexity on the Natural Language Classification of Chinese News Headlines

Torrey Wagner    
Dennis Guhl and Brent Langhals    

Resumen

Given the emergence of China as a political and economic power in the 21st century, there is increased interest in analyzing Chinese news articles to better understand developing trends in China. Because of the volume of the material, automating the categorization of Chinese-language news articles by headline text or titles can be an effective way to sort the articles into categories for efficient review. A 383,000-headline dataset labeled with 15 categories from the Toutiao website was evaluated via natural language processing to predict topic categories. The influence of six data preparation variations on the predictive accuracy of four algorithms was studied. The simplest model (Naïve Bayes) achieved 85.1% accuracy on a holdout dataset, while the most complex model (Neural Network using BERT) demonstrated 89.3% accuracy. The most useful data preparation steps were identified, and another goal examined the underlying complexity and computational costs of automating the categorization process. It was discovered the BERT model required 170x more time to train, was slower to predict by a factor of 18,600, and required 27x more disk space to save, indicating it may be the best choice for low-volume applications when the highest accuracy is needed. However, for larger-scale operations where a slight performance degradation is tolerated, the Naïve Bayes algorithm could be the best choice. Nearly one in four records in the Toutiao dataset are duplicates, and this is the first published analysis with duplicates removed.

Artículos similares

Hemos preparados una selección de otros artículos que pudieran ser de tu interés
Ezra Kahn, Erin Antognoli and Peter Arbuckle    
Life cycle assessment (LCA) is a flexible and powerful tool for quantifying the total environmental impact of a product or service from cradle-to-grave. The US federal government has developed deep expertise in environmental LCA for a range of applicatio... ver más
Revista: Applied Sciences
Dimitrios Papamartzivanos, Sofia Anna Menesidou, Panagiotis Gouvas and Thanassis Giannetsos    
As the upsurge of information and communication technologies has become the foundation of all modern application domains, fueled by the unprecedented amount of data being processed and exchanged, besides security concerns, there are also pressing privacy... ver más
Revista: Future Internet
Jane Henriksen-Bulmer, Shamal Faily and Sheridan Jeary    
Cyber Physical Systems (CPS) seamlessly integrate physical objects with technology, thereby blurring the boundaries between the physical and virtual environments. While this brings many opportunities for progress, it also adds a new layer of complexity t... ver más
Revista: Future Internet
Xiaomei Bai, Hui Liu, Fuli Zhang, Zhaolong Ning, Xiangjie Kong, Ivan Lee and Feng Xia    
Scholarly article impact reflects the significance of academic output recognised by academic peers, and it often plays a crucial role in assessing the scientific achievements of researchers, teams, institutions and countries. It is also used for addressi... ver más
Revista: Information
Viktoriya Tsyganskaya, Sandro Martinis and Philip Marzahn    
Synthetic Aperture Radar (SAR) is particularly suitable for large-scale mapping of inundations, as this tool allows data acquisition regardless of illumination and weather conditions. Precise information about the flood extent is an essential foundation ... ver más
Revista: Water