REVISTA
Algorithms

TODAS

Inicio / Algorithms / Vol: 16 Par: 12 (2023) / Art�culo

ART�CULO

TITULO

On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 2?Applicability Domain and Outliers

Cindy Trinh

Silvia Lasala

Olivier Herbinet and Dimitrios Meimaroglou

Resumen

This article investigates the applicability domain (AD) of machine learning (ML) models trained on high-dimensional data, for the prediction of the ideal gas enthalpy of formation and entropy of molecules via descriptors. The AD is crucial as it describes the space of chemical characteristics in which the model can make predictions with a given reliability. This work studies the AD definition of a ML model throughout its development procedure: during data preprocessing, model construction and model deployment. Three AD definition methods, commonly used for outlier detection in high-dimensional problems, are compared: isolation forest (iForest), random forest prediction confidence (RF confidence) and k-nearest neighbors in the 2D projection of descriptor space obtained via t-distributed stochastic neighbor embedding (tSNE2D/kNN). These methods compute an anomaly score that can be used instead of the distance metrics of classical low-dimension AD definition methods, the latter being generally unsuitable for high-dimensional problems. Typically, in low- (high-) dimensional problems, a molecule is considered to lie within the AD if its distance from the training domain (anomaly score) is below a given threshold. During data preprocessing, the three AD definition methods are used to identify outlier molecules and the effect of their removal is investigated. A more significant improvement of model performance is observed when outliers identified with RF confidence are removed (e.g., for a removal of 30% of outliers, the MAE" role="presentation" style="position: relative;">??????MAE M A E (Mean Absolute Error) of the test dataset is divided by 2.5, 1.6 and 1.1 for RF confidence, iForest and tSNE2D/kNN, respectively). While these three methods identify X-outliers, the effect of other types of outliers, namely Model-outliers and y-outliers, is also investigated. In particular, the elimination of X-outliers followed by that of Model-outliers enables us to divide MAE" role="presentation" style="position: relative;">??????MAE M A E and RMSE" role="presentation" style="position: relative;">????????RMSE R M S E (Root Mean Square Error) by 2 and 3, respectively, while reducing overfitting. The elimination of y-outliers does not display a significant effect on the model performance. During model construction and deployment, the AD serves to verify the position of the test data and of different categories of molecules with respect to the training data and associate this position with their prediction accuracy. For the data that are found to be close to the training data, according to RF confidence, and display high prediction errors, tSNE 2D representations are deployed to identify the possible sources of these errors (e.g., representation of the chemical information in the training data).

Palabras claves

machine learning - QSPR/QSAR - high-dimensional data - descriptors - thermodynamic properties - applicability domain - outlier detection

Acceso

P�GINAS

pp. 0 - 0

N�MERO

Volumen: 16 Parte: 12 (2023)

MATERIAS

INGENIER�A Y CONSTRUCCI�N CIVIL
TECNOLOG�A

REVISTAS SIMILARES

Management Theory and Studies for Rural Business and Infrastructure Development
Water
IRA-International Journal of Management & Social Sciences

DOI

https://doi.org/10.3390/a16120573

Art�culos similares

Application of MODFLOW with Boundary Conditions Analyses Based on Limited Available Observations: A Case Study of Birjand Plain in East Iran

Acceso

Reza Aghlmand and Ali Abbasi

Increasing water demands, especially in arid and semi-arid regions, continuously exacerbate groundwater resources as the only reliable water resources in these regions. Groundwater numerical modeling can be considered as an effective tool for sustainable... ver m�s

Revista: Water

A Comprehensive Performance Assessment of the Modified Philip?Dunne Infiltrometer

Acceso

Zuhier Alakayleh, Xing Fang and T. Prabhakar Clement

This study aims at furthering our understanding of the Modified Philip?Dunne Infiltrometer (MPDI), which is used to determine the saturated hydraulic conductivity Ks and the Green?Ampt suction head ? at the wetting front. We have developed a forward-mode... ver m�s

Revista: Water

Capital Structure and Performance of Deposit Money Banks in Nigeria

Acceso

Osareme Erhomosele P�g. 130 - 144

AbstractInvestigations into the relationship between capital structure and firm performance over the years have consistently produced mixed results in the light of prevailing theories relevant to the concept of capital structure. The study examined the n... ver m�s

Revista: IRA-International Journal of Management & Social Sciences

Determining the Pension Benefit Obligation of a Defined Benefit Plan: Applying a Multivariate ARIMA Stochastic Model

Acceso

Jeffrey Tim Query, Evaristo Diz P�g. 145 - 159

AbstractIn this study we examine the robustness of fit for a multivariate and an autoregressive integrated moving average model to a data sample time series type. The sample is a recurrent actuarial data set for a 10-year horizon. We utilize ... ver m�s

Revista: IRA-International Journal of Management & Social Sciences

The Impact of Monetary Policy Announcements on Stock Market Index in Poland

Acceso

Hanna Zofia Kolodziejczyk P�g. 7 - 16

Financial market participants are influenced by the news reaching them from all manner of sources, including the country?s central bank. In this paper we model daily returns of WIG20 index with respect to announcements made by the National Bank of Poland... ver m�s

Revista: Research Papers in Economics and Finance

Revistas destacadas

Acceso directo a los n�meros publicados en la revista Infrastructures

Infrastructures

Acceso directo a los n�meros publicados en la revista Informed Infraestructure

Informed Infraestructure

Acceso directo a los n�meros publicados en la revista BiT

Acceso directo a los n�meros publicados en la revista Revista de la Construcci�n

Revista de la Construcci�n

Ver todas las revistas disponibles