Gene Association Classification for Autism Spectrum Disorder: Leveraging Gene Embedding and Differential Gene Expression Profiles to Identify Disease-Related Genes

Suratanee, Apichat; Plaimas, Kitiporn

doi:10.3390/app13158980

Open AccessArticle

Gene Association Classification for Autism Spectrum Disorder: Leveraging Gene Embedding and Differential Gene Expression Profiles to Identify Disease-Related Genes

by

Apichat Suratanee

^1,2

and

Kitiporn Plaimas

^3,4,*

¹

Department of Mathematics, Faculty of Applied Science, King Mongkut’s University of Technology North Bangkok, Bangkok 10800, Thailand

²

Intelligent and Nonlinear Dynamic Innovations Research Center, Science and Technology Research Institute, King Mongkut’s University of Technology North Bangkok, Bangkok 10800, Thailand

³

Advanced Virtual and Intelligent Computing (AVIC) Center, Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Bangkok 10330, Thailand

⁴

Omics Science and Bioinformatics Center, Faculty of Science, Chulalongkorn University, Bangkok 10330, Thailand

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(15), 8980; https://doi.org/10.3390/app13158980

Submission received: 14 July 2023 / Revised: 2 August 2023 / Accepted: 3 August 2023 / Published: 5 August 2023

(This article belongs to the Special Issue Artificial Intelligence in Bioinformatics: Current Status and Future Prospects)

Download

Browse Figures

Versions Notes

Abstract

:

Identifying genes associated with autism spectrum disorder (ASD) is crucial for understanding the underlying mechanisms of the disorder. However, ASD is a complex condition involving multiple mechanisms, and this has resulted in an unclear understanding of the disease and a lack of precise knowledge concerning the genes associated with ASD. To address these challenges, we conducted a systematic analysis that integrated multiple data sources, including associations among ASD-associated genes and gene expression data from ASD studies. With these data, we generated both a gene embedding profile that captured the complex relationships between genes and a differential gene expression profile (built from the gene expression data). We utilized the XGBoost classifier and leveraged these profiles to identify novel ASD associations. This approach revealed 10,848 potential gene–gene associations and inferred 125 candidate genes, with DNA Topoisomerase I, ATP Synthase F1 Subunit Gamma, and Neuronal Calcium Sensor 1 being the top three candidates. We conducted a statistical analysis to assess the relevance of candidate genes to specific functions and pathways. Additionally, we identified sub-networks within the candidate network to uncover sub-groups of associations that could facilitate the identification of potential ASD-related genes. Overall, our systematic analysis, which integrated multiple data sources, represents a significant step towards unraveling the complexities of ASD. By combining network-based gene associations, gene expression data, and machine learning, we contribute to ASD research and facilitate the discovery of new targets for molecularly targeted therapies.

Keywords:

gene associations; embeddings; network analysis; machine learning

1. Introduction

Autism spectrum disorder (ASD) is a neurodevelopmental disease characterized by difficulties in social interaction, impaired communication, and the presence of restricted, repetitive behaviors, interests, or activities [1,2,3,4]. The genetically heterogeneous nature of ASD presents a formidable and multifaceted challenge. It exerts its impact across multiple systems, intertwining genetic, epigenetic, and environmental factors [5,6,7,8]. There has been a growing focus on genetic variants linked to ASD, while immune dysbiosis and gut microbiota have emerged as new focuses in ASD research since 2015 [9]. Significant progress has been made in understanding the genetics, epigenetics, neuropathology, neuroanatomy, neurochemistry, and neuroimaging of ASD [10,11]. In addition, large-scale studies using whole-genome sequencing have proposed several ASD-risk genes and ASD-associated CNVs [12,13,14,15,16]. Recent advances in genomic research have led to the identification of numerous genes associated with ASD. According to data from AutDB [17,18], 1404 genes and 38,296 protein interactions have been implicated in ASD. Additionally, AutismKB [19] has found 1379 genes implicated in ASD. These findings highlight the intricate network of genetic factors contributing to this disorder. Recently, the impact of molecular processes and pathways, such as neurotransmitter release regulation, synaptic function and plasticity, immune system function, and signaling pathways, have become a focus of ASD research [20,21,22]. Despite the identification of several risk genes associated with ASD, the precise genes implicated in ASD development and the specific mechanisms by which they contribute to the disorder are still not fully comprehended [18]. Experimental studies in ASD can be challenging, time-consuming, and resource-intensive due to ethical and practical constraints. However, computational approaches offer advantages, such as the ability to save time and resources, thus enabling researchers to simulate experiments in controlled environments and provide insights across various scenarios beyond specific experiments. Therefore, the utilization of computational methods with a systematic analysis could produce useful results.

Network-based approaches are a systematic analysis tools that have been successfully applied to identify disease-related genes in various disorders [21,22,23,24,25,26]. Furthermore, certain studies have sought to capture the essence of networks or graphs using low-dimensional features while still preserving the network information. Consequently, graph-embedding techniques have been employed. Among these techniques, node embedding has emerged as a successful approach for identifying disease-related genes or risk genes in several diseases [27,28,29,30]. Ata et al. [27] predicted disease-related genes using embeddings obtained from protein–protein interaction networks and keywords extracted from UniProt [31] as features for genes. These features were applied to four classification models using feature-selection and oversampling techniques to successfully predict disease-related genes for Alzheimer’s disease, breast cancer, colorectal cancer, diabetes mellitus, lung cancer, obesity, and prostate cancer. Lagisetty et al. [15] considered gene–gene interactions in order to identify risk genes for Alzheimer’s disease. They calculated edge weights for an individual’s network using the perturbation scores of the connected genes. These edge weights were then averaged to construct one case-specific network and one control-specific network. Subsequently, node embedding techniques were employed on these networks. The resulting embeddings were utilized to measure the distances between nodes in the case and control networks. Wang et al. [14] employed graph-embedding methods and ensemble learning to predict disease–gene associations. They constructed a heterogeneous network using gene–disease associations, gene–chemical associations, disease–chemical associations, and gene–gene associations as edges of the network. Nodes in the network represented genes, diseases, and chemicals. The embedding vectors of the genes and diseases were merged to represent gene–disease pairs and random forest classifiers were applied. Ten top novel gene–disease pairs were proposed, along with supporting evidence. These studies successfully employed network-based approaches to identify promising risk genes and associations between diseases and genes.

In this study, we have merged various data sources pertaining to ASD by establishing a differential gene expression profile and constructing a gene association network to derive a gene embedding profile for ASD. By combining these diverse datasets, we have employed machine learning algorithms to identify potential gene associations. These associations have subsequently been utilized to discover novel genes associated with ASD. Furthermore, we have employed a network clustering algorithm to further investigate a specific group of genes and have conducted functional and pathway enrichment analyses to pinpoint the functions and pathways associated with the identified candidate genes.

2. Materials and Methods

2.1. Data Preparation

We gathered gene expression data associated with ASD from the Gene Expression Omnibus (GEO) [10]. This study utilized two datasets: a whole-genome transcriptomic dataset (GSE6575), which included microarray data from children with autism as well as children from the general population, and an ASD-related dataset (GSE28521) consisting of human post-mortem brain tissue samples. To preprocess the data, we employed GEO2R and performed a log2 transformation to achieve normalization. These processed gene expression data were then utilized to calculate differential gene expression values between the two groups (ASD and control). These differential gene expression values served as features in our analysis. To acquire protein–protein interaction data for genes associated with ASD, we referred to AutDB [16,32], a curated database that includes all known direct interactions between proteins, including protein binding, RNA binding, promoter binding, protein modification, auto-regulation, and direct regulation. We specifically selected interactions involving Homo sapiens and eliminated any redundant interactions. As a result, we obtained a total of 25,057 unique known interactions involving 12,480 genes. We collected known ASD genes from the Simons Foundation Autism Research Initiative (SFARI) database [33]. Genes in the SFARI with scores ranging from 1 to 3 were included in our analysis. Additionally, we also retrieved ASD-related genes with high confidence scores (core dataset) from the AutismKB 2.0 database [15].

2.2. Differential Gene Expression Profile

The probe IDs of the gene expression data were mapped to gene IDs and HGNC names [14] using an R package gProfiler2 [34]. We collapsed the gene expression data with the same HGNC names using the average method to obtain a single expression value for each gene. The GSE6575 dataset comprises 44 ASD-related samples, including 18 individuals with ASD with regression, 17 individuals with ASD without regression, 9 individuals with mental retardation or developmental delay, and 12 control samples. To identify the differential gene expression profile of a gene, we compared the expression values between the disease and control groups. By calculating the differential expression values for all possible combinations of ASD sample and control sample pairs, we obtained 44 × 12 = 528 differential expression features for the GSE6575 dataset. For the GSE28521 dataset, there were 39 ASD samples and 40 control samples. Consequently, we obtained 39 × 40 = 1560 features of differentially expressed genes (DEGs) for each gene. Therefore, in total, we derived 2088 gene features from both datasets. Known ASD protein interactions were retrieved from the AutDB database [35,36]. We only filtered interactions where their corresponding nodes were found among our 5330 genes. Ultimately, we obtained 3611 known ASD interactions.

2.3. Gene Embedding Profile

In the context of graph representation, a node can be represented as a low-dimensional feature in vector space using a node-embedding method, effectively preserving the maximum amount of information. In our study, we focused on exploring the protein interaction network, where we associated each node in the network with its corresponding gene. As a result, node-embedding methods allow us to effectively capture and encode information about the nodes and their relationships within the graph by generating a low-dimensional feature representation for each node [37]. These feature representations are commonly referred to as gene embeddings.

For this purpose, we employed the DeepWalk algorithm [38] in our study. This algorithm utilizes a random walk algorithm to generate random paths for each node in the graph. These paths were then used to train a Word2Vec model with the Skip-gram algorithm to predict nearby nodes and obtain gene embeddings. The gene connections in this study were derived from interactions retrieved from STRING [39], and we only considered interactions with a confidence score of 900 or higher. These interactions linked two gene products that correspond to the genes found in the gene expression datasets. The network used in our analysis consisted of 5330 nodes and 45,247 interactions. The DeepWalk algorithm was applied to this network (walk length: 10; number of walks: 80). The embedding size was set to 128 and the window size was set as 5 with 3 iterations. As a result, we obtained 128 features for each gene in the form of a gene embedding profile. We utilized a python implementation of DeepWalk available at https://github.com/shenweichen/GraphEmbedding (accessed on 16 February 2023) that used the Word2Vec model with the Skip-gram algorithm from the python library ‘Gensim’ [40].

2.4. Classification of Gene Associations for ASD

Using the differential gene expression profile and gene embedding profile, we calculated association features for a gene pair using the Hadamard product. The Hadamard product, also known as the element-wise product, takes two matrices of the same dimensions and returns a matrix with multiplied corresponding elements. Subsequently, these association features for gene pairs were utilized for classification models. In this study, we investigated four classifiers: XGBoost, naïve Bayes, neural network, and random forest classifiers. The aim was to classify known and unknown associations. To reduce the number of features and enhance classification performance, we employed the recursive feature elimination (RFE) algorithm. Logistic regression was used as an estimator, and three-fold cross-validation was applied to filter and select the most relevant features. The RFE algorithm recursively selected subsets of features based on their importance, resulting in the selection of the most relevant features. The classifiers were trained using the initial set of features, and the importance of each feature was determined through a feature-selection process. Subsequently, the least important features were eliminated, and the procedure was recursively repeated on the remaining features until the desired number of features was reached. After these RFE processes were complete, the classifiers were trained using the data with these reduced features. To assess the performance of the models, we employed the ten-times five-fold cross-validation technique. The hyperparameters of each classifier were optimized using a grid search with cross-validation. Supplementary Table S1 shows the hyperparameter sets of the classification models. Each machine learning process can be implemented in Python with the ‘xgboost’ library [41] for the XGBoost classifier, the ‘scikit-learn’ library [42] for the naïve Bayes, neural network, and random forest classifiers, the RFE algorithm, and logistic regression, and the ‘keras’ library for the neural network from https://github.com/keras-team/keras (accessed on 16 February 2023).

2.5. Enrichment Analysis

The candidate genes associated with ASD were utilized to identify their respective functions in terms of gene ontology [43] and pathways. We conducted a search of the gene list for GO biological processes, GO molecular functions, and GO cellular components to determine the functions of these genes. To identify relevant pathways, we searched the Kyoto Encyclopedia of Genes and Genomes (KEGG) [15] and Reactome (the European Bioinformatics Institute pathway database [14]). These gene set enrichment analyses were performed using the R package Enrichr [31]. We employed an adjusted p-value calculated using the Benjamini–Hochberg method.

3. Results

3.1. Overview of the Study

An overview of the process of our study—in which we aimed to predict gene associations for ASD by integrating gene expression data for ASD and network information related to ASD—is illustrated in Figure 1. We constructed differential gene expression profiles and gene embedding profiles using the available data. These profiles were then combined to create association features for gene pairs. Machine learning algorithms were applied to infer the gene associations, and the performances of the models were evaluated using the AutDB database [15,36]. By utilizing the predicted associations, we were able to identify genes potentially related to ASD. Subsequently, network clustering was employed to detect significant sub-networks associated with ASD. Finally, statistical, functional, and literal investigations were conducted on these genes to gain further insights.

3.2. Gene Association Predictions Using Differential Gene Expression and Gene Embedding Profiles

In order to acquire a differential gene expression profile for ASD, we extracted gene expression data from the GEO database [14] for the GSE6575 and GSE28521 datasets. The GSE6575 dataset encompassed 22,237 genes, whereas the GSE28521 dataset comprised 8363 genes. By comparing these datasets, we identified 8268 genes that were common to both. Subsequently, we narrowed down our selection to only include genes present in the STRING database [41], and this resulted in a final set of 5330 genes. For each gene, we calculated the differential gene expression values, resulting in 2088 features. Additionally, we performed gene embeddings on the ASD-associated gene network and obtained 128 features. By combining 2088 differential expression features, we obtained a total of 2216 features for each gene. These differential gene expression profiles and gene embedding profiles were concatenated for all 5330 genes. After this, these profiles were used to calculate association features for gene pairs, and this resulted in a total of 14,201,785 gene pairs with 2216 association features. Among these pairs, 3611 pairs were recognized in the AutDB database [42,44] and thus defined as the positive set for gene association prediction. To train and test the classification models, we employed a balanced binary classification approach in which the number of positive instances and the number of negative instances were equal in order to ensure unbiased predictions for both classes. To achieve this task, we employed the undersampling technique, randomly selecting positive and negative samples in the same quantity, which amounted to 3611 instances. During the training processes, we applied the RFE algorithm to find the optimal features for each classifier. In this study, the naïve Bayes classifier, neural network classifier, random forest classifier, and the XGBoost algorithm were tested for their classification performance. The XGBoost algorithm proved to be the best classifier. The comparisons of the performances of these classification methods and the investigation of the top relevant features recognized by the RFE algorithm will be analyzed and discussed in this section. The promising associations detected by the algorithm are reported in the next section.

Using the complete association features generated from the differential expression profiles and the gene embedding profiles, the XGBoost classifier yielded the best performance, with an average area under the curve (AUC) of 0.8249 and an average accuracy of 0.7532. Furthermore, we also investigated the impact of different feature sets on classification performance. Therefore, we discarded some of the association features generated from the gene embedding profile and used the same procedures for the classifications. The performance slightly decreased, with an average AUC of 0.8074 and an average accuracy of 0.7368. Moreover, we excluded the differential expression profile from the association features and performed the same classification procedures. The performance further decreased, with an average AUC of 0.7805 and an average accuracy of 0.7106. These results indicate that both the differential gene expression profile and the gene embedding profile are essential for the association classifications.

Furthermore, we explored other classifiers, i.e., the naïve Bayes, neural network, and random forest classifiers, using the complete set of association features. The results showed that the XGBoost classifier outperformed the other classifiers, while the random forest classifier presented good performances, with an average AUC of 0.7779. However, in our experiments, its performance was still inferior to the classification performances of the XGBoost classifier. Moreover, we observed the classification performances obtained from the naïve Bayes and neural network classifiers. These two classifiers performed worse, with average AUCs of 0.6946 and 0.6539, respectively. Thus, the XGBoost classifier, which used features from both the differential gene expression and gene embedding profiles, demonstrated the best performance. A summary of the performances is shown in Figure 2. Figure S1 displays the ROC curves of each classifier, and Table S3 provides a comprehensive summary of the evaluation metrics, including F1 measure, precision, and recall, for each classifier.

To understand the relevant association features for distinguishing positive pairs and negative pairs, we investigated the top features obtained from the RFE algorithm for all classifiers. We observed that 80% (8 out of 10) of the top 10 features were association features calculated from gene embedding profiles. As we extended the analysis to the top 20, 30, 40, and 50 features, we found that the proportion of association features calculated from gene embedding profiles decreased to 65%, 50%, 40%, and 34%, respectively. Supplementary Table S2 provides a comprehensive overview of the percentages of association features calculated from the top selected features (ranging from 10 to 100) in both the gene embedding profiles and the differential expression profiles. We further investigated the distribution of the top 10 features between the positive and negative sets by calculating the correlation between the feature values and the class labels. The positive class was labeled as 1, and the negative class was labeled as 0. The results indicate that 9 out of these 10 features exhibited negative correlations. When these features have higher values, it indicates a lower possibility that the gene pairs are associated with the positive class. However, the absolute correlation values were relatively low. This suggests that using only one feature may not be sufficient to accurately predict an association, and combining various features could lead to a more effective predictive model. It is important to note that these top ten features were derived from voting or counting based on the top selected features obtained from the RFE algorithm. They were not directly combined to train a machine learning model. Supplementary Figure S2 shows a plot of the correlation values of these top ten features.

3.3. ASD-Related Gene Identification

We employed the classification models to evaluate all possible gene pairs. Each classifier assigned a score of 1 if it predicted that the gene pair was associated with ASD, and a score of 0 if it predicted otherwise. The prediction score for a gene pair was then computed as the average value derived from the classifiers. A gene pair achieved the highest prediction score when all of the classification models consistently classified it as associated with ASD Ultimately, our analysis yielded 10,848 association pairs with the highest possible prediction scores. These associated genes were then investigated to identify novel ASD candidate genes. For each pair, we considered a gene that was connected to a gene in the set of known ASD genes collected from the SFARI Gene database and the AutismKB 2.0 database. The candidate genes were compiled and their frequencies were calculated. Among them, three genes showed a frequency of six (the highest frequency): TOP1 (DNA Topoisomerase I), ATP5F1C (ATP Synthase F1 Subunit Gamma), and NCS1 (Neuronal Calcium Sensor 1). The group of genes with the second-highest frequency (five) totaled nine: ARNT, CALU, GALT, GPC2, HDAC2, IGSF8, RIOK3, UBC, and SNRPN. Additionally, there were 113 genes with a frequency of four (the third-highest value). A list of genes with frequencies of six, five, and four is provided in Supplementary Table S4. These genes were used for further analysis.

3.4. Gene ontology Enrichment for Candidate Genes

To investigate the functions of the candidate genes, we performed an enrichment analysis. First, we examined the cellular component domain and identified several interesting findings related to different cellular compartments. In ‘Mitochondrial Membrane’ (GO:0031966), we observed the enrichment of genes such as DNAJC19, COA3, ABCB6, MRPS22, BNIP3, MRPL17, TIMM22, ATP5F1C, ATP5F1D, COQ5, BNIP1, ALDH18A1, and SLC25A4, with an adjusted p-value of 0.00325. ‘Organelle Inner Membrane’ (GO:0019866) showed the enrichment of genes such as DNAJC19, COA3, MRPS22, MRPL17, TIMM22, ALDH18A1, ATP5F1C, SIRT1, SLC25A4, ATP5F1D, and COQ5, with an adjusted p-value of 0.00325. This GO term has direct descendants related to inner membranes, such as the nuclear inner membrane, mitochondrial inner membrane, and plastid inner membrane. The genes enriched in GO:0019866, excluding SIRT1, were also enriched in Mitochondrial Inner Membrane’ (GO:0005743), with an adjusted p-value of 0.00589.

Furthermore, we observed the enrichment of genes in ‘Focal Adhesion’ (GO:0005925), with an adjusted p-value of 0.02691, and in ‘Cell–Substrate Junction’ (GO:0030055), with an adjusted p-value of 0.02691, suggesting the importance of cell junction-connecting cells to the extracellular matrix. Key genes in these categories include RPL3, RPS19, ACTN2, CLTC, ANXA5, CHP1, ILK, ATP6V0C, and FERMT2. Additionally, the enriched genes involved in ‘Preribosome, Small Subunit Precursor’ (GO:0030688) were LTV1 and RIOK3, with an adjusted p-value of 0.03996. Finally, the enrichment of genes, including DNAJC19, TSFM, POLG2, MRPS22, ATP5F1C, ATP5F1D, COQ5, and BCAT2, was observed in ‘Mitochondrial Matrix’ (GO:0005759), with an adjusted p-value of 0.03996. The enriched GO terms are presented in Figure 3a.

Focusing on the biological process, our enrichment analysis revealed a statistically significant enrichment of the GO term ‘Negative Regulation of Macromolecule Biosynthetic Process’ (GO:0010558), with an adjusted p-value of 0.04077. This suggests the involvement of mechanisms that decrease the rate or extent of the chemical reactions and pathways involved in macromolecule formation. The enriched genes for this term were FXR1, RBM4, HDAC2, GSTP1, CLTC, and RACK1. Additionally, we observed a significant enrichment of the GO term ‘Cell–Matrix Adhesion’ (GO:0007160), with an adjusted p-value of 0.0408. The enriched genes for this term included ACTN2, ILK, PTPRK, FERMT2, PPFIA2, and BCAT2. This term indicated that the adhesive properties by which a cell adheres to the extracellular matrix play a crucial role in cellular processes. Our analysis identified the GO term ‘Translation’ (GO:0006412) as significantly enriched, with an adjusted p-value of 0.04364, implying the significance of cellular metabolic processes at the level of protein formation. The key genes for this term included TSFM, RPL3, EEF2K, RPS19, MRPS22, RACK1, RPL24, and MRPL17. The GO term ‘Negative Regulation Of I-kappaB kinase/NF-kappaB Signaling’ (GO:0043124) showed significant enrichment, with an adjusted p-value of 0.04364, indicating the potential modulation of the I-kappaB kinase/NF-kappaB signaling pathway. The enriched genes for this GO term included TLE1, RIOK3, GSTP1, and SIRT1. A bar plot illustrating these GO terms is shown in Figure 3b.

In terms of molecular function, we identified several significantly enriched terms related to protein binding and kinase activity. Particularly, ‘Protein Kinase Binding’ (GO:0019901) and ‘Kinase Binding’ (GO:0019900) exhibited highly significant enrichment, and each had adjusted p-values of 0.00319. These findings suggest that our gene set may play a crucial role in catalyzing the transfer of a phosphate group. Furthermore, ‘RNA Binding’ (GO:0003723) also showed significant enrichment, with an adjusted p-value of 0.00392, indicating a potential involvement in the process of binding to an RNA molecule. Additionally, ‘Protein Serine/Threonine Kinase Activity’ (GO:0004674) was significantly enriched, with an adjusted p-value of 0.01512, suggesting that the genes may play a role in the catalysis of reactions. The enriched genes included PRKCG, BRSK1, PRKAB2, TBK1, RIOK3, CSNK2B, MAP3K20, ILK, and TOP1. We also observed the enrichment of the term ‘Pre-mRNA Binding’ (GO:0036002), with an adjusted p-values of 0.03421. This finding provides valuable insights into the mechanisms involved in binding to a pre-messenger RNA. The enriched genes associated with this term were RBM4, TARBP2, and U2AF1L4. Figure 3c shows the enriched molecular function gene ontology terms.

3.5. Pathway Enrichment for Candidate Genes

Furthermore, we conducted an enrichment analysis using the KEGG pathway database [45] to identify the significantly enriched biological pathways associated with our gene set. The analysis revealed several KEGG pathways. Interestingly, the pathway of neurodegeneration (an adjusted p-value of 0.00851) emerged as a highly enriched pathway, suggesting its relevance to our study. Additionally, specific neurodegenerative diseases, such as Huntington’s disease (an adjusted p-value of 0.01291) and Parkinson’s disease (an adjusted p-value of 0.03177), were identified as enriched pathways, indicating their potential involvement in our gene set. Additionally, mitophagy (an adjusted p-value of 0.03177), shigellosis (which is also known as shigella infection and is caused by the invasion of the epithelium lining the terminal ileum, colon, and rectum by shigella species [46]) (an adjusted p-value of 0.03177), and Salmonella infection (an adjusted p-value of 0.03177) exhibited significant enrichment, implying their association with cellular processes, infection, and neurodegeneration. The Synaptic vesicle cycle (an adjusted p-value of 0.04044), taurine and hypotaurine metabolism (an adjusted p-value of 0.04984), GABAergic synapse (an adjusted p-value of 0.04984), and morphine addiction (an adjusted p-value of 0.04984) pathways also demonstrated enrichment, suggesting their potential involvement in neuronal function and addiction-related processes. Figure 4 shows a bar plot of these KEGG pathways.

In addition, we conducted an enrichment analysis using the Reactome pathway database [47]. The ‘MHC Class II Antigen Presentation’ pathway (R-HSA-2132295) showed the highest enrichment (an adjusted p-value of 0.00318), indicating the involvement of our genes in immune responses and antigen presentation. Significant enrichment was also observed in the following pathways: ‘Cellular Responses To Stimuli’ (R-HSA-8953897), ‘Organelle Biogenesis And Maintenance’ (R-HSA-1852241), and ‘Golgi-to-ER Retrograde Transport’ (R-HSA-8856688) (an adjusted p-value of 0.01448). These findings highlight the importance of our genes in cellular responses to external molecular and physical signals, the biogenesis of subcellular structures, and intracellular transport. Furthermore, the ‘Metabolism’ (R-HSA-1430728) and ‘Cellular Responses To Stress’ (R-HSA-2262752) pathways exhibited significant enrichment (adjusted p-values of 0.0216 and 0.02234, respectively), suggesting their involvement in cellular metabolic processes and their ability to modulate molecular processes in response to external and internal stresses.

Furthermore, the ‘Mitochondrial Biogenesis’ (R-HSA-1592230) and ‘Metabolism Of RNA’ (R-HSA-8953854) pathways showed significant enrichment (adjusted p-values of 0.02699 and 0.02713, respectively), suggesting their roles in mitochondrial processes and processes involved in the modification of RNA transcription products. Significant enrichment was also observed in the following pathways: ‘Membrane Trafficking’ (R-HSA-199991), ‘Metabolism of Carbohydrates’ (R-HSA-71387), and ‘Signaling By WNT’ (R-HSA-195721), with adjusted p-values of 0.03311, 0.03384, and 0.03777, respectively. Additionally, the ‘Selective Autophagy’ (R-HSA-9663891), ‘Vesicle-mediated Transport’ (R-HSA-5653656), ‘Transmission Across Chemical Synapses’ (R-HSA-112315), ‘Intra-Golgi And Retrograde Golgi-to-ER Traffic’ (R-HSA-6811442), ‘mRNA Splicing-Major Pathway’ (R-HSA-72163), and ‘POLB-Dependent Long Patch Base Excision Repair’ (R-HSA-110362) pathways exhibited significant enrichment (adjusted p-values of 0.03778, 0.03955, 0.04793, 0.04793, 0.04793, and 0.04901, respectively). A list of enriched Reactome pathways is shown in Figure 5. Supplementary Table S5 provides comprehensive information on enriched GO terms and pathways.

3.6. Candidate Network and Sub-Networks

With our predicted associations, we focused on the pairs containing at least one candidate gene. Among these pairs, some were known to be associated, while others were not known to be associated. In addition, some of these genes were known to be ASD-related genes, while others were novel candidate genes. We have demonstrated these complex relationships with a graph. Figure 6 shows the network of candidate genes and their connecting genes. Through this network, we discovered interesting sub-groups of genes, and this prompted us to conduct a further analysis of the network using ClusterViz [48], a plugin software for CytoScape [49] based on the EAGLE clustering algorithm [50]. With this algorithm, we applied a clique size threshold of 3 and a complex size threshold of 2. The results revealed three sub-networks, and these are shown in Figure 7.

Figure 7a shows a sub-network consisting of 53 genes and 87 associations. Among them, the top three genes with high degree values, CUL3, MECP2, and TOP1 (with degree values of 33, 31, and 18, respectively) stood out. CUL3 and MECP2 are known ASD-related genes. Hence, DNA Topoisomerase I (TOP1) is an interesting candidate gene in this sub-network due to its connections to several known ASD-related genes, including AP3B2, CCT4, CUL3, MEPCP2, and SLC12A5. Some of these links were already known associations found in the database. It has been reported that TOP1 inhibition reduces the expression of long genes linked to synapses and autism [14,15]. Additionally, this sub-network revealed other genes of interest that were connected to known ASD-related genes, including RBM17, RPL3, SNRPN, and UBC.

Figure 7b shows a sub-network comprised 33 genes and 60 associations. RBFOX1 and TSC1 had the highest degree values (30 and 29, respectively). However, these two genes were already known to be ASD-related genes. General Transcription Factor IIIC Subunit 3 (GTF3C3) is an intriguing gene in this sub-network. This gene exhibited a slightly higher betweenness centrality value (0.0625) than the other genes in the sub-network. Interestingly, GTF3C3 was connected to two known ASD-related genes, SDC2 and RBFOX1. The protein encoded by GTF3C3 is a part of the TFIIIC2 complex, which is involved in the recruitment of RNA polymerase III. GTF3C3 has been identified as a potential ASD gene [48].

The sub-network shown in Figure 7c contains 64 genes and 124 associations. CTNNB1 and TOP3B were the two genes with the highest degree values (62). Nevertheless, these two genes were already known to be ASD-related genes. CTNNB1 has two known associations (to DDB1 and SIRT1), while TOP3B exhibited known associations with CLTC, PTPRK, HCFC1, FXR1, EEF2K, ALDH19A1, SYMPK, and DGKZ. These connecting genes could potentially be candidates for ASD. All information regarding the associations, association statuses, and gene statuses of all the sub-networks is provided in Supplementary Table S6.

4. Discussion

Based on the findings of our study, we have drawn several significant conclusions. We discovered the remarkable power of data integration. Based on the utilization of the differential gene expression profile, we achieved commendable performance in association classification. Furthermore, the integration of network information from the gene embedding profile further enhanced the classification performance. Although utilizing the gene embedding profile alone resulted in a slightly lower performance, the integration of data from multiple sources distinctly enhanced the overall effectiveness of our study. Consequently, we can confidently conclude that data integration holds immense potential for enhancing performance in association classification, as is evidenced by our results.

In addition to utilizing diverse data characteristics, such as the differential gene expression profile and the gene embedding profile, we also integrated data concerning associations between genes related to ASD from the AutDB database [51,52] and information on known ASD-related genes from the SFARI and AutismKB databases to identify significant genes. It was observed that genes found in the association database occasionally did not appear in the ASD-related gene databases (SFARI and AutismKB). This could be due to our stringent selection process, whereby we only classified core genes and genes with high confidence scores from the databases as known ASD-related genes. Nevertheless, both datasets played a valuable role in identifying potential ASD-related genes within the sub-networks obtained through candidate network clustering.

From the results of the candidate gene inference, we made an intriguing discovery: the top three genes, namely TOP1, ATP5F1C, and NCS1, emerged as high-frequency candidates in our association predictions. It has been reported that inhibiting TOP1 leads to the downregulation of long genes associated with synaptic function in neurons [53]. The study suggested that TOP1 mutations may contribute to ASD and other neurodevelopmental disorders. ATP5F1C, which encodes a subunit of mitochondrial ATP synthase [54], caught our attention. It is involved in the conversion of physical energy into chemical energy [55,56,57]. While direct evidence linking ATP5F1C to ASD is lacking, it has been reported that alterations in energy homeostasis have been observed in individuals with ASD [58], making ATP5F1C a potential gene of interest. Additionally, NCS1 has been associated with ASD because a missense mutation (R102Q) in the human NCS1 protein was identified in an autistic patient [59]. This further highlights the potential relevance of NCS1 to ASD. TOP1, ATP5F1C, and NCS1 are noteworthy candidate genes that exhibit promising connections to ASD and warrant further investigation and exploration in relation to their potential associations with ASD. Moreover, Rahman and colleagues [60] conducted an in-depth study on the gene expression profile of the brain cortex, analyzing 15 cases of individuals with ASD and 15 control subjects using RNA-Seq transcriptomics. Their comprehensive meta-analysis revealed the presence of 1567 differentially expressed genes. We compared these to our candidates and identified six specific genes: ACOT7, DNM1, DYNC1I1, GAD2, GPC2, and PSMB8. These genes have drawn particular interest and merit further investigation in the context of ASD.

The GO enrichment analysis of our gene list revealed significant enrichment in the mitochondrial membrane and mitochondrial inner membrane in the cellular component category. This finding suggests that our candidate genes may be closely linked to the mitochondrial compartment, which is involved in the respiratory chain function and is directly associated with ASD [61,62]. Additionally, we observed enrichment in genes related to the negative regulation of macromolecule biosynthetic processes in the biological process category. This term relates to negative processes that decrease the rates at which chemical reactions and pathways form macromolecules. However, direct evidence establishing a significant relationship between this process and ASD is currently lacking. Another notable term in the biological process category is ‘Cell–Matrix Adhesion’, which is a form of cell adhesion that induces interactions among cells and between these cells and the extracellular matrix, as well as intracellular signal generation [63]. The extracellular matrix plays crucial roles during brain development [64], suggesting that cell–matrix interactions might be related to brain development [65] and might have implications for ASD. Regarding the molecular function category, our gene list exhibited enrichment in terms related to protein kinase binding and kinase binding. Several studies have reported an association between ASD and the enrichment of the protein kinase binding terms in their gene sets [66,67]. Protein kinase binding is the process of binding to a protein kinase, which is an enzyme responsible for catalyzing the transfer of a phosphate group to a protein substrate. This function is vital for providing energy to carry out various cellular processes that may be relevant to ASD. Because ASD is a severe neurodegenerative syndrome [42,68], we performed a KEGG pathway enrichment analysis, which revealed enrichment in the neurodegeneration pathways. In addition, our gene list showed enrichment in the Huntington’s disease pathway, suggesting a potential connection between this neurodegenerative disease and ASD [67]. Additionally, we conducted a pathway enrichment analysis on Reactome [42] and identified enrichment in ‘MHC Class II Antigen Presentation’ (R-HSA-2132295). This pathway is a part of ‘Adaptive Immune System (Homo sapiens)’ (R-HSA-1280218) pathway, and previous research has highlighted the significant role of the adaptive immune response in the development of autism [18,68]. Our comprehensive enrichment analysis revealed that our candidate genes exhibit significant associations with mitochondrial function, the negative regulation of macromolecule biosynthesis, cell–matrix adhesion, protein kinase binding, neurodegeneration pathways, and immune system involvement, emphasizing their potential relevance to ASD. A study by Gevezova et al. (2023) has shown that upregulated genes in peripheral blood mononuclear cells (PBMCs) are associated with immune responses and defense mechanisms against viral infections [18]. Our study also identified enrichment in immune system involvement and adaptive immune responses. This obviously suggests that immune dysregulation may contribute to ASD pathogenesis. In addition, downregulated genes in the central nervous system (CNS) enriched in mitochondrial dysfunctions, including the electron transport chain (ETC) and ATP production, have also been reported in [18]. We also identified significant associations with the mitochondrial membrane and the mitochondrial inner membrane, suggesting links between respiratory chain function and ASD. We highlighted enrichment in neurodegeneration pathways. This indicates that processes related to neurodegeneration may play a role in ASD development. Furthermore, potential drug targets for treating or preventing ASD were identified. Specific genes in the brain and peripheral blood have been reported in [18], while we identified genes such as TOP1, ATP5F1C, and NCS1 as high-frequency candidates with potential connections to ASD.

5. Conclusions

Our study employed a systematic analysis and data integration approaches to investigate genes associated with ASD. By analyzing the gene association network, integrating gene expression data, and applying a machine learning algorithm, we identified novel ASD associations and potential gene–gene interactions that may have been disregarded in earlier studies. By utilizing our method, we successfully uncovered a promising total of 10,848 associations, all of which demonstrated the highest prediction score. From these associations, 125 candidate genes were inferred to be related to ASD. A statistical analysis allowed us to assess the relevance of candidate genes to specific functions and pathways associated with ASD. The findings of our study have important implications. The candidate gene list exhibited enrichment in essential terms related to cellular components, biological processes, and molecular functions, e.g., the mitochondrial membrane, the regulation of macromolecule biosynthetic processes, cell–matrix adhesion, and kinase binding. Furthermore, our gene list demonstrated enrichment in significant pathways, including neurodegeneration and MHC class II antigen presentation. The enrichment of these terms suggests the involvement of ASD in multiple underlying mechanisms. Lastly, we successfully constructed and clustered a network comprising candidate genes, allowing us to explore the sub-interactions occurring between different groups of genes. Through an analysis of these sub-networks, we were able to identify and explore the presence of novel and promising genes that exhibit links to known ASD-related genes or serve as connecting genes in established associations, even if they were not present in the database of known ASD-related genes. Several genes, such as TOP1, RBM17, RPL3, SNRPN, UBC, GTF3C3, DDB1, SIRT1, CLTC, PTPRK, HCFC1, FXR1, EEF2K, ALDH19A1, SYMPK, and DGKZ, emerged as potential candidates associated with ASD. This discovery holds significant potential and serves as a strong justification for further investigation and research. Overall, our approach, combining network-based gene associations, gene expression analysis, and machine learning, offers a comprehensive framework for ASD research. It represents a significant step towards unraveling the complexities of ASD and provides a foundation for further investigations into the development of effective treatments and interventions for this complex disorder.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app13158980/s1, Figure S1: ROC curves from (a) XGBoost with complete association features, (b) XGBoost with association features based on differential expression profile, (c) XGBoost with association features based on gene embedding profile, (d) random forest classifier, (e) naïve Bayes classifier, and (f) neural network; Figure S2: Average correlation values of top-ranked features; Table S1: List of hyperparameters for classification models; Table S2: Proportion of association features calculated from the top selected features of the gene embedding profile and differential expression profile; Table S3: The summary of the performance metrics for all classifiers; Table S4: List of 125 candidate genes and their frequencies; Table S5: Enrichment analysis results for cellular components, biological processes, molecular functions, and pathways; Table S6: List of gene associations in sub-networks.

Author Contributions

Conceptualization, A.S. and K.P.; formal analysis, A.S. and K.P.; funding acquisition, A.S.; methodology, A.S. and K.P.; writing—original draft preparation, A.S.; writing—review and editing, A.S. and K.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by King Mongkut’s University of Technology North Bangkok, Contract no. KMUTNB-65-KNOW-16.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors acknowledge the NSTDA Supercomputer Center (ThaiSC) for providing computing resources.

Conflicts of Interest

The authors declare no conflict of interest.

References

An, J.Y.; Claudianos, C. Genetic heterogeneity in autism: From single gene to a pathway perspective. Neurosci. Biobehav. Rev. 2016, 68, 442–453. [Google Scholar] [CrossRef] [Green Version]
Hirota, T.; King, B.H. Autism Spectrum Disorder: A Review. JAMA 2023, 329, 157–168. [Google Scholar] [CrossRef] [PubMed]
Masini, E.; Loi, E.; Vega-Benedetti, A.F.; Carta, M.; Doneddu, G.; Fadda, R.; Zavattari, P. An Overview of the Main Genetic, Epigenetic and Environmental Factors Involved in Autism Spectrum Disorder Focusing on Synaptic Activity. Int. J. Mol. Sci. 2020, 21, 8290. [Google Scholar] [CrossRef] [PubMed]
Havdahl, A.; Niarchou, M.; Starnawska, A.; Uddin, M.; van der Merwe, C.; Warrier, V. Genetic contributions to autism spectrum disorder. Psychol. Med. 2021, 51, 2260–2273. [Google Scholar] [CrossRef] [PubMed]
Sato, A.; Ikeda, K. Genetic and Environmental Contributions to Autism Spectrum Disorder Through Mechanistic Target of Rapamycin. Biol. Psychiatry Glob. Open Sci. 2022, 2, 95–105. [Google Scholar] [CrossRef] [PubMed]
Jiang, M.; Lu, T.; Yang, K.; Li, X.; Zhao, L.; Zhang, D.; Li, J.; Wang, L. Autism spectrum disorder research: Knowledge mapping of progress and focus between 2011 and 2022. Front Psychiatry 2023, 14, 1096769. [Google Scholar] [CrossRef]
Hyman, S.L.; Levy, S.E.; Myers, S.M.; Council on Children with Disabilities; Section on Developmental and Behavioral Pediatrics. Identification, Evaluation, and Management of Children With Autism Spectrum Disorder. Pediatrics 2020, 145, e20193447. [Google Scholar] [CrossRef] [Green Version]
Nisar, S.; Haris, M. Neuroimaging genetics approaches to identify new biomarkers for the early diagnosis of autism spectrum disorder. Mol. Psychiatry 2023. [Google Scholar] [CrossRef]
Satterstrom, F.K.; Kosmicki, J.A.; Wang, J.; Breen, M.S.; De Rubeis, S.; An, J.Y.; Peng, M.; Collins, R.; Grove, J.; Klei, L.; et al. Large-Scale Exome Sequencing Study Implicates Both Developmental and Functional Changes in the Neurobiology of Autism. Cell 2020, 180, 568–584.e523. [Google Scholar] [CrossRef]
Abrahams, B.S.; Arking, D.E.; Campbell, D.B.; Mefford, H.C.; Morrow, E.M.; Weiss, L.A.; Menashe, I.; Wadkins, T.; Banerjee-Basu, S.; Packer, A. SFARI Gene 2.0: A community-driven knowledgebase for the autism spectrum disorders (ASDs). Mol. Autism 2013, 4, 36. [Google Scholar] [CrossRef] [Green Version]
Velinov, M. Genomic Copy Number Variations in the Autism Clinic-Work in Progress. Front. Cell Neurosci. 2019, 13, 57. [Google Scholar] [CrossRef]
Sanders, S.J.; Murtha, M.T.; Gupta, A.R.; Murdoch, J.D.; Raubeson, M.J.; Willsey, A.J.; Ercan-Sencicek, A.G.; DiLullo, N.M.; Parikshak, N.N.; Stein, J.L.; et al. De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature 2012, 485, 237–241. [Google Scholar] [CrossRef] [Green Version]
Doan, R.N.; Lim, E.T.; De Rubeis, S.; Betancur, C.; Cutler, D.J.; Chiocchetti, A.G.; Overman, L.M.; Soucy, A.; Goetze, S.; Autism Sequencing, C.; et al. Recessive gene disruptions in autism spectrum disorder. Nat. Genet. 2019, 51, 1092–1098. [Google Scholar] [CrossRef] [PubMed]
Pereanu, W.; Larsen, E.C.; Das, I.; Estevez, M.A.; Sarkar, A.A.; Spring-Pearson, S.; Kollu, R.; Basu, S.N.; Banerjee-Basu, S. AutDB: A platform to decode the genetic architecture of autism. Nucleic Acids Res. 2018, 46, D1049–D1054. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Basu, S.N.; Kollu, R.; Banerjee-Basu, S. AutDB: A gene reference resource for autism research. Nucleic Acids Res. 2009, 37, D832–D836. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Yang, C.; Li, J.; Wu, Q.; Yang, X.; Huang, A.Y.; Zhang, J.; Ye, A.Y.; Dou, Y.; Yan, L.; Zhou, W.Z.; et al. AutismKB 2.0: A knowledgebase for the genetic evidence of autism spectrum disorder. Database 2018, 2018, bay106. [Google Scholar] [CrossRef]
Cheroni, C.; Caporale, N.; Testa, G. Autism spectrum disorder at the crossroad between genes and environment: Contributions, convergences, and interactions in ASD developmental pathophysiology. Mol. Autism 2020, 11, 69. [Google Scholar] [CrossRef]
Gevezova, M.; Sbirkov, Y.; Sarafian, V.; Plaimas, K.; Suratanee, A.; Maes, M. Autistic spectrum disorder (ASD)–Gene, molecular and pathway signatures linking systemic inflammation, mitochondrial dysfunction, transsynaptic signalling, and neurodevelopment. Brain Behav. Immun. Health 2023, 30, 100646. [Google Scholar] [CrossRef]
Jiang, C.C.; Lin, L.S.; Long, S.; Ke, X.Y.; Fukunaga, K.; Lu, Y.M.; Han, F. Signalling pathways in autism spectrum disorder: Mechanisms and therapeutic implications. Signal Transduct. Target. Ther. 2022, 7, 229. [Google Scholar] [CrossRef]
Apte, M.; Kumar, A. Correlation of mutated gene and signalling pathways in ASD. IBRO Neurosci. Rep. 2023, 14, 384–392. [Google Scholar] [CrossRef]
Janyasupab, P.; Suratanee, A.; Plaimas, K. Network diffusion with centrality measures to identify disease-related genes. Math. Biosci. Eng. 2021, 18, 2909–2929. [Google Scholar] [CrossRef]
Suratanee, A.; Buaboocha, T.; Plaimas, K. Prediction of Human-Plasmodium vivax Protein Associations From Heterogeneous Network Structures Based on Machine-Learning Approach. Bioinform. Biol. Insights 2021, 15, 13350. [Google Scholar] [CrossRef] [PubMed]
Suratanee, A.; Plaimas, K. Reverse Nearest Neighbor Search on a Protein-Protein Interaction Network to Infer Protein-Disease Associations. Bioinform. Biol. Insights 2017, 11, 1177932217720405. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kim, Y.; Park, J.H.; Cho, Y.R. Network-Based Approaches for Disease-Gene Association Prediction Using Protein-Protein Interaction Networks. Int. J. Mol. Sci. 2022, 23, 7411. [Google Scholar] [CrossRef] [PubMed]
Barabasi, D.L.; Bianconi, G.; Bullmore, E.; Burgess, M.; Chung, S.; Eliassi-Rad, T.; George, D.; Kovacs, I.A.; Makse, H.; Papadimitriou, C.; et al. Neuroscience needs Network Science. arXiv 2023, arXiv:2305.06160. [Google Scholar]
Galindez, G.; Sadegh, S.; Baumbach, J.; Kacprowski, T.; List, M. Network-based approaches for modeling disease regulation and progression. Comput. Struct. Biotechnol. J. 2023, 21, 780–795. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Gong, Y.; Yi, J.; Zhang, W. Predicting gene-disease associations from the heterogeneous network using graph embedding. In Proceedings of the 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 18–21 November 2019; pp. 504–511. [Google Scholar]
Ata, S.K.; Ou-Yang, L.; Fang, Y.; Kwoh, C.K.; Wu, M.; Li, X.L. Integrating node embeddings and biological annotations for genes to predict disease-gene associations. BMC Syst. Biol. 2018, 12, 138. [Google Scholar] [CrossRef]
UniProt, C. UniProt: The Universal Protein Knowledgebase in 2023. Nucleic Acids Res 2023, 51, D523–D531. [Google Scholar] [CrossRef]
Lagisetty, Y.; Bourquard, T.; Al-Ramahi, I.; Mangleburg, C.G.; Mota, S.; Soleimani, S.; Shulman, J.M.; Botas, J.; Lee, K.; Lichtarge, O. Identification of risk genes for Alzheimer’s disease by gene embedding. Cell Genom. 2022, 2, 162. [Google Scholar] [CrossRef]
Barrett, T.; Wilhite, S.E.; Ledoux, P.; Evangelista, C.; Kim, I.F.; Tomashevsky, M.; Marshall, K.A.; Phillippy, K.H.; Sherman, P.M.; Holko, M.; et al. NCBI GEO: Archive for functional genomics data sets—Update. Nucleic Acids Res. 2013, 41, D991–D995. [Google Scholar] [CrossRef] [Green Version]
Seal, R.L.; Braschi, B.; Gray, K.; Jones, T.E.M.; Tweedie, S.; Haim-Vilmovsky, L.; Bruford, E.A. Genenames.org: The HGNC resources in 2023. Nucleic Acids Res 2023, 51, D1003–D1009. [Google Scholar] [CrossRef] [PubMed]
Kolberg, L.; Raudvere, U.; Kuzmin, I.; Vilo, J.; Peterson, H. gprofiler2–An R package for gene list functional enrichment analysis and namespace conversion toolset g:Profiler. F1000Research 2020, 9, ELIXIR-709. [Google Scholar] [CrossRef]
Wang, D.; Cui, P.; Zhu, W. Structural Deep Network Embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1225–1234. [Google Scholar]
Perozzi, B.; Al-Rfou, R. DeepWalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 701–710. [Google Scholar] [CrossRef] [Green Version]
Szklarczyk, D.; Kirsch, R.; Koutrouli, M.; Nastou, K.; Mehryary, F.; Hachilif, R.; Gable, A.L.; Fang, T.; Doncheva, N.T.; Pyysalo, S.; et al. The STRING database in 2023: Protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2023, 51, D638–D646. [Google Scholar] [CrossRef]
Rehurek, R.; Sojka, P. Gensim–Python Framework for Vector Space Modelling; NLP Centre, Faculty of Informatics, Masaryk University: Brno, Czech Republic, 2011; Volume 3. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef] [Green Version]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Gene Ontology, C.; Aleksander, S.A.; Balhoff, J.; Carbon, S.; Cherry, J.M.; Drabkin, H.J.; Ebert, D.; Feuermann, M.; Gaudet, P.; Harris, N.L.; et al. The Gene Ontology knowledgebase in 2023. Genetics 2023, 224, iyad031. [Google Scholar] [CrossRef] [PubMed]
Kanehisa, M.; Furumichi, M.; Sato, Y.; Kawashima, M.; Ishiguro-Watanabe, M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 2023, 51, D587–D592. [Google Scholar] [CrossRef]
Gillespie, M.; Jassal, B.; Stephan, R.; Milacic, M.; Rothfels, K.; Senff-Ribeiro, A.; Griss, J.; Sevilla, C.; Matthews, L.; Gong, C.; et al. The reactome pathway knowledgebase 2022. Nucleic Acids Res. 2022, 50, D687–D692. [Google Scholar] [CrossRef] [PubMed]
Kuleshov, M.V.; Jones, M.R.; Rouillard, A.D.; Fernandez, N.F.; Duan, Q.; Wang, Z.; Koplev, S.; Jenkins, S.L.; Jagodnik, K.M.; Lachmann, A.; et al. Enrichr: A comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016, 44, W90–W97. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kotloff, K.L.; Riddle, M.S.; Platts-Mills, J.A.; Pavlinac, P.; Zaidi, A.K.M. Shigellosis. Lancet 2018, 391, 801–812. [Google Scholar] [CrossRef]
Wang, J.; Zhong, J.; Chen, G.; Li, M.; Wu, F.X.; Pan, Y. ClusterViz: A Cytoscape APP for Cluster Analysis of Biological Network. IEEE/ACM Trans. Comput. Biol. Bioinform. 2015, 12, 815–822. [Google Scholar] [CrossRef] [PubMed]
Shannon, P.; Markiel, A.; Ozier, O.; Baliga, N.S.; Wang, J.T.; Ramage, D.; Amin, N.; Schwikowski, B.; Ideker, T. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 2003, 13, 2498–2504. [Google Scholar] [CrossRef] [PubMed]
Shen, H.; Cheng, X.; Cai, K.; Hu, M.-B. Detect overlapping and hierarchical community structure in networks. Phys. A Stat. Mech. Appl. 2009, 388, 1706–1712. [Google Scholar] [CrossRef] [Green Version]
King, I.F.; Yandava, C.N.; Mabb, A.M.; Hsiao, J.S.; Huang, H.S.; Pearson, B.L.; Calabrese, J.M.; Starmer, J.; Parker, J.S.; Magnuson, T.; et al. Topoisomerases facilitate transcription of long genes linked to autism. Nature 2013, 501, 58–62. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Mabb, A.M.; Kullmann, P.H.; Twomey, M.A.; Miriyala, J.; Philpot, B.D.; Zylka, M.J. Topoisomerase 1 inhibition reversibly impairs synaptic function. Proc. Natl. Acad. Sci. USA 2014, 111, 17290–17295. [Google Scholar] [CrossRef]
Ji, X.; Kember, R.L.; Brown, C.D.; Bucan, M. Increased burden of deleterious variants in essential genes in autism spectrum disorder. Proc. Natl. Acad. Sci. USA 2016, 113, 15054–15059. [Google Scholar] [CrossRef] [PubMed]
Wen, W.X.; Mead, A.J.; Thongjuea, S. MARVEL: An integrated alternative splicing analysis platform for single-cell RNA sequencing data. Nucleic Acids Res. 2023, 51, e29. [Google Scholar] [CrossRef]
Fiorillo, M.; Scatena, C.; Naccarato, A.G.; Sotgia, F.; Lisanti, M.P. Bedaquiline, an FDA-approved drug, inhibits mitochondrial ATP production and metastasis in vivo, by targeting the gamma subunit (ATP5F1C) of the ATP synthase. Cell Death Differ. 2021, 28, 2797–2817. [Google Scholar] [CrossRef] [PubMed]
Walker, J.E. The ATP synthase: The understood, the uncertain and the unknown. Biochem. Soc. Trans. 2013, 41, 1–16. [Google Scholar] [CrossRef] [Green Version]
Fiorillo, M.; Ozsvari, B.; Sotgia, F.; Lisanti, M.P. High ATP Production Fuels Cancer Drug Resistance and Metastasis: Implications for Mitochondrial ATP Depletion Therapy. Front. Oncol. 2021, 11, 740720. [Google Scholar] [CrossRef]
Gevezova, M.; Minchev, D.; Pacheva, I.; Todorova, T.; Yordanova, R.; Timova, E.; Ivanov, I.; Sarafian, V. Association of NGF and Mitochondrial Respiration with Autism Spectrum Disorder. Int. J. Mol. Sci. 2022, 23, 11917. [Google Scholar] [CrossRef]
Piton, A.; Michaud, J.L.; Peng, H.; Aradhya, S.; Gauthier, J.; Mottron, L.; Champagne, N.; Lafreniere, R.G.; Hamdan, F.F.; S2D Team; et al. Mutations in the calcium-related gene IL1RAPL1 are associated with autism. Hum. Mol. Genet. 2008, 17, 3965–3974. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Rahman, M.R.; Petralia, M.C.; Ciurleo, R.; Bramanti, A.; Fagone, P.; Shahjaman, M.; Wu, L.; Sun, Y.; Turanli, B.; Arga, K.Y.; et al. Comprehensive Analysis of RNA-Seq Gene Expression Profiling of Brain Transcriptomes Reveals Novel Genes, Regulators, and Pathways in Autism Spectrum Disorder. Brain Sci. 2020, 10, 747. [Google Scholar] [CrossRef] [PubMed]
Frye, R.E.; Lionnard, L.; Singh, I.; Karim, M.A.; Chajra, H.; Frechet, M.; Kissa, K.; Racine, V.; Ammanamanchi, A.; McCarty, P.J.; et al. Mitochondrial morphology is associated with respiratory chain uncoupling in autism spectrum disorder. Transl. Psychiatry 2021, 11, 527. [Google Scholar] [CrossRef]
Tang, G.; Gutierrez Rios, P.; Kuo, S.H.; Akman, H.O.; Rosoklija, G.; Tanji, K.; Dwork, A.; Schon, E.A.; Dimauro, S.; Goldman, J.; et al. Mitochondrial abnormalities in temporal lobe of autistic brain. Neurobiol. Dis. 2013, 54, 349–361. [Google Scholar] [CrossRef] [Green Version]
Meldolesi, J. Pharmacology of the cell/matrix form of adhesion. Pharmacol. Res. 2016, 107, 430–436. [Google Scholar] [CrossRef]
Soles, A.; Selimovic, A.; Sbrocco, K.; Ghannoum, F.; Hamel, K.; Moncada, E.L.; Gilliat, S.; Cvetanovic, M. Extracellular Matrix Regulation in Physiology and in Brain Disease. Int. J. Mol. Sci. 2023, 24, 7049. [Google Scholar] [CrossRef]
Dwivedi, I.; Caldwell, A.B.; Zhou, D.; Wu, W.; Subramaniam, S.; Haddad, G.G. Methadone alters transcriptional programs associated with synapse formation in human cortical organoids. Transl. Psychiatry 2023, 13, 151. [Google Scholar] [CrossRef] [PubMed]
Gabrielli, A.P.; Manzardo, A.M.; Butler, M.G. GeneAnalytics Pathways and Profiling of Shared Autism and Cancer Genes. Int. J. Mol. Sci. 2019, 20, 1166. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Nomura, J.; Mardo, M.; Takumi, T. Molecular signatures from multi-omics of autism spectrum disorders and schizophrenia. J. Neurochem. 2021, 159, 647–659. [Google Scholar] [CrossRef]
Yousefi, B.; Kokhaei, P.; Mehranfar, F.; Bahar, A.; Abdolshahi, A.; Emadi, A.; Eslami, M. The role of the host microbiome in autism and neurodegenerative disorders and effect of epigenetic procedures in the brain functions. Neurosci. Biobehav. Rev. 2022, 132, 998–1009. [Google Scholar] [CrossRef]
Kern, J.K.; Geier, D.A.; Sykes, L.K.; Geier, M.R. Evidence of neurodegeneration in autism spectrum disorder. Transl. Neurodegener. 2013, 2, 17. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Piras, I.S.; Picinelli, C.; Iennaco, R.; Baccarin, M.; Castronovo, P.; Tomaiuolo, P.; Cucinotta, F.; Ricciardello, A.; Turriziani, L.; Nanetti, L.; et al. Huntingtin gene CAG repeat size affects autism risk: Family-based and case-control association study. Am. J. Med. Genet. B Neuropsychiatr. Genet. 2020, 183, 341–351. [Google Scholar] [CrossRef] [PubMed]
Cohly, H.H.; Panja, A. Immunological findings in autism. Int. Rev. Neurobiol. 2005, 71, 317–341. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of gene association analysis framework.

Figure 2. Comparison of performances of XGBoost, random forest, naïve Bayes, and neural network classifiers using bar charts. AUC stands for area under the ROC curve, and ACC represents the accuracy performance. Error bars indicate the standard deviations of the performances obtained from the ten-times five-fold cross-validation.

Figure 3. (a) Enriched cellular component gene ontology terms; (b) enriched biological process gene ontology terms; (c) enriched molecular function gene ontology terms.

Figure 4. Results of KEGG pathway enrichment analysis.

Figure 5. Results of Reactome pathway enrichment analysis.

Figure 6. The network of candidate genes. The red nodes represent known ASD-related genes and the green nodes represent new candidate genes. The bold lines represent known associations and the dashed lines represent unknown associations.

Figure 7. Three sub-networks created by clustering the candidate network: (a) the sub-network consists of 53 genes and 87 associations, (b) the sub-network consists of 33 genes and 60 associations, and (c) the sub-network consists of 64 genes and 124 associations. The red nodes represent known ASD-related genes and the green nodes represent new candidate genes. The bold lines represent known associations and the dashed lines represent unknown associations.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Suratanee, A.; Plaimas, K. Gene Association Classification for Autism Spectrum Disorder: Leveraging Gene Embedding and Differential Gene Expression Profiles to Identify Disease-Related Genes. Appl. Sci. 2023, 13, 8980. https://doi.org/10.3390/app13158980

AMA Style

Suratanee A, Plaimas K. Gene Association Classification for Autism Spectrum Disorder: Leveraging Gene Embedding and Differential Gene Expression Profiles to Identify Disease-Related Genes. Applied Sciences. 2023; 13(15):8980. https://doi.org/10.3390/app13158980

Chicago/Turabian Style

Suratanee, Apichat, and Kitiporn Plaimas. 2023. "Gene Association Classification for Autism Spectrum Disorder: Leveraging Gene Embedding and Differential Gene Expression Profiles to Identify Disease-Related Genes" Applied Sciences 13, no. 15: 8980. https://doi.org/10.3390/app13158980

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Gene Association Classification for Autism Spectrum Disorder: Leveraging Gene Embedding and Differential Gene Expression Profiles to Identify Disease-Related Genes

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Preparation

2.2. Differential Gene Expression Profile

2.3. Gene Embedding Profile

2.4. Classification of Gene Associations for ASD

2.5. Enrichment Analysis

3. Results

3.1. Overview of the Study

3.2. Gene Association Predictions Using Differential Gene Expression and Gene Embedding Profiles

3.3. ASD-Related Gene Identification

3.4. Gene ontology Enrichment for Candidate Genes

3.5. Pathway Enrichment for Candidate Genes

3.6. Candidate Network and Sub-Networks

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI